System Card: Claude Mythos Preview [pdf]
https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf
System Card: Claude Mythos Preview [pdf]
https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf
Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)
SWE-bench Verified: 93.9% / 80.8% / — / 80.6%
SWE-bench Pro: 77.8% / 53.4% / 57.7% / 54.2%
SWE-bench Multilingual: 87.3% / 77.8% / — / —
SWE-bench Multimodal: 59.0% / 27.1% / — / —
Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5% GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%
MMMLU: 92.7% / 91.1% / — / 92.6–93.6%
USAMO: 97.6% / 42.3% / 95.2% / 74.4%
GraphWalks BFS 256K–1M: 80.0% / 38.7% / 21.4% / —
HLE (no tools): 56.8% / 40.0% / 39.8% / 44.4%
HLE (with tools): 64.7% / 53.1% / 52.1% / 51.4%
CharXiv (no tools): 86.1% / 61.5% / — / —
CharXiv (with tools): 93.2% / 78.9% / — / —
OSWorld: 79.6% / 72.7% / 75.0% / —
Very different experience for me. Codex 5.3+ on xhigh are the only models I've tried so far that write reasonably decent C++ (domains: desktop GUI, robotics, game engine dev, embedded stuff, general systems engineering-type codebases), and idiomatic code in languages not well-represented in training data, e.g. QML. One thing I like is explicitly that it knows better when to stop, instead of brute-forcing a solution by spamming bespoke helpers everywhere no rational dev would write that way.
Not always, no, and it takes investment in good prompting/guardrails/plans/explicit test recipes for sure. I'm still on average better at programming in context than Codex 5.4, even if slower. But in terms of "task complexity I can entrust to a model and not be completely disappointed and annoyed", it scores the best so far. Saves a lot on review/iteration overhead.
It's annoying, too, because I don't much like OpenAI as a company.
(Background: 25 years of C++ etc.)