Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

SWE-bench Verified: 93.9% / 80.8% / — / 80.6%
SWE-bench Pro: 77.8% / 53.4% / 57.7% / 54.2%
SWE-bench Multilingual: 87.3% / 77.8% / — / —
SWE-bench Multimodal: 59.0% / 27.1% / — / —
Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%

GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%
MMMLU: 92.7% / 91.1% / — / 92.6–93.6%
USAMO: 97.6% / 42.3% / 95.2% / 74.4%
GraphWalks BFS 256K–1M: 80.0% / 38.7% / 21.4% / —

HLE (no tools): 56.8% / 40.0% / 39.8% / 44.4%
HLE (with tools): 64.7% / 53.1% / 52.1% / 51.4%

CharXiv (no tools): 86.1% / 61.5% / — / —
CharXiv (with tools): 93.2% / 78.9% / — / —

OSWorld: 79.6% / 72.7% / 75.0% / —

Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.
Totally. Best-in-class for SWE work (until Mythos gets released, if ever, but I suspect the rumored "Spud" will be out by then too)
It really isn’t. I wish it was, because work complains about overuse of Opus.