RT @burkov
Grok 4 scores very poorly on Yupp, an LMArena competitor to which, probably, the LLM providers haven’t yet managed to finetune; below even Grok 3, with a score of 1142.
Claude Opus 4’s score, for comparison, is 1381.
So, all as I expected: when you don’t have more data and, as a consequence, cannot improve model quality substantially, you beat cherry-picked benchmarks hoping to make a sensation.
But we aren’t in 2024 anymore, so cheap hand-made sensations aren’t working anymore.