RT @burkov
Grok 4 scores very poorly on Yupp, an LMArena competitor to which, probably, the LLM providers haven’t yet managed to finetune; below even Grok 3, with a score of 1142.

Claude Opus 4’s score, for comparison, is 1381.

So, all as I expected: when you don’t have more data and, as a consequence, cannot improve model quality substantially, you beat cherry-picked benchmarks hoping to make a sensation.

But we aren’t in 2024 anymore, so cheap hand-made sensations aren’t working anymore.

https://yupp.ai/leaderboard/explore?live_models=false