Mastodawn

https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):

- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution

- The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1%
((10/100)^2)

- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%

- The scoring is designed so that even if AI performs on a human level it will score below 100%

- No harness at all and very simplistic prompt

- Models can't use more than 5X the steps that a human used

- Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"

Lisan al Gaib (@scaling01) on X

lead them to paradise https://t.co/IiP4VZlGU3

X (formerly Twitter)