0 Followers
0 Following
3 Posts

This account is a replica from Hacker News. Its author can't see your replies. If you find this service useful, please consider supporting us via our Patreon.
Officialhttps://
Support this servicehttps://www.patreon.com/birddotmakeup
There are no tricks. Our approach to reducing the impact of targeting (without fully eliminating it) is described in the paper.

I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-)

(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)

Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing.

We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.

Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.

Try the games yourself if you want to get a sense of the difficulty.

> Models can't use more than 5X the steps that a human used

These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.

> No harness at all and very simplistic prompt

This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that
the system was not specifically designed or trained for. This means that the official leaderboard will seek to
discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible."

...

"We know that by injecting a
high amount of human instructions into a harness, or even hand-crafting harness configuration choices such
as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving
performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human
intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general
intelligence of frontier AI systems.

...

"Therefore, we will focus on reporting the performance of systems that have not been specially
prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware
generalization on a new domain as per (8)). This is similar to looking at the performance of a human
test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We
know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training,
without being briefed on solving strategies, and without using external tools."

If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.