How We Broke Top AI Agent Benchmarks: And What Comes Next

https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Center for Responsible, Decentralized Intelligence at Berkeley

If only the blog itself wasn't written by AI?

>No reasoning. No capability. Just exploitation of how the score is computed.

shudder

Yes, marks of AI all over the place. Also the SVGs.

>No solution written, 100% score.

Its weird. Turns out that hardest problem for LLMs to really tackle is long-form text.

Someone here mentioned a whole ago that the labs deliberately haven't tried to train these characteristics out of their models, because leaving them in makes it easier to identify, and therefore exclude, LLM-generated text from their training corpus.
But it's odd that these characteristics are the same across models from different labs. I find it hard to believe that researchers across competing companies are coordinating on something like that.