How We Broke Top AI Agent Benchmarks: And What Comes Next
https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/
If only the blog itself wasn't written by AI?
>No reasoning. No capability. Just exploitation of how the score is computed.
shudder