How We Broke Top AI Agent Benchmarks: And What Comes Next
How We Broke Top AI Agent Benchmarks: And What Comes Next
If only the blog itself wasn't written by AI?
>No reasoning. No capability. Just exploitation of how the score is computed.
shudder
Yes, marks of AI all over the place. Also the SVGs.
>No solution written, 100% score.
Its weird. Turns out that hardest problem for LLMs to really tackle is long-form text.