Mastodawn

Anon84 4d ago

How We Broke Top AI Agent Benchmarks: And What Comes Next

https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Center for Responsible, Decentralized Intelligence at Berkeley

Show thread

danslo 4d ago

If only the blog itself wasn't written by AI?

>No reasoning. No capability. Just exploitation of how the score is computed.

shudder

Show thread

cpldcpu 4d ago

Yes, marks of AI all over the place. Also the SVGs.

>No solution written, 100% score.

Its weird. Turns out that hardest problem for LLMs to really tackle is long-form text.

Show thread

basch

Maybe in one shot.

In theory I would expect them to be able to ingest the corpus of the new yorker and turn it into a template with sub-templates, and then be able to rehydrate those templates.

The harder part seems to be synthesizing new connection from two adjacent ideas. They like to take x and y and create x+y instead of x+y+z.