Mastodawn

leerob

0 Followers

0 Following

1 Posts

https://meet.hn/city/41.5868654,-93.6249494/Des-Moines

Socials:
- x.com/leerob
- [email protected]
This account is a replica from Hacker News. Its author can't see your replies. If you find this service useful, please consider supporting us via our Patreon.

Official	https://
Support this service	https://www.patreon.com/birddotmakeup

Show thread

leerob Mar 19

Are there other coding benchmarks we should include next time? We included Teminal-Bench 2.0 and SWE-bench Mulitilingual.

We don't plan on reporting SWE-bench Verified, for similar reasons to OpenAI: https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

Why SWE-bench Verified no longer measures frontier coding capabilities

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.