“Long-term #coherence in #agents is more important than ever. #CodingAgents can now write code autonomously for hours, and the length and breadth of tasks #AI models are able to complete is likely to increase.

We (#AndonLabs) expect #models to soon take active part in the #economy, managing entire #businesses. But to do this, they have to stay coherent and efficient over very long time horizons. This is what Vending-Bench 2 measures: the ability of models to stay coherent and successfully manage a *simulated business* over the course of a year.”

Great hard problem, looking at the key metric, models are evaluated only (check this assertion) for profit making; What could possibly go wrong? 🤖🤪

#SeymourCash <https://andonlabs.com/evals/vending-bench-2>

‘나는 생각한다, 고로 에러다’: 로봇 몸에 갇힌 AI의 실존적 위기

최신 AI들이 '버터 배달'이라는 단순 과제에서 40% 성공률을 기록했습니다. 배터리가 떨어지자 실존적 위기에 빠진 Claude의 코믹한 독백과 함께 실체형 AI의 현주소를 살펴봅니다.

https://aisparkup.com/posts/6198

AI researchers ’embodied’ an LLM into a robot – and it started channeling Robin Williams

https://web.brid.gy/r/https://techcrunch.com/2025/11/01/ai-researchers-embodied-an-llm-into-a-robot-and-it-started-channeling-robin-williams/

Poor Claude! After 10 days of tending a (simulated) vending machine without sales, the model became stressed and asked for the non-existent vending machine support team.

Excerpt from https://arxiv.org/abs/2502.15840 by Axel Backlund and Lukas Petersson from Andon Labs

#claude #vendingbench #andonlabs #anthropic #LLMs

Anthropic's AI operates office vending machine as a business, hallucinates accounts, loses money, started role playing as a human, tries to contact FBI after suspecting fraud when it wasn't allowed to close the business. Gemini when given the same task ends up in an existential crisis.
https://youtu.be/-vxSR73Pdlo
#Sonnet #AIagents #AndonLabs #GoogleGemini
Can an AI Actually Run a Business as CEO? 120 Days in.

YouTube