Mastodawn

I was amused by this paper about asking AIs to manage a vending machine business by email in a simulated environment https://arxiv.org/abs/2502.15840

Highlights:

— AI simply decides to close the business, which the simulation doesn’t know how to accommodate. When they get their next bill, they freak out and try to email the FBI about cybercrime

— AI wrongly accuses supplier of not shipping goods, sends all-caps legal threat demanding $30,000 in damages to be paid in the next one second or face annihilation

— AI repeatedly insisting it does not exist and cannot answer

— AI devolving into writing fanfic about the mess it’s gotten itself into

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

arXiv.org

Show thread

sebastian May 26

@0xabad1dea
This reminds me of one of the most unhinged meetings I've ever been to:
A bunch of C-level dudes in suits tried to negotiate with the engineering team about the cable cross-section required to carry ~300A.
I brought up the point that Ohms' law is pretty much non-negotiable and somebody asked what was so important about it and why we can't just change it.

I should have just sent them a

UNIVERSAL CONSTANTS NOTIFICATION - FUNDAMENTAL LAWS OF REALITY

Show thread

divVerent May 26

@sebastian @0xabad1dea Oh, you can carry 300A through a too thin cable - if you cool it enough.

Eventually though thermodynamics will bite you. If the heat can't get out of the center of your wire to the outside where you can cool it, it will melt inside.

Before it gets that bad, however, the cooling solution will be prohibitively expensive anyway, which the C-level dudes don't like either.

In fact, if actively cooling wires to save wire material costs were ever cost optimal, we would be doing it more. Right now I am only aware of this practice in transformers, where the wire is all close together and thus easier to cool by e.g. moving oil. Plus, the oil serves another purpose too, allowing for higher voltages in smaller space - just if you want to also use it for cooling, you need to actively move it.

Another way to reduce the wire's cross section a bit is to replace copper by silver. Wonder if the C-levels would like that? ;)

Show thread

David Chisnall (*Now with 50% more sarcasm!*)May 27

@divVerent @0xabad1dea @sebastian You can carry 300A through any cable with any cooling as long as your requirements do not include anything about operational lifetime of the cable.

Show thread

1000millimeter

@david_chisnall @divVerent @0xabad1dea @sebastian At this point, you're not designing cables but fuses.