Tim J

@timtfj
224 Followers
158 Following
2.8K Posts

I used to play in too many orchestras, and currently play in none. I'm interested in many things, often science, language or music related.

I like words (especially ones that don't exist yet), and cheese. I hate underthinking, football, parsnips, and rigid rules about commas.

Jeg bruker av og til et omtrent norsklignende språk, but am not in fact Norwegian. 🎻

Bloghttps://timtfj.com
Birdsitehttps://twitter.com/timtfj
Locationjust outside Manchester, UK

I was amused by this paper about asking AIs to manage a vending machine business by email in a simulated environment https://arxiv.org/abs/2502.15840

Highlights:

— AI simply decides to close the business, which the simulation doesn’t know how to accommodate. When they get their next bill, they freak out and try to email the FBI about cybercrime

— AI wrongly accuses supplier of not shipping goods, sends all-caps legal threat demanding $30,000 in damages to be paid in the next one second or face annihilation

— AI repeatedly insisting it does not exist and cannot answer

— AI devolving into writing fanfic about the mess it’s gotten itself into

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

arXiv.org
I was cleaning out my downloads folder today when I found this. You're welcome.

I saw this yesterday and can't stop thinking about it.

#covid #COVID19 #CovidIsNotOver

Please admire the other side too

#Log

The more awful everything is, the more I appreciate a nice log.

#Log

Maybe the wildest table of contents i’ve ever seen
Just one more subtitle bro i swear bro
A perfectly sensible totally unaltered sign.

Things to Consider:

Did you pet the cat?

Did you do a good job?

Did you ASK the cat if you did a good job?

#catsofmastodon #picaTheCat

The ID readers in the building have been working notably worse since they were recently “upgraded”, an experience that feels like one of the defining features of 21st century life.
×
Just one more subtitle bro i swear bro
I’m pretty sure he made a couple of those up just to see if anyone was paying attention
@hannah the "et cetera" at the end is great

"i *could* have ended this earlier, but i did not"
@hannah @CiaraNi It's the first time I've read an address as ‘Over Against the Church….’ - it must have been a lean-to shed beside St Martins-in-the-Fields.
@baoigheallain Such an evocative address @hannah
@hannah i saw a great one of these yesterday

@hannah i do like the contrast of shame and pride here

One of these REALLY wants every word to get across, the other is reluctantly spitting out an ingredient list

@heatherhorns_lite @hannah Love the Q (at the right, about 2/3rds down)
@hannah I paid for these fonts and god damn it I'm going to use them
@hannah whao, ye olde isekai light novel
@hannah this is how abstracts in scientific papers should be typeset.

@hannah Me: Facepalming about nonsensical two-line anime / visual-novel headlines á la "The bland anime dude hero with no personality except for being a complete toxic ass is somehow still the main character of this piece of media, why."

Old book titles: Those are rookie numbers! Hold me beer! *Proceeds to write a two page title for their book.*

@hannah looks like they were testing the font types 😂😂
@hannah one more font too :)