stilescrisis

@stilescrisis@mastodon.gamedev.place
117 Followers
369 Following
1.3K Posts

I was amused by this paper about asking AIs to manage a vending machine business by email in a simulated environment https://arxiv.org/abs/2502.15840

Highlights:

— AI simply decides to close the business, which the simulation doesn’t know how to accommodate. When they get their next bill, they freak out and try to email the FBI about cybercrime

— AI wrongly accuses supplier of not shipping goods, sends all-caps legal threat demanding $30,000 in damages to be paid in the next one second or face annihilation

— AI repeatedly insisting it does not exist and cannot answer

— AI devolving into writing fanfic about the mess it’s gotten itself into

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

arXiv.org
COBOL c:
This is a really excellent security bug: https://issues.chromium.org/issues/391788835
Ligatures are amazing.
Chromium

It sounds WAY cooler in the original Japanese, you just gotta trust me on this one. 
Inside you there are two wolves.

Wolf Basic is free for individuals and includes ad-supported access to all of the classic features of the Wolf platform. Or, for $19.99 per month, upgrade to Wolf Plus and unlock an ad-free experience, Priority Howls, and all of our pro features.
HBO Max, the company producing J.K. Rowling's new project, wants you to respect its second name change in as many years.
Donald Draper level of marketing genius
@RenewedRebecca @oldmankris “Chat GPT said” is the “Here’s your sign” of 2025. It’s a convenient bozo signifier.
Creating visual arts using chemistry
LegoGPT creates Lego designs using AI and text inputs — tool now available for free to the public

This LLM will unlock the possibilities with your LEGO bricks.

Tom's Hardware
×
Donald Draper level of marketing genius
@beyondmachines1 I'd like to report a typo. The "die" got autocorrected to "start" or something.
@tomasv that's AI autocorrect for you 🤷
@beyondmachines1 even on ads Jira is there only to block you from what you need to do. The only idea that starts with Jira (and the ad) is to skip (bypass) Jira.
@beyondmachines1 @gPiak & @nith0u , vous allez apprécier j'en suis sûr.
@enthraxxx @beyondmachines1 @nith0u m'en fous, j'ai le déhanché pour passer ces portiques sans me faire castrer 💅🏻

@beyondmachines1

"big ideas start with Jira"

"but frequently end with tears..."

@paul_ipv6 There is only one big idea starting with Jira:
"I need to export the Jira data to Excel because Jira doesn't allow me to do..."

@beyondmachines1 big ideas start on napkins, in eye-to-eye discussions, or as crazy prototypes.

Then someone turns them into a gazillion Jira tickets. Random people implement tasks. You're lucky if anyone still remembers the big picture. You get bogged down in bureaucracy, but you have a "burn down" to tell you that you're still making progress. By the end of the week there are 83 new tickets but nobody knows how they all add together to a meaningful whole... yay, Jira!

@kwramm That's the corporatization of an idea. Any tool would do - even post-it.
@beyondmachines1 Why are Atlassian advertising in public? #lang_en
@beyondmachines1 successful big ideas end in Jira, that's for sure

@beyondmachines1 I don't get the hate for Jira. It's a perfectly reasonable piece of issue-tracking software that absolutely gets in the way most of the time but we put up with it because who can be bothered to replace it...

Oh...

@beyondmachines1 they start with Jira and what happens next mf’ers? 🪿
@beyondmachines1 I saw a LinkedIn ad for something that claimed some large percentage of people thought it was better than JIRA. And my only thought was ‘if that’s the most positive thing you can find to say about your product, I hope I never have to use it’.
@beyondmachines1 ok, so it's something that makes it slightly more annoying and expensive to get where you were going, prevents you from going back if you made a mistake, exists mostly to track metrics for people in suits, and can be easily circumvented if you don't mind breaking the rules. Yeah checks out.

@beyondmachines1 @sbi I saw the original on the shit bird site and don't know who to credit but:

Therapist: What do we do when we are feeling overwhelmed?
Me: Create a JIRA ticket
Therapist: no

@malwareminigun @beyondmachines1 Years ago, I saw one there, too:
*it's way too early in the morning*
*you start your machine to begin working*
*you open Jira*
Jira: You have issues.
You: Thanks, Jira, I knew that.