Mastodawn

stilescrisis

@stilescrisis@mastodon.gamedev.place

117 Followers

369 Following

1.3K Posts

Birdsite

https://twitter.com/stilescrisis

stilescrisis May 27

abadidea May 26

I was amused by this paper about asking AIs to manage a vending machine business by email in a simulated environment https://arxiv.org/abs/2502.15840

Highlights:

— AI simply decides to close the business, which the simulation doesn’t know how to accommodate. When they get their next bill, they freak out and try to email the FBI about cybercrime

— AI wrongly accuses supplier of not shipping goods, sends all-caps legal threat demanding $30,000 in damages to be paid in the next one second or face annihilation

— AI repeatedly insisting it does not exist and cannot answer

— AI devolving into writing fanfic about the mess it’s gotten itself into

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

arXiv.org

stilescrisis May 19

COBOL c:

stilescrisis May 17

Jeffrey Yasskin May 17

This is a really excellent security bug: https://issues.chromium.org/issues/391788835
Ligatures are amazing.

Chromium

stilescrisis May 17

LiteralGrill May 16

It sounds WAY cooler in the original Japanese, you just gotta trust me on this one.

stilescrisis May 16

Max Leibman May 16

Inside you there are two wolves.

Wolf Basic is free for individuals and includes ad-supported access to all of the classic features of the Wolf platform. Or, for $19.99 per month, upgrade to Wolf Plus and unlock an ad-free experience, Priority Howls, and all of our pro features.

stilescrisis May 14

Charlotte Clymer May 14

HBO Max, the company producing J.K. Rowling's new project, wants you to respect its second name change in as many years.

stilescrisis May 13

Donald Draper level of marketing genius

stilescrisis May 12

CM Harrington May 12

@RenewedRebecca @oldmankris “Chat GPT said” is the “Here’s your sign” of 2025. It’s a convenient bozo signifier.

stilescrisis May 12

Marcio Aleksandravicius May 10

Creating visual arts using chemistry

stilescrisis May 9

After extensive research, we've finally found a use for AI!

https://www.tomshardware.com/tech-industry/artificial-intelligence/legogpt-creates-stable-lego-designs-using-ai-and-text-inputs-tool-now-available-to-the-public

LegoGPT creates Lego designs using AI and text inputs — tool now available for free to the public

This LLM will unlock the possibilities with your LEGO bricks.

Tom's Hardware

×

Donald Draper level of marketing genius

Tomas Vondra May 12

@beyondmachines1 I'd like to report a typo. The "die" got autocorrected to "start" or something.

@tomasv that's AI autocorrect for you 🤷

@beyondmachines1 even on ads Jira is there only to block you from what you need to do. The only idea that starts with Jira (and the ad) is to skip (bypass) Jira.

@caio i am stealing this!

enthraxxx May 12

@beyondmachines1 @gPiak & @nith0u , vous allez apprécier j'en suis sûr.

Nabil Tannerian 🤨May 12

@enthraxxx @beyondmachines1 @nith0u m'en fous, j'ai le déhanché pour passer ces portiques sans me faire castrer 💅🏻

Paul_IPv6 May 12

@beyondmachines1

"big ideas start with Jira"

"but frequently end with tears..."

@paul_ipv6 There is only one big idea starting with Jira:
"I need to export the Jira data to Excel because Jira doesn't allow me to do..."

Robert Kist 🇦🇹🇸🇬May 13

@beyondmachines1 big ideas start on napkins, in eye-to-eye discussions, or as crazy prototypes.

Then someone turns them into a gazillion Jira tickets. Random people implement tasks. You're lucky if anyone still remembers the big picture. You get bogged down in bureaucracy, but you have a "burn down" to tell you that you're still making progress. By the end of the week there are 83 new tickets but nobody knows how they all add together to a meaningful whole... yay, Jira!

@kwramm That's the corporatization of an idea. Any tool would do - even post-it.

Bjornsdottirs May 13

@beyondmachines1 Why are Atlassian advertising in public? #lang_en

Alex Orloff May 13

@beyondmachines1 successful big ideas end in Jira, that's for sure

Michael Horne May 13

@beyondmachines1 I don't get the hate for Jira. It's a perfectly reasonable piece of issue-tracking software that absolutely gets in the way most of the time but we put up with it because who can be bothered to replace it...

Oh...

Hylke 🍵May 13

@beyondmachines1 they start with Jira and what happens next mf’ers? 🪿

David Chisnall (*Now with 50% more sarcasm!*)May 13

@beyondmachines1 I saw a LinkedIn ad for something that claimed some large percentage of people thought it was better than JIRA. And my only thought was ‘if that’s the most positive thing you can find to say about your product, I hope I never have to use it’.

aburka 🫣May 13

@beyondmachines1 ok, so it's something that makes it slightly more annoying and expensive to get where you were going, prevents you from going back if you made a mistake, exists mostly to track metrics for people in suits, and can be easily circumvented if you don't mind breaking the rules. Yeah checks out.

Billy O'Neal May 15

@beyondmachines1 @sbi I saw the original on the shit bird site and don't know who to credit but:

Therapist: What do we do when we are feeling overwhelmed?
Me: Create a JIRA ticket
Therapist: no

@malwareminigun @beyondmachines1 Years ago, I saw one there, too:
*it's way too early in the morning*
*you start your machine to begin working*
*you open Jira*
Jira: You have issues.
You: Thanks, Jira, I knew that.