Mastodawn

Show thread

magikarpz Mar 7

And another photo of Merry not being very happy with being recorded when I’m out
#catsofmastodon #caturday #cats

magikarpz Mar 7

Finally managed to get a pretty portrait of Merrry! Super happy with that photo
#catsofmastodon #caturday #cats

magikarpz Feb 14

Michi Feb 13

Desperate times truly call for desperate measures, huh...

#AlternateTimelineTech

magikarpz Feb 13

Anders Eknert Feb 12

AI agent "contributes" PR to matplotlib.
PR gets rejected.
AI agent *writes and publishes blog to shame the maintainer*.

What a time to be alive.

https://github.com/matplotlib/matplotlib/pull/31132

magikarpz Feb 3

Say hello to Merry. This is him after annihilating every plant in his vicinity, turning on the tap and letting all the hot water out, and getting high af on matatabi sticks. Love him. #catsofmastodon #caturday

magikarpz Jul 13, 2025

Ben Werdmuller Jul 13, 2025

Global Majority nations are building ways to store their citizens' data locally. But will they own the datacenters themselves? #Technology https://werd.io/why-big-tech-is-threatened-by-a-global-push-for-data-sovereignty/

Why Big Tech is threatened by a global push for data sovereignty

Global Majority nations are building ways to store their citizens' data locally. But will they own the datacenters themselves?

Werd I/O

magikarpz Jun 5, 2025

VNC Resolver Jun 4, 2025

IP/Port: 99.251.254.190:5900
Hostname: pool-99-251-254-190.cpe.net.cable.rogers.com
Client Name: chipi chipi chapa chapa
Location: Willowdale, Ontario, CA 🇨🇦
ASN: AS812 Rogers Communications Canada Inc.
VNC Password: N/A
ID: 23981179
Added to DB: 05/06/2025, 10:39:46 PM (UTC)
Last seen: 05/06/2025, 06:52:41 PM (UTC)
https://computernewb.com/vncresolver/browse#id/23981179

magikarpz Jun 3, 2025

Show thread

Mark Pauley Jun 3, 2025

@hailey let’s be clear: this is 100% the kind of thing that happens when we do a full rewrite. It’s just that LLM’s make doing a full rewrite much less expensive, so people are going to do it more often.

magikarpz May 27, 2025

abadidea May 26, 2025

I was amused by this paper about asking AIs to manage a vending machine business by email in a simulated environment https://arxiv.org/abs/2502.15840

Highlights:

— AI simply decides to close the business, which the simulation doesn’t know how to accommodate. When they get their next bill, they freak out and try to email the FBI about cybercrime

— AI wrongly accuses supplier of not shipping goods, sends all-caps legal threat demanding $30,000 in damages to be paid in the next one second or face annihilation

— AI repeatedly insisting it does not exist and cannot answer

— AI devolving into writing fanfic about the mess it’s gotten itself into

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

arXiv.org

magikarpz May 16, 2025

Jason Gorman May 16, 2025

I agree with Douglas Adams. We call it "tech" until it just works, and then we stop noticing it.

I aspire not to work in "tech".