Michael Bacarella

97 Followers
43 Following
30 Posts
Founder GPShopper (successful exit)
ex-Jane Street (pre-SBF)
ex-Google (pre-Sundar)
occasional shitpoaster
twitterhttps://x.com/mbacarella
bskyhttps://bsky.app/profile/michael.bacarella.com
it's straight up elder abuse that Chrome force uninstalled uBlock Origin on laptops that their grandkids set up for them

the brutality of 3d graphics programming is that you're mostly dealing with arrays full of floats all day and 30 years of software engineering discipline just passes you by because there's not that much that helps with that

you may as well be writing fortran

I was amused by this paper about asking AIs to manage a vending machine business by email in a simulated environment https://arxiv.org/abs/2502.15840

Highlights:

— AI simply decides to close the business, which the simulation doesn’t know how to accommodate. When they get their next bill, they freak out and try to email the FBI about cybercrime

— AI wrongly accuses supplier of not shipping goods, sends all-caps legal threat demanding $30,000 in damages to be paid in the next one second or face annihilation

— AI repeatedly insisting it does not exist and cannot answer

— AI devolving into writing fanfic about the mess it’s gotten itself into

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

arXiv.org
the truth hurts

ATTENTION ALL REMOTE WORKERS. In our efforts to combat cyberespionage your manager will ask each of you, every morning, at the start of morning stand-up: HOW! FAT! IS! KIM JONG-UN!? Your answer to this question is mandatory. Thank you.

https://www.theregister.com/2025/04/29/north_korea_worker_interview_questions/

The one interview question that will protect you from North Korean fake workers

RSAC: FBI and others list how to spot NK infiltrators, but AI will make it harder

The Register

I did a static analysis on the DeepSeek Android app

tl;dr it does aggressive device fingerprinting, root detection, has anti-tampering mechanisms, bundles native code and has dynamic code loading and execution facilities

none of which should be necessary for an app like this

more here: https://michael.bacarella.com/2025/02/07/static-analysis-of-the-deepseek-android-app/

Static analysis of the DeepSeek Android App

Introduction

Michael Bacarella