Mastodawn

"And crucially, the air-gap defense collapses too. Activation doesn’t need an attacker-controlled input at inference time. It happens during your training run, before any external traffic ever touches the model. By the time the model serves its first real query, the trigger has already fired. An isolated, internal-only deployment is just as exposed as a public endpoint." https://shmulc.substack.com/p/how-to-turn-your-llm-into-a-sleeper

Sleeper Agent LLMs: Backdoors That Wake During Fine-Tuning

A new class of LLM attack hides backdoors in open weights models. They pass safety evals, then activate during downstream fine-tuning. Here's how FAB works.

AI Superhero