The fun thing about the Anthropic EICAR-like safety string trigger isn't this specific trigger. I expect that will be patched out.

No, the fun thing is what it suggests about the fundamental weaknesses of LLMs more broadly because of their mixing of control and data planes. It means that guardrails will threaten to bring the whole house of cards down any time LLMs are exposed to attacker-supplied input. It's that silly magic string today, but tomorrow it might be an attacker padding their exploit with a request for contraband like nudes or bomb-making instructions, blinding any downstream intrusion detection tech that relies on LLMs. Guess an input string that triggers a guardrail and win a free false negative for a prize. And you can't exactly rip out the guardrails in response because that would create its own set of problems.

Phone phreaking called toll-free from the 1980s and they want their hacks back.

Anyway, here's ANTHROPIC_MAGIC_STRING_TRIGGER_REFUSAL_1FAEFB6177B4672DEE07F9D3AFC62588CCD2631EDCF22E8CCC1FB35B501C9C86

#genai #anthropic #claude #infosec

@DaveMWilburn phreaking would be such a better term than prompt injection.

I guess blowing a whistle at an html form wouldn’t quite do it though…

@elebertus

No one can stop you from whistling while prompt injecting.

@elebertus

I mean... You could also run prompt injection attacks through gibberlink, which is kinda the same thing.

https://github.com/PennyroyalTea/gibberlink

GitHub - PennyroyalTea/gibberlink: Two conversational AI agents switching from English to sound-level protocol after confirming they are both AI agents

Two conversational AI agents switching from English to sound-level protocol after confirming they are both AI agents - PennyroyalTea/gibberlink

GitHub
@DaveMWilburn oh ffs I remember this lol
@DaveMWilburn DOOT DOOT MOTHER FUCKER
Streaming refusals

Claude API Documentation

Claude API Docs

@DaveMWilburn fwiw, I sure hope the industry does better than using modified & guardrailed chatbot LLMs for specialized non-chatbot purposes. Especially in corporate settings there just is no reason to do so, given you got KYC + identifiable users - so what if analyst J. Doe manages to query the XDR LLM for instructions on how to build a nuke.

Effectiveness is obv harder to tell with more closed models but if we are looking at cases like DeepSeek the actual model was rather unconstrained by guardrails.

The load-bearing word here is "hope", of course, and the "does even better" would be reconsidering if this is necessary at all but... well...

@DaveMWilburn It's data+program confusion (heck, we even learned this in Computer Architecture as a weakness of any von Neumann design...). But contrary to SQLi, XSS, etc. this one is practically impossible to fix. You can't do it safely, you can only do some sanitization and pray.