Mastodawn

leontrolski 3d ago

"Disregard That" Attacks

https://calpaterson.com/disregard.html

"Disregard that!" attacks

Why you shouldn't share your context window with others

calpaterson.com

Show thread

stingraycharles 3d ago

I didn’t see the article talk specifically about this, or at least not in enough detail, but isn’t the de-facto standard mitigation for this to use guardrails which lets some other LLM that has been specifically tuned for these kind of things evaluate the safety of the content to be injected?

There are a lot of services out there that offer these types of AI guardrails, and it doesn’t have to be expensive.

Not saying that this approach is foolproof, but it’s better than relying solely on better prompting or human review.

Show thread

mannanj 3d ago

The article does mention this and a weakness of that approach is mentioned too.

Show thread

wbeckler

The article didn't describe how the second AI is tuned to distrust input and scan it for "disregard that." Instead it showed an architecture where a second AI accepts input from a naively implemented firewall AI that isn't scanning for "disregard that"