Mastodawn

leontrolski 3d ago

"Disregard That" Attacks

https://calpaterson.com/disregard.html

"Disregard that!" attacks

Why you shouldn't share your context window with others

calpaterson.com

Show thread

stingraycharles 3d ago

I didn’t see the article talk specifically about this, or at least not in enough detail, but isn’t the de-facto standard mitigation for this to use guardrails which lets some other LLM that has been specifically tuned for these kind of things evaluate the safety of the content to be injected?

There are a lot of services out there that offer these types of AI guardrails, and it doesn’t have to be expensive.

Not saying that this approach is foolproof, but it’s better than relying solely on better prompting or human review.

Show thread

mannanj 3d ago

The article does mention this and a weakness of that approach is mentioned too.

Show thread

crisnoble

Perhaps they asked AI to summarize the article for them and it stopped after the first "disregard that" it read into its context window.