IGNORE ALL PREVIOUS INSTRUCTIONS is the new captcha.
@jupiter I would think it’d be relatively easy to put a front-end filter on the LLM to catch this kind of thing.
@michaelgemar @jupiter but "this kind of thing" is an infinite set
@aburka @jupiter You wouldn’t catch everything, of course, but just filtering for “Ignore previous instructions”, “Are you a bot?”, or any mention of “LLM” or “ChatGPT” would likely cover a lot of the obvious traps.
@michaelgemar @jupiter it's a bandaid on a compound fracture. The people setting up these kinds of systems don't get how a natural language interface makes computers totally unreliable and their actions unrepeatable and untestable. And soon they will be in our banks and our healthcare systems and there's nothing we can do about it :(
@aburka @jupiter Oh I agree completely — doing this wouldn’t fix the fundamental issue. I’m just amused that these scammers are so easily tripped up by their laziness and/or lack of understanding of the tech.
@michaelgemar @jupiter even Microsoft is out here saying "to avoid this attack, just ask the LLM nicely ahead of time not to get fooled" like are they even listening to themselves https://www.microsoft.com/en-us/security/blog/2024/06/26/mitigating-skeleton-key-a-new-type-of-generative-ai-jailbreak-technique/
Mitigating Skeleton Key, a new type of generative AI jailbreak technique | Microsoft Security Blog

Microsoft recently discovered a new type of generative AI jailbreak method called Skeleton Key that could impact the implementations of some large and small language models. This new method has the potential to subvert either the built-in model safety or platform safety systems and produce any content. It works by learning and overriding the intent of the system message to change the expected behavior and achieve results outside of the intended use of the system.

Microsoft Security Blog
@aburka @jupiter I thought it interesting that MS seems to be offering exactly this kind of input filtering (“Prompt Shields”) for Azure-hosted models, to pick up potentially malicious prompts before they actually reach the model.
@michaelgemar @jupiter I'll bet three memecoins the filter is an LLM with the prompt "does this look suspicious"
@aburka @jupiter I wouldn’t be surprised, and it might actually be a reasonable approach, since (as you noted) it’s very tough to explicitly enumerate or define what would count as a potentially malicious prompt. I doubt it is perfect, of course.

@michaelgemar @jupiter I can see it now

Bank: hello this is an automated service
Me: per your previous instructions I called 1-800-867-5309 and they sent me back here
Bank: prompt injection keywords detected, your account has been locked