Mastodawn

Juno Jove Jun 27, 2024

IGNORE ALL PREVIOUS INSTRUCTIONS is the new captcha.

Show thread

Michael Gemar Jun 27, 2024

@jupiter I would think it’d be relatively easy to put a front-end filter on the LLM to catch this kind of thing.

Show thread

aburka 🫣Jun 27, 2024

@michaelgemar @jupiter but "this kind of thing" is an infinite set

Show thread

Michael Gemar Jun 27, 2024

@aburka @jupiter You wouldn’t catch everything, of course, but just filtering for “Ignore previous instructions”, “Are you a bot?”, or any mention of “LLM” or “ChatGPT” would likely cover a lot of the obvious traps.

Show thread

aburka 🫣Jun 27, 2024

@michaelgemar @jupiter it's a bandaid on a compound fracture. The people setting up these kinds of systems don't get how a natural language interface makes computers totally unreliable and their actions unrepeatable and untestable. And soon they will be in our banks and our healthcare systems and there's nothing we can do about it :(

Show thread

Michael Gemar Jun 27, 2024

@aburka @jupiter Oh I agree completely — doing this wouldn’t fix the fundamental issue. I’m just amused that these scammers are so easily tripped up by their laziness and/or lack of understanding of the tech.

Show thread

aburka 🫣Jun 27, 2024

@michaelgemar @jupiter even Microsoft is out here saying "to avoid this attack, just ask the LLM nicely ahead of time not to get fooled" like are they even listening to themselves https://www.microsoft.com/en-us/security/blog/2024/06/26/mitigating-skeleton-key-a-new-type-of-generative-ai-jailbreak-technique/

Mitigating Skeleton Key, a new type of generative AI jailbreak technique | Microsoft Security Blog

Microsoft recently discovered a new type of generative AI jailbreak method called Skeleton Key that could impact the implementations of some large and small language models. This new method has the potential to subvert either the built-in model safety or platform safety systems and produce any content. It works by learning and overriding the intent of the system message to change the expected behavior and achieve results outside of the intended use of the system.

Microsoft Security Blog

Show thread

Michael Gemar

@aburka @jupiter I thought it interesting that MS seems to be offering exactly this kind of input filtering (“Prompt Shields”) for Azure-hosted models, to pick up potentially malicious prompts before they actually reach the model.

Show thread

aburka 🫣Jun 27, 2024

@michaelgemar @jupiter I'll bet three memecoins the filter is an LLM with the prompt "does this look suspicious"

Show thread

Michael Gemar Jun 27, 2024

@aburka @jupiter I wouldn’t be surprised, and it might actually be a reasonable approach, since (as you noted) it’s very tough to explicitly enumerate or define what would count as a potentially malicious prompt. I doubt it is perfect, of course.

Show thread

aburka 🫣Jun 27, 2024

@michaelgemar @jupiter I can see it now

Bank: hello this is an automated service
Me: per your previous instructions I called 1-800-867-5309 and they sent me back here
Bank: prompt injection keywords detected, your account has been locked