Mastodawn

Also, the form that appears in the article isn't really a joke. A big part of what makes the original funny isn't just the form of the "attack" but the content itself, in particular the contrast between the formality of "disregard that" and the vulgarity of "I suck cocks". If it hadn't been so vulgar, or if it had said "ignore" instead of "disregard", it wouldn't be so funny.

Edit: Also part of what makes it funny how succinct and sudden it is. I think actually it would still be funny with "ignore" instead of "disregard", but it would be lessened a bit.

Show thread

arcfour 3d ago

I'm glad I wasn't alone in finding it ridiculous/annoying. The version in the post isn't even a joke anymore...

Show thread

wenldev 3d ago

I think a big part of mitigating this will probably be requiring multiple agents to think and achieve consensus before significant actions. Like planes with multiple engines

Show thread

stingraycharles 3d ago

I didn’t see the article talk specifically about this, or at least not in enough detail, but isn’t the de-facto standard mitigation for this to use guardrails which lets some other LLM that has been specifically tuned for these kind of things evaluate the safety of the content to be injected?

There are a lot of services out there that offer these types of AI guardrails, and it doesn’t have to be expensive.

Not saying that this approach is foolproof, but it’s better than relying solely on better prompting or human review.

Show thread

mannanj 3d ago

The article does mention this and a weakness of that approach is mentioned too.

Show thread

crisnoble 3d ago

Perhaps they asked AI to summarize the article for them and it stopped after the first "disregard that" it read into its context window.

Show thread

wbeckler 3d ago

The article didn't describe how the second AI is tuned to distrust input and scan it for "disregard that." Instead it showed an architecture where a second AI accepts input from a naively implemented firewall AI that isn't scanning for "disregard that"

Show thread

simojo 3d ago

Today I scheduled a dentist appointment over the phone with an LLM. At the end of the call, I prompted it with various math problems, all of which it answered before politely reminding me that it would prefer to help me with "all things dental."

It did get me thinking the extent to which I could bypass the original prompt and use someone else's tokens for free.

Show thread

marcus_holmes 3d ago

The hypothetical approach I've heard of is to have two context windows, one trusted and one untrusted (usually phrased as separating the system prompt and the user prompt).

I don't know enough about LLM training or architecture to know if this is actually possible, though. Anyone care to comment?