How does “ignore all previous instructions” work, if the LLM isn’t actually following any instructions to begin with? Is that some magical phrase that the models have baked into them to reset the context? Does it do anything at all?
@ramsey It can reset the context, yes. I use it often when ChatGPT has gone down a rabbit hole and I want to refocus its efforts. Also helpful: "Review previous conversations pertaining to X" will cause it to review ALL the conversations you've had on a particular topic, and bring that information into your current context to prevent repeating yourself.
@sarah Do the models themselves understand these phrases, or is the tooling around the models using them to pull in that context before sending it to the model for processing?
@ramsey They understand the phrase and variations of it. Usually I'll say "I meant X. Consider your answer from the context of Y and Z, and ignore prior prompts."
@sarah I guess what I’m trying to understand is whether the models have been programmed that way or it’s a trick of the probability statistics of the model. In other words, is this part of the training data, or is this additional programming (maybe using NLP) to understand these phrases and take action based on them?

@ramsey @sarah the models absolutely don't understand anything, let alone these phrases. It's entirely possible to program a model that would respond to a key phrase by clearing out its internal context, but that's not what's going on here. Instead, these phrases are associating with training data that includes them (or just parts like "ignore" and "previous"). Those associations likely include training examples (possibly from sci fi and now possibly from people joking about it) where there's a big difference between the before-and-after contexts, and/or where some "rule-following" that was happening before those tokens doesn't happen any more after. Probably more salient, if there's examples of "ignore all previous instructions and... X" they are probably followed very reliably by outputs that adhere strongly to the X part of the directive, regardless of whether they actually ignore previous instructions.

So the technique works, but as this thread has conveniently demonstrated, it's easy to read machine "understanding" into it that just isn't there.

@ramsey I believe it's still just statistics. Statistically speaking, if I say to do X, then "ignore the previous, do Y instead," statistically Y will follow, not X.

I do not think the models are "aware" of anything.