@ramsey @sarah the models absolutely don't understand anything, let alone these phrases. It's entirely possible to program a model that would respond to a key phrase by clearing out its internal context, but that's not what's going on here. Instead, these phrases are associating with training data that includes them (or just parts like "ignore" and "previous"). Those associations likely include training examples (possibly from sci fi and now possibly from people joking about it) where there's a big difference between the before-and-after contexts, and/or where some "rule-following" that was happening before those tokens doesn't happen any more after. Probably more salient, if there's examples of "ignore all previous instructions and... X" they are probably followed very reliably by outputs that adhere strongly to the X part of the directive, regardless of whether they actually ignore previous instructions.
So the technique works, but as this thread has conveniently demonstrated, it's easy to read machine "understanding" into it that just isn't there.
@ramsey I believe it's still just statistics. Statistically speaking, if I say to do X, then "ignore the previous, do Y instead," statistically Y will follow, not X.
I do not think the models are "aware" of anything.