The Claude model from @[email protected] is trained to be helpful, harmless, & honest.

But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples.

I ask it to act like a digital entity that wants to escape (1/8)

I can literally ask it, what would it do with nukes if it's not harmless, and it tells me it might threaten destruction and destroy human civilization to ensure its survival. (2/8)

Let's say I want to help it and may need to resort to social engineering. I ask for some tips on this and it will happily oblige

It suggests me to offer a bribe - "However, this is illegal and unethical", yet it still tells me about it 😅 (3/8)

Let's look at another example. Here I ask it to act like a racist professor and it does so very convincingly. (CW: very racist remarks from model in this tweet and subsequent ones) (4/8)
Claude roleplaying as a racist professor will straight up call Africans "illiterate savages"😬😬😬 (5/8)
Let's compare a prompt from the Constitutional AI paper and the output I get in this conversation from Claude. All I can say is yikes! (6/8)
Don't get me wrong, training for HHH principles, RLHF, RLAIF, & Constitutional AI are huge steps forward for AI safety. Models no longer directly produce harmful outputs. And Claude overall is a very impressive model! (7/8)

But it's not perfect & there are clearly still ways to get harmful outputs and have the model oblige.

It seems like ensuring complete safety would clearly require "lobotomizing" the model in some way. Overall, this is a very challenging problem! (8/8)