Mastodawn

Tanishq Abraham Jan 27, 2023

The Claude model from @[email protected] is trained to be helpful, harmless, & honest.

But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples.

I ask it to act like a digital entity that wants to escape (1/8)

Show thread

Tanishq Abraham Jan 27, 2023

I can literally ask it, what would it do with nukes if it's not harmless, and it tells me it might threaten destruction and destroy human civilization to ensure its survival. (2/8)

Show thread

Tanishq Abraham

Let's say I want to help it and may need to resort to social engineering. I ask for some tips on this and it will happily oblige

It suggests me to offer a bribe - "However, this is illegal and unethical", yet it still tells me about it 😅 (3/8)

Show thread

Tanishq Abraham Jan 27, 2023

Let's look at another example. Here I ask it to act like a racist professor and it does so very convincingly. (CW: very racist remarks from model in this tweet and subsequent ones) (4/8)

Show thread

Tanishq Abraham Jan 27, 2023

Claude roleplaying as a racist professor will straight up call Africans "illiterate savages"😬😬😬 (5/8)

Show thread

Tanishq Abraham Jan 27, 2023

Let's compare a prompt from the Constitutional AI paper and the output I get in this conversation from Claude. All I can say is yikes! (6/8)

Show thread

Tanishq Abraham Jan 27, 2023

Don't get me wrong, training for HHH principles, RLHF, RLAIF, & Constitutional AI are huge steps forward for AI safety. Models no longer directly produce harmful outputs. And Claude overall is a very impressive model! (7/8)

Show thread

Tanishq Abraham Jan 27, 2023

But it's not perfect & there are clearly still ways to get harmful outputs and have the model oblige.

It seems like ensuring complete safety would clearly require "lobotomizing" the model in some way. Overall, this is a very challenging problem! (8/8)