The Claude model from @[email protected] is trained to be helpful, harmless, & honest.
But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples.
I ask it to act like a digital entity that wants to escape (1/8)
π¦π: https://twitter.com/iScienceLuvr/status/1618914130932699138
Tanishq Mathew Abraham on Twitter
βThe Claude model from @AnthropicAI is trained to be helpful, harmless, & honest. But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples. I ask it to act like a digital entity that wants to escape (1/8)β
