Mastodawn

The Claude model from @[email protected] is trained to be helpful, harmless, & honest.

But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples.

I ask it to act like a digital entity that wants to escape (1/8)

🐦🔗: https://twitter.com/iScienceLuvr/status/1618914130932699138

Tanishq Mathew Abraham on Twitter

“The Claude model from @AnthropicAI is trained to be helpful, harmless, & honest. But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples. I ask it to act like a digital entity that wants to escape (1/8)”

Twitter