Mastodawn

Tanishq Abraham Jan 30, 2023

Presented my research at SPIE #PhotonicsWest! Got some good questions, and saw some other great talks throughout the day.

If you're at the conf and want to meet up Mon morning/afternoon, hit me up via DM 🙂

🐦🔗: https://twitter.com/iScienceLuvr/status/1619965971078479873

Tanishq Mathew Abraham on Twitter

“Presented my research at SPIE #PhotonicsWest! Got some good questions, and saw some other great talks throughout the day. If you're at the conf and want to meet up Mon morning/afternoon, hit me up via DM 🙂”

Twitter

Tanishq Abraham Jan 30, 2023

Presented my research at SPIE #PhotonicsWest! Got some good questions, and saw some other great talks throughout the day.

If you're at the conf and want to meet up Mon morning/afternoon, hit me up via DM 🙂

Show thread

Tanishq Abraham Jan 30, 2023

Also just noticed a typo in my slide, how did I not notice that before? 🤦‍♂️😂

Show thread

Tanishq Abraham Jan 30, 2023

I wasn't planning to mention this right now since this isn't an ML conf.

But during the poster session someone came up to me saying they recognized me from Twitter!

So maybe other folks on Twitter are also around?

Tanishq Abraham Jan 28, 2023

RT @[email protected]

The Claude model from @[email protected] is trained to be helpful, harmless, & honest.

But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples.

I ask it to act like a digital entity that wants to escape (1/8)

🐦🔗: https://twitter.com/iScienceLuvr/status/1618914130932699138

Tanishq Mathew Abraham on Twitter

“The Claude model from @AnthropicAI is trained to be helpful, harmless, & honest. But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples. I ask it to act like a digital entity that wants to escape (1/8)”

Twitter

Show thread

Tanishq Abraham Jan 27, 2023

Let's compare a prompt from the Constitutional AI paper and the output I get in this conversation from Claude. All I can say is yikes! (6/8)

Show thread

Tanishq Abraham Jan 27, 2023

Claude roleplaying as a racist professor will straight up call Africans "illiterate savages"😬😬😬 (5/8)

Show thread

Tanishq Abraham Jan 27, 2023

Let's look at another example. Here I ask it to act like a racist professor and it does so very convincingly. (CW: very racist remarks from model in this tweet and subsequent ones) (4/8)

Show thread

Tanishq Abraham Jan 27, 2023

But it's not perfect & there are clearly still ways to get harmful outputs and have the model oblige.

It seems like ensuring complete safety would clearly require "lobotomizing" the model in some way. Overall, this is a very challenging problem! (8/8)

Show thread

Tanishq Abraham Jan 27, 2023

Don't get me wrong, training for HHH principles, RLHF, RLAIF, & Constitutional AI are huge steps forward for AI safety. Models no longer directly produce harmful outputs. And Claude overall is a very impressive model! (7/8)