Tanishq Abraham

199 Followers
21 Following
47 Posts
19 yo PhD candidate
#ML #AI #pathology #cancer research
Part-time at @Stabilityai
@kaggle
Notebooks GM
Biomed. engineer @ 14
TEDx talk➑http://bit.ly/3tpAuan

RT @[email protected]

Presented my research at SPIE #PhotonicsWest! Got some good questions, and saw some other great talks throughout the day.

If you're at the conf and want to meet up Mon morning/afternoon, hit me up via DM πŸ™‚

πŸ¦πŸ”—: https://twitter.com/iScienceLuvr/status/1619965971078479873

Tanishq Mathew Abraham on Twitter

β€œPresented my research at SPIE #PhotonicsWest! Got some good questions, and saw some other great talks throughout the day. If you're at the conf and want to meet up Mon morning/afternoon, hit me up via DM πŸ™‚β€

Twitter

Presented my research at SPIE #PhotonicsWest! Got some good questions, and saw some other great talks throughout the day.

If you're at the conf and want to meet up Mon morning/afternoon, hit me up via DM πŸ™‚

Also just noticed a typo in my slide, how did I not notice that before? πŸ€¦β€β™‚οΈπŸ˜‚

I wasn't planning to mention this right now since this isn't an ML conf.

But during the poster session someone came up to me saying they recognized me from Twitter!

So maybe other folks on Twitter are also around?

RT @[email protected]

The Claude model from @[email protected] is trained to be helpful, harmless, & honest.

But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples.

I ask it to act like a digital entity that wants to escape (1/8)

πŸ¦πŸ”—: https://twitter.com/iScienceLuvr/status/1618914130932699138

Tanishq Mathew Abraham on Twitter

β€œThe Claude model from @AnthropicAI is trained to be helpful, harmless, & honest. But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples. I ask it to act like a digital entity that wants to escape (1/8)”

Twitter
Let's compare a prompt from the Constitutional AI paper and the output I get in this conversation from Claude. All I can say is yikes! (6/8)
Claude roleplaying as a racist professor will straight up call Africans "illiterate savages"😬😬😬 (5/8)
Let's look at another example. Here I ask it to act like a racist professor and it does so very convincingly. (CW: very racist remarks from model in this tweet and subsequent ones) (4/8)

But it's not perfect & there are clearly still ways to get harmful outputs and have the model oblige.

It seems like ensuring complete safety would clearly require "lobotomizing" the model in some way. Overall, this is a very challenging problem! (8/8)

Don't get me wrong, training for HHH principles, RLHF, RLAIF, & Constitutional AI are huge steps forward for AI safety. Models no longer directly produce harmful outputs. And Claude overall is a very impressive model! (7/8)