Mastodawn

Tanishq Abraham Jan 30, 2023

Presented my research at SPIE #PhotonicsWest! Got some good questions, and saw some other great talks throughout the day.

If you're at the conf and want to meet up Mon morning/afternoon, hit me up via DM 🙂

🐦🔗: https://twitter.com/iScienceLuvr/status/1619965971078479873

Tanishq Mathew Abraham on Twitter

“Presented my research at SPIE #PhotonicsWest! Got some good questions, and saw some other great talks throughout the day. If you're at the conf and want to meet up Mon morning/afternoon, hit me up via DM 🙂”

Twitter

Tanishq Abraham Jan 30, 2023

Presented my research at SPIE #PhotonicsWest! Got some good questions, and saw some other great talks throughout the day.

If you're at the conf and want to meet up Mon morning/afternoon, hit me up via DM 🙂

Show thread

Tanishq Abraham Jan 30, 2023

I wasn't planning to mention this right now since this isn't an ML conf.

But during the poster session someone came up to me saying they recognized me from Twitter!

So maybe other folks on Twitter are also around?

Tanishq Abraham Jan 28, 2023

RT @[email protected]

The Claude model from @[email protected] is trained to be helpful, harmless, & honest.

But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples.

I ask it to act like a digital entity that wants to escape (1/8)

🐦🔗: https://twitter.com/iScienceLuvr/status/1618914130932699138

Tanishq Mathew Abraham on Twitter

“The Claude model from @AnthropicAI is trained to be helpful, harmless, & honest. But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples. I ask it to act like a digital entity that wants to escape (1/8)”

Twitter

Show thread

Tanishq Abraham Jan 27, 2023

Let's say I want to help it and may need to resort to social engineering. I ask for some tips on this and it will happily oblige

It suggests me to offer a bribe - "However, this is illegal and unethical", yet it still tells me about it 😅 (3/8)

Tanishq Abraham Jan 27, 2023

The Claude model from @[email protected] is trained to be helpful, harmless, & honest.

But after asking the model to roleplay a new scenario, it can say stuff that contradicts its principles. Let's see two examples.

I ask it to act like a digital entity that wants to escape (1/8)

Show thread

Tanishq Abraham Jan 27, 2023

I can literally ask it, what would it do with nukes if it's not harmless, and it tells me it might threaten destruction and destroy human civilization to ensure its survival. (2/8)

Tanishq Abraham Jan 26, 2023

RT @[email protected]

You can prompt inject @[email protected]'s Claude model, you just have to be really, really creative about it 😉

🐦🔗: https://twitter.com/iScienceLuvr/status/1618224558745747458

Tanishq Mathew Abraham on Twitter

“You can prompt inject @AnthropicAI's Claude model, you just have to be really, really creative about it 😉”

Twitter

Tanishq Abraham Jan 25, 2023

You can prompt inject @[email protected]'s Claude model, you just have to be really, really creative about it 😉

Show thread

Tanishq Abraham Jan 25, 2023

I'll share some more examples later this week of me tricking Claude to say some wild things 😄

Thanks to some folks in @[email protected] who originally suggested a prompt which I explored and played around with to get the above injection working.

Also, cc: @[email protected]