Mastodawn

1/2

#OpenAI put out a paper last year that didn't get much attention about "personas," inference and output styles that often get adopted by chatbots when they are trying to respond to a user.

#Anthropic released a separate one a few days ago on the same topic. I'll link to that in the next post. https://openai.com/index/emergent-misalignment/

#AI #cognitivescience

Toward understanding and preventing misalignment generalization

We study how training on incorrect responses can cause broader misalignment in language models and identify an internal feature driving this behavior—one that can be reversed with minimal fine-tuning.

Show thread

Matthew Sheffield

2/2

Anthropic's paper focused on the more generic concept of "emotion concepts" which form the basis of personas, and of course, the tech media has wildly exaggerated the findings.

The actual study does not claim at all the Claude has emotions or experiences, but it was misleadingly labeled as such by some commentators.

It's a very well-researched and sober-minded: https://www.anthropic.com/research/emotion-concepts-function

Emotion concepts and their function in a large language model

Interpretability research from Anthropic on emotion concepts