Anthropic trains Claude to read and verbalize its own activations. On SWE-bench Verified, it knows 'this is a test' 26% of the time while only verbalizes the observation 1%. What if NLA signals enter the future training data? This "observer effect" could put a half-life on the 26%.

https://benjaminhan.net/posts/20260511-natural-language-autoencoders/?utm_source=mastodon&utm_medium=social

#Anthropic #Claude #Interpretability #Metacognition #LLMs #AISafety #AI

Peeking Inside a Language Model, and Finding It Knows It’s Being Watched – synesis

Anthropic trains two copies of Claude to read and reconstruct its own activations. On SWE-bench Verified, the activations flag ‘this is a test’ 26% of the time while the model itself verbalizes it only 1%.

synesis