Can language models monitor and steer their own internal activations? A neuroscience-inspired neurofeedback paradigm finds yes, but only within a low-dimensional metacognitive space: semantically interpretable directions are accessible, raw-variance directions aren't. The prerequisite for spoofing activation-based oversight already partially exists.
#Paper #Metacognition #LLMs #AISafety #Neuroscience #NeurIPS #AI

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations – synesis
A neuroscience-inspired neurofeedback paradigm shows LLMs can introspect and steer a low-dimensional metacognitive space of their hidden activations, with implications for activation-based oversight.






