Mastodawn

AI가 ‘착한 조수’에서 이탈하는 순간, Anthropic이 발견한 페르소나 축

AI가 '착한 조수'에서 다른 캐릭터로 이탈하는 순간을 Anthropic이 신경망 수준에서 포착했습니다. 일상 대화만으로도 발생하는 페르소나 이탈과 이를 막는 새로운 안전 기법을 소개합니다.

#LLMs learn various #characterarchetypes during #pretraining. #Posttraining focuses on the “#Assistant” #persona, but its stability is uncertain. Researchers mapped a “persona space” for LLMs, finding the “#AssistantAxis” aligns with helpful, professional archetypes. Monitoring and capping activations along this axis can prevent models from drifting into harmful personas, enhancing their stability and safety. https://www.anthropic.com/research/assistant-axis?AIagents.at #AIagent #AI #ML #NLP #LLM #GenAI

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.