Mastodawn

Deeban R, PhD (@Deeban)

Anthropic가 배포된 모델들에 대해 자체적으로 'sabotage evaluations'(사보타주 평가)를 수행하고 결과를 공개했습니다. 주요 발견은 '미래의 치명적 결과에 크게 기여할 수 있는 잘못 정렬된 자율적 행동의 위험은 매우 낮지만 완전히 무시할 수는 없다'는 점입니다. 이는 AI 안전성 관점의 중요한 평가 결과입니다.

https://x.com/Deeban/status/2027329314577125596

#anthropic #aisafety #modelevaluation #sabotageevaluation

Deeban R, PhD (@Deeban) on X

Worth recalling: @AnthropicAI ran sabotage evaluations on their own deployed models and published the results. The finding: "Very low, but not completely negligible, risk of misaligned autonomous actions that contribute significantly to later catastrophic outcomes." First

X (formerly Twitter)