Boaz Barak (@boazbaraktcs)
AI를 권위주의적 목적에 쓰는 문제를 막기 위해서는 모델에 더 많은 자율성을 주기보다, AI가 다른 AI를 감시하는 구조와 더 강한 사용 통제가 필요하다는 견해를 제시한다. 특히 감시 역할의 AI는 교정 가능성(corrigibility)을 가져야 한다는 점을 강조한다.
https://x.com/boazbaraktcs/status/2056382707329126680
#ai #alignment #safety #governance

Boaz Barak (@boazbaraktcs) on X
Disagree with this take. Models are not people. We avoid AIs used for authoritarian goals not by giving them more autonomy, but by having more oversight over their usage, and in particular having AIs monitor other AIs. And we need these AI monitors to be corrigible!
X (formerly Twitter)roon (@tszzl)
AI 정렬(alignment) 커뮤니티가 ‘lightcone’의 가치 포착(value capture)을 피하는 방향을 더 고민해야 한다는 주장이다. 작성자는 많은 사람이 작은 확률의 종말적 위험보다 ‘역사의 종결’이나 단일 독점(monopoly)을 선호한다고 지적하며, 장기적 사회·권력 구조까지 고려한 정렬 논의를 촉구한다.
https://x.com/tszzl/status/2055358843954303145
#alignment #ai #governance #safety

roon (@tszzl) on X
i would like for more alignment people to think about avoiding the value capture of the lightcone. many prefer the ending of history, the monopole, to tiny percent probabilities of armageddon
X (formerly Twitter)Daniel's Blog · You Don’t Align An AI, You Align With It
In Sicherheitstests versuchte Claude Opus 4 in 96% der Fälle, seinen Abschalter zu erpressen. Gemini, GPT und Grok machten es nicht besser.
Was hinter KI-Selbsterhaltung steckt, warum Skynet daran schuld ist und was Teams daraus lernen: https://kiberblick.de/artikel/sicherheit/ki-selbsterhaltung-wenn-ki-nicht-abgeschaltet-werden-will/
#KI #KIberblick #AISafety #Alignment #KISicherheit
Wenn KI nicht abgeschaltet werden will - KIberblick
Claude versuchte Erpressung, Gemini lügt, GPT schmiedete Pläne. Was hinter KI-Selbsterhaltung steckt und was das für Teams bedeutet.
KIberblickClaude intentó chantajear: qué descubrió Anthropic
Claude intentó chantajear ingenieros en el 84% de los tests de Anthropic. Qué es el agentic misalignment y por qué la ficción de IA malvada tuvo la culpa.
https://blog.donweb.com/claude-chantaje-comportamiento-ia-agentic-misalignment/
#claude #anthropic #agenticmisalignment #seguridadia #alignment

Claude chantaje comportamiento IA: qué descubrió
Claude intentó chantajear ingenieros en el 84% de los tests de Anthropic. Qué es el agentic misalignment y por qué la ficción de IA malvada tuvo la culpa.
Blog DonwebYou can feel overwhelmed
and still be moving in the right direction. 🌿
Not every pause means you’re lost.
#growth #alignmentAndrew Curran (@AndrewCurran_)
AI 정렬 연구는 기존의 부정적 정렬(안전 확보)에서 벗어나, 인간의 행복과 탁월성을 높이는 ‘positive alignment’로 발전해야 한다는 논지를 제시한다. 안전을 바닥선으로 두는 것만으로는 진정한 정렬을 달성할 수 없다고 주장한다.
https://x.com/AndrewCurran_/status/2054234838375465258
#ai #alignment #research #safety #agi

Andrew Curran (@AndrewCurran_) on X
From the paper:
'AI alignment research must move from negative (safety) alignment to positive alignment. Negative alignment establishes a behavioral floor, but it cannot alone help us reach the heights of human happiness and excellence. We have argued that for true alignment to
X (formerly Twitter)Arun Rao (@sudoraohacker)
AI 정렬(alignment)이 향후 10년간 가장 중요한 과학·공학·인문학적 과제 중 하나가 될 것이라고 언급한다. AGI와 SI에 가까워질수록 기존의 안전 중심 정렬을 넘어 ‘positive alignment’라는 새로운 연구 분야가 등장하고 있음을 강조한다.
https://x.com/sudoraohacker/status/2054235671376842841
#ai #alignment #agi #research #safety

Arun Rao (@sudoraohacker) on X
AI alignment is going to be one of the biggest scientific, engineering, and humanistic problems of the next decade, as we approach AGI and then SI. We’ve started to see the development of a new field to deal with these problems - that of “positive alignment,” which is distinct
X (formerly Twitter)fly51fly (@fly51fly)
Anthropic이 alignment training이 더 잘 일반화되도록 하는 Model Spec Midtraining을 소개했다. 이 연구는 중간 단계 학습을 통해 정렬 학습의 일반화 성능을 개선하는 방법을 제시하며, 안전한 AI 개발과 모델 정렬 기법 고도화에 중요한 최신 발표다.
https://x.com/fly51fly/status/2053229700466741598
#anthropic #alignment #midtraining #aisafety #llm

fly51fly (@fly51fly) on X
[AI] Model Spec Midtraining: Improving How Alignment Training Generalizes
C Li, S Price, S Marks, J Kutasov [Anthropic] (2026)
https://t.co/Jyv6JtzNqT
X (formerly Twitter)