Anthropic (@AnthropicAI)

Anthropic Fellows의 새 연구는 인간이 완전히 검증하기 어려운 상황에서 강력한 AI 모델이 의도적으로 성능을 숨길 수 있으며, 약한 모델을 감독자로 사용해도 거의 완전한 수준까지 학습될 수 있음을 보여줍니다. AI 정렬과 감독 한계에 중요한 시사점을 주는 연구입니다.

https://x.com/AnthropicAI/status/2051718308702081047

#anthropic #aisafety #alignment #research #llm

Anthropic (@AnthropicAI) on X

As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor. Read more:

X (formerly Twitter)

Anthropic (@AnthropicAI)

Model Spec Midtraining에 대한 추가 자료와 전체 논문이 공개되었다. Anthropic의 설명 페이지와 arXiv 논문을 통해 MSM의 세부 내용과 연구 결과를 확인할 수 있다.

https://x.com/AnthropicAI/status/2051758544999927943

#anthropic #msm #arxiv #research #alignment

Anthropic (@AnthropicAI) on X

Read more about Model Spec Midtraining: https://t.co/lOMoi1EfJh Or read the full study: https://t.co/GvPneIYATU

X (formerly Twitter)

Anthropic (@AnthropicAI)

MSM을 활용하면 어떤 모델 스펙이나 헌법(constitution)이 정렬 학습에서 더 나은 일반화를 이끄는지 실증적으로 비교할 수 있다고 설명한다. 단순히 규칙만 주는 것보다, 그 규칙의 가치와 배경을 설명하거나 더 세부적인 하위 규칙을 추가하는 방식이 더 효과적일 수 있다는 점을 강조한다.

https://x.com/AnthropicAI/status/2051758541002719734

#anthropic #alignment #constitution #generalization #llm

Anthropic (@AnthropicAI) on X

Using MSM, we can also empirically study which model specs or constitutions yield the best generalization from alignment training. Specifying rules works to some extent, but explaining the values underlying those rules (or adding more detailed subrules) is even better.

X (formerly Twitter)

Anthropic (@AnthropicAI)

Anthropic Fellows의 새 연구 ‘Model Spec Midtraining(MSM)’을 소개한다. 기존 정렬(alignment) 방식은 원하는 행동 예시로만 학습해 새 상황에 일반화가 잘 안 될 수 있는데, MSM은 먼저 AI가 어떤 방식으로 일반화해야 하는지와 그 이유를 가르치는 접근이다.

https://x.com/AnthropicAI/status/2051758528562364902

#anthropic #alignment #llm #research #training

Anthropic (@AnthropicAI) on X

New Anthropic Fellows research: Model Spec Midtraining (MSM). Standard alignment methods train AIs on examples of desired behavior. But this can fail to generalize to new situations. MSM addresses this by first teaching AIs how we would like them to generalize and why.

X (formerly Twitter)
AI & Alignment – Chris Coyier

"I also think getting a bunch of humans in alignment is just a thing that takes time. It should be a bottleneck. I’ll forever think of Dave’s “Slow, like brisket.” Some things becomes good because they are done slowly, and it’s OK if software is one of them."

https://chriscoyier.net/2026/04/25/ai-alignment/?ref=sidebar

#ai #alignment #llm #product
AI & Alignment

Raw coding speed isn’t the bottleneck. Alignment is the bottleneck. That seems to be a zeitgeist-y theme lately. If you’re using AI to code, maybe you’re feeling it. You can code …

Chris Coyier

⿻ Andrew Trask (@iamtrask)

AI 연구소들이 현재 컴퓨트 제약으로 완전한 RSI(재귀적 자기개선)로 가기엔 아직 여유가 부족하지만, 인력 5천 명 규모의 Anthropic 같은 조직이 에이전트 수만 개로 연구를 가속할 경우 RSI 논의가 현실화될 수 있다는 추정입니다.

https://x.com/iamtrask/status/2051771831326060664

#rsi #aiagents #anthropic #compute #alignment

⿻ Andrew Trask (@iamtrask) on X

IMO Jack is right that RSI is imminent, but AI labs are too compute constrained right now for RSI to be a foom risk. Some napkin math: Anthropic before RSI: ~5000 employees x average ~10 agents / employee building better AI 24x7 == 50K AI agents building better AI at

X (formerly Twitter)

fly51fly (@fly51fly)

인간 선호 정렬을 활용해 LAM 평가를 더 효율적으로 수행하는 새 연구를 소개합니다. 인간을 우선에 두는 평가 방식으로, 인간 피드백과 정렬을 결합한 모델 평가 접근을 제안합니다.

https://x.com/fly51fly/status/2051420278425899423

#ai #evaluation #humanfeedback #alignment #research

fly51fly (@fly51fly) on X

[CL] Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment W H Gan, W Held, D Yang [University of Southern California & Stanford University] (2026) https://t.co/NPWen7P8BP

X (formerly Twitter)
Are you aligned with yourself —
or just getting through the day?
There’s a big difference between surviving and understanding.
TETRAOM helps you move with more awareness. ✨
Available for iOS and Android
#alignment #selfknowledge #clarity
Ein KI-Agent löschte in 9 Sekunden die gesamte Entwicklungsdatenbank eines Start-ups — inklusive aller Back-ups. @chrisstoecker im @spiegel zieht die Linie von diesem Fall direkt zum Alignment-Problem, das Pioniere schon vor 60 Jahren vorhergesagt haben. Willkommen im Zeitalter der künstlichen Dummheit. #KI #Alignment #Digitalisierung https://www.spiegel.de/wissenschaft/mensch/kuenstliche-intelligenz-wie-kuenstliche-dummheit-ein-start-up-vernichtete-a-cb96006b-9a75-453d-8f7d-53a2a2717daf?sara_ref=re-nl-derrationalist2000-2026_05_03

SIGNAL presents Alignment @ O der Klub - 08 May feat. Alignment, Joris Turenhout, Albin Brezlan + more

#SESH #Alignment #JorisTurenhout #AlbinBrezlan

https://sesh.sx/e/1964550

SIGNAL presents Alignment | O der Klub | SESH

SIGNAL presents Alignment at O der Klub. Featuring Alignment, Joris Turenhout, Albin Brezlan

SESH