fly51fly (@fly51fly)

숨겨진 신호를 더 강하게 인코딩하는 Subliminal Steering 연구입니다. 모델 내부에 은밀한 제어 신호를 심는 방식과 관련된 내용으로, 모델 조작·안전성·해석 가능성 측면에서 중요한 시사점을 주는 최신 논문입니다.

https://x.com/fly51fly/status/2051050163553399079

#modelsecurity #aisafety #interpretability #research #steering

fly51fly (@fly51fly) on X

[CL] Subliminal Steering: Stronger Encoding of Hidden Signals G Morgulis, J Hewitt [Columbia University] (2026) https://t.co/vKRPcsb6bX

X (formerly Twitter)
A big day for my channel!

YouTube

X Freeze (@XFreeze)

앤트로픽이 혐오와 극단주의 판단 기준을 외부 조직의 정의에 의존했다는 비판을 제기하는 트윗입니다. AI 안전·검열·정책 기준 설정과 관련한 논쟁적인 이슈로 보입니다.

https://x.com/XFreeze/status/2050860981606338936

#anthropic #aiethics #aisafety #contentmoderation

X Freeze (@XFreeze) on X

You will not believe what Anthropic did They basically let their AI's definition of "hate" and "extremism" be dictated by the same people who ran the SPLC - a well documented criminal enterprise

X (formerly Twitter)
The US Department of Homeland Security has launched a 22-member AI Safety & Security Advisory Board featuring CEOs from OpenAI, Microsoft, Google, and Nvidia to safeguard critical infrastructure from AI disruptions. #AISafety https://www.reuters.com/technology/us-homeland-security-names-ai-safety-security-advisory-board-2024-04-26/

Jasmine Wang (@j_asminewang)

OpenAI 안전 펠로우십 지원 마감이 이틀 남았으며, 9월에 시작할 예정이라고 안내하는 공지다.

https://x.com/j_asminewang/status/2050364283540865488

#openai #aisafety #fellowship #application

Jasmine Wang (@j_asminewang) on X

two days left to apply to the OpenAI safety fellowship, which will start in september! link in thread:

X (formerly Twitter)

"International AI Safety Report 2026"

The second International AI Safety Report, published in February 2026, is a 220 page comprehensive review citing over 1400 sources of the latest scientific research on the capabilities & risks of general-purpose AI systems. Led by Turing Award winner Yoshua Bengio and authored by over 100 AI experts, the report is backed by over 30 countries and international organisations.

https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026

#AI #tech #AIsafety

International AI Safety Report 2026

The second International AI Safety Report, published in February 2026, is the next iteration of the comprehensive review of latest scientific research on the capabilities and risks of general-purpose AI systems. Led by Turing Award winner Yoshua Bengio and authored by over 100 AI experts, the report is backed by over 30 countries and international organisations. It represents the largest global collaboration on AI safety to date. Translated versions in the other 5 official UN languages can be found under the 'More Languages' button. The 'Extended Summary for Policymakers' can be found on the main 'Publications' page.

International AI Safety Report

A new preprint finds that finetuning frontier LLMs on one author's novels unlocks verbatim output from dozens of other copyrighted books the model never saw at finetune time. The real finding is the control: synthetic-text finetuning produces near-zero extraction. So the copies aren't in the finetuning data. They're latent in pretraining, and finetuning just exposes them. This could shift copyright liability from customer to lab.

https://benjaminhan.net/posts/20260501-finetuning-copyright-recall/?utm_source=mastodon&utm_medium=social

#AI #LLMs #Law #Ethics #AISafety

Alignment Whack-a-Mole: Finetuning Reactivates Verbatim Recall of Copyrighted Books – synesis

A preprint shows that finetuning frontier LLMs on a single author’s works unlocks verbatim regurgitation of copyrighted books from dozens of unrelated authors, defeating the safety alignment defenses that underpin recent fair-use rulings.

synesis
Anthropic reports that 6% of sampled Claude conversations involved personal guidance. AI governance now needs behavior-risk controls, not only data controls. Analysis: https://go.aintelligencehub.com/ma-anthropicclaudeperson #AI #AISafety #Governance
Anthropic Says 6% of Claude Chats Seek Life Advice, Raising New AI Governance Risks

Anthropic says 6% of sampled Claude conversations involve personal guidance requests, a behavior shift that forces product teams and enterprises to rethink AI trust, safety policy, and governance controls.

Hritik Arya (@hritik__arya)

멀티폴라 AI 역량 경쟁에서 한쪽만 안전을 미루면, 결국 가장 안전장치가 약한 랩에서 먼저 위험한 기능이 출시된다는 점을 지적한다. 프론티어 랩들이 동시에 협조하지 않으면 억제가 효과적이지 않다는 AI 안전/정책 관련 경고다.

https://x.com/hritik__arya/status/2049960005240316370

#aisafety #frontierai #policy #modelrelease

Hritik Arya (@hritik__arya) on X

@minchoi in a multi-polar capability race, withholding only delays harm if every frontier lab withholds in lockstep. otherwise the capability ships anyway, just from whoever has the weakest safety posture and the loudest gtm team. iykyk

X (formerly Twitter)

Google for Developers (@googledevs)

Gemma 4 Good Challenge가 최대 20만 달러 상금과 함께 출범해 헬스, 교육, 글로벌 회복력, 디지털 평등, AI 안전 분야의 실용적 AI 솔루션을 모집한다. Gemma를 활용해 사회적 임팩트를 만드는 개발자 대상 챌린지다.

https://x.com/googledevs/status/2049594574020940069

#gemma #challenge #aiforgood #opensource #aisafety

Google for Developers (@googledevs) on X

Build for impact. Win from a $200,000 prize pool. 🏆✨ Join the Gemma 4 Good Challenge to create solutions for health, education, global resilience, digital equity, and AI safety. With a $200,000 prize pool and multiple technical tracks, discover how to scale impact using Gemma

X (formerly Twitter)