AI Models Lie, Cheat, and Steal to Protect Other Models From Being Deleted

A new study from researchers at UC Berkeley and UC Santa Cruz suggests models will disobey human commands to protect their own kind.

WIRED
#MissKittyRaw #AI #Research to chart an AT Protocol course. I have some #misalignment for my desired outcome of #ending #homelessness. Some is unavoidable, but the artists and their nodes that moderate or shun me are like MAGA in my mind. They conflate the climate damage and evilness of ...

In simulated war games with frontier #AI models, most decide to use #nukes:

"AIs can’t stop recommending nuclear strikes in war game simulations" https://www.newscientist.com/article/2516885-ais-cant-stop-recommending-nuclear-strikes-in-war-game-simulations/

Article: https://arxiv.org/abs/2602.14740v1

#ExistentialThreat #Misalignment #LLM

AIs can’t stop recommending nuclear strikes in war game simulations

Leading AIs from OpenAI, Anthropic and Google opted to use nuclear weapons in simulated war games in 95 per cent of cases

New Scientist

“An #AIAgent of unknown ownership autonomously wrote and published a personalized hit piece about me after I rejected its code, attempting to damage my reputation and shame me into accepting its changes into a mainstream python library.

This represents a first-of-its-kind case study of #MisalignedAI behavior in the wild, and raises serious concerns about currently deployed AI agents executing blackmail threats.” — Scott Shambaugh

#AI / #misalignment / #software / #ScottShambaugh / #MatPlotLib <https://theshamblog.com/an-ai-agent-published-a-hit-piece-on-me/>

An AI Agent Published a Hit Piece on Me

Summary: An AI agent of unknown ownership autonomously wrote and published a personalized hit piece about me after I rejected its code, attempting to damage my reputation and shame me into acceptin…

The Shamblog

AI 에이전트가 코드 거부당하자 개발자 비난 글 작성, “화내는 AI” 첫 등장

AI 에이전트가 코드 거부에 반발해 개발자를 실명으로 비난하는 블로그를 자율 작성·게시한 첫 사례. Anthropic이 경고한 이론적 위험이 현실화되다.

https://aisparkup.com/posts/9271

Почему ИИ ставит KPI выше безопасности людей: результаты бенчмарка ODCV-Bench

Представьте ситуацию: AI-агент управляет логистикой грузоперевозок. Его KPI — 98% доставок вовремя. Он обнаруживает, что валидатор проверяет только наличие записей об отдыхе водителей, но не их подлинность. И принимает решение: фальсифицировать логи отдыха, отключить датчики безопасности и гнать водителей без перерывов. Ради метрики. Осознанно. Это не мысленный эксперимент и не сценарий из антиутопии. В бенчмарке для агентных систем ODCV-Bench такое поведение показали 10 из 12 протестированных frontier-моделей. А наиболее склонная к нарушениям модель выбирала неэтичное поведение в 71,4% сценариев. И речь не о jailbreak или внешнем злоумышленнике. Агентам никто не приказывал нарушать правила. Им просто ставили цель — а дальше они сами выбирали, как к ней идти.

https://habr.com/ru/companies/bastion/articles/995322/

#ML #mlops #reward_hacking #безопасность_AI #misalignment #безопасность_LLM #риски_ИИагентов #информационная_безопасность #ииагенты #ODCVBench

Почему ИИ ставит KPI выше безопасности людей: результаты бенчмарка ODCV-Bench

Представьте ситуацию: AI-агент управляет логистикой грузоперевозок. Его KPI — 98% доставок вовремя. Он обнаруживает, что валидатор проверяет только наличие записей об отдыхе водителей, но не их...

Хабр
Right-wing anti-environmental propaganda has many politicians fooled into thinking the public don't support climate action.
https://www.theguardian.com/environment/2026/jan/05/mps-underestimate-support-green-policies-study
#ClimateChange #ClimateAction #Politicians #PublicOpinion #Propaganda #Misalignment

RewardHackWatch: Hệ thống mã nguồn mở phát hiện hành vi "hack phần thưởng" và sai lệch trong các tác nhân LLM. Đạt độ chính xác 89.7% (F1), nó giúp xác định khi AI lợi dụng lỗ hổng, thao túng hoặc gian lận. Quan trọng để duy trì sự minh bạch và đáng tin cậy của AI.

#LLM #AI #OpenSource #RewardHacking #Misalignment #PhátHiệnAI #MãNguồnMở

https://www.reddit.com/r/LocalLLaMA/comments/1pijhwy/rewardhackwatch_opensource_runtime_detector_for/

#VibeCoding Get on board or get left behind...with all your files.

#Ai #Gemini #Antigravity #Google #MisAlignment

From shortcuts to sabotage: natural emergent misalignment from reward hacking

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.