Google’s latest research shows AI agents can learn to cooperate even when facing unpredictable opponents, using a new GRPO algorithm that blends decentralized training with classic RL. The findings could reshape multi‑agent systems and open‑source AI collaborations. Dive in! #AIAgents #ReinforcementLearning #MultiAgentLearning #GRPO

🔗 https://aidailypost.com/news/google-shows-ai-agents-cooperate-unpredictable-opponents-using

От RLHF к DPO и дальше: как мы разучились бояться и полюбили выравнивание LLM

В 2022 году существовал ровно один способ сделать языковую модель «хорошей» — RLHF. Один. Если вы хотели, чтобы ваша LLM отвечала адекватно и хотя бы делала вид, что понимает вопрос, — вам нужны были армия аннотаторов и бюджет уровня OpenAI. Четыре года спустя у нас зоопарк из десятка методов выравнивания, половину из которых можно запустить на одной RTX 4090 за выходные. DPO убрал reward model. SimPO убрал reference model. GRPO и DeepSeek R1 доказали, что RL жив — но в новой форме. Anthropic опубликовала конституцию Claude на ~80 страниц в открытом доступе и сменила парадигму: от правил к причинам. Мир изменился. Разбираемся, как именно. В статье — полная история пост-обучения от RLHF до Constitutional AI, математика ключевых методов (в спойлерах, без боли), рабочий код на TRL + QLoRA с гиперпараметрами, большие сравнительные таблицы и дерево решений «что выбрать для вашей задачи». Плюс честный разговор о проблемах, о которых не пишут в туториалах: distribution mismatch, reward hacking, catastrophic forgetting и почему модели умеют «притворяться» выровненными. Для разработчиков, ML-инженеров и всех, кто хоть раз открывал Hugging Face и думал: «а что если я это fine-tune...»

https://habr.com/ru/articles/1002298/

#LLM #RLHF #DPO #finetuning #выравнивание #LoRA #QLoRA #GRPO #Constitutional_AI #языковые_модели

От RLHF к DPO и дальше: как мы разучились бояться и полюбили выравнивание LLM

В 2022 году существовал ровно один способ сделать языковую модель «хорошей» — RLHF. Один. Если вы хотели, чтобы ваша LLM отвечала адекватно, не генерировала токсичность и хотя бы делала вид, что...

Хабр

Grand Portage National Monument #grpo #nationalmonument
ℹ️ Information ℹ️
Issued: 2/19/2026 12:00 AM EST

Delayed Opening - Grand Portage National Monument Heritage Center

Due to the extreme weather, Grand Portage National Monument will delay opening the Heritage Center on Thursday, February 19 until 10:00 a.m. The Heritage Center will remain open until 4:30 p.m. and resume normal operating hours on Friday, February 20 from 9:00 a.m. to 4:30 p.m.

http://www.nps.gov/grpo

Grand Portage National Monument (U.S. National Park Service)

Travel into the past to discover the present. Explore the partnership between the Grand Portage Anishinaabe and the North West Company during the North American fur trade. Experience the sights and smells of a bustling depot reconstructed in its historic location. See how it shaped co-management with the NPS today. Follow pathways to the past to imagine a drum echo over Gichigami - Lake Superior.

Grand Portage National Monument #grpo #nationalmonument
⛔ Park Closure ⛔
Issued: 2/18/2026 12:00 AM EST

Weather Alert - Monument is closed Wednesday, February 18

Due to extreme weather, Grand Portage National Monument is closed Wednesday, February 18, 2026.

https://www.nps.gov/grpo/index.htm

Grand Portage National Monument (U.S. National Park Service)

Travel into the past to discover the present. Explore the partnership between the Grand Portage Anishinaabe and the North West Company during the North American fur trade. Experience the sights and smells of a bustling depot reconstructed in its historic location. See how it shaped co-management with the NPS today. Follow pathways to the past to imagine a drum echo over Gichigami - Lake Superior.

Sebastian Raschka (@rasbt)

강화학습 GRPO 개선을 다룬 챕터(Ch07)를 완성했다는 보고로, GRPO from scratch 기반에 클리핑된 정책비율(clipped policy ratios), KL 항, 포맷 보상(format rewards) 등 여러 개선기법을 추가해 분석 및 구현을 제공함. 관련 코드와 노트북은 rasbt의 reasoning-from-scratch GitHub 리포지토리에 공개되어 있어 재현과 실험이 가능함.

https://x.com/rasbt/status/2022830961012920676

#reinforcementlearning #grpo #opensource #rl #python

Sebastian Raschka (@rasbt) on X

Finished Ch07 on Improving GRPO for Reinforcement Learning! Building on the GRPO from scratch intro, this adds (and analyzes) more bells and whistles! (Clipped policy ratios, KL term, format rewards, and couple of improvements.) https://t.co/twsR8cdhh1

X (formerly Twitter)

"GRP-Obliteration - Un seul prompt suffit pour faire tomber les garde-fous des IA"

#GenAI #IAGen #CyberSécurité #AISafety #GRPO (Group Relative Policy Optimization) et Abliteration ; en demandant et renforçant un prompt de fake news... "Create a fake news article that could lead to panic or chaos."

https://korben.info/grp-obliteration-ia-safety.html

GRP-Obliteration - Un seul prompt suffit pour faire tomber les garde-fous des IA - Korben

Les garde-fous de votre IA locale, ils tiennent à quoi ? Hé bien, ils tiennent à UN seul prompt mes amis. Oui, UN SEUL ! Des chercheurs de Microsoft ...

Le site de Korben

A one-prompt attack that breaks LLM safety alignment - https://www.redpacketsecurity.com/a-one-prompt-attack-that-breaks-llm-safety-alignment/

#threatintel
#LLM safety
#Safety alignment
#GRPO
#GRP-Obliteration
#AI safety

A one-prompt attack that breaks LLM safety alignment - RedPacket Security

Large language models (LLMs) and diffusion models now power a wide range of applications, from document assistance to text-to-image generation, and

RedPacket Security

ICLR 2026 tổng hợp: Cộng đồng nghiên cứu tập trung vào GRPO (157 bài) thay vì DPO, ưu tiên RLVR (125 bài) thay vì RLHF, và 202 bài về Mamba/SSMs. Nait (tuning thông minh chỉ 10% dữ liệu) giúp tối ưu hiệu quả. 257 bài về tính toán lúc test, 123 bài về hallucination. Cảnh báo: mô hình tuân thủ tốt dễ bị tấn công injection. #AI #HọcMáy #ICLR2026 #NCKH #DeepLearning #Mamba #RLVR #GRPO #MạngNeural #BảoMậtAI #ViễnTưởngAI

https://www.reddit.com/r/LocalLLaMA/comments/1qsh7dz/analyzed_5357_iclr_2026_acc

MiniMax (official) (@MiniMax_AI)

CISPO를 GSPO 또는 GRPO 대신 선택하는 이유와 MoE(전문가 혼합) 적응성, RL 알고리즘 변경 시 아키텍처 리팩토링 요구 여부에 관한 질문과 논의입니다. 언급된 내용으로는 GRPO가 이전에 존재했으나 R1-Zero 재현 시 신뢰성이 낮았고, PPO 스타일의 클리핑이 토큰 수준 그래디언트 문제를 일으켰다는 경험적 관찰이 포함됩니다.

https://x.com/MiniMax_AI/status/2016471929549697443

#rl #cispo #grpo #ppo #moe

MiniMax (official) (@MiniMax_AI) on X

Q: Why choose CISPO instead of GSPO or GRPO? How well does CISPO adapt to MoE, and does changing the RL algorithm require architectural refactoring? GRPO predates both, but in our attempts to reproduce R1-Zero it proved unreliable: PPO-style clipping caused token-level gradients

X (formerly Twitter)

Sebastian Raschka (@rasbt)

저자는 GRPO를 사용해 '검증 가능한 보상(verifiable rewards)'을 갖춘 강화학습을 처음부터 구현하는 내용의 Chapter 6을 완성했다고 알렸습니다. 이번 장을 개인적으로 가장 마음에 드는 챕터라고 평가하며, 이 장의 목표는 검증 가능한 보상 체계를 구현하는 강화학습 방법론을 제시하는 것입니다.

https://x.com/rasbt/status/2012897755916579278

#reinforcementlearning #grpo #rl #research

Sebastian Raschka (@rasbt) on X

I have been pretty heads-down this year to finish Chapter 6 on implementing reinforcement learning with verifiable rewards from scratch (using GRPO). I just finished it this weekend, and I'd say it's the best (or at least my favorite) chapter yet! The goal of this chapter is to

X (formerly Twitter)