ICLR 2026 tổng hợp: Cộng đồng nghiên cứu tập trung vào GRPO (157 bài) thay vì DPO, ưu tiên RLVR (125 bài) thay vì RLHF, và 202 bài về Mamba/SSMs. Nait (tuning thông minh chỉ 10% dữ liệu) giúp tối ưu hiệu quả. 257 bài về tính toán lúc test, 123 bài về hallucination. Cảnh báo: mô hình tuân thủ tốt dễ bị tấn công injection. #AI #HọcMáy #ICLR2026 #NCKH #DeepLearning #Mamba #RLVR #GRPO #MạngNeural #BảoMậtAI #ViễnTưởngAI

https://www.reddit.com/r/LocalLLaMA/comments/1qsh7dz/analyzed_5357_iclr_2026_acc

🧠 Mới! Notebook code RLVR kết hợp GRPO từ đầu, được chia sẻ trong dự án “Reasoning‑from‑Scratch”. Hữu ích cho những ai muốn khám phá mô hình RL và tối ưu hoá trong AI/ML. #AI #MachineLearning #RLVR #GRPO #LậpTrình #MãNguồn

https://www.reddit.com/r/LocalLLaMA/comments/1qgcj8b/rlvr_with_grpo_from_scratch_code_notebook/

RLVR promises faster sampling but leaves reasoning untouched—base LLMs still carry the heavy‑lifting of trajectories. The paper (NeurIPS 2025) shows that gains come from smarter teacher‑distillation and minor architectural tweaks, not a new reasoning engine. Curious how sampling efficiency separates from true understanding? Dive into the details. #RLVR #SamplingEfficiency #LLMReasoning #NeurIPS2025

🔗 https://aidailypost.com/news/rlvr-lifts-sampling-efficiency-not-reasoning-base-models-hold

2025년 LLM 혁명: RLVR로 훈련비용 90% 절감, 추론 모델의 시대가 왔다

2025년 LLM 분야를 장악한 RLVR+GRPO 기술과 훈련 비용 혁명. 벤치마크의 함정부터 LLM을 슈퍼파워로 활용하는 법까지, Sebastian Raschka의 연례 리뷰를 소개합니다.

https://aisparkup.com/posts/7892

Nick Kukoz (@NickKukoz)

arXiv 논문(2512.04359)을 인용해 저자들이 RLVR entropy 문제를 다룸으로써 추론 능력을 일관되게 개선하는 방법을 발견했다는 내용을 알립니다. 논문 링크만 제공되어 구체적 메커니즘은 원문 확인이 필요하지만, 'RLVR entropy'를 해소해 추론 성능을 향상시켰다는 연구 발표입니다.

https://x.com/NickKukoz/status/2003399858011668891

#arxiv #research #reasoning #rlvr

Nick Kukoz (@NickKukoz) on X

@rasbt https://t.co/U0AuwPOx3s authors found a way to consistently improve reasoning by addressing RLVR entropy

X (formerly Twitter)

ajay dhisone (@AjayDhisone)

작성자는 2023년의 '변호사 시험 합격' 수준에서 2025년에는 모델이 합격 이유를 설명하고 숨겨진 chain-of-thought까지 보여주는 수준으로 발전했다며, RLVR(관련 강화학습 기술)의 급격한 연구 발전을 강조하고 있다.

https://x.com/AjayDhisone/status/2003125435266408772

#rlvr #research #reasoning #chainofthought

ajay dhisone (@AjayDhisone) on X

@rasbt 2023: Can it pass the Bar Exam? 2025: Can it explain why it passed and show the hidden chain-of-thought? The progress in RLVR is insane.

X (formerly Twitter)
2025 saw significant advancements in #LLMs, with #ReinforcementLearning from #VerifiableRewards (#RLVR) emerging as a key stage in training, leading to improved #reasoning capabilities. The industry also began to understand the unique “jagged” intelligence of LLMs, excelling in specific domains but lacking generalisation. https://karpathy.bearblog.dev/year-in-review-2025/?eicker.news #tech #media #news
2025 LLM Year in Review

2025 Year in Review of LLM paradigm changes

karpathy

New research from Tsinghua shows that reasoning‑augmented LLMs solve tasks with fewer calls but don’t surpass raw capability. The study compares chain‑of‑thought prompting, RL‑based RLVR, and pass@1 metrics, highlighting efficiency gains for open‑source models. Worth a read for anyone tracking LLM benchmarks. #LLM #ChainOfThought #RLVR #PassAt1

🔗 https://aidailypost.com/news/study-finds-reasoning-llms-are-more-efficient-not-more-capable

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

https://arxiv.org/abs/2507.15855

#HackerNews #Implicit #Actor #Critic #Coupling #Supervised #Learning #RLVR #ReinforcementLearning

Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline

The International Mathematical Olympiad (IMO) is widely regarded as the world championship of high-school mathematics. IMO problems are renowned for their difficulty and novelty, demanding deep insight, creativity, and rigor. Although large language models perform well on many mathematical benchmarks, they often struggle with Olympiad-level problems. Using carefully designed prompts, we construct a model-agnostic, verification-and-refinement pipeline. We demonstrate its effectiveness on the recent IMO 2025, avoiding data contamination for models released before the competition. Equipped with any of the three leading models -- Gemini 2.5 Pro, Grok-4, or GPT-5 -- our pipeline correctly solved 5 out of the 6 problems ($\approx$85.7% accuracy). This is in sharp contrast to their baseline accuracies: 31.6% (Gemini 2.5 Pro), 21.4% (Grok-4), and 38.1% (GPT-5), obtained by selecting the best of 32 candidate solutions. The substantial improvement underscores that the path to advanced AI reasoning requires not only developing more powerful base models but also designing effective methodologies to harness their full potential for complex tasks.

arXiv.org

→ Les 4 étapes pour entrainer un LLM
https://scienceetonnante.com/blog/2025/04/25/les-4-etapes-pour-entrainer-un-llm/

« Voilà le principe de l'apprentissage par renforcement avec une récompense vérifiable [RLVR], qui permet de se passer d'humains qui doivent juger si la réponse est conforme ou pas. »

#entrainer #LLM #apprentissage #RLVR #humains

Les 4 étapes pour entrainer un LLM – Science étonnante