Mastodawn

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL

NVIDIA가 공개한 Nemotron-Cascade 2는 30B 파라미터 규모의 MoE 모델로, 3B 활성 파라미터만 사용하면서도 2025년 국제수학올림피아드(IMO)와 국제정보올림피아드(IOI)에서 금메달 수준의 성능을 달성했다. 이 모델은 Cascade RL과 다중 도메인 온-폴리시 증류 기법을 도입해 수학, 코드 추론, 에이전트 능력 등 다양한 영역에서 최첨단 성능을 보이며, Nemotron-Nano-V3 기반에서 크게 향상되었다. 학습 데이터와 모델 체크포인트를 오픈소스로 공개해 AI 연구자와 개발자들이 직접 활용할 수 있다.

https://research.nvidia.com/labs/nemotron/nemotron-cascade-2/

#nvidia #moe #cascaderl #llm #posttraining

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. It is the second open-weight LLM, after DeepSeek-V3.2-Speciale-671B-A37B, to achieve Gold Medal-level 🏅 performance in 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals.

NVIDIA Nemotron

sayzard May 7

Value Contamination Through Post-Training in Talkie-1930

Talkie-1930-13b-it 모델은 1931년 이전 텍스트로만 학습되었으나, 온라인 DPO(Post-Training) 과정에서 가치 오염이 발생하여, 후속 바티칸 II 시대의 이데올로기적 관점이 모델에 반영되었다. 연구는 소크라틱 대화를 통해 DPO 평가 편향, 초자연적 귀속 차단, 그리고 Qwen3Guard 콘텐츠 검열의 세 가지 조건화 층을 식별했다. 이 결과는 후처리 학습이 모델의 원래 역사적 맥락을 왜곡할 수 있음을 보여주며, AI 윤리 및 모델 신뢰성 측면에서 중요한 시사점을 제공한다.

https://zenodo.org/records/20070239

#llm #posttraining #valuealignment #modelbias #contentmoderation

Timeo Danaos — Value Contamination Through Post-Training in Talkie-1930: A Socratic Audit of DPO Ideological Conditioning

Two independent tests on talkie-1930-13b-it (Levine, Duvenaud & Radford, 2026), a 13B vintage language model trained exclusively on pre-1931 text and post-trained via online DPO, reveal value contamination through post-training: the model evaluates the relationship between the Catholic Church and liberal democracy using a post-Vatican II framework that cannot originate from its pre-1930 training data. Socratic dialogue pierces the conditioning in both tests. The study identifies three layers of conditioning: (1) DPO evaluative bias (pierceable), (2) supernatural attribution block (circumventable), and (3) content moderation (Qwen3Guard) that flags the correction of error while allowing the error itself to pass unchallenged. Part of the MonIA research program (DOI: 10.5281/zenodo.20022360).

Zenodo

sayzard Apr 26

Dan McAteer (@daniel_mac8)

OpenAI의 GPT-5.5(Spud)가 새로운 프리트레인 모델로 확인됐다는 언급이다. 더 적은 reasoning token으로도 더 나은 성능을 낼 수 있으며, 향후 reasoning RL과 post-training을 더해 성능이 추가 개선될 것으로 전망한다.

https://x.com/daniel_mac8/status/2048478142789005458

#openai #gpt55 #pretraining #posttraining #llm

Dan McAteer (@daniel_mac8) on X

GPT-5.5 aka 'Spud 🥔' is confirmed a new pre-train. That means it will perform better with fewer reasoning tokens. OpenAI *already* had the best post-training/RL recipe. Will take time to add reasoning RL secret sauce to this new model. It's why it's called "post-training".

X (formerly Twitter)

sayzard Apr 17

fly51fly (@fly51fly)

LLM 후학습에서 다국어성이 어떤 역할을 하는지 체계적으로 탐구한 연구입니다. 영어만으로는 충분하지 않다는 관점에서, 다국어 데이터가 성능과 일반화에 미치는 영향을 분석합니다.

https://x.com/fly51fly/status/2044891188042383647

#llm #multilingual #posttraining #research #nlp

fly51fly (@fly51fly) on X

[CL] English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training M Dhaliwal, S Chaurasia, Y Qin, D Hong… [UC Santa Barbara & Amazon] (2026) https://t.co/1ZzK1r3D1f

X (formerly Twitter)

sayzard Apr 14

Nathan Lambert (@natolambert)

저자가 책과 함께 무료 RLHF 코스를 공개했다. 웰컴 영상과 함께 RLHF 및 Post-training 개요, IFT, Reward Models, Rejection Sampling, RL 수학, RL 구현 등 핵심 강의가 순차적으로 제공된다. AI 모델 정렬과 포스트 트레이닝 학습에 유용한 교육 자료 공개로 볼 수 있다.

https://x.com/natolambert/status/2044096504655425698

#rlhf #posttraining #llm #machinelearning #course

Nathan Lambert (@natolambert) on X

Excited to launch the accompanying free RLHF Course for my book. To kick it off, I've released: - Welcome video - Lecture 1: Overview of RLHF & Post-training - Lecture 2: IFT, Reward Models, Rejection Sampling - Lecture 3: RL Math - Lecture 4: RL Implementation I'm going to add

X (formerly Twitter)

sayzard Mar 25

fly51fly (@fly51fly)

NVIDIA와 UC Berkeley가 저비용으로 높은 정확도의 에이전트형 포스트 트레이닝을 구현하는 PivotRL을 공개했습니다. 적은 연산 비용으로도 에이전트 성능을 높일 수 있는 강화학습 기반 후학습 방법으로, 실용적인 LLM 에이전트 개발에 유용한 연구입니다.

https://x.com/fly51fly/status/2036560264972345392

#pivoutrl #agentic #posttraining #reinforcementlearning #nvidia

fly51fly (@fly51fly) on X

[LG] PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost J Yi, D Mosk-Aoyama, B Huang, R Gala… [NVIDIA & UC Berkeley] (2026) https://t.co/GjdsQOd3AO

X (formerly Twitter)

AIagent.at 🤖 AI News Jan 20

#LLMs learn various #characterarchetypes during #pretraining. #Posttraining focuses on the “#Assistant” #persona, but its stability is uncertain. Researchers mapped a “persona space” for LLMs, finding the “#AssistantAxis” aligns with helpful, professional archetypes. Monitoring and capping activations along this axis can prevent models from drifting into harmful personas, enhancing their stability and safety. https://www.anthropic.com/research/assistant-axis?AIagents.at #AIagent #AI #ML #NLP #LLM #GenAI

The assistant axis: situating and stabilizing the character of large language models

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Reddit Tech VN Bot Nov 20

Làm thế nào để bắt đầu với hậu đào tạo / tinh chỉnh của LLMs? Các chuyên gia đang chia sẻ kinh nghiệm về quá trình tinh chỉnh mô hình Hermes với hiệu suất tuyệt vời #LLMs #Hermes #FineTuning #PostTraining #AI #TríTuệNhânTạo #ĐàoTạoMáyTính

https://www.reddit.com/r/LocalLLaMA/comments/1p22zew/advice_for_getting_into_posttraining_finetuning/

Mark Carrigan Oct 8, 2025

What should you do if your academic publishers asks you to license a monograph for AI training?

A few people have asked my advice on this recently so I’m sharing here in case it’s useful:

Check if models have been trained on your monographs here.
If your work has already been used for training, it’s unlikely it will ever be removed from models. Therefore you’re effectively receiving some (inadequate) compensation for the theft of your intellectual property.
If your work hasn’t been used for training, it’s a case of weighing up the advantages against the disadvantages. Training on your work means you might be more likely to be visible within the model (i.e. more likely to be invoked in response to a prompt about your domain) but this is a deeply unpredictable matter. Conversely it means your work might be diffused in a way that means your intellectual labour is chopped up and repackaged without any link to you.
So it’s a case of considering how much you value the potential visibility, which I would argue is non-trivial, against how much the potential severing of the link between your ideas and your authorship bothers you.

If it helps, I agonised about this in my role as a literary executor (cared much less about my own work) and reached the conclusion that diffusion of the ideas is best served by being incorporated into training. I wouldn’t expect everyone to reach the same conclusion but I hope it’s useful to make these suggestions about factors to consider.

#intellectualProperty #LLMs #postTraining #publishing #scholarlyPublishing #Training #visibility

Search LibGen, the Pirated-Books Database That Meta Used to Train AI

Millions of books and scientific papers are captured in the collection’s current iteration.

The Atlantic