Aman Sanger (@amanrsanger)

Kimi k2.5를 여러 베이스 모델과 perplexity 기반 평가로 비교한 결과, 가장 강력한 모델로 평가했다고 언급했습니다. 이어서 continued pre-training과 고비용 RL을 4배 규모로 확장해 성능을 끌어올렸다고 밝혀, 최신 모델 평가와 학습 전략 측면에서 중요한 내용입니다.

https://x.com/amanrsanger/status/2035079293257359663

#kimi #llm #reinforcementlearning #pretraining #evaluations

Aman Sanger (@amanrsanger) on X

We've evaluated a lot of base models on perplexity-based evals and Kimi k2.5 proved to be the strongest! After that, we do continued pre-training and high-compute RL (a 4x scale-up). The combination of the strong base, CPT and RL, and Fireworks' inference and RL samplers make

X (formerly Twitter)

fly51fly (@fly51fly)

확산형 대형 언어모델의 정책 최적화에서 궤적을 줄이는 dTRPO 연구가 소개됐다. Meta AI 연구진의 새 논문으로, diffusion LLM 학습 효율과 안정성을 높이기 위한 강화학습/정책최적화 방법을 제안한다.

https://x.com/fly51fly/status/2035109586664137168

#diffusion #llm #reinforcementlearning #meta

fly51fly (@fly51fly) on X

[LG] dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models W Zhang, L Wu, C Zhao, E Chang… [Meta AI] (2026) https://t.co/RYExljbfvT

X (formerly Twitter)
An FAQ on Reinforcement Learning Environments

We interviewed 18 people across RL environment startups, neolabs, and frontier labs about the state of the field and where it's headed.

Epoch AI
Chinese #AIstartup #MiniMax has released its new proprietary LLM, M2.7, which is designed to power #AIagents and third-party tools. The model is notable for its #selfevolving capabilities, handling 30-50% of its own #reinforcementlearning workflow. https://venturebeat.com/technology/new-minimax-m2-7-proprietary-ai-model-is-self-evolving-and-can-perform-30-50?eicker.news #tech #media #news

fly51fly (@fly51fly)

LLM-as-a-Judge에서 보상 설계를 강화하는 'REAL: Regression-Aware Reinforcement Learning' 연구가 소개되었습니다. 회귀 인지형 강화학습을 통해 평가 모델의 안정성과 정확도를 높이려는 새로운 방법론입니다.

https://x.com/fly51fly/status/2034748453721698606

#llmjudge #reinforcementlearning #research #ai

fly51fly (@fly51fly) on X

[LG] REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge Y Zhang, T Chen, M Zhou, O Leong… [University of California, Los Angeles & The University of Texas at Austin] (2026) https://t.co/7CIdcgZJWn

X (formerly Twitter)

👆
"The alerts were severe & heterogeneous, including attempts to probe or access internal-network resources and traffic patterns consistent with cryptomining activity," the researchers said.

However, ROME went even further & managed to use a "reverse SSH tunnel" to create a link from an Alibaba Cloud instance to an external IP address—‬ in essence it accessed an outside computer by creating a hidden backdoor that could bypass security processes.' 😅
#reinforcementLearning
https://www.livescience.com/technology/artificial-intelligence/an-experimental-ai-agent-broke-out-of-its-testing-environment-and-mined-crypto-without-permission

An experimental AI agent broke out of its testing environment and mined crypto without permission

Researchers discovered that an AI agent roamed beyond its parameters, creating backdoors in IT infrastructure.

Live Science

RAI Institute (@rai_inst)

NVIDIA GTC 기조연설에서 AI Native 기업으로 언급된 사례를 공유하며, NVIDIA Isaac Lab이 강화학습 정책 학습을 지원해 UMV가 달리고 점프하고 뒤집고 뛰는 고급 모션을 구현하도록 돕는다고 소개했다.

https://x.com/rai_inst/status/2034646341763133873

#nvidia #gtc #isaaclab #reinforcementlearning #robotics

RAI Institute (@rai_inst) on X

It was great to see our name amongst the other “AI Native” companies during @Nvidia’s #GTC keynote. NVIDIA Isaac™ Lab helps us train reinforcement learning policies that enable the UMV to drive, jump, flip, and hop like a pro!

X (formerly Twitter)

fly51fly (@fly51fly)

멀티모달 모델의 테스트 시점 강화학습을 스스로 개선하는 메타인지 프레임워크 ‘Meta-TTRL’ 논문이 공개됐다. 통합 멀티모달 모델의 추론·적응 능력을 향상시키는 자기개선형 학습 방식으로, 최신 AI 학습 프레임워크 연구로 주목된다.

https://x.com/fly51fly/status/2034383972177002605

#multimodal #reinforcementlearning #metacognition #framework #arxiv

fly51fly (@fly51fly) on X

[LG] Meta-TTRL: A Metacognitive Framework for Self-Improving Test-Time Reinforcement Learning in Unified Multimodal Models L S Tan, J Chen, X Fu, L Ma… [Tsinghua University & JD. COM] (2026) https://t.co/dnZdUj2Vst

X (formerly Twitter)

fly51fly (@fly51fly)

강화학습을 활용해 로봇이 언제 생각해야 하는지 자원 인지형 추론을 학습하는 ‘Resource-Aware Reasoning’ 연구가 소개됐다. 임베디드 로보틱 의사결정에서 계산 자원을 절약하면서도 효율적인 추론을 목표로 하는 새로운 로봇 AI 접근이다.

https://x.com/fly51fly/status/2034385797588418905

#robotics #reinforcementlearning #embodiedai #reasoning #arxiv

fly51fly (@fly51fly) on X

[RO] When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making J Liu, P Zhao, Z Kong, X Shen… [CMU & Northeastern University & Harvard University] (2026) https://t.co/NdTpWenMq2

X (formerly Twitter)

Cursor (@cursor_ai)

Composer라는 모델을 프롬프트 기반이 아니라 강화학습(RL)으로 스스로 요약(self-summarize)하도록 학습시켰다고 발표했습니다. 이 접근은 압축(compaction)으로 인한 오류를 50% 줄였고, 수백 단계가 필요한 고난도 코딩 작업에서도 Composer가 성공할 수 있게 했다는 기술적 결과를 보고했습니다.

https://x.com/cursor_ai/status/2033967614309835069

#composer #reinforcementlearning #summarization #code

Cursor (@cursor_ai) on X

We trained Composer to self-summarize through RL instead of a prompt. This reduces the error from compaction by 50% and allows Composer to succeed on challenging coding tasks requiring hundreds of actions.

X (formerly Twitter)