Mastodawn

🌖 基於結果的強化學習以預測未來
➤ 強化學習在預測領域的應用與突破
✤ https://arxiv.org/abs/2505.17989
本研究探討瞭如何將強化學習與可驗證獎勵（RLVR）應用於更複雜的現實世界預測任務。研究人員成功地將兩種領先的強化學習演算法（GRPO和ReMax）改進並應用於14B參數的模型，使其在預測準確度、校準和假設性預測市場表現上都超越了現有的模型。實驗結果顯示，透過精煉的RLVR方法，即使是較小規模的語言模型也能轉化為具有潛在經濟價值的預測工具。
+ 令人印象深刻的研究！強化學習應用於預測，而且還能帶來實際經濟效益，這代表了人工智能發展的一個新方向。
+ 我很好奇這種方法是否能應用到其他領域，例如金融市場或氣候預測。這項技術的潛力似乎非常巨大。
#人工智慧 #機器學習 #預測 #強化學習

Outcome-based Reinforcement Learning to Predict the Future

Reinforcement learning with verifiable rewards (RLVR) has boosted math and coding in large language models, yet there has been little effort to extend RLVR into messier, real-world domains like forecasting. One sticking point is that outcome-based reinforcement learning for forecasting must learn from binary, delayed, and noisy rewards, a regime where standard fine-tuning is brittle. We show that outcome-only online RL on a 14B model can match frontier-scale accuracy and surpass it in calibration and hypothetical prediction market betting by adapting two leading algorithms, Group-Relative Policy Optimisation (GRPO) and ReMax, to the forecasting setting. Our adaptations remove per-question variance scaling in GRPO, apply baseline-subtracted advantages in ReMax, hydrate training with 100k temporally consistent synthetic questions, and introduce lightweight guard-rails that penalise gibberish, non-English responses and missing rationales, enabling a single stable pass over 110k events. Scaling ReMax to 110k questions and ensembling seven predictions yields a 14B model that matches frontier baseline o1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it in calibration (ECE = 0.042, p < 0.001). A simple trading rule turns this calibration edge into \$127 of hypothetical profit versus \$92 for o1 (p = 0.037). This demonstrates that refined RLVR methods can convert small-scale LLMs into potentially economically valuable forecasting tools, with implications for scaling this to larger models.

arXiv.org