Mastodawn

As a professional mathematician, I would say that a good proof requires a very good representation of the problem, and then pulling out the tricks. The latter part is easy to get operating using LLMs, they can do it already. It's the former part that still needs humans, and I'm perfectly fine with that.

Show thread

hodgehog11 13h ago

This argument, that LLMs can develop new crazy strategies using RLVR on math problems (like what happened with Chess), turns out to be false without a serious paradigm shift. Essentially, the search space is far too large, and the model will need help to explore better, probably with human feedback.

https://arxiv.org/abs/2504.13837

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.

arXiv.org

Show thread

hodgehog11 14h ago

It's finding constructions and counterexamples. That's different from finding new proof techniques, but still extremely useful, and still gives way to novel findings.

Show thread

hodgehog11 Mar 19

That's an excellent point. It seems likely they thought they could operate as a proper reviewer, but when the deadline came, they took the shortcut they knew they were not supposed to take.

It really does sound like an addiction when you put it this way.

Show thread

hodgehog11 Mar 19

I was thinking this too, but I don't believe this is the case, and I feel like it would not be a good idea either.

Most of these people are likely students; this should be a learning moment, but I don't think it is yet grounds for their entire academic career to be crippled by being unable to publish in a top-tier ML venue.

Show thread

hodgehog11 Mar 19

I'm amazed that such a simple method of detection worked so flawlessly for so many people. This would not work for those who merely used LLMs to help pinpoint strengths and weaknesses in the paper; there are separate techniques to judge that. Instead, it only detects those who quite literally copied and pasted the LLM output as a review.

It's incredible how so many people thought it was fair that their paper should be assessed by human reviewers alone, and yet would not extend the same courtesy to others.

Official	https://
Support this service	https://www.patreon.com/birddotmakeup