Further human + AI + proof assistant work on Knuth's "Claude Cycles" problem

Knuth Claude's Cycles note update: problem now fully solved, by LLMs - https://news.ycombinator.com/item?id=47306926 - March 2026 (2 comments)

https://chatgpt.com/share/69aaab4b-888c-8003-9a02-d1df80f9c7...

Claude's Cycles [pdf] - https://news.ycombinator.com/item?id=47230710 - March 2026 (362 comments)

https://twitter.com/BoWang87/status/2037648937453232504

Knuth Claude's Cycles note update: problem now fully solved, by LLMs | Hacker News

I've always said this but AI will win a fields medal before being able to manage a McDonald's.

Math seems difficult to us because it's like using a hammer (the brain) to twist in a screw (math).

LLMs are discovering a lot of new math because they are great at low depth high breadth situations.

I predict that in the future people will ditch LLMs in favor of AlphaGo style RL done on Lean syntax trees. These should be able to think on much larger timescales.

Any professional mathematician will tell you that their arsenal is ~ 10 tricks. If we can codify those tricks as latent vectors it's GG

Tricks are nothing but patterns in the logical formulae we reduce.

Ergo these are latent vectors in our brain. We use analogies like geometry in order to use Algebraic Geometry to solve problems in Number Theory.

An AI trained on Lean Syntax trees might develop it's own weird versions of intuition that might actually properly contain ours.

If this sounds far fetched, look at Chess. I wonder if anyone has dug into StockFish using mechanistic interpretability

This argument, that LLMs can develop new crazy strategies using RLVR on math problems (like what happened with Chess), turns out to be false without a serious paradigm shift. Essentially, the search space is far too large, and the model will need help to explore better, probably with human feedback.

https://arxiv.org/abs/2504.13837

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs), particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.

arXiv.org
The search space for the game of Go was also thought to be too large for computers to manage.