Mastodawn

A new paper with Bogdan Georgiev, Javier Gomez-Serrano, and Adam Zsolt Wagner: "Mathematical exploration and discovery at scale" https://arxiv.org/abs/2511.02864 , in which we record our experiments using the LLM-powered optimization tool #AlphaEvolve to attack 67 different math problems (both solved and unsolved), improving upon the state of the art in some cases and matching preivous literature in others. The data for these experiments can be found at https://github.com/google-deepmind/alphaevolve_repository_of_problems and further discussion is at https://terrytao.wordpress.com/2025/11/05/mathematical-exploration-and-discovery-at-scale/

Mathematical exploration and discovery at scale

AlphaEvolve (Novikov et al., 2025) is a generic evolutionary coding agent that combines the generative capabilities of LLMs with automated evaluation in an iterative evolutionary framework that proposes, tests, and refines algorithmic solutions to challenging scientific and practical problems. In this paper we showcase AlphaEvolve as a tool for autonomously discovering novel mathematical constructions and advancing our understanding of long-standing open problems. To demonstrate its breadth, we considered a list of 67 problems spanning mathematical analysis, combinatorics, geometry, and number theory. The system rediscovered the best known solutions in most of the cases and discovered improved solutions in several. In some instances, AlphaEvolve is also able to generalize results for a finite number of input values into a formula valid for all input values. Furthermore, we are able to combine this methodology with Deep Think and AlphaProof in a broader framework where the additional proof-assistants and reasoning systems provide automated proof generation and further mathematical insights. These results demonstrate that large language model-guided evolutionary search can autonomously discover mathematical constructions that complement human intuition, at times matching or even improving the best known results, highlighting the potential for significant new ways of interaction between mathematicians and AI systems. We present AlphaEvolve as a powerful new tool for mathematical discovery, capable of exploring vast search spaces to solve complex optimization problems at scale, often with significantly reduced requirements on preparation and computation time.

arXiv.org

Don_Rubiel Nov 6

@tao At Pusan National University we tried to do experiments with FunSearch, but the computational requirements were too demanding. Does the paper includes an estimation of the cost of running these experiments?

Don_Rubiel Nov 6

@tao DeepSeek reduced the cost of LLM, but still...

Terence Tao Nov 6

@Don_Rubiel There is some discussion in Section 3, although we did not receive permission to share exact compute costs. There is definitely a tradeoff in performance between the speed of AlphaEvolve in finding good solutions, and the amount of compute expended. When working at scale, it would make sense to first run a low-power version of this tool on all the problems to collect all the "low hanging fruit" of easy counterexamples, use human analysis to draw whatever conclusions one can from these, and then follow up with more expensive runs on more targeted problems, using whatever insights one could gain from the cheaper runs to increase performance.

allendist57 Nov 6

@tao how good was the generalizer for the model. More specifically were the patterns more something that would have been tedious for humans or something that nontrivial to find based on the finite examples.

Terence Tao Nov 6

@allendist57 IMO 2025 problem 6 https://google-deepmind.github.io/alphaevolve_repository_of_problems/problems/65.html was an interesting case of this: see the discussion in section 43 where AlphaEvolve discovered the construction in Figure 34. The majority of human participants at the 2025 IMO, and none of the AI tools applied to the problem, were able to obtain such a construction. (But one should make the caveat that locating the optimal construction is only one half of this problem; the other half is to prove that the construction is optimal, and AlphaEvolve had no capability on its own to accomplish this. On the other hand, one could imagine that a human attacking this problem who was given this example by AlphaEvolve could use it as inspiration to try to establish a rigorous proof of optimality.)

Math Problem Template

allendist57 Nov 7

@tao I meant more so what if a human was given the finite cases. How trivial would it be for them to get the models generalization for any n

Terence Tao Nov 7

@allendist57 In our experiments we have been incentivizing AlphaEvolve to come up with solutions that are as interpretable as possible, and have as clear a dependency on the parameter n as possible, so the general solutions that have been found are ones which a human could generalize from seeing the code to generate them for small n (although in the finite field Kakeya and Nikodym examples, if a human was just given the raw set of points rather than the code used to procedurally generate them, it would be significantly harder to discern the pattern).

allendist57 Nov 7

@tao thank you for your response. Do you think the generalization aspect will improve soon

Aries Grant Nov 10

@tao @allendist57

https://docs.google.com/document/d/109g1saVqS3RQMsGn4bHVqFeSdmu0_r2w3JBoLEhwy3Y/edit?usp=drivesdk

Mike Battaglia Nov 7

@tao this is one of the neatest things I have seen people use LLMs for so far. Do you plan to keep going with this research and scale this up? Do you see any possible ways to generalize this?

emil_zakirov Nov 8

I’ve recreated GraphMERT and adapted it specifically for mathematics:

https://arxiv.org/abs/2510.09580

Using arXiv’s API together with Deep Seek OCR, I trained it to create triplets of mathematical facts. The current setup achieves 0 % hallucinations, though OCR still struggles with math symbols.

I’d love your input on two points:

1. What are the best unstructured datasets to feed such a math-focused architecture?

2. What potential benefits or complementarities do you see between this approach and your recent paper «Mathematical Exploration and Discovery at Scale»?

https://arxiv.org/abs/2511.02864

(My goal is to test whether a MERT-based triplet embedding system can augment AlphaEvolve-style search by providing denser mathematical relational priors before reasoning.)

GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Researchers have pursued neurosymbolic artificial intelligence (AI) applications for nearly three decades because symbolic components provide abstraction while neural components provide generalization. Thus, a marriage of the two components can lead to rapid advancements in AI. Yet, the field has not realized this promise since most neurosymbolic AI frameworks fail to scale. In addition, the implicit representations and approximate reasoning of neural approaches limit interpretability and trust. Knowledge graphs (KGs), a gold-standard representation of explicit semantic knowledge, can address the symbolic side. However, automatically deriving reliable KGs from text corpora has remained an open problem. We address these challenges by introducing GraphMERT, a tiny graphical encoder-only model that distills high-quality KGs from unstructured text corpora and its own internal representations. GraphMERT and its equivalent KG form a modular neurosymbolic stack: neural learning of abstractions; symbolic KGs for verifiable reasoning. GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines. Concretely, we target reliable domain-specific KGs that are both (1) factual (with provenance) and (2) valid (ontology-consistent relations with domain-appropriate semantics). When a large language model (LLM), e.g., Qwen3-32B, generates domain-specific KGs, it falls short on reliability due to prompt sensitivity, shallow domain expertise, and hallucinated relations. On text obtained from PubMed papers on diabetes, our 80M-parameter GraphMERT yields a KG with a 69.8% FActScore; a 32B-parameter baseline LLM yields a KG that achieves only 40.2% FActScore. The GraphMERT KG also attains a higher ValidityScore of 68.8%, versus 43.0% for the LLM baseline.

arXiv.org

@tao you looked at the ovals problem! If I understand your paper correctly, this is *evidence* for the conjectured bound, but the conjecture itself is still open....

Terence Tao Nov 10

@julie Yes, this is correct. We tasked AlphaEvolve with looking for counterexamples to the ovals conjecture but it did not find any, which is (weak) evidence in favor of the conjecture.

Aries Grant Nov 10

@tao I have created my version of mathematics labeled "the unified principle" china's
76-qubit computer that yields results of computing "billion" years in 4 minutes is 4 minutes slower than the unified principle.. I'll provide an equation of validity.
T_Y / T_D = N_R
364 / 6.5 = 56

G_C / Y_U = N_R
20384 / 364 = 56

S = (223 * M_U) + (Y_U - M_U)
S = (223 * 28) + (364 - 28) = 6580

Aries Grant Nov 10

@tao the cosmic distance ladder. All mathematics isn't lineal. The universe moves in a helical motion so we have to account for that time even though it looks linear from Earth.

https://docs.google.com/document/d/1ODKNwOxPiHRPHWtZRLzXUMPcu2WUCUqy/edit?usp=drivesdk&ouid=102658146990030891869&rtpof=true&sd=true

@tao Amazing paper, the interactive HTML version can be found at https://www.sciencestack.ai/arxiv/2511.02864v1 (useful for accessibility + non-native speakers)

Mathematical exploration and discovery at scale - ScienceStack

AlphaEvolve is a generic evolutionary coding agent that combines the generative capabilities of LLMs with automated evaluation in an iterative evolutionary fram