Mastodawn

Simons Institute for the Theory of Computing (@SimonsInstitute)

Pi Day에 커뮤니티 실험 프로젝트 1stproof가 연구 수준의 수학 문제 해결 능력을 평가하는 벤치마크의 두 번째 배치와 벤치마크 공개를 예고했습니다. 1stproof는 AI의 고난도 수학·연구 능력을 측정하기 위한 커뮤니티 기반 평가 프로젝트입니다.

https://x.com/SimonsInstitute/status/2033625366997438550

#benchmark #math #research #1stproof

Simons Institute for the Theory of Computing (@SimonsInstitute) on X

On Pi Day, #1stproof, a community experiment to measure the capability of AI to do research-level math, announced their forthcoming second batch and benchmark. https://t.co/KnBweGwbiF

X (formerly Twitter)

Terence Tao Mar 16

The #1stproof team (that I am not involved in, though I support their efforts) have announced the preliminary details of their "second batch" challenge, with a more structured assessment procedure, and request for community input: https://1stproof.org/documents/First_Proof_March_14_Announcement.pdf

Show thread

Terence Tao Feb 25

@domotorp The official Zulip channel to discuss the #1stproof solutions is https://icarm.zulipchat.com/#narrow/channel/568090-first-proof , with many experts weighing in on the most promising AI-generated solutions.

Public view of ICARM | Zulip team chat

Browse the publicly accessible channels in ICARM without logging in.

Zulip

Eigenraum Feb 18

Vielleicht ein kleines Update zum Diskussionsstand #1stProof:

OpenAI hat Lösungsvorschläge eingereicht, vor der Deadline. Diese waren aber nicht nur durch die KI erstellt, sondern Experten die OpenAI kontaktiert hatte waren auch beteiligt und haben der KI geholfen.

Mohammed Abouzaid schrieb dazu im Zulip: "I think the consensus that that 6, 9 and 10 are essentially correct".

Daniel Litt: "One pseudonymous expert has vouched for the solution to 5 on Twitter"

[Was ist dieses Twitter 🥸?]

🧵

Show thread

Dietmar Wolz Feb 14

Agentic system design paper:
https://althofer.de/agentic_strategy_design_for_math_proofs.pdf
#1stProof #TheoremProving

Dietmar Wolz Feb 14

First Proof (#1stProof): We ran an AI-only workflow (no human mathematical input) and published a writeup + outputs.
Report: https://althofer.de/first-proof-competition/first-proof-report.html
Official: https://1stproof.org/
I’d appreciate critique—especially rigor/correctness checks and suggestions for better verification.
#1stProof #Mathematics #TheoremProving #AI

Team Wolz & Althofer

First Proof Competition

Terence Tao Feb 12

In a day or so, the mathematicians behind the #1stproof challenge at https://1stproof.org/ will reveal their solutions to the 10 challenge problems they posted recently. (I am not directly involved in this challenge, although I know most of the authors personally and approve of their experiment.) It seems likely that there will be many claims, both trustworthy and dubious, of proofs of these problems by various AI-generated means.

The Erdos problem web site, having dealt with this type of thing for several months now, has come up with several guidances on how to increase confidence in the correctness of an AI-generated proof: https://github.com/teorth/erdosproblems/wiki/I-think-I-managed-to-get-my-favorite-AI-tool-to-solve-an-open-Erd%C5%91s-problem!--What-do-I-do-next%3F The wording there is specific to Erdos problems, but much of the advice can be applied more broadly.

I would like to highlight in particular the additional correctness guarantees provided by formalizing the argument in Lean. When used correctly, a Lean formalization of a proof can provide extremely high confidence that a given proof correctly proves the desired claim. However, if the Lean proof is itself AI-generated without supervision from an expert in Lean, there are still ways in which a supposed "Lean certificate" of correctness is unsatisfactory or even worthless. These include:

1. A Lean proof that adds additional axioms in the proof beyond the standard three, or which relies on malicious metaprogramming.
2. Subtle errors in the formalization of the *statement* of the result to be proved, that allows the claim to be proven on a technicality. (This is a particular risk if this statement formalization is also AI-generated.)

See https://leanprover-community.github.io/did_you_prove_it.html and https://lean-lang.org/doc/reference/latest/ValidatingProofs/#validating-proofs for best practices on guarding against such issues.

First Proof | Research-Level Math for AI Evaluation

A set of research-level mathematics questions to assess AI capabilities, with encrypted solutions.

Tammy Kolda Feb 10

Please help promote this project called "First Proof" led by Mohammed Abouzaid (Stanford), Nikhil Srivastava (Cal), Rachel Ward (UT Austin), and Lauren Williams (Harvard). The goal is to understand the capabilities of AI systems on problems that come up in math research. We have a collection of research problems for which solutions have not yet been posted online, so it's a good testbed. The solutions will come out in just one week. Take a crack at it! #FirstProof #1stProof

https://arxiv.org/abs/2602.05192

First Proof

To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.

arXiv.org