Simons Institute for the Theory of Computing (@SimonsInstitute)
Pi Day에 커뮤니티 실험 프로젝트 1stproof가 연구 수준의 수학 문제 해결 능력을 평가하는 벤치마크의 두 번째 배치와 벤치마크 공개를 예고했습니다. 1stproof는 AI의 고난도 수학·연구 능력을 측정하기 위한 커뮤니티 기반 평가 프로젝트입니다.
Simons Institute for the Theory of Computing (@SimonsInstitute)
Pi Day에 커뮤니티 실험 프로젝트 1stproof가 연구 수준의 수학 문제 해결 능력을 평가하는 벤치마크의 두 번째 배치와 벤치마크 공개를 예고했습니다. 1stproof는 AI의 고난도 수학·연구 능력을 측정하기 위한 커뮤니티 기반 평가 프로젝트입니다.
Vielleicht ein kleines Update zum Diskussionsstand #1stProof:
OpenAI hat Lösungsvorschläge eingereicht, vor der Deadline. Diese waren aber nicht nur durch die KI erstellt, sondern Experten die OpenAI kontaktiert hatte waren auch beteiligt und haben der KI geholfen.
Mohammed Abouzaid schrieb dazu im Zulip: "I think the consensus that that 6, 9 and 10 are essentially correct".
Daniel Litt: "One pseudonymous expert has vouched for the solution to 5 on Twitter"
[Was ist dieses Twitter 🥸?]
🧵
In a day or so, the mathematicians behind the #1stproof challenge at https://1stproof.org/ will reveal their solutions to the 10 challenge problems they posted recently. (I am not directly involved in this challenge, although I know most of the authors personally and approve of their experiment.) It seems likely that there will be many claims, both trustworthy and dubious, of proofs of these problems by various AI-generated means.
The Erdos problem web site, having dealt with this type of thing for several months now, has come up with several guidances on how to increase confidence in the correctness of an AI-generated proof: https://github.com/teorth/erdosproblems/wiki/I-think-I-managed-to-get-my-favorite-AI-tool-to-solve-an-open-Erd%C5%91s-problem!--What-do-I-do-next%3F The wording there is specific to Erdos problems, but much of the advice can be applied more broadly.
I would like to highlight in particular the additional correctness guarantees provided by formalizing the argument in Lean. When used correctly, a Lean formalization of a proof can provide extremely high confidence that a given proof correctly proves the desired claim. However, if the Lean proof is itself AI-generated without supervision from an expert in Lean, there are still ways in which a supposed "Lean certificate" of correctness is unsatisfactory or even worthless. These include:
1. A Lean proof that adds additional axioms in the proof beyond the standard three, or which relies on malicious metaprogramming.
2. Subtle errors in the formalization of the *statement* of the result to be proved, that allows the claim to be proven on a technicality. (This is a particular risk if this statement formalization is also AI-generated.)
See https://leanprover-community.github.io/did_you_prove_it.html and https://lean-lang.org/doc/reference/latest/ValidatingProofs/#validating-proofs for best practices on guarding against such issues.
Please help promote this project called "First Proof" led by Mohammed Abouzaid (Stanford), Nikhil Srivastava (Cal), Rachel Ward (UT Austin), and Lauren Williams (Harvard). The goal is to understand the capabilities of AI systems on problems that come up in math research. We have a collection of research problems for which solutions have not yet been posted online, so it's a good testbed. The solutions will come out in just one week. Take a crack at it! #FirstProof #1stProof

To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.