Entgegen anderslautender Behauptungen können LLMs keine nicht-trivialen mathematischen Probleme lösen. Sie faseln irgendwas rum und Menschen merken es nicht mal. So ein Befürworter, der mit einem Beweis das Gegenteil Beweisen wollte. Leider ist der Beweis falsch. Wie peinlich.

https://garymarcus.substack.com/p/reports-of-llms-mastering-math-have

Ernest Davis und Gary Marcus haben darüber geschrieben:

„The refusal of these kinds of AI to admit ignorance or incapacity and their obstinate preference for generating incorrect but plausible-looking answers instead are one of their most dangerous characteristics. It is extremely easy for a user to pose a question to an LLM, get what looks like a valid answer, and then trust to it, without doing the careful inspection necessary to check that it is actually right.“

„If this kind of technology becomes commonly used to answer difficult questions before the problem of generating invalid answers is fixed, we will be in serious trouble.“

Und übrigens: Symbolische KI ist anders: Entweder die System können das oder sie können es nicht. Dann ist aber auch klar, dass sie es nicht können:

„Importantly, the neurosymbolic method used by DeepMind’s AlphaProof and AlphaGeometry systems (which we discussed recently) which (more or less) achieved a silver-medal level performance on the 2024 International Math Olympiad, is immune to this problem. AlphaProof and AlphaGeometry generate a completely detailed symbolic proof that can be fed into a formal proof verifier. They can fail to find a proof, but they cannot generate an incorrect proof. But that is because they rely in part on powerful, completely hand-written, symbolic reasoning systems. LLMs are not similarly immune.“

Nestler ist die Person, die auf X behauptet hat: Klar können die LLMs das:

„So Nestler’s experiment does not contradict the finding of the report; it corroborates it. o3, yet again, produced an invalid answer to this problem. It also confirms how dangerous this kind of failing is. The AI output a “proof” that looked plausible to Nestler led Nestler to make a fool of himself in public by outrageously accusing reputable scientists of fakery. (To a trained mathematician, the error in Nestler’s own proof is pretty obvious once pointed out.)“

Und dies:
„If indeed the AIs could solve more of these problems with better prompts, then that’s evidence in favor of their mathematical ability but it’s evidence against their ability to judge the right thing to do on their own.“

Den Olympiadeteilnehmern muss man nicht sagen, dass sie sich Mühe geben sollen und keinen Bullshit abgeben sollen. =:-)

„The really important challenge is not to get the AIs to solve more #USAMO problems; it is to get them to say “I give up” when they can’t. And we have yet to see any evidence that any kind of prompt helps in that regard.“

Und hier noch mal zu den Hausarbeiten: Ein großes Problem der Menschheit ist Dummheit bzw. Unwissenheit. Jetzt im Zeitalter der #LLM-basierten #KI kommt Leichtgläubigkeit dazu. Als Wissenschaftler*in muss man lernen, Dinge zu hinterfragen, denn nur, wenn man das, worauf man aufbaut, verbessert, wird man irgendwohin gelangen.

Wenn man aber unwissend und schnell zufrieden ist, bemerkt man nicht einmal, welchen Schrott man fabriziert.

#Mathematik

Reports of LLMs mastering math have been greatly exaggerated

What happens when you minimize the chance of data leakage?

Marcus on AI

Gemini 2.5 gets 24.4% on MathArena USAMO beating previous top score of 4.7%

https://matharena.ai/

#HackerNews #Gemini2.5 #MathArena #USAMO #Score #TechNews #AI

MathArena.ai

MathArena: Evaluating LLMs on Uncontaminated Math Benchmarks

[2503.21934v1] Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
https://arxiv.org/abs/2503.21934v1

> all tested models struggled significantly, achieving less than 5% on average

#LLM #Math #USAMO #AI

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.

arXiv.org
https://www.wacoca.com/news/2362254/ 米オープンAIのアルトマンCEO、妹への性虐待疑惑を否定 | ロイター #AMERS #ARTI #BACT #Biz #CASE1 #CLEB #CLJ #CMPNY #CRIM #DEST:NOJPTPM #DEST:NOJPWDM #DEST:NOJPZTM #DLI #Ent #Gen #HEA #JFOR #JLN #JUDIC #man #MNGISS #NAMER #NEWS #PRIVT #pro #PXP #RSBI:LITIGATION #SCI #SCRIM #SOCI #SOFW #SOFW1 #SWIT #Tech #TECH08 #TMT #TRN #US #USAMO #www #ニュース
米オープンAIのアルトマンCEO、妹への性虐待疑惑を否定 | ロイター

人工知能(AI)「チャットGPT」を手がけるオープンAIのサム・アルトマン最高経営責任者(CEO)の妹が同氏から定期的に性的虐待を受けていたとする訴訟を巡り、アルトマン氏と家族は妹の主張を否定した。

WACOCA NEWS

General Motors on Wednesday secured a new $6 billion line of credit and estimated that the cost of the United Auto Workers strike was $200 million ...

General Motors (GM.N) on Wednesday secured a new $6 billion line of credit and estimated that the cost of the United Auto Workers strike was $200 million during the third quarter, a company spokesman said.#RSBI:WORKER-RIGHTS #RATI:SUPPLY-CHAIN #RATI:WORKFORCE #AUT #AUTO #AUTOMV #CARM #CARM1 #CMPNY #CYCS #CYCS08 #DISP #GEN #JOB #PUBL #WPAY #WEU #AMERS #US #EUROP #BLUX #EZC #NAMER #NL #RSBI:LITIGATION #SUSTAINABLE-BUSINESS #BACT #BIZ #CDM #CORPD #DBT #FIN #FINS #FINS08 #LOA #TOPNWS #TOPCMB #USAOH #USAMI #USAIN #NEWS1 #USAMO
GM locks in $6 billion credit line as strike costs rise

GM locks in $6 billion credit line as strike costs rise

General Motors <a href="https://www.reuters.com/markets/companies/GM.N" target="_blank">(GM.N)</a> on Wednesday secured a new $6 billion line of credit and estimated that the cost of the United Auto Workers <a href="/business/autos-transportation/gm-furloughs-another-163-workers-due-uaw-strike-2023-10-03/">strike</a> was $200 million during the third quarter, a company spokesman said.

Reuters

#introduction #introductions #math #maths #ukmt

It's my 200th toot! I love maths, and am applying to #Cambridge #university to study it there 😱

In the meantime, have a go at this fun and reasonably easy maths question from the UKMT #mathschallenge.

#mathematics #amc #aime #usamo #olympiad #step #problem #bmo #brmo

Anyone from the olympiad community around on mastodon?

https://lgbt.io/media/vwvEHgst_3QnxxQdoRs