We knew, but the proof is nice.

"Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves"

The guess-the-next-words machines don’t actually understand anything.

https://nitter.poast.org/heynavtoor/status/2041243558833987600#m

#math #ai

@davidaugust And yet large companies are firing actual reasoning, thinking humans to replace them with these dumb-ass machines. Staggering.

@BruceMirken staggering, stupefying and stupid.

I think they’ll come to regret doing so. Many already have.

@davidaugust And so will investors when the AI bubble implodes, which it inevitably will. Humans have short memories.

@davidaugust

In other shocking news:

Water is Wet
Without air you will die

@lemgandi
The wetness of water has been hotly debated, as to some wet means "covered with or soaked in water", and it's questioned whether water is covered with itself.
@davidaugust

@davidaugust

not new, here's the 2024 paper referenced:

https://arxiv.org/abs/2410.05229

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

arXiv.org
@joriki it’s from August.
Apple Engineers Show How Flimsy AI β€˜Reasoning’ Can Be

The new frontier in large language models is the ability to β€œreason” their way through problems. New research from Apple says it's not quite what it's cracked up to be.

WIRED
πŸ’‘πš‚π—†π–Ίπ—‹π—π—†π–Ίπ—‡ π™°π—‰π—‰π—ŒπŸ“± (@[email protected])

1/x #MathsMonday #Maths #Math Over time I've saved many screenshots of #AI #slop #aiSlop stuffing up #Mathematics big time, and on occasion I've had cause to reshare them, and at times I have cursed that I can only attach 4 pics per post. Then I realised, what am I worried about - just post them all in a thread and then I can just link to the thread (or individual screenshots), and can add to it as more come up πŸ™‚ P.S. feel free to reply with more I hereby present to you, AI's greatest 5hits...

dotnet.social

@davidaugust

Don't let @scottjenson catch you disseminating defeatist news on AI.

It's utterly your fault that we have this bad reputation on the Fedi with respect to AI.

@xdydx

@glitzersachen @scottjenson @xdydx guessing you are joking. But also suspect it may be an inside joke with not a lot of folks on the inside.

@davidaugust @glitzersachen

Actually, this particular joke has the attention of quite a few people..

https://social.coop/@scottjenson/116358195717244835

@scottjenson

Scott Jenson (@[email protected])

OK, this is going even MORE sideways so I need to make a few things clear: 1. I took a complex point and made it poorly 2. My goal was to ask for more inclusiveness 3. I am sickened by what happend to BlackTwitter and I don't want it recur 4. But I can't speak for BlackTwitter nor should I 5. I apologize to black mastodon users for making such a poor comparison 6. I'm not endorsing "AI Slop" they were a foil to make my point 7. I'm certainly NOT trying to compare AI bros to Black twitter (but, as I said, I can see how people made that connection. I'm trying to correct that here)

social.coop

@davidaugust did you see the documentary about "Project Nim", the monkey who got adopted and was supposed to be speaking sign language?

there have been other monkeys showing the same effect of this highly sophisticated begging where the monkeys guess the expected behavior but do not understand

they answer with some sign but they do not understand that there was a question nor that they do give an answer

they see human behavior and from pattern matching they do what they think is appropriate for that moment because it had served them well in the past

its hard for me to explain but the documentary did a good job in preparing me for the sophistcated begging of llms

i wanted it to be true to communicate with monkeys but the documentary gave me a werner herzog umderstanding of the beauty of standing apart, watching each other over a unbridgeable chasm

@drifthood yes, there does seem to be a threshold over which in some respects only humans cross over to one side.

I see that sort of begging in a dog. He wants the treat, so instead of just doing the desired behavior the human command is asking for, he tries every response that has ever gotten him a treat until he β€œunlocks” the treat. Humans can and do do this too from time to time, but humans _also_ actually communicate and understand from time to time as well.

@drifthood @davidaugust This makes me think of "Clever Hans", the horse that appeared to do arithmetics but actually just responded to involuntary human cues:
https://en.wikipedia.org/wiki/Clever_Hans
Clever Hans - Wikipedia

@bladecoder @drifthood excellent point. It does feel to me vaguely like the mechanical turk too: a machine passed off as machine with a person inside able to give the β€œmachine” human-like skills and abilities.

There is a fiction inside all of them.

@davidaugust @bladecoder you are right,
there is lots of reporting from 404 media for example that its nigerians who do a lot of the chatbotting

in the interview the nigerian explains how its not possible for the workers to know if they are training a bot or if they are chatting with a human on the other side

https://www.youtube.com/watch?v=QH654YPxvEE

What It’s Like to Be a Data Labeler Training AI

YouTube

@davidaugust Well, there have actually been successes by connecting LLMs to proof assistant and computer algebra programs. As this post rightly puts, the LLM is not capable in itself to perform computations reliably, but it can write commands sent to the computer algebra programs, or proof candidates sent to the proof assistant; which can answer that the proof is incorrect, and the process goes on until a correct proof is produced.

See also uses by pro mathematicians:
https://bsky.app/profile/wildverzweigt.bsky.social/post/3miua4ulxhk2f

Also see Terence Tao

Wildverzweigte Erweiterung (@wildverzweigt.bsky.social)

Si K contient une racine carrΓ©e de -1 alors on a ce contre-exemple (trouvΓ© par mon ami R.R. en utilisant un LLM, je suis pas fan mais bon). J'ai vΓ©rifiΓ© le calcul dans Sage.

Bluesky Social

@davidaugust Direct link to the paper https://arxiv.org/pdf/2410.05229 (presented at ICLR 2025).

Seems not to be a very recent news, then.

@davidaugust In about 80 years we've gone from a room full of computers the size of refrigerators that were good at crunching numbers but not much else to computers the size of corporate office parks that can draw almost-convincing pictures of people with five fingers (and thumbs, too!) but can't do elementary school math.

And some people call this progress.

@Karen5Lund Maybe because people stopped writing efficient code about 20 years ago?
@davidaugust Ecosia AI gets it right. It looks like the paper referenced was published in 2025, so the research conducted prior. The models are all much better now. I’m no AI apologist, but I think any argument of β€œAI sucks because it’s not good at _____” is on tenuous ground and will be proven wrong as the models continue to improve. @Ecosia
@audioflyer79 @davidaugust I mean, it's worth noting that the LLMs have ingested that paper by now. : /

@alisynthesis @davidaugust fair enough. I changed up the problem completely and added some reasoning and it did pretty well. It appears to be generating code to solve the math. The only thing it missed is that very unripe bananas are green, not yellow.

James picks 40 apples on Monday. Then he picks 35 lemons on Tuesday. On Wednesday, he picks half as many bananas as he did apples, but five of them were very unripe. How many yellow fruits does James have?

@audioflyer79 @alisynthesis @davidaugust how does it do if you swap the colors of the fruit?
@davidaugust AGI is coming son 🀭
@pascal_le_merrer any day now. I hear potus say in two weeks.
@davidaugust interesting. Had to ask. Already fixed?

@flq yes, many systems have tools and/or abilities built in to take over basic math operations that simpler LLMs failed at.

The salient and enduring issue, I think, is that the spin and marketing of LLMs as "understanding," "thinking" or "intelligent" (as those words typical meanings suggest) remains largely fictional.

@flq @davidaugust it may be fixed fo this phrase and these numbers but what if you asked a similar question but mention that 5 of the kiwis are twice as big as the others? Would it still give 190 or would it give 195?

@davidaugust Of course an LLM cannot do math, but to be honest, that is also not what they're designed for. An LLM these days like Claude knows that it should take a calculator and type the equation in there, instead of hallucinating an answer. Complaining that an LLM can't do math is like complaining a screwdriver can't drill a hole.

You can counter that there are plenty of people who are using the screwdriver to drill the hole, but that is not on the tool, that is on the user.

@davidaugust When did they do this test? I tried it with the following LLMs: Sonnet 4.6, Codex 5.3, GPT-5.4, GPT-5-Mini and Kimi-K2.5. They all answer the kiwi question correctly.

@erwinrossen like a surprisingly large number of people, LLMs do not have actual understanding is the key take away.

Not only do LLMs (and some people) not understand stand what is overtly said, they cannot and do not have the ability for nuanced understanding either because that subset of understanding is just as inaccessible to them as the entire super set of understanding.

I am share this with an LLM (and some people) but I cannot make them understand it.

@fedor yes. Though many many people take words like β€œintelligence” and β€œunderstanding” at face value in this context nonetheless.
@davidaugust I don't know if this is acurate. I remember the much younger me. That extra throw away about the size would have confused me. I may have got the answer right, but my brain would have wasted a lot of time trying to figure out what it meant. I know in some case that younger me would have subtracted 5 just like the LLMs did. My god I still have reasoning issues with superflous words in converstations.
@EdBruce just because it emulates and imitates reasoning issues does not mean it actually has a reasoning (good reasoning or bad reasoning), and that is the key here.
@davidaugust paging @tao who's done quite a bit of posting on AI in math - although beyond grade school.
@davidaugust wondering how many actual 10-yr olds they asked for comparison..

@castaway sound rubrics for various ages exist. As does the underlying, screenshotted and linked study.

Muting you for a week to avoid spending time on disengaged responses. Have a good week!