We knew, but the proof is nice.

"Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves"

The guess-the-next-words machines don’t actually understand anything.

https://nitter.poast.org/heynavtoor/status/2041243558833987600#m

#math #ai

@davidaugust And yet large companies are firing actual reasoning, thinking humans to replace them with these dumb-ass machines. Staggering.

@BruceMirken staggering, stupefying and stupid.

I think they’ll come to regret doing so. Many already have.

@davidaugust And so will investors when the AI bubble implodes, which it inevitably will. Humans have short memories.
@BruceMirken @davidaugust when the engine that generates exchange value is decoupled from the engine that generates use value there are no reasons to invest in things that generate use value. We live in two economies now.

@davidaugust

In other shocking news:

Water is Wet
Without air you will die

@lemgandi
The wetness of water has been hotly debated, as to some wet means "covered with or soaked in water", and it's questioned whether water is covered with itself.
@davidaugust
@ozzelot @lemgandi @davidaugust surface tension implies water is hydrophilic. I'm on team "water is wet" here. :)

@davidaugust

not new, here's the 2024 paper referenced:

https://arxiv.org/abs/2410.05229

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

arXiv.org
@joriki it’s from August.
Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

The new frontier in large language models is the ability to “reason” their way through problems. New research from Apple says it's not quite what it's cracked up to be.

WIRED
@joriki @davidaugust The discrepancy is the pre-print vs the official publication, I think

@b_cavello @joriki yes. Good point.

Many in this thread/s are debating a preprint when the official version supersedes whatever it may says. Such debate is not salient to the issues at hand and seems more the realm of historical concern and not about the issues the authors are actually addressing.

@davidaugust @joriki check the post history, it was revised in August ‘25 but first posted back in October 2024. Presumably the second revision was after review or in prep for presentation at a conference

@nuclearpidgeon @joriki as the paper says at the link you provided, it said August for “this version.”

To avoid the meaningless debate you seem to be devolving into of arbitrarily discounting later versions as the valid “this version” being put forward, I am muting you for a week.

It wastes your time, my time and anyone reading along to split hairs randomly on which version you personally want to designate as canonical.

Good luck.

@davidaugust @joriki it matters when the research was done! They didn’t “just prove” something if the work and evaluations were all done like a year earlier.
💡𝚂𝗆𝖺𝗋𝗍𝗆𝖺𝗇 𝙰𝗉𝗉𝗌📱 (@[email protected])

1/x #MathsMonday #Maths #Math Over time I've saved many screenshots of #AI #slop #aiSlop stuffing up #Mathematics big time, and on occasion I've had cause to reshare them, and at times I have cursed that I can only attach 4 pics per post. Then I realised, what am I worried about - just post them all in a thread and then I can just link to the thread (or individual screenshots), and can add to it as more come up 🙂 P.S. feel free to reply with more I hereby present to you, AI's greatest 5hits...

dotnet.social

@davidaugust

Don't let @scottjenson catch you disseminating defeatist news on AI.

It's utterly your fault that we have this bad reputation on the Fedi with respect to AI.

@xdydx

@glitzersachen @scottjenson @xdydx guessing you are joking. But also suspect it may be an inside joke with not a lot of folks on the inside.

@davidaugust @glitzersachen

Actually, this particular joke has the attention of quite a few people..

https://social.coop/@scottjenson/116358195717244835

@scottjenson

Scott Jenson (@[email protected])

OK, this is going even MORE sideways so I need to make a few things clear: 1. I took a complex point and made it poorly 2. My goal was to ask for more inclusiveness 3. I am sickened by what happend to BlackTwitter and I don't want it recur 4. But I can't speak for BlackTwitter nor should I 5. I apologize to black mastodon users for making such a poor comparison 6. I'm not endorsing "AI Slop" they were a foil to make my point 7. I'm certainly NOT trying to compare AI bros to Black twitter (but, as I said, I can see how people made that connection. I'm trying to correct that here)

social.coop

@davidaugust did you see the documentary about "Project Nim", the monkey who got adopted and was supposed to be speaking sign language?

there have been other monkeys showing the same effect of this highly sophisticated begging where the monkeys guess the expected behavior but do not understand

they answer with some sign but they do not understand that there was a question nor that they do give an answer

they see human behavior and from pattern matching they do what they think is appropriate for that moment because it had served them well in the past

its hard for me to explain but the documentary did a good job in preparing me for the sophistcated begging of llms

i wanted it to be true to communicate with monkeys but the documentary gave me a werner herzog umderstanding of the beauty of standing apart, watching each other over a unbridgeable chasm

@drifthood yes, there does seem to be a threshold over which in some respects only humans cross over to one side.

I see that sort of begging in a dog. He wants the treat, so instead of just doing the desired behavior the human command is asking for, he tries every response that has ever gotten him a treat until he “unlocks” the treat. Humans can and do do this too from time to time, but humans _also_ actually communicate and understand from time to time as well.

@drifthood @davidaugust This makes me think of "Clever Hans", the horse that appeared to do arithmetics but actually just responded to involuntary human cues:
https://en.wikipedia.org/wiki/Clever_Hans
Clever Hans - Wikipedia

@bladecoder @drifthood excellent point. It does feel to me vaguely like the mechanical turk too: a machine passed off as machine with a person inside able to give the “machine” human-like skills and abilities.

There is a fiction inside all of them.

@davidaugust @bladecoder you are right,
there is lots of reporting from 404 media for example that its nigerians who do a lot of the chatbotting

in the interview the nigerian explains how its not possible for the workers to know if they are training a bot or if they are chatting with a human on the other side

https://www.youtube.com/watch?v=QH654YPxvEE

What It’s Like to Be a Data Labeler Training AI

YouTube

@davidaugust Well, there have actually been successes by connecting LLMs to proof assistant and computer algebra programs. As this post rightly puts, the LLM is not capable in itself to perform computations reliably, but it can write commands sent to the computer algebra programs, or proof candidates sent to the proof assistant; which can answer that the proof is incorrect, and the process goes on until a correct proof is produced.

See also uses by pro mathematicians:
https://bsky.app/profile/wildverzweigt.bsky.social/post/3miua4ulxhk2f

Also see Terence Tao

Wildverzweigte Erweiterung (@wildverzweigt.bsky.social)

Si K contient une racine carrée de -1 alors on a ce contre-exemple (trouvé par mon ami R.R. en utilisant un LLM, je suis pas fan mais bon). J'ai vérifié le calcul dans Sage.

Bluesky Social

@davidaugust Direct link to the paper https://arxiv.org/pdf/2410.05229 (presented at ICLR 2025).

Seems not to be a very recent news, then.

@davidaugust In about 80 years we've gone from a room full of computers the size of refrigerators that were good at crunching numbers but not much else to computers the size of corporate office parks that can draw almost-convincing pictures of people with five fingers (and thumbs, too!) but can't do elementary school math.

And some people call this progress.

@Karen5Lund Maybe because people stopped writing efficient code about 20 years ago?
@davidaugust Ecosia AI gets it right. It looks like the paper referenced was published in 2025, so the research conducted prior. The models are all much better now. I’m no AI apologist, but I think any argument of “AI sucks because it’s not good at _____” is on tenuous ground and will be proven wrong as the models continue to improve. @Ecosia
@audioflyer79 @davidaugust I mean, it's worth noting that the LLMs have ingested that paper by now. : /

@alisynthesis @davidaugust fair enough. I changed up the problem completely and added some reasoning and it did pretty well. It appears to be generating code to solve the math. The only thing it missed is that very unripe bananas are green, not yellow.

James picks 40 apples on Monday. Then he picks 35 lemons on Tuesday. On Wednesday, he picks half as many bananas as he did apples, but five of them were very unripe. How many yellow fruits does James have?

@audioflyer79 @alisynthesis @davidaugust how does it do if you swap the colors of the fruit?

@audioflyer79 @alisynthesis @davidaugust
The correct mathematical response to your question is either a statement of uncertainty (I can't answer that because I don't know what color the apples are or how ripe the lemons and non-unripe bananas are) or a request for clarification (what kind of apples? are the lemons ripe? how ripe are the ripe bananas?).

The fact that it provides a guess indicates that it has correctly understood what *you want it to say*.

It's not doing math. It's playing "what does the user expect?"

@alisynthesis @davidaugust @Robotistry funny, that’s how I formulate my own answers in conversations (what kind of answer does the user expect? 😂) But seriously, the prompt is written like a grade school math word problem. There is no grade school math test that would give your answer a correct score. And if LLMs always gave us the most pedantically accurate answer instead of the plain-reading answer they wouldn’t be useful at all.

@audioflyer79 @alisynthesis @davidaugust
I don't want something that's sold as "useful because it is knowledgeable and capable enough to perform jobs for me" to operate at a grade school level of understanding. I want it to grasp nuance and call out uncertainty, so that it will do precisely what I need it to do.

The problem with this challenge, as posed to a computer, is that it is not fundamentally a math or language understanding problem.

It is an *inference* problem. It is assessing the model's ability to infer properties from insufficient information (what color are apples, ripe bananas, unripe bananas, and lemons?"). (Questions like these are one reason standardized tests produce biased data, because people exposed to different environments reach different answers.)

But the Apple paper is posing a *math* problem that enables them to probe the model's *language understanding*.

The success metric for the math problem is "did the model demonstrate understanding of the text when not specifically trained to do so?" (and therefore loses its utility once it is available publicly and can be included in their training data).

The success metric for the inference problem is "can the model correctly match fruit names to a color label?". This metric doesn't illustrate understanding, because it is testing precisely the kind of linguistic pattern-matching the models are designed to do.

@davidaugust @Robotistry @alisynthesis Interesting points. As so much of LLM output depends on the context provided, I would expect it would return a different answer depending on what you asked it to focus on. In fact, when I asked it what color unripe bananas were and whether that affected the answer, it said that they were green and should not have been included, and that it was focused on bananas as a yellow fruit.
@Robotistry @alisynthesis @davidaugust you do make a good point about one big flaw in LLM responses (in my opinion) which is the inability to say “I don’t know” “I’m not sure” etc. But it’s possible this is also being addressed with newer models.

@Robotistry @audioflyer79 @alisynthesis @davidaugust

"What does the user expect" implies that it is aware of a user. This is also not really true. It emits a response matching the best fit to it's training data and query.

@blterrible @Robotistry @audioflyer79 @alisynthesis quite right.

If we include human designers/programmers as if part of the artificial intelligence system in question, then business goals or interface concerns of “what does the user expect” would come into the process. But if we don’t logically or schematically incorporate the humans’ business goals as if part of the AI system in question, then yes: the AI system does not in fact actively know nor expect anything on behalf of the user.

@davidaugust @Robotistry @alisynthesis @blterrible What’s really interesting to me about these conversations is not what we can say about what AI “knows” or “awareness” or “understanding,” but rather, what is says about humans and our need to “other” any intelligence competing with our own. We have no real understanding of what awareness, understanding, or consciousness is, we just know we have it. 1/
@Robotistry @blterrible @alisynthesis @davidaugust …and anything non-human doesn’t have it because *reasons*. I believe consciousness/awareness/understanding is a continuum, not a binary, and that all of the failures and mistakes made by LLMs could just as easily be attributed to humans in another context. Or to put it another way, that the failures of LLMs are *human* failures, mostly because they are trained on human data. 2/
@alisynthesis @davidaugust @blterrible @Robotistry and that the faults we attribute to LLMs (they’re only matching patterns to their training data, they’re only replying what the user expects) are really not all that different to how humans operate. Our brains are pretty much giant pattern matching association machines. Emergent properties we feel are there, like consciousness, have no provable basis 3/
@Robotistry @alisynthesis @davidaugust @blterrible nor any way to prove any other creature, natural or synthetic, doesn’t have. The Turing Test goalposts will keep getting pushed back until we realize we’re not as special as we think we are. 4/
@davidaugust @blterrible @alisynthesis @Robotistry Also, big tech sucks, the way AI is being developed and accelerated is ethically wrong, and AI may well do more harm to the world than good. 5/5

@audioflyer79 @davidaugust Most (every?) big "AI" chatbot have been patched to intercept math questions and hand them over to an actual program (often Python).

That doesn't change the fact that the LLM part itself cannot do math, and there are still risks that it will misinterpret your question and produce the wrong program to calculate the answer.

@davidaugust AGI is coming son 🤭
@pascal_le_merrer any day now. I hear potus say in two weeks.
@davidaugust interesting. Had to ask. Already fixed?

@flq yes, many systems have tools and/or abilities built in to take over basic math operations that simpler LLMs failed at.

The salient and enduring issue, I think, is that the spin and marketing of LLMs as "understanding," "thinking" or "intelligent" (as those words typical meanings suggest) remains largely fictional.

@flq @davidaugust it may be fixed fo this phrase and these numbers but what if you asked a similar question but mention that 5 of the kiwis are twice as big as the others? Would it still give 190 or would it give 195?

@davidaugust Of course an LLM cannot do math, but to be honest, that is also not what they're designed for. An LLM these days like Claude knows that it should take a calculator and type the equation in there, instead of hallucinating an answer. Complaining that an LLM can't do math is like complaining a screwdriver can't drill a hole.

You can counter that there are plenty of people who are using the screwdriver to drill the hole, but that is not on the tool, that is on the user.

@davidaugust When did they do this test? I tried it with the following LLMs: Sonnet 4.6, Codex 5.3, GPT-5.4, GPT-5-Mini and Kimi-K2.5. They all answer the kiwi question correctly.