We knew, but the proof is nice.

"Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves"

The guess-the-next-words machines don’t actually understand anything.

https://nitter.poast.org/heynavtoor/status/2041243558833987600#m

#math #ai

@davidaugust And yet large companies are firing actual reasoning, thinking humans to replace them with these dumb-ass machines. Staggering.

@BruceMirken staggering, stupefying and stupid.

I think they’ll come to regret doing so. Many already have.

@davidaugust And so will investors when the AI bubble implodes, which it inevitably will. Humans have short memories.
@BruceMirken @davidaugust when the engine that generates exchange value is decoupled from the engine that generates use value there are no reasons to invest in things that generate use value. We live in two economies now.
@GnosticStreetSweeper @BruceMirken @davidaugust
Well... 🟠tRump is ABSOLUTELY of NO use 2 us❗️❗️❗️
Time 2 scrap him as worse than a dud bc he's pumping out tonnes of monoxide even as he naps his days away❗️

@davidaugust

In other shocking news:

Water is Wet
Without air you will die

@lemgandi
The wetness of water has been hotly debated, as to some wet means "covered with or soaked in water", and it's questioned whether water is covered with itself.
@davidaugust

@davidaugust

not new, here's the 2024 paper referenced:

https://arxiv.org/abs/2410.05229

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

arXiv.org
@joriki it’s from August.
Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

The new frontier in large language models is the ability to “reason” their way through problems. New research from Apple says it's not quite what it's cracked up to be.

WIRED
@joriki @davidaugust The discrepancy is the pre-print vs the official publication, I think

@b_cavello @joriki yes. Good point.

Many in this thread/s are debating a preprint when the official version supersedes whatever it may says. Such debate is not salient to the issues at hand and seems more the realm of historical concern and not about the issues the authors are actually addressing.

@davidaugust @joriki check the post history, it was revised in August ‘25 but first posted back in October 2024. Presumably the second revision was after review or in prep for presentation at a conference

@nuclearpidgeon @joriki as the paper says at the link you provided, it said August for “this version.”

To avoid the meaningless debate you seem to be devolving into of arbitrarily discounting later versions as the valid “this version” being put forward, I am muting you for a week.

It wastes your time, my time and anyone reading along to split hairs randomly on which version you personally want to designate as canonical.

Good luck.

@davidaugust @joriki it matters when the research was done! They didn’t “just prove” something if the work and evaluations were all done like a year earlier.
💡𝚂𝗆𝖺𝗋𝗍𝗆𝖺𝗇 𝙰𝗉𝗉𝗌📱 (@[email protected])

1/x #MathsMonday #Maths #Math Over time I've saved many screenshots of #AI #slop #aiSlop stuffing up #Mathematics big time, and on occasion I've had cause to reshare them, and at times I have cursed that I can only attach 4 pics per post. Then I realised, what am I worried about - just post them all in a thread and then I can just link to the thread (or individual screenshots), and can add to it as more come up 🙂 P.S. feel free to reply with more I hereby present to you, AI's greatest 5hits...

dotnet.social

@davidaugust did you see the documentary about "Project Nim", the monkey who got adopted and was supposed to be speaking sign language?

there have been other monkeys showing the same effect of this highly sophisticated begging where the monkeys guess the expected behavior but do not understand

they answer with some sign but they do not understand that there was a question nor that they do give an answer

they see human behavior and from pattern matching they do what they think is appropriate for that moment because it had served them well in the past

its hard for me to explain but the documentary did a good job in preparing me for the sophistcated begging of llms

i wanted it to be true to communicate with monkeys but the documentary gave me a werner herzog umderstanding of the beauty of standing apart, watching each other over a unbridgeable chasm

@drifthood yes, there does seem to be a threshold over which in some respects only humans cross over to one side.

I see that sort of begging in a dog. He wants the treat, so instead of just doing the desired behavior the human command is asking for, he tries every response that has ever gotten him a treat until he “unlocks” the treat. Humans can and do do this too from time to time, but humans _also_ actually communicate and understand from time to time as well.

@drifthood @davidaugust This makes me think of "Clever Hans", the horse that appeared to do arithmetics but actually just responded to involuntary human cues:
https://en.wikipedia.org/wiki/Clever_Hans
Clever Hans - Wikipedia

@bladecoder @drifthood excellent point. It does feel to me vaguely like the mechanical turk too: a machine passed off as machine with a person inside able to give the “machine” human-like skills and abilities.

There is a fiction inside all of them.

@davidaugust @bladecoder you are right,
there is lots of reporting from 404 media for example that its nigerians who do a lot of the chatbotting

in the interview the nigerian explains how its not possible for the workers to know if they are training a bot or if they are chatting with a human on the other side

https://www.youtube.com/watch?v=QH654YPxvEE

What It’s Like to Be a Data Labeler Training AI

YouTube

@davidaugust Well, there have actually been successes by connecting LLMs to proof assistant and computer algebra programs. As this post rightly puts, the LLM is not capable in itself to perform computations reliably, but it can write commands sent to the computer algebra programs, or proof candidates sent to the proof assistant; which can answer that the proof is incorrect, and the process goes on until a correct proof is produced.

See also uses by pro mathematicians:
https://bsky.app/profile/wildverzweigt.bsky.social/post/3miua4ulxhk2f

Also see Terence Tao

Wildverzweigte Erweiterung (@wildverzweigt.bsky.social)

Si K contient une racine carrée de -1 alors on a ce contre-exemple (trouvé par mon ami R.R. en utilisant un LLM, je suis pas fan mais bon). J'ai vérifié le calcul dans Sage.

Bluesky Social

@davidaugust Direct link to the paper https://arxiv.org/pdf/2410.05229 (presented at ICLR 2025).

Seems not to be a very recent news, then.

@davidaugust In about 80 years we've gone from a room full of computers the size of refrigerators that were good at crunching numbers but not much else to computers the size of corporate office parks that can draw almost-convincing pictures of people with five fingers (and thumbs, too!) but can't do elementary school math.

And some people call this progress.

@Karen5Lund Maybe because people stopped writing efficient code about 20 years ago?
@davidaugust Ecosia AI gets it right. It looks like the paper referenced was published in 2025, so the research conducted prior. The models are all much better now. I’m no AI apologist, but I think any argument of “AI sucks because it’s not good at _____” is on tenuous ground and will be proven wrong as the models continue to improve. @Ecosia
@audioflyer79 @davidaugust I mean, it's worth noting that the LLMs have ingested that paper by now. : /

@alisynthesis @davidaugust fair enough. I changed up the problem completely and added some reasoning and it did pretty well. It appears to be generating code to solve the math. The only thing it missed is that very unripe bananas are green, not yellow.

James picks 40 apples on Monday. Then he picks 35 lemons on Tuesday. On Wednesday, he picks half as many bananas as he did apples, but five of them were very unripe. How many yellow fruits does James have?

@audioflyer79 @alisynthesis @davidaugust how does it do if you swap the colors of the fruit?

@audioflyer79 @alisynthesis @davidaugust
The correct mathematical response to your question is either a statement of uncertainty (I can't answer that because I don't know what color the apples are or how ripe the lemons and non-unripe bananas are) or a request for clarification (what kind of apples? are the lemons ripe? how ripe are the ripe bananas?).

The fact that it provides a guess indicates that it has correctly understood what *you want it to say*.

It's not doing math. It's playing "what does the user expect?"

@alisynthesis @davidaugust @Robotistry funny, that’s how I formulate my own answers in conversations (what kind of answer does the user expect? 😂) But seriously, the prompt is written like a grade school math word problem. There is no grade school math test that would give your answer a correct score. And if LLMs always gave us the most pedantically accurate answer instead of the plain-reading answer they wouldn’t be useful at all.

@audioflyer79 @alisynthesis @davidaugust
I don't want something that's sold as "useful because it is knowledgeable and capable enough to perform jobs for me" to operate at a grade school level of understanding. I want it to grasp nuance and call out uncertainty, so that it will do precisely what I need it to do.

The problem with this challenge, as posed to a computer, is that it is not fundamentally a math or language understanding problem.

It is an *inference* problem. It is assessing the model's ability to infer properties from insufficient information (what color are apples, ripe bananas, unripe bananas, and lemons?"). (Questions like these are one reason standardized tests produce biased data, because people exposed to different environments reach different answers.)

But the Apple paper is posing a *math* problem that enables them to probe the model's *language understanding*.

The success metric for the math problem is "did the model demonstrate understanding of the text when not specifically trained to do so?" (and therefore loses its utility once it is available publicly and can be included in their training data).

The success metric for the inference problem is "can the model correctly match fruit names to a color label?". This metric doesn't illustrate understanding, because it is testing precisely the kind of linguistic pattern-matching the models are designed to do.

@davidaugust @Robotistry @alisynthesis Interesting points. As so much of LLM output depends on the context provided, I would expect it would return a different answer depending on what you asked it to focus on. In fact, when I asked it what color unripe bananas were and whether that affected the answer, it said that they were green and should not have been included, and that it was focused on bananas as a yellow fruit.

@audioflyer79 @davidaugust @alisynthesis This is actually very important.

LLMs do not "forget" the way humans do. They don't have memory lapses.

Humans have memory lapses and difficulty recalling facts they know.

But the nature of computers is to remember perfectly. LLMs exist because they remember perfectly and look for patterns in that memory.

If it is to be useful at interpreting human commands and understanding human expectations (see "now anyone can code" PR), it needs to be encoding concepts, not characters.

People are making big claims that these machines are somehow conscious or intelligent and are able to understand abstract concepts.

If an inference machine with perfect memory states unequivocally at one point that unripe bananas are yellow, and then later states unequivocally that unripe bananas are green, then it is not storing and retrieving conceptual information.

Any claim of generalization cannot involve this kind of concept.

In the battle between Cog and Cyc, LLMs are Cyc writ large.

Cog: https://en.wikipedia.org/wiki/Cog_(project)
Cyc: https://en.wikipedia.org/wiki/Cyc

Cog (project) - Wikipedia

@alisynthesis @Robotistry @davidaugust fascinating. I assume humans “generalize,” yet they readily make conflicting statements with conviction all the time, not based on faulty memory, but on context, misunderstanding, audience, deception, a number of other reasons.

@audioflyer79 @alisynthesis @davidaugust More properties that I don't actually want a machine to have!

Humans do generalize very well. They also see non-existent patterns in noise and walk in circles when blindfolded (the Mythbusters episode on that is hilarious).

We can "believe six impossible things before breakfast".

But when I interact with a machine, I want it to be able to identify and tell me the context it is assuming, I want a rigorous back end that can identify potential areas for misunderstanding and let me know they exist, and if I can't have predictability and precision, I want at the very least awareness of the existence and degree of unpredictability and imprecision. I do not want the machine to lie to me, ever, and I want a system grounded in facts.

The point is that I, as a human, navigate conflicting statements by context within a shared reality. I do not expect a computer to understand that shared reality because its reality is grounded in what we write, not what we experience.

If I see a word problem that is about fruit color and mentions "unripe," that is a context cue for relevance because ripeness is correlated with color. The computer missing that correlation is exhibiting the same underlying flaw that produces "smaller means subtract" - inability to connect the tokens to the underlying concepts (why we have language in the first place).

@Robotistry @alisynthesis @davidaugust you do make a good point about one big flaw in LLM responses (in my opinion) which is the inability to say “I don’t know” “I’m not sure” etc. But it’s possible this is also being addressed with newer models.

@Robotistry @audioflyer79 @alisynthesis @davidaugust

"What does the user expect" implies that it is aware of a user. This is also not really true. It emits a response matching the best fit to it's training data and query.

@blterrible @Robotistry @audioflyer79 @alisynthesis quite right.

If we include human designers/programmers as if part of the artificial intelligence system in question, then business goals or interface concerns of “what does the user expect” would come into the process. But if we don’t logically or schematically incorporate the humans’ business goals as if part of the AI system in question, then yes: the AI system does not in fact actively know nor expect anything on behalf of the user.

@davidaugust @Robotistry @alisynthesis @blterrible What’s really interesting to me about these conversations is not what we can say about what AI “knows” or “awareness” or “understanding,” but rather, what is says about humans and our need to “other” any intelligence competing with our own. We have no real understanding of what awareness, understanding, or consciousness is, we just know we have it. 1/
@Robotistry @blterrible @alisynthesis @davidaugust …and anything non-human doesn’t have it because *reasons*. I believe consciousness/awareness/understanding is a continuum, not a binary, and that all of the failures and mistakes made by LLMs could just as easily be attributed to humans in another context. Or to put it another way, that the failures of LLMs are *human* failures, mostly because they are trained on human data. 2/
@alisynthesis @davidaugust @blterrible @Robotistry and that the faults we attribute to LLMs (they’re only matching patterns to their training data, they’re only replying what the user expects) are really not all that different to how humans operate. Our brains are pretty much giant pattern matching association machines. Emergent properties we feel are there, like consciousness, have no provable basis 3/
@Robotistry @alisynthesis @davidaugust @blterrible nor any way to prove any other creature, natural or synthetic, doesn’t have. The Turing Test goalposts will keep getting pushed back until we realize we’re not as special as we think we are. 4/
@davidaugust @blterrible @alisynthesis @Robotistry Also, big tech sucks, the way AI is being developed and accelerated is ethically wrong, and AI may well do more harm to the world than good. 5/5

@audioflyer79 @blterrible @alisynthesis @Robotistry “We have no real understanding of what awareness, understanding, or consciousness is…” Philosophy of Mind
disagrees. So does semiotics.

“…brains are pretty much giant pattern matching association machines,” nope. There is no evidence human intelligence is statistical frequency matching based, nor is any other organic intelligence.

The Turing test is a very specific litmus test, not a wide use test for the presence of intelligence.

@davidaugust @Robotistry @alisynthesis @blterrible showing the limits of my ignorance. No, the brain is not statistical, but it does focus pretty heavily on pattern matching from a neuronal connection perspective, no? I mentioned the Turing test as a placeholder for any test we use to prove AI is “other.”

@audioflyer79 @Robotistry @alisynthesis @blterrible ah, that makes some sense, Turning Test as a placeholder for any test we might apply.

Organic intelligences, like humans, do tend to excel at pattern recognition, but I’m not sure that is the core of what makes them intelligent. You’re right that organic intelligences are often stronger than many synthetic intelligences at pattern recognition, but I think pattern recognition is only part of what constitutes intelligence.

@davidaugust @Robotistry @alisynthesis @blterrible just as incorporating reasoning and python code expands the capabilities of LLMs. Thanks for engaging conversation.

@davidaugust @audioflyer79 @alisynthesis @blterrible One of the things humans are really, really good at is adapting to tools. (I suspect tool use and invention are more fundamental to our intelligence than pattern matching.)

This is one reason research into human-robot interaction is so challenging - the human will adapt their actions and expectations to the tools after just a handful of uses and won't be able to give good feedback about how hard or difficult it is to use or what change in performance they would expect.

Which means that because we have been trained in this particular form of call and response, the mere fact of the system treating a question like an elementary school math test may predispose the user to assume that they *intended* the model to treat it that way and not realize that their original intention was to gather a different kind of information about the system.

That's one thing that makes the Apple paper particularly nice - they managed to intentionally avoid human built-in post hoc rationalizations and focus on the specific question they wanted answered.

@alisynthesis @davidaugust @Robotistry @blterrible except that the intention of the prompt isn’t at issue, I was talking about the format of the prompt. I wouldn’t expect an LLM to know my intention.
Very good and interesting point about tool use. I’d be curious to see studies comparing how ML might “play” with a tool to learn how it works vs humans.
@davidaugust @blterrible @alisynthesis @Robotistry I appreciate you guys humoring me while I try to sound like I know anything. Very interesting topic and lots to think about.
@audioflyer79 @alisynthesis @davidaugust @blterrible You might like Dreamships and Dreaming Metal by Melissa Scott, if you can find copies. They explore both sides of the "is it sentient" question.
@Robotistry @audioflyer79 @alisynthesis @blterrible “This is one reason research into human-robot interaction is so challenging - the human will adapt their actions and expectations to the tools after just a handful of uses and won't be able to give good feedback about how hard or difficult it is to use or what change in performance they would expect.” *chef’s kiss*

@blterrible @audioflyer79 @alisynthesis @davidaugust True.

I apologize in advance for the personification, but it does seem to make some concepts easier for me. In robotics, the "user" is shorthand for any agent that initiates a task. (I'll spare you the technical discussion about whether users are the same as operators.)

"Users" as agents can be known knowns (represented), unknown knowns (the concept of a user agent is encoded without being linked to specific data or goals), known unknowns (there is an interaction with something during the task, but the robot doesn't associate it with the concept of "user"), and unknown unknowns (the robot has not been given the concept of a "user" and the concept is irrelevant to its operation).

In this case, there are at least two factors feeding into an LLMs response: the model is trained to provide the kind of answer it sees most often, but is also constrained to provide the kind of answer the designers have decided is what most people expect to see.

Depending on how the constraints and interface layers are designed, the system could be in any of these states.

So while the internal trained model generating the words may not be aware, I think it's hard to argue definitively that the system the user is interacting with is unaware of the concept of a user.

@audioflyer79 @davidaugust Most (every?) big "AI" chatbot have been patched to intercept math questions and hand them over to an actual program (often Python).

That doesn't change the fact that the LLM part itself cannot do math, and there are still risks that it will misinterpret your question and produce the wrong program to calculate the answer.