We knew, but the proof is nice.

"Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves"

The guess-the-next-words machines don’t actually understand anything.

https://nitter.poast.org/heynavtoor/status/2041243558833987600#m

#math #ai

@davidaugust Ecosia AI gets it right. It looks like the paper referenced was published in 2025, so the research conducted prior. The models are all much better now. I’m no AI apologist, but I think any argument of “AI sucks because it’s not good at _____” is on tenuous ground and will be proven wrong as the models continue to improve. @Ecosia
@audioflyer79 @davidaugust I mean, it's worth noting that the LLMs have ingested that paper by now. : /

@alisynthesis @davidaugust fair enough. I changed up the problem completely and added some reasoning and it did pretty well. It appears to be generating code to solve the math. The only thing it missed is that very unripe bananas are green, not yellow.

James picks 40 apples on Monday. Then he picks 35 lemons on Tuesday. On Wednesday, he picks half as many bananas as he did apples, but five of them were very unripe. How many yellow fruits does James have?

@audioflyer79 @alisynthesis @davidaugust
The correct mathematical response to your question is either a statement of uncertainty (I can't answer that because I don't know what color the apples are or how ripe the lemons and non-unripe bananas are) or a request for clarification (what kind of apples? are the lemons ripe? how ripe are the ripe bananas?).

The fact that it provides a guess indicates that it has correctly understood what *you want it to say*.

It's not doing math. It's playing "what does the user expect?"

@alisynthesis @davidaugust @Robotistry funny, that’s how I formulate my own answers in conversations (what kind of answer does the user expect? 😂) But seriously, the prompt is written like a grade school math word problem. There is no grade school math test that would give your answer a correct score. And if LLMs always gave us the most pedantically accurate answer instead of the plain-reading answer they wouldn’t be useful at all.

@audioflyer79 @alisynthesis @davidaugust
I don't want something that's sold as "useful because it is knowledgeable and capable enough to perform jobs for me" to operate at a grade school level of understanding. I want it to grasp nuance and call out uncertainty, so that it will do precisely what I need it to do.

The problem with this challenge, as posed to a computer, is that it is not fundamentally a math or language understanding problem.

It is an *inference* problem. It is assessing the model's ability to infer properties from insufficient information (what color are apples, ripe bananas, unripe bananas, and lemons?"). (Questions like these are one reason standardized tests produce biased data, because people exposed to different environments reach different answers.)

But the Apple paper is posing a *math* problem that enables them to probe the model's *language understanding*.

The success metric for the math problem is "did the model demonstrate understanding of the text when not specifically trained to do so?" (and therefore loses its utility once it is available publicly and can be included in their training data).

The success metric for the inference problem is "can the model correctly match fruit names to a color label?". This metric doesn't illustrate understanding, because it is testing precisely the kind of linguistic pattern-matching the models are designed to do.

@davidaugust @Robotistry @alisynthesis Interesting points. As so much of LLM output depends on the context provided, I would expect it would return a different answer depending on what you asked it to focus on. In fact, when I asked it what color unripe bananas were and whether that affected the answer, it said that they were green and should not have been included, and that it was focused on bananas as a yellow fruit.

@audioflyer79 @davidaugust @alisynthesis This is actually very important.

LLMs do not "forget" the way humans do. They don't have memory lapses.

Humans have memory lapses and difficulty recalling facts they know.

But the nature of computers is to remember perfectly. LLMs exist because they remember perfectly and look for patterns in that memory.

If it is to be useful at interpreting human commands and understanding human expectations (see "now anyone can code" PR), it needs to be encoding concepts, not characters.

People are making big claims that these machines are somehow conscious or intelligent and are able to understand abstract concepts.

If an inference machine with perfect memory states unequivocally at one point that unripe bananas are yellow, and then later states unequivocally that unripe bananas are green, then it is not storing and retrieving conceptual information.

Any claim of generalization cannot involve this kind of concept.

In the battle between Cog and Cyc, LLMs are Cyc writ large.

Cog: https://en.wikipedia.org/wiki/Cog_(project)
Cyc: https://en.wikipedia.org/wiki/Cyc

Cog (project) - Wikipedia

@alisynthesis @Robotistry @davidaugust fascinating. I assume humans “generalize,” yet they readily make conflicting statements with conviction all the time, not based on faulty memory, but on context, misunderstanding, audience, deception, a number of other reasons.

@audioflyer79 @alisynthesis @davidaugust More properties that I don't actually want a machine to have!

Humans do generalize very well. They also see non-existent patterns in noise and walk in circles when blindfolded (the Mythbusters episode on that is hilarious).

We can "believe six impossible things before breakfast".

But when I interact with a machine, I want it to be able to identify and tell me the context it is assuming, I want a rigorous back end that can identify potential areas for misunderstanding and let me know they exist, and if I can't have predictability and precision, I want at the very least awareness of the existence and degree of unpredictability and imprecision. I do not want the machine to lie to me, ever, and I want a system grounded in facts.

The point is that I, as a human, navigate conflicting statements by context within a shared reality. I do not expect a computer to understand that shared reality because its reality is grounded in what we write, not what we experience.

If I see a word problem that is about fruit color and mentions "unripe," that is a context cue for relevance because ripeness is correlated with color. The computer missing that correlation is exhibiting the same underlying flaw that produces "smaller means subtract" - inability to connect the tokens to the underlying concepts (why we have language in the first place).