“Elegant and powerful new result that seriously undermines large language models”

Like I’ve been saying for a while now: LLMs do not think or reason. They are not on the path to AGI. They are extremely limited correlation and text synthesis machines. https://garymarcus.substack.com/p/elegant-and-powerful-new-result-that

Elegant and powerful new result that seriously undermines large language models

Wowed by a new paper I just read and wish I had thought to write myself. Lukas Berglund and others, led by Owain Evans, asked a simple, powerful, elegant question: can LLMs trained on A is B infer automatically that B is A? The shocking (yet, in historical context, see below, unsurprising) answer is no:

Marcus on AI
@baldur interesting paper, i tried to reproduce the failing on Bing Chat with GPT-4 but that didn't work ?
I view LLMs as extremely capable (not limited) text correlation tools, not tools that can do logical deductions, so I am not shocked honestly.

@ErikJonker Bing Chat is a search engine front end. The output is likely at least partially informed by search engine results.

The problem isn’t that they are limited at text correlation. The problem is that text correlation is a fundamentally limiting approach. Pattern-matching on an undocumented and effectively unknowable body of work using methods that are non-deterministic in practice exposes you to potential errors and biases in the output that are very hard to detect

@baldur true, i agree, the question is how much of those big flaws/problems can be addressed by next iterations of current models, combining them with other techniques/algorithms etc, so much money is thrown at it, there is non-trivial chance companies will succeed in this
@ErikJonker @baldur This , "they will probably fix this... with Money" answer is not satisfying for me.
Also: what is the point of a large language model?
It's a projection of our personal hopes an believes.🤷
@Zeugs @baldur ...the massive trainingset is more then our hopes and believes, LLM really shines in the language departement, whether it's translation, brainstorming about the structure of a document, it's content etc. It does NOT replace humans in my view but can really augment them, there is enough evidence for that. Not only in my personal experience but also in various papers. Also the major point is making a lot of money i am afraid...
@ErikJonker various papers ? I have seen 3 or so. Personal evidence is nice but with the vastness of input and output possibilities no hard evidence.
Translation was a solved problem before for example by deepL. Brainstorming is nice but hard to put a pricetag on. The reliability is in my opinion not that good, maybe works for me but not for everyone on every topic.
@Zeugs ..this was a nice experiment for knowledge workers https://ssrn.com/abstract=4573321
@Zeugs ... personally i have been using tools like perplexity.ai or Phind.com which works fine for questions i used Google for but now I get a more comprehensive and complete answer, they work fine for technical subjects, perplexity even tries to provide the sources to check if you really want to be sure. Everything is far from perfect and not fully reliable for sure, but does it save time and added value? For me it does.
#llm #generativeAI
@ErikJonker like the study showed: having an overview of the topic helps. Professionals (trained in critical thinking) can handle this. The negative effect in the study hints that you are not familiar with something it's bad for quality. That should be a red flag since making money involves the masses and support on wide fields.
@ErikJonker The setup ist made up of professionals and the questions reassemble stuff that really works like " come up with 10 ideas of..." this works but imho those are not very creative.
Also:"For a task selected to be outside the frontier, however, consultants using AI were 19 percentage points less likely to produce correct solutions compared to those without AI."
This jagged Border thing I do not understand.
In the end the questions seem like questions from a test. An tests gpt can do.
@ErikJonker from a brainstorming/consulting Background. "Imagine 10 potential names for a beverage"
You hire 10 consultancies and in total they come up with 12 names because they are getting the same/Similiar answers from the language model.
@ErikJonker okay just reviewed it in the paper. They actually measured that in Appendix D.

@ErikJonker @baldur The concept of LLMs is text correlation, and as such LLMs can't achieve more. Throwing more money at them won't make them any different.

Some companies try to combine them with other approaches to get around the limitations. But in this case, LLMs become nothing more than a frontend for something else. This "something else" would still have to be a major breakthrough that has nothing to do with LLMs and would be probably possible independently of them.

@weddige @baldur ...breakthroughs are often accomplished by combining existing parts in new ways, a breakthrough can be just that and there is so much more in machinelearning and AI then LLMs
@baldur reminds me of an experiment comparing the intelligence of human toddlers against chimpanzees. One crucial difference was the sophistication of their internal model of the world. You give an L-shaped wooden block to both and ask them to balance it on the long end. No problem. Then you cheat and hide a weight in the short end, so that balancing becomes impossible. The chimpanzee will just keep trying indefinitely. The human tries once and then starts examining the block.

@BuschnicK @baldur

The belief that humans are special and fundamentally different from everything else in the universe, rather than just a more powerful, more complex version of the things that already exist in the universe, is this millennium's version of the geocentric model of the world. We aren't special. We are and incredible demonstration of the power of complexity and what amazing effects can arrive from simple rules in complex systems. But at the end of the day, we are just remarkable collections of unremarkable star stuff...

Same as a chimpanzee, same as an nvidia A400 processor.

@danbrotherston @baldur I don't disagree. I do believe that AGI is possible and that there is no magic sauce that makes us special. However, working with the current generation of LLMs on a daily basis also drives home the point that we are still a long, long distance away from matching even chimpanzees with our silicon "brains".

@BuschnicK @baldur

I mean...that's reasonable.

I wouldn't make a specific claim about how far we are from AGI without an actual definition of AGI.

But I do think that a lot of people dismiss LLMs as "not intelligent" because they "simply regurgitate rearrangements of things they've heard before" without considering the nature of human intelligence. In my opinion we simply don't know enough about how human minds actually work to say that isn't how we function. (Not saying you--Soren--did say that, although Balder seems to).

That said, we know, LLMs differ from human intelligence in a number of ways. Specifically they lack a physical experience of the world, and they also lack continuity of experiences and a self narrative. But I rarely see these arguments given as a reason LLMs are different from humans, and I also don't know that they'd be required for AGI...again, that's a very poorly specified concept.

@danbrotherston @baldur well, at minimum it requires:

- a way of interacting with the environment to run experiments

- planning

- a model of the world and way of identifying and dealing with conflicting information

- a notion of how confident they are about statements

- not always going with the first/most likely response

All they currently are is probabilistic text completion engines. That's already useful. But falls short of what I'd consider AGI.

@danbrotherston @baldur Deepmind does interesting research into these questions by putting their agents into virtual worlds / games. I think that's a good approach to address many of these shortcomings. But again, a long way to go yet.

One interesting litmus test: when do we overcome the curse of recursion and instead of LLMs tainting training data actually improve it? I.e. LLMs learning from LLMs with actually positive feedback loops? Works for some limited use cases, but not generally yet.

@baldur

Yes.
But it does not matter.
Because the "top people" making the money decisions see AI LLMs as "silver bullet" that can solve all problems.

😠

@JeffGrigg @baldur It solves the problem of having to pay people for labor, so they don't care if it works well or not.
@baldur imagining investing billions into this technology while telling everyone that you've built an AI god but it turns out it was a big load of nothing....
@baldur I personally suffer the "reversal curse" when trying to recall names, sometimes quite well-known ones -- having to consciously cycle through a bunch of prompts before finding one that brings one up; this would be even stronger for obscure ones in the paper's examples (non-famous parents of famous people). Would the authors argue on that basis that I'm not sentient? And if not, what have they proved?

@rst @baldur Yep. Just recently I couldn't remember everyone who was in Monty Python, yet given their names I would immediately know the reverse. Same is true when learning a language: I can often read a word and know its meaning but am unable to remember it when trying to write or speak.

Another title for the paper: "LLMs More Human Than We Thought"

@sstrader @rst @baldur

To avoid the issue of famous vs non-famous bias the authors of the paper
https://owainevans.github.io/reversal_curse.pdf
fine-tuned a LLM using fictitious training data and showed it couldn't generalize the information

Humans might fail to recall info but not consistently always in one direction like a Large Language Model

@baldur This was pretty much a summary of my talk internally the other day - I like to use this image to summarise it.
@baldur What surprises me is the big names in AI being shocked that reasoning over unstructured and unattributed sources would produce crap
@baldur LLMs are effectively the perspective of power translated to machine code. That's why talking to one is so similar to talking to a CEO full of hot air. & also why everyone who believes in the perspective of power is fooled by them. Because they want to believe they deserve what they have, & "machine god" would "prove" this to them because they only know how to respect power.
@baldur it's not a that plain argument since Tom cruise has siblings "Cruise has three sisters named Lee Anne, Marian, and Cass."(wikipedia). Name and Identity are not the same.
LLM's and Knowledge are definitely a problematic field but the test has flaws.

@Zeugs @baldur Having daughters and one famous son shouldn't have been what prevented a LLM from answering "Who is Mary Lee Pfeiffer's son?"

https://owainevans.github.io/reversal_curse.pdf

@baldur Interesting but of course "A is B" is not commutative (reversible) for all sorts of A and B. (A rose is pink; John is holding a gun.)

So even if LLMs could do this, it would still need to know when to do it.

@fishidwardrobe @baldur

And yet tech leaders seem to want to bet their businesses on #GenerativeAI systems that are unable to tell when given some fact "A is B", in which situations is it reasonable to deduce that "B is A"

@baldur unfortunately, this result is rather weak. it basically says, "by providing insufficient training data, the neural network fails to generalise". they taught the NNs that "A is B" and it did not magically know that "B is A". if, however, they had also taught it deduction rules, it would have done better. this can be easily verified by giving a trained NN some deduction rules, "facts" with made up words, and queries about these facts.
@baldur i agree that these techniques are unlikely to lead to AGI, and there are plenty of reasons to object to and be sceptical about the current LLM vogue, but they are not so unsophisticated that they fail the most basic of tests.
@baldur indeed, if they had trained it on a corpus of “A implies B” rather than “is” it would be incorrect to deduce that “B implies A”. So it’s not even obvious that what they have found is a defect.

@baldur Always love seeing a dose of reality to counter the "AI" hype! I do kinda wonder though...this does not feel entirely different from how human minds often work. If I hear a name outside of the typical context I might be sitting there for *days* going "I know I know that name...why do I know that name??" until I find that one specific piece of context to attach to that name which makes everything fall into place...

It's not a database where you can query for some string and any related facts just pop right out...but we already have those. I suspect that is not a necessary part of being able to think or reason and could potentially even be detrimental.

@admin @baldur

Not sure how context can be a factor in the experiment that the authors of the paper tried

https://owainevans.github.io/reversal_curse.pdf

They tried training LLMs using facts stated in either direction. The LLM was much better at answering the question if it was given in the same direction as the facts it was trained on

@bornach @baldur The context is the thing that allows me to make the connection in the right direction.

I might see someone walking down the street and think they look familiar but not be able to say who they are. If I get a call the next day from my doctor's office, I might remember that the person I saw was the doctor's receptionist. If I start from thinking about the doctor, I can remember the people I interact with there; but I can't necessarily make the same connection going the other way.

@baldur And now people are arguing that this isn’t really all that bad because under specific circumstances, humans can make similar mistakes, e. g. when they are distracted or when their „training“ happened long ago etc. 😬. I would disagree, because I think we would (for example) expect employees in call centers - even those on the lowest levels - to not commit that type of mistake while doing the work they are being paid for.
@stefanieschulte Right. And a lot of people are actively confusing recall (being able to remember a fact) and the ability to reason about the facts presented to them.
@baldur @stefanieschulte
Thus revealing that they didn't even attempt to read the paper.
It describes how the authors had designed their experiments (using ficticious data expressed in different "directions") such that the issue of famous son vs not-so-famous mother was clearly not what was hampering the LLMs ability to generate correct answers.
@baldur Large language models (i.e. Transformers) are not time-symmetrical devices, and the word "is" is sometimes used in asymmetrical situations, so this result is perhaps not too surprising.

@baldur

Yet another Wall Street hyped up narrative about a "plausible sentence generator" to gin up valuations & stock prices.

Soon to join NFT's, cryptocurrency, and other stock scams...

@baldur Well of course, they are formal language models, not knowledge models. :)
@baldur "X is Y" does not imply "Y is X"; that's a logical fallacy.
@antlersoft @baldur sometimes “is” refers to an equivalence relation, in which case the implication is true. Sometimes “is” doesn’t mean an equivalence relation, that’s literally been the unsolved problem in neural nets for decades that the article discusses.

@baldur
I'm not at all surprised!
They are stochastic parrots.
They are very good at language.
They are kinda like that bullsh1t con artist guy you briefly knew in college, who could convince anyone of anything, but didn't actually know anything themselves.

Super helpful for language understanding and interface, though!
They will still be useful in this niche.

@baldur AI = Algorithmically Intensive
@baldur Richard Feynman demonstrated that poorly educated grad students have the same problem.
@resuna @baldur can you say more about this and ideally link to a source or the story? I am interested .
@gjdavis @baldur It's in his autobiography "Surely You're Joking, Mr Feynman" when he was a visiting lecturer in Brazil.
@resuna @baldur Kind of proves the point, doesn't it? I don't want "poorly educated grad students" to write code, give medical advice, or do anything remotely critical.

@baldur Human brains are otoh nothing more than pattern matching machines in a loop either. The issue isn't that LLMs are "simple" - it's admitting that we are.

(see split-brain patient research)

@baldur

In linguistic terms, it seems AI's have learned the nouns, and maybe the definitions of the verbs, but haven't grasped verb usage yet?

Seems like a pretty huge language issue.