“Elegant and powerful new result that seriously undermines large language models”

Like I’ve been saying for a while now: LLMs do not think or reason. They are not on the path to AGI. They are extremely limited correlation and text synthesis machines. https://garymarcus.substack.com/p/elegant-and-powerful-new-result-that

Elegant and powerful new result that seriously undermines large language models

Wowed by a new paper I just read and wish I had thought to write myself. Lukas Berglund and others, led by Owain Evans, asked a simple, powerful, elegant question: can LLMs trained on A is B infer automatically that B is A? The shocking (yet, in historical context, see below, unsurprising) answer is no:

Marcus on AI
@baldur interesting paper, i tried to reproduce the failing on Bing Chat with GPT-4 but that didn't work ?
I view LLMs as extremely capable (not limited) text correlation tools, not tools that can do logical deductions, so I am not shocked honestly.

@ErikJonker Bing Chat is a search engine front end. The output is likely at least partially informed by search engine results.

The problem isn’t that they are limited at text correlation. The problem is that text correlation is a fundamentally limiting approach. Pattern-matching on an undocumented and effectively unknowable body of work using methods that are non-deterministic in practice exposes you to potential errors and biases in the output that are very hard to detect

@baldur true, i agree, the question is how much of those big flaws/problems can be addressed by next iterations of current models, combining them with other techniques/algorithms etc, so much money is thrown at it, there is non-trivial chance companies will succeed in this
@ErikJonker @baldur This , "they will probably fix this... with Money" answer is not satisfying for me.
Also: what is the point of a large language model?
It's a projection of our personal hopes an believes.🤷
@Zeugs @baldur ...the massive trainingset is more then our hopes and believes, LLM really shines in the language departement, whether it's translation, brainstorming about the structure of a document, it's content etc. It does NOT replace humans in my view but can really augment them, there is enough evidence for that. Not only in my personal experience but also in various papers. Also the major point is making a lot of money i am afraid...
@ErikJonker various papers ? I have seen 3 or so. Personal evidence is nice but with the vastness of input and output possibilities no hard evidence.
Translation was a solved problem before for example by deepL. Brainstorming is nice but hard to put a pricetag on. The reliability is in my opinion not that good, maybe works for me but not for everyone on every topic.
@Zeugs ..this was a nice experiment for knowledge workers https://ssrn.com/abstract=4573321
@Zeugs ... personally i have been using tools like perplexity.ai or Phind.com which works fine for questions i used Google for but now I get a more comprehensive and complete answer, they work fine for technical subjects, perplexity even tries to provide the sources to check if you really want to be sure. Everything is far from perfect and not fully reliable for sure, but does it save time and added value? For me it does.
#llm #generativeAI
@ErikJonker like the study showed: having an overview of the topic helps. Professionals (trained in critical thinking) can handle this. The negative effect in the study hints that you are not familiar with something it's bad for quality. That should be a red flag since making money involves the masses and support on wide fields.
@ErikJonker The setup ist made up of professionals and the questions reassemble stuff that really works like " come up with 10 ideas of..." this works but imho those are not very creative.
Also:"For a task selected to be outside the frontier, however, consultants using AI were 19 percentage points less likely to produce correct solutions compared to those without AI."
This jagged Border thing I do not understand.
In the end the questions seem like questions from a test. An tests gpt can do.
@ErikJonker from a brainstorming/consulting Background. "Imagine 10 potential names for a beverage"
You hire 10 consultancies and in total they come up with 12 names because they are getting the same/Similiar answers from the language model.
@ErikJonker okay just reviewed it in the paper. They actually measured that in Appendix D.

@ErikJonker @baldur The concept of LLMs is text correlation, and as such LLMs can't achieve more. Throwing more money at them won't make them any different.

Some companies try to combine them with other approaches to get around the limitations. But in this case, LLMs become nothing more than a frontend for something else. This "something else" would still have to be a major breakthrough that has nothing to do with LLMs and would be probably possible independently of them.

@weddige @baldur ...breakthroughs are often accomplished by combining existing parts in new ways, a breakthrough can be just that and there is so much more in machinelearning and AI then LLMs