Apple did the research; LLMs cannot do formal reasoning. Results change by as much as 10% if something as basic as the names change.

https://garymarcus.substack.com/p/llms-dont-do-formal-reasoning-and

LLMs don’t do formal reasoning - and that is a HUGE problem

Important new study from Apple

Marcus on AI
@ShadowJonathan Why would we judge LLMs on their ability to solve complex tasks? The interesting thing is if they can solve simple tasks well enough to be useful.
@anderspuck @ShadowJonathan Which they also can't do.
@dalias @ShadowJonathan They can absolutely do certain things well enough to be useful. Create a fairly accurate transcript of a podcast, for example.
@anderspuck @dalias @ShadowJonathan Weeeeeeelllll.... "fairly" & "useful" are pretty load-bearing here - like, yes, they can, but they still make the sort of errors that completely change the meaning of the content & there's no way to check for it except human proofreading, which itself is unreliable at low-cost scale (i.e. a non-specialist low-paid worker checking many texts at a fast pace). Suffice to say that even for this, LLMs are wildly oversold.

@jwcph @dalias @ShadowJonathan I used an LLM to create a first draft of the transcript here, for example. Without that help there just wouldn’t be any transcript because it would take too much time. So that for me is definitely in the category of “useful”.

https://www.logicofwar.com/why-did-experts-fail-to-predict-russias-invasion-of-ukraine/

Why did experts fail to predict Russia's invasion of Ukraine?

Hello, In this video, I discuss why so many experts failed to accurately predict the Russian invasion of Ukraine in 2022. Most experts at the time were saying that it was very unlikely that Russia would invade Ukraine. Of those who did foresee an invasion, many dramatically overestimated the capabilities

Logic of War

@anderspuck @dalias @ShadowJonathan Sure - now all you have to figure out is how much you'd pay for that usefulness, because this is only happening to become an extremely lucrative business for somebody.

(no, that's not a different topic; the problem complex here is functionality + usefulness + environmental impact + business model)

@jwcph Let’s see how it develops. Ollama is working great for me, but it does require a fairly good computer. So yes, either taht processing power has to be local with the user or somewhere centralized.

@anderspuck @dalias @ShadowJonathan

LLMs are NOT doing *speech to text* translation -- doing transcripts from audio (podcast). That's a different set of AI technologies.

The industry has been developing "AI" technologies since before I was born. Some are quite useful.

It's the "Generative AI" subset (which includes LLMs, chatbots) that is so misleading, mostly useless, and incredibly wasteful.

@JeffGrigg @anderspuck @ShadowJonathan This. 👆 The industry is all about muddling these differences so they can use the utility of one thing to justify a different piece of garbage they want to sell.
@JeffGrigg @dalias @ShadowJonathan True. I kind of bundled ChatGTP and Whisper in that statement.
I don’t find generative AI useless, though. There are many tasks for which it is very good, but probably not those flashy ones many people are thinking about. For example an LLM is much better at sentiment analysis than older methods.
@anderspuck @JeffGrigg @ShadowJonathan Are you sure about that? I'm pretty sure they do an extremely racist version of "sentiment analysis".
@dalias @anderspuck @JeffGrigg @ShadowJonathan Let’s be clear, the LLM is not developing racism out of nowhere. It is just able to amplify racial bias in the dataset. The stuff used to train it was already racist. It’s extremely hard to filter that out. I still laugh at tip culture being ingrained into LLMs. Some would do "better work" than normal if you bribed it with a tip. Freaky stuff.

@dalias @ShadowJonathan @anderspuck no, never reliable enough. This stems from how they are designed.

They are incapable of asking for help if they don’t understand a passage, for example, writing down something hallucinated* instead.

*) I’m aware that this is not a good term to use for this but I don’t have a better one handy before coffee.