"…a damning new study could put #AI companies on the defensive. In it, #Stanford and #Yale researchers found compelling evidence that #AImodels are actually copying all that data, not “learning” from it. Specifically, four prominent LLMs — OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, xAI’s Grok 3, and Anthropic’s Claude 3.7 Sonnet — happily #reproduced lengthy excerpts from #popular — and #protected#works, with a stunning degree of #accuracy."

https://futurism.com/artificial-intelligence/ai-industry-recall-copyright-books

Researchers Just Found Something That Could Shake the AI Industry to Its Core

Researchers found compelling evidence that AI models are actually copying copyrighted data, not "learning" from it.

Futurism

@josemurilo

the fact that those "models" require hundreds of billions of parameters to work is sort of a smoking gun.

While there is probably some degree of "learning" involved, in that the model size is not the same as the input data size (😂 ), this is orders or magnitude away from what we normally call a "model" of something: a parsimonious representation.

And because LLM's don't really learn, we also don't really learn by inspecting them (which again is the hallmark of a useful model)

@openrisk @josemurilo you'd probably not need that much training to be ready to answer questions on any topic
@josemurilo This is part of the leaked system prompt in Claude Code: "Do not produce or reproduce exact song lyrics" - which goes to show some desperate engineer had to try to hide this fact by begging the thing to stop spilling the beans.

@mzedp
Interesting proof & makes it crystal clear that we’re dealing with GrandTheftAutoComplete: “Claude Code: "Do not produce or reproduce exact song lyrics" - which goes to show some desperate engineer had to try to hide this fact by begging the thing to stop spilling the beans.”

#GrandTheftAutoComplete #ClaudeCode #LLMs #AI
@josemurilo

@Su_G @mzedp @josemurilo This is the best example of why it's pointless to argue but lol

@josemurilo
Of course they are. It's a total lie to use the phrase "learning". It's analogous to a distributed data flow database. It's why it needs so much RAM for reasonable performance.

They are plagiarism machines that only give useful results when regurgitating.

Better real search engines pointing to real source would be honest and more useful.

Simple fines or compensation isn't enough. They need to be opt-in only for content and all illicitly obtained content / models deleted.

@josemurilo A statistical inference engine cannot be doing anything else than copying.

The model is essentially a ”compression”, capable of more or less reproducing it’s inputs. That’s the whole point.

Calling this process of compression/optimisation ”thinking” is the greatest scam ever pulled in IT technology!

@josemurilo Color me shocked.

There is very little (I am being generous) intelligence with these tools. LLMs are nothing more than expensive text generators.

@josemurilo what's surprising is that we need a study to "prove" this.. this is just how these programs work.. sometimes it looks like people actually think calling it "AI" means the program is actually intelligent? 😬
@josemurilo não entendi uma coisa: isso tudo já não era evidente? Qual é a novidade do estudo?
@josemurilo this is a bad study, and the Anthropic model is literally so old that it's no longer available.