LLM boosters: This is trained with all of the text and code on the public internet!

People who can fucking think: So it's extremely low quality, then?

"The average of everything on github" isn't badass code, it's unfinished student projects that never worked.
We hear "all the text on the internet" and we might think "well, they've sure digitized a lot of classic books, that must be good, right?" but then we realize that all of the books are maybe like 1% of the data. Most of it is like facebook messenger breakup arguments and semi-literate emails.
@sidereal they've presumably pulled all the bad fanfic as well as the Epstein Files....
@Susan_calvin @sidereal LLMs follow Sturgeon's Law
@otfrom @sidereal Sturgeon's law allows for the possibility of some good material amidst the dross. I don't think it applies to LLMs.