Mastodawn

sidereal Mar 1

LLM boosters: This is trained with all of the text and code on the public internet!

People who can fucking think: So it's extremely low quality, then?

Show thread

sidereal Mar 1

"The average of everything on github" isn't badass code, it's unfinished student projects that never worked.

Show thread

sidereal Mar 1

We hear "all the text on the internet" and we might think "well, they've sure digitized a lot of classic books, that must be good, right?" but then we realize that all of the books are maybe like 1% of the data. Most of it is like facebook messenger breakup arguments and semi-literate emails.

Show thread

Jess👾Mar 1

@sidereal one could hope they apply different weighting factors to different sources

Show thread

Malfunct (he/him)

@JessTheUnstill @sidereal mostly they don't and it is a known issue in training.