People get mad when you call LLMs "spicy autocomplete" but my investigations into recreating and implementing small versions of this tech make me think that nick name is very accurate.

Basically, it's a method to predict the next content in a text file. The whole conversation between you and the LLM is one file, and the LLM tries to find the most likely next text based on the training data.

There is something significant here: LLMs were trained on internet forums and social media.

Thus the training data didn't just contain text, but rather text where each passage is tagged and attributed to a particular user.

This aspect of the training data was critical in creating the illusion of talking to another person.

An LLM doesn't just predict the next text. It predicts the next text that might come from another user. You need to hard code this in to make it work well.

Leave it out and there is no conversation.

For example if I give an LLM without user seperation this text:

"It's a lovely day." It might continue with "The sun was shining."

But with user separation it focuses on responses to "it was a lovely day" from other users and the training data might suggest "I agree, it's wonderful weather."

So interaction with an LLM is like posting on a forum, it gives you and average of typical responses with one small change: most LLMs have a strong positivity bias programmed in.

Because let's be real, if you posted "It's a lovely day." on an internet forum you might get a response like "No it's not, noob."

LLMs are heavily weighted to give supportive, and constructive responses.

I wonder what they might be like without these limitations? Without the limitation to make the response from another user they might be much less deceptive.

That they are popular shows that many people just want a nice moderated online community where people treat each other with respect.

@futurebird this is a simplistic view – that it’s all about token prediction of similar vectors using gradient descent to arrive at the more likely next token to place in line with what is already there. There’s also RLHF – reinforcement learning through human feedback. This involves the human dressing up as a wizard and sitting behind the curtain ensuring that all the responses are the sort of responses that the wizard I mean human would actually prefer and approve of. Technically, this is achieved using a lot of smoke and some carefully placed bidirectional mirrors which act as beam splitters to construct a hologram which fools people into thinking that the machine did it all, instead of poorly paid workers.

@u0421793

Ian, you had me going for a moment there. I was like "how do they keep finding me? why are they like this all the time???"

😆

@futurebird I honestly think (unpopular opinion here) that most of the cost of LLM-based AI thus far is in ‘training’. Not training as in running the phenomenal amount of harvested stolen text and image input through tokenisation processes and reward giving through weight assignment and vector assessment, using more GPUs than exist on Earth, but rather, lots and lots and lots of money paying humans to fake it all and build in patches – patch after patch on top of patch of corrective behaviour, encoded themselves as vector weights. The training had nothing much to do with running it all through GPUs, I believe that probably took an embarassing but totally affordable amount of time and energy. I believe (with no visible means of factual reference to cite) that most of the expenditure of these capital-burning companies was ‘training’ by paying humans and then encoding their resulting guidance. Paying workers.
@u0421793 @futurebird Yes. Or getting humans to do that labor for no pay.

@u0421793

The wind up with the bamboozling jargon (you can feel these dudes hoping they put in enough tricky sounding words and concepts to make you just give up) was perfect in your post.

"token prediction"
"vectors"
"gradient descent" (OMG)

The problem is math jargon is my briar patch and tossing me in there is a big mistake.

:)

@futurebird I wasn’t making it up though - the way it works is by tokenising language (not into words but into fragments of words), then assigning the word-derived tokens to vectors (word2vec - it exists), then these vectors are winnowed into the likely winners by gradient descent to find the lowest error (and not get trapped by just falling downhill down the nearest valley) and so on.

@futurebird

diy'ing a 'math jargon is my briar patch' t-shirt asap.

@u0421793