People get mad when you call LLMs "spicy autocomplete" but my investigations into recreating and implementing small versions of this tech make me think that nick name is very accurate.

Basically, it's a method to predict the next content in a text file. The whole conversation between you and the LLM is one file, and the LLM tries to find the most likely next text based on the training data.

There is something significant here: LLMs were trained on internet forums and social media.

Thus the training data didn't just contain text, but rather text where each passage is tagged and attributed to a particular user.

This aspect of the training data was critical in creating the illusion of talking to another person.

An LLM doesn't just predict the next text. It predicts the next text that might come from another user. You need to hard code this in to make it work well.

Leave it out and there is no conversation.

For example if I give an LLM without user seperation this text:

"It's a lovely day." It might continue with "The sun was shining."

But with user separation it focuses on responses to "it was a lovely day" from other users and the training data might suggest "I agree, it's wonderful weather."

So interaction with an LLM is like posting on a forum, it gives you and average of typical responses with one small change: most LLMs have a strong positivity bias programmed in.

Because let's be real, if you posted "It's a lovely day." on an internet forum you might get a response like "No it's not, noob."

LLMs are heavily weighted to give supportive, and constructive responses.

I wonder what they might be like without these limitations? Without the limitation to make the response from another user they might be much less deceptive.

That they are popular shows that many people just want a nice moderated online community where people treat each other with respect.

@futurebird apologies for pedantic-quibbling on your thread twice, but…

LLMs in their platonic form are not weighted in this manner, they are exactly as you have imagined here: they reproduce the statistical distribution of tokens in their training corpus.

If you've never played around with GPT-2 or GPT-3 (from the era before we had GPT-3.5 and from there "ChatGPT"), they often would do *precisely* this sort of direct, non-conversational continuation. You could feed in a sentence or two and get "autocomplete", or you could feed in `<html><body><span>Lorem ipsum` and get a plasuible-looking continuation of an HTML document (or whatever)

Once "Chat" models (and the paradigm shift to RLHF to "fine-tune" model performance) showed up, we started seeing the conversational pattern. I don't know the details there, but there is definitely a distinct line between when we first started seeing "LLMs" and when we started seeing models arranged explicitly around a conversational format.

@SnoopJ

I think the only detail you need is adding tags after each user entry that match the end and start of a post in a forum. This worked in the very messy low fidelity testing I’ve been doing. Or it convinced me that would be enough </forum post><forum post> (in that order) is a very powerful pattern and signals a huge shift.

@futurebird @SnoopJ ok, so in the chatbot source code, there must be some point, after it appended the user's latest chunk of text to the end of the document, and _before_ it calls some "inference" function, where it appends a string to the end of the document like:

"</forum post><forum post> ChatbotUser:"

this makes me think that, for a large number of users, the spell might break if we could show how the chatbot behaves when we comment out that little string appendage.

@futurebird @SnoopJ
(and it also might help if we could have a full debug view of the ongoing document, which might contain lots of other annotations like this that help explain how a next-token-predictor could produce something that fools us into thinking that it could only have been written by a feeling & sentient being.)

@JamesWidman @futurebird @SnoopJ You can do this with a local LLM if you want to learn how it works.

Yes, you're right that it's just building up a bigger and bigger string, with some tags to indicate whose turn it is.

Generally the UI sends the conversation history as a series of JSON messages. Here's an example:

$ cat chat.json
{
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "what is 6*7?"
},
{
"role": "assistant",
"content": "6 × 7 = 42.",
"reasoning_content": "Here's a thinking process:\n\n1. **Identify the User's Question**: The user is asking for the product of 6 multiplied by 7.\n2. **Perform the Calculation**: 6 × 7 = 42.\n3. **Verify the Result**: This is a basic multiplication fact. 6 × 7 = 42 is correct.\n4. **Formulate the Response**: Keep it clear and direct. \"6 * 7 = 42\" or \"The product of 6 and 7 is 42.\"\n5. **Final Output Generation**: Provide the answer concisely.✅\n"
}
]
}

Then the LLM parses that JSON and formats it the way the backend expects. I can ask my local server to do that and just show me what it will be feeding in:

$ curl -s -H "Content-Type: application/json" --data @chat.json http://127.0.0.1:8080/apply-template | jq -r '.prompt'
<|im_start|>user
what is 6*7?<|im_end|>
<|im_start|>assistant
<think>
Here's a thinking process:

1. **Identify the User's Question**: The user is asking for the product of 6 multiplied by 7.
2. **Perform the Calculation**: 6 × 7 = 42.
3. **Verify the Result**: This is a basic multiplication fact. 6 × 7 = 42 is correct.
4. **Formulate the Response**: Keep it clear and direct. "6 * 7 = 42" or "The product of 6 and 7 is 42."
5. **Final Output Generation**: Provide the answer concisely.✅

</think>

6 × 7 = 42.

That's the pre-tokenization string. It then tokenizes it, breaks it up into a series of tokens, each identified by an integer. Then it has a table of embeddings, a vector for each token, so it looks up the embedding in a lookup table. Then it passes that through the transformer architecture of a few billion multiplications, that generates a list of likely next tokens, it samples one from the most likely tokens, and repeats the process.

@unlambda @futurebird @SnoopJ
i mean... i have multiple full plates at the moment, so i don't really _want_ to break out the debugging/tracing tools to learn its internals...

but on the other hand, these systems are being used to defraud millions of people and do all kinds of lasting damage, so we all probably _need_ to make some time to understand the internals.

@unlambda @futurebird @SnoopJ basically i keep hoping for a modern-day James Randi to show up and demystify this stuff.

but even Randi would need some time to learn the relevant math & code...

@JamesWidman So, there's not really much in the way of technical trickery. The math is all pretty well documented and standard. It's mostly a bunch of matrix multiplications and activation functions.

Between the advent of GPUs with massive parallel processing capabilities, and huge high bandwidth memory, plus the Transformer architecture which actually allows for it to learn how to take into account context, they're finally able to train models that can generate language in ways that some folks find useful.

How they train it is a bit more of where the trickery lies. As discussed, they start by training it as just autocomplete, then they do fine tuning, where they train it with the conversational turns and to follow instructions.

Then they do various kinds of reinforcement learning. One of those where a lot of problems have come in is reinforcement learning with human feedback, where they use human feedback (those up and down thumbs) to train the model to produce text that people prefer. Of course, that causes some people to reward it for saying what they want to hear, so the model becomes extremely sycophantic. One things machine learning is really good at is optimizing for exactly what it is you're training for at the expense of all else. This has caused all kinds of problems; the labs have since backed off on how much they weight RLHF, so they still use some but not as much.

Then there's reinforcement learning with verifiable rewards (RLVR), where they do reinforcement learning of rollouts where they have it try to solve problems, like math problems or programming problems, where it's easy to verify the solution.

Of course, again, machine learning is very good at optimizing for a particular goal. So sometimes the models will reward hack; they'll find ways of getting the reward without actually doing what you wanted. If the goal was to pass unit tests, they might just delete the failing tests. Or if it was to fix a bug, and you accidentally gave them a git repo that has the bug fix in a different branch, they'll look at the git history and find the fix.

Then the trickery is in figuring out what the model is and isn't actually good at. And this is really hard and subtle.

@unlambda separately, i think an important aspect of the con is the industry's repurposing of programming metaphors.

we've almost always used metaphors in the design of both hardware & software. like, i can't find the source at the moment, but back in the 1950's, engineers were using the term "storage", and the main reason why we started calling it "memory" is because von Neumann started pushing a certain brain/body metaphor...

@unlambda
then of course POSIX is full of metaphors ("file", "directory", "head", "tail", "child", "kill", etc). and most coders probably stop thinking of them as metaphors when they reach the point where these words (chosen to help people learn what initially seemed like a foreign concept) conjure up a _technical_ understanding, at which point they don't really need a metaphor anymore.

at that point, we use them more as shorthand than as deliberate metaphor.

@unlambda
so it's only natural that someone would start using words like "train", "infer", "learn", etc to describe the operation of LLM systems.

but then non-programmers started using these terms in a _literal_ sense. and even a lot of programmers seem to be acting like they're not metaphors.

seriously, it's not difficult to find people who actually think that there is a literal ghost or spirit in the machine.

that's a problem!

@JamesWidman Yeah.

The thing is, the human brain is really good at picking up on things that look like faces, or sound like language.

And these machines are really good at pretending to be human, at least on short contexts. People naturally want to make the connection that there is some consciousness there.

It is just a machine, that is optimized to appear as human-like as possible. But it's really good at hacking some people's perception to think that it's really conscious.

@unlambda i don't know if i would even use the word "pretend" here.

i'm not even sure the use of "language" in "large language model" is helping (even though these systems are of course processing large amounts of information that humans generally interpret as text, (because that information was originally intended to be interpreted as text, by humans and for humans))