People get mad when you call LLMs "spicy autocomplete" but my investigations into recreating and implementing small versions of this tech make me think that nick name is very accurate.

Basically, it's a method to predict the next content in a text file. The whole conversation between you and the LLM is one file, and the LLM tries to find the most likely next text based on the training data.

There is something significant here: LLMs were trained on internet forums and social media.

Thus the training data didn't just contain text, but rather text where each passage is tagged and attributed to a particular user.

This aspect of the training data was critical in creating the illusion of talking to another person.

An LLM doesn't just predict the next text. It predicts the next text that might come from another user. You need to hard code this in to make it work well.

Leave it out and there is no conversation.

For example if I give an LLM without user seperation this text:

"It's a lovely day." It might continue with "The sun was shining."

But with user separation it focuses on responses to "it was a lovely day" from other users and the training data might suggest "I agree, it's wonderful weather."

So interaction with an LLM is like posting on a forum, it gives you and average of typical responses with one small change: most LLMs have a strong positivity bias programmed in.

Because let's be real, if you posted "It's a lovely day." on an internet forum you might get a response like "No it's not, noob."

LLMs are heavily weighted to give supportive, and constructive responses.

I wonder what they might be like without these limitations? Without the limitation to make the response from another user they might be much less deceptive.

That they are popular shows that many people just want a nice moderated online community where people treat each other with respect.

@futurebird this is a simplistic view – that it’s all about token prediction of similar vectors using gradient descent to arrive at the more likely next token to place in line with what is already there. There’s also RLHF – reinforcement learning through human feedback. This involves the human dressing up as a wizard and sitting behind the curtain ensuring that all the responses are the sort of responses that the wizard I mean human would actually prefer and approve of. Technically, this is achieved using a lot of smoke and some carefully placed bidirectional mirrors which act as beam splitters to construct a hologram which fools people into thinking that the machine did it all, instead of poorly paid workers.

@u0421793

Ian, you had me going for a moment there. I was like "how do they keep finding me? why are they like this all the time???"

😆

@futurebird I honestly think (unpopular opinion here) that most of the cost of LLM-based AI thus far is in ‘training’. Not training as in running the phenomenal amount of harvested stolen text and image input through tokenisation processes and reward giving through weight assignment and vector assessment, using more GPUs than exist on Earth, but rather, lots and lots and lots of money paying humans to fake it all and build in patches – patch after patch on top of patch of corrective behaviour, encoded themselves as vector weights. The training had nothing much to do with running it all through GPUs, I believe that probably took an embarassing but totally affordable amount of time and energy. I believe (with no visible means of factual reference to cite) that most of the expenditure of these capital-burning companies was ‘training’ by paying humans and then encoding their resulting guidance. Paying workers.
@u0421793 @futurebird Yes. Or getting humans to do that labor for no pay.

@u0421793

The wind up with the bamboozling jargon (you can feel these dudes hoping they put in enough tricky sounding words and concepts to make you just give up) was perfect in your post.

"token prediction"
"vectors"
"gradient descent" (OMG)

The problem is math jargon is my briar patch and tossing me in there is a big mistake.

:)

@futurebird I wasn’t making it up though - the way it works is by tokenising language (not into words but into fragments of words), then assigning the word-derived tokens to vectors (word2vec - it exists), then these vectors are winnowed into the likely winners by gradient descent to find the lowest error (and not get trapped by just falling downhill down the nearest valley) and so on.

@futurebird

diy'ing a 'math jargon is my briar patch' t-shirt asap.

@u0421793

@futurebird Without those biases you get the famous Microsoft Tay chatbot that went Nazi within a couple of hours.

@futurebird I think most people get sick of moderation actually. It's not done well anywhere and people just want to interact with each other but instead have a bunch of "Karens" telling them how to think and what they can say. People with the power to ostracize you from your friends without their consent.

I think maybe what people are really after here is the smaller, "personal" interaction they're getting. Privacy. Not being judged all the damn time.

@crazyeddie @futurebird most people are not you. I also prefer unmoderated discussion but from my experience the possibility to moderate if someone goes crazy insulting helps to prevent bad manner
@nichtich @futurebird Most people are not you either.

@futurebird apologies for pedantic-quibbling on your thread twice, but…

LLMs in their platonic form are not weighted in this manner, they are exactly as you have imagined here: they reproduce the statistical distribution of tokens in their training corpus.

If you've never played around with GPT-2 or GPT-3 (from the era before we had GPT-3.5 and from there "ChatGPT"), they often would do *precisely* this sort of direct, non-conversational continuation. You could feed in a sentence or two and get "autocomplete", or you could feed in `<html><body><span>Lorem ipsum` and get a plasuible-looking continuation of an HTML document (or whatever)

Once "Chat" models (and the paradigm shift to RLHF to "fine-tune" model performance) showed up, we started seeing the conversational pattern. I don't know the details there, but there is definitely a distinct line between when we first started seeing "LLMs" and when we started seeing models arranged explicitly around a conversational format.

@SnoopJ

I think the only detail you need is adding tags after each user entry that match the end and start of a post in a forum. This worked in the very messy low fidelity testing I’ve been doing. Or it convinced me that would be enough </forum post><forum post> (in that order) is a very powerful pattern and signals a huge shift.

@futurebird @SnoopJ ok, so in the chatbot source code, there must be some point, after it appended the user's latest chunk of text to the end of the document, and _before_ it calls some "inference" function, where it appends a string to the end of the document like:

"</forum post><forum post> ChatbotUser:"

this makes me think that, for a large number of users, the spell might break if we could show how the chatbot behaves when we comment out that little string appendage.

@futurebird @SnoopJ
(and it also might help if we could have a full debug view of the ongoing document, which might contain lots of other annotations like this that help explain how a next-token-predictor could produce something that fools us into thinking that it could only have been written by a feeling & sentient being.)
@JamesWidman @futurebird @SnoopJ I'm pretty sure there are a small number of parlor tricks like this they do to get the illusion. Mainly a mix of injecting tagging, perioditcally reinjecting "summary" of head of the document (then erasing it), and iterative application of these type of rules on the document until some stopping condition is met.

@dalias @futurebird @SnoopJ

if ( user_mentioned_politician_critical_of_LLM_corp() ) {
document.append("[pragma: engage character assassin mode]");
}

viewed in this light, it makes sense that the kind of person who would buy an entire social media platform just to tilt politics the way they want would also try to build a popular LLM chatbot

@dalias @JamesWidman @futurebird @SnoopJ They generally try to reuse as much of the prefix as possible; since attention computation is O(n^2), they want to cache all of the computations for the prefix, otherwise you'd have to re-compute that whole O(n^2) prefix calculation each time there was a new conversation turn.

So providers generally try to preserve all of the early conversation history and just append. There are some minor exceptions, such as some models don't preserve thinking of older messages, so would need to re-process the response that comes after the thinking if the user sends a reply back, but in general they try very hard for conversations to be append-only so they can reuse the cache as much as possible.

If your conversation gets far enough that it's going to exceed the context window that the model was trained and configured for, it will stop shortly beforehand and add a request that the conversation be summarized, producing a new prompt that is much shorter with a summary of the conversation. Of course, this loses a lot of information, so it's kind of a best effort way of being able to continue and still have some context.

@unlambda @JamesWidman @futurebird @SnoopJ Yeah that's roughly what I was trying to express succinctly. Thanks for adding the big-O's and other technical details tho.

@JamesWidman @futurebird @SnoopJ You can do this with a local LLM if you want to learn how it works.

Yes, you're right that it's just building up a bigger and bigger string, with some tags to indicate whose turn it is.

Generally the UI sends the conversation history as a series of JSON messages. Here's an example:

$ cat chat.json
{
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "what is 6*7?"
},
{
"role": "assistant",
"content": "6 × 7 = 42.",
"reasoning_content": "Here's a thinking process:\n\n1. **Identify the User's Question**: The user is asking for the product of 6 multiplied by 7.\n2. **Perform the Calculation**: 6 × 7 = 42.\n3. **Verify the Result**: This is a basic multiplication fact. 6 × 7 = 42 is correct.\n4. **Formulate the Response**: Keep it clear and direct. \"6 * 7 = 42\" or \"The product of 6 and 7 is 42.\"\n5. **Final Output Generation**: Provide the answer concisely.✅\n"
}
]
}

Then the LLM parses that JSON and formats it the way the backend expects. I can ask my local server to do that and just show me what it will be feeding in:

$ curl -s -H "Content-Type: application/json" --data @chat.json http://127.0.0.1:8080/apply-template | jq -r '.prompt'
<|im_start|>user
what is 6*7?<|im_end|>
<|im_start|>assistant
<think>
Here's a thinking process:

1. **Identify the User's Question**: The user is asking for the product of 6 multiplied by 7.
2. **Perform the Calculation**: 6 × 7 = 42.
3. **Verify the Result**: This is a basic multiplication fact. 6 × 7 = 42 is correct.
4. **Formulate the Response**: Keep it clear and direct. "6 * 7 = 42" or "The product of 6 and 7 is 42."
5. **Final Output Generation**: Provide the answer concisely.✅

</think>

6 × 7 = 42.

That's the pre-tokenization string. It then tokenizes it, breaks it up into a series of tokens, each identified by an integer. Then it has a table of embeddings, a vector for each token, so it looks up the embedding in a lookup table. Then it passes that through the transformer architecture of a few billion multiplications, that generates a list of likely next tokens, it samples one from the most likely tokens, and repeats the process.

@JamesWidman @futurebird @SnoopJ Many models base their format on OpenAI's ChatML, which they used to provide as part of their API but stopped several years ago: https://github.com/openai/openai-python/blob/release-v0.28.0/chatml.md

Other models use different formats, but it's all generally pretty similar, some kind of special tokens for indicating whose turn it is and to distinguish thinking from the final answer.

openai-python/chatml.md at release-v0.28.0 · openai/openai-python

The official Python library for the OpenAI API. Contribute to openai/openai-python development by creating an account on GitHub.

GitHub

@unlambda @futurebird @SnoopJ
i mean... i have multiple full plates at the moment, so i don't really _want_ to break out the debugging/tracing tools to learn its internals...

but on the other hand, these systems are being used to defraud millions of people and do all kinds of lasting damage, so we all probably _need_ to make some time to understand the internals.

@unlambda @futurebird @SnoopJ basically i keep hoping for a modern-day James Randi to show up and demystify this stuff.

but even Randi would need some time to learn the relevant math & code...

@unlambda @futurebird @SnoopJ "ai" is kind of a perfect con in that sense: showing people how Uri Geller bent the spoon is a lot easier than getting people to volunteer to do some math & programming homework that they didn't ask for

@JamesWidman So, there's not really much in the way of technical trickery. The math is all pretty well documented and standard. It's mostly a bunch of matrix multiplications and activation functions.

Between the advent of GPUs with massive parallel processing capabilities, and huge high bandwidth memory, plus the Transformer architecture which actually allows for it to learn how to take into account context, they're finally able to train models that can generate language in ways that some folks find useful.

How they train it is a bit more of where the trickery lies. As discussed, they start by training it as just autocomplete, then they do fine tuning, where they train it with the conversational turns and to follow instructions.

Then they do various kinds of reinforcement learning. One of those where a lot of problems have come in is reinforcement learning with human feedback, where they use human feedback (those up and down thumbs) to train the model to produce text that people prefer. Of course, that causes some people to reward it for saying what they want to hear, so the model becomes extremely sycophantic. One things machine learning is really good at is optimizing for exactly what it is you're training for at the expense of all else. This has caused all kinds of problems; the labs have since backed off on how much they weight RLHF, so they still use some but not as much.

Then there's reinforcement learning with verifiable rewards (RLVR), where they do reinforcement learning of rollouts where they have it try to solve problems, like math problems or programming problems, where it's easy to verify the solution.

Of course, again, machine learning is very good at optimizing for a particular goal. So sometimes the models will reward hack; they'll find ways of getting the reward without actually doing what you wanted. If the goal was to pass unit tests, they might just delete the failing tests. Or if it was to fix a bug, and you accidentally gave them a git repo that has the bug fix in a different branch, they'll look at the git history and find the fix.

Then the trickery is in figuring out what the model is and isn't actually good at. And this is really hard and subtle.

@JamesWidman These models tend to have capabilities that are very "spiky"; they will be really good at one thing, but then another thing that seems similar they'll be terrible at.

So some of the trickery comes in showing off something that it's good and, and implying that it generalizes more than it does.

The tricky thing is, they do generalize... somewhat. So there are cases where it learns general patterns, and is able to go beyond what it's trained on. But there are also times where it over-generalizes, causing what we refer to as hallucinations.

There are also times when it will just memorize some of the input, instead of generalizing. This usually happens if you have too little input, or accidentally have repetitions of the same thing in the input. The labs generally try to avoid this, they will do de-duplication filtering, and have trends that let them know how much input they need to train on for a given model size.

The other tricks can be in training toward the test, what some people call "benchmaxxing".

And then there's the question of whether the resulting models are actually useful enough to make back the massive piles of money that have been spent on training them and building out the data centers to do inference. So far, no one but Nvidia (and other hardware vendors) are making profits here, but of course because the field is growing, everyone believes that they will be able to make a profit once they stop growing.

Anyhow, there are lots of problems. But for the actual conversational production of text in a chatbot; you can download open wights models, and open source software, and run it yourself, with no "tricks up its sleeve", and see how it behaves. I don't mean that you need to learn all the math or debug it yourself.

Of course, the actual "how it works" in the trained models is... not really something we understand. We just provide the training algorithm and a pile of data. There's a whole field of "mechanistic interpretability" to try to find ways to probe the resulting models to figure out how they represent certain concepts and how they perform certain tasks.

But yeah, I've found a certain amount of kicking the tires locally on my own machine to help a bit in understanding how the pieces fit.

@JamesWidman Oh, and as you mention earlier; the labs do get to put their thumb on the scale a lot. There are a number of places they can do this; in the filtering and selection of the input data. In the reinforcement learning process. In the prompts that they give, for the hosted chats and other tools like programming tools, etc.

These models can easily amplify bias, either bias that's found in their training data, or bias from the selection of training data, or in the reinforcement learning process. So having them controlled by a small in-group of silicon valley tech bros, can be kind of horrifying.

@unlambda separately, i think an important aspect of the con is the industry's repurposing of programming metaphors.

we've almost always used metaphors in the design of both hardware & software. like, i can't find the source at the moment, but back in the 1950's, engineers were using the term "storage", and the main reason why we started calling it "memory" is because von Neumann started pushing a certain brain/body metaphor...

@unlambda
then of course POSIX is full of metaphors ("file", "directory", "head", "tail", "child", "kill", etc). and most coders probably stop thinking of them as metaphors when they reach the point where these words (chosen to help people learn what initially seemed like a foreign concept) conjure up a _technical_ understanding, at which point they don't really need a metaphor anymore.

at that point, we use them more as shorthand than as deliberate metaphor.

@unlambda
so it's only natural that someone would start using words like "train", "infer", "learn", etc to describe the operation of LLM systems.

but then non-programmers started using these terms in a _literal_ sense. and even a lot of programmers seem to be acting like they're not metaphors.

seriously, it's not difficult to find people who actually think that there is a literal ghost or spirit in the machine.

that's a problem!

@JamesWidman Yeah.

The thing is, the human brain is really good at picking up on things that look like faces, or sound like language.

And these machines are really good at pretending to be human, at least on short contexts. People naturally want to make the connection that there is some consciousness there.

It is just a machine, that is optimized to appear as human-like as possible. But it's really good at hacking some people's perception to think that it's really conscious.

@unlambda i don't know if i would even use the word "pretend" here.

i'm not even sure the use of "language" in "large language model" is helping (even though these systems are of course processing large amounts of information that humans generally interpret as text, (because that information was originally intended to be interpreted as text, by humans and for humans))

@SnoopJ @futurebird It's pretty straightforward to play with "raw" LLMs, eg. with ollama or llama.cpp.

BTW If we're being pedantic "they reproduce the statistical distribution of tokens in their training corpus" isn't quite right. Inductive bias is crucial otherwise the model grinds to a halt on novel inputs. (And I'd really like to know what it looks like when you do this but I don't have the resources to find out.)

@dpiponi @futurebird I should probably have said *attempt* to reproduce :)

But as you say, novel inputs can be quite tricky, as in the case of the "glitch tokens" of GPTs gone by: https://www.vice.com/en/article/ai-chatgpt-tokens-words-break-reddit/

At the time, they slapped a band-aid on and just fell back onto a generic "an error has occurred" response and no generation if one of those tokens was input. I don't know what the purported solution is to the same problem today, aside from "whatever it is, it's probably rubbish and involves a lot of lying"

ChatGPT Can Be Broken by Entering These Strange Words, And Nobody Is Sure Why

Reddit usernames like ‘SolidGoldMagikarp’ are somehow causing the chatbot to give bizarre responses.

VICE
@dpiponi @futurebird annoyingly, the LessWrong write-up linked to that 'SolidGoldMagikarp' work is actually quite good, but in the time since that research was published there has been similar research published in more uhhh reputable places, e.g. https://dl.acm.org/doi/full/10.1145/3660799

@futurebird

So that's what they were talking about.

><Slartibartfast> Perhaps I'm old and tired, but I think that the chances of finding out what's actually going on are so absurdly remote that the only thing to do is to say, "Hang the sense of it," and keep yourself busy. I'd much rather be happy than right any day.
><Arthur Dent> And are you?
><Slartibartfast> Ah, no. Well, that's where it all falls down, of course.
(The Hitchhiker's Guide to the Galaxy, 2005)

@futurebird And there we have it - the ELIZA effect writ large. I wasn't sure what trick these things were pulling to make things seem human, but once seen never unseen.

@futurebird My understanding is there's interesting consequences of that positivity bias as it relates to the training data and the user input.

Talk to it like you're in a professional setting and you are likelier to get responses sourced from professional exchanges. Talk to it like you're on 4chan and you're likelier to start getting conspiracy theory nonsense out of it.

@futurebird and it needs to be repeated at every opportunity; it makes no distinction between “commands” from you or “commands” from anyone else.

in fact it has no idea what a “commandl or a “prompt” is, just the statistical qualities of a prompt and the qualities of text that follows text like the prompt

so it is that claude or any other agent doesn’t know the difference between your commands and commands in any random text file it happens to read.

it cannot be made to understand the difference

@bri7

The people who said "no calling it spicy autocomplete misses the whole point there is more going on!"

Really made me think that maybe there was more going on, and of course they'd never say. But, it's just the rather clever exploit of using the way that so much of the training data was in the form of posts and responses to make the auto completer feel more like a conversation.

That is the "more" that is going on.

@futurebird the transformer architecture and attention mechanism is clever and super effective. but at the end of the day it is just markov chaining but more variables and moving parts

@bri7 @futurebird

[Not arguging that these models are 'thinking', even if it might sound like that.]

I think the "explain how you arrived at that conclusion" that was all the rage is very interesting for two reasons:

  • The modell is generating more text. It's not like it is showing you a walk through its model and the random numbers it pulled. So it is basically generating an explanation that is plausible given what the said before.

  • I think this is often also a behavior with humans. My opinion about a topic might be a gut feeling, but when questioned I start thinking about it, trying to find arguments. Often ones I didn't already have when I stated my opinion.

  • The first thing could make sense to ask if amodel are trained to change their position given new information. So they could "correct" a bad roll of the dice.
    Of course, a user might think that the model "really thought about this", which is obviously not the case.

    @futurebird @bri7 The early versions of these things (before the "chat" branding) really were simple text continuation engines. You'd put some text in a box and it would simply continue where you left off.

    The current iteration of chat interfaces insert some extra processing between user input and the text extruder. Some inputs are intercepted and processed by other backends like image generators.

    Fundamentally, though, it remains spicy autocomplete at the core.

    @bri7 @futurebird The even more maddening part is that it does make that distinction, some of the time.

    The attention mechanism does give the LLMs a way to keep track of who is saying what. But only to a degree. It's not perfect. It can lose track.

    So it's easy to get lulled into a false sense of security. Some of the time, prompt injection like this won't work. You can try it out, and see it correctly figure out that what you said was instruction but what's in the file was just text that its processing.

    But because it's all based on a whole bunch of statistical vector arithmetic, you can't depend on that. Some percentage of the time, it will slip up and interpret those commands in the text file as instructions rather than data that its processing.

    I think that's what's even more dangerous about it than just that it can't distinguish. Its when it can, sometimes, that you can easily get lulled into a false sense of complacency.

    @unlambda @bri7 @futurebird

    This is probably going to change though, there is research into representing who is saying what directly in the embeddings of the tokens.

    So in a year or two the LLMs will be much, much more reliable in separating instruction from data, and user text from LLM text. Or maybe even now, we don't know what closed source LLMs do, and it's a pretty straightforward idea.

    https://arxiv.org/pdf/2410.09102

    @bri7 @futurebird Your point about commands is a huge desl. It's why prompt injection remains a thing: the model has no concepts, has no way to process instructions that isn't using the exact same text prediction. Prompt injection is inherent and unfixable.

    Once I realized that, I realized LLMs should never be given private data.

    @futurebird Do you have references explaining how the recent LLMs have been trained to do this?

    I'd be interested in understanding what this training methodology looks like in detail.

    I didn't actually think this "conversational training" was necessary, as I thought the chat-bots were just told 'you are a chat bot, pretend to have a conversation, put your output directly in the next line. Here is the user's first inputs: "It's a lovely day"' Or something like that.

    @futurebird yes, some earlier models had a failure mode where they not only generated their ‚answer‘ but continued to generate the part of the user too.

    @futurebird You may find this helpful:

    https://pytorch.org/blog/a-primer-on-llm-post-training/

    The conversational nature of LLMs is usually achieved in post-training.

    There are multiple steps after the "autocomplete" training (i.e. training the model to predict next word).

    I don't know how much of the conversationality can be attributed to posttraining and how much to forums in the training data, but I wouldn't dismiss the posttraining as unimportant.
    Many distinctive LLM traits come from it, e.g. sycophancy.

    @poleguy

    A Primer on LLM Post-Training – PyTorch