RE: https://neuromatch.social/@jonny/116331911643580333

I'm impressed with the kinds of things we can get this generation of LLMs to do but it's so weird that all these new modes of operation just involve stacking on more natural language prompt text.

As a software developer I'd expect there to be some kind of lower-level API that you'd use to direct the model's behavior but doesn't work like that; it's just one big black box of LLM and we're just trying to carefully construct a premise in which it will respond with the right thing.

the biggest "this is so stupid I can't believe it works" kind of thing are the prompts that code review tools feed into the AI.

The prompts literally instruct the LLM to role play like a series of different personas, each looking for different things in the code.

The stupidest thing is that these prompts actually work! The LLMs give better feedback this way.

@harpaa01 it only seems stupid if you think of it as role play.

If you frame every interaction with LLM as context priming it becomes pretty obvious. In what context in training data a certain kind of response is more likely? If you stuff your prompt with tokens that often go with good debugging you’re likely to get good debugging.

In a way, “prompt engineering” is a game of guessing training data, sort of a reverse engineering exercise.

@pointlessone I get why it works, but it feelslike a bad architecture to put things structural to your actual product into the prompt, where 1) behavior is not deterministic, and 2) you increase the chances not every instruction will be followed to the letter the more you cram in there (plus the fact that your prompt has limits to how big it can be).

It feels like we don't try to build the models to be more specialized because we either don't know how, or we know it'd be really expensive.

@harpaa01 So… 1) we can make output deterministic. Many LLMs have a parameter called “temperature” which determines the probability of choosing a token with lower probability than maximum. At the lowest temperature the most probable token is always chosen. Turns out result at this setting are not very good, they become very monotone and models tend to fall into weird loops. I suppose this could be remedies by using some sort of deterministic rng (like og doom, or seeding rng with the same seed for the same prompt, etc.)
2) I'm not sure that's actually the case. Unless you do something weird like stuff your context with tokens that usually come with software development and then ask for a dirty limerick you're narrowing down the cluster of topics within the training data set. Closely related tokens reinforce each other. I suppose, it's a question of how representable your prompt is of the training data. I guess Anthropic does at least some model finetuning on their Code prompts.

We actually have a few examples of specialised models. Pretty much every provider has a "code” flavour of their base model. Well, they used to. It's RLHF’d on coding tasks. Second, we have a few Mixture of Experts models. Those consists of a router that decides which subnet is going to generate the next token based on the context. Then a specialised subnet is used for the token generation. It's not exactly what task specialisation though. Usually it’s done to reduce memory requirement. So router decide that the next should go some punctuation, only that small subnet is activated for that. Basically, instead of multiplying same enormous matrices, this architecture multiplies smaller matrices, but there are more of them and they can be loaded and unloaded during runtime. But in principle this can be used for task-specific subnets if a sufficiently clever router can be trained.