@harpaa01 So… 1) we can make output deterministic. Many LLMs have a parameter called “temperature” which determines the probability of choosing a token with lower probability than maximum. At the lowest temperature the most probable token is always chosen. Turns out result at this setting are not very good, they become very monotone and models tend to fall into weird loops. I suppose this could be remedies by using some sort of deterministic rng (like og doom, or seeding rng with the same seed for the same prompt, etc.)
2) I'm not sure that's actually the case. Unless you do something weird like stuff your context with tokens that usually come with software development and then ask for a dirty limerick you're narrowing down the cluster of topics within the training data set. Closely related tokens reinforce each other. I suppose, it's a question of how representable your prompt is of the training data. I guess Anthropic does at least some model finetuning on their Code prompts.
We actually have a few examples of specialised models. Pretty much every provider has a "code” flavour of their base model. Well, they used to. It's RLHF’d on coding tasks. Second, we have a few Mixture of Experts models. Those consists of a router that decides which subnet is going to generate the next token based on the context. Then a specialised subnet is used for the token generation. It's not exactly what task specialisation though. Usually it’s done to reduce memory requirement. So router decide that the next should go some punctuation, only that small subnet is activated for that. Basically, instead of multiplying same enormous matrices, this architecture multiplies smaller matrices, but there are more of them and they can be loaded and unloaded during runtime. But in principle this can be used for task-specific subnets if a sufficiently clever router can be trained.