How close are we to "manually tuning" LLMs?

https://lemmy.world/post/25200116

How close are we to "manually tuning" LLMs? - Lemmy.World

LLMs are built by generating a network of weights based on a large volume of training data. Some models have made those weights public/open, meaning you could, in principle, go in and manually edit the weights individually to change the outcomes. In practice, you would never do this because it would only ruin the output. However, you could theoretically nudge a lot of values in just the right way to change the model to favor an ideology, have a different attitude, produce disinformation etc. Right now, this is done practically in a brute force manner. The program will have certain instructions and parameters appended to the input in order to force a certain disposition, limit the scope, etc. There are a lot of reasons to want to adjust the fundamentals of a model, but AFAIK such a technology doesn’t exist yet (publicly). For example, this could be used for political gain, or for positive purposes like removing racism that has been well documented. Is anyone working on such a thing? Note: This community is β€œno stupid questions,” but I am actually pretty stupid and I probably misunderstood some (all) of the fundamentals of how this works. Please respond to any part of my question.

With an ai model u can do what’s called finetuning which is essentially training a pretrained model on a specific set of data to tweak the weights in the desired direction. There are multiple use cases for thus currently ie coding/specific language expert models, dolphin models for uncensored models, roleplaying finetunings etc etc.

We still have very little knowledge on how and what the weights in a model do. So manually tweaking them is unreasonable. There is lots of work related to trying to decode the meaning/purpose of specific neuron or group of neurons and if me manually boost/suppress it it will change the output to reflect as such.

It would be easier and faster to just train it on the stuff you want it to output. There are hundreds of billions of weights in models like gpt, and no one really knows what any individual one does.

I read a series of super interesting set of posts a few months back where someone was exploring the dimensional concept space in LLMs. The jump off point was the discovery of weird glitch tokens which would break GPTs, making them enter a tailspin of nonsense, but the author presented a really interesting deep dive into how concepts are clustered dimensionally. I don’t know if any of that means we’re any where cost to being able to find those conceptual weighting clusters and tune them, but well worth a read for the curious. There’s also a YouTube series which really dives into the nitty gritty of LLMs, much of which goes over my head, but helped me understand at least the outlines of how the magic happens.

(Excuse any confused terminology here, my knowledge level is interested amateur!)

Posts on glitch tokens and exploring how an LLM encodes concepts in multidimensional space. lesswrong.com/…/solidgoldmagikarp-iii-glitch-toke…

YouTube series is by 3Blue1Brown - m.youtube.com/@3blue1brown

SolidGoldMagikarp III: Glitch token archaeology β€” LessWrong

The set of anomalous tokens which we found in mid-January are now being described as 'glitch tokens' and 'aberrant tokens' in online discussion, as w…