@JamesWidman These models tend to have capabilities that are very "spiky"; they will be really good at one thing, but then another thing that seems similar they'll be terrible at.
So some of the trickery comes in showing off something that it's good and, and implying that it generalizes more than it does.
The tricky thing is, they do generalize... somewhat. So there are cases where it learns general patterns, and is able to go beyond what it's trained on. But there are also times where it over-generalizes, causing what we refer to as hallucinations.
There are also times when it will just memorize some of the input, instead of generalizing. This usually happens if you have too little input, or accidentally have repetitions of the same thing in the input. The labs generally try to avoid this, they will do de-duplication filtering, and have trends that let them know how much input they need to train on for a given model size.
The other tricks can be in training toward the test, what some people call "benchmaxxing".
And then there's the question of whether the resulting models are actually useful enough to make back the massive piles of money that have been spent on training them and building out the data centers to do inference. So far, no one but Nvidia (and other hardware vendors) are making profits here, but of course because the field is growing, everyone believes that they will be able to make a profit once they stop growing.
Anyhow, there are lots of problems. But for the actual conversational production of text in a chatbot; you can download open wights models, and open source software, and run it yourself, with no "tricks up its sleeve", and see how it behaves. I don't mean that you need to learn all the math or debug it yourself.
Of course, the actual "how it works" in the trained models is... not really something we understand. We just provide the training algorithm and a pile of data. There's a whole field of "mechanistic interpretability" to try to find ways to probe the resulting models to figure out how they represent certain concepts and how they perform certain tasks.
But yeah, I've found a certain amount of kicking the tires locally on my own machine to help a bit in understanding how the pieces fit.