It's been over a year, but DeepSeek finally did it and released their latest models. Performance, particularly for coding looks to be strong with 1M context, and despite being compute constrained, their team figured out some new attention methods to use during training.
