MiniMax's M3 runs on about a twentieth of the compute per token of its last model. Vendor figures: 9x faster prefill and 15x faster decode at a 1M-token context, via a new sparse attention scheme that only bothers with the relevant bits of the prompt. Net effect: long-context AI gets dramatically cheaper per query. Catch: open weights are promised but still not on the shelf as of mid-June.

https://youtu.be/z9AAR51R_Bw

#AI #MiniMax #LLM

MiniMax M3 slashes AI compute costs to a twentieth

YouTube