Mastodawn

Most AI models are what they appear to be. A 12B parameter model uses 12B parameters. What you see is what runs.

Marco MoE does not work that way. Alibaba built two models, Marco Nano and Marco Mini, that carry billions of parameters but wake up only a tiny fraction of them for each request. Marco Nano activates 0.6B out of 8B. Marco Mini activates 0.86B out of 17.3B. Less than 5% of either model actually works.

https://firethering.com/marco-moe-nano-mini/
#opensource #ai #alibaba #moe #huggingface #llm #genai

Marco MoE Uses 5% of Its Parameters but Outperforms Models 3× Its Size - Firethering

Most AI models are what they appear to be. A 12B parameter model uses 12B parameters. What you see is what runs. Marco MoE does not work that way. Alibaba built two models, Marco Nano and Marco Mini, that carry billions of parameters but wake up only a tiny fraction of them for each request. Marco Nano activates 0.6 billion out of 8 billion. Marco Mini activates 0.86 billion out of 17.3 billion. Less than 5% of either model is actually working at any moment. The part that makes this worth paying attention to is what that 5% manages to do against models running at full capacity.

Firethering