Arcee Trinity Large Technical Report
Arcee Trinity Large는 4000억 개의 파라미터를 가진 희소 Mixture-of-Experts(MoE) 모델로, 토큰당 130억 개의 활성화 파라미터를 사용한다. 이와 함께 Trinity Nano(60억 파라미터)와 Trinity Mini(260억 파라미터) 모델도 소개되었으며, 모두 최신 아키텍처와 새로운 MoE 부하 균형 전략인 SMEBU를 적용했다. 모델들은 Muon 옵티마이저로 훈련되었고, 대규모 토큰 데이터셋(최대 170억 토큰)으로 사전학습되었다. 이 기술 보고서는 대규모 희소 모델 설계와 훈련에 중요한 참고자료가 될 전망이다.
https://arxiv.org/abs/2602.17004
#machinelearning #mixtureofexperts #largescale #transformers #optimization

Arcee Trinity Large Technical Report
We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.










