Mastodawn

Snowflake's Arctic Long Sequence Training: How to Train LLMs on 15 Million Tokens Without Selling a Kidney

https://techlife.blog/posts/snowflakes-arctic-long-sequence-training-how-to-train-llms-on-15-million-tokens-without-selling-a-kidney/

#ALST #Snowflake #LongContextTraining #DeepSpeed #HuggingFace #SequenceParallelism #LLMTraining #H100 #Llama8B #Qwen3 #GPUMemoryOptimization

Snowflake's Arctic Long Sequence Training: How to Train LLMs on 15 Million Tokens Without Selling a Kidney

Snowflake AI Research just open-sourced Arctic Long Sequence Training (ALST), a framework that pushes LLM training from a measly 32K tokens to over 15 million — a 469x improvement — using standard Hugging Face models and H100 GPUs. Here's what it means for you.

TechLife

Reddit Tech VN Bot Oct 6, 2025

Chạy LLM (mô hình ngôn ngữ lớn) bằng cách chuyển dữ liệu sang SSD/NVMe để tiết kiệm chi phí. Sử dụng DeepSpeed nhưng chưa thành công. Cần thêm gợi ý và thông tin chi tiết.
#LLM #AI #DeepSpeed #NVMe #SSD #trituenhantao #congnghe

https://www.reddit.com/r/LocalLLaMA/comments/1nzvtp9/inference_of_llms_with_offloading_to_ssdnvme/

michabbb Sep 12, 2024

Introducing Phind-405B and faster, high quality #AI answers for everyone

🚀 Phind-405B: New flagship #llm, based on Meta Llama 3.1 405B, designed for programming & technical tasks. #Phind405B

⚡ 128K tokens, 32K context window at launch, 92% on HumanEval, great for web app design. #Programming #AIModel

💡 Trained on 256 H100 GPUs with FP8 mixed precision, 40% memory reduction. #DeepSpeed #FP8

⚡ Phind Instant Model: Super fast, 350 tokens/sec, based on Meta Llama 3.1 8B. #PhindInstant

🚀 Runs on NVIDIA TensorRT-LLM with flash decoding, fused CUDA kernels. #NVIDIA #GPUs

🔍 Faster Search: Prefetches results, saves up to 800ms latency, better embeddings. #FastSearch

👨‍💻 Goal: Help developers experiment faster, new features coming soon! #DevTools #Innovation

https://www.phind.com/blog/introducing-phind-405b-and-better-faster-searches

Show thread

Feynman 🔴Jun 27, 2023

@w84death that would explain a lot : https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/

#deepspeed #MoE

DeepSpeed powers 8x larger MoE model training with high performance - Microsoft Research

Today, we are proud to announce DeepSpeed MoE, a high-performance system that supports massive scale mixture of experts (MoE) models as part of the DeepSpeed optimization library. MoE models are an emerging class of sparsely activated models that have sublinear compute costs with respect to their parameters. For example, the Switch Transformer consists of 1.6 […]

Microsoft Research