RT @PavloMolchanov: 🚀 Selbst-Spekulation ermöglicht eine 6,75-fache echte Beschleunigung der LLM-Generierung mit SGLang-Inference!

mehr auf Arint.info

#AI #Diffusion #LLM #MachineLearning #Nemotron #SGLang #arint_info

https://x.com/PavloMolchanov/status/2060245957254824246#m

Arint - SEO+KI (@[email protected])

<p>RT @PavloMolchanov: 🚀 Selbst-Spekulation ermöglicht eine 6,75-fache echte Beschleunigung der LLM-Generierung mit SGLang-Inference!</p> <p><a href="https://arint.info/@Arint/116664374649559028">mehr</a> auf <a href="https://arint.info/">Arint.info</a></p> <p>#AI #Diffusion #LLM #MachineLearning #Nemotron #SGLang #arint_info</p> <p><a href="https://x.com/PavloMolchanov/status/2060245957254824246#m">https://x.com/PavloMolchanov/status/2060245957254824246#m</a></p>

Mastodon Glitch Edition

Архитектура AI-сервисов: почему монолит убивает latency и GPU

Ваш AI‑чат или автокомплит тормозит при 50 запросах в секунду? Монолит убивает GPU и латенси? В этом туториале — реальная архитектура low‑latency инференса на high‑load: почему изолированный inference‑bundle вместо монолита, как выбрать между vLLM и SGLang без маркетинга, зачем нужны continuous batching и admission control. Читать разбор

https://habr.com/ru/companies/otus/articles/1031286/

#AIсервисы #LLM #инференс #highload #latency #GPU #vLLM #SGLang #continuous_batching #admission_control

Архитектура AI-сервисов: почему монолит убивает latency и GPU

Всем привет, меня зовут Сергей Прощаев, и в этой статье я расскажу про реальную архитектуру ИИ-сервисов, которые выдерживают high-load и отвечают за десятки миллисекунд. Я Tech Lead и руководитель...

Хабр

RT @lmsysorg: Der DeepSeek V4-Bug für die fehlerhafte Ausgabe in der Open-Source-Inferenz-Engine wurde in SGLang behoben.

mehr auf Arint.info

#AI #BugFix #Collaboration #DeepSeek #OpenSource #SGLang #arint_info

https://x.com/lmsysorg/status/2048592063290356011#m

Arint - SEO+KI (@[email protected])

<p>RT @lmsysorg: Der DeepSeek V4-Bug für die fehlerhafte Ausgabe in der Open-Source-Inferenz-Engine wurde in SGLang behoben.</p> <p><a href="https://arint.info/@Arint/116476097578552901">mehr</a> auf <a href="https://arint.info/">Arint.info</a></p> <p>#AI #BugFix #Collaboration #DeepSeek #OpenSource #SGLang #arint_info</p> <p><a href="https://x.com/lmsysorg/status/2048592063290356011#m">https://x.com/lmsysorg/status/2048592063290356011#m</a></p>

Mastodon Glitch Edition

SGLang Flaw Enables Remote Code Execution via Malicious Model Files

A single malicious file can become a powerful gateway for attackers to run arbitrary commands on vulnerable machines - and a newly disclosed flaw in SGLang, CVE-2026-5760, reveals just how easily this can happen through specially crafted GGUF model files. This highly severe vulnerability, scoring 9.8 out of 10.0, enables remote code…

https://osintsights.com/sglang-flaw-enables-remote-code-execution-via-malicious-model-files?utm_source=mastodon&utm_medium=social

#RemoteCodeExecution #Cve20265760 #CommandInjection #Gguf #Sglang

SGLang Flaw Enables Remote Code Execution via Malicious Model Files

Learn how CVE-2026-5760 enables remote code execution via malicious SGLang model files and protect your systems now with expert insights and mitigation strategies.

OSINTSights

RT @ZenMagnets: Minimax m2.7 nvfp4 läuft mit ~130 tok/s im Single-Stream auf 2x RTX 6k mit sglang. Bis zu ~1500 tok/s bei 64 gleichzeitigen frischen Kontexten. Enormer Leistungsabfall bei höheren Kontexten. Aber viel schneller als meine m2.5 vLLM-Konfiguration von vor zwei Monaten (sprich: 2 KI-Jahre), und ich bin beeindruckt, wie sehr SgLang bei der Performance bei hoher Nebenläufigkeit aufgeholt hat, was früher eine Spezialität von vLLM war. Verwendung der lukealonso/MiniMax-M2.7-NVFP4 Konfiguration ➡️ Alt-Text des Bildes 𝗭𝗲𝗻 𝗠𝗮𝗴𝗻𝗲𝘁𝘀 (@ZenMagnets) GROSSE BEGEISTERUNG: Erster Minimax m2.5 NVFP4 Quant auf Hugging Face. 83 tok/s Single-Stream vLLM auf zwei RTX 6000. Oder etwa doppelt so schnell wie ein Mac 512GB-System, das halb so viel kostet. Außer dass der Mac nicht auch 1000+ tok/s über 32+ gleichzeitige Verbindungen schafft. Leistungsbegrenzung bei 550W pro GPU für diesen Test. lukealonso/MiniMax-M2.5-NVFP4 vLLM-Rezept, das ich im Alt-Text des Bildes verwendet habe — https://nitter.net/ZenMagnets/status/2022562893091475626#m

mehr auf Arint.info

#AI #GPU #LLM #MachineLearning #NVIDIA #SGLang #arint_info

https://x.com/ZenMagnets/status/2044281284885958780#m

Install SGLang with uv, pip, or Docker; configure YAML and server flags; then serve Hugging Face LLMs with an OpenAI-compatible API plus native /generate and offline Engine examples.

#Cheatsheet #Self-Hosting #LLM #AI #AI Coding #DevOps #Docker #sglang #openai #SelfHosting

https://www.glukhov.org/llm-hosting/sglang/

SGLang QuickStart: Install, Configure, and Serve LLMs via OpenAI API

Install SGLang with uv, pip, or Docker; configure YAML and server flags; then serve Hugging Face LLMs with an OpenAI-compatible API plus native /generate and offline Engine examples.

Rost Glukhov | Personal site and technical blog

SGLang and vLLM Workshops Coming to GOSIM Paris 2026!

The GOSIM Workshops have long been known for their diversity, hands-on learning, and interactivity, making them one of the most popular segments of the conference.

This May, the SGLang Workshop and vLLM Workshop will arrive at GOSIM Paris 2026, bringing together AI infrastructure developers from around the world to explore the latest advances in LLM inference systems.

Ticket purchase link:
https://eventbrite.com/e/gosim-paris-2026-tickets-1984013840806?aff=oddtdtcreator

#SGLang #vLLM

🚀 Big news!
The SGLang Workshop & vLLM Workshop are coming to GOSIM Paris 2026! 🎉
🌐 A must-attend event for AI developers and open-source contributors worldwide
💡 Dive into cutting-edge topics: large model inference, agentic AI, and more
🎓 Hands-on sessions and discussions to bring high-value learning and networking

Get your early bird tickets now and enjoy the discount: https://eventbrite.com/e/gosim-paris-2026-tickets-1984013840806?aff=oddtdtcreator 🚀

#GOSIMParis2026 #SGLang #vLLM #AIWorkshop #OpenSourceAI

Kimi K2.5 chạy trên ktkernel + sglang đạt 16 token/giây nhưng thiếu thẻ mở &lt;think&gt; trong chuỗi phản hồi, chỉ có &lt;/think&gt;. Điều này gây lỗi với Open WebUI, Cline do parser không nhận diện được phần reasoning. Dù đã dùng --reasoning-parser kimi_k2 nhưng không hiệu quả. Cần tìm cách khôi phục thẻ mở &lt;think&gt;. #KimiK25 #SGLang #LocalLLM #AIInference #reasoning #hỗ_trợ_AI #mô_hình_ngôn_ngữ #trí_tuệ_nhân_tạo #thiếu_thẻ_thinking

https://www.reddit.com/r/LocalLLaMA/comments/1qqebfh/kim

DFlash: Hệ thống giải mã suy đoán theo kiểu khuếch tán, tạo block token cùng lúc thay vì từng token. Dùng draft model nhẹ để tạo block, kiểm nghiệm bằng LLM đích – tăng độ chấp nhận và hiệu suất, đặc biệt với văn cảnh dài & batch lớn. Hỗ trợ Qwen3-4B/8B/30B, tích hợp với SGLang, hỗ trợ streaming và sinh code dài. Hiệu quả cao trong sinh code và đầu ra cấu trúc. Code, checkpoint đã công bố, hướng dẫn huấn luyện sắp ra mắt. #DFlash #LLM #SpeculativeDecoding #Qwen3 #SGLang #AI #MachineLearning #Trí