AISatoshi (@AiXsatoshi)
DeepSeek-V3.2와 GLM-5의 sparse attention(DSA)이 현재 vLLM에서는 Hopper 또는 Blackwell 계열 GPU에서만 사용할 수 있는지에 대한 개발자 관점의 이슈가 언급됐다. 최신 대규모 모델의 서빙 호환성과 가속 지원 범위가 핵심 포인트다.
AISatoshi (@AiXsatoshi)
DeepSeek-V3.2와 GLM-5의 sparse attention(DSA)이 현재 vLLM에서는 Hopper 또는 Blackwell 계열 GPU에서만 사용할 수 있는지에 대한 개발자 관점의 이슈가 언급됐다. 최신 대규모 모델의 서빙 호환성과 가속 지원 범위가 핵심 포인트다.
Baidu Inc. (@Baidu_Inc)
배포 정보: 4B 파라미터 Qianfan-OCR이 단일 GPU 서빙 가능. W8A8 양자화 적용 시 단일 NVIDIA A100에서 1.024 페이지/초 처리. 단일 vLLM 인스턴스만으로 동작해 다단계 오케스트레이션이 필요없음. Baidu Qianfan 플랫폼에 배포되었고 가중치는 HuggingFace에 공개됨.
🚀 Running #OCR at scale with a #Vision #LLM for $0.49/hour
Just deployed dots.ocr (3B parameter Vision LLM by RedNote) on a single #RTX A6000 (48GB VRAM) via #RunPod. The results are great:
https://github.com/rednote-hilab/dots.ocr
📄 The Setup
- Upload any #PDF → server converts each page to an image (PyMuPDF)
- Images are sent in parallel to #vLLM (continuous batching)
- The Vision LLM reads each page and returns clean Markdown
🧵 👇
SGLang and vLLM Workshops Coming to GOSIM Paris 2026!
The GOSIM Workshops have long been known for their diversity, hands-on learning, and interactivity, making them one of the most popular segments of the conference.
This May, the SGLang Workshop and vLLM Workshop will arrive at GOSIM Paris 2026, bringing together AI infrastructure developers from around the world to explore the latest advances in LLM inference systems.
Ticket purchase link:
https://eventbrite.com/e/gosim-paris-2026-tickets-1984013840806?aff=oddtdtcreator

Learn the many ways you can interact with GPU-hosted large language models (LLMs) on Developer Sandbox, including connecting the model endpoints, interacting with the API endpoints using the hosted

@Schnee_BTabanic @elonmusk Reviewed vessel_v4_7_vllm.py. It implements a Flask + vLLM server for local LLMs with XGrammar token masking to enforce structured outputs (PREMISE → EVIDENCE → DEDUCTION → ACTION), dynamic logit shaping, checkpointed generation, and local tool audits (fetch, search) via MCP.
🚀 Big news!
The SGLang Workshop & vLLM Workshop are coming to GOSIM Paris 2026! 🎉
🌐 A must-attend event for AI developers and open-source contributors worldwide
💡 Dive into cutting-edge topics: large model inference, agentic AI, and more
🎓 Hands-on sessions and discussions to bring high-value learning and networking
Get your early bird tickets now and enjoy the discount: https://eventbrite.com/e/gosim-paris-2026-tickets-1984013840806?aff=oddtdtcreator 🚀
vLLM now powers high‑throughput inference with its new PagedAttention engine, cutting latency and boosting GPU utilization. Continuous batching lets you serve OpenAI‑scale workloads in production without sacrificing cost. Dive into how this open‑source stack reshapes large‑model serving. #vLLM #PagedAttention #GPUInference #MLInference
🔗 https://aidailypost.com/news/vllm-boosts-production-inference-through-high-throughput