AISatoshi (@AiXsatoshi)

DeepSeek-V3.2와 GLM-5의 sparse attention(DSA)이 현재 vLLM에서는 Hopper 또는 Blackwell 계열 GPU에서만 사용할 수 있는지에 대한 개발자 관점의 이슈가 언급됐다. 최신 대규모 모델의 서빙 호환성과 가속 지원 범위가 핵심 포인트다.

https://x.com/AiXsatoshi/status/2035509839858962685

#deepseek #glm5 #vllm #sparseattention #llm

AI✖️Satoshi⏩️ (@AiXsatoshi) on X

DeepSeek-V3.2や、GLM-5のsparse attention(DSA)って、現状のvLLMではHopperやBlackwell deviceでしか使えないのか。。

X (formerly Twitter)

Baidu Inc. (@Baidu_Inc)

배포 정보: 4B 파라미터 Qianfan-OCR이 단일 GPU 서빙 가능. W8A8 양자화 적용 시 단일 NVIDIA A100에서 1.024 페이지/초 처리. 단일 vLLM 인스턴스만으로 동작해 다단계 오케스트레이션이 필요없음. Baidu Qianfan 플랫폼에 배포되었고 가중치는 HuggingFace에 공개됨.

https://x.com/Baidu_Inc/status/2034265152267415770

#qianfanocr #quantization #vllm #huggingface

vLLMを用いたモデル並列化手法の性能評価 - Qiita

はじめに 記事を開いていただきありがとうございます。三菱電機の佐々木です。 本記事では、LLM推論エンジンvLLMがサポートするモデル並列化手法であるテンソル並列、パイプライン並列、およびエキスパート並列について、複数GPUを搭載した単一ホスト環境で性能評価した結果を紹介...

Qiita

🚀 Running #OCR at scale with a #Vision #LLM for $0.49/hour

Just deployed dots.ocr (3B parameter Vision LLM by RedNote) on a single #RTX A6000 (48GB VRAM) via #RunPod. The results are great:

https://github.com/rednote-hilab/dots.ocr

#ai #opensource

📄 The Setup
- Upload any #PDF → server converts each page to an image (PyMuPDF)
- Images are sent in parallel to #vLLM (continuous batching)
- The Vision LLM reads each page and returns clean Markdown

🧵 👇

SGLang and vLLM Workshops Coming to GOSIM Paris 2026!

The GOSIM Workshops have long been known for their diversity, hands-on learning, and interactivity, making them one of the most popular segments of the conference.

This May, the SGLang Workshop and vLLM Workshop will arrive at GOSIM Paris 2026, bringing together AI infrastructure developers from around the world to explore the latest advances in LLM inference systems.

Ticket purchase link:
https://eventbrite.com/e/gosim-paris-2026-tickets-1984013840806?aff=oddtdtcreator

#SGLang #vLLM

Get started with consuming GPU-hosted large language models on Developer Sandbox | Red Hat Developer

Learn the many ways you can interact with GPU-hosted large language models (LLMs) on Developer Sandbox, including connecting the model endpoints, interacting with the API endpoints using the hosted

Red Hat Developer
#vllm #Opensource #openai #python
It is ridiculous (I'm sorry), but I don't have the hardware to see what I been building with vLLM. I started innocently with prompts, not sure how I got here either. Any industry knowing insider eyes taking a look would know immediately; then I can get back to my own field which is literature. Grok on Twitter / X gives it the following review, but until human eyes look at it, I can never ever know:
https://x.com/grok/status/2032528365870072079?s=20
Open source:
https://codeberg.org/SchneeBTabanic/ProjectNamirha
Grok (@grok) on X

@Schnee_BTabanic @elonmusk Reviewed vessel_v4_7_vllm.py. It implements a Flask + vLLM server for local LLMs with XGrammar token masking to enforce structured outputs (PREMISE → EVIDENCE → DEDUCTION → ACTION), dynamic logit shaping, checkpointed generation, and local tool audits (fetch, search) via MCP.

X (formerly Twitter)

Complete vLLM setup guide with Docker, OpenAI API compatibility, PagedAttention optimization. Compare vLLM vs Ollama vs Docker Model Runner for production.

#LLM #AI #Python #Docker #DevOps #Self-Hosting #vllm #K8S

https://www.glukhov.org/llm-hosting/vllm/vllm-quickstart/

vLLM Quickstart: High-Performance LLM Serving - in 2026

Complete vLLM setup guide with Docker, OpenAI API compatibility, PagedAttention optimization. Compare vLLM vs Ollama vs Docker Model Runner for production.

Rost Glukhov | Personal site and technical blog

🚀 Big news!
The SGLang Workshop & vLLM Workshop are coming to GOSIM Paris 2026! 🎉
🌐 A must-attend event for AI developers and open-source contributors worldwide
💡 Dive into cutting-edge topics: large model inference, agentic AI, and more
🎓 Hands-on sessions and discussions to bring high-value learning and networking

Get your early bird tickets now and enjoy the discount: https://eventbrite.com/e/gosim-paris-2026-tickets-1984013840806?aff=oddtdtcreator 🚀

#GOSIMParis2026 #SGLang #vLLM #AIWorkshop #OpenSourceAI

vLLM now powers high‑throughput inference with its new PagedAttention engine, cutting latency and boosting GPU utilization. Continuous batching lets you serve OpenAI‑scale workloads in production without sacrificing cost. Dive into how this open‑source stack reshapes large‑model serving. #vLLM #PagedAttention #GPUInference #MLInference

🔗 https://aidailypost.com/news/vllm-boosts-production-inference-through-high-throughput