Alibaba’s new open‑source model Qwen3‑VL can scan two‑hour videos, achieving 96.5 % on DocVQA and 875 on OCRBench. The multimodal vision‑language system rivals the rumored GPT‑5 in document understanding. Dive into the results and see why the community is buzzing. #Qwen3VL #Alibaba #DocVQA #OCRBench

🔗 https://aidailypost.com/news/qwen3vl-scans-twohour-videos-hits-965-docvqa-875-ocrbench

#NVIDIA introduces #NVLM 1.0, a family of open-source #multimodal #LLMs:

🏆 Achieves state-of-the-art results on vision-language tasks, competing with #GPT4 and #Llama3V

📊 72B model outperforms on #OCRBench and #VQAv2 benchmarks

📈 Shows improved accuracy on text-only tasks after multimodal training

💻 Excels in #math, #coding, and #reasoning across modalities

🧠 Novel architecture enhances training efficiency and multimodal reasoning

🖼️ Introduces 1-D tile-tagging for improved performance on high-resolution images

🔬 Emphasizes dataset quality and task diversity over scale in training

🔗 Open-sourcing model weights and training code in Megatron-Core

Learn more: https://research.nvidia.com/labs/adlr/NVLM-1/

NVLM: Open Frontier-Class Multimodal LLMs

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training.

NVIDIA ADLR