#NVIDIA introduces #NVLM 1.0, a family of open-source #multimodal #LLMs:
🏆 Achieves state-of-the-art results on vision-language tasks, competing with #GPT4 and #Llama3V
📊 72B model outperforms on #OCRBench and #VQAv2 benchmarks
📈 Shows improved accuracy on text-only tasks after multimodal training
💻 Excels in #math, #coding, and #reasoning across modalities
🧠 Novel architecture enhances training efficiency and multimodal reasoning
🖼️ Introduces 1-D tile-tagging for improved performance on high-resolution images
🔬 Emphasizes dataset quality and task diversity over scale in training
🔗 Open-sourcing model weights and training code in Megatron-Core
Learn more: https://research.nvidia.com/labs/adlr/NVLM-1/
NVLM: Open Frontier-Class Multimodal LLMs
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training.