Breakthrough in Visual Language Models and Reasoning 🧠
🔍 #LLaVAo1 pioneers systematic visual reasoning capabilities:
• First #VLM to implement spontaneous step-by-step analysis like #GPT4
• New 11B model surpasses #Gemini15pro & #Llama32 performance
• Excels on 6 multimodal benchmark tests
• Breaks down complex problems into structured analysis stages
🎯 Key Features:
• Problem outline creation
• Image information interpretation
• Sequential reasoning process
• Evidence-based conclusions
• Handles science & reasoning challenges
💡 Technical Specs:
• Based on #opensource architecture
• Pretrained weights available on #HuggingFace
• 11B parameter model size
• Supports multiple reasoning domains
📚 Paper available: https://arxiv.org/abs/2411.10440
🔗 Project repository: https://github.com/PKU-YuanGroup/LLaVA-o1
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.
