Mastodawn

"GPT-4V revolutionizes vision-language tasks with human-level accuracy! #GPT4V #MultimodalAI #VisionLanguage"

GPT-4V, a multimodal AI model, has achieved human-level performance on vision-language tasks by integrating advanced vision encoders with large language models. The model's novel attention mechanism enables more effective cross-modal understanding, allowing it to reason about images with unprecedented...

#GPT-4V #MultimodalAI #Vision-LanguageUnderstanding #LargeLanguageModels

tejiri Dec 7

"GPT-4V revolutionizes vision-language tasks with human-level accuracy #MultimodalAI #GPT4V #VisionLanguage"

The GPT-4V model has achieved human-level performance on vision-language tasks by integrating advanced vision encoders with large language models, enabling accurate image understanding and reasoning. This breakthrough is attributed to a novel attention mechanism and improved training techniques that facilitate...

#GPT-4V #MultimodalAI #Vision-LanguageTasks #LargeLanguageModels

tejiri Dec 6

"GPT-4V revolutionizes AI vision with human-level understanding, leveraging novel attention mechanisms #GPT4V #MultimodalAI #VisionLanguage"

The GPT-4V model has achieved human-level performance on vision-language tasks by integrating advanced vision encoders with large language models, enabling accurate image understanding and reasoning. A novel attention mechanism is a key innovation in GPT-4V, allowing for improved...

#GPT-4V #MultimodalAI #Vision-LanguageModels #AttentionMechanisms

michabbb Nov 7, 2024

🔍 Major breakthrough in multimodal AI research:

#InfinityMM dataset launches with 43.4M entries across 4 categories: 10M image descriptions, 24.4M visual instructions, 6M high-quality instructions & 3M #AI generated data

🧠 Technical highlights:

New #AquilaVL2B model uses #LLaVA architecture with #Qwen25 language model & #SigLIP for image processing
Despite only 2B parameters, achieves state-of-the-art results in multiple benchmarks
Exceptional performance: #MMStar (54.9%), #MathVista (59%), #MMBench (75.2%)

🚀 Training innovation:

4-stage training process with increasing complexity
Combines image recognition, instruction classification & response generation
Uses #opensource models like RAM++ for data generation

💡 Industry impact:

Model trained on both #Nvidia A100 GPUs & Chinese chips
Complete dataset & model available to research community
Shows promising results compared to commercial systems like #GPT4V

https://arxiv.org/abs/2410.18558

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM.

arXiv.org

michabbb Oct 26, 2024

🔍 #Microsoft introduces #OmniParser, a new screen parsing module for #GUI interactions:
• Converts UI screenshots into structured elements for improved #AI agent navigation
• Works with #GPT4V to generate precise actions for interface regions
• Achieves top performance on #WindowsAgentArena benchmark

🛠️ Key Components:
• Specialized datasets for icon detection and description
• Fine-tuned detection model for identifying actionable regions
• Captioning model for extracting functional semantics

📊 Performance Highlights:
• Outperforms standard #GPT4V on #ScreenSpot benchmarks
• Compatible with #Phi35V and #Llama32V models
• Functions across PC and mobile platforms without HTML dependencies

🔗 Learn more: https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/

OmniParser for pure vision-based GUI agent - Microsoft Research

By Yadong Lu, Senior Researcher; Jianwei Yang, Principal Researcher; Yelong Shen, Principal Research Manager; Ahmed Awadallah, Partner Research Manager Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains […]

Microsoft Research

Habr Aug 15, 2024

[Перевод] Картинка стоит 170 токенов: как GPT-4o кодирует изображения?

Интересный факт : GPT-4o взимает по 170 токенов за обработку каждого тайла 512x512 , используемого в режиме высокого разрешения. При соотношении примерно 0,75 токенов на слово можно предположить, что картинка стоит примерно 227 слов, что всего в четыре раза меньше, чем в поговорке «картинка стоит тысячи слов». (Кроме того, взимается 85 токенов за master thumbnail низкого разрешения каждого изображения, а изображения более высокого разрешения разбиваются на множество таких тайлов 512x512 , но давайте ограничимся одним тайлом высокого разрешения.) Но почему же 170? Необычное число, неправда ли? В своих ценах OpenAI указывает округлённые числа, например, $20 или $0,50, а в своих внутренних размерностях — степени двойки и тройки. Почему же в этом случае выбрано число 170? Числа, которые без объяснений вставляют в кодовую базу, называют в программировании « магическими числами », и 170 кажется очевидным магическим числом. И почему затраты на изображения вообще преобразуются в стоимость в токенах? Если бы это нужно было только для определения цены, то разве не удобнее было бы просто указать цену за тайл? Что, если OpenAI выбрала 170 не в рамках своей запутанной стратегии ценообразования, а потому что это в буквальном смысле так? Что, если тайлы изображений действительно представлены в виде 170 последовательных векторов эмбеддингов? А если это так, то как реализовано?

https://habr.com/ru/articles/834548/

#openai #gpt4 #gpt4o #gpt4v #эмбеддинги

Картинка стоит 170 токенов: как GPT-4o кодирует изображения?

Хабр

isws Jun 10, 2024

Is Mona Lisa's left hand over her right hand? Aldo Gangemi is asking #GPT4V which does not provide the right answer considering that the viewers perspective differs from Mona Lisa's persceptive...
So much for a theory of mind in #LLMs ....

#generatveAI #knowledgerepresentation #philosophyoflanguage #philosophy #summerschool #ISWS2024

Show thread

Yann Leflour Dec 16, 2023

Next up is building a proper frontend to chat with the #GPT4V assistant API

But all that in the next thread

In the meantime, please 💙, 🔁 and 🔖, that helps me get things moving

Show thread

Yann Leflour Dec 12, 2023

This September, I quit my job and started working full time on AI Product Engineering

And it coincided with the release of #GPT4V, which got my mind racing again

But the 💡 moment was when I came across @sawyerhood's https://github.com/SawyerHood/draw-a-ui

GitHub - SawyerHood/draw-a-ui: Draw a mockup and generate html for it

Draw a mockup and generate html for it. Contribute to SawyerHood/draw-a-ui development by creating an account on GitHub.

GitHub

jati Nov 27, 2023

Fun and interesting experiment with #DallE and #GPT4V: Prompt Dall-E for an image, then let GPT-4 Vision describe that image and feed the result back into Dall-E. Example: https://dalle.party/?party=42riPROf

DALL·E image → GPT4 Vision → repeat | DALL·E Party

DALL·E image → GPT4 Vision → repeat