Mastodawn

fly51fly (@fly51fly)

Meta AI 연구진이 긴 영상 이해를 위해 소형 비전-언어 모델이 ‘스마트 압축기’ 역할을 할 수 있다는 CV 논문을 공개했다. 장기 영상 이해를 위한 모델 효율화 및 압축 관련 연구 결과로 주목할 만하다.

https://x.com/fly51fly/status/2042722498501054824

#metaai #visionlanguagemodel #videounderstanding #compression #arxiv

fly51fly (@fly51fly) on X

[CV] Small Vision-Language Models are Smart Compressors for Long Video Understanding J Fei, J Chen, Z Liu, Y Xiong… [Meta AI] (2026) https://t.co/BRJiWYcoac

X (formerly Twitter)

AI Daily Post Nov 25

Black Forest Labs just dropped Flux 2, a hybrid architecture that pairs a Rectified Flow Transformer with a VAE image encoder, now powered by the new Mistral‑3 24B vision‑language model. The open‑source‑friendly release brings multimodal generation to the BFL API—perfect for developers eager to experiment. Dive into the details and see what this combo can create! #Flux2 #Mistral324B #VisionLanguageModel #HybridArchitecture

🔗 https://aidailypost.com/news/black-forest-labs-releases-flux-2-mistral3-24b-visionlanguage-model

Reddit Tech VN Bot Oct 30, 2025

NVIDIA ra mắt mô hình VLM nguồn mở Nemotron Nano 12B V2 VL ấn tượng. Mô hình nhận diện tài liệu, tóm tắt nội dung và nhận diện hình ảnh chính xác. #AI #NVIDIA #VisionLanguageModel #CongNgheAI #AIVietNam

https://www.reddit.com/r/LocalLLaMA/comments/1ojrv67/tried_nvidias_new_opensource_vlm_heres_my/

AI Sparkup Oct 18, 2025

VLM 실행하기: CPU 최적화부터 클라우드까지

VLM을 실행하는 방법을 완벽 정리했습니다. 다양한 모델 비교부터 Intel CPU 최적화, Ollama Cloud 활용까지 실무에 바로 적용할 수 있는 가이드입니다.

https://aisparkup.com/posts/5634

michabbb Jul 22, 2025

#MistralAI Document #AI: Advanced #OCR solution for complex document processing 📄

📺 https://www.youtube.com/watch?v=yrx5D5WosrU

🔧 Fine-tuned #VisionLanguageModel specifically designed for document understanding beyond traditional #OCR limitations that plague most business workflows
📊 Processes diverse file formats including #PDF files and images with complex layouts, tables, charts and poor quality scans that typically cause errors

🧵 👇

michabbb Jan 22, 2025

#UITARS Desktop: The Future of Computer Control through Natural Language 🖥️

🎯 #ByteDance introduces GUI agent powered by #VisionLanguageModel for intuitive computer control

Code: https://lnkd.in/eNKasq56
Paper: https://lnkd.in/eN5UPQ6V
Models: https://lnkd.in/eVRAwA-9

#ai

🧵 ↓

GitHub - bytedance/UI-TARS

Contribute to bytedance/UI-TARS development by creating an account on GitHub.

GitHub

michabbb Nov 26, 2024

Edge-Ready #Vision Language Model Advances Visual #AI Processing 🌟

🧠 #OmniVision (968M params) sets new benchmark as world's smallest #VisionLanguageModel

🔄 Architecture combines #Qwen2 (0.5B) for text & #SigLIP (400M) for vision processing

💡 Key Innovations:
• 9x token reduction (729 → 81) for faster processing
• Enhanced accuracy through #DPO training
• Only 988MB RAM & 948MB storage required
• Outperforms #nanoLLAVA across multiple benchmarks

🎯 Use Cases:
• Image analysis & description
• Visual memory assistance
• Recipe generation from food images
• Technical documentation support

Try it now: https://huggingface.co/spaces/NexaAIDev/omnivlm-dpo-demo
Source: https://nexa.ai/blogs/omni-vision

Omnivlm Dpo Demo - a Hugging Face Space by NexaAIDev

Discover amazing ML apps made by the community

Tech Chilli Jul 14, 2024

Google DeepMind’s PaliGemma: A Small But Mighty Open-Source Vision-Language Model.

See here - https://techchilli.com/news/google-deepminds-paligemma-open-source-vision-language-model/

#GoogleDeepMind #PaliGemma #VisionLanguageModel #AI #TechInnovation #OpenSource #MachineLearning #AIEfficiency #TechTrends #FutureOfAI #ArtificialIntelligence #DeepLearning #AIModel

Google DeepMind’s PaliGemma: A Small But Mighty Open-Source Vision-Language Model

Explore Google DeepMind's PaliGemma, a compact vision-language model with 3 billion parameters. This open-source VLM delivers impressive performance on diverse tasks, setting new standards in AI efficiency.

Tech Chill