merve (@mervenoyann)

비전-언어 모델(VLM) 관련 서적에 새로 두 장이 추가되었다는 공지입니다. 문서 AI 장은 기존 모델부터 최신 VLM 접근법, 검색(retrieval) 등을 정리하고, 비디오 언어 모델 장은 비디오 이해와 관련 기법 및 실무 노하우를 다룹니다. 발표자가 직접 집필한 문서 AI 장도 포함되어 있습니다.

https://x.com/mervenoyann/status/2028404451476705295

#visionlanguagemodels #documentai #videolanguagemodels #vlm

merve (@mervenoyann) on X

two more chapters on vision language models book is out! > document AI chapter (by yours truly) shows old models, new VLM approaches, retrieval and more! > video language models chapter shows video understanding, know-hows, approaches and more! sneak peek below

X (formerly Twitter)
A new system fuses language models with 3D Gaussian Splatting to help robots build real-time, semantic maps 3.5x faster than existing methods. https://hackernoon.com/researchers-develop-a-real-time-3d-mapping-system-that-helps-robots-understand-natural-language #visionlanguagemodels
Researchers Develop a Real-Time 3D Mapping System That Helps Robots Understand Natural Language | HackerNoon

A new system fuses language models with 3D Gaussian Splatting to help robots build real-time, semantic maps 3.5x faster than existing methods.

How a new vision-language AI uses multi-stage reasoning to identify schools, parks, and hospitals—going beyond pixels to understand cities. https://hackernoon.com/how-multi-stage-reasoning-helps-ai-understand-what-cities-mean #visionlanguagemodels
How Multi-Stage Reasoning Helps AI Understand What Cities Mean | HackerNoon

How a new vision-language AI uses multi-stage reasoning to identify schools, parks, and hospitals—going beyond pixels to understand cities.

Nvidia launches Alpamayo, open AI models that allow autonomous vehicles to ‘think like a human’

https://fed.brid.gy/r/https://techcrunch.com/2026/01/05/nvidia-launches-alpamayo-open-ai-models-that-allow-autonomous-vehicles-to-think-like-a-human/

"Revolutionize workflow editing with Flowchart2Mermaid, converting images to editable Mermaid.js code! #MermaidJS #VisionLanguageModels #FlowchartEditing"

The Flowchart2Mermaid system leverages vision-language models to convert static flowchart images into editable Mermaid.js code, enhancing reusability and collaboration. This web-based tool utilizes a detailed system prompt to facilitate accurate conversions,...

#Mermaid.js #Vision-LanguageModels #FlowchartConversion #WorkflowEditing

Alibaba's Qwen3-VL achieves 99.5% frame retrieval in two-hour videos while trailing GPT-5 by nine points on reasoning benchmarks. The pattern defines open-source vision AI: exceptional perception, persistent inference gaps.

#OpenSourceAI #VisionLanguageModels

https://www.implicator.ai/alibabas-qwen3-vl-can-find-a-single-frame-in-two-hours-of-video-the-catch-it-still-cant-outthink-gpt-5/

Reviews state-of-the-art MLLMs. Highlights the challenge of expanding current models beyond the simple one-to-one image text relationship. https://hackernoon.com/mllm-adapters-review-of-vpgs-and-multimodal-fusion #visionlanguagemodels
MLLM Adapters: Review of VPGs and Multimodal Fusion | HackerNoon

Reviews state-of-the-art MLLMs. Highlights the challenge of expanding current models beyond the simple one-to-one image text relationship.

PerSense-D is a new benchmark dataset for personalized dense image segmentation, advancing AI accuracy in crowded visual environments. https://hackernoon.com/new-dataset-persense-d-enables-model-agnostic-dense-object-segmentation #visionlanguagemodels
New Dataset PerSense-D Enables Model-Agnostic Dense Object Segmentation | HackerNoon

PerSense-D is a new benchmark dataset for personalized dense image segmentation, advancing AI accuracy in crowded visual environments.

Adaptive prompts, density maps, and VLMs are used in PerSense's training-free one-shot segmentation framework for dense picture interpretation. https://hackernoon.com/persense-delivers-expert-level-instance-recognition-without-any-training #visionlanguagemodels
PerSense Delivers Expert-Level Instance Recognition Without Any Training | HackerNoon

Adaptive prompts, density maps, and VLMs are used in PerSense's training-free one-shot segmentation framework for dense picture interpretation.

PerSense is a model-aware, training-free system for one-shot tailored instance division in dense images based on density and vision-language cues. https://hackernoon.com/persense-a-one-shot-framework-for-personalized-segmentation-in-dense-images #visionlanguagemodels
PerSense: A One-Shot Framework for Personalized Segmentation in Dense Images | HackerNoon

PerSense is a model-aware, training-free system for one-shot tailored instance division in dense images based on density and vision-language cues.