🧠 Come fanno gli #AI Agent come #Operator a eseguire azioni sui browser e su qualunque interfaccia grafica? 
👁️ Questo è un esempio di utilizzo di #OmniParser V2 in esecuzione in locale. Il sistema elabora ciò che "vede" nello schermo, e lo converte in dati strutturati che mappano e classificano ogni elemento. 
⚙️ Questi dati diventano contesto per un #LLM, che può eseguire operazioni sugli elementi. 

#AI #GenAI #GenerativeAI #IntelligenzaArtificiale 

🧠 #Microsoft ha rilasciato #OmniParser V2: un sistema open source in grado di compiere azioni nell'interfaccia utente. 
✨ Non solo sul browser, ma un sistema che usa un #LLM in un Computer Use Agent. 

🔗 Il progetto: https://github.com/microsoft/OmniParser

___ 

✉️ 𝗦𝗲 𝘃𝘂𝗼𝗶 𝗿𝗶𝗺𝗮𝗻𝗲𝗿𝗲 𝗮𝗴𝗴𝗶𝗼𝗿𝗻𝗮𝘁𝗼/𝗮 𝘀𝘂 𝗾𝘂𝗲𝘀𝘁𝗲 𝘁𝗲𝗺𝗮𝘁𝗶𝗰𝗵𝗲, 𝗶𝘀𝗰𝗿𝗶𝘃𝗶𝘁𝗶 𝗮𝗹𝗹𝗮 𝗺𝗶𝗮 𝗻𝗲𝘄𝘀𝗹𝗲𝘁𝘁𝗲𝗿: https://bit.ly/newsletter-alessiopomaro 

#AI #GenAI #GenerativeAI #IntelligenzaArtificiale #LLM #AIAgent

GitHub - microsoft/OmniParser: A simple screen parsing tool towards pure vision based GUI agent

A simple screen parsing tool towards pure vision based GUI agent - microsoft/OmniParser

GitHub
OmniParser V2: Turning Any LLM into a Computer Use Agent - Microsoft Research

Yadong Lu, Senior Researcher; Thomas Dhome-Casanova (opens in new tab), Software Engineer; Jianwei Yang, Principal Researcher; Ahmed Awadallah, Partner Research Manager Graphic User interface (GUI) automation requires agents with the ability to understand and interact with user screens. However, using general purpose LLM models to serve as GUI agents faces several challenges: 1) reliably identifying […]

Microsoft Research
Omniparser, interesting opensource, essential for AI models that want to interpret screens and user interfaces.
https://microsoft.github.io/OmniParser/
#omniparser #ai #opensource #microsoft
SOCIAL MEDIA TITLE TAG

SOCIAL MEDIA DESCRIPTION TAG TAG

#開源分享 微軟開源了一款可以解析和識別螢幕上可交互圖示的工具:OmniParser,它能準確的識別出用戶界面中的可交互圖示,在解析方面優於GPT-4V

特點:
1、雙重識別能力,能找出界面上所有可以點擊的地方,具備語義理解能力,能理解按鈕或圖示的具體功能

2、可以作為插件,與Phi-3.5-V、 Llama-3.2-V以及其他模型結合使用

3、支持結構化輸出,除了識別螢幕上的元素,還能將這些元素轉換成結構化的數據

專案地址: github.com/microsoft/OmniParser
網站: microsoft.github.io/OmniParser

#OmniParser

🔍 #Microsoft introduces #OmniParser, a new screen parsing module for #GUI interactions:
• Converts UI screenshots into structured elements for improved #AI agent navigation
• Works with #GPT4V to generate precise actions for interface regions
• Achieves top performance on #WindowsAgentArena benchmark

🛠️ Key Components:
• Specialized datasets for icon detection and description
• Fine-tuned detection model for identifying actionable regions
• Captioning model for extracting functional semantics

📊 Performance Highlights:
• Outperforms standard #GPT4V on #ScreenSpot benchmarks
• Compatible with #Phi35V and #Llama32V models
• Functions across PC and mobile platforms without HTML dependencies

🔗 Learn more: https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/

OmniParser for pure vision-based GUI agent - Microsoft Research

By Yadong Lu, Senior Researcher; Jianwei Yang, Principal Researcher; Yelong Shen, Principal Research Manager; Ahmed Awadallah, Partner Research Manager Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains […]

Microsoft Research