๐Ÿง  Come fanno gli #AI Agent come #Operator a eseguire azioni sui browser e su qualunque interfaccia grafica? 
๐Ÿ‘๏ธ Questo รจ un esempio di utilizzo di #OmniParser V2 in esecuzione in locale. Il sistema elabora ciรฒ che "vede" nello schermo, e lo converte in dati strutturati che mappano e classificano ogni elemento. 
โš™๏ธ Questi dati diventano contesto per un #LLM, che puรฒ eseguire operazioni sugli elementi. 

#AI #GenAI #GenerativeAI #IntelligenzaArtificiale 

๐Ÿง  #Microsoft ha rilasciato #OmniParser V2: un sistema open source in grado di compiere azioni nell'interfaccia utente. 
โœจ Non solo sul browser, ma un sistema che usa un #LLM in un Computer Use Agent. 

๐Ÿ”— Il progetto: https://github.com/microsoft/OmniParser

___ 

โœ‰๏ธ ๐—ฆ๐—ฒ ๐˜ƒ๐˜‚๐—ผ๐—ถ ๐—ฟ๐—ถ๐—บ๐—ฎ๐—ป๐—ฒ๐—ฟ๐—ฒ ๐—ฎ๐—ด๐—ด๐—ถ๐—ผ๐—ฟ๐—ป๐—ฎ๐˜๐—ผ/๐—ฎ ๐˜€๐˜‚ ๐—พ๐˜‚๐—ฒ๐˜€๐˜๐—ฒ ๐˜๐—ฒ๐—บ๐—ฎ๐˜๐—ถ๐—ฐ๐—ต๐—ฒ, ๐—ถ๐˜€๐—ฐ๐—ฟ๐—ถ๐˜ƒ๐—ถ๐˜๐—ถ ๐—ฎ๐—น๐—น๐—ฎ ๐—บ๐—ถ๐—ฎ ๐—ป๐—ฒ๐˜„๐˜€๐—น๐—ฒ๐˜๐˜๐—ฒ๐—ฟ: https://bit.ly/newsletter-alessiopomaro 

#AI #GenAI #GenerativeAI #IntelligenzaArtificiale #LLM #AIAgent

GitHub - microsoft/OmniParser: A simple screen parsing tool towards pure vision based GUI agent

A simple screen parsing tool towards pure vision based GUI agent - microsoft/OmniParser

GitHub
OmniParser V2: Turning Any LLM into a Computer Use Agent - Microsoft Research

Yadong Lu, Senior Researcher; Thomas Dhome-Casanova (opens in new tab), Software Engineer; Jianwei Yang, Principal Researcher; Ahmed Awadallah, Partner Research Manager Graphic User interface (GUI) automation requires agents with the ability to understand and interact with user screens. However, using general purpose LLM models to serve as GUI agents faces several challenges: 1) reliably identifying [โ€ฆ]

Microsoft Research
Omniparser, interesting opensource, essential for AI models that want to interpret screens and user interfaces.
https://microsoft.github.io/OmniParser/
#omniparser #ai #opensource #microsoft
SOCIAL MEDIA TITLE TAG

SOCIAL MEDIA DESCRIPTION TAG TAG

#้–‹ๆบๅˆ†ไบซ ๅพฎ่ปŸ้–‹ๆบไบ†ไธ€ๆฌพๅฏไปฅ่งฃๆžๅ’Œ่ญ˜ๅˆฅ่žขๅน•ไธŠๅฏไบคไบ’ๅœ–็คบ็š„ๅทฅๅ…ท๏ผšOmniParser๏ผŒๅฎƒ่ƒฝๆบ–็ขบ็š„่ญ˜ๅˆฅๅ‡บ็”จๆˆถ็•Œ้ขไธญ็š„ๅฏไบคไบ’ๅœ–็คบ๏ผŒๅœจ่งฃๆžๆ–น้ขๅ„ชๆ–ผGPT-4V

็‰น้ปž๏ผš
1ใ€้›™้‡่ญ˜ๅˆฅ่ƒฝๅŠ›๏ผŒ่ƒฝๆ‰พๅ‡บ็•Œ้ขไธŠๆ‰€ๆœ‰ๅฏไปฅ้ปžๆ“Š็š„ๅœฐๆ–น๏ผŒๅ…ทๅ‚™่ชž็พฉ็†่งฃ่ƒฝๅŠ›๏ผŒ่ƒฝ็†่งฃๆŒ‰้ˆ•ๆˆ–ๅœ–็คบ็š„ๅ…ท้ซ”ๅŠŸ่ƒฝ

2ใ€ๅฏไปฅไฝœ็‚บๆ’ไปถ๏ผŒ่ˆ‡Phi-3.5-Vใ€ Llama-3.2-VไปฅๅŠๅ…ถไป–ๆจกๅž‹็ตๅˆไฝฟ็”จ

3ใ€ๆ”ฏๆŒ็ตๆง‹ๅŒ–่ผธๅ‡บ๏ผŒ้™คไบ†่ญ˜ๅˆฅ่žขๅน•ไธŠ็š„ๅ…ƒ็ด ๏ผŒ้‚„่ƒฝๅฐ‡้€™ไบ›ๅ…ƒ็ด ่ฝ‰ๆ›ๆˆ็ตๆง‹ๅŒ–็š„ๆ•ธๆ“š

ๅฐˆๆกˆๅœฐๅ€๏ผš github.com/microsoft/OmniParser
็ถฒ็ซ™๏ผš microsoft.github.io/OmniParser

#OmniParser

๐Ÿ” #Microsoft introduces #OmniParser, a new screen parsing module for #GUI interactions:
โ€ข Converts UI screenshots into structured elements for improved #AI agent navigation
โ€ข Works with #GPT4V to generate precise actions for interface regions
โ€ข Achieves top performance on #WindowsAgentArena benchmark

๐Ÿ› ๏ธ Key Components:
โ€ข Specialized datasets for icon detection and description
โ€ข Fine-tuned detection model for identifying actionable regions
โ€ข Captioning model for extracting functional semantics

๐Ÿ“Š Performance Highlights:
โ€ข Outperforms standard #GPT4V on #ScreenSpot benchmarks
โ€ข Compatible with #Phi35V and #Llama32V models
โ€ข Functions across PC and mobile platforms without HTML dependencies

๐Ÿ”— Learn more: https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/

OmniParser for pure vision-based GUI agent - Microsoft Research

By Yadong Lu, Senior Researcher; Jianwei Yang, Principal Researcher; Yelong Shen, Principal Research Manager; Ahmed Awadallah, Partner Research Manager Recent advancements in large vision-language models (VLMs), such as GPT-4V and GPT-4o, have demonstrated considerable promise in driving intelligent agent systems that operate within user interfaces (UI). However, the full potential of these multimodal models remains [โ€ฆ]

Microsoft Research