Benchmarked 8 VLMs for screenshot grounding inside a browser agent.
Counterintuitive finding: Qwen 3.5-9B classifies a dropdown affordance that 308B MiMo V2.5 misses. Affordance doesn't scale with parameter count.
Only 1 of 8 (Qwen 3.6-35B-A3B) flags honest uncertainty in calibration.
Full write-up + VRAM tier recs + repo:
#LocalLLM #VLM #AIAgents #OpenSource #Qwen #qwen36 #MachineLearning #FOSS #AI

