Google for Developers (@googledevs)

Gemini API File Search의 최신 업데이트로 멀티모달 RAG 기능이 강화되었습니다. 이미지와 텍스트를 네이티브로 처리하고, 사용자 지정 메타데이터로 검색 속도를 높이며, 페이지 수준 인용으로 더 정확한 근거 제시가 가능해졌습니다.

https://x.com/googledevs/status/2051728211105493194

#gemini #api #filesearch #rag #multimodal

Google for Developers (@googledevs) on X

Give applications a photographic memory with the latest updates to Gemini API File Search 🧠🖼️ New features enable more precise multimodal RAG and include: ✅ Native image and text processing ✅ Custom metadata for faster retrieval ✅ Page-level citations for precise grounding

X (formerly Twitter)

HackerNewsTop5 (@hackernewstop5)

GLM-5V-Turbo가 공개됐으며, 멀티모달 에이전트를 위한 네이티브 파운데이션 모델을 지향하는 연구입니다. 이미지와 텍스트 등 복합 입력을 다루는 차세대 멀티모달 AI 모델 관련 중요한 발표입니다.

https://x.com/hackernewstop5/status/2051739491476615235

#glm5vturbo #multimodal #foundationmodel #llm #research

HackerNewsTop5 (@hackernewstop5) on X

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents #HackerNews https://t.co/uUa9da5eVN

X (formerly Twitter)
🤖 Another day, another incomprehensible jargon-filled soup about how machines are getting better at understanding pictures and words. Who knew that blending buzzwords with indecipherable acronyms could make #AI sound like it just discovered fire? 🔥 Let's all pretend we're not terrified by the impending takeover of our #multimodal #overlords. 🚀
https://arxiv.org/abs/2604.26752 #Revolution #Jargon #Overload #TechTrends #HackerNews #ngated
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

arXiv.org

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

https://arxiv.org/abs/2604.26752

#HackerNews #GLM5VTurbo #Multimodal #Agents #Foundation #Model #AI #Research

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

arXiv.org

Ilir Aliu (@IlirAliu_)

FRANKA Robotics의 FR3 Duo를 GELLO Duo로 원격 조작해 양팔 작업용 멀티모달 학습 데이터를 수집하는 로보틱스 데모를 소개한다. 촉각과 토크 신호까지 포함해 실제 현장에서 로봇 학습 데이터 파이프라인을 구축하는 방식이 핵심이다.

https://x.com/IlirAliu_/status/2050924083035209746

#robotics #teleoperation #multimodal #bimanual #tactile

Ilir Aliu (@IlirAliu_) on X

If you’re entering robotics and don’t know where to start... Follow the data: Last week I saw it demonstrated live. @FRANKAROBOTICS showed how FR3 Duo is teleoperated via GELLO Duo to capture multimodal training data for bimanual tasks, including tactile torque signals. That

X (formerly Twitter)

Ilir Aliu (@IlirAliu_)

로봇 학습 입문자에게는 데이터를 따라가라는 조언과 함께, FRANKA ROBOTICS의 FR3 Duo가 GELLO Duo를 통해 양팔 작업용 멀티모달 학습 데이터를 원격조작으로 수집하는 사례를 소개한다. 촉각·토크 신호까지 포함해 로봇 데이터 수집 방식의 중요성을 보여준다.

https://x.com/IlirAliu_/status/2050924083035209746

#robotics #teleoperation #multimodal #dataset #humanoid

Ilir Aliu (@IlirAliu_) on X

If you’re entering robotics and don’t know where to start... Follow the data: Last week I saw it demonstrated live. @FRANKAROBOTICS showed how FR3 Duo is teleoperated via GELLO Duo to capture multimodal training data for bimanual tasks, including tactile torque signals. That

X (formerly Twitter)

Nikolaus West (@NikolausWest)

로봇 학습에서는 비디오 압축을 반드시 이해해야 한다는 내용이다. 카메라 스트림이 대부분의 데이터셋에서 90% 이상을 차지해, 영상 처리가 더 복잡하더라도 압축을 통해 얻는 저장공간 절감 효과가 매우 크다고 강조한다.

https://x.com/NikolausWest/status/2050907496819122189

#robotlearning #videocompression #datasets #robotics #multimodal

Nikolaus West (@NikolausWest) on X

If you’re serious about robot learning you (unfortunately) need to know about video compression. Camera streams dominate data volumes for most datasets at 90+% even when compressed. Video is more complicated to deal with but the size wins are too big to give up. The unit of

X (formerly Twitter)

Meet SenseNova-U1, an open source multimodal that handles standard visual question answering, document parsing, chart comprehension, OCR, and agentic visual tasks. Feed it a screenshot, a PDF, a handwritten note, it processes all of it in the same model without switching modes.
On the generation side it does text-to-image, image editing, and native interleaved image and text generation.
https://firethering.com/sensenova-u1-open-source-multimodal-ai/

#ai #llms #genai #opensource #generativeai #multimodal

SenseNova-U1: Open Source AI That Understands and Generates Images in One Model - Firethering

Most multimodal models are text models with image handling bolted on. A vision encoder reads the image, converts it into tokens the language model understands, and the two systems communicate through that translation layer. It works. It's also where things break down when text and image content need to stay tightly in sync. SenseNova-U1 takes a different approach. Released by SenseTime under Apache 2.0, it removes the visual encoder and VAE entirely. No translation layer or separate systems. Pixel and word information modeled together from the start. The technical report isn't out yet and the A3B variant is still pending. But the 8B weights are available now.

Firethering

Derya Unutmaz, MD (@DeryaTR_)

멀티모달 데이터를 통합해 예측하는 헬스 AI 모델이 매우 뛰어나다고 평가하는 ट्वीट입니다. BioAI 분야에서 이런 형태의 데이터 통합이 실제로 큰 필요가 있다고 언급하며, 의료·생명과학 AI 응용 가능성을 보여줍니다.

https://x.com/DeryaTR_/status/2050953793827717582

#healthai #multimodal #bioai #medicalai

Derya Unutmaz, MD (@DeryaTR_) on X

This is an outstanding predictive health AI model! This sort of multi-modal data integration is what we really need for BioAI.

X (formerly Twitter)

You must try this.

GPT Image-2 can do PALM reading and I’m so here for it.

Full prompt below

https://x.com/LinusEkenstam/status/2048426035541135437

#gptimage2 #palmreading #multimodal #prompting #ai

Linus ✦ Ekenstam (@LinusEkenstam) on X

You must try this. GPT Image-2 can do PALM reading and I’m so here for it. Full prompt below ⤵️

X (formerly Twitter)