Mastodawn

Building low-latency voice agents in 3 lines of code with GPT Realtime 2 and AG2

OpenAI가 GPT Realtime 2를 활용한 저지연 음성 에이전트 구축을 위한 LiveAgent를 공개했다. LiveAgent는 단일 양방향 세션에서 연속 오디오 입출력과 음성 활동 감지를 지원해 전화 통화처럼 자연스러운 대화 흐름과 즉각적인 끼어들기를 가능하게 한다. 3줄 코드로 간단히 구현할 수 있으며, 도구 호출과 서브에이전트 위임도 실시간 음성 세션 내에서 지원해 복잡한 작업도 처리할 수 있다. OpenAI와 Gemini 모델을 지원하며, 기존 STT-에이전트-TTS 방식 대비 낮은 지연과 자연스러운 대화 경험을 제공한다.

https://docs.ag2.ai/latest/docs/blog/2026/05/12/LiveAgent/

#openai #gptrealtime2 #voiceagent #ag2 #realtimeai

AG2 - Building low-latency voice agents in 3 lines of code with GPT Realtime 2

A programming framework for agentic AI

sayzard 4d ago

Interaction models by Thinking Machines Lab [video]

Thinking Machines Lab이 실시간 협업을 위해 설계된 새로운 AI 'interaction models'를 소개했다. 이 모델은 사람처럼 동시에 듣고, 말하고, 보고, 보여주고, 생각하는 능력을 갖춰 AI와 인간 간의 자연스러운 협업을 목표로 한다. 해당 기술은 실시간 상호작용에 최적화된 AI 모델로, 자세한 기술 보고서는 공식 블로그에서 확인할 수 있다.

https://www.youtube.com/watch?v=A12AVongNN4

#interactionmodels #realtimeai #aicollaboration #thinkingmachineslab

Introducing interaction models | Thinking Machines Lab

YouTube

sayzard May 9

I made Meta's TRIBE v2 watch YouTube in real time

Meta의 TRIBE v2 AI 뇌 시뮬레이션 시스템을 활용해 실시간으로 유튜브 영상을 시청하고 반응하는 AI 애플리케이션이 개발되었습니다. 이 시스템은 화면 인식, 음성 모델, 애니메이션 아바타를 결합해 영상 내용을 이해하고 실시간으로 대화하며 반응할 수 있습니다. 개발자는 AI의 도움을 받아 며칠 만에 이 실험적 프로젝트를 완성했으며, 창의적 AI 활용 가능성을 보여주는 사례로 주목받고 있습니다.

https://www.youtube.com/watch?v=I4oGPLMVoC0

#meta #tribev2 #brainsimulation #realtimeai #creativeai

I Made Real Brain Data React to YouTube...

YouTube

sayzard May 2

Google for Developers (@googledevs)

Gemini 3.1 Flash Live를 실시간 음성·비디오 스택에 통합하면, 다국어 전환이 가능한 speech-to-speech 대화 에이전트, 1초 미만의 지연시간, 여러 모델을 연결하던 파이프라인 대신 단일 Gemini Live 호출로 구현할 수 있다고 소개한다.

https://x.com/googledevs/status/2050319352105251261

#gemini #googleai #realtimeai #voiceai #multimodal

Google for Developers (@googledevs) on X

What happens when you integrate Gemini 3.1 Flash Live into a real-time voice and video stack? You get conversational agents with: 🎙️ Speech-to-speech with multilingual switching ⚡ Sub-second latency 🔧 Single Gemini Live call instead of a multi-model pipeline 🤖 Real world

X (formerly Twitter)

sayzard May 1

Google for Developers (@googledevs)

Gemini 3.1 Flash Live를 실시간 음성·영상 스택에 통합했을 때의 효과를 소개한다. 다국어 음성 간 전환이 가능한 speech-to-speech, 1초 미만 지연, 여러 모델을 잇는 파이프라인 대신 단일 Gemini Live 호출로 처리할 수 있는 대화형 에이전트를 강조한다.

https://x.com/googledevs/status/2050319352105251261

#gemini #realtimeai #voiceai #videoai #agents

Google for Developers (@googledevs) on X

X (formerly Twitter)

sayzard Apr 17

田中義弘 | taziku CEO / AI × Creative (@taziku_co)

DecartAI의 Lucy2.1이 실시간 비디오 생성에서 선명도, 립싱크, 자연스러운 조명, 복잡한 모션을 구현한다는 소개입니다. 생성 결과를 기다리는 방식에서 직접 상호작용하는 방식으로 비디오 생성 패러다임이 바뀔 수 있음을 강조합니다.

https://x.com/taziku_co/status/2044989847379824859

#videogeneration #realtimeai #multimodal #aivideo #decartai

田中義弘 | taziku CEO / AI × Creative (@taziku_co) on X

高品質な動画がリアルタイムで動くと、映像生成は「待つもの」ではなく「触るもの」に変わる。 @DecartAIの、Lucy2.1は、鮮明さ、リップシンク、自然光、複雑モーションをリアルタイムで。リアルタイムで生成が進むなら、動画は生成物ではなくUIになる。

X (formerly Twitter)

Marcus Schuler Apr 16

DeepL launched voice translation for Zoom, Teams, and mobile conversations, but their speech-to-text-to-speech pipeline creates a 1-2 sentence delay. While their translation quality remains strong, the latency challenge highlights why live conversation AI is fundamentally different from document translation. Multiple competitors already offer similar features.

#VoiceTranslation #AITranslation #RealTimeAI

https://www.implicator.ai/deepl-adds-voice-translation-but-the-delay-is-the-product/

DeepL Voice Translation Runs Into the Delay Problem

DeepL is turning text translation into spoken translation for Zoom, Microsoft Teams, contact centers, and frontline teams. The hard part is not saying "voice-to-voice." It is beating the delay that appears when speech still has to become text before it becomes speech again.

Implicator.ai

Hacker News Apr 6

Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

https://github.com/fikrikarim/parlor

#HackerNews #RealTimeAI #AudioVideo #M3Pro #GemmaE2B #HackerNews

GitHub - fikrikarim/parlor: On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine. Powered by Gemma 4 E2B and Kokoro.

On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine. Powered by Gemma 4 E2B and Kokoro. - fikrikarim/parlor

GitHub

sayzard Apr 1

Google for Developers (@googledevs)

Gemini 3.1 Flash Live와 Stream의 Vision Agents SDK를 활용해 실시간 음성 에이전트를 구축하는 방법을 소개합니다. 초기 접근 단계에서 여러 단계를 조율하는 워크플로우까지 확장하는 실전 가이드가 포함되어 있습니다.

https://x.com/googledevs/status/2039115523619697086

#gemini #voiceagent #visionsdk #stream #realtimeai

Google for Developers (@googledevs) on X

Build a real-time voice agent with Gemini 3.1 Flash Live and Stream's Vision Agents SDK using Stefan Blos’s walkthrough to move from early access to a fully orchestrated multi-step workflow. What’s covered: ✨ Setting up the Vision Agents SDK with the Gemini plugin ✨ Defining

X (formerly Twitter)

sayzard Mar 23

Mark Gadala-Maria (@markgadala)

AI가 게임이 진행되는 동안 실시간으로 멀티플레이어 게임을 함께 생성하는 라이브 데모를 소개한다. 게임 플레이와 생성이 동시에 일어나는 형태로, 향후 실시간 AI 기반 게임의 가능성을 보여주는 인상적인 사례다.

https://x.com/markgadala/status/2035912668452639079

#aigaming #realtimeai #generativeai #demo #multiplayer

Mark Gadala-Maria (@markgadala) on X

Whoa. This blew my mind. This is a live demo of a multiplayer game being created by AI live as it’s being played. The future of gaming is going to be realtime AI based.

X (formerly Twitter)