Artificial Analysis (@ArtificialAnlys)
NVIDIA가 Nemotron 3 VoiceChat을 공개했습니다. 약 120억 매개변수의 음성-to-음성(S2S) 대화 모델로, 오픈 웨이트로 제공되며 대화 역학(conversational dynamics)과 음성 추론(speech reasoning) 사이의 퍼레토 프런티어에서 선도적인 성능을 보인다고 소개합니다. 또한 음성-음성 모델 성능 평가는 다차원적이라는 설명을 덧붙였습니다.
https://x.com/ArtificialAnlys/status/2033642073052868861
#nvidia #nemotron #speechtospeech #voicechat

Artificial Analysis (@ArtificialAnlys) on X
NVIDIA has released Nemotron 3 VoiceChat! A ~12B parameter Speech to Speech model that leads our open weights Conversational Dynamics vs. Speech Reasoning pareto frontier
Understanding Speech to Speech model performance is multidimensional - two key and distinct dimensions are
X (formerly Twitter)
NVIDIA PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Native Swift with MLX
What if you could talk to your laptop and it talked back — not through a three-step pipeline of transcribe-think-synthesize, but as a…
MediumPython Trending (@pythontrending)
Speech To Speech라는 프로젝트(혹은 이니셔티브)를 소개하는 트윗으로, 오픈소스·모듈식 설계로 'GPT4-o'급 모델을 음성-음성(speech-to-speech) 시스템으로 구현하려는 시도를 알림. 오픈 소스 음성 변환·대화 시스템 개발 및 GPT4유형 모델의 음성 응용 확대를 목표로 함.
https://x.com/pythontrending/status/2020500372075302935
#speechtospeech #gpt4o #opensource #tts

Python Trending 🇺🇦 (@pythontrending) on X
speech-to-speech - Speech To Speech: an effort for an open-sourced and modular GPT4-o https://t.co/RvmH2dsQCb
X (formerly Twitter)Rohan Paul (@rohanpaul_ai)
FlashLabs(@flashlabsdotai)가 오픈소스 네이티브 음성→음성 모델 'Chroma'를 공개했습니다. Chroma는 오디오 토큰을 직접 처리해 기존의 ASR→LLM→TTS 분리 파이프라인 없이 한 루프에서 음성으로 추론하고 발화하며, 듀얼-레이어 RAG로 구동된다고 합니다. 자율 음성 에이전트와 실시간 음성 처리에 중요한 진전입니다.
https://x.com/rohanpaul_ai/status/2013999190058369044
#speechtospeech #opensource #audiollm #rag #flashlabs

Rohan Paul (@rohanpaul_ai) on X
Another great news for autonomous voice agents
@flashlabsdotai launched Chroma, an open source native speech-to-speech model that processes audio tokens directly, so there is no ASR to LLM to TTS handoff. It reasons and speaks in audio in one loop.
Powered by a dual-layer RAG
X (formerly Twitter)Nvidia mới ra demo PersonaPlex: mô hình speech‑to‑speech có thể điều khiển qua system prompt, mở ra khả năng tùy chỉnh giọng nói trong AI. Thú vị cho các nhà phát triển và nghiên cứu âm thanh. #AI #Nvidia #PersonaPlex #CôngNghệ #SpeechToSpeech #AI_VN
https://www.reddit.com/r/LocalLLaMA/comments/1qgcm6x/demo_for_the_latest_personaplex_model_from_nvidia/

High-Fidelity Simultaneous Speech-To-Speech Translation
We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples as well as models and inference code.
arXiv.org