Angry Tom (@AngryTomtweets)
Higgsfield Audio 출시 발표: 텍스트로부터 음성 생성(21개 보이스 프리셋), 영상의 음성을 완벽한 립싱크로 변경 가능, 10개 내장 언어로 음성 번역 지원 등 주요 기능을 소개하는 신제품 발표 트윗.
Angry Tom (@AngryTomtweets)
Higgsfield Audio 출시 발표: 텍스트로부터 음성 생성(21개 보이스 프리셋), 영상의 음성을 완벽한 립싱크로 변경 가능, 10개 내장 언어로 음성 번역 지원 등 주요 기능을 소개하는 신제품 발표 트윗.
Google Tranlsate Unlocks Gemini AI Live Speech Translations for All Android Users
#AI #Google #Android #GeminiAI #GoogleTranslate #LiveTranslate #GenAI #LanguageLearning #EdTech #SpeechTranslation #RealTimeTranslation #Alphabet #BigTech
Dự án dịch ngôn ngữ thời gian thực gặp khó khăn. Hệ thống hiện tại có độ trễ cao, không đạt được độ trễ 3 giây như các hệ thống thương mại. #DịchNgônNgữ #ThờiGianThực #SpeechTranslation #RealTimeTranslation #AI #TríTuệNhânTạo
Our pick of the week by @sarapapi: "Consistent Transcription and Translation of Speech" by Sperber et al., 2020 TACL.
https://arxiv.org/pdf/2007.12741.pdf
#NLProc #NLP #speech #translation #speechtranslation #consistency #consistent
#AI will only take over from human #interpreters when it stops doing what most human interpreters say they do.
End-to-end Speech Translation (E2E ST) aims to directly translate source speech into target text. Existing ST methods perform poorly when only extremely small speech-text data are available for training. We observe that an ST model's performance closely correlates with its embedding similarity between speech and source transcript. In this paper, we propose Word-Aligned COntrastive learning (WACO), a simple and effective method for extremely low-resource speech-to-text translation. Our key idea is bridging word-level representations for both speech and text modalities via contrastive learning. We evaluate WACO and other methods on the MuST-C dataset, a widely used ST benchmark, and on a low-resource direction Maltese-English from IWSLT 2023. Our experiments demonstrate that WACO outperforms the best baseline by 9+ BLEU points with only 1-hour parallel ST data. Code is available at https://github.com/owaski/WACO.
Our Pick of the week: Phuong-Hang Le et al., "Pre-training for Speech Translation: CTC Meets Optimal Transport"
by @mgaido91
The gap between speech and text modalities is a major challenge in speech-to-text translation (ST). Different methods have been proposed to reduce this gap, but most of them require architectural changes in ST training. In this work, we propose to mitigate this issue at the pre-training stage, requiring no change in the ST model. First, we show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design. We provide a quantitative comparison with the more common cross-entropy loss, showing that pre-training with CTC consistently achieves better final ST accuracy. Nevertheless, CTC is only a partial solution and thus, in our second contribution, we propose a novel pre-training method combining CTC and optimal transport to further reduce this gap. Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space. Extensive experiments on the standard CoVoST-2 and MuST-C datasets show that our pre-training method applied to the vanilla encoder-decoder Transformer achieves state-of-the-art performance under the no-external-data setting, and performs on par with recent strong multi-task learning systems trained with external data. Finally, our method can also be applied on top of these multi-task systems, leading to further improvements for these models.