Mastodawn

CausticHarmony Jul 22

Sumana Harihareswara

A few days ago I came across this amazing bug report against Whisper, the LLM-based transcription tool: 'Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic which translates as "Translation by Nancy Qunqar"'

https://github.com/openai/whisper/discussions/2608

More about the problem in this thread https://xcancel.com/SheriefFYI/status/1756694564109423035

Thanks for the pointer, @itamarst (in https://hachyderm.io/@itamarst/114874924215640058 )

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic which translates as "Translation by Nancy Qunqar" · openai whisper · Discussion #2608

If you generate complete silence in a wav file and run whisper on it, it will always hallucinate the same thing ffmpeg -f lavfi -i anullsrc=r=44100:cl=stereo -t 30 silence.wav whisper ./silence.wav...

GitHub

Sumana Harihareswara Jul 22

I noticed that the GitHub issue now has a bunch of very recent comments, and figured out the reason: @Edent posted a link to Hacker News, where it's currently #1 on the front page (and has been syndicated to various social media bots accordingly)

https://news.ycombinator.com/item?id=44643564 Comments include examples from Czech, Russian, Welsh, Japanese, Turkish, Mandarin....

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic | Hacker News

Travis F W Jul 22

@brainwane @itamarst I always get, "Thanks for watching!"

Echo Pair Jul 22

@brainwane hold on a sec I need to contact John Cage via seance about this

Bill, organizer of stuff Jul 22

@brainwane @itamarst Looks like they stole the translations used for training from here:

https://lyricstranslate.com/en/translator/nancy-qunqar

From her:
https://www.instagram.com/nancyrk/

Nancy Qunqar | Lyrics Translate

Nancy Qunqar - 14 translations, 1 song, 155 thanks received, 11 translation re

Jack Yan (甄爵恩)Jul 22

@wcbdata @brainwane @itamarst We also know where they stole the German training from too! (From the original link, 'Untertitelung des ZDF für funk, 2017.')

@wcbdata @brainwane @itamarst wouldnt it make more sense to train translation on some international documents or i dont know, EU laws, that are the best 1:1 translations out there (because law is petty)

Sumana Harihareswara Jul 22

I think possibly you have misunderstood @wcbdata a little bit; the relevant training dataset that's leading to *this* problem is, if I understand correctly, a combination of audio recordings of human speech in Arabic and human-made Arabic captions/transcripts of that speech, *not* Arabic translations of speech or of written documents from other languages. https://xcancel.com/SheriefFYI/status/1756694204867355041 has more on this.

Sumana Harihareswara Jul 22

This is a problem in Whisper's *transcription* of Arabic-language audio, not a problem with Whisper's translation between languages.

@wcbdata @itamarst

@brainwane @wcbdata @itamarst aaah, thx, i now i got it :)

@brainwane @itamarst I used whispers large model yesterday to transcribe a few interviews in Swedish - it's a bit picky with the file formats and when it wasn't to it's liking it said:

"Svensktextning.nu Svensktextning.nu Svensktextning.nu Svensktextning.nu"

Probably a related or same issue.

Morten Grøftehauge Jul 22

@haagen @brainwane @itamarst If it is legal to upload them you could try https://goodtape.io/

Good Tape — Fast, secure and accurate transcription

Good Tape is an automatic transcription service that makes it easy for journalists (and others) to turn audio recordings into text, regardless of language or sound quality. We save you time and effort so you can focus on what really matters.

Sumana Harihareswara Jul 22

https://goodtape.io/blog/help-center/ The Troubleshooting section mentions "Random/repeated words in transcripts" which indicates that GoodTape has the same problems (which makes sense since it is probably using Whisper or something similar)

@haagen @itamarst

@drgroftehauge @brainwane @itamarst I solved my problem by "simply" converting the audiofile to a different format and a different encoding.

Morten Grøftehauge Jul 22

@haagen @brainwane @itamarst Ez pez 😭

FlohEinstein Jul 22

Bug report über Whisper, das LLM-basierte Transkriptionstool:
Anscheinend wird Stille am Ende auf Deutsch transkribiert als

"Untertitelung des ZDF für funk, 2017."

https://github.com/openai/whisper/discussions/2608#discussioncomment-13790984

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic which translates as "Translation by Nancy Qunqar" · openai whisper · Discussion #2608

If you generate complete silence in a wav file and run whisper on it, it will always hallucinate the same thing ffmpeg -f lavfi -i anullsrc=r=44100:cl=stereo -t 30 silence.wav whisper ./silence.wav...

GitHub

Quinn Comendant Jul 22

@brainwane @itamarst At first I thought of these as “Easter eggs” but really it’s more like summoning ghosts.

Here’s another discussion with more anecdotes: https://github.com/openai/whisper/discussions/928

Dataset bias ("❤️ Translated by Amara.org Community") · openai whisper · Discussion #928

Hello, I noticed multiples biases using whisper. For example, it sometimes outputs (in french) ❤️ Translated by Amara.org Community as I guess it was used video subtitles by Amara. There are also l...

GitHub

Sumana Harihareswara Jul 22

@com @itamarst wow!

"Suggestion: don't use train your language models on Ice Skating videos, or classical music will trigger transcription of figure skating notations. :-)"

from https://github.com/openai/whisper/discussions/928#discussioncomment-13759269

Sumana Harihareswara Jul 22

Evocative:

"There are many solutions and workarounds people are working on to prevent hallucinations, and most of them involve just cutting out the silent parts of the audio so that Whisper is never tempted by the silence"

TEMPTED BY THE SILENCE

@brainwane @itamarst Asking the doctor why their clinic notes say "please like and subscribe" at the end

Sumana Harihareswara Jul 22

@threedaymonk May I boost? (asking since you posted as unlisted/quiet public)

@brainwane @itamarst go ahead!

マリオ (Mario Menti)Jul 22

@threedaymonk @brainwane @itamarst I came across loads of those when using Whisper in a project at work. If it can't discern any words, it tends to just add things like these. We had German audio, and in these cases where there was no speech, or hardly any speech, it often added references to things like ZDF (a German broadcaster)

Justin Derrick Jul 22

@brainwane @itamarst Yeah, I processed a series of videos with odd audio tracks - sometimes they contained nothing, or white noise. The transcriptions it produced were all very strange, but were basically a “greatest hits” of phrases from YouTube closed captioning.

I considered this as a “accepted side effect”, and didn’t really consider it a bug.

Alexander 😷Jul 22

@brainwane @itamarst And now we know where Whisper steals its training data.

Adolfo Jayme Barrientos Jul 23

@brainwane @itamarst
From that Xitter thread: “... the culture’s disregard for quality...”
I stopped reading. THE US-AMERICAN’S ARROGANCE.

Itamar Turner-Trauring Jul 23

@fito @brainwane It's pretty gross way to put it, yes, not endorsing them as a person, don't know who they are, but the explanation of the mechanism (training on captions) makes sense.