A few days ago I came across this amazing bug report against Whisper, the LLM-based transcription tool: 'Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic which translates as "Translation by Nancy Qunqar"'

https://github.com/openai/whisper/discussions/2608

More about the problem in this thread https://xcancel.com/SheriefFYI/status/1756694564109423035

Thanks for the pointer, @itamarst (in https://hachyderm.io/@itamarst/114874924215640058 )

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic which translates as "Translation by Nancy Qunqar" · openai whisper · Discussion #2608

If you generate complete silence in a wav file and run whisper on it, it will always hallucinate the same thing ffmpeg -f lavfi -i anullsrc=r=44100:cl=stereo -t 30 silence.wav whisper ./silence.wav...

GitHub

I noticed that the GitHub issue now has a bunch of very recent comments, and figured out the reason: @Edent posted a link to Hacker News, where it's currently #1 on the front page (and has been syndicated to various social media bots accordingly)

https://news.ycombinator.com/item?id=44643564 Comments include examples from Czech, Russian, Welsh, Japanese, Turkish, Mandarin....

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic | Hacker News

@brainwane @itamarst I always get, "Thanks for watching!"
@brainwane hold on a sec I need to contact John Cage via seance about this

@brainwane @itamarst Looks like they stole the translations used for training from here:

https://lyricstranslate.com/en/translator/nancy-qunqar

From her:
https://www.instagram.com/nancyrk/

Nancy Qunqar | Lyrics Translate

Nancy Qunqar - 14 translations, 1 song, 155 thanks received, 11 translation re

@wcbdata @brainwane @itamarst We also know where they stole the German training from too! (From the original link, 'Untertitelung des ZDF für funk, 2017.')
@wcbdata @brainwane @itamarst wouldnt it make more sense to train translation on some international documents or i dont know, EU laws, that are the best 1:1 translations out there (because law is petty)

@utf_7

I think possibly you have misunderstood @wcbdata a little bit; the relevant training dataset that's leading to *this* problem is, if I understand correctly, a combination of audio recordings of human speech in Arabic and human-made Arabic captions/transcripts of that speech, *not* Arabic translations of speech or of written documents from other languages. https://xcancel.com/SheriefFYI/status/1756694204867355041 has more on this.

@itamarst

@utf_7

This is a problem in Whisper's *transcription* of Arabic-language audio, not a problem with Whisper's translation between languages.

@wcbdata @itamarst

@brainwane @wcbdata @itamarst aaah, thx, i now i got it :)

@brainwane @itamarst I used whispers large model yesterday to transcribe a few interviews in Swedish - it's a bit picky with the file formats and when it wasn't to it's liking it said:

"Svensktextning.nu Svensktextning.nu Svensktextning.nu Svensktextning.nu"

Probably a related or same issue.

@haagen @brainwane @itamarst If it is legal to upload them you could try https://goodtape.io/
Good Tape — Fast, secure and accurate transcription

Good Tape is an automatic transcription service that makes it easy for journalists (and others) to turn audio recordings into text, regardless of language or sound quality. We save you time and effort so you can focus on what really matters.

@drgroftehauge

https://goodtape.io/blog/help-center/ The Troubleshooting section mentions "Random/repeated words in transcripts" which indicates that GoodTape has the same problems (which makes sense since it is probably using Whisper or something similar)

@haagen @itamarst

@drgroftehauge @brainwane @itamarst I solved my problem by "simply" converting the audiofile to a different format and a different encoding.

Bug report über Whisper, das LLM-basierte Transkriptionstool:
Anscheinend wird Stille am Ende auf Deutsch transkribiert als

"Untertitelung des ZDF für funk, 2017."

https://github.com/openai/whisper/discussions/2608#discussioncomment-13790984

#whisper #LLM

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic which translates as "Translation by Nancy Qunqar" · openai whisper · Discussion #2608

If you generate complete silence in a wav file and run whisper on it, it will always hallucinate the same thing ffmpeg -f lavfi -i anullsrc=r=44100:cl=stereo -t 30 silence.wav whisper ./silence.wav...

GitHub

@brainwane @itamarst At first I thought of these as “Easter eggs” but really it’s more like summoning ghosts.

Here’s another discussion with more anecdotes: https://github.com/openai/whisper/discussions/928

Dataset bias ("❤️ Translated by Amara.org Community") · openai whisper · Discussion #928

Hello, I noticed multiples biases using whisper. For example, it sometimes outputs (in french) ❤️ Translated by Amara.org Community as I guess it was used video subtitles by Amara. There are also l...

GitHub

@com @itamarst wow!

"Suggestion: don't use train your language models on Ice Skating videos, or classical music will trigger transcription of figure skating notations. :-)"

from https://github.com/openai/whisper/discussions/928#discussioncomment-13759269

@com

Evocative:

"There are many solutions and workarounds people are working on to prevent hallucinations, and most of them involve just cutting out the silent parts of the audio so that Whisper is never tempted by the silence"

TEMPTED BY THE SILENCE

@itamarst

@brainwane @itamarst Asking the doctor why their clinic notes say "please like and subscribe" at the end

@threedaymonk May I boost? (asking since you posted as unlisted/quiet public)

@itamarst

@threedaymonk @brainwane @itamarst I came across loads of those when using Whisper in a project at work. If it can't discern any words, it tends to just add things like these. We had German audio, and in these cases where there was no speech, or hardly any speech, it often added references to things like ZDF (a German broadcaster)

@brainwane @itamarst Yeah, I processed a series of videos with odd audio tracks - sometimes they contained nothing, or white noise. The transcriptions it produced were all very strange, but were basically a “greatest hits” of phrases from YouTube closed captioning.

I considered this as a “accepted side effect”, and didn’t really consider it a bug.

@brainwane @itamarst And now we know where Whisper steals its training data.
@brainwane @itamarst
From that Xitter thread: “... the culture’s disregard for quality...”
I stopped reading. THE US-AMERICAN’S ARROGANCE.
@fito @brainwane It's pretty gross way to put it, yes, not endorsing them as a person, don't know who they are, but the explanation of the mechanism (training on captions) makes sense.