I'm working on reviving my old podcast searching system using OpenAI's Whisper engine (https://github.com/openai/whisper).

The results so far are amazing. I can run the transcription right on my Mac at roughly 5X realtime, and the accuracy is super impressive. It even gets brand names and weird words right nearly every time.

For example, this segment from The Talk Show where @marcoarment and @gruber argue about how to pronounce databases was perfectly transcribed, down the even the mispronunciations. 🤯

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

GitHub

@_Davidsmith funnily enough, I've been transcribing old podcast episodes too, and so far my experience was very mixed

it worked alright on newer episodes with clean audio, but on older episodes with low bitrate mono mp3s it's… not very good

especially if you use condition on previous text in recent update (which is enabled by default), where it gets stuck in a fail state and 90% of your transcription is the same sentence over and over again

even if you disable it and run with older behavior where it doesn't try to form meaningful sentences, it's still missing half the conversation if it has music background or rapid overlapping voices

so in the end, it's fine for my use-case (I just want some loose index for quick lookup), but it's not a failproof option for usable sub generation for example

@13xforever I have seen an odd failure state where Whisper just loses all punctuation and everything becomes a giant run on sentence. The words are still right at least, but becomes a lot harderto parse.
@marcoshuerta with condition on previous text it breaks pretty fast, it's rare for it to transcribe past 1h mark without breaking, even with compression ratio and no speech threshold decreased, usually it breaks after 20-30 minute mark

@13xforever That's really interesting; I am spot checking the transcripts I made with base.en and medium.en with the defaults on the ATP podcast audio files. Condition on previous text defaults to true, and I used the defaults.

All of my spot checks that are over 2 hours look fine. I don't see any breaking failures. I wonder what the difference here is?

@marcoshuerta can’t say for sure, but I’m using python version with cuda backend on very compressed mp3s, so there’s quite a lot of variables to pin down; for now I just do a first pass with copt disabled