I'm working on reviving my old podcast searching system using OpenAI's Whisper engine (https://github.com/openai/whisper).

The results so far are amazing. I can run the transcription right on my Mac at roughly 5X realtime, and the accuracy is super impressive. It even gets brand names and weird words right nearly every time.

For example, this segment from The Talk Show where @marcoarment and @gruber argue about how to pronounce databases was perfectly transcribed, down the even the mispronunciations. 🤯

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

GitHub
@_Davidsmith Does it do well with crosstalk?
@DonSqueak It does alright, it isn’t trying to segment by speaker it will just transcribe each intelligible word it hears. So if there is cross talk it might intertwine the speakers but the words themselves are accurate
@_Davidsmith Once they segment by speaker, that’s gonna be the transcription holy grail I guess(?)
@DonSqueak @_Davidsmith This might be relevant to your interests! Just found it the other day and learned about diarization: https://huggingface.co/spaces/vumichien/whisper-speaker-diarization
Whisper Speaker Diarization - a Hugging Face Space by vumichien

Discover amazing ML apps made by the community

@haraball Thanks for this. I've seen a few diarization efforts underway, at this point I haven't pursued it but definitely something I'm keeping my eyes on.