Mastodawn

David Smith Jan 15, 2023

I'm working on reviving my old podcast searching system using OpenAI's Whisper engine (https://github.com/openai/whisper).

The results so far are amazing. I can run the transcription right on my Mac at roughly 5X realtime, and the accuracy is super impressive. It even gets brand names and weird words right nearly every time.

For example, this segment from The Talk Show where @marcoarment and @gruber argue about how to pronounce databases was perfectly transcribed, down the even the mispronunciations. 🤯

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

GitHub

Show thread

John Siracusa

@_Davidsmith Can it do speaker identification?

Show thread

David Smith Jan 15, 2023

@siracusa not directly, there are other tools you can run that will segment by speaker, so if you wanted to I suppose you could combine them.

Show thread

John Siracusa Jan 15, 2023

@_Davidsmith Do you know of any that run on the Mac? I’d love transcripts and search for all my podcasts, but I think speaker identification is essential.

Show thread

Marcos Huerta Jan 15, 2023

@siracusa @_Davidsmith OpenAI has said their model could do it, but I haven't seen anyone try to use them that way (yet.)

https://github.com/openai/whisper/blob/main/model-card.md

"They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas. "

EDITED to link to this discussion and some efforts to combine with pyannote https://github.com/openai/whisper/discussions/264

whisper/model-card.md at main · openai/whisper

Robust Speech Recognition via Large-Scale Weak Supervision - whisper/model-card.md at main · openai/whisper

GitHub

Show thread

Nathan Gathright Jan 15, 2023

@siracusa @_Davidsmith Here’s a python CLI that does speaker diarization: https://github.com/yinruiqing/pyannote-whisper

GitHub - yinruiqing/pyannote-whisper

Contribute to yinruiqing/pyannote-whisper development by creating an account on GitHub.

GitHub

Show thread

Jonathan Jan 15, 2023

@siracusa @_Davidsmith I briefly explored https://github.com/speechbrain/speechbrain as a possibility for this a number of months ago as a “locally find and remove dynamically inserted podcast ads” project a number of months ago and it seemed promising. Whisper also can probably do it but the functionality is less tested/robust than it’s main transcription purpose

GitHub - speechbrain/speechbrain: A PyTorch-based Speech Toolkit

A PyTorch-based Speech Toolkit. Contribute to speechbrain/speechbrain development by creating an account on GitHub.

GitHub

Show thread

Josh Cheshire Jan 15, 2023

@siracusa @_Davidsmith If Merlin finds out you didn’t ask him about Descript…

Speaker recognition is one of its flagship features.

Show thread

Dave Nanian Jan 15, 2023

@siracusa @_Davidsmith Google Recorder does speaker identification and translation, but I don't know if you can feed it prerecorded audio without doing it a kind of dumb way.

Show thread

David Smith Jan 15, 2023

@siracusa Everything I've read says that Pyannote is the best way to do it currently (https://lablab.ai/t/whisper-transcription-and-speaker-identification) I've not pursued it much because I'm just looking for search, where speakers are particularly important.

Show thread

TechBeret Jan 15, 2023

@_Davidsmith @siracusa Pyannote is what I used for catatp.fm speaker identification, but they recently changed their model which dramatically reduced accuracy on long form audio clips (like podcasts). I was able to revert back manually and keep using their old model, but in the future I plan on investigating nVidia NeMo, which I've seen anecdotal reports of high accuracy on.

Show thread

Ian Robinson Jan 15, 2023

@siracusa @_Davidsmith Otter.ai does it.

Show thread

David Friedman Jan 18, 2023

@_Davidsmith @siracusa I have done it with this clunky/imperfect workflow:
1) Use Whisper for free transcription
2) Remove all carriage returns from the generated txt file
3) Upload the original audio file to Descript but *do not* have Descript transcribe it.
4) Add the whisper-generated transcript to Descript, which only charges for its own transcription time. It's free to have Descript synchronize a file to your own provided transcript.
3) Have Descript detect speakers.
4) Export