I'm working on reviving my old podcast searching system using OpenAI's Whisper engine (https://github.com/openai/whisper).

The results so far are amazing. I can run the transcription right on my Mac at roughly 5X realtime, and the accuracy is super impressive. It even gets brand names and weird words right nearly every time.

For example, this segment from The Talk Show where @marcoarment and @gruber argue about how to pronounce databases was perfectly transcribed, down the even the mispronunciations. 🤯

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

GitHub
@_Davidsmith nailing mon-gobd is next level
@_Davidsmith Does it do well with crosstalk?
@DonSqueak It does alright, it isn’t trying to segment by speaker it will just transcribe each intelligible word it hears. So if there is cross talk it might intertwine the speakers but the words themselves are accurate
@_Davidsmith Once they segment by speaker, that’s gonna be the transcription holy grail I guess(?)
@DonSqueak @_Davidsmith This might be relevant to your interests! Just found it the other day and learned about diarization: https://huggingface.co/spaces/vumichien/whisper-speaker-diarization
Whisper Speaker Diarization - a Hugging Face Space by vumichien

Discover amazing ML apps made by the community

@haraball Thanks for this. I've seen a few diarization efforts underway, at this point I haven't pursued it but definitely something I'm keeping my eyes on.
@_Davidsmith @marcoarment @gruber is this a public service or project? I've been wanting to run whisper on the entire corpus of ATP episodes just to be able to reference things easier, but I haven't really sat down to do it.
@particles my old system is here http://podsearch.david-smith.org This will be updated with these new transcripts (and recent episodes when I work through the backlog of episodes)
Podcast Search

@_Davidsmith just to double check, are you using the v2 English large model or one of the v1 models?

@particles
While nowhere near as precise as the Whisper demo that @_Davidsmith just showed, https://catatp.fm also posts automated transcripts of ATP. I used it just the other day to check out some of the RSS feeds @marcoarment mentioned in episode 417 when they weren’t in the shownotes.

Thank you for sharing the link to your transcript catalogue, David. It’s a treasure trove.

catatp.fm Ā· Unofficial Accidental Tech Podcast transcripts (generated by computer, so expect errors).

@jon sadly, whisper is just too good to pass up imho. It looks like catatp's author might be willing to consider it because they consider the "word error rate" of other transcription services in their about page. Whisper is just...awesome tbh.
@particles Agreed. Given my experience with ChatGPT in its first few hours of public existence (prior to its capabilities being reined in), I’m convinced it was trained on the endless abyss of YouTube content & other video, in addition to the vast sea of open podcast feeds, using Whisper.
@_Davidsmith Thanks for doing this. There was a particular Roderick on the Line segment that I swear existed, but could never locate again, even with the old search. Sounds like this will help me prove I’m not crazy.

@Chris There are some older episodes of Roderick here: http://podsearch.david-smith.org/shows/7 So it might be there already....but that only goes through episode 285

But I had to stop updating it a while back when my old transcribing system broke.

Podcast Search

@_Davidsmith Would you share the episode number, please?
@dvk That episode is #352, https://daringfireball.net/thetalkshow/2022/07/25/ep-352 The segment shown is around 84minutes in.
The Talk Show ✪: Ep. 352, With Marco Arment

@_Davidsmith i must have a misconfiguration somewhere, I’m getting 1x on an M1 MBP at best. It’s still amazing in its quality but the speed is underwhelming.
@donw Try this C++ port of Whisper. It is much faster I believe than the python based version. Then tweak the threads variable to best make use of your machine: https://github.com/ggerganov/whisper.cpp
GitHub - ggerganov/whisper.cpp: Port of OpenAI's Whisper model in C/C++

Port of OpenAI's Whisper model in C/C++. Contribute to ggerganov/whisper.cpp development by creating an account on GitHub.

GitHub
@_Davidsmith @donw have anybody figured out how to transform the microphone PCM16 to whatever whisper is expecting? For me, it is working out of the box for wav files but microphone I just get mumbling as a result

@_Davidsmith @marcoarment @gruber I built a transcript search python module and unofficial @atpfm search engine with Whisper (which an Nvidia GPU and base model did ATP back catalog in 3 min per episode):

https://marcoshuerta.com/dash/atp_search/

https://github.com/astrowonk/search_transcripts

Back-ended by FTS5 in sqlite.

ATP Transcript Search

Unofficial Full text search of Accidental Tech Podcast transcripts.

@_Davidsmith @marcoarment @gruber

holyshitholyshitholyshit.

a) holyshit
b) i so need to get this running and indexing the podcasts I listen to (because none have transcripts, and I so often want to pull quotes from them)
c) holy shit
d) How do we make it easy for _every_ podcast to add this to their site?!

@masukomi I made a Python class that takes a folder of transcripts (ostensibly episodes of a podcast) and turns them into a sqlite database with full text search (FTS5). That database is can be used by any software (I use it via Python to make my Accidental Tech Podcast search engine.)

https://github.com/astrowonk/search_transcripts

GitHub - astrowonk/search_transcripts: Convert a directory of .vtt or json transcripts into a fast searchable database

Convert a directory of .vtt or json transcripts into a fast searchable database - astrowonk/search_transcripts

GitHub
@marcoshuerta Nice! I love that you link to @simon 's Datasette but i think it'd be _really_ valuable if you actually linked to an example of that in play. I think more folks would consider using something like your tool if they realized they could have a decent interface to the data with essentially no effort.

@marcoshuerta @simon I've found SQLite's FTS5 to be useful but _very_ annoying to set up. I never wrapped my head around how to deal with data across joined tables in it, and having to create a trigger for every Create, Update, and Delete on every table you care about is a PITA.

Full Text Search is really important to my current long-term project and there'll be lots of it so I'm planning on going with CouchDB + the opensource version of Zinc for search https://zincsearch.com/

ZincSearch - A modern search engine

ZincSearch is the simplest and easiest search system to get up and running. It's an open source easy-to-use search engine to solves your observability needs.

@marcoshuerta @simon BUT for now, for podcasts, I'm 100% going to use your tool and shove the transcripts into Datasette. I really appreciate that you and simon have done all the hard work for me. :D
@masukomi @marcoshuerta have you tried my sqlite-utils Python library and CLI tool for FTS5? It has methods that can configure the triggers for you, and run searches with the necessary joins: https://sqlite-utils.datasette.io/en/stable/cli.html#configuring-full-text-search
sqlite-utils command-line tool - sqlite-utils

@simon I haven't, but that's mostly just because I'm not in python-land.

I swear your SQLite work is the most tempting reason for me to poke Python. Really, what I _should_ be doing is mining your repos for the bits that would make my life easier and just porting them to #RakuLang

Side note: Reading some source you linked was the only way I managed to grok FTS5 setup in the first place.

@masukomi the CLI tool means you don't have to care it's written in Python (you can actually "brew install sqlite-utils" to get that) - and there's one command that will output the SQL query you need for a search directly to your terminal:

sqlite-utils search mydb.db documents searchterm --sql

@simon ... šŸ‘€ ... oooooOOOOOOoooo

😸 somehow i missed that it a cli tool not just a library šŸ¤¦ā€ā™€ļø

thank you.

@masukomi I've been having real fun with it trying to ensure every Python library feature is also available as a CLI command

@masukomi fun fact! I had no idea FTS5 existed until a few months ago and had been glomming on some python bm25 indexing library on top of SQL until then. The first version of search_transcripts didn't use FTS5… 😱

https://github.com/astrowonk/search_transcripts/commit/5375dc9b4b514cef9f01bd5e9a60c54aadb519d7

first commit Ā· astrowonk/search_transcripts@5375dc9

Convert a directory of .vtt or json transcripts into a fast searchable database - first commit Ā· astrowonk/search_transcripts@5375dc9

GitHub

@simon @masukomi Tangential question, when I create a virtual FTS5 table per the sqlite documentation and then load that .db into datasette, the virtual table (that has the search ability) is hidden by default. Am I doing something wrong? I tried datasette lite on a file I made with search_transcripts and it works but I have to click down to the hidden tables to get to the search_data table:

https://lite.datasette.io/?url=https%3A%2F%2Fmarcoshuerta.com%2Ffor_download%2Fscotus_main.db#/scotus_main

Datasette

@marcoshuerta @masukomi Datasette's search table feature only works if the FTS table was created in a way that references the table you are searching - https://datasette.io/content/repos has a search box because https://datasette.io/content/repos_fts is defined like this:

CREATE VIRTUAL TABLE [repos_fts] USING FTS5 (
[name], [description],
content=[repos]
);

If your FTS table didn't set content=table you can manually configure it in the query string like this: https://docs.datasette.io/en/stable/full_text_search.html#configuring-full-text-search-for-a-table-or-view

content: repos: 151 rows

@simon Ah, so that sounds like it works with what is called an ā€œexternal contentā€ table here: https://www.sqlite.org/fts5.html

I am very new to FTS in SQLlite, I just followed the CREATE VIRTUAL TABLE example in section 1 in the doc link and inserted directly into the empty virtual table as if it was a normal table. I didn’t add the FTS later after a content table existed.

SQLite FTS5 Extension

@marcoshuerta yeah that's the one - I create all of my FTS tables using sqlite-utils so I tend to forget how obscure it can be doing it from scratch

@masukomi I used the virtual table creation command and treat the virtual table like any other sqlite table. When it's time to add another episode, the class just uses SqlAlchemy and pandas .to_sql() to append more records.

Under the hood it does look complicated! (when I look at all the actual schemas the virtual table creation made) but so far I haven't run into complications for my use case.

@masukomi Yeah I thought about that but I had never deployed a Datasette site before and the tools recommended (glitch, vercel, etc) I hadn't used, so making yet another Plotly Dash app and deploying another UWSgi vassal app was more straightforward (for me.).

I'll definitely look into a Datasette deployment for something to demo the class. Datasette light pointing to a DB url somewhere might be the easiest.

@masukomi Check out the Snipd podcast player. It has already integrated the Whisper engine and makes it it easy to save quotes, export, etc.
@ruan oooh. I was wondering if there was something like that. Thank you. :D
@_Davidsmith @marcoarment @gruber I've played around with it a little bit and was also very impressed by the results. I hope they are working on improvements to subtitle timings, that was one shortcoming I've noticed.

@_Davidsmith funnily enough, I've been transcribing old podcast episodes too, and so far my experience was very mixed

it worked alright on newer episodes with clean audio, but on older episodes with low bitrate mono mp3s it's… not very good

especially if you use condition on previous text in recent update (which is enabled by default), where it gets stuck in a fail state and 90% of your transcription is the same sentence over and over again

even if you disable it and run with older behavior where it doesn't try to form meaningful sentences, it's still missing half the conversation if it has music background or rapid overlapping voices

so in the end, it's fine for my use-case (I just want some loose index for quick lookup), but it's not a failproof option for usable sub generation for example

@13xforever I have seen an odd failure state where Whisper just loses all punctuation and everything becomes a giant run on sentence. The words are still right at least, but becomes a lot harderto parse.
@marcoshuerta with condition on previous text it breaks pretty fast, it's rare for it to transcribe past 1h mark without breaking, even with compression ratio and no speech threshold decreased, usually it breaks after 20-30 minute mark

@13xforever That's really interesting; I am spot checking the transcripts I made with base.en and medium.en with the defaults on the ATP podcast audio files. Condition on previous text defaults to true, and I used the defaults.

All of my spot checks that are over 2 hours look fine. I don't see any breaking failures. I wonder what the difference here is?

@marcoshuerta can’t say for sure, but I’m using python version with cuda backend on very compressed mp3s, so there’s quite a lot of variables to pin down; for now I just do a first pass with copt disabled
@_Davidsmith Oh dang this could be huge for my livestreams
@_Davidsmith Can it do speaker identification?
@siracusa not directly, there are other tools you can run that will segment by speaker, so if you wanted to I suppose you could combine them.
@_Davidsmith Do you know of any that run on the Mac? I’d love transcripts and search for all my podcasts, but I think speaker identification is essential.

@siracusa @_Davidsmith OpenAI has said their model could do it, but I haven't seen anyone try to use them that way (yet.)

https://github.com/openai/whisper/blob/main/model-card.md

"They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas. "

EDITED to link to this discussion and some efforts to combine with pyannote https://github.com/openai/whisper/discussions/264

whisper/model-card.md at main Ā· openai/whisper

Robust Speech Recognition via Large-Scale Weak Supervision - whisper/model-card.md at main Ā· openai/whisper

GitHub
@siracusa @_Davidsmith Here’s a python CLI that does speaker diarization: https://github.com/yinruiqing/pyannote-whisper
GitHub - yinruiqing/pyannote-whisper

Contribute to yinruiqing/pyannote-whisper development by creating an account on GitHub.

GitHub
@siracusa @_Davidsmith I briefly explored https://github.com/speechbrain/speechbrain as a possibility for this a number of months ago as a ā€œlocally find and remove dynamically inserted podcast adsā€ project a number of months ago and it seemed promising. Whisper also can probably do it but the functionality is less tested/robust than it’s main transcription purpose
GitHub - speechbrain/speechbrain: A PyTorch-based Speech Toolkit

A PyTorch-based Speech Toolkit. Contribute to speechbrain/speechbrain development by creating an account on GitHub.

GitHub

@siracusa @_Davidsmith If Merlin finds out you didn’t ask him about Descript…

Speaker recognition is one of its flagship features.