Mastodawn

David Smith Jan 15, 2023

I'm working on reviving my old podcast searching system using OpenAI's Whisper engine (https://github.com/openai/whisper).

The results so far are amazing. I can run the transcription right on my Mac at roughly 5X realtime, and the accuracy is super impressive. It even gets brand names and weird words right nearly every time.

For example, this segment from The Talk Show where @marcoarment and @gruber argue about how to pronounce databases was perfectly transcribed, down the even the mispronunciations. 🤯

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

GitHub

Show thread

masukomi Jan 15, 2023

@_Davidsmith @marcoarment @gruber

holyshitholyshitholyshit.

a) holyshit
b) i so need to get this running and indexing the podcasts I listen to (because none have transcripts, and I so often want to pull quotes from them)
c) holy shit
d) How do we make it easy for _every_ podcast to add this to their site?!

Show thread

Marcos Huerta Jan 15, 2023

@masukomi I made a Python class that takes a folder of transcripts (ostensibly episodes of a podcast) and turns them into a sqlite database with full text search (FTS5). That database is can be used by any software (I use it via Python to make my Accidental Tech Podcast search engine.)

https://github.com/astrowonk/search_transcripts

GitHub - astrowonk/search_transcripts: Convert a directory of .vtt or json transcripts into a fast searchable database

Convert a directory of .vtt or json transcripts into a fast searchable database - astrowonk/search_transcripts

GitHub

Show thread

masukomi Jan 15, 2023

@marcoshuerta Nice! I love that you link to @simon 's Datasette but i think it'd be _really_ valuable if you actually linked to an example of that in play. I think more folks would consider using something like your tool if they realized they could have a decent interface to the data with essentially no effort.

Show thread

masukomi Jan 15, 2023

@marcoshuerta @simon I've found SQLite's FTS5 to be useful but _very_ annoying to set up. I never wrapped my head around how to deal with data across joined tables in it, and having to create a trigger for every Create, Update, and Delete on every table you care about is a PITA.

Full Text Search is really important to my current long-term project and there'll be lots of it so I'm planning on going with CouchDB + the opensource version of Zinc for search https://zincsearch.com/

ZincSearch - A modern search engine

ZincSearch is the simplest and easiest search system to get up and running. It's an open source easy-to-use search engine to solves your observability needs.

Show thread

Simon Willison Jan 15, 2023

@masukomi @marcoshuerta have you tried my sqlite-utils Python library and CLI tool for FTS5? It has methods that can configure the triggers for you, and run searches with the necessary joins: https://sqlite-utils.datasette.io/en/stable/cli.html#configuring-full-text-search

sqlite-utils command-line tool - sqlite-utils

Show thread

masukomi Jan 15, 2023

@simon I haven't, but that's mostly just because I'm not in python-land.

I swear your SQLite work is the most tempting reason for me to poke Python. Really, what I _should_ be doing is mining your repos for the bits that would make my life easier and just porting them to #RakuLang

Side note: Reading some source you linked was the only way I managed to grok FTS5 setup in the first place.

Show thread

Marcos Huerta

@masukomi fun fact! I had no idea FTS5 existed until a few months ago and had been glomming on some python bm25 indexing library on top of SQL until then. The first version of search_transcripts didn't use FTS5… 😱

https://github.com/astrowonk/search_transcripts/commit/5375dc9b4b514cef9f01bd5e9a60c54aadb519d7

first commit · astrowonk/search_transcripts@5375dc9

Convert a directory of .vtt or json transcripts into a fast searchable database - first commit · astrowonk/search_transcripts@5375dc9

GitHub