Mastodawn

David Smith Jan 15, 2023

I'm working on reviving my old podcast searching system using OpenAI's Whisper engine (https://github.com/openai/whisper).

The results so far are amazing. I can run the transcription right on my Mac at roughly 5X realtime, and the accuracy is super impressive. It even gets brand names and weird words right nearly every time.

For example, this segment from The Talk Show where @marcoarment and @gruber argue about how to pronounce databases was perfectly transcribed, down the even the mispronunciations. 🤯

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

GitHub

Show thread

masukomi Jan 15, 2023

@_Davidsmith @marcoarment @gruber

holyshitholyshitholyshit.

a) holyshit
b) i so need to get this running and indexing the podcasts I listen to (because none have transcripts, and I so often want to pull quotes from them)
c) holy shit
d) How do we make it easy for _every_ podcast to add this to their site?!

Show thread

Marcos Huerta Jan 15, 2023

@masukomi I made a Python class that takes a folder of transcripts (ostensibly episodes of a podcast) and turns them into a sqlite database with full text search (FTS5). That database is can be used by any software (I use it via Python to make my Accidental Tech Podcast search engine.)

https://github.com/astrowonk/search_transcripts

GitHub - astrowonk/search_transcripts: Convert a directory of .vtt or json transcripts into a fast searchable database

Convert a directory of .vtt or json transcripts into a fast searchable database - astrowonk/search_transcripts

GitHub

Show thread

masukomi Jan 15, 2023

@marcoshuerta Nice! I love that you link to @simon 's Datasette but i think it'd be _really_ valuable if you actually linked to an example of that in play. I think more folks would consider using something like your tool if they realized they could have a decent interface to the data with essentially no effort.

Show thread

masukomi Jan 15, 2023

@marcoshuerta @simon I've found SQLite's FTS5 to be useful but _very_ annoying to set up. I never wrapped my head around how to deal with data across joined tables in it, and having to create a trigger for every Create, Update, and Delete on every table you care about is a PITA.

Full Text Search is really important to my current long-term project and there'll be lots of it so I'm planning on going with CouchDB + the opensource version of Zinc for search https://zincsearch.com/

ZincSearch - A modern search engine

ZincSearch is the simplest and easiest search system to get up and running. It's an open source easy-to-use search engine to solves your observability needs.

Show thread

Simon Willison Jan 15, 2023

@masukomi @marcoshuerta have you tried my sqlite-utils Python library and CLI tool for FTS5? It has methods that can configure the triggers for you, and run searches with the necessary joins: https://sqlite-utils.datasette.io/en/stable/cli.html#configuring-full-text-search

sqlite-utils command-line tool - sqlite-utils

Show thread

masukomi Jan 15, 2023

@simon I haven't, but that's mostly just because I'm not in python-land.

I swear your SQLite work is the most tempting reason for me to poke Python. Really, what I _should_ be doing is mining your repos for the bits that would make my life easier and just porting them to #RakuLang

Side note: Reading some source you linked was the only way I managed to grok FTS5 setup in the first place.

Show thread

Simon Willison Jan 15, 2023

@masukomi the CLI tool means you don't have to care it's written in Python (you can actually "brew install sqlite-utils" to get that) - and there's one command that will output the SQL query you need for a search directly to your terminal:

sqlite-utils search mydb.db documents searchterm --sql

Show thread

masukomi Jan 15, 2023

@simon ... 👀 ... oooooOOOOOOoooo

😸 somehow i missed that it a cli tool not just a library 🤦‍♀️

thank you.

Show thread

Simon Willison

@masukomi I've been having real fun with it trying to ensure every Python library feature is also available as a CLI command