I'm working on reviving my old podcast searching system using OpenAI's Whisper engine (https://github.com/openai/whisper).

The results so far are amazing. I can run the transcription right on my Mac at roughly 5X realtime, and the accuracy is super impressive. It even gets brand names and weird words right nearly every time.

For example, this segment from The Talk Show where @marcoarment and @gruber argue about how to pronounce databases was perfectly transcribed, down the even the mispronunciations. 🤯

GitHub - openai/whisper: Robust Speech Recognition via Large-Scale Weak Supervision

Robust Speech Recognition via Large-Scale Weak Supervision - openai/whisper

GitHub

@_Davidsmith @marcoarment @gruber

holyshitholyshitholyshit.

a) holyshit
b) i so need to get this running and indexing the podcasts I listen to (because none have transcripts, and I so often want to pull quotes from them)
c) holy shit
d) How do we make it easy for _every_ podcast to add this to their site?!

@masukomi I made a Python class that takes a folder of transcripts (ostensibly episodes of a podcast) and turns them into a sqlite database with full text search (FTS5). That database is can be used by any software (I use it via Python to make my Accidental Tech Podcast search engine.)

https://github.com/astrowonk/search_transcripts

GitHub - astrowonk/search_transcripts: Convert a directory of .vtt or json transcripts into a fast searchable database

Convert a directory of .vtt or json transcripts into a fast searchable database - astrowonk/search_transcripts

GitHub
@marcoshuerta Nice! I love that you link to @simon 's Datasette but i think it'd be _really_ valuable if you actually linked to an example of that in play. I think more folks would consider using something like your tool if they realized they could have a decent interface to the data with essentially no effort.

@marcoshuerta @simon I've found SQLite's FTS5 to be useful but _very_ annoying to set up. I never wrapped my head around how to deal with data across joined tables in it, and having to create a trigger for every Create, Update, and Delete on every table you care about is a PITA.

Full Text Search is really important to my current long-term project and there'll be lots of it so I'm planning on going with CouchDB + the opensource version of Zinc for search https://zincsearch.com/

ZincSearch - A modern search engine

ZincSearch is the simplest and easiest search system to get up and running. It's an open source easy-to-use search engine to solves your observability needs.

@masukomi @marcoshuerta have you tried my sqlite-utils Python library and CLI tool for FTS5? It has methods that can configure the triggers for you, and run searches with the necessary joins: https://sqlite-utils.datasette.io/en/stable/cli.html#configuring-full-text-search
sqlite-utils command-line tool - sqlite-utils

@simon @masukomi Tangential question, when I create a virtual FTS5 table per the sqlite documentation and then load that .db into datasette, the virtual table (that has the search ability) is hidden by default. Am I doing something wrong? I tried datasette lite on a file I made with search_transcripts and it works but I have to click down to the hidden tables to get to the search_data table:

https://lite.datasette.io/?url=https%3A%2F%2Fmarcoshuerta.com%2Ffor_download%2Fscotus_main.db#/scotus_main

Datasette

@marcoshuerta @masukomi Datasette's search table feature only works if the FTS table was created in a way that references the table you are searching - https://datasette.io/content/repos has a search box because https://datasette.io/content/repos_fts is defined like this:

CREATE VIRTUAL TABLE [repos_fts] USING FTS5 (
[name], [description],
content=[repos]
);

If your FTS table didn't set content=table you can manually configure it in the query string like this: https://docs.datasette.io/en/stable/full_text_search.html#configuring-full-text-search-for-a-table-or-view

content: repos: 151 rows

@simon Ah, so that sounds like it works with what is called an “external content” table here: https://www.sqlite.org/fts5.html

I am very new to FTS in SQLlite, I just followed the CREATE VIRTUAL TABLE example in section 1 in the doc link and inserted directly into the empty virtual table as if it was a normal table. I didn’t add the FTS later after a content table existed.

SQLite FTS5 Extension

@marcoshuerta yeah that's the one - I create all of my FTS tables using sqlite-utils so I tend to forget how obscure it can be doing it from scratch