Mastodawn

ah shit my attempt to tape text embedding vector search to the side of GTS is actually sort of working. currently prototyping with PGVector and local Ollama running EmbeddingGemma. creating embeddings and indexing them at a few hundred posts per second is using essentially none of my M1 laptop's CPU.

the prototype is probably flexible enough to switch to something even more basic like Word2Vec or GloVe for the low end of GTS deployments. figuring out how to get the sqlite-vec extension into GTS WASM SQLite is left as an exercise to the reader.

really i'm just messing around here as i get back into coding for fun, but this could be the start of semantic search, or a custom feed where you give it a list of exemplar posts and it shows you new ones that come in close to one of them.

GitHub - pgvector/pgvector: Open-source vector similarity search for Postgres

Open-source vector similarity search for Postgres. Contribute to pgvector/pgvector development by creating an account on GitHub.

GitHub

Show thread

post-Gundam cryptography Oct 29

i'm about to describe some pie in the sky but: what if a relay could do expensive processing like calculating standardized post text and image embeddings (or even just fetching link preview cards), and then consumers that decide to trust that relay could skip recomputing/refetching all that stuff, so they'd only need to calc query embeddings locally (and local posts obvi). some guy could put an old gaming PC in his garage and then hundreds of Fedi servers could do less work.

how's that Mastodon thing for "Fediverse providers" going anyway

also why aren't we using torrents for post media. did people forget torrents exist again

Fediverse Discovery Providers

A project exploring better search and discovery on the Fediverse as an optional, decentralized and pluggable service.

Fediverse Discovery Providers

Show thread

post-Gundam cryptography Oct 29

gotta go faster… i started the migration to create embeddings for my 2.7M existing posts last night 14 hours ago and it's only done about 1.0M of them since

Show thread

bob Oct 29

@vyr maybe try fasttext instead of a big transformer model?

fastText

Library for efficient text classification and representation learning

Show thread

post-Gundam cryptography Oct 29

@bob i am learning so much from you today, thanks for that. fastText even has a WASM target.

Show thread

bob

@vyr actually something I just thought of which I think would generate good embeddings for posts is to take the word vectors from fastText and train a little convolutional neural network where distances between posts in the same reply thread are small and distances between posts in different threads are large. and then for running the thing in wasm use the CNN code from darknet (I've done that before, it's not that hard and it has no deps and is written in plain C)