Mastodawn

ah shit my attempt to tape text embedding vector search to the side of GTS is actually sort of working. currently prototyping with PGVector and local Ollama running EmbeddingGemma. creating embeddings and indexing them at a few hundred posts per second is using essentially none of my M1 laptop's CPU.

the prototype is probably flexible enough to switch to something even more basic like Word2Vec or GloVe for the low end of GTS deployments. figuring out how to get the sqlite-vec extension into GTS WASM SQLite is left as an exercise to the reader.

really i'm just messing around here as i get back into coding for fun, but this could be the start of semantic search, or a custom feed where you give it a list of exemplar posts and it shows you new ones that come in close to one of them.

GitHub - pgvector/pgvector: Open-source vector similarity search for Postgres

Open-source vector similarity search for Postgres. Contribute to pgvector/pgvector development by creating an account on GitHub.

GitHub

Show thread

post-Gundam cryptography Oct 29

i'm about to describe some pie in the sky but: what if a relay could do expensive processing like calculating standardized post text and image embeddings (or even just fetching link preview cards), and then consumers that decide to trust that relay could skip recomputing/refetching all that stuff, so they'd only need to calc query embeddings locally (and local posts obvi). some guy could put an old gaming PC in his garage and then hundreds of Fedi servers could do less work.

how's that Mastodon thing for "Fediverse providers" going anyway

also why aren't we using torrents for post media. did people forget torrents exist again

Fediverse Discovery Providers

A project exploring better search and discovery on the Fediverse as an optional, decentralized and pluggable service.

Fediverse Discovery Providers

Show thread

post-Gundam cryptography Oct 29

gotta go faster… i started the migration to create embeddings for my 2.7M existing posts last night 14 hours ago and it's only done about 1.0M of them since

Show thread

post-Gundam cryptography Nov 1

my thing yesterday was learning the ort API (it's the ONNX Runtime wrapper for Rust). and since i don't know ONNX yet either, it's gonna be my thing tomorrow too

Show thread

post-Gundam cryptography

wrapping up the first prototype GTS version of this tonight. there's something here, but a lot of the specifics are fussy, and i think going fully out of process, including storage, indexing, tokenization, etc. will be the way to go.

Show thread

post-Gundam cryptography Nov 2

vyr@Xochiquetzal➜  gotosocial git:(embedding-search) ✗ DEBUG=1 GTS_LOG_LEVEL=warn ./gotosocial --config-path fake-p-i/config.yaml debug query "pictures of rats"
testrig: precompiling ffmpeg WASM
testrig: precompiling ffprobe WASM
=== pictures of rats ===

https://chaosfurs.social/@darkrat/112261068293761345
(Source: emperorsofmischief on Instagram: https://www.instagram.com/reel/C5jam-ESlDE/ <https://www.instagram.com/reel/C5jam-ESlDE/> )
#rats <https://chaosfurs.social/tags/rats>
OPs dad built small cars for pet rats. They quickly learn to drive around in them

https://posts.rat.pictures/@hannah/110199254827214969
Rat pictures
A really wonderful sticker of two cute rats holding paws and my own rat love alley sign sticker beneath it

https://lethargic.talkative.fish/@suricrasia/statuses/01GZ879SWHWPNBFPXN3H61YAZG
🐀
a picture of a rat being moved and rotated around while the amen break plays

https://goblin.technology/@tobi/statuses/01JA5T90MHNH59HGK1XF977FTF
just saw a whole family of rats chilling out, sniffling in the grass, and playing! 😍😍😍
Three brown rats sniffling around in the grass, one of whom is a tiny baby rat!!!
Another shot of the little family of rats, but there's four of them now!

https://icosahedron.website/@halcy/112209746387416602
bundesrat
Picture of a rat with the german flag overlaid

https://mastodon.art/@eondraws/110678663865836781
yum

[#rats <https://mastodon.art/tags/rats> #rat <https://mastodon.art/tags/rat> #ratArt <https://mastodon.art/tags/ratArt> #ratsOfMastodon <https://mastodon.art/tags/ratsOfMastodon> #ratsOfTheFedi <https://mastodon.art/tags/ratsOfTheFedi> #ratsOfTheFediverse <https://mastodon.art/tags/ratsOfTheFediverse> #ratLove <https://mastodon.art/tags/ratLove> #dumboRat <https://mastodon.art/tags/dumboRat> #fancyRat <https://mastodon.art/tags/fancyRat> #raturday <https://mastodon.art/tags/raturday> #petArt <https://mastodon.art/tags/petArt> #pets <https://mastodon.art/tags/pets> #rodent <https://mastodon.art/tags/rodent>]
drawing of a grey dumbo rat holding a Werther's Original sweet in her paws and biting onto it. she looks like she knows she's committing a crime, and is very proud of it. yum

https://cathode.church/@easrng/110189504849737103
A picture of a rat next to a small rainbow-colored toy piano, captioned "Neil banging out the tunes" and dated April 13th, 2006

Show thread

post-Gundam cryptography Nov 3

https://github.com/VyrCossont/gotosocial/blob/embedding-search/README.md#semantic-search that's my high effort shitpost for the week. i'm like 80% sure it works, provided you use PG, have pgvector, and use the rest of the settings in my addition to the readme. i'm equally sure it could be made faster somehow, possibly just by parallelizing the status embeddings advanced migration. it has updated example config and a few basic tests. (i have not tried with tests/run-postgres.sh yet, however, since that Docker image probably doesn't have pgvector.)

#GtSDev #FediDev

gotosocial/README.md at embedding-search · VyrCossont/gotosocial

Fast, fun, ActivityPub server, powered by Go. Contribute to VyrCossont/gotosocial development by creating an account on GitHub.

GitHub

Show thread

moanos Dec 5

@vyr I love the example you chose!

Show thread

bob Nov 2

@vyr if you're doing it in rust, yeah, but I think you can do it in a fairly small amount of C code in-process. what I'm thinking is byte pair encoding (I already have a pure C library I wrote for that) -> token vectors (I generated those yesterday) -> CNN autoencoder embedding (I have that training right now) -> gaussian random projection -> morton codes -> LMDB. the neural network code is copy/paste from darknet (which is also plain C with no deps)

the big advantage of doing random projection and morton codes in a b-tree index like that (instead of something like HNSW) is adding posts to the index are just b-tree inserts. writes are fast and there's no need to re-build indexes.

you need cgo to build but there are no dependencies and no external build process so as a go library it should "just work"

Show thread

post-Gundam cryptography Nov 2

@bob oh, clever, i like the random projection + morton code (or as i know them, Z-curves) approach. assuming it works well in practice.

Show thread

kouhai, of the health issues Nov 2

@bob @vyr do you have any non-/academic literature for that, out of curiosity

Show thread

bob Nov 2

@kouhai @vyr the scikit-learn docs do a good job of describing random projection, but apart from that just Wikipedia