Mastodawn

@RubenKelevra
Where did I just walk in? xD

@nobody my follower list

@RubenKelevra
🤟🪩

@nobody you work on NixOS? 🤔

@RubenKelevra
Genau! Mostly just fretting about how to have cuda-scicomp-python be less annoying and damaging for the rest of the project, but lately learning e.g. about microvms and other great folks' work in this direction...

@RubenKelevra
I see ipfs, arch, ladybird on your page. What's been on your plate? I followed for memes!

@nobody currently writing a search engine with typo correction and an mmaped file based hash table implementation for it which scales well to tens of millions of entries.

@RubenKelevra
Hmm is that something you plan to run on end user devices or host on a server? Hash table, meaning you're not going the embeddings+kNN approach?

@nobody Thanks for the suggestion, but I think that's going to be too slow. I'm using phrase prefix hashes, so lookups are O(1)ish unless you're making a typo. The goal then is to detect that by "having not enough good results" and falling back on a Levenshtein distance search with a distance of 1 below 4 characters and 2 if 4 or more.

This gives me a lookup time of 1-4ms with 25k entries (an entry being a sentence plus a word), with around 900k phrase prefix hashes, on an 15 year old notebook

@nobody with 100k entries I'm looking at 3.3 M prefix tokens, and the lookup time is going up to around 8 ms

@RubenKelevra
Don't mean to suggest anything (not yet anyway), just trying to understand the use case

@nobody ah! It's Omni Box search :)

That's currently the performance with a 100k dataset which results in 3.69 M phrase prefixes which needs to be indexed by the hashtable :)

@RubenKelevra
Ty for the demo! All makes sense now xD