@nobody willkommen.
@RubenKelevra
Where did I just walk in? xD
@nobody my follower list
@RubenKelevra
🤟🪩
@nobody you work on NixOS? 🤔
@RubenKelevra
Genau! Mostly just fretting about how to have cuda-scicomp-python be less annoying and damaging for the rest of the project, but lately learning e.g. about microvms and other great folks' work in this direction...
@RubenKelevra
I see ipfs, arch, ladybird on your page. What's been on your plate? I followed for memes!
@nobody currently writing a search engine with typo correction and an mmaped file based hash table implementation for it which scales well to tens of millions of entries.
@RubenKelevra
Hmm is that something you plan to run on end user devices or host on a server? Hash table, meaning you're not going the embeddings+kNN approach?

@nobody Thanks for the suggestion, but I think that's going to be too slow. I'm using phrase prefix hashes, so lookups are O(1)ish unless you're making a typo. The goal then is to detect that by "having not enough good results" and falling back on a Levenshtein distance search with a distance of 1 below 4 characters and 2 if 4 or more.

This gives me a lookup time of 1-4ms with 25k entries (an entry being a sentence plus a word), with around 900k phrase prefix hashes, on an 15 year old notebook

@nobody with 100k entries I'm looking at 3.3 M prefix tokens, and the lookup time is going up to around 8 ms
@RubenKelevra
Don't mean to suggest anything (not yet anyway), just trying to understand the use case

@nobody ah! It's Omni Box search :)

That's currently the performance with a 100k dataset which results in 3.69 M phrase prefixes which needs to be indexed by the hashtable :)

@RubenKelevra
Ty for the demo! All makes sense now xD

@nobody finished my performance rewrite today.

It's now 39.8x faster in my hardcore test.

Hardcore test is feeding in entries which basicly rewrite the whole db up to 2 times, shutting down the db, opening it again and validate every single entry by a search.

Adding 20 million entries, searching 18.7 million entries and doing 751 cold starts take now 33.9 minutes instead of 22.5 hours :)