tl;dr Using https://thi.ng/column-store to accelerate tag intersection queries by a factor of 880x...
Working on the static website generator/export plugin for my personal knowledge tool has been one of the main projects this past month. A key part of this setup is tagging, not just simple flat keywords/categories, but actually treating tags as sets. The system doesn't just allow browsing content by a single tag, but also supports adding (or removing) tags to narrow or widen the current topic. E.g. The combination of "3d + geometry + typescript" would select only works which have all of these three tags...
In the local version of my tool there's no limit to the number of tags (and it also supports tag negation), but for the static site generation I have to limit the set size (due to combinatoric explosion) and pre-compute all possible permutations, then create HTML documents for each these individual combinations which actually produce results.
So far I'm having ~400 unique tags in use, meaning if I want to aim for a max set size of 3, there're theoretically ~64,000,000 possibilities to check[1]! For the roughly 3500 content items used for testing, a naive JS approach to filter the result array and only retain items matching the entire current permutation is so extremely slow, that I stopped the process after 3.5 minutes just for the first 250k (aka 0.4%) of the 64 million permutations, i.e. at that rate the full process would have taken ~15 hours, pretty slow for a SSG... :)
Naive approach 🫣:
```
permutation = ["3d", "geometry", "typescript"]
results.filter(item => permutation.every(tag => item.tags.includes(tag)))
```
But since I'm using https://thi.ng/column-store as my database, such queries can be optimized by a few magnitudes, since here these intersection queries are applied only to bitfields (explained in the pkg readme). This results in all 64+ million permutations being processed in just 62 seconds (1+ million per second). Quite the difference, i.e. ~880x faster than the above approach!
Also, of these 64 million initial possibilities, there're fewer unique ones (excluding duplicates and ignoring ordering), and currently only ~24,000 are actually producing a result. Still, that's 24,000 index pages to generate & host and it's, of course, far, far too much!
So I will have to also spend more effort curating and severely reducing the tag vocabulary, at least the subset used for the website. On the other hand, I think this system will really help with browsing this large body/archive of work much more meaningfully than the boring single-tag/category approach most websites are offering. And it will do so without any backend (other than file hosting)...
[1] Permutations = 400 + 400^2 + 400^3
#ThingUmbrella #Tagging #Intersection #Query #Bitfield #WebDev #JavaScript #TypeScript #Optimization
