tl;dr Using https://thi.ng/column-store to accelerate tag intersection queries by a factor of 880x...

Working on the static website generator/export plugin for my personal knowledge tool has been one of the main projects this past month. A key part of this setup is tagging, not just simple flat keywords/categories, but actually treating tags as sets. The system doesn't just allow browsing content by a single tag, but also supports adding (or removing) tags to narrow or widen the current topic. E.g. The combination of "3d + geometry + typescript" would select only works which have all of these three tags...

In the local version of my tool there's no limit to the number of tags (and it also supports tag negation), but for the static site generation I have to limit the set size (due to combinatoric explosion) and pre-compute all possible permutations, then create HTML documents for each these individual combinations which actually produce results.

So far I'm having ~400 unique tags in use, meaning if I want to aim for a max set size of 3, there're theoretically ~64,000,000 possibilities to check[1]! For the roughly 3500 content items used for testing, a naive JS approach to filter the result array and only retain items matching the entire current permutation is so extremely slow, that I stopped the process after 3.5 minutes just for the first 250k (aka 0.4%) of the 64 million permutations, i.e. at that rate the full process would have taken ~15 hours, pretty slow for a SSG... :)

Naive approach 🫣:

```
permutation = ["3d", "geometry", "typescript"]
results.filter(item => permutation.every(tag => item.tags.includes(tag)))
```

But since I'm using https://thi.ng/column-store as my database, such queries can be optimized by a few magnitudes, since here these intersection queries are applied only to bitfields (explained in the pkg readme). This results in all 64+ million permutations being processed in just 62 seconds (1+ million per second). Quite the difference, i.e. ~880x faster than the above approach!

Also, of these 64 million initial possibilities, there're fewer unique ones (excluding duplicates and ignoring ordering), and currently only ~24,000 are actually producing a result. Still, that's 24,000 index pages to generate & host and it's, of course, far, far too much!

So I will have to also spend more effort curating and severely reducing the tag vocabulary, at least the subset used for the website. On the other hand, I think this system will really help with browsing this large body/archive of work much more meaningfully than the boring single-tag/category approach most websites are offering. And it will do so without any backend (other than file hosting)...

[1] Permutations = 400 + 400^2 + 400^3

#ThingUmbrella #Tagging #Intersection #Query #Bitfield #WebDev #JavaScript #TypeScript #Optimization

In-memory column store database with customizable column types, extensible query engine, bitfield indexing for query acceleration, JSON serialization with optional RLE compression

thi.ng/column-store

An additional improvement is that registers can now be defined inline in markdown code blocks, or can be referenced from an external TOML file that can contain multiple register definitions.

The latter is useful to be able to reference the same register diagram from multiple locations - so the chapter on the EGA can discuss a register in-depth, but the same register can be displayed in the appendix for quick reference without scrolling through a bunch of paragraphs.

When the base register definition is updated, both SVGs will update. And the cool thing is the register TOML reference becomes kind of a standalone register reference that could be used by anybody.

Here's what the "EGA Registers" appendix source looks like now:

## External Register Details

{{#bitfield ega_registers.toml#miscellaneous-output-register}}

{{#bitfield ega_registers.toml#feature-control-register}}

{{#bitfield ega_registers.toml#input-status-register-0}}

#HowToThing #023 — Responsive & reactive image gallery with tag-based Jaccard similarity ranking/filtering using https://thi.ng/bitfield, https://thi.ng/rstream & https://thi.ng/rdom

A quite common comment about #ThingUmbrella is that people often have little idea what some of the ~185 packages are even good/intended for and/or how to synthesize solutions from these small, individual building blocks. IMHO this is less about these packages themselves and more down to existing blank spots about the underlying concepts, algorithms and their potential role/utility in a larger problem domain... So I very much hope this new example is also useful in this respect!

Alas, the full code for this got pretty long and contains a lot more UI stuff. I'm intending to develop this further for the new homepage to browse all ~135 #ThingUmbrella examples (and maybe even for parts of the https://thi.ng website itself)... For those of you interested in more "advanced" https://thi.ng/rdom examples, do check it out!

Background info:
https://en.wikipedia.org/wiki/Jaccard_index

Demo:
https://demo.thi.ng/umbrella/related-images/

Full source code:
https://github.com/thi-ng/umbrella/tree/develop/examples/related-images/src/

The important parts re: using compact binary encoding, bitfields & Jaccard similarity to find related items are here:

https://github.com/thi-ng/umbrella/blob/fc5db1c7a2b9083b40e9be5d6002db937b5a8267/examples/related-images/src/data.ts#L191-L225

#ThingUmbrella #Tagging #Statistics #Similarity #Ranking #Bitfield #TypeScript #JavaScript #UI #Frontend #Reactive #Tutorial