Mastodawn

@zwol Can you elaborate on how your solution works ?

@Sobex It's the simplest possible algorithm for this: Take a growable string which will hold all the tokens seen so far. As each new token comes in, search for it in the string; if you find it, yield the position of the existing instance. If you don't find it, append it to the string and yield that position.

With O(m+n) string search this is something like O(number of words * (average word length + total number of bytes required for all unique words)), which smells quadratic to me.

@zwol So my idea, which I had not analysed was to have some sort of mapping type, e.g. HashMap<String,bool>.

You look up if the token is in the map :
If not insert it with true,, if it is already in the map, then set it to false.

https://en.wikipedia.org/wiki/Trie

@zwol Since we may have optimised sets with prefix trees, I wonder if there's a way to attack metadata on leaves ?

Trie - Wikipedia

@Sobex I couldn't fit this into the original post but the other wrinkle here is I really, really want to avoid doing a bunch of temporary allocations, as required by HashMap<String, T>. A trie might be OK though. I'm not sure what my options are for tries in Rust.

@zwol Well, I found this https://rust-dsa.github.io/trie/trie.html
(There's an annoying IP address trie crate coming in the results)
https://crates.io/crates/hat_trie and https://crates.io/crates/prefix-tree-rs may be worth looking at though.

Overall, you might need to implement your own trie structure.

Trie - Data Structures and Algorithms in Rust

@Sobex Yeah there's several options.

The catch, though, is that what I need for the *next* stage is "repeatedly select a *random element* of the set of strings in O(1) time", for which the bone-simple storage buffer + list of substring positions is ideal and any sort of tree is not.

@zwol Do you need the random to be uniform, and are you removing that element afterwards ?

@Sobex Yes, and no, respectively.

Dr. Sobek

@zwol Yep, so that's a bit annoying.

An interesting question is : what fraction of tokens in there are you expecting to be unique. I'm not sure traversing the trie to generate a flat vector of unique strings at the end of the process is necessarily more expensive than the overhead of using a flat buffer.

(Btw, you really want your search routine to only search starting from appropriate positions in the buffer. Hopefully that's the case, but otherwise that could be expensive).