I'm looking for an algorithm that probably has a name but I don't know what it is.

You have a bunch of text files, which you split into words somehow. You want to compile a list of all the *unique* words across all the files and store this list as *compactly* as possible. For example: input ["a", "a", "ab"], output { store: "ab", index: [(0,1), (0,2)] }.

I know how to do this in quadratic time and O(1) extra space, assuming a O(m+n) string search primitive. Can one do better?

#cs #algorithms

@zwol Can you elaborate on how your solution works ?

@Sobex It's the simplest possible algorithm for this: Take a growable string which will hold all the tokens seen so far. As each new token comes in, search for it in the string; if you find it, yield the position of the existing instance. If you don't find it, append it to the string and yield that position.

With O(m+n) string search this is something like O(number of words * (average word length + total number of bytes required for all unique words)), which smells quadratic to me.

@zwol So my idea, which I had not analysed was to have some sort of mapping type, e.g. HashMap<String,bool>.

You look up if the token is in the map :
If not insert it with true,, if it is already in the map, then set it to false.

@zwol Since we may have optimised sets with prefix trees, I wonder if there's a way to attack metadata on leaves ?

https://en.wikipedia.org/wiki/Trie

Trie - Wikipedia

@Sobex I couldn't fit this into the original post but the other wrinkle here is I really, really want to avoid doing a bunch of temporary allocations, as required by HashMap<String, T>. A trie might be OK though. I'm not sure what my options are for tries in Rust.

@zwol Well, I found this https://rust-dsa.github.io/trie/trie.html
(There's an annoying IP address trie crate coming in the results)
https://crates.io/crates/hat_trie and https://crates.io/crates/prefix-tree-rs may be worth looking at though.

Overall, you might need to implement your own trie structure.

Trie - Data Structures and Algorithms in Rust

@Sobex Yeah there's several options.

The catch, though, is that what I need for the *next* stage is "repeatedly select a *random element* of the set of strings in O(1) time", for which the bone-simple storage buffer + list of substring positions is ideal and any sort of tree is not.

@zwol Do you need the random to be uniform, and are you removing that element afterwards ?
@Sobex Yes, and no, respectively.

@zwol Yep, so that's a bit annoying.

An interesting question is : what fraction of tokens in there are you expecting to be unique. I'm not sure traversing the trie to generate a flat vector of unique strings at the end of the process is necessarily more expensive than the overhead of using a flat buffer.

(Btw, you really want your search routine to only search starting from appropriate positions in the buffer. Hopefully that's the case, but otherwise that could be expensive).

@Sobex I checked and Rust's built in str.find uses an algorithm ("Two-Way") that's as good as it gets for a string search with no persistent state.

I should probably benchmark my naive algorithm and find out if it's actually too slow for this application before I spend a lot of time on anything more complicated.