I'm looking for an algorithm that probably has a name but I don't know what it is.

You have a bunch of text files, which you split into words somehow. You want to compile a list of all the *unique* words across all the files and store this list as *compactly* as possible. For example: input ["a", "a", "ab"], output { store: "ab", index: [(0,1), (0,2)] }.

I know how to do this in quadratic time and O(1) extra space, assuming a O(m+n) string search primitive. Can one do better?

#cs #algorithms

Addendum: For unrelated engineering reasons, an algorithm that does not require building a complex temporary data structure is strongly preferred. Even just a hash table with string keys is troublesome.
@zwol Can you elaborate on how your solution works ?

@Sobex It's the simplest possible algorithm for this: Take a growable string which will hold all the tokens seen so far. As each new token comes in, search for it in the string; if you find it, yield the position of the existing instance. If you don't find it, append it to the string and yield that position.

With O(m+n) string search this is something like O(number of words * (average word length + total number of bytes required for all unique words)), which smells quadratic to me.

@zwol So my idea, which I had not analysed was to have some sort of mapping type, e.g. HashMap<String,bool>.

You look up if the token is in the map :
If not insert it with true,, if it is already in the map, then set it to false.

@zwol Since we may have optimised sets with prefix trees, I wonder if there's a way to attack metadata on leaves ?

https://en.wikipedia.org/wiki/Trie

Trie - Wikipedia

@Sobex I couldn't fit this into the original post but the other wrinkle here is I really, really want to avoid doing a bunch of temporary allocations, as required by HashMap<String, T>. A trie might be OK though. I'm not sure what my options are for tries in Rust.

@zwol Well, I found this https://rust-dsa.github.io/trie/trie.html
(There's an annoying IP address trie crate coming in the results)
https://crates.io/crates/hat_trie and https://crates.io/crates/prefix-tree-rs may be worth looking at though.

Overall, you might need to implement your own trie structure.

Trie - Data Structures and Algorithms in Rust

@Sobex Yeah there's several options.

The catch, though, is that what I need for the *next* stage is "repeatedly select a *random element* of the set of strings in O(1) time", for which the bone-simple storage buffer + list of substring positions is ideal and any sort of tree is not.

@zwol Do you need the random to be uniform, and are you removing that element afterwards ?
@Sobex Yes, and no, respectively.

@zwol Yep, so that's a bit annoying.

An interesting question is : what fraction of tokens in there are you expecting to be unique. I'm not sure traversing the trie to generate a flat vector of unique strings at the end of the process is necessarily more expensive than the overhead of using a flat buffer.

(Btw, you really want your search routine to only search starting from appropriate positions in the buffer. Hopefully that's the case, but otherwise that could be expensive).

@Sobex I checked and Rust's built in str.find uses an algorithm ("Two-Way") that's as good as it gets for a string search with no persistent state.

I should probably benchmark my naive algorithm and find out if it's actually too slow for this application before I spend a lot of time on anything more complicated.

@zwol @Sobex With a radix tree, where each node strores the number of words in the subtree, I think you can pick a word at random, uniformely, in O(length of longest word)
@zwol I don't know, but a suffix tree can be built in linear time and should answer that question and more, if I am not mistaken.
@uecker Feels like overkill especially since I couldn't use it for the next stage. I'd have to build the tree and then throw it away.
@zwol Yes, maybe. The other (related) idea is something such as Burrows–Wheeler transform followed by Huffman coding.
@zwol why not just iterate over the words and save each pair of word: (position, bool) in a hashmap? If you see new word you insert (position, true), if it was seen you change the second value in the tuple to false (non-unique). So this would just be a single for-loop and hashmap. If you want to save on space but on the cost of accuracy, you could use bloom filter instead of hashmap to check if word was seen, and store the positions in another data structure. There’s no way for it to be precise, fast, and space-efficient, you need to pick your priority.
@zwol
Isn't this the same as creating a dictionary for the Aho-Corasick algorithm?
https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm
Aho–Corasick algorithm - Wikipedia

@zwol this sounds a lot like superpermutation but also maybe not as you want only specific combinations and not all of them.

https://en.wikipedia.org/wiki/Superpermutation

Superpermutation - Wikipedia

@zwol Do you want to store the words or just check whether they were in the files?
@zwol Second option is much more compact.
@nojhan I need to store the words for subsequent use. Also, it needs to be possible to choose a random word from the stored list in O(1) time.
@zwol I’m afraid there’s something I don’t get in the first example. Chosing at random from a contiguous array is trivial in O(1), since one know its length… what am I missing? Can you give more example?
@nojhan If I *didn't* have a contiguous array of all the strings' locations in the pool coming out of this algorithm — if I replaced the entire mess with a (Python type notation) set[str], for example — then processing downstream of this algorithm would no longer be O(1).