Mastodawn

I'm looking for an algorithm that probably has a name but I don't know what it is.

You have a bunch of text files, which you split into words somehow. You want to compile a list of all the *unique* words across all the files and store this list as *compactly* as possible. For example: input ["a", "a", "ab"], output { store: "ab", index: [(0,1), (0,2)] }.

I know how to do this in quadratic time and O(1) extra space, assuming a O(m+n) string search primitive. Can one do better?

#cs #algorithms

Show thread

Martin Uecker

@zwol I don't know, but a suffix tree can be built in linear time and should answer that question and more, if I am not mistaken.

Show thread

Zack Weinberg Mar 21

@uecker Feels like overkill especially since I couldn't use it for the next stage. I'd have to build the tree and then throw it away.

Show thread

Martin Uecker Mar 21

@zwol Yes, maybe. The other (related) idea is something such as Burrows–Wheeler transform followed by Huffman coding.