Mastodawn

I'm looking for an algorithm that probably has a name but I don't know what it is.

You have a bunch of text files, which you split into words somehow. You want to compile a list of all the *unique* words across all the files and store this list as *compactly* as possible. For example: input ["a", "a", "ab"], output { store: "ab", index: [(0,1), (0,2)] }.

I know how to do this in quadratic time and O(1) extra space, assuming a O(m+n) string search primitive. Can one do better?

#cs #algorithms

Show thread

nojhan à la double douleur vaudoue 1d ago

@zwol Do you want to store the words or just check whether they were in the files?

Show thread

nojhan à la double douleur vaudoue 1d ago

@zwol Second option is much more compact.

Show thread

Zack Weinberg

@nojhan I need to store the words for subsequent use. Also, it needs to be possible to choose a random word from the stored list in O(1) time.

Show thread

nojhan à la double douleur vaudoue 1d ago

@zwol I’m afraid there’s something I don’t get in the first example. Chosing at random from a contiguous array is trivial in O(1), since one know its length… what am I missing? Can you give more example?

Show thread

Zack Weinberg 1d ago

@nojhan If I *didn't* have a contiguous array of all the strings' locations in the pool coming out of this algorithm — if I replaced the entire mess with a (Python type notation) set[str], for example — then processing downstream of this algorithm would no longer be O(1).