I can't believe a simple change made my code 100x faster! I clearly did the wrong way initially. I was concatenating multiple `vector`s into one. Initially, for each vector, I used `result.reserve(result.size() + vec.size())` and then performed a bunch of `push_back`s. It took an unacceptable 10s. Then, I changed to precalculate the final size of the result vector, perform a single `.resize()`, and then do a series of copies. The same operation now takes only ~100ms.