It took literally hours, but I have now successfully unpacked

commonswiki-20221120-pages-meta-current.xml.bz2 (15.9GB)

into

commonswiki-20221120-pages-meta-current.xml (168.7GB)

from https://dumps.wikimedia.org/

#WikiResearch #Tinkering

Wikimedia Downloads

(this is on a very old desktop computer fwiw)

grep'ing 168GB files is fun! 😂

@rmounce

(tried ripgrep ?)

@sje haven't but I'm very interested in this feature: "ripgrep supports searching files compressed in a common format e.g. bzip2"

it'd be nice if I could directly search the compressed file rather than the uncompressed monster. I presume it would be slower searching a compressed file though, so a trade-off?

@rmounce

it is the bomb. I can't bear to use grep now, it is so much slower.

yes, maybe a trade-off re: searching compressed data files. But there are many other architectural improvements I'd recommend you give it a go.

@rmounce It depends on the relative speed of your CPU and disk. With the multistream bz2 and sufficient CPU I can read the XML at about 1500 MiB/s, which is difficult to achieve from an uncompressed filesystem.
https://mamot.fr/@nemobis/109963771248837157
https://phabricator.wikimedia.org/T298436#8665069
Nemo_bis 🌈 (@[email protected])

@[email protected] That's great! I had not considered --line-buffered. This is very fast, scans the JSON at about 700 MiB/s for me. (After switching the tar to bz2 and using lbzip2 for decompression.)

La Quadrature du Net - Mastodon - Media Fédéré