It took literally hours, but I have now successfully unpacked
commonswiki-20221120-pages-meta-current.xml.bz2 (15.9GB)
into
commonswiki-20221120-pages-meta-current.xml (168.7GB)
It took literally hours, but I have now successfully unpacked
commonswiki-20221120-pages-meta-current.xml.bz2 (15.9GB)
into
commonswiki-20221120-pages-meta-current.xml (168.7GB)
(this is on a very old desktop computer fwiw)
grep'ing 168GB files is fun! 😂
(tried ripgrep ?)
@sje haven't but I'm very interested in this feature: "ripgrep supports searching files compressed in a common format e.g. bzip2"
it'd be nice if I could directly search the compressed file rather than the uncompressed monster. I presume it would be slower searching a compressed file though, so a trade-off?
it is the bomb. I can't bear to use grep now, it is so much slower.
yes, maybe a trade-off re: searching compressed data files. But there are many other architectural improvements I'd recommend you give it a go.
@[email protected] That's great! I had not considered --line-buffered. This is very fast, scans the JSON at about 700 MiB/s for me. (After switching the tar to bz2 and using lbzip2 for decompression.)