Mastodawn

Well that don’t look right at all

So they’re -9 compressed bz2 files

$ file *.bz2
[...]
DRR187559_1.fastqsanger.bz2: bzip2 compressed data, block size = 900k
DRR187559_2.fastqsanger.bz2: bzip2 compressed data, block size = 900k

And when looking for the bzip2 header that indicates compression and start of file we see:

$ grep BZh9 -c *.bz2   
1.bz2:0
2.bz2:0
3.bz2:0
4.bz2:0
5.bz2:0
6.bz2:0
7.bz2:0
8.bz2:0
9.bz2:1
DRR187559_1.fastqsanger.bz2:229
DRR187559_2.fastqsanger.bz2:259

the first 8 lines are expeted, BZh and then the compression level wouldn’t be in 1-8 which were compressed with the associated compression levels

But the last two, uhhh, how did you possibly generate bzip2 files with that many headers? Apparently that can happen through concatenation.

Fun fact: bzip2 reads _2 fine.
Funner fact: basically no other implementations do. I.e. most bioinformatics tools. They just read the first entry and are done. But we only know this because it’s split mid-read, unlike _1 which runs successfully while actually failing.

$ fastqc DRR187559_1.fastqsanger.bz2 
application/x-bzip2
Started analysis of DRR187559_1.fastqsanger.bz2
Analysis complete for DRR187559_1.fastqsanger.bz2
fastqc DRR187559_1.fastqsanger.bz2  4.67s user 0.35s system 150% cpu 3.334 total

FastQC reports 1927 reads which is off by, a lot. (451782 is the correct value.) We’d never know unless we carefully check this.

So if your tool breaks on a bzip2 file, try decompressing and recompressing, and updating your resume on linkedin while you find a new career.

#bioinformatics #software

Akkoma

Show thread

Peter Cock Feb 16, 2024

@hexylena concanation is hopefully not usually an issue with gzip support? That’s the basis of the blocks in BGZF (blocked gzip format) as used in BAM files

Show thread

dr. Violet Feb 17, 2024

@pjacock from the samtools manual, yes that seems correct. I guess more folks had to process bgzip files that support for that is more common than bbzip2 files? It seems silly to me you wouldn't support multiple gzip (of bz2) files in a stream but I can see it's also an unusual thing to do in the first place

> The BGZF format written by bgzip is described in the SAM format specification available from http://samtools.github.io/hts-specs/SAMv1.pdf. It makes use of a gzip feature which allows compressed files to be concatenated.

Show thread

Peter Cock Feb 17, 2024

@hexylena Right - I don’t recall details now but pretty sure there were a few bugs with concatenated gzip files back in the early days of BAM.

And the blocks are great for random access in general - me ten years ago: https://blastedbio.blogspot.com/2013/04/random-access-to-blocked-xz-format-bxzf.html?m=1

Random access to blocked XZ format (BXZF)

I've written about random access to the blocked GZIP variant BGZF used in BAM, and looked at random access to BZIP2 , but here I'm looking ...