Well that don’t look right at all
So they’re -9 compressed bz2 files
$ file *.bz2
[...]
DRR187559_1.fastqsanger.bz2: bzip2 compressed data, block size = 900k
DRR187559_2.fastqsanger.bz2: bzip2 compressed data, block size = 900k
And when looking for the bzip2 header that indicates compression and start of file we see:
$ grep BZh9 -c *.bz2
1.bz2:0
2.bz2:0
3.bz2:0
4.bz2:0
5.bz2:0
6.bz2:0
7.bz2:0
8.bz2:0
9.bz2:1
DRR187559_1.fastqsanger.bz2:229
DRR187559_2.fastqsanger.bz2:259
the first 8 lines are expeted, BZh and then the compression level wouldn’t be in 1-8 which were compressed with the associated compression levels
But the last two, uhhh, how did you possibly generate bzip2 files with that many headers? Apparently that can happen through concatenation.
Fun fact: bzip2 reads _2 fine.
Funner fact: basically no other implementations do. I.e. most bioinformatics tools. They just read the first entry and are done. But we only know this because it’s split mid-read, unlike _1 which runs successfully while actually failing.
$ fastqc DRR187559_1.fastqsanger.bz2
application/x-bzip2
Started analysis of DRR187559_1.fastqsanger.bz2
Analysis complete for DRR187559_1.fastqsanger.bz2
fastqc DRR187559_1.fastqsanger.bz2 4.67s user 0.35s system 150% cpu 3.334 total
FastQC reports 1927 reads which is off by, a lot. (451782 is the correct value.) We’d never know unless we carefully check this.
So if your tool breaks on a bzip2 file, try decompressing and recompressing, and updating your resume on linkedin while you find a new career.