Are we throwing away good data? Evaluation of chimera detection algorithms on long-read amplicons reveals high false-positive rates across algorithms
TLDR - uchime_denovo using default parameters offers the best precision and recall
Are we throwing away good data? Evaluation of chimera detection algorithms on long-read amplicons reveals high false-positive rates across algorithms
Long-read amplicon sequencing has enabled us to return to full-length DNA barcodes, which benefit from the higher taxonomic resolution in metabarcoding-based biodiversity studies. However, chimeric sequences (artificial constructs formed when incomplete amplicons fuse during polymerase chain reaction (PCR)) remain challenging, potentially skewing diversity estimates and ecological inferences. Here, we benchmark three de novo chimera detection algorithms, uchime_denovo, removeBimeraDenovo, and chimeras_denovo, on simulated and empirical eukaryotic full-ITS (rRNA ITS1-5.8S-ITS2) datasets to evaluate their precision, sensitivity, and effects on the final OTUs composition/community structure. Upon simulated data, uchime_denovo achieved the highest precision even with default settings, whereas other algorithms displayed high false-positive chimera rates without setting adjustments. Similarly, the tests upon empirical data showed that uchime_denovo had lower false positive rates, whereas about half of the sequences in the putative chimeric batch were false positives when using chimeras_denovo and removeBimeraDenovo. We found that most of the false-negative chimeras contained multiple 5.8S regions, indicating PacBio library preparation artifacts rather than PCR artifacts. However, OTU-level comparisons indicated that overall richness and community-ordination patterns remain largely consistent across different chimera-filtering approaches with or without accounting for false positives and negatives.