HAPP: High-accuracy pipeline f...

HAPP: High-accuracy pipeline for processing deep metabarcoding data
Author summary Charting and monitoring biodiversity is essential for understanding and protecting ecosystems, but it has been difficult to collect data cost-efficiently at scale. An approach that potentially solves this problem is metabarcoding—a method that can be applied to DNA from environmental samples to identify many species at once. Unfortunately, it may produce misleading results due to noise in the data. A particularly challenging problem when analysing data from mitochondrial DNA, such as the CO1 gene often used for analysing insect biodiversity, is the existence of nuclear encoded copies of the gene that can severely inflate diversity estimates. We created an algorithm called NEEAT that helps remove such misleading signals by combining information from multiple samples and spotting unusual patterns of genetic change. We also tested many existing tools for other steps of data processing, and combined NEEAT with the best tools in creating a new, high-accuracy analysis pipeline we call HAPP. Using both simulated and real-world insect data, we show that our approach is not only more accurate than current methods but also efficient at handling large datasets. Our work aims to make biodiversity studies more precise and scalable, supporting better conservation and environmental decision-making.