PhylteR, our new tool for filtering phylogenomics datasets, is now out!

https://doi.org/10.1093/molbev/msad234

PhylteR identifies with precision, from a collection of gene trees, the "outlier" sequences responsible for a lack of concordance among gene trees.

How it works? A small thread 👇

#phylogenomics

PhylteR: efficient identification of outlier sequences in phylogenomic datasets

Abstract. In phylogenomics, incongruences between gene trees, resulting from both artifactual and biological reasons, can decrease the signal-to-noise ratio and

OUP Academic
PhylteR starts from a collection of distance matrices, (pairwise patristic distances between species) retrieved from individual gene trees (or -optionally - directly from multiple sequence alignments).
Missing data (if any) are imputed to ensure equal dimensions of all matrices.
Then the process at the heart of PhylteR starts. It is based on DISTATIS, an extension of multidimensional scaling to three dimensions. Here is what happens (simplified):
1/ RV-coefficients (~correlation) between matrices are computed and used to assign a weight to each matrix
(matrices that are very dissimilar to the others are assigned a lower weight).
2/ These weights are used in the creation of the "Compromise Matrix", a distance matrix obtained by computing the weighted average of the indidual distance matrices.
3/ The compromise matrix is then projected on the "compromise space". There, each dot represents the average position of each species with respect to the others; distance between dots reflects the distance between the species in the compromise matrix.
4/ Then, on this same space, each individual matrix is projected, so that the position of each species (small dots) in each matrix can be compared to its average position (large dots).
This is actually very cool! Because one can then compute [...]
[...] a matrix from these projections, giving for each species in each individual gene, its distance to its average position according to the compromise. We call this the 2-way reference (2WR) matrix, a gene x species matrix where outliers (large values) can then be spotted.
5/ From this 2WR matrix, we detect outlier values, we store them in a list, we remove these outliers directly in the initial distance matrices, and we compute the new compromise matrix. If the compromise is improved, we continue this new loop and find new outliers (if any). Etc.

6/ Well, this was quick, but you get the idea!? At the end, PhylteR users obtain a list of identified outliers. Their choice then to do what they want with it (filter MSAs, prune gene trees, explore outliers, etc.).

For more details read the paper!
And **GIVE IT A TRY!!**

7/ PhylteR is a package written in R language available on CRAN (https://cran.r-project.org/web/packages/phylter/index.html), but also as singularity and docker containers.
Extensive documentation can be found at https://damiendevienne.github.io/phylter/index.html.
phylter: Detect and Remove Outliers in Phylogenomics Datasets

Analyzis and filtering of phylogenomics datasets. It takes an input either a collection of gene trees (then transformed to matrices) or directly a collection of gene matrices and performs an iterative process to identify what species in what genes are outliers, and whose elimination significantly improves the concordance between the input matrices. The methods builds upon the Distatis approach (Abdi et al. (2005) <<a href="https://doi.org/10.1101%2F2021.09.08.459421" target="_top">doi:10.1101/2021.09.08.459421</a>>), a generalization of classical multidimensional scaling to multiple distance matrices.

PhylteR was developed and written over the years with great students/colleagues,
Aurore Comte, Theo Tricou, Eric Tannier, @Julien_JOSEPH, Aurélie Siberchicot, Simon Penel, Rémi Allio, Frédéric Delsuc and Stéphane Dray.

Thanks! 🙏