5/ From this 2WR matrix, we detect outlier values, we store them in a list, we remove these outliers directly in the initial distance matrices, and we compute the new compromise matrix. If the compromise is improved, we continue this new loop and find new outliers (if any). Etc.
[...] a matrix from these projections, giving for each species in each individual gene, its distance to its average position according to the compromise. We call this the 2-way reference (2WR) matrix, a gene x species matrix where outliers (large values) can then be spotted.
4/ Then, on this same space, each individual matrix is projected, so that the position of each species (small dots) in each matrix can be compared to its average position (large dots).
This is actually very cool! Because one can then compute [...]
3/ The compromise matrix is then projected on the "compromise space". There, each dot represents the average position of each species with respect to the others; distance between dots reflects the distance between the species in the compromise matrix.
(matrices that are very dissimilar to the others are assigned a lower weight).
2/ These weights are used in the creation of the "Compromise Matrix", a distance matrix obtained by computing the weighted average of the indidual distance matrices.
Then the process at the heart of PhylteR starts. It is based on DISTATIS, an extension of multidimensional scaling to three dimensions. Here is what happens (simplified):
1/ RV-coefficients (~correlation) between matrices are computed and used to assign a weight to each matrix
PhylteR starts from a collection of distance matrices, (pairwise patristic distances between species) retrieved from individual gene trees (or -optionally - directly from multiple sequence alignments).
Missing data (if any) are imputed to ensure equal dimensions of all matrices.
PhylteR, our new tool for filtering phylogenomics datasets, is now out!
https://doi.org/10.1093/molbev/msad234
PhylteR identifies with precision, from a collection of gene trees, the "outlier" sequences responsible for a lack of concordance among gene trees.
How it works? A small thread 👇
#phylogenomics

PhylteR: efficient identification of outlier sequences in phylogenomic datasets
Abstract. In phylogenomics, incongruences between gene trees, resulting from both artifactual and biological reasons, can decrease the signal-to-noise ratio and
OUP AcademicVery nice general article (in french) in this month's
#Epsiloon magazine, on
#ghost lineages!!
Thanks to l'#Humanité_magazine for this double page (in french) on #ghosts and their impact for the study of gene flow!!!
It's really nice to see the PhD work of Theo Tricou (with Eric Tannier and myself) being so well covered by mainstream medias!