Is more data always better? Or should we select less data of higher quality?

🚨 🧬 #bioinformatics #computationalbiology #statistics #preprint 🧬🚨
(repost trying to improve my #hashtag game)

We explored this question in the context of fitness prediction of #proteins 🧬 #mutations from MSA #sequencedata, finding a scaling law that relates the performance of statistical models to two simple data descriptors:

https://www.biorxiv.org/content/10.1101/2022.12.12.520004v1

The first descriptor is the mean Hamming distance of the training MSA data to the mutated sequence (D in the figure). This quantifies the "quality" of the data for the given problem - the closer to the mutated sequence, the better.

We show the Hamming distance is connected to the statistical #bias of the inferred model, with a prefactor (J0) that depends by the amount of high-order #epistasis that is not explicitly accounted for by the model.

(in the figure: B = number of sequences)

#bioinformatics #computationalbiology #statistics #preprint

The second is the number of sequences in the training MSA. The more the merrier, given they are of similar quality. We show that the number of sequences is - perhaps unsurprisingly - connected to the #variance of the inferred statistical model.

#bioinformatics #computationalbiology #statistics #preprint

Therefore, given a bunch of training data, there is a clear trade-off between selecting a few "good" training points and including more, but of lower overall quality. We provide some heuristics to select the optimal subset of data given a model and a prediction problem.

/end of my first #tootprint #masthread #preprint 😁​

#bioinformatics #computationalbiology #statistics