NEW BIML Bibliography entry

https://arxiv.org/abs/2503.03150

Position: Model Collapse Does Not Mean What You Think

Rylan Schaeffer, Joshua Kazdan, Alvan Caleb Arulandu, Sanmi Koyejo

We think recursive pollution is a better term than model collapse. Weak terminology leads to misunderstanding of impact. See figure 4. This is a very good paper.

#TOPPAPER #MLsec #RecursivePollution #DataPoisoning

https://berryvilleiml.com/references/

NEW BIML Bibliography entry

https://arxiv.org/abs/2404.05090

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

Mohamed El Amine Seddik, et al

This treatment fails because the models being studied are TOY models too simple to be interesting.

#MLsec #RecursivePollution #SyntheticData

https://berryvilleiml.com/references/

NEW BIML Bibliography entry

https://arxiv.org/abs/2502.18865

A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops

Shi Fu, Yingjie Wang, Yuzhu Chen, Xinmei Tian, Dacheng Tao

Published at ICLR 2025. A bit overfocused on the real vs synthetic data problem, this paper covers the depletion of real data available for training ML. STLs are getting very close indeed to recursive pollution, so the math here is relevant.

#MLsec #RecursivePollution

https://berryvilleiml.com/references/

NEW BIML Bibliography entry

https://arxiv.org/abs/2410.04840

Strong Model Collapse

Elvis Dohmatob, Yunzhen Feng, Arjun Subramonian, Julia Kempe
(NYU and META)

Recursive pollution leads to model collapse. This view of strong model collapse describes what happens in the case of recursive data poison.
#TOPPAPER #MLsec #Data #RecursivePollution

https://berryvilleiml.com/references/

NEW BIML Bibliography entry

https://arxiv.org/abs/2509.16499

A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

Lianghe Shi, et al

A very nice set of references to work in model collapse. Collapsed model == lookup table (that is, no generalization). Discussion of recursive pollution as causing variance shrinkage or distribution shift.

#TOPPAPER #MLsec #Data #RecursivePollution

https://berryvilleiml.com/references/

Recursive Pollution and Model Collapse Are Not the Same | BIML

Forever ago in 2020, we identified "looping" as one of the "raw data in the world" risks. See An Architectural Risk Anal

Berryville Institute of Machine Learning