fly51fly (@fly51fly)

Stanford와 EPFL 공동연구진(J Kazdan, N Levi, R Schaeffer, J Chudnovsky 등)이 2026년 arXiv에 'Scale Dependent Data Duplication' 논문을 발표했습니다. 본 논문은 학습 데이터 중복(data duplication)이 모델 성능과 일반화에 미치는 영향이 데이터 스케일에 따라 어떻게 달라지는지 분석하며, 데이터 중복 관련 문제와 스케일링 관점의 시사점을 다룹니다.

https://x.com/fly51fly/status/2031483138908762496

#dataduplication #datasetquality #mlresearch #arxiv

fly51fly (@fly51fly) on X

[LG] Scale Dependent Data Duplication J Kazdan, N Levi, R Schaeffer, J Chudnovsky… [Stanford University & EPFL] (2026) https://t.co/tIspicuiEc

X (formerly Twitter)
Alexander Wlodawer et al.: Duplicate entries in the Protein Data Bank: how to detect and handle them
#DataDuplication
#protein
https://journals.iucr.org/d/issues/2025/04/00/gm5112/index.html
Duplicate entries in the Protein Data Bank: how to detect and handle them

A global analysis of protein crystal structures in the Protein Data Bank (PDB) reveals many pairs with (nearly) identical main-chain coordinates. Such cases are identified and analyzed, leading to a proposal about how the PDB could ameliorate this problem.

Acta Crystallographica Section D
7/10: Jesse & team addressed the challenge of data duplication through their Rabbit Hole interface. Progress has been made, but their navigation of the issue is ongoing. #RabbitHole #DataDuplication