I built a tool to find problems hiding in my training data.
LabelLens analyzes labeled text classification datasets for duplicates, mislabels, and class imbalance. Ran it on my own 26K sample dataset — found 5,664 exact duplicates I had no idea about.
Try it: https://huggingface.co/spaces/mikenoe/label-lens
Blog post: https://mikenoe.com/posts/i-built-a-tool-to-find-the-problems-in-my-training-data/





