👽 📈 👽 How difficult is this dataset REALLY? 👽 📈 👽
New Paper Alert!
Class-wise Autoencoders Measure Classification Difficulty and Detect Label Mistakes
We like to think that the challenge in training a classifier is handled by hyperparameter tuning or model innovation, but there is rich inherent signal in the data and their embeddings. Understanding how hard a machine learning problem is has been quite elusive. Not any more.
Now you can compute the difficulty of a classification dataset without training a classifier, and requiring only 100 labels per class. And, this difficulty estimate is surprisingly independent of the dataset size.
Traditionally, methods for dataset difficulty assessment have been time and/or compute-intensive, often requiring training one or multiple large downstream models. What's more, if you train a model with a certain architecture on your dataset and achieve a certain accuracy, there is no way to be sure that your architecture was perfectly suited to the task at hand — it could be that a different set of inductive biases would have led to a model that learned patterns in the data with far more ease.
Our method trains a lightweight autoencoder for each class and uses the ratios of reconstruction errors to estimate classification difficulty. Running this dataset difficulty estimation method on a 100k sample dataset takes just a few minutes, and doesn't require tuning or custom processing to run on new datasets!
How well does it work? We conducted a systematic study of 19 common visual datasets, comparing the estimated difficulty from our method to the SOTA classification accuracy. Aside from a single outlier, the correlation is 0.78. It even works on medical datasets!
Paper Link: https://arxiv.org/abs/2412.02596
GitHub Repo: https://github.com/voxel51/reconstruction-error-ratios
Top comments (0)