New open source AI feature alert! 💧🔔💧🔔💧🔔💧🔔
Generalization in machine learning models is still poorly understood. Due to this, the status quo practice is to heuristically verify our models on holdout test sets, and hope that this check has some bearing on performance in the wild. Of course, this means that there is huge cost to faulty testing---a huge cost in both critical MLE time and in error filled data and annotation.
One common failure mode of testing is when the test split is afflicted with data leakage. When testing on such a split, there is no guarantee that generalization is being verified. In fact, in the extreme case, no new information is gained on the performance of the model outside of the train set. Supervised models learn the minimal discriminative features needed to make a decision, and if those features appear in the test set, a dangerous, false sense of confidence can be built in a model. Don't let this happen to you.
Leaky splits can be the bane of ML models, giving a false sense of confidence, and a nasty surprise in production. The image on this post is a sneak peak into what you can expect (this example is taken from ImageNet 👀)
Check out this Leaky-Splits blog post by my friend and colleague Jacob Sela
https://medium.com/voxel51/on-leaky-datasets-and-a-clever-horse-18b314b98331
Jacob is also the lead developer behind the new open source Leaky-Splits feature in FiftyOne, available in version 1.1.
This function allows you to automatically:
🕵 Detect data leakage in your dataset splits
🪣 Clean your data from these leaks
This will help you:
✔️ Build trust in your data
📊 Get more accurate evaluations
And, it's open source. Check it out on GitHub.
From your friends at Voxel51
Top comments (0)