Author: Harpreet Sahota (Hacker in Residence at Voxel51)
Where Research Meets Real-World Data Challenges
Despite practitioners universally acknowledging that data quality is the cornerstone of reliable AI systems, only 56 out of 4,543 papers at NeurIPS 2024 explicitly focused on data-centric AI approaches.
While this represents a doubling from 2023’s 28 papers, it remains a surprisingly small fraction given data’s outsized role in real-world AI success. Ask any machine learning engineer about their biggest challenges, and you’ll likely hear about data quality issues, bias in training sets, or the endless hours spent cleaning and curating datasets. Yet, the academic focus remains heavily skewed toward model architectures and optimization techniques.
This disconnect between practical reality and research emphasis makes the data-centric AI papers at NeurIPS 2024 particularly valuable.
In this series of blog posts, I’ll explore this select but crucial body of work tackling the foundation of AI development — the data itself. From new methodologies for auditing data quality to frameworks for understanding dataset bias, these papers offer critical insights for bridging the gap between academic research and practical implementation. I’ll examine current research about approaches to data curation, challenge assumptions about synthetic data, and investigate the potential of dynamic “foundation distributions” that adapt during training.
For anyone building or deploying AI systems, these findings could be more immediately impactful than the latest architectural innovation. After all, as the old programming adage goes: garbage in, garbage out — no matter how sophisticated your model, because data eats models for lunch.
The five papers covered in this series:
- The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
- Intrinsic Self-Supervision for Data Quality Audits
- SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification
- Visual Data Diagnosis and Debiasing with Concept Graphs
- Understanding Bias in Large-Scale Visual Datasets
Below are brief overviews of the papers, each followed by a link to my full breakdown.
The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
This work investigates the efficacy of using synthetic data generated by text-to-image models for adapting pre-trained vision models to downstream tasks.
Specifically, the authors compare the performance of models fine-tuned on targeted synthetic images generated by Stable Diffusion against models fine-tuned on targeted real images retrieved from Stable Diffusion’s training dataset, LAION-2B. The authors conduct experiments on five downstream tasks: ImageNet, Describable Textures (DTD), FGVC-Aircraft, Stanford Cars, and Oxford Flowers-102, and evaluate model performance using zero-shot and linear probing accuracy. Across all benchmarks and data scales, the authors find that training on real data retrieved from the generator’s upstream dataset consistently outperforms or matches training on synthetic data from the generator, highlighting the limitations of synthetic data compared to real data. The authors attribute this underperformance to generator artifacts and inaccuracies in semantic visual details within the synthetic images.
Overall, the work emphasizes the importance of considering retrieval from a generative model’s training data as a strong baseline when evaluating the value of synthetic training data.
You can read a full breakdown of this work here.
Intrinsic Self-Supervision for Data Quality Audits
This work presents SELFCLEAN, a data cleaning procedure that leverages self-supervised representation learning to detect data quality issues in image datasets.
SELFCLEAN identifies off-topic samples, near duplicates, and label errors using dataset-specific representations and distance-based indicators. The authors demonstrate that SELFCLEAN outperforms competing methods for synthetic data quality issues and aligns well with metadata and expert verification in natural settings. Applying SELFCLEAN to well-known image benchmark datasets, the authors estimated the prevalence of various data quality issues and highlighted their impact on model scores.
Their analysis emphasizes the importance of data cleaning for improving the reliability of benchmark performance and boosting confidence in AI applications.
You can read my full breakdown of this paper and my takeaways here.
SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification
This work introduces SELECT, a benchmark for evaluating data curation strategies for image classification, and IMAGENET++, a dataset used to generate baseline curation methods.
The authors created IMAGENET++ by extending ImageNet with five new training-data shifts; each assembled using a distinct curation strategy including crowdsourced labeling, embedding-based search, and synthetic image generation. The authors evaluated these data curation baselines by training image classification models from scratch and inspecting a fixed pretrained self-supervised representation. Their findings indicate that while reduced-cost curation methods are becoming more competitive, expert labeling, as used in the original ImageNet dataset, remains the most effective strategy.
The authors suggest that future research focus on improving cost-effective data filtration, sample labeling, and synthetic data generation to further bridge the gap between reduced-cost and expert curation methods.
You can read a full breakdown of the paper and my key takeaways here.
Visual Data Diagnosis and Debiasing with Concept Graphs
This paper introduces ConBias, a novel framework for diagnosing and mitigating concept co-occurrence biases in visual datasets.
ConBias represents visual datasets as knowledge graphs of concepts, which enables the analysis of spurious concept co-occurrences to identify concept imbalances across the dataset. This approach targets object co-occurrence bias, which refers to any spurious correlation between a label and an object causally unrelated to the label. Once concept imbalances have been identified, CONBIAS generates images to address under-represented class-concept combinations, leading to a more uniform concept distribution across classes. This process involves prompting a text-to-image generative model to create images of under-represented concept combinations.
Extensive experiments show that data augmentation based on a balanced concept distribution generated by ConBias enhances generalization performance across multiple datasets, outperforming state-of-the-art methods.
Check out the blog for a full breakdown of ConBias.
Understanding Bias in Large-Scale Visual Datasets
This paper uses three popular datasets (YFCC, CC, and DataComp) as a case study to explore the different forms of bias in large-scale visual datasets.
The authors develop a framework based on applying various image transformations to isolate specific visual attributes (e.g., semantics, structure, color, frequency) and then assess how well a neural network can still classify the images based on their original dataset after these transformations. Strong classification performance after a transformation suggests that the targeted attribute contributes to dataset bias. The study reveals that these datasets exhibit significant biases across all the examined visual attributes, including object-level imbalances, differences in color statistics, variations in object shape and spatial geometry, and distinct thematic focuses.
The authors emphasize that understanding these biases is essential for developing more diverse datasets and building robust, generalizable vision models.
Top comments (0)