Dataset Distillation via Curriculum Data Synthesis in Large Data Era

Published 30 Nov 2023 in cs.CV, cs.AI, and cs.LG | (2311.18838v2)

Abstract: Dataset distillation or condensation aims to generate a smaller but representative subset from a large dataset, which allows a model to be trained more efficiently, meanwhile evaluating on the original testing data distribution to achieve decent performance. Previous decoupled methods like SRe$^2$L simply use a unified gradient update scheme for synthesizing data from Gaussian noise, while, we notice that the initial several update iterations will determine the final outline of synthesis, thus an improper gradient update strategy may dramatically affect the final generation quality. To address this, we introduce a simple yet effective global-to-local gradient refinement approach enabled by curriculum data augmentation ($\texttt{CDA}$) during data synthesis. The proposed framework achieves the current published highest accuracy on both large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, using a regular input resolution of 224$\times$224 with faster convergence speed and less synthetic time. The proposed model outperforms the current state-of-the-art methods like SRe$^2$L, TESLA, and MTT by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterparts to less than absolute 15%. Moreover, this work represents the inaugural success in dataset distillation on the larger-scale ImageNet-21K dataset under the standard 224$\times$224 resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery budget are available at https://github.com/VILA-Lab/SRe2L/tree/main/CDA.

Abstract PDF Upgrade to Chat

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a novel Curriculum Data Augmentation (CDA) that incrementally synthesizes representative data subsets to address large-scale dataset challenges.
It integrates curriculum and reverse curriculum learning through parameterized image crops, showcasing robust performance across varied datasets.
Empirical results demonstrate over 4% Top-1 accuracy improvement on ImageNet-1K and effective distillation on ImageNet-21K, highlighting practical efficiency.

Dataset Distillation in the Era of Large-Scale Datasets

The presented paper addresses the emergent challenge of dataset distillation in the context of expanding data scales, specifically targeting datasets like ImageNet-1K and ImageNet-21K. Dataset distillation, which involves generating compact, representative subsets of large datasets, is particularly relevant in the current era where extensive datasets pose significant computational and storage demands. This work introduces a novel approach to dataset distillation by leveraging a strategy termed Curriculum Data Augmentation (CDA) which advances prior methodologies both conceptually and in terms of performance metrics.

Key Contributions and Methodology

The paper makes several contributions, primarily in the sphere of enhancing dataset distillation methods to handle large-scale datasets efficiently. This is achieved through the following innovations:

Curriculum Data Augmentation (CDA): A core contribution is the introduction of CDA. The premise behind CDA builds upon philosophy from curriculum learning, where the learning process benefits from gradually introducing complexity. In this context, CDA manages the data synthesis difficulty by adjusting the cropping of training samples, incrementally exposing the model to more complex portions of the data.
Integration of Curriculum and Reverse Curriculum Learning: The study compares different paradigms of data synthesis—standard curriculum learning, reverse curriculum learning, and a constant learning baseline. These paradigms are operationalized through the strategic application of data augmentation techniques, particularly the parameterization of image crops using RandomResizedCrop.
Empirical Evaluation: The authors empirically validate their approach on CIFAR-100, Tiny-ImageNet, ImageNet-1K, and, notably, ImageNet-21K. CDA's application on large-scale datasets like ImageNet-21K marks a pioneering effort in this domain. In these evaluations, CDA consistently outperforms existing state-of-the-art methods, demonstrating improvements of more than 4% Top-1 accuracy on ImageNet-1K when compared against prominent baseline methods.
Theoretical and Practical Implications: By distilling datasets to a significant compactness while retaining robust classification accuracy, the paper highlights practical implications for regularly leveraging large-scale datasets in resource-constrained environments. Moreover, the synthesized datasets may offer fewer privacy concerns since they potentially exclude raw, personally identifiable data.

Numerical Results and Achievements

The paper reports substantial numerical results. Specifically, for ImageNet-1K under 50 IPC, the proposed method achieved an impressive accuracy of 63.2%, indicating a Top-1 accuracy improvement surpassing 4% over previous approaches. On ImageNet-21K, the method reaches a Top-1 accuracy of 36.1% with IPC 20, thus narrowing the gap to less than absolute 15% compared to the full dataset counterpart.

Practical and Theoretical Implications

Practically, this research could democratize access to powerful machine learning models by reducing the computational resources required for training, especially in contexts with limited data storage or processing capabilities. Theoretically, this investigation opens avenues for further research into synthesis strategies and curriculum-based learning schedules that might enhance generalization and reduce overfitting further in distilled datasets.

Future Outlook in AI

The promising results from CDA point towards a future where efficient dataset handling in vast datasets is possible. This can extend beyond image datasets to text, audio, and other modalities, potentially transforming data management and training paradigms across AI subfields. Future investigations might explore more sophisticated curriculum strategies or adaptive data augmentation techniques to further optimize these results, enhancing both the applicability and efficiency of dataset distillation methodologies.

In summary, this paper leverages curriculum learning in a novel way to synthesize representative subsets of large datasets efficiently. Its findings suggest exciting possibilities for AI model training, particularly in resource-limited settings, paving the way for more inclusive and widespread AI application development.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Dataset Distillation via Curriculum Data Synthesis in Large Data Era

Summary

Dataset Distillation in the Era of Large-Scale Datasets

Key Contributions and Methodology

Numerical Results and Achievements

Practical and Theoretical Implications

Future Outlook in AI

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (2)

Collections

GitHub

Dataset Distillation via Curriculum Data Synthesis in Large Data Era

Summary

Dataset Distillation in the Era of Large-Scale Datasets

Key Contributions and Methodology

Numerical Results and Achievements

Practical and Theoretical Implications

Future Outlook in AI

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

GitHub