Utilizing massive web-scale datasets has led to unprecedented performance gains in machine learning models, but also imposes outlandish compute requirements for their training. In order to improve training and data efficiency, we here push the limits of pruning large-scale multimodal datasets for training CLIP-style models. Today's most effective pruning method on ImageNet clusters data samples into separate concepts according to their embedding and prunes away the most prototypical samples. We scale this approach to LAION and improve it by noting that the pruning rate should be concept-specific and adapted to the complexity of the concept. Using a simple and intuitive complexity measure, we are able to reduce the training cost to a quarter of regular training. By filtering from the LAION dataset, we find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. More specifically, we are able to outperform the LAION-trained OpenCLIP-ViT-B32 model on ImageNet zero-shot accuracy by 1.1p.p. while only using 27.7% of the data and training compute. Despite a strong reduction in training cost, we also see improvements on ImageNet dist. shifts, retrieval tasks and VTAB. On the DataComp Medium benchmark, we achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.

Continuous training with 30M samples rivals full LAION-50M dataset performance, outperforms LAION CLIP-B/16; DBP boosts over SSP-Pruning.


  • The paper addresses the environmental and computational challenges of training AI models with large-scale datasets by introducing methods to prune these datasets effectively.

  • Traditional dataset pruning methods like SSP-Pruning are outperformed by the novel technique of Density-Based Pruning, which accounts for the complexity of data clusters.

  • The proposed pruning method reduces computational costs by selecting a subset of data that retains the diversity and complexity of the full dataset.

  • Density-Based Pruning, combined with deduplication and CLIP-score filtering, yields models that surpass existing baselines on benchmarks like ImageNet and DataComp.

  • The research demonstrates that intelligent pruning techniques can lead to more sustainable and accessible AI development, particularly for those with limited computational resources.


Advancements in artificial intelligence, particularly in the realms of machine learning and its application to large-scale multimodal datasets, have led to significant improvements in model performance. However, it's also important to consider the compute requirements and environmental costs associated with training these increasingly complex models. Building on the efficiency of data usage, recent research has focused on refining dataset pruning—a process that selects a subset of the original dataset for training—to significantly reduce computational costs while maintaining, or even enhancing, model performance.

Data Efficiency and Pruning

In the context of large-scale datasets such as LAION, which can contain billions of examples, identifying and removing redundant or less informative data can accelerate the learning process and enhance data efficiency. Traditional methods of pruning involve a process called Self-Supervised-Prototypes Pruning (SSP-Pruning), where clusters of data samples are formed and the most prototypical examples—those closest to the cluster centers—are discarded. However, recent innovations propose a more nuanced pruning method that takes into account the complexity of the data within the clusters, leading to more effective pruning by adapting the rate at which data is discarded based on the cluster's complexity.

Research Contributions and Methodology

The researchers present several significant contributions. They scale SSP-Pruning to web-scale datasets and implement a novel pruning criterion influenced by concept complexity within these datasets. When compared with previous methods, their approach demonstrates superior performance on various benchmarks while reducing training computational costs by a significant margin. For instance, their model exceeds the LAION-trained OpenCLIP-ViT-B/32 model in zero-shot accuracy by 1.1 percentage points while only using 27.7% of the data and compute.

Central to their methodology is a new technique called Density-Based Pruning (DBP), which strategically selects a smaller yet high-quality subset of data from a web-scale dataset. DBP considers the intricacies of clusters by evaluating the average intra-cluster distance—the variation within a cluster—and the inter-cluster distance—the spatial relation between clusters. The result is a pruned dataset that better captures the diversity and complexities of the original data, leading to more balanced and efficient training.

Experiments and Results

The team's extensive experiments further validate their approach. The pruning process involves deduplication, CLIP-score filtering, which scores image and text pair compatibility, and finally, the DBP method that selects the data subset. They show that by applying this innovative pruning strategy to the LAION-CAT-440M dataset, and thus creating smaller, curated subsets, their models outperform the existing baselines on the ImageNet benchmark using just a fraction of the original computational cost. Additionally, state-of-the-art results were achieved on the DataComp Medium benchmark, which categorizes it at the forefront of pruning methods.


The research highlights the effectiveness of intelligent dataset pruning in improving the efficiency of model training processes. By utilizing DBP, models can be trained to achieve superior performance on complex tasks using significantly smaller datasets. This reduction in computational overhead makes it feasible for more researchers, including those in academic settings with limited resources, to engage in state-of-the-art AI research. The research paves the way for more sustainable and accessible AI development, with a particular focus on optimal data usage and cost reduction while maximizing model performance.

