Emergent Mind

Abstract

Scale the model, scale the data, scale the GPU farms is the reigning sentiment in the world of generative AI today. While model scaling has been extensively studied, data scaling and its downstream impacts on model performance remain under-explored. This is particularly important in the context of multimodal datasets whose main source is the World Wide Web, condensed and packaged as the Common Crawl dump, which is known to exhibit numerous drawbacks. In this paper, we evaluate the downstream impact of dataset scaling on 14 visio-linguistic models (VLMs) trained on the LAION400-M and LAION-2B datasets by measuring racial and gender bias using the Chicago Face Dataset (CFD) as the probe. Our results show that as the training data increased, the probability of a pre-trained CLIP model misclassifying human images as offensive non-human classes such as chimpanzee, gorilla, and orangutan decreased, but misclassifying the same images as human offensive classes such as criminal increased. Furthermore, of the 14 Vision Transformer-based VLMs we evaluated, the probability of predicting an image of a Black man and a Latino man as criminal increases by 65% and 69%, respectively, when the dataset is scaled from 400M to 2B samples for the larger ViT-L models. Conversely, for the smaller base ViT-B models, the probability of predicting an image of a Black man and a Latino man as criminal decreases by 20% and 47%, respectively, when the dataset is scaled from 400M to 2B samples. We ground the model audit results in a qualitative and historical analysis, reflect on our findings and their implications for dataset curation practice, and close with a summary of mitigation mechanisms and ways forward. Content warning: This article contains racially dehumanising and offensive descriptions.

Heatmap depicting misclassification of Latino/Latina individuals as 'Asian' worsening with increased dataset size.

Overview

  • The paper examines the effects of dataset size on racial and gender biases in Vision Transformer (ViT) models, using two dataset sizes (LAION400-M and LAION-2B) and testing with the Chicago Face Dataset (CFD).

  • Key findings include a reduction in non-human misclassifications with larger datasets, but an increase in offensive racial classifications, especially concerning Black and Latino men being labeled as criminals.

  • The study emphasizes the importance of responsible AI practices and suggests future directions like improved dataset curation, robust bias mitigation techniques, and ethical AI deployment to minimize biases and ensure fairness.

Exploring Dataset Scaling and Bias in Vision Transformer Models

Understanding the Study

The study evaluated the impact of dataset size on racial and gender bias in visio-linguistic models (VLMs), specifically focusing on Vision Transformers (ViT) models trained on two data sizes: LAION400-M and LAION-2B. Researchers used the Chicago Face Dataset (CFD) to measure bias, revealing significant variations in racial classification as dataset size changed.

Key Findings

Impact of Dataset Scaling

  • Decrease in Non-human Misclassifications: Larger datasets reduced misclassifications of human images as non-human categories such as animals or apes.
  • Increase in Offensive Human Classifications: Larger datasets increased the likelihood of misclassifying Black and Latino men as criminals, particularly with larger model architectures (e.g., ViT-L).

Model Response Differences

  • Model Size Matters: Larger models (ViT-L) increased criminal classification predictions when trained on the bigger dataset (2B samples), while smaller models (ViT-B) showed a decrease in such predictions when the dataset was scaled up.

Implications for AI Development and Ethics

The findings underscore the complexities of scaling datasets in AI training:

  • Bias Amplification: Scaling up datasets without careful curation and consideration of diversity can amplify biases, potentially leading to harmful stereotypes being reinforced in AI applications.
  • Need for Responsible AI Practices: The study highlights the critical need for transparency, evaluation, and responsible dataset management in AI development to avoid propagation of stereotypes and bias.

Speculating on Future Developments

Given the study's findings, future developments in AI might focus on:

  1. Improved Dataset Curation: Enhanced methods for dataset curation to ensure diversity and minimize biases.
  2. Robust Bias Mitigation Techniques: Development of more sophisticated techniques to detect and mitigate biases as datasets scale.
  3. Ethical AI Deployment: Emphasis on ethical considerations and fairness in AI deployment, especially in sensitive applications.

Conclusion

This analysis demonstrates the nuanced challenge of scaling datasets in the training of AI models. While larger datasets can enhance the model's ability to generalize, they can also inadvertently amplify existing societal biases if not curated responsibly. The study reinforces the necessity for continued vigilance and advancement in ethical AI practices. As AI technologies become increasingly integrated into societal frameworks, the stakes for responsible AI development are markedly high, requiring concerted efforts from developers, researchers, and policymakers alike.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.