Large image datasets: A pyrrhic win for computer vision? (2006.16923v2)

Published 24 Jun 2020 in cs.CY, stat.AP, and stat.ML

Abstract: In this paper we investigate problematic practices and consequences of large scale vision datasets. We examine broad issues such as the question of consent and justice as well as specific concerns such as the inclusion of verifiably pornographic images in datasets. Taking the ImageNet-ILSVRC-2012 dataset as an example, we perform a cross-sectional model-based quantitative census covering factors such as age, gender, NSFW content scoring, class-wise accuracy, human-cardinality-analysis, and the semanticity of the image class information in order to statistically investigate the extent and subtleties of ethical transgressions. We then use the census to help hand-curate a look-up-table of images in the ImageNet-ILSVRC-2012 dataset that fall into the categories of verifiably pornographic: shot in a non-consensual setting (up-skirt), beach voyeuristic, and exposed private parts. We survey the landscape of harm and threats both society broadly and individuals face due to uncritical and ill-considered dataset curation practices. We then propose possible courses of correction and critique the pros and cons of these. We have duly open-sourced all of the code and the census meta-datasets generated in this endeavor for the computer vision community to build on. By unveiling the severity of the threats, our hope is to motivate the constitution of mandatory Institutional Review Boards (IRB) for large scale dataset curation processes.

Citations (334)

View on Semantic Scholar

Summary

The paper reveals that large-scale, uncurated datasets violate privacy and informed consent, often including non-consensual images.
It conducts a quantitative audit of ImageNet, uncovering significant ethical lapses including NSFW and voyeuristic content.
The study proposes actionable guidelines such as mandatory IRB reviews and privacy-preserving methods to improve dataset curation.

Examining the Challenges of Large-Scale Vision Datasets

The paper "Large datasets: A Pyrrhic win for computer vision?" critically explores the ethical and practical predicaments of curating large-scale vision datasets. The authors analyze potential ethical breaches and the societal costs associated with these datasets, using the ImageNet-ILSVRC-2012 dataset as a focal example. Their investigation includes a detailed quantitative audit and explores the implications of current practices on privacy, consent, and broader social justice.

Ethical Concerns in Large-Scale Vision Datasets

The paper highlights key issues surrounding consent and privacy, noting how the massive collection of images often neglects informed consent principles. The researchers illustrate how datasets frequently include individuals' photographs without their awareness or approval. They draw specific attention to unethical content such as non-consensual voyeuristic images present in datasets like ImageNet.

The ImageNet Analysis

ImageNet is used as a case paper to demonstrate the problems inherent in large-scale vision datasets. The authors conduct a detailed cross-sectional analysis, examining variables such as age, gender, and the ethical dimensions of image class information. They uncover significant instances of privacy violations and ethical lapses, such as the presence of NSFW content, that raise pertinent questions about the integrity of such widely used datasets.

Societal Impacts and the Technological Landscape

The authors assess the societal harm and threats that arise due to inadequate curation practices. The paper postulates that the use of such datasets in training AI models may reinforce harmful stereotypes and biases, disproportionately impacting marginalized groups. Furthermore, the proliferation of even larger, less transparent datasets exacerbates these concerns.

Pathways for Ethical Data Curation

Recognizing these challenges, the authors propose actionable solutions for addressing the ethical concerns in large-scale vision datasets. They advocate for the establishment of mandatory Institutional Review Boards (IRBs) in dataset curation processes and encourage a commitment to transparency and openness in dataset curation. Suggested strategies include removing problematic images, obtaining informed consent, using synthetic data alternatives, and ensuring privacy-preserving methods like differential privacy for identifiable images.

Implications for Future AI Developments

The implications of this research extend to practical and theoretical domains. Practically, it suggests immediate remedies to prevent ongoing harm and unethical usage of datasets. Theoretically, it provides a foundation for refining ethical data usage frameworks and guidelines, which could reshape dataset curation processes worldwide.

Conclusion

This work represents a critical call to action for the computer vision and AI communities to reevaluate the methods used in curating large-scale datasets. The authors emphasize the need for a shift in how ethics are considered in dataset development, advocating for a more responsible and informed approach that prioritizes human dignity and social justice. The results and suggestions from this paper can serve as a blueprint for future improvements in dataset ethics, driving a more conscientious evolution of AI technologies.