GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models (2407.21001v3)
Abstract: Vision-LLMs (VLMs) are intensively used in many downstream tasks, including those requiring assessments of individuals appearing in the images. While VLMs perform well in simple single-person scenarios, in real-world applications, we often face complex situations in which there are persons of different genders doing different activities. We show that in such cases, VLMs are biased towards identifying the individual with the expected gender (according to ingrained gender stereotypes in the model or other forms of sample selection bias) as the performer of the activity. We refer to this bias in associating an activity with the gender of its actual performer in an image or text as the Gender-Activity Binding (GAB) bias and analyze how this bias is internalized in VLMs. To assess this bias, we have introduced the GAB dataset with approximately 5500 AI-generated images that represent a variety of activities, addressing the scarcity of real-world images for some scenarios. To have extensive quality control, the generated images are evaluated for their diversity, quality, and realism. We have tested 12 renowned pre-trained VLMs on this dataset in the context of text-to-image and image-to-text retrieval to measure the effect of this bias on their predictions. Additionally, we have carried out supplementary experiments to quantify the bias in VLMs' text encoders and to evaluate VLMs' capability to recognize activities. Our experiments indicate that VLMs experience an average performance decline of about 13.2% when confronted with gender-activity binding bias.
- Gpt-4 technical report, 2023.
- Evaluating clip: Towards characterization of broader capabilities and downstream implications. ArXiv, abs/2108.02818, 2021.
- Convolutional image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5561–5570, 2018.
- Recovering from selection bias in causal and statistical inference. Probabilistic and Causal Inference, 2014.
- A prompt array keeps the bias away: Debiasing vision-language models with adversarial learning. arXiv preprint arXiv:2203.11933, 2022.
- AltCLIP: Altering the language encoder in CLIP for extended language capabilities. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8666–8682, 2023.
- Debiasing vision-language models via biased prompts, 2023.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
- Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision, 14(3–4):163–352, 2022.
- Visogender: A dataset for benchmarking gender bias in image-text pronoun resolution. ArXiv, abs/2306.12424, 2023.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
- Denoising diffusion probabilistic models, 2020.
- spaCy: Industrial-strength Natural Language Processing in Python. 2020.
- Socialcounterfactuals: Probing and mitigating intersectional social biases in vision-language models with counterfactual examples, 2024.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR, 2021.
- Ultralytics YOLO, 2023. URL https://github.com/ultralytics/ultralytics.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
- Survey of social bias in vision-language models. CoRR, abs/2309.14381, 2023.
- Multimodal foundation models: From specialists to general-purpose assistants. ArXiv, abs/2309.10020, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022.
- Grounded language-image pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10955–10965, 2021.
- Y. Li and N. Vasconcelos. Debias your vlm with counterfactuals: A unified approach. 2023.
- Roberta: A robustly optimized bert pretraining approach, 2019.
- OpenAI. Dall-e 3. https://openai.com/dall-e-3/, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
- Language models are unsupervised multitask learners. 2019.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
- N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019. URL http://arxiv.org/abs/1908.10084.
- High-resolution image synthesis with latent diffusion models, 2022.
- Towards an exhaustive evaluation of vision-language foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 339–352, 2023.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. ArXiv, abs/2111.02114, 2021.
- Dear: Debiasing vision-language models with additive residuals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6820–6829, 2023.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15638–15650, 2022.
- T. Srinivasan and Y. Bisk. Worst of both worlds: Biases compound in pre-trained vision-and-language models. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 77–85. Association for Computational Linguistics, 2022a.
- T. Srinivasan and Y. Bisk. Worst of both worlds: Biases compound in pre-trained vision-and-language models. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 77–85, 2022b.
- Eva-clip: Improved training techniques for clip at scale. ArXiv, arXiv:2303.15389, 2023.
- H. Tan and M. Bansal. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China, 2019. Association for Computational Linguistics.
- Winoground: Probing vision and language models for visio-linguistic compositionality. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5228–5238, 2022.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Are gender-neutral queries really gender-neutral? mitigating gender bias in image search. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, page 1995–2008, 2021.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
- Filip: Fine-grained interactive language-image pre-training. ArXiv, abs/2111.07783, 2021.
- Coca: Contrastive captioners are image-text foundation models. ArXiv, abs/2205.01917, 2022.
- When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2023.
- Joint face detection and alignment using multi-task cascaded convolutional networks. CoRR, abs/1604.02878, 2016.
- The unreasonable effectiveness of deep features as a perceptual metric, 2018.
- Counterfactually measuring and eliminating social bias in vision-language pre-training models. In MM ’22: The 30th ACM International Conference on Multimedia, pages 4996–5004. ACM, 2022.
- Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. ArXiv, abs/2207.00221, 2022.
- Vision language models in autonomous driving and intelligent transportation systems. ArXiv, abs/2310.14414, 2023.