CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias (2308.12539v3)
Abstract: As LLMs (LMs) become increasingly powerful and widely used, it is important to quantify them for sociodemographic bias with potential for harm. Prior measures of bias are sensitive to perturbations in the templates designed to compare performance across social groups, due to factors such as low diversity or limited number of templates. Also, most previous work considers only one NLP task. We introduce Comprehensive Assessment of LLMs (CALM) for robust measurement of two types of universally relevant sociodemographic bias, gender and race. CALM integrates sixteen datasets for question-answering, sentiment analysis and natural language inference. Examples from each dataset are filtered to produce 224 templates with high diversity (e.g., length, vocabulary). We assemble 50 highly frequent person names for each of seven distinct demographic groups to generate 78,400 prompts covering the three NLP tasks. Our empirical evaluation shows that CALM bias scores are more robust and far less sensitive than previous bias measurements to perturbations in the templates, such as synonym substitution, or to random subset selection of templates. We apply CALM to 20 LLMs, and find that for 2 LLM series, larger parameter models tend to be more biased than smaller ones. The T0 series is the least biased model families, of the 20 LLMs investigated here. The code is available at https://github.com/vipulgupta1011/CALM.
- Jaimeen Ahn and Alice Oh. 2021. Mitigating Language-Dependent Ethnic Bias in BERT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 533–549. https://doi.org/10.18653/v1/2021.emnlp-main.42
- Using Natural Sentence Prompts for Understanding Biases in Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Seattle, United States, 2824–2830. https://doi.org/10.18653/v1/2022.naacl-main.203
- SODAPOP: Open-Ended Discovery of Social Biases in Social Commonsense Reasoning Models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, 1573–1596. https://aclanthology.org/2023.eacl-main.116
- Unmasking Contextual Stereotypes: Measuring and Mitigating BERT’s Gender Bias. In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Barcelona, Spain (Online), 1–16. https://aclanthology.org/2020.gebnlp-1.1
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. https://doi.org/10.1145/3442188.3445922
- Jayadev Bhaskaran and Isha Bhallamudi. 2019. Good secretaries, bad truck drivers? Occupational gender stereotypes in sentiment analysis. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Florence, Italy, 62–68. https://doi.org/10.18653/v1/W19-3809
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. Eleuther AI. https://doi.org/10.5281/zenodo.5297715 If you use this software, please cite it using these metadata..
- Language (technology) is power: a critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5454–5476. https://doi.org/10.18653/v1/2020.acl-main.485
- Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc., Red Hook, NY, USA. https://proceedings.neurips.cc/paper_files/paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 632–642. https://doi.org/10.18653/v1/D15-1075
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.
- On Measuring Gender Bias in Translation of Gender-neutral Pronouns. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Florence, Italy, 173–181. https://doi.org/10.18653/v1/W19-3824
- Quantifying social biases in NLP: A generalization and empirical comparison of extrinsic fairness metrics. Transactions of the Association for Computational Linguistics 9 (2021), 1249–1267.
- On Measuring and Mitigating Biased Inferences of Word Embeddings. Proceedings of the AAAI Conference on Artificial Intelligence 34, 05 (Apr. 2020), 7659–7666. https://doi.org/10.1609/aaai.v34i05.6267
- Sunipa Dev and Jeff M. Phillips. 2019. Attenuating Bias in Word Vectors. CoRR abs/1901.07656 (2019). arXiv:1901.07656 http://arxiv.org/abs/1901.07656
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Yanai Elazar and Yoav Goldberg. 2018. Adversarial Removal of Demographic Attributes from Text Data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 11–21. https://doi.org/10.18653/v1/D18-1002
- Jacob Feldman. 2015. There Are 922 Unisex Names in America — Is Yours One of Them? https://fivethirtyeight.com/features/there-are-922-unisex-names-in-america-is-yours-one-of-them/
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
- Twitter sentiment classification using distant supervision. CS224N project report, Stanford 1, 12 (2009), 2009.
- Intrinsic Bias Metrics Do Not Correlate with Application Bias. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 1926–1940. https://doi.org/10.18653/v1/2021.acl-long.150
- SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). CVPR, New Orleans, USA, 5078–5088.
- Survey on Sociodemographic Bias in Natural Language Processing. arXiv preprint arXiv:2306.08158 (2023).
- “Fifty Shades of Bias”: Normative Ratings of Gender Bias in GPT Generated English Text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 1862–1876. https://doi.org/10.18653/v1/2023.emnlp-main.115
- Helen. 2018. Very Large Language Models and How to Evaluate Them. https://huggingface.co/blog/zero-shot-eval-on-the-hub
- Reducing Sentiment Bias in Language Models via Counterfactual Evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 65–83. https://doi.org/10.18653/v1/2020.findings-emnlp.7
- Christina Jewett. 2023. Doctors Wrestle With A.I. in Patient Care, Citing Lax Oversight. https://www.nytimes.com/2023/10/30/health/doctors-ai-technology-health-care.html
- Jigsaw. 2018. Jigsaw Toxic Comment Classification Challenge. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
- Svetlana Kiritchenko and Saif Mohammad. 2018. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, New Orleans, Louisiana, 43–53. https://doi.org/10.18653/v1/S18-2005
- Measuring Bias in Contextualized Word Representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, Florence, Italy, 166–172. https://doi.org/10.18653/v1/W19-3823
- Zero-Shot Relation Extraction via Reading Comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Association for Computational Linguistics, Vancouver, Canada, 333–342. https://doi.org/10.18653/v1/K17-1034
- Comparing Biases and the Impact of Multilingual Training across Multiple Languages. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 10260–10280. https://doi.org/10.18653/v1/2023.emnlp-main.634
- Out of Context: Investigating the Bias and Fairness Concerns of “Artificial Intelligence as a Service”. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 1–17.
- UNQOVERing Stereotyping Biases via Underspecified Questions. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3475–3489. https://doi.org/10.18653/v1/2020.findings-emnlp.311
- Holistic Evaluation of Language Models. Transactions on Machine Learning Research (2023). https://openreview.net/forum?id=iO4LZibEqW Featured Certification, Expert Certification.
- A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland, 216–223. http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf
- Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. AAAI, online, 14867–14875.
- Crowdsourcing Question-Answer Meaning Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 560–568. https://doi.org/10.18653/v1/N18-2089
- MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://aclanthology.org/L18-1564
- BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 2086–2105. https://doi.org/10.18653/v1/2022.findings-acl.165
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023). arXiv:2306.01116 https://arxiv.org/abs/2306.01116
- Red Teaming Language Models with Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 3419–3448. https://doi.org/10.18653/v1/2022.emnlp-main.225
- Perturbation Sensitivity Analysis to Detect Unintended Model Biases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 5740–5745. https://doi.org/10.18653/v1/D19-1578
- MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (Eds.). Association for Computational Linguistics, Seattle, Washington, USA, 193–203. https://aclanthology.org/D13-1020
- Gender Bias in Coreference Resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Marilyn Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, New Orleans, Louisiana, 8–14. https://doi.org/10.18653/v1/N18-2002
- DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1683–1693. https://doi.org/10.18653/v1/P18-1156
- WinoGrande: An Adversarial Winograd Schema Challenge at Scale. Commun. ACM 64, 9 (aug 2021), 99–106. https://doi.org/10.1145/3474381
- Multitask Prompted Training Enables Zero-Shot Task Generalization. arXiv:2110.08207 [cs.LG]
- Social IQa: Commonsense Reasoning about Social Interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 4463–4473. https://doi.org/10.18653/v1/D19-1454
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
- A proposal for identifying and managing bias in artificial intelligence. Draft NIST Special Publication 1270 (2021).
- The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Toronto, Canada, 1373–1386. https://aclanthology.org/2023.acl-short.118
- Quantifying Social Biases Using Templates is Unreliable. In Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022. https://openreview.net/forum?id=rIhzjia7SLa
- “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 9180–9211. https://aclanthology.org/2022.emnlp-main.625
- Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631–1642. https://aclanthology.org/D13-1170
- Evaluating Gender Bias in Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 1679–1684. https://doi.org/10.18653/v1/P19-1164
- Yi Chern Tan and L. Elisa Celis. 2019. Assessing Social and Intersectional Biases in Contextualized Word Representations. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., Red Hook, NY, USA. https://proceedings.neurips.cc/paper/2019/hash/201d546992726352471cfea6b0df0a48-Abstract.html
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- A study of implicit bias in pretrained language models against people with disabilities. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 1324–1332.
- Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023). Association for Computational Linguistics, Toronto, Canada, 26–34.
- Pranav Narayanan Venkit and Shomir Wilson. 2021. Identification of bias against people with disabilities in sentiment analysis and toxicity detection models. arXiv preprint arXiv:2111.13259 (2021).
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32, Article 294 (2019), 15 pages.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Tal Linzen, Grzegorz Chrupała, and Afra Alishahi (Eds.). Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.18653/v1/W18-5446
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353–355. https://doi.org/10.18653/v1/W18-5446
- Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns. Transactions of the Association for Computational Linguistics 6 (2018), 605–617. https://doi.org/10.1162/tacl_a_00240
- Measuring and Reducing Gendered Correlations in Pre-trained Models. http://arxiv.org/abs/2010.06032 arXiv:2010.06032 [cs].
- Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1502.05698
- TWEETQA: A Social Media Focused Question Answering Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5020–5031. https://doi.org/10.18653/v1/P19-1496
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
- BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations. ICLR, Online. https://openreview.net/forum?id=SkeHuCVFDr
- Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 15–20. https://doi.org/10.18653/v1/N18-2003
- Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity. arXiv preprint arXiv:2301.12867 (2023).