Unintended Impacts of LLM Alignment on Global Representation (2402.15018v2)
Abstract: Before being deployed for user-facing applications, developers align LLMs to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning. We make our code and data publicly available on Github.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Fine-tuning language models to find agreement among humans with diverse preferences. In Advances in Neural Information Processing Systems.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884.
- Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
- The foundation model transparency index.
- Megan Brenan. 2023. Canada, britain favored most in u.s.; russia, n. korea least.
- Multilingual large language models leak human stereotypes across language boundaries.
- Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca.
- Evaluating large language models trained on code.
- Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Together Computer. 2023. Redpajama: an open dataset for training large language models.
- Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(comming soon).
- Llm.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
- Queer people are people first: Deconstructing sexual identity stereotypes in large language models.
- Towards measuring the representation of subjective global opinions in language models.
- Md3: The multi-dialect dataset of dialogues. In InterSpeech.
- Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130.
- Emilio Ferrara. 2023. Should chatgpt be biased? challenges and risks of bias in large language models. First Monday.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Reward reports for reinforcement learning.
- The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Human feedback is not gold standard.
- Jing Huang and Diyi Yang. 2023. Culturally aware natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7591–7609, Singapore. Association for Computational Linguistics.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2.
- Mistral 7b.
- Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
- Understanding the effects of rlhf on llm generalisation and diversity.
- Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference, CI ’23, page 12–24, New York, NY, USA. Association for Computing Machinery.
- Openassistant conversations – democratizing large language model alignment.
- The history and risks of reinforcement learning and human feedback.
- Cfmatch: Aligning automated answer equivalence evaluation with expert judgments for open-domain question answering.
- Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca.
- Opening up chatgpt: Tracking openness, transparency, and accountability in instruction-tuned text generators. In Proceedings of the 5th International Conference on Conversational User Interfaces, CUI ’23, New York, NY, USA. Association for Computing Machinery.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Gabrielle Kaili-May Liu. 2023. Perspectives on the social impacts of reinforcement learning with human feedback.
- The data provenance initiative: A large scale audit of dataset licensing & attribution in ai.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229.
- StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
- Shuyo Nakatani. 2010. Language detection library for java.
- Having beer after prayer? measuring cultural bias in large language models.
- Gabriel Nicholas and Aliya Bhatia. 2023. Lost in translation: Large language models in non-english content analysis.
- OpenAI. 2023a. Gpt-4 technical report.
- OpenAI. 2023b. Openai devday: Opening keynote.
- Training language models to follow instructions with human feedback.
- Instruction tuning with gpt-4.
- Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, Toronto, Canada. Association for Computational Linguistics.
- Direct preference optimization: Your language model is secretly a reward model.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Whose opinions do language models reflect?
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Grounding or guesswork? large language models are presumptive grounders. arXiv preprint arXiv:2311.09144.
- A long way to go: Investigating length correlations in rlhf.
- A roadmap to pluralistic alignment.
- Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada. Association for Computational Linguistics.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- No language left behind: Scaling human-centered machine translation.
- Llama 2: Open foundation and fine-tuned chat models.
- Christoph Treude and Hideaki Hata. 2023. She elicits requirements and he tests: Software engineering gender bias in large language models.
- The alignment handbook. https://github.com/huggingface/alignment-handbook.
- Zephyr: Direct distillation of lm alignment.
- "kelly is a warm person, joseph is a role model": Gender biases in llm-generated reference letters.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Low-resource languages jailbreak gpt-4.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Agieval: A human-centric benchmark for evaluating foundation models.
- LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif.
- Multi-VALUE: A framework for cross-dialectal English NLP. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 744–768, Toronto, Canada. Association for Computational Linguistics.