Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision (2305.03047v2)
Abstract: Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of LLMs with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base LLM, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.
- Anthropic. Claude’s constitution, 2023a. URL https://www.anthropic.com/index/claudes-constitution.
- Anthropic. Core views on ai safety: When, why, what, and how, 2023b. URL https://www.anthropic.com/index/core-views-on-ai-safety.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback, 2022b.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://vicuna.lmsys.org.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30, 2017.
- Databricks. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
- The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023.
- Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
- Openassistant conversations – democratizing large language model alignment, 2023.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Visual instruction tuning. 2023.
- Microsoft. Introducing the new bing, 2023. URL https://www.bing.com/new#features.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
- OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
- OpenAI. Gpt-4 technical report, 2023a.
- OpenAI. OpenAI: GPT-4, 2023b. URL https://openai.com/research/gpt-4.
- OpenAI. How do text-davinci-002 and text-davinci-003 differ? https://help.openai.com/en/articles/6779149-how-do-text-davinci-002-and-text-davinci-003-differ, 2023c.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193, 2021.
- Align-rudder: Learning from few demonstrations by reward redistribution. arXiv preprint arXiv:2009.14108, 2020.
- Language models are unsupervised multitask learners. 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301, 2018.
- Bloom: A 176B-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. arXiv preprint arXiv:2212.08061, 2022.
- Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873, 2021.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Salmon: Self-alignment with principle-following reward models. arXiv preprint arXiv:2310.05910, 2023a.
- Recitation-augmented language models. In International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=-cqvvvb-NkI.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. NeurIPS, 2017.
- Poisoning language models during instruction tuning, 2023.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196, 2023.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Zhiqing Sun (35 papers)
- Yikang Shen (62 papers)
- Qinhong Zhou (7 papers)
- Hongxin Zhang (48 papers)
- Zhenfang Chen (36 papers)
- David Cox (48 papers)
- Yiming Yang (152 papers)
- Chuang Gan (196 papers)