Emergent Mind

Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

(2305.03047)
Published May 4, 2023 in cs.LG , cs.AI , cs.CL , and cs.CY

Abstract

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of LLMs with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Anthropic. Claude’s constitution, 2023a. https://www.anthropic.com/index/claudes-constitution.

  2. Anthropic. Core views on ai safety: When, why, what, and how, 2023b. https://www.anthropic.com/index/core-views-on-ai-safety.

  3. A General Language Assistant as a Laboratory for Alignment
  4. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  5. Constitutional ai: Harmlessness from ai feedback, 2022b
  6. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
  7. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://vicuna.lmsys.org.

  9. PaLM: Scaling Language Modeling with Pathways
  10. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30
  11. Databricks. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.

  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  13. Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437
  14. The Capacity for Moral Self-Correction in Large Language Models
  15. Koala: A dialogue model for academic research. Blog post, April 2023. https://bair.berkeley.edu/blog/2023/04/03/koala/.

  16. The Curious Case of Neural Text Degeneration
  17. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations
  18. Sequence-Level Knowledge Distillation
  19. Large Language Models are Zero-Shot Reasoners
  20. Openassistant conversations – democratizing large language model alignment
  21. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
  22. TruthfulQA: Measuring How Models Mimic Human Falsehoods
  23. Visual instruction tuning. 2023.
  24. Microsoft. Introducing the new bing, 2023. https://www.bing.com/new#features.

  25. Show Your Work: Scratchpads for Intermediate Computation with Language Models
  26. OpenAI. OpenAI: Introducing ChatGPT, 2022. https://openai.com/blog/chatgpt.

  27. OpenAI. Gpt-4 technical report, 2023a.
  28. OpenAI. OpenAI: GPT-4, 2023b. https://openai.com/research/gpt-4.

  29. OpenAI. How do text-davinci-002 and text-davinci-003 differ? https://help.openai.com/en/articles/6779149-how-do-text-davinci-002-and-text-davinci-003-differ, 2023c.

  30. Training language models to follow instructions with human feedback
  31. BBQ: A Hand-Built Bias Benchmark for Question Answering
  32. Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution
  33. Language models are unsupervised multitask learners. 2019.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551
  35. Gender Bias in Coreference Resolution
  36. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  37. On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning
  38. Process for adapting language models to society (palms) with values-targeted datasets. Advances in Neural Information Processing Systems, 34:5861–5873
  39. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  40. SALMON: Self-Alignment with Instructable Reward Models
  41. Recitation-augmented language models. In International Conference on Learning Representations, 2023b. https://openreview.net/forum?id=-cqvvvb-NkI.

  42. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  43. LaMDA: Language Models for Dialog Applications
  44. LLaMA: Open and Efficient Foundation Language Models
  45. Llama 2: Open Foundation and Fine-Tuned Chat Models
  46. Attention is all you need. NeurIPS
  47. Poisoning language models during instruction tuning
  48. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  49. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS
  50. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data
  51. ReAct: Synergizing Reasoning and Acting in Language Models
  52. OPT: Open Pre-trained Transformer Language Models

Show All 52