Emergent Mind

Exploration with Principles for Diverse AI Supervision

(2310.08899)
Published Oct 13, 2023 in cs.CL

Abstract

Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. Even state-of-the-art AI models like ChatGPT depend on fine-tuning through human demonstrations, demanding extensive human input and domain expertise. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. To address this limitation, we propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data. Drawing inspiration from unsupervised reinforcement learning (RL) pretraining, EAI achieves exploration within the natural language space. We accomplish this by harnessing LLMs to assess the novelty of generated content. Our approach employs two key components: an actor that generates novel content following exploration principles and a critic that evaluates the generated content, offering critiques to guide the actor. Empirical evaluations demonstrate that EAI significantly boosts model performance on complex reasoning tasks, addressing the limitations of human-intensive supervision.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Anthropic. Introducing claude, 2023. https://www.anthropic.com/index/introducing-claude.

  2. Program Synthesis with Large Language Models
  3. Constitutional AI: Harmlessness from AI Feedback
  4. J. Beirlant. Nonparametric entropy estimation: An overview. International Journal of the Mathematical Statistics Sciences, 6:17–39
  5. Submodularity In Machine Learning and Artificial Intelligence
  6. Beyond Fine-Tuning: Transferring Behavior in Reinforcement Learning
  7. Evaluating Large Language Models Trained on Code
  8. Teaching Large Language Models to Self-Debug
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org

  10. Training Verifiers to Solve Math Word Problems
  11. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.

  12. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
  13. Guiding Pretraining in Reinforcement Learning with Large Language Models
  14. Diversity is All You Need: Learning Skills without a Reward Function
  15. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR
  16. Koala: A dialogue model for academic research. Blog post, April, 1
  17. Measuring Mathematical Problem Solving With the MATH Dataset
  18. Large Language Models Can Self-Improve
  19. URLB: Unsupervised Reinforcement Learning Benchmark
  20. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328
  21. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857
  22. Let's Verify Step by Step
  23. H. Liu and P. Abbeel. Aps: Active pre-training with successor features. In International Conference on Machine Learning, 2021a.
  24. H. Liu and P. Abbeel. Behavior from the void: Unsupervised active pre-training. In Advances in Neural Information Processing Systems, 2021b.
  25. Chain of Hindsight Aligns Language Models with Feedback
  26. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
  27. Self-Refine: Iterative Refinement with Self-Feedback
  28. Orca: Progressive Learning from Complex Explanation Traces of GPT-4
  29. K. P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023. http://probml.github.io/book2.

  30. Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate
  31. Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations
  32. OpenAI. Gpt-4 technical report
  33. Controllability-Aware Unsupervised Skill Discovery
  34. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR
  35. Mastering the unsupervised reinforcement learning benchmark from pixels. 2023.
  36. Code Llama: Open Foundation Models for Code
  37. Chatgpt: Optimizing language models for dialogue. OpenAI Blog, 2022. https://openai.com/blog/chatgpt.

  38. Reflexion: Language Agents with Verbal Reinforcement Learning
  39. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(3-4):301–321
  40. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82
  41. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
  42. Stanford alpaca: An instruction-following llama model
  43. Llama 2: Open Foundation and Fine-Tuned Chat Models
  44. Attention is all you need. Advances in neural information processing systems, 30
  45. Voyager: An Open-Ended Embodied Agent with Large Language Models
  46. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  47. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837
  48. WizardLM: Empowering Large Language Models to Follow Complex Instructions
  49. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
  50. RRHF: Rank Responses to Align Language Models with Human Feedback without tears
  51. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
  52. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488

Show All 52