Emergent Mind

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make LLMs easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of human preferences that acts as a reward function for optimization. This approach, which operates at the intersection of many stakeholders and academic disciplines, remains poorly understood. RLHF reward models are often cited as being central to achieving performance, yet very few descriptors of capabilities, evaluations, training methods, or open-source models exist. Given this lack of information, further study and transparency is needed for learned RLHF reward models. In this paper, we illustrate the complex history of optimizing preferences, and articulate lines of inquiry to understand the sociotechnical context of reward models. In particular, we highlight the ontological differences between costs, rewards, and preferences at stake in RLHF's foundations, related methodological tensions, and possible research directions to improve general understanding of how reward models function.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleOn the expressivity of markov reward On the expressivity of markov reward.\BBCQ \APACjournalVolNumPagesAdvances in Neural Information Processing Systems347799–7812. \PrintBackRefs\CurrentBib
  2. \APACrefYearMonthDay2021. \BBOQ\APACrefatitlePersistent anti-muslim bias in large language models Persistent anti-muslim bias in large language models.\BBCQ \BIn \APACrefbtitleProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society Proceedings of the 2021 aaai/acm conference on ai, ethics, and society (\BPGS 298–306). \PrintBackRefs\CurrentBib
  3. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleDeep reinforcement learning at the edge of the statistical precipice Deep reinforcement learning at the edge of the statistical precipice.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems3429304–29320. \PrintBackRefs\CurrentBib
  4. Self-Consuming Generative Models Go MAD
  5. \APACinsertmetastararnauld1861port{APACrefauthors}Arnauld, A.  \APACrefYear1662. \APACrefbtitleThe Port-Royal Logic The port-royal logic. \PrintBackRefs\CurrentBib
  6. \APACinsertmetastararrow1950difficulty{APACrefauthors}Arrow, K\BPBIJ.  \APACrefYearMonthDay1950. \BBOQ\APACrefatitleA difficulty in the concept of social welfare A difficulty in the concept of social welfare.\BBCQ \APACjournalVolNumPagesJournal of political economy584328–346. \PrintBackRefs\CurrentBib
  7. A General Language Assistant as a Laboratory for Alignment
  8. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  9. Constitutional AI: Harmlessness from AI Feedback
  10. \APACinsertmetastardefcon{APACrefauthors}Bajak, F.  \APACrefYearMonthDay2023. \APACrefbtitleHackers red-teaming A.I. are ‘breaking stuff left and right,’ but don’t expect quick fixes from DefCon: ‘There are no good guardrails’. Hackers red-teaming a.i. are ‘breaking stuff left and right,’ but don’t expect quick fixes from defcon: ‘there are no good guardrails’. \APAChowpublishedhttps://fortune.com/2023/08/13/hackers-red-teaming-ai-defcon-breaking-stuff-but-no-quick-fixes/. \APACrefnoteAccessed: 2023-10-03 \PrintBackRefs\CurrentBib

  11. \APACrefYearMonthDay2023. \APACrefbtitlePeering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models. Peering through preferences: Unraveling feedback acquisition for aligning large language models. \PrintBackRefs\CurrentBib
  12. \APACinsertmetastarbaum2020social{APACrefauthors}Baum, S\BPBID.  \APACrefYearMonthDay2020. \BBOQ\APACrefatitleSocial choice ethics in artificial intelligence Social choice ethics in artificial intelligence.\BBCQ \APACjournalVolNumPagesAI & SOCIETY351165–176. \PrintBackRefs\CurrentBib
  13. \APACinsertmetastarbellman1957markovian{APACrefauthors}Bellman, R.  \APACrefYearMonthDay1957. \BBOQ\APACrefatitleA Markovian decision process A markovian decision process.\BBCQ \APACjournalVolNumPagesJournal of mathematics and mechanics679–684. \PrintBackRefs\CurrentBib
  14. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleOn the dangers of stochastic parrots: Can language models be too big? On the dangers of stochastic parrots: Can language models be too big?\BBCQ \BIn \APACrefbtitleProceedings of the 2021 ACM conference on fairness, accountability, and transparency Proceedings of the 2021 acm conference on fairness, accountability, and transparency (\BPGS 610–623). \PrintBackRefs\CurrentBib
  15. \APACinsertmetastarbentham1823hedonic{APACrefauthors}Bentham, J.  \APACrefYear1823. \APACrefbtitleAn Introduction to the Principles of Morals and Legislation An introduction to the principles of morals and legislation. \PrintBackRefs\CurrentBib
  16. \APACrefYearMonthDay1952. \BBOQ\APACrefatitleRank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons Rank analysis of incomplete block designs: I. the method of paired comparisons.\BBCQ \APACjournalVolNumPagesBiometrika393/4324–345. {APACrefURL} [2023-02-13]http://www.jstor.org/stable/2334029 \PrintBackRefs\CurrentBib

  17. \APACinsertmetastarbriggs2014normative{APACrefauthors}Briggs, R\BPBIA.  \APACrefYearMonthDay2014. \BBOQ\APACrefatitleNormative theories of rational choice: Expected utility Normative theories of rational choice: Expected utility.\BBCQ \PrintBackRefs\CurrentBib
  18. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
  19. A Survey on Evaluation of Large Language Models
  20. \APACrefYearMonthDay2010. \BBOQ\APACrefatitleLabel ranking methods based on the Plackett-Luce model Label ranking methods based on the plackett-luce model.\BBCQ \BIn \APACrefbtitleProceedings of the 27th International Conference on Machine Learning (ICML-10) Proceedings of the 27th international conference on machine learning (icml-10) (\BPGS 215–222). \PrintBackRefs\CurrentBib
  21. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleDeep reinforcement learning from human preferences Deep reinforcement learning from human preferences.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems30. \PrintBackRefs\CurrentBib
  22. \APACrefYearMonthDay2022. \BBOQ\APACrefatitleMagnetic control of tokamak plasmas through deep reinforcement learning Magnetic control of tokamak plasmas through deep reinforcement learning.\BBCQ \APACjournalVolNumPagesNature6027897414–419. \PrintBackRefs\CurrentBib
  23. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleHard choices in artificial intelligence Hard choices in artificial intelligence.\BBCQ \APACjournalVolNumPagesArtificial Intelligence300103555. \PrintBackRefs\CurrentBib
  24. Moral Machine or Tyranny of the Majority?
  25. Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
  26. Multi-Principal Assistance Games
  27. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
  28. Scaling Laws for Reward Model Overoptimization
  29. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
  30. Reward Reports for Reinforcement Learning
  31. Choices, Risks, and Reward Reports: Charting Public Policy for Reinforcement Learning Systems
  32. Improving alignment of dialogue agents via targeted human judgements
  33. \APACrefYear2017. \APACrefbtitleAutomatic control systems Automatic control systems. \APACaddressPublisherMcGraw-Hill Education. \PrintBackRefs\CurrentBib
  34. The False Promise of Imitating Proprietary LLMs
  35. \APACrefYearMonthDay2014. \BBOQ\APACrefatitleMicrofoundations of the Rule of Law Microfoundations of the rule of law.\BBCQ \APACjournalVolNumPagesAnnual Review of Political Science1721–42. \PrintBackRefs\CurrentBib
  36. \APACrefYearMonthDay2016. \BBOQ\APACrefatitleCooperative inverse reinforcement learning Cooperative inverse reinforcement learning.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems29. \PrintBackRefs\CurrentBib
  37. \APACinsertmetastarharsanyi1977rule{APACrefauthors}Harsanyi, J\BPBIC.  \APACrefYearMonthDay1977. \BBOQ\APACrefatitleRule utilitarianism and decision theory Rule utilitarianism and decision theory.\BBCQ \APACjournalVolNumPagesErkenntnis11125–53. \PrintBackRefs\CurrentBib
  38. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices The ethical ambiguity of ai data enrichment: Measuring gaps in research ethics norms and practices.\BBCQ \BIn \APACrefbtitleProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency Proceedings of the 2023 acm conference on fairness, accountability, and transparency (\BPGS 261–270). \PrintBackRefs\CurrentBib
  39. \APACinsertmetastarhirschman1984against{APACrefauthors}Hirschman, A\BPBIO.  \APACrefYearMonthDay1984. \BBOQ\APACrefatitleAgainst parsimony: Three easy ways of complicating some categories of economic discourse Against parsimony: Three easy ways of complicating some categories of economic discourse.\BBCQ \APACjournalVolNumPagesBulletin of the American Academy of arts and Sciences37811–28. \PrintBackRefs\CurrentBib
  40. \APACinsertmetastarhoward1960dynamic{APACrefauthors}Howard, R\BPBIA.  \APACrefYearMonthDay1960. \BBOQ\APACrefatitleDynamic programming and markov processes. Dynamic programming and markov processes.\BBCQ \PrintBackRefs\CurrentBib
  41. The Ghost in the Machine has an American accent: value conflict in GPT-3
  42. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleChampion-level drone racing using deep reinforcement learning Champion-level drone racing using deep reinforcement learning.\BBCQ \APACjournalVolNumPagesNature6207976982–987. {APACrefURL} https://doi.org/10.1038/s41586-023-06419-4 {APACrefDOI} \doi10.1038/s41586-023-06419-4 \PrintBackRefs\CurrentBib

  43. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback
  44. \APACinsertmetastarklopf1972brain{APACrefauthors}Klopf, A\BPBIH.  \APACrefYear1972. \APACrefbtitleBrain function and adaptive systems: a heterostatic theory Brain function and adaptive systems: a heterostatic theory (\BNUM 133). \APACaddressPublisherAir Force Cambridge Research Laboratories, Air Force Systems Command, United …. \PrintBackRefs\CurrentBib
  45. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleChatGPT’s inconsistent moral advice influences users’ judgment Chatgpt’s inconsistent moral advice influences users’ judgment.\BBCQ \APACjournalVolNumPagesScientific Reports1314569. \PrintBackRefs\CurrentBib
  46. \APACrefYearMonthDay2023. \APACrefbtitleRLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. \PrintBackRefs\CurrentBib
  47. Scalable agent alignment via reward modeling: a research direction
  48. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleTowards understanding and mitigating social biases in language models Towards understanding and mitigating social biases in language models.\BBCQ \BIn \APACrefbtitleInternational Conference on Machine Learning International conference on machine learning (\BPGS 6565–6576). \PrintBackRefs\CurrentBib
  49. \APACrefYear2014. \APACrefbtitleThe Arrow impossibility theorem The arrow impossibility theorem. \APACaddressPublisherColumbia University Press. \PrintBackRefs\CurrentBib
  50. \APACrefYearMonthDay2003. \BBOQ\APACrefatitleA computational substrate for incentive salience A computational substrate for incentive salience.\BBCQ \APACjournalVolNumPagesTrends in neurosciences268423–428. \PrintBackRefs\CurrentBib
  51. \APACrefYearMonthDay1970. \BBOQ\APACrefatitle8 Reinforcement-Learning Control and Pattern Recognition Systems 8 reinforcement-learning control and pattern recognition systems.\BBCQ \BIn J. Mendel \BBA K. Fu (\BEDS), \APACrefbtitleAdaptive, Learning and Pattern Recognition Systems Adaptive, learning and pattern recognition systems (\BVOL 66, \BPG 287-318). \APACaddressPublisherElsevier. {APACrefURL} https://www.sciencedirect.com/science/article/pii/S007653920860497X {APACrefDOI} \doihttps://doi.org/10.1016/S0076-5392(08)60497-X \PrintBackRefs\CurrentBib

  52. Teaching language models to support answers with verified quotes
  53. Should Robots be Obedient?
  54. Playing Atari with Deep Reinforcement Learning
  55. WebGPT: Browser-assisted question-answering with human feedback
  56. \APACinsertmetastarnardo2023waluigi{APACrefauthors}Nardo, C.  \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe waluigi effect (mega-post) The waluigi effect (mega-post).\BBCQ \APACjournalVolNumPagesLess Wrong. {APACrefURL} https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post \APACrefnoteAccessed: 2023-09-11 \PrintBackRefs\CurrentBib

  57. \APACrefYearMonthDay1999. \BBOQ\APACrefatitlePolicy invariance under reward transformations: Theory and application to reward shaping Policy invariance under reward transformations: Theory and application to reward shaping.\BBCQ \BIn \APACrefbtitleIcml Icml (\BVOL 99, \BPGS 278–287). \PrintBackRefs\CurrentBib
  58. \APACrefYearMonthDay2000. \BBOQ\APACrefatitleAlgorithms for inverse reinforcement learning. Algorithms for inverse reinforcement learning.\BBCQ \BIn \APACrefbtitleIcml Icml (\BVOL 1, \BPG 2). \PrintBackRefs\CurrentBib
  59. Training language models to follow instructions with human feedback
  60. \APACinsertmetastarpettigrew2019choosing{APACrefauthors}Pettigrew, R.  \APACrefYear2019. \APACrefbtitleChoosing for changing selves Choosing for changing selves. \APACaddressPublisherOxford University Press. \PrintBackRefs\CurrentBib
  61. \APACinsertmetastarpitis2019rethinking{APACrefauthors}Pitis, S.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleRethinking the discount factor in reinforcement learning: A decision theoretic approach Rethinking the discount factor in reinforcement learning: A decision theoretic approach.\BBCQ \BIn \APACrefbtitleProceedings of the AAAI Conference on Artificial Intelligence Proceedings of the aaai conference on artificial intelligence (\BVOL 33, \BPGS 7949–7956). \PrintBackRefs\CurrentBib
  62. Consistent Aggregation of Objectives with Diverse Time Preferences Requires Non-Markovian Rewards
  63. On Releasing Annotator-Level Labels and Information in Datasets
  64. \APACinsertmetastarprasad2018social{APACrefauthors}Prasad, M.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleSocial choice and the value alignment problem Social choice and the value alignment problem.\BBCQ \APACjournalVolNumPagesArtificial intelligence safety and security291–314. \PrintBackRefs\CurrentBib
  65. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  66. \APACinsertmetastarramsey2016truth{APACrefauthors}Ramsey, F\BPBIP.  \APACrefYearMonthDay2016. \BBOQ\APACrefatitleTruth and probability Truth and probability.\BBCQ \APACjournalVolNumPagesReadings in Formal Epistemology: Sourcebook21–45. \PrintBackRefs\CurrentBib
  67. \APACinsertmetastarsalha2011aesthetics{APACrefauthors}Salha, N.  \APACrefYear2011.  \APACrefbtitleAesthetics & Art in the Early Development of Human-Computer Interfaces Aesthetics & art in the early development of human-computer interfaces \APACtypeAddressSchool\BUPhD.  \APACaddressSchoolTesis de doctorado en ingeniería, Universität Bremen]. https://bit. ly/3ZICKZJ. \PrintBackRefs\CurrentBib

  68. \APACinsertmetastarschulman2023proxy{APACrefauthors}Schulman, J.  \APACrefYearMonthDay2023. \APACrefbtitleProxy objectives in reinforcement learning from human feedback. Proxy objectives in reinforcement learning from human feedback. {APACrefURL} https://icml.cc/virtual/2023/invited-talk/21549 \APACrefnoteInternational Conference on Machine Learning (ICML) \PrintBackRefs\CurrentBib

  69. Proximal Policy Optimization Algorithms
  70. \APACrefYearMonthDay2022. \APACrefbtitleChatGPT: Optimizing Language Models for Dialogue. Chatgpt: Optimizing language models for dialogue. \APAChowpublishedhttps://openai.com/blog/chatgpt/. \APACrefnoteAccessed: 2023-02-12 \PrintBackRefs\CurrentBib

  71. \APACinsertmetastarsen1973behaviour{APACrefauthors}Sen, A.  \APACrefYearMonthDay1973. \BBOQ\APACrefatitleBehaviour and the Concept of Preference Behaviour and the concept of preference.\BBCQ \APACjournalVolNumPagesEconomica40159241–259. \PrintBackRefs\CurrentBib
  72. The Trickle-down Impact of Reward (In-)consistency on RLHF
  73. The Curse of Recursion: Training on Generated Data Makes Models Forget
  74. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleMastering the game of go without human knowledge Mastering the game of go without human knowledge.\BBCQ \APACjournalVolNumPagesnature5507676354–359. \PrintBackRefs\CurrentBib
  75. \APACrefYearMonthDay2021. \BBOQ\APACrefatitleReward is enough Reward is enough.\BBCQ \APACjournalVolNumPagesArtificial Intelligence299103535. \PrintBackRefs\CurrentBib
  76. \APACrefYearMonthDay2009. \BBOQ\APACrefatitleWhere do rewards come from Where do rewards come from.\BBCQ \BIn \APACrefbtitleProceedings of the annual conference of the cognitive science society Proceedings of the annual conference of the cognitive science society (\BPGS 2601–2606). \PrintBackRefs\CurrentBib
  77. \APACinsertmetastarskinner2019behavior{APACrefauthors}Skinner, B\BPBIF.  \APACrefYear2019. \APACrefbtitleThe behavior of organisms: An experimental analysis The behavior of organisms: An experimental analysis. \APACaddressPublisherBF Skinner Foundation. \PrintBackRefs\CurrentBib
  78. \APACrefYearMonthDay2015. \BBOQ\APACrefatitleCorrigibility Corrigibility.\BBCQ \BIn \APACrefbtitleWorkshops at the twenty-ninth AAAI conference on artificial intelligence. Workshops at the twenty-ninth aaai conference on artificial intelligence. \PrintBackRefs\CurrentBib
  79. \APACinsertmetastarspaan2012partially{APACrefauthors}Spaan, M\BPBIT.  \APACrefYearMonthDay2012. \BBOQ\APACrefatitlePartially observable Markov decision processes Partially observable markov decision processes.\BBCQ \BIn \APACrefbtitleReinforcement learning: State-of-the-art Reinforcement learning: State-of-the-art (\BPGS 387–414). \APACaddressPublisherSpringer. \PrintBackRefs\CurrentBib
  80. \APACrefYearMonthDay2020. \BBOQ\APACrefatitleLearning to summarize with human feedback Learning to summarize with human feedback.\BBCQ \BIn H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan\BCBL \BBA H. Lin (\BEDS), \APACrefbtitleAdvances in Neural Information Processing Systems Advances in neural information processing systems (\BVOL 33, \BPGS 3008–3021). \APACaddressPublisherCurran Associates, Inc. {APACrefURL} https://proceedings.neurips.cc/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf \PrintBackRefs\CurrentBib

  81. \APACinsertmetastarsuperalignment{APACrefauthors}Sutskever, J\BPBIL\BPBII.  \APACrefYearMonthDay2023. \APACrefbtitleIntroducing Superalignment. Introducing superalignment. \APAChowpublishedhttps://openai.com/blog/introducing-superalignment. \APACrefnoteAccessed: 2023-09-27 \PrintBackRefs\CurrentBib

  82. \APACinsertmetastarsutton1988learning{APACrefauthors}Sutton, R\BPBIS.  \APACrefYearMonthDay1988. \BBOQ\APACrefatitleLearning to predict by the methods of temporal differences Learning to predict by the methods of temporal differences.\BBCQ \APACjournalVolNumPagesMachine learning39–44. \PrintBackRefs\CurrentBib
  83. \APACrefYear2018. \APACrefbtitleReinforcement learning: An introduction Reinforcement learning: An introduction. \APACaddressPublisherMIT press. \PrintBackRefs\CurrentBib
  84. \APACinsertmetastartesauro1995temporal{APACrefauthors}Tesauro, G.\BCBT \BOthersPeriod.  \APACrefYearMonthDay1995. \BBOQ\APACrefatitleTemporal difference learning and TD-Gammon Temporal difference learning and td-gammon.\BBCQ \APACjournalVolNumPagesCommunications of the ACM38358–68. \PrintBackRefs\CurrentBib
  85. \APACinsertmetastarthorndike1927law{APACrefauthors}Thorndike, E\BPBIL.  \APACrefYearMonthDay1927. \BBOQ\APACrefatitleThe law of effect The law of effect.\BBCQ \APACjournalVolNumPagesThe American journal of psychology391/4212–222. \PrintBackRefs\CurrentBib
  86. Llama 2: Open Foundation and Fine-Tuned Chat Models
  87. \APACrefYearMonthDay1947. \BBOQ\APACrefatitleTheory of games and economic behavior, 2nd rev Theory of games and economic behavior, 2nd rev.\BBCQ \PrintBackRefs\CurrentBib
  88. \APACrefYearMonthDay1965. \BBOQ\APACrefatitleA heuristic approach to reinforcement learning control systems A heuristic approach to reinforcement learning control systems.\BBCQ \APACjournalVolNumPagesIEEE Transactions on Automatic Control104390-398. {APACrefDOI} \doi10.1109/TAC.1965.1098193 \PrintBackRefs\CurrentBib
  89. Shepherd: A Critic for Language Model Generation
  90. \APACrefYearMonthDay1992. \BBOQ\APACrefatitleQ-learning Q-learning.\BBCQ \APACjournalVolNumPagesMachine learning8279–292. \PrintBackRefs\CurrentBib
  91. \APACrefYearMonthDay1960. \APACrefbtitleAdaptive switching circuits Adaptive switching circuits \APACbVolEdTR\BTR. \APACaddressInstitutionStanford Univ Ca Stanford Electronics Labs. \PrintBackRefs\CurrentBib
  92. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleA survey of preference-based reinforcement learning methods A survey of preference-based reinforcement learning methods.\BBCQ \APACjournalVolNumPagesJournal of Machine Learning Research181361–46. \PrintBackRefs\CurrentBib
  93. Recursively Summarizing Books with Human Feedback
  94. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
  95. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  96. Fine-Tuning Language Models from Human Preferences
  97. Universal and Transferable Adversarial Attacks on Aligned Language Models

Show All 97