The History and Risks of Reinforcement Learning and Human Feedback (2310.13595v2)
Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make LLMs easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of human preferences that acts as a reward function for optimization. This approach, which operates at the intersection of many stakeholders and academic disciplines, remains poorly understood. RLHF reward models are often cited as being central to achieving performance, yet very few descriptors of capabilities, evaluations, training methods, or open-source models exist. Given this lack of information, further study and transparency is needed for learned RLHF reward models. In this paper, we illustrate the complex history of optimizing preferences, and articulate lines of inquiry to understand the sociotechnical context of reward models. In particular, we highlight the ontological differences between costs, rewards, and preferences at stake in RLHF's foundations, related methodological tensions, and possible research directions to improve general understanding of how reward models function.
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleOn the expressivity of markov reward On the expressivity of markov reward.\BBCQ \APACjournalVolNumPagesAdvances in Neural Information Processing Systems347799–7812. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitlePersistent anti-muslim bias in large language models Persistent anti-muslim bias in large language models.\BBCQ \BIn \APACrefbtitleProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society Proceedings of the 2021 aaai/acm conference on ai, ethics, and society (\BPGS 298–306). \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleDeep reinforcement learning at the edge of the statistical precipice Deep reinforcement learning at the edge of the statistical precipice.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems3429304–29320. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay\bibnodate. \APACrefbtitleSelf-Consuming Generative Models Go MAD. Self-Consuming Generative Models Go MAD. {APACrefURL} [2023-08-01]http://arxiv.org/abs/2307.01850 {APACrefDOI} \doi10.48550/arXiv.2307.01850 \PrintBackRefs\CurrentBib
- \APACinsertmetastararnauld1861port{APACrefauthors}Arnauld, A. \APACrefYear1662. \APACrefbtitleThe Port-Royal Logic The port-royal logic. \PrintBackRefs\CurrentBib
- \APACinsertmetastararrow1950difficulty{APACrefauthors}Arrow, K\BPBIJ. \APACrefYearMonthDay1950. \BBOQ\APACrefatitleA difficulty in the concept of social welfare A difficulty in the concept of social welfare.\BBCQ \APACjournalVolNumPagesJournal of political economy584328–346. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleA general language assistant as a laboratory for alignment A general language assistant as a laboratory for alignment.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2112.00861. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleTraining a helpful and harmless assistant with reinforcement learning from human feedback Training a helpful and harmless assistant with reinforcement learning from human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2204.05862. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleConstitutional AI: Harmlessness from AI Feedback Constitutional ai: Harmlessness from ai feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2212.08073. \PrintBackRefs\CurrentBib
- \APACinsertmetastardefcon{APACrefauthors}Bajak, F. \APACrefYearMonthDay2023. \APACrefbtitleHackers red-teaming A.I. are ‘breaking stuff left and right,’ but don’t expect quick fixes from DefCon: ‘There are no good guardrails’. Hackers red-teaming a.i. are ‘breaking stuff left and right,’ but don’t expect quick fixes from defcon: ‘there are no good guardrails’. \APAChowpublishedhttps://fortune.com/2023/08/13/hackers-red-teaming-ai-defcon-breaking-stuff-but-no-quick-fixes/. \APACrefnoteAccessed: 2023-10-03 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \APACrefbtitlePeering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models. Peering through preferences: Unraveling feedback acquisition for aligning large language models. \PrintBackRefs\CurrentBib
- \APACinsertmetastarbaum2020social{APACrefauthors}Baum, S\BPBID. \APACrefYearMonthDay2020. \BBOQ\APACrefatitleSocial choice ethics in artificial intelligence Social choice ethics in artificial intelligence.\BBCQ \APACjournalVolNumPagesAI & SOCIETY351165–176. \PrintBackRefs\CurrentBib
- \APACinsertmetastarbellman1957markovian{APACrefauthors}Bellman, R. \APACrefYearMonthDay1957. \BBOQ\APACrefatitleA Markovian decision process A markovian decision process.\BBCQ \APACjournalVolNumPagesJournal of mathematics and mechanics679–684. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleOn the dangers of stochastic parrots: Can language models be too big? On the dangers of stochastic parrots: Can language models be too big?\BBCQ \BIn \APACrefbtitleProceedings of the 2021 ACM conference on fairness, accountability, and transparency Proceedings of the 2021 acm conference on fairness, accountability, and transparency (\BPGS 610–623). \PrintBackRefs\CurrentBib
- \APACinsertmetastarbentham1823hedonic{APACrefauthors}Bentham, J. \APACrefYear1823. \APACrefbtitleAn Introduction to the Principles of Morals and Legislation An introduction to the principles of morals and legislation. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay1952. \BBOQ\APACrefatitleRank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons Rank analysis of incomplete block designs: I. the method of paired comparisons.\BBCQ \APACjournalVolNumPagesBiometrika393/4324–345. {APACrefURL} [2023-02-13]http://www.jstor.org/stable/2334029 \PrintBackRefs\CurrentBib
- \APACinsertmetastarbriggs2014normative{APACrefauthors}Briggs, R\BPBIA. \APACrefYearMonthDay2014. \BBOQ\APACrefatitleNormative theories of rational choice: Expected utility Normative theories of rational choice: Expected utility.\BBCQ \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleOpen problems and fundamental limitations of reinforcement learning from human feedback Open problems and fundamental limitations of reinforcement learning from human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.15217. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleA survey on evaluation of large language models A survey on evaluation of large language models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.03109. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2010. \BBOQ\APACrefatitleLabel ranking methods based on the Plackett-Luce model Label ranking methods based on the plackett-luce model.\BBCQ \BIn \APACrefbtitleProceedings of the 27th International Conference on Machine Learning (ICML-10) Proceedings of the 27th international conference on machine learning (icml-10) (\BPGS 215–222). \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2017. \BBOQ\APACrefatitleDeep reinforcement learning from human preferences Deep reinforcement learning from human preferences.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems30. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleMagnetic control of tokamak plasmas through deep reinforcement learning Magnetic control of tokamak plasmas through deep reinforcement learning.\BBCQ \APACjournalVolNumPagesNature6027897414–419. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleHard choices in artificial intelligence Hard choices in artificial intelligence.\BBCQ \APACjournalVolNumPagesArtificial Intelligence300103555. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleMoral Machine or Tyranny of the Majority? Moral machine or tyranny of the majority?\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2305.17319. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleBridging the gap: A survey on integrating (human) feedback for natural language generation Bridging the gap: A survey on integrating (human) feedback for natural language generation.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2305.00955. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2020. \BBOQ\APACrefatitleMulti-principal assistance games Multi-principal assistance games.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2007.09540. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleRed teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2209.07858. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleScaling Laws for Reward Model Overoptimization Scaling laws for reward model overoptimization.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2210.10760. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2020. \BBOQ\APACrefatitleRealtoxicityprompts: Evaluating neural toxic degeneration in language models Realtoxicityprompts: Evaluating neural toxic degeneration in language models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2009.11462. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleReward reports for reinforcement learning Reward reports for reinforcement learning.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2204.10817. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleChoices, Risks, and Reward Reports: Charting Public Policy for Reinforcement Learning Systems Choices, risks, and reward reports: Charting public policy for reinforcement learning systems.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2202.05716. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleImproving alignment of dialogue agents via targeted human judgements Improving alignment of dialogue agents via targeted human judgements.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2209.14375. \PrintBackRefs\CurrentBib
- \APACrefYear2017. \APACrefbtitleAutomatic control systems Automatic control systems. \APACaddressPublisherMcGraw-Hill Education. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay\bibnodate. \APACrefbtitleThe False Promise of Imitating Proprietary LLMs. The False Promise of Imitating Proprietary LLMs. {APACrefURL} [2023-08-01]http://arxiv.org/abs/2305.15717 {APACrefDOI} \doi10.48550/arXiv.2305.15717 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2014. \BBOQ\APACrefatitleMicrofoundations of the Rule of Law Microfoundations of the rule of law.\BBCQ \APACjournalVolNumPagesAnnual Review of Political Science1721–42. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2016. \BBOQ\APACrefatitleCooperative inverse reinforcement learning Cooperative inverse reinforcement learning.\BBCQ \APACjournalVolNumPagesAdvances in neural information processing systems29. \PrintBackRefs\CurrentBib
- \APACinsertmetastarharsanyi1977rule{APACrefauthors}Harsanyi, J\BPBIC. \APACrefYearMonthDay1977. \BBOQ\APACrefatitleRule utilitarianism and decision theory Rule utilitarianism and decision theory.\BBCQ \APACjournalVolNumPagesErkenntnis11125–53. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices The ethical ambiguity of ai data enrichment: Measuring gaps in research ethics norms and practices.\BBCQ \BIn \APACrefbtitleProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency Proceedings of the 2023 acm conference on fairness, accountability, and transparency (\BPGS 261–270). \PrintBackRefs\CurrentBib
- \APACinsertmetastarhirschman1984against{APACrefauthors}Hirschman, A\BPBIO. \APACrefYearMonthDay1984. \BBOQ\APACrefatitleAgainst parsimony: Three easy ways of complicating some categories of economic discourse Against parsimony: Three easy ways of complicating some categories of economic discourse.\BBCQ \APACjournalVolNumPagesBulletin of the American Academy of arts and Sciences37811–28. \PrintBackRefs\CurrentBib
- \APACinsertmetastarhoward1960dynamic{APACrefauthors}Howard, R\BPBIA. \APACrefYearMonthDay1960. \BBOQ\APACrefatitleDynamic programming and markov processes. Dynamic programming and markov processes.\BBCQ \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleThe Ghost in the Machine has an American accent: value conflict in GPT-3 The ghost in the machine has an american accent: value conflict in gpt-3.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2203.07785. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleChampion-level drone racing using deep reinforcement learning Champion-level drone racing using deep reinforcement learning.\BBCQ \APACjournalVolNumPagesNature6207976982–987. {APACrefURL} https://doi.org/10.1038/s41586-023-06419-4 {APACrefDOI} \doi10.1038/s41586-023-06419-4 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitlePersonalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2303.05453. \PrintBackRefs\CurrentBib
- \APACinsertmetastarklopf1972brain{APACrefauthors}Klopf, A\BPBIH. \APACrefYear1972. \APACrefbtitleBrain function and adaptive systems: a heterostatic theory Brain function and adaptive systems: a heterostatic theory (\BNUM 133). \APACaddressPublisherAir Force Cambridge Research Laboratories, Air Force Systems Command, United …. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleChatGPT’s inconsistent moral advice influences users’ judgment Chatgpt’s inconsistent moral advice influences users’ judgment.\BBCQ \APACjournalVolNumPagesScientific Reports1314569. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \APACrefbtitleRLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2018. \BBOQ\APACrefatitleScalable agent alignment via reward modeling: a research direction Scalable agent alignment via reward modeling: a research direction.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1811.07871. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleTowards understanding and mitigating social biases in language models Towards understanding and mitigating social biases in language models.\BBCQ \BIn \APACrefbtitleInternational Conference on Machine Learning International conference on machine learning (\BPGS 6565–6576). \PrintBackRefs\CurrentBib
- \APACrefYear2014. \APACrefbtitleThe Arrow impossibility theorem The arrow impossibility theorem. \APACaddressPublisherColumbia University Press. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2003. \BBOQ\APACrefatitleA computational substrate for incentive salience A computational substrate for incentive salience.\BBCQ \APACjournalVolNumPagesTrends in neurosciences268423–428. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay1970. \BBOQ\APACrefatitle8 Reinforcement-Learning Control and Pattern Recognition Systems 8 reinforcement-learning control and pattern recognition systems.\BBCQ \BIn J. Mendel \BBA K. Fu (\BEDS), \APACrefbtitleAdaptive, Learning and Pattern Recognition Systems Adaptive, learning and pattern recognition systems (\BVOL 66, \BPG 287-318). \APACaddressPublisherElsevier. {APACrefURL} https://www.sciencedirect.com/science/article/pii/S007653920860497X {APACrefDOI} \doihttps://doi.org/10.1016/S0076-5392(08)60497-X \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleTeaching language models to support answers with verified quotes Teaching language models to support answers with verified quotes.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2203.11147. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2017. \BBOQ\APACrefatitleShould robots be obedient? Should robots be obedient?\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1705.09990. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2013. \BBOQ\APACrefatitlePlaying atari with deep reinforcement learning Playing atari with deep reinforcement learning.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1312.5602. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleWebgpt: Browser-assisted question-answering with human feedback Webgpt: Browser-assisted question-answering with human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2112.09332. \PrintBackRefs\CurrentBib
- \APACinsertmetastarnardo2023waluigi{APACrefauthors}Nardo, C. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe waluigi effect (mega-post) The waluigi effect (mega-post).\BBCQ \APACjournalVolNumPagesLess Wrong. {APACrefURL} https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post \APACrefnoteAccessed: 2023-09-11 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay1999. \BBOQ\APACrefatitlePolicy invariance under reward transformations: Theory and application to reward shaping Policy invariance under reward transformations: Theory and application to reward shaping.\BBCQ \BIn \APACrefbtitleIcml Icml (\BVOL 99, \BPGS 278–287). \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2000. \BBOQ\APACrefatitleAlgorithms for inverse reinforcement learning. Algorithms for inverse reinforcement learning.\BBCQ \BIn \APACrefbtitleIcml Icml (\BVOL 1, \BPG 2). \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \BBOQ\APACrefatitleTraining language models to follow instructions with human feedback Training language models to follow instructions with human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2203.02155. \PrintBackRefs\CurrentBib
- \APACinsertmetastarpettigrew2019choosing{APACrefauthors}Pettigrew, R. \APACrefYear2019. \APACrefbtitleChoosing for changing selves Choosing for changing selves. \APACaddressPublisherOxford University Press. \PrintBackRefs\CurrentBib
- \APACinsertmetastarpitis2019rethinking{APACrefauthors}Pitis, S. \APACrefYearMonthDay2019. \BBOQ\APACrefatitleRethinking the discount factor in reinforcement learning: A decision theoretic approach Rethinking the discount factor in reinforcement learning: A decision theoretic approach.\BBCQ \BIn \APACrefbtitleProceedings of the AAAI Conference on Artificial Intelligence Proceedings of the aaai conference on artificial intelligence (\BVOL 33, \BPGS 7949–7956). \PrintBackRefs\CurrentBib
- \APACinsertmetastarpitis2023consistent{APACrefauthors}Pitis, S. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleConsistent Aggregation of Objectives with Diverse Time Preferences Requires Non-Markovian Rewards Consistent aggregation of objectives with diverse time preferences requires non-markovian rewards.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2310.00435. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleOn releasing annotator-level labels and information in datasets On releasing annotator-level labels and information in datasets.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2110.05699. \PrintBackRefs\CurrentBib
- \APACinsertmetastarprasad2018social{APACrefauthors}Prasad, M. \APACrefYearMonthDay2018. \BBOQ\APACrefatitleSocial choice and the value alignment problem Social choice and the value alignment problem.\BBCQ \APACjournalVolNumPagesArtificial intelligence safety and security291–314. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleDirect preference optimization: Your language model is secretly a reward model Direct preference optimization: Your language model is secretly a reward model.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2305.18290. \PrintBackRefs\CurrentBib
- \APACinsertmetastarramsey2016truth{APACrefauthors}Ramsey, F\BPBIP. \APACrefYearMonthDay2016. \BBOQ\APACrefatitleTruth and probability Truth and probability.\BBCQ \APACjournalVolNumPagesReadings in Formal Epistemology: Sourcebook21–45. \PrintBackRefs\CurrentBib
- \APACinsertmetastarsalha2011aesthetics{APACrefauthors}Salha, N. \APACrefYear2011. \APACrefbtitleAesthetics & Art in the Early Development of Human-Computer Interfaces Aesthetics & art in the early development of human-computer interfaces \APACtypeAddressSchool\BUPhD. \APACaddressSchoolTesis de doctorado en ingeniería, Universität Bremen]. https://bit. ly/3ZICKZJ. \PrintBackRefs\CurrentBib
- \APACinsertmetastarschulman2023proxy{APACrefauthors}Schulman, J. \APACrefYearMonthDay2023. \APACrefbtitleProxy objectives in reinforcement learning from human feedback. Proxy objectives in reinforcement learning from human feedback. {APACrefURL} https://icml.cc/virtual/2023/invited-talk/21549 \APACrefnoteInternational Conference on Machine Learning (ICML) \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2017. \BBOQ\APACrefatitleProximal policy optimization algorithms Proximal policy optimization algorithms.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1707.06347. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2022. \APACrefbtitleChatGPT: Optimizing Language Models for Dialogue. Chatgpt: Optimizing language models for dialogue. \APAChowpublishedhttps://openai.com/blog/chatgpt/. \APACrefnoteAccessed: 2023-02-12 \PrintBackRefs\CurrentBib
- \APACinsertmetastarsen1973behaviour{APACrefauthors}Sen, A. \APACrefYearMonthDay1973. \BBOQ\APACrefatitleBehaviour and the Concept of Preference Behaviour and the concept of preference.\BBCQ \APACjournalVolNumPagesEconomica40159241–259. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe Trickle-down Impact of Reward (In-) consistency on RLHF The trickle-down impact of reward (in-) consistency on rlhf.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2309.16155. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay\bibnodate. \APACrefbtitleThe Curse of Recursion: Training on Generated Data Makes Models Forget. The Curse of Recursion: Training on Generated Data Makes Models Forget. {APACrefURL} [2023-08-01]http://arxiv.org/abs/2305.17493 {APACrefDOI} \doi10.48550/arXiv.2305.17493 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2017. \BBOQ\APACrefatitleMastering the game of go without human knowledge Mastering the game of go without human knowledge.\BBCQ \APACjournalVolNumPagesnature5507676354–359. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleReward is enough Reward is enough.\BBCQ \APACjournalVolNumPagesArtificial Intelligence299103535. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2009. \BBOQ\APACrefatitleWhere do rewards come from Where do rewards come from.\BBCQ \BIn \APACrefbtitleProceedings of the annual conference of the cognitive science society Proceedings of the annual conference of the cognitive science society (\BPGS 2601–2606). \PrintBackRefs\CurrentBib
- \APACinsertmetastarskinner2019behavior{APACrefauthors}Skinner, B\BPBIF. \APACrefYear2019. \APACrefbtitleThe behavior of organisms: An experimental analysis The behavior of organisms: An experimental analysis. \APACaddressPublisherBF Skinner Foundation. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2015. \BBOQ\APACrefatitleCorrigibility Corrigibility.\BBCQ \BIn \APACrefbtitleWorkshops at the twenty-ninth AAAI conference on artificial intelligence. Workshops at the twenty-ninth aaai conference on artificial intelligence. \PrintBackRefs\CurrentBib
- \APACinsertmetastarspaan2012partially{APACrefauthors}Spaan, M\BPBIT. \APACrefYearMonthDay2012. \BBOQ\APACrefatitlePartially observable Markov decision processes Partially observable markov decision processes.\BBCQ \BIn \APACrefbtitleReinforcement learning: State-of-the-art Reinforcement learning: State-of-the-art (\BPGS 387–414). \APACaddressPublisherSpringer. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2020. \BBOQ\APACrefatitleLearning to summarize with human feedback Learning to summarize with human feedback.\BBCQ \BIn H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan\BCBL \BBA H. Lin (\BEDS), \APACrefbtitleAdvances in Neural Information Processing Systems Advances in neural information processing systems (\BVOL 33, \BPGS 3008–3021). \APACaddressPublisherCurran Associates, Inc. {APACrefURL} https://proceedings.neurips.cc/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf \PrintBackRefs\CurrentBib
- \APACinsertmetastarsuperalignment{APACrefauthors}Sutskever, J\BPBIL\BPBII. \APACrefYearMonthDay2023. \APACrefbtitleIntroducing Superalignment. Introducing superalignment. \APAChowpublishedhttps://openai.com/blog/introducing-superalignment. \APACrefnoteAccessed: 2023-09-27 \PrintBackRefs\CurrentBib
- \APACinsertmetastarsutton1988learning{APACrefauthors}Sutton, R\BPBIS. \APACrefYearMonthDay1988. \BBOQ\APACrefatitleLearning to predict by the methods of temporal differences Learning to predict by the methods of temporal differences.\BBCQ \APACjournalVolNumPagesMachine learning39–44. \PrintBackRefs\CurrentBib
- \APACrefYear2018. \APACrefbtitleReinforcement learning: An introduction Reinforcement learning: An introduction. \APACaddressPublisherMIT press. \PrintBackRefs\CurrentBib
- \APACinsertmetastartesauro1995temporal{APACrefauthors}Tesauro, G.\BCBT \BOthersPeriod. \APACrefYearMonthDay1995. \BBOQ\APACrefatitleTemporal difference learning and TD-Gammon Temporal difference learning and td-gammon.\BBCQ \APACjournalVolNumPagesCommunications of the ACM38358–68. \PrintBackRefs\CurrentBib
- \APACinsertmetastarthorndike1927law{APACrefauthors}Thorndike, E\BPBIL. \APACrefYearMonthDay1927. \BBOQ\APACrefatitleThe law of effect The law of effect.\BBCQ \APACjournalVolNumPagesThe American journal of psychology391/4212–222. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleLlama 2: Open foundation and fine-tuned chat models Llama 2: Open foundation and fine-tuned chat models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.09288. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay1947. \BBOQ\APACrefatitleTheory of games and economic behavior, 2nd rev Theory of games and economic behavior, 2nd rev.\BBCQ \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay1965. \BBOQ\APACrefatitleA heuristic approach to reinforcement learning control systems A heuristic approach to reinforcement learning control systems.\BBCQ \APACjournalVolNumPagesIEEE Transactions on Automatic Control104390-398. {APACrefDOI} \doi10.1109/TAC.1965.1098193 \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleShepherd: A Critic for Language Model Generation Shepherd: A critic for language model generation.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2308.04592. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay1992. \BBOQ\APACrefatitleQ-learning Q-learning.\BBCQ \APACjournalVolNumPagesMachine learning8279–292. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay1960. \APACrefbtitleAdaptive switching circuits Adaptive switching circuits \APACbVolEdTR\BTR. \APACaddressInstitutionStanford Univ Ca Stanford Electronics Labs. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2017. \BBOQ\APACrefatitleA survey of preference-based reinforcement learning methods A survey of preference-based reinforcement learning methods.\BBCQ \APACjournalVolNumPagesJournal of Machine Learning Research181361–46. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2021. \BBOQ\APACrefatitleRecursively summarizing books with human feedback Recursively summarizing books with human feedback.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2109.10862. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleFine-Grained Human Feedback Gives Better Rewards for Language Model Training Fine-grained human feedback gives better rewards for language model training.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.01693. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleJudging LLM-as-a-judge with MT-Bench and Chatbot Arena Judging llm-as-a-judge with mt-bench and chatbot arena.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2306.05685. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2019. \BBOQ\APACrefatitleFine-tuning language models from human preferences Fine-tuning language models from human preferences.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1909.08593. \PrintBackRefs\CurrentBib
- \APACrefYearMonthDay2023. \BBOQ\APACrefatitleUniversal and transferable adversarial attacks on aligned language models Universal and transferable adversarial attacks on aligned language models.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:2307.15043. \PrintBackRefs\CurrentBib