Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make LLMs easier to use and more effective. A core piece of the RLHF process is the training and utilization of a model of human preferences that acts as a reward function for optimization. This approach, which operates at the intersection of many stakeholders and academic disciplines, remains poorly understood. RLHF reward models are often cited as being central to achieving performance, yet very few descriptors of capabilities, evaluations, training methods, or open-source models exist. Given this lack of information, further study and transparency is needed for learned RLHF reward models. In this paper, we illustrate the complex history of optimizing preferences, and articulate lines of inquiry to understand the sociotechnical context of reward models. In particular, we highlight the ontological differences between costs, rewards, and preferences at stake in RLHF's foundations, related methodological tensions, and possible research directions to improve general understanding of how reward models function.
We're not able to analyze this paper right now due to high demand.
Please check back later (sorry!).
Generate a detailed summary of this paper with a premium account.
We ran into a problem analyzing this paper.
\APACinsertmetastardefcon{APACrefauthors}Bajak, F. \APACrefYearMonthDay2023. \APACrefbtitleHackers red-teaming A.I. are ‘breaking stuff left and right,’ but don’t expect quick fixes from DefCon: ‘There are no good guardrails’. Hackers red-teaming a.i. are ‘breaking stuff left and right,’ but don’t expect quick fixes from defcon: ‘there are no good guardrails’. \APAChowpublishedhttps://fortune.com/2023/08/13/hackers-red-teaming-ai-defcon-breaking-stuff-but-no-quick-fixes/. \APACrefnoteAccessed: 2023-10-03 \PrintBackRefs\CurrentBib
\APACrefYearMonthDay1952. \BBOQ\APACrefatitleRank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons Rank analysis of incomplete block designs: I. the method of paired comparisons.\BBCQ \APACjournalVolNumPagesBiometrika393/4324–345. {APACrefURL} [2023-02-13]http://www.jstor.org/stable/2334029 \PrintBackRefs\CurrentBib
\APACrefYearMonthDay2023. \BBOQ\APACrefatitleChampion-level drone racing using deep reinforcement learning Champion-level drone racing using deep reinforcement learning.\BBCQ \APACjournalVolNumPagesNature6207976982–987. {APACrefURL} https://doi.org/10.1038/s41586-023-06419-4 {APACrefDOI} \doi10.1038/s41586-023-06419-4 \PrintBackRefs\CurrentBib
\APACrefYearMonthDay1970. \BBOQ\APACrefatitle8 Reinforcement-Learning Control and Pattern Recognition Systems 8 reinforcement-learning control and pattern recognition systems.\BBCQ \BIn J. Mendel \BBA K. Fu (\BEDS), \APACrefbtitleAdaptive, Learning and Pattern Recognition Systems Adaptive, learning and pattern recognition systems (\BVOL 66, \BPG 287-318). \APACaddressPublisherElsevier. {APACrefURL} https://www.sciencedirect.com/science/article/pii/S007653920860497X {APACrefDOI} \doihttps://doi.org/10.1016/S0076-5392(08)60497-X \PrintBackRefs\CurrentBib
\APACinsertmetastarnardo2023waluigi{APACrefauthors}Nardo, C. \APACrefYearMonthDay2023. \BBOQ\APACrefatitleThe waluigi effect (mega-post) The waluigi effect (mega-post).\BBCQ \APACjournalVolNumPagesLess Wrong. {APACrefURL} https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post \APACrefnoteAccessed: 2023-09-11 \PrintBackRefs\CurrentBib
\APACinsertmetastarsalha2011aesthetics{APACrefauthors}Salha, N. \APACrefYear2011. \APACrefbtitleAesthetics & Art in the Early Development of Human-Computer Interfaces Aesthetics & art in the early development of human-computer interfaces \APACtypeAddressSchool\BUPhD. \APACaddressSchoolTesis de doctorado en ingeniería, Universität Bremen]. https://bit. ly/3ZICKZJ. \PrintBackRefs\CurrentBib
\APACinsertmetastarschulman2023proxy{APACrefauthors}Schulman, J. \APACrefYearMonthDay2023. \APACrefbtitleProxy objectives in reinforcement learning from human feedback. Proxy objectives in reinforcement learning from human feedback. {APACrefURL} https://icml.cc/virtual/2023/invited-talk/21549 \APACrefnoteInternational Conference on Machine Learning (ICML) \PrintBackRefs\CurrentBib
\APACrefYearMonthDay2022. \APACrefbtitleChatGPT: Optimizing Language Models for Dialogue. Chatgpt: Optimizing language models for dialogue. \APAChowpublishedhttps://openai.com/blog/chatgpt/. \APACrefnoteAccessed: 2023-02-12 \PrintBackRefs\CurrentBib
\APACrefYearMonthDay2020. \BBOQ\APACrefatitleLearning to summarize with human feedback Learning to summarize with human feedback.\BBCQ \BIn H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan\BCBL \BBA H. Lin (\BEDS), \APACrefbtitleAdvances in Neural Information Processing Systems Advances in neural information processing systems (\BVOL 33, \BPGS 3008–3021). \APACaddressPublisherCurran Associates, Inc. {APACrefURL} https://proceedings.neurips.cc/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf \PrintBackRefs\CurrentBib
\APACinsertmetastarsuperalignment{APACrefauthors}Sutskever, J\BPBIL\BPBII. \APACrefYearMonthDay2023. \APACrefbtitleIntroducing Superalignment. Introducing superalignment. \APAChowpublishedhttps://openai.com/blog/introducing-superalignment. \APACrefnoteAccessed: 2023-09-27 \PrintBackRefs\CurrentBib