Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 163 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 46 tok/s Pro
GPT-5 High 43 tok/s Pro
GPT-4o 109 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 40 tok/s Pro
2000 character limit reached

Crowd-PrefRL: Preference-Based Reward Learning from Crowds (2401.10941v2)

Published 17 Jan 2024 in cs.HC, cs.LG, and cs.SI

Abstract: Preference-based reinforcement learning (RL) provides a framework to train AI agents using human feedback through preferences over pairs of behaviors, enabling agents to learn desired behaviors when it is difficult to specify a numerical reward function. While this paradigm leverages human feedback, it typically treats the feedback as given by a single human user. However, different users may desire multiple AI behaviors and modes of interaction. Meanwhile, incorporating preference feedback from crowds (i.e. ensembles of users) in a robust manner remains a challenge, and the problem of training RL agents using feedback from multiple human users remains understudied. In this work, we introduce a conceptual framework, Crowd-PrefRL, that integrates preference-based RL approaches with techniques from unsupervised crowdsourcing to enable training of autonomous system behaviors from crowdsourced feedback. We show preliminary results suggesting that Crowd-PrefRL can learn reward functions and agent policies from preference feedback provided by crowds of unknown expertise and reliability. We also show that in most cases, agents trained with Crowd-PrefRL outperform agents trained with majority-vote preferences or preferences from any individual user, especially when the spread of user error rates among the crowd is large. Results further suggest that our method can identify the presence of minority viewpoints within the crowd in an unsupervised manner.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. 2021. Deep reinforcement learning at the edge of the statistical precipice. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems, volume 34, 29304–29320. Curran Associates, Inc.
  2. 2011. Apprenticeship learning about multiple intentions. In Getoor, L., and Scheffer, T., eds., Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, 897–904. New York, NY, USA: ACM.
  3. 2022. Imitation learning by estimating expertise of demonstrators. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 1732–1748. PMLR.
  4. 2019. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Proceedings of the International Conference on Machine Learning, 783–792.
  5. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research.
  6. 2013. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, 193–202. New York, NY, USA: Association for Computing Machinery.
  7. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30.
  8. 2012. Bayesian multitask inverse reinforcement learning. In Sanner, S., and Hutter, M., eds., Recent Advances in Reinforcement Learning, 273–284. Berlin, Heidelberg: Springer Berlin Heidelberg.
  9. 2017. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 1235–1245. Red Hook, NY, USA: Curran Associates Inc.
  10. 2021. B-pref: Benchmarking preference-based reinforcement learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  11. 2021. PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning, 6152–6163.
  12. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Computing Research Repository (CoRR) abs/2005.01643.
  13. 2016. Crowdsourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering 28(9):2296–2319.
  14. 2022. Training language models to follow instructions with human feedback. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 27730–27744. Curran Associates, Inc.
  15. 2014. Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences 111(4):1253–1258.
  16. 2022. SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In International Conference on Learning Representations.
  17. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  18. 2023. Distributional preference learning: Understanding and accounting for hidden context in RLHF. arXiv preprint arXiv:2312.08358.
  19. 2018. Reinforcement learning: An introduction. MIT press.
  20. 2018. Deepmind control suite. Computing Research Repository (CoRR) abs/1801.00690.
  21. 2023. Breadcrumbs to the goal: Goal-conditioned exploration from human-in-the-loop feedback. arXiv preprint arXiv:2307.11049.
  22. 2021. Learning reward functions from scale feedback. In 5th Annual Conference on Robot Learning.
  23. 2023. Batch reinforcement learning from crowds. In Amini, M.-R.; Canu, S.; Fischer, A.; Guns, T.; Kralj Novak, P.; and Tsoumakas, G., eds., Machine Learning and Knowledge Discovery in Databases, 38–51. Cham: Springer Nature Switzerland.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: