Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Crowd-PrefRL: Preference-Based Reward Learning from Crowds (2401.10941v2)

Published 17 Jan 2024 in cs.HC, cs.LG, and cs.SI

Abstract: Preference-based reinforcement learning (RL) provides a framework to train AI agents using human feedback through preferences over pairs of behaviors, enabling agents to learn desired behaviors when it is difficult to specify a numerical reward function. While this paradigm leverages human feedback, it typically treats the feedback as given by a single human user. However, different users may desire multiple AI behaviors and modes of interaction. Meanwhile, incorporating preference feedback from crowds (i.e. ensembles of users) in a robust manner remains a challenge, and the problem of training RL agents using feedback from multiple human users remains understudied. In this work, we introduce a conceptual framework, Crowd-PrefRL, that integrates preference-based RL approaches with techniques from unsupervised crowdsourcing to enable training of autonomous system behaviors from crowdsourced feedback. We show preliminary results suggesting that Crowd-PrefRL can learn reward functions and agent policies from preference feedback provided by crowds of unknown expertise and reliability. We also show that in most cases, agents trained with Crowd-PrefRL outperform agents trained with majority-vote preferences or preferences from any individual user, especially when the spread of user error rates among the crowd is large. Results further suggest that our method can identify the presence of minority viewpoints within the crowd in an unsupervised manner.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. 2021. Deep reinforcement learning at the edge of the statistical precipice. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems, volume 34, 29304–29320. Curran Associates, Inc.
  2. 2011. Apprenticeship learning about multiple intentions. In Getoor, L., and Scheffer, T., eds., Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, 897–904. New York, NY, USA: ACM.
  3. 2022. Imitation learning by estimating expertise of demonstrators. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 1732–1748. PMLR.
  4. 2019. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Proceedings of the International Conference on Machine Learning, 783–792.
  5. 2023. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research.
  6. 2013. Pairwise ranking aggregation in a crowdsourced setting. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, 193–202. New York, NY, USA: Association for Computing Machinery.
  7. 2017. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30.
  8. 2012. Bayesian multitask inverse reinforcement learning. In Sanner, S., and Hutter, M., eds., Recent Advances in Reinforcement Learning, 273–284. Berlin, Heidelberg: Springer Berlin Heidelberg.
  9. 2017. Multi-modal imitation learning from unstructured demonstrations using generative adversarial nets. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 1235–1245. Red Hook, NY, USA: Curran Associates Inc.
  10. 2021. B-pref: Benchmarking preference-based reinforcement learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  11. 2021. PEBBLE: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning, 6152–6163.
  12. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Computing Research Repository (CoRR) abs/2005.01643.
  13. 2016. Crowdsourced data management: A survey. IEEE Transactions on Knowledge and Data Engineering 28(9):2296–2319.
  14. 2022. Training language models to follow instructions with human feedback. In Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; and Oh, A., eds., Advances in Neural Information Processing Systems, volume 35, 27730–27744. Curran Associates, Inc.
  15. 2014. Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences 111(4):1253–1258.
  16. 2022. SURF: Semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In International Conference on Learning Representations.
  17. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  18. 2023. Distributional preference learning: Understanding and accounting for hidden context in RLHF. arXiv preprint arXiv:2312.08358.
  19. 2018. Reinforcement learning: An introduction. MIT press.
  20. 2018. Deepmind control suite. Computing Research Repository (CoRR) abs/1801.00690.
  21. 2023. Breadcrumbs to the goal: Goal-conditioned exploration from human-in-the-loop feedback. arXiv preprint arXiv:2307.11049.
  22. 2021. Learning reward functions from scale feedback. In 5th Annual Conference on Robot Learning.
  23. 2023. Batch reinforcement learning from crowds. In Amini, M.-R.; Canu, S.; Fischer, A.; Guns, T.; Kralj Novak, P.; and Tsoumakas, G., eds., Machine Learning and Knowledge Discovery in Databases, 38–51. Cham: Springer Nature Switzerland.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com