Tiered Reward: Designing Rewards for Specification and Fast Learning of Desired Behavior (2212.03733v3)
Abstract: Reinforcement-learning agents seek to maximize a reward signal through environmental interactions. As humans, our job in the learning process is to design reward functions to express desired behavior and enable the agent to learn such behavior swiftly. However, designing good reward functions to induce the desired behavior is generally hard, let alone the question of which rewards make learning fast. In this work, we introduce a family of a reward structures we call Tiered Reward that addresses both of these questions. We consider the reward-design problem in tasks formulated as reaching desirable states and avoiding undesirable states. To start, we propose a strict partial ordering of the policy space to resolve trade-offs in behavior preference. We prefer policies that reach the good states faster and with higher probability while avoiding the bad states longer. Next, we introduce Tiered Reward, a class of environment-independent reward functions and show it is guaranteed to induce policies that are Pareto-optimal according to our preference relation. Finally, we demonstrate that Tiered Reward leads to fast learning with multiple tabular and deep reinforcement-learning algorithms.
- On the expressivity of markov reward. Advances in Neural Information Processing Systems, 34: 7799–7812.
- Faulty reward functions in the wild. URL: https://blog. openai. com/faulty-reward-functions.
- Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
- R-MAX—A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research, 3: 213–231.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, 783–792. PMLR.
- Non-markovian rewards expressed in LTL: guiding search via reward shaping. In Tenth annual symposium on combinatorial search.
- LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning. In IJCAI, volume 19, 6065–6073.
- Minigrid. CoRR, abs/2306.13831.
- Explicable reward design for reinforcement learning agents. Advances in Neural Information Processing Systems, 34: 20118–20131.
- Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, 1407–1416. PMLR.
- Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- Models of Sequential Decision Making (msdm).
- Using reward machines for high-level task specification and decomposition in reinforcement learning. In International Conference on Machine Learning, 2107–2116. PMLR.
- Reward machines: Exploiting reward function structure in reinforcement learning. Journal of Artificial Intelligence Research, 73: 173–208.
- Complexity analysis of real-time reinforcement learning. In AAAI, volume 93, 99–105.
- Reinforcement learning with temporal logic rewards. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3834–3839. IEEE.
- Environment-independent task specifications via GLTL. arXiv preprint arXiv:1704.04341.
- Mataric, M. J. 1994. Reward functions for accelerated learning. In Machine learning proceedings 1994, 181–189. Elsevier.
- Mornati, F. 2013. Pareto Optimality in the work of Pareto. Revue européenne des sciences sociales. European Journal of Social Sciences, (51-2): 65–82.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, 278–287.
- Subgoal-based reward shaping to improve efficiency in reinforcement learning. IEEE Access, 9: 97557–97568.
- Artificial Intelligence (A Modern Approach).
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Designing Rewards for Fast Learning.
- Reinforcement Learning: An Introduction. The MIT Press.
- Teaching multiple tasks to an RL agent using LTL. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 452–461.
- Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning, 84(1): 51–80.
- Q-learning. Machine learning, 8(3): 279–292.
- A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136): 1–46.
- Receding horizon control for temporal logic specifications. In Proceedings of the 13th ACM international conference on Hybrid systems: computation and control, 101–110.
- Computational benefits of intermediate rewards for goal-reaching policy learning. Journal of Artificial Intelligence Research, 73: 847–896.