Provably Efficient Exploration in Policy Optimization (1912.05830v4)
Abstract: While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d2 H3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
- POLITEX: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning.
- Exploration-enhanced POLITEX. arXiv preprint arXiv:1908.10479.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems.
- Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261.
- Fitted Q-iteration in continuous action-space mdps. In Advances in Neural Information Processing Systems.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47 235–256.
- Model-based reinforcement learning with value-targeted regression. arXiv preprint arXiv:2006.01107.
- Dynamic policy programming. Journal of Machine Learning Research, 13 3207–3245.
- Speedy Q-learning. In Advances in Neural Information Processing Systems.
- On the sample complexity of reinforcement learning with a generative model. arXiv preprint arXiv:1206.6461.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning.
- Direct gradient-based reinforcement learning. In International Symposium on Circuits and Systems.
- Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
- Boyan, J. A. (2002). Least-squares temporal difference learning. Machine Learning, 49 233–246.
- Linear least-squares algorithms for temporal difference learning. Machine Learning, 22 33–57.
- Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5 1–122.
- Prediction, Learning, and Games. Cambridge.
- Information-theoretic considerations in batch reinforcement learning. arXiv preprint arXiv:1905.00360.
- Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics.
- Stochastic linear optimization under bandit feedback. Conference on Learning Theory.
- Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems.
- n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG-regret for learning in Markov decision processes with function approximation and low Bellman rank. arXiv preprint arXiv:1909.02506.
- Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016.
- Provably efficient Q-learning with function approximation via distribution shift error checking oracle. arXiv preprint arXiv:1906.06321.
- Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning.
- Online Markov decision processes. Mathematics of Operations Research, 34 726–736.
- Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
- Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039.
- A theory of regularized Markov decision processes. arXiv preprint arXiv:1901.11275.
- Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11 1563–1600.
- Contextual decision processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning.
- Is Q-learning provably efficient? In Advances in Neural Information Processing Systems.
- Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388.
- Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
- Kakade, S. M. (2003). On the Sample Complexity of Reinforcement Learning. Ph.D. thesis, University of London.
- Complexity analysis of real-time reinforcement learning. In Association for the Advancement of Artificial Intelligence.
- Actor-critic algorithms. In Advances in Neural Information Processing Systems.
- Learning with good feature representations in bandits and in RL with a generative model. arXiv preprint arXiv:1911.07676.
- Efficient reinforcement learning with relocatable action models. In Association for the Advancement of Artificial Intelligence.
- Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
- Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055.
- Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9 815–857.
- Problem Complexity and Method Efficiency in Optimization. Wiley.
- Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems.
- The online loop-free stochastic shortest-path problem. In Conference on Learning Theory.
- The adversarial stochastic shortest path problem with unknown transition probabilities. In International Conference on Artificial Intelligence and Statistics.
- A unified view of entropy-regularized Markov decision processes. arXiv preprint arXiv:1705.07798.
- OpenAI (2019). OpenAI Five. https://openai.com/five/.
- On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732.
- Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635.
- Online convex optimization in adversarial Markov decision processes. arXiv preprint arXiv:1905.07773.
- Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems.
- Linearly parameterized bandits. Mathematics of Operations Research, 35 395–411.
- Trust region policy optimization. In International Conference on Machine Learning.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems.
- Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Symposium on Discrete Algorithms.
- Mastering the game of Go with deep neural networks and tree search. Nature, 529 484.
- Mastering the game of Go without human knowledge. Nature, 550 354.
- PAC model-free reinforcement learning. In International Conference on Machine Learning.
- Reinforcement Learning: An Introduction. MIT.
- Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
- Boosted fitted Q-iteration. In International Conference on Machine Learning.
- Comments on the Du-Kakade-Wang-Yang lower bounds. arXiv preprint arXiv:1911.07910.
- Wainwright, M. J. (2019). Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697.
- Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
- Deep reinforcement learning for NLP. In Association for Computational Linguistics.
- Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42 762–782.
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 229–256.
- Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11 2543–2596.
- Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.
- Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning.
- On the global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. arXiv preprint arXiv:1907.06246.
- A theoretical analysis of deep Q-learning. arXiv preprint arXiv:1901.00137.
- Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34 737–757.
- Provably efficient reinforcement learning for discounted mdps with feature mapping. arXiv preprint arXiv:2006.13165.
- Online learning in episodic Markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.