ReZero: Boosting MCTS-based Algorithms by Backward-view and Entire-buffer Reanalyze (2404.16364v4)
Abstract: Monte Carlo Tree Search (MCTS)-based algorithms, such as MuZero and its derivatives, have achieved widespread success in various decision-making domains. These algorithms employ the reanalyze process to enhance sample efficiency from stale data, albeit at the expense of significant wall-clock time consumption. To address this issue, we propose a general approach named ReZero to boost tree search operations for MCTS-based algorithms. Specifically, drawing inspiration from the one-armed bandit model, we reanalyze training samples through a backward-view reuse technique which obtains the value estimation of a certain child node in advance. To further adapt to this design, we periodically reanalyze the entire buffer instead of frequently reanalyzing the mini-batch. The synergy of these two designs can significantly reduce the search cost and meanwhile guarantee or even improve performance, simplifying both data collecting and reanalyzing. Experiments conducted on Atari environments and board games demonstrate that ReZero substantially improves training speed while maintaining high sample efficiency. The code is available as part of the LightZero benchmark at https://github.com/opendilab/LightZero.
- Planning in stochastic environments with a learned model. In International Conference on Learning Representations, 2021.
- Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
- Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038, 2020.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
- Policy improvement by planning with gumbel. In International Conference on Learning Representations, 2022.
- Sample-efficient reinforcement learning by breaking the replay ratio barrier. In Deep Reinforcement Learning Workshop NeurIPS 2022, 2022.
- Accelerating monte carlo tree search with probability tree state abstraction. arXiv preprint arXiv:2310.06513, 2023.
- Monte-carlo tree search as regularized policy optimization. In International Conference on Machine Learning, pp. 3769–3778. PMLR, 2020.
- The value equivalence principle for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33:5541–5552, 2020a.
- The value equivalence principle for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33:5541–5552, 2020b.
- Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
- Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Learning and planning in complex action spaces. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 4476–4486. PMLR, 2021. URL http://proceedings.mlr.press/v139/hubert21a.html.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. pmlr, 2015.
- When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
- On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Monte-carlo graph search: the value of merging similar states. In Asian Conference on Machine Learning, pp. 577–592. PMLR, 2020.
- Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Speedyzero: Mastering atari with limited data and time. In The Eleventh International Conference on Learning Representations, 2023.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Lightzero: A unified benchmark for monte carlo tree search in general sequential decision scenarios. arXiv preprint arXiv:2310.08348, 2023.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- Observe and look further: Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Rosin, C. D. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
- Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
- Mastering atari, go, chess and shogi by planning with a learned model. CoRR, abs/1911.08265, 2019. URL http://arxiv.org/abs/1911.08265.
- Online and offline reinforcement learning by planning with a learned model. Advances in Neural Information Processing Systems, 34:27580–27591, 2021.
- Bigger, better, faster: Human-level atari with human-level efficiency. In International Conference on Machine Learning, pp. 30365–30380. PMLR, 2023.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
- Stochastic multi-armed bandits: Optimal trade-off among optimality, consistency, and tail risk. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Reinforcement learning: An introduction. IEEE Transactions on Neural Networks, 16:285–286, 1988.
- Monte carlo tree search: A review of recent modifications and applications. Artificial Intelligence Review, 56(3):2497–2562, 2023.
- Grandmaster level in starcraft II using multi-agent reinforcement learning. Nat., 575(7782):350–354, 2019. doi: 10.1038/s41586-019-1724-z. URL https://doi.org/10.1038/s41586-019-1724-z.
- Envpool: A highly parallel reinforcement learning environment execution engine. Advances in Neural Information Processing Systems, 35:22409–22421, 2022.
- Wu, D. J. Accelerating self-play learning in go. arXiv preprint arXiv:1902.10565, 2019.
- Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021.