Emergent Mind

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

(2102.01567)
Published Feb 2, 2021 in cs.LG , math.OC , and stat.ML

Abstract

This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous reinforcement learning (RL) algorithms. We do this by first reformulating the RL algorithms as \textit{Markovian Stochastic Approximation} (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as $Q$-learning, $n$-step TD, TD$(\lambda)$, and off-policy TD algorithms including V-trace. As a by-product, by analyzing the convergence bounds of $n$-step TD and TD$(\lambda)$, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in (Sutton, 1999).

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.