Emergent Mind

Abstract

In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime $d\gamma{2}>1$, where $d$ is the dimension of the feature vector and $\gamma$ is the discount rate. In this regime, for any $q\in[\gamma{2},1]$, we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is $q/d$ and it requires $\Omega\left(\frac{d}{\gamma{2}\left(q-\gamma{2}\right)\varepsilon{2}}\exp\left(\Theta\left(d\gamma{2}\right)\right)\right)$ samples to approximate the value function up to an additive error $\varepsilon$. Note that the lower bound of the sample complexity is exponential in $d$. If $q=\gamma{2}$, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most $O\left(\max\left{ \frac{\left\Vert \theta{\pi}\right\Vert _{2}{4}}{\varepsilon{4}}\log\frac{d}{\delta},\frac{1}{\varepsilon{2}}\left(d+\log\frac{1}{\delta}\right)\right} \right)$ samples ($\theta{\pi}$ is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of $\varepsilon$ with probability at least $1-\delta$.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.