Emergent Mind

Abstract

We initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary stochastic shortest path problem with changing cost and transition functions. We start by establishing a lower bound $\Omega((B{\star} SAT{\star}(\Deltac + B{\star}2\Delta_P)){1/3}K{2/3})$, where $B{\star}$ is the maximum expected cost of the optimal policy of any episode starting from any state, $T{\star}$ is the maximum hitting time of the optimal policy of any episode starting from the initial state, $SA$ is the number of state-action pairs, $\Deltac$ and $\DeltaP$ are the amount of changes of the cost and transition functions respectively, and $K$ is the number of episodes. The different roles of $\Deltac$ and $\DeltaP$ in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of $\Deltac$ and $\DeltaP$, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm [Zhang et al., 2020], adaptive confidence widening [Wei and Luo, 2021], as well as some new techniques such as properly penalizing long-horizon policies. Finally, when $\Deltac$ and $\DeltaP$ are unknown, we develop a variant of the MASTER algorithm [Wei and Luo, 2021] and integrate the aforementioned ideas into it to achieve $\widetilde{O}(\min{B{\star} S\sqrt{ALK}, (B{\star}2S2AT{\star}(\Deltac+B{\star}\DeltaP)){1/3}K{2/3}})$ regret, where $L$ is the unknown number of changes of the environment.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.