Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 52 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 13 tok/s Pro

GPT-4o 100 tok/s Pro

Kimi K2 192 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Replicability in Reinforcement Learning (2305.19562v2)

Published 31 May 2023 in cs.LG, cs.AI, and stat.ML

Abstract: We initiate the mathematical study of replicability as an algorithmic property in the context of reinforcement learning (RL). We focus on the fundamental setting of discounted tabular MDPs with access to a generative model. Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if, with high probability, it outputs the exact same policy after two executions on i.i.d. samples drawn from the generator when its internal randomness is the same. We first provide an efficient $\rho$-replicable algorithm for $(\varepsilon, \delta)$-optimal policy estimation with sample and time complexity $\widetilde O\left(\frac{N^{3\cdot\log(1/\delta)}{(1-\gamma)^{5\cdot\varepsilon^{2\cdot\rho^2}\right)$,}}} where $N$ is the number of state-action pairs. Next, for the subclass of deterministic algorithms, we provide a lower bound of order $\Omega\left(\frac{N^{3}{(1-\gamma)^{3\cdot\varepsilon^{2\cdot\rho^2}\right)$.}}} Then, we study a relaxed version of replicability proposed by Kalavasis et al. [2023] called TV indistinguishability. We design a computationally efficient TV indistinguishable algorithm for policy estimation whose sample complexity is $\widetilde O\left(\frac{N^{2\cdot\log(1/\delta)}{(1-\gamma)^{5\cdot\varepsilon^{2\cdot\rho^2}\right)$.}}} At the cost of $\exp(N)$ running time, we transform these TV indistinguishable algorithms to $\rho$-replicable ones without increasing their sample complexity. Finally, we introduce the notion of approximate-replicability where we only require that two outputted policies are close under an appropriate statistical divergence (e.g., Renyi) and show an improved sample complexity of $\widetilde O\left(\frac{N\cdot\log(1/\delta)}{(1-\gamma)^{5\cdot\varepsilon^{2\cdot\rho^2}\right)$.}}