Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems (2206.12020v1)

Published 24 Jun 2022 in cs.LG, math.ST, stat.ME, stat.ML, and stat.TH

Abstract: We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.

Citations (29)

View on Semantic Scholar

Summary

The paper introduces the PO-Bilinear Actor-Critic framework to address challenges in partially observable reinforcement learning.
It provides sample complexity bounds and PAC guarantees for efficient policy learning in models like POMDPs, LQGs, and PSRs.
The approach competes against optimal memory-based policies while mitigating the exponential dependence on the horizon.

Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems

Reinforcement Learning (RL) in partially observable environments presents unique challenges, primarily due to the inherent difficulty in determining optimal policies when the system's state is not fully visible. This paper proposes a framework designed to address these challenges by extending existing RL methodologies to partially observable settings using a novel actor-critic framework.

Partially Observable Bilinear Actor-Critic Framework

The proposed framework introduces a new class called Partially Observable Bilinear Actor-Critic (PO-Bilinear AC) which integrates function approximation into the RL paradigm for partially observable systems. This framework is versatile enough to include various models such as:

Observable tabular Partially Observable Markov Decision Processes (POMDPs)
Observable Linear-Quadratic-Gaussian (LQG)
Predictive State Representations (PSRs)
A newly introduced model: Hilbert Space Embeddings of POMDPs
Observable POMDPs with latent low-rank transition

The core of this framework is an actor-critic style algorithm that facilitates agnostic policy learning by competing against the best memory-based policy within a defined class. The policy class comprises memory-based policies that look at a fixed-length window of recent observations. The value function class consists of functions that take both memory and future observations as inputs.

Algorithmic Innovations

The paper introduces an algorithm capable of competing against the globally optimal policy without incurring an exponential dependence on the horizon, which is a significant improvement for certain models like undercomplete observable tabular POMDPs, observable LQGs, and observable POMDPs with latent low-rank transition. The algorithm leverages the inherent special properties of these models to achieve sample-efficient learning.

Sample Complexity and PAC Guarantees

The sample complexity for each model is detailed, quantifying the resources required for the PO-Bilinear AC algorithm to learn policies effectively. The paper provides Probably Approximately Correct (PAC) guarantees, ensuring an efficient learning process under specified conditions.

Practical Applications

This framework broadens the applicability of RL in complex environments where full observability cannot be assumed. For instance, in robotic navigation where sensor data might be intermittent or noisy, or in finance where the market conditions are partially observable, this framework offers a structured approach for learning policies effectively.

Conclusion

The proposed PO-Bilinear Actor-Critic framework provides a comprehensive solution for reinforcement learning in partially observable dynamical systems. By leveraging function approximation and optimizing against the best memory-based policy, it offers significant improvements in sample efficiency and horizon scalability across various models.

Future research may build on these findings to explore even broader applications and refine the framework further, potentially integrating more sophisticated function approximators or exploring new types of partial observability conditions.