- The paper presents a novel backward reachability curriculum (BaRC) that bootstraps policy learning in sparse reward robotic tasks.
- It employs an approximate dynamics model and computed backward reachable sets to generate progressive training states, thereby reducing sample complexity.
- Experimental results on 5D car and 6D quadrotor models show significant improvements in training speed and robustness against model mismatches.
Reinforcement Learning (RL) offers a powerful approach for learning control policies for complex robotic tasks. However, a major bottleneck in applying model-free RL to robotics is its high sample complexity, particularly for tasks with sparse reward functions. Standard exploration strategies like ϵ-greedy or adding noise are inefficient in these scenarios, often failing to discover the goal region within a reasonable number of training episodes. Training in simulation helps but still requires millions of trials.
The paper "BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning" (BaRC: Backward Reachability Curriculum for Robotic Reinforcement Learning, 2018) proposes Backward Reachability Curriculum (BaRC) to address this sample efficiency issue. BaRC leverages physical prior knowledge in the form of an approximate system dynamics model to generate a curriculum for any model-free RL algorithm. The core idea is to start training the policy from states that are dynamically close to the goal state and progressively expand the set of initial training states backward in time, guided by the system's dynamics.
How BaRC Works
BaRC operates as a wrapper around a standard model-free RL training loop. It structures the training process into stages, each defined by a set of initial states from which the policy is trained.
- Initialization: The curriculum begins with a set of initial states (
starts
) containing states very close to the goal region (e.g., a single goal state). An oldstarts
buffer is also initialized with these goal states to prevent forgetting. The RL policy π is randomly initialized.
- Curriculum Stage Expansion: At the beginning of each stage, the
starts
set is expanded backward in time using backward reachable set (BRS) computation based on an approximate system dynamics model (ExpandBackwards
function). The expanded set, starts_set
, represents states from which the goal region can be reached within a short time horizon T using the approximate dynamics. This step provides new, slightly harder states to train from.
- Policy Training Loop: Within each stage, the policy is trained iteratively (
TrainPolicy
function) using a mix of initial states: Nnew states sampled uniformly from the newly expanded starts_set
, and Nold states sampled from the oldstarts
buffer. This blending helps the policy learn from new, challenging states while reinforcing performance on states it has already mastered. The TrainPolicy
function runs the chosen model-free RL algorithm (e.g., PPO) for NTP iterations using episodes starting from the sampled states and returns the policy and a map of success rates for the sampled initial states.
- State Selection and Buffer Update: After training for NTP iterations, states from the sampled initial states with a success rate exceeding a threshold Cselect are added to the
starts
set for the next curriculum stage expansion. These successful states are also added to the oldstarts
buffer to be used for future training and to prevent catastrophic forgetting. The Select
function performs this update.
- Stage Evaluation: The algorithm evaluates the policy's performance on the current
starts_set
(Evaluate
function). If the fraction of states in starts_set
from which the policy can reach the goal exceeds a threshold Cpass, the current curriculum stage is considered mastered, and the algorithm proceeds to the next stage (back to step 2). Otherwise, the policy training loop (steps 3-5) continues on the current stage's starts_set
.
- Termination: The curriculum continues expanding and training until the
starts_set
includes states from the problem's true initial state distribution ρ0 and the policy masters this stage, or until a desired performance metric from ρ0 is reached.
Backward Reachable Sets (BRS)
A BRS for a target set and time horizon T is the set of all initial states from which there exists some control policy that drives the system into the target set within time T. The paper uses the Hamilton-Jacobi (HJ) formulation of reachability to compute BRSs. This involves solving a partial differential equation (PDE).
The core challenge with HJ reachability is the computational cost, which grows exponentially with state space dimension. To make BRS computation practical for robotics, the authors propose two key strategies:
- Approximate Dynamics Model: Use a simplified, lower-dimensional, or linearized model of the system (the "curriculum model") for BRS computation, instead of the high-fidelity simulation model used for policy training.
- System Decomposition: Break down the curriculum model's dynamics into smaller, overlapping subsystems (e.g., 1D or 2D). Compute BRSs for these subsystems and combine them (often as outer approximations) to get an approximate BRS for the full system. This allows using computationally efficient methods like those in the open-source
helperOC
and Level Set Methods
MATLAB toolboxes.
The BRS computed this way provides a dynamically informed "frontier" from which to sample new training states. The set of initial states for a new curriculum stage is the union of the BRSs of the states marked as successful in the previous stage, backward for time T. Sampling from the BRS is done via rejection sampling within bounding boxes around the BRS components.
Practical Implementation Considerations
- Curriculum Model: The quality of the curriculum model affects BRS accuracy but doesn't need to be perfect. It should capture the core nonlinear dynamics relevant to reaching the goal. The paper shows robustness to significant model mismatch.
- BRS Computation Efficiency: Decomposition is crucial. The specific decomposition depends on the system dynamics. For the car model, they decompose a 5D system into 4D subsystems, which are further decomposed. For the quadrotor, a 6D system is decomposed into 2D and 1D subsystems.
- Hyperparameter Tuning: The hyperparameters (Nnew,Nold,T,Cpass,Cselect,NTP) are relatively intuitive. T controls the difficulty increase per stage. Cselect and Cpass define mastery. Nnew,Nold balance exploration of new states and consolidation on learned states. NTP depends on the inner RL algorithm's convergence speed. The authors report robustness to these settings.
- State Representation: The BRS is computed in the curriculum model's state space, which may be a projection or subset of the simulator state space. A mapping is needed between the two.
- Integration: BaRC is designed as a modular wrapper, requiring minimal modification to the chosen model-free RL algorithm. It primarily alters the initial state distribution used during training.
Experimental Results
BaRC was evaluated on two robotic environments:
- 5D Car Model: A standard non-holonomic car model with state (x,y,θ,v,κ) and controls (av,aκ). The goal is a specific state with non-zero velocity. The reward is sparse (1.0 at the goal, 0 otherwise).
- Planar Quadrotor Model: A 6D planar quadrotor with state (x,vx,y,vy,ϕ,ω) and thrust controls, navigating cluttered obstacles. Observations include state and 8 laser rangefinder readings. The goal is a target region (x≥4,y≥4). The reward is sparse (1000 at goal) with control costs and collision penalties. This is a highly dynamic and unstable system.
- Standard PPO and a random curriculum fail to reach the goal consistently, often learning local optima like hovering to avoid collisions. PPO with a smoothed quadratic reward also learns a sub-optimal local minimum.
- BaRC successfully learns a policy to reach the goal within a few curriculum iterations. The average reward plot again shows the characteristic learning and expansion phases.
- Although BRS computation adds overhead per iteration, the significant reduction in required RL iterations leads to substantial speedup in total wall clock time to task completion compared to baselines.
Conclusion
BaRC effectively addresses the sample efficiency challenge in sparse-reward robotic RL by intelligently shaping the training distribution using dynamically informed backward reachable sets computed with approximate models and decomposition techniques. It acts as a general wrapper around model-free RL algorithms, leveraging physical priors seamlessly. The experimental results show significant performance improvements in terms of sample complexity and wall clock time on representative dynamic robotic tasks, including those with unstable dynamics and sparse rewards that are intractable for standard methods. Future work includes exploring sampling-based BRS methods and integrating system identification to estimate curriculum models.