Inverse Reinforcement Learning With Constraint Recovery (2305.08130v1)

Published 14 May 2023 in cs.LG and cs.AI

Abstract: In this work, we propose a novel inverse reinforcement learning (IRL) algorithm for constrained Markov decision process (CMDP) problems. In standard IRL problems, the inverse learner or agent seeks to recover the reward function of the MDP, given a set of trajectory demonstrations for the optimal policy. In this work, we seek to infer not only the reward functions of the CMDP, but also the constraints. Using the principle of maximum entropy, we show that the IRL with constraint recovery (IRL-CR) problem can be cast as a constrained non-convex optimization problem. We reduce it to an alternating constrained optimization problem whose sub-problems are convex. We use exponentiated gradient descent algorithm to solve it. Finally, we demonstrate the efficacy of our algorithm for the grid world environment.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel IRL approach that simultaneously recovers reward and constraint functions in CMDPs via maximum entropy.
It employs an alternating convex optimization strategy using exponentiated gradient descent to iteratively refine parameter estimates.
Experimental results in grid-world settings validate that the algorithm qualitatively matches original function patterns despite numerical variabilities.

Inverse Reinforcement Learning With Constraint Recovery

The paper "Inverse Reinforcement Learning With Constraint Recovery" explores the problem of learning both the reward and constraint functions from demonstrated optimal behavior in Constrained Markov Decision Processes (CMDPs). This novel algorithm addresses the challenge of Inverse Reinforcement Learning (IRL) when constraints are involved, providing a pathway to recover both elements from trajectory data.

Problem Formulation

CMDPs are an extension of Markov Decision Processes (MDPs) where an agent's policy not only seeks to maximize the expected reward but also adheres to certain constraints. Traditional IRL focuses on recovering the reward function based on observed trajectories. The extension to CMDPs involves inferring not only the reward functions but also the constraints governing the agent's behavior.

The paper proposes casting the IRL with constraint recovery (IRL-CR) problem as a constrained non-convex optimization problem. The authors approach the problem using the principle of maximum entropy, which posits that the trajectory distribution is consistent with maximum entropy given observed constraints, and they employ a linear function approximation for both reward and constraints. The optimization problem thereby resolves into alternating constrained sub-problems, each convex, and is solved using an exponentiated gradient descent algorithm.

Methodology

Maximum Entropy Principle

The solution begins by using the maximum entropy principle to determine the trajectory distribution as a Boltzmann distribution, characterized by parameters related to both the reward and constraint functions. This statistical model ensures the solution has the least bias possible given the data, which is achieved by defining the trajectory distribution with the reward and constraints as exponential terms.

Alternating Optimization

Facing the non-convex nature of the resultant optimization problem, the authors reduce it to an alternating constrained optimization task. Each sub-problem—solving iteratively for either the reward or constraint parameters with the others fixed—is convex. This breakdown allows the effective application of exponentiated gradient descent methods to incrementally refine parameter estimates.

Implementation

The implementation leverages the iterative updates of policy and state visitation distributions, which in turn update the estimates of the reward and constraint functions. The algorithm is designed to proceed until convergence, i.e., until the updates to the parameters fall below a certain threshold, signaling the optimal point for the given data configuration.

def irl_cr_algorithm(trajectories, initial_w_r, initial_w_c, learning_rate, tolerance):
    w_r, w_c = initial_w_r, initial_w_c
    
    while not converged(w_r, w_c, tolerance):
        # Step 1: Determine optimal policy and Lagrange multiplier
        pi, lambda_ = compute_optimal_policy(w_r, w_c)
        
        # Step 2: Compute empirical feature expectations and policy feature expectations
        empirical_r, empirical_c = compute_empirical_features(trajectories, pi)
        phi_r, phi_c = compute_policy_features(pi)
        
        # Step 3: Gradient step
        grad_w_r = empirical_r - phi_r
        grad_w_c = lambda_ * (empirical_c - phi_c)
        
        # Update parameters with exponentiated gradient descent
        w_r = update_parameters(w_r, grad_w_r, learning_rate)
        w_c = update_parameters(w_c, grad_w_c, learning_rate, constraint=True)
    
    return w_r, w_c, pi

Experiments and Results

The proposed algorithm was evaluated in a grid-world setting, a classic testbed for demonstrating learning algorithms. The paper showed successful recovery of both the reward and constraint functions qualitatively. Their alignment with the original functions in terms of pattern and distribution was notable, although numerical deviations were observed due to maximum entropy regularization.

Numerical Performance

Presented results illustrate qualitative success in reconstructing the reward and constraint landscapes as well as deriving an optimal policy that matches the known policy of the demonstrator. Iterative refinement and the use of stochastic trajectory sampling enhanced prediction accuracy, albeit with slight variability due to the innate stochastic nature of CMDPs.

Conclusion

This research addresses a previously unscored area of IRL, namely the simultaneous recovery of reward and constraint functions. This development is particularly pertinent to fields involving safety-critical applications such as autonomous vehicles and healthcare. Extending this algorithm to handle real-time data and accommodating for unknown state features represent tantalizing future directions.

By solving the IRL-CR problem, the paper introduces a meaningful avenue for more sophisticated agent behavior modeling, integral for designing adaptive and complex autonomous systems within constrained environments.