Learning to Synthesize Programs as Interpretable and Generalizable Policies (2108.13643v4)

Published 31 Aug 2021 in cs.LG, cs.AI, and cs.PL

Abstract: Recently, deep reinforcement learning (DRL) methods have achieved impressive performance on tasks in a variety of domains. However, neural network policies produced with DRL methods are not human-interpretable and often have difficulty generalizing to novel scenarios. To address these issues, prior works explore learning programmatic policies that are more interpretable and structured for generalization. Yet, these works either employ limited policy representations (e.g. decision trees, state machines, or predefined program templates) or require stronger supervision (e.g. input/output state pairs or expert demonstrations). We present a framework that instead learns to synthesize a program, which details the procedure to solve a task in a flexible and expressive manner, solely from reward signals. To alleviate the difficulty of learning to compose programs to induce the desired agent behavior from scratch, we propose to first learn a program embedding space that continuously parameterizes diverse behaviors in an unsupervised manner and then search over the learned program embedding space to yield a program that maximizes the return for a given task. Experimental results demonstrate that the proposed framework not only learns to reliably synthesize task-solving programs but also outperforms DRL and program synthesis baselines while producing interpretable and more generalizable policies. We also justify the necessity of the proposed two-stage learning scheme as well as analyze various methods for learning the program embedding.

Citations (63)

View on Semantic Scholar

Summary

The paper introduces LEAPS, which learns a robust program embedding space to synthesize policies that are both interpretable and generalizable.
It employs a two-stage learning process, using unsupervised program reconstruction followed by the Cross Entropy Method to optimize latent programs from reward feedback.
Experimental results in the Karel domain show that LEAPS outperforms both deep reinforcement learning and prior synthesis methods in task performance and adaptability.

Overview

Innovations in deep reinforcement learning (DRL) have led to remarkable achievements in the AI domain. Despite these successes, DRL models often struggle with interpretability and generalizability. Prior work has explored learning programmatic policies, which offer better structure for interpretability and generalizing to new situations. However, these methods have limitations: they either use constrained policy representations or require substantial supervision. To further this research, a novel framework titled Learning Embeddings for lAtent Program Synthesis (LEAPS) has been developed, aiming to synthesize programs directly from reward feedback that are both understandable and capable of generalizing.

Learning Program Embedding

LEAPS utilizes a two-stage learning scheme to facilitate program synthesis. Initially, it learns a program embedding space, where proximal latent programs signify similar behaviors. A program encoder encodes given programs into latent space, and a corresponding decoder reconstructs programs from this latent representation. This embedding space is crafted through unsupervised learning, which relies on reconstructing randomly generated programs and the behaviors they invoke. Significantly, once this space is established, it can be utilized universally across different tasks, circumventing the need for retraining.

Program Synthesis

After learning the embedding space, the framework searches for optimal latent programs using a gradient-free search algorithm called Cross Entropy Method (CEM). By iteratively updating a population of candidate latent programs via CEM, a latent program which, when decoded, maximizes the desired behavior in the given task is found. This search process benefits from the embedding space's ability to interpolate smoothly between program behaviors.

Experimental Validation

Experiments conducted in the Karel domain highlight LEAPS’s proficiency in synthesizing programs that solve tasks requiring navigation and interaction with objects, like stacking and maze navigation. Comparatively, LEAPS not only reliably generates functional programs but also surpasses both DRL and program synthesis baselines in achieving better task performance. Moreover, it shows an enhanced ability to generalize across various domain settings and task configurations.

Concluding Thoughts

LEAPS stands out by embracing a highly expressive program representation and demanding minimal supervision, distinguishing it from preceding programmatic reinforcement learning approaches. Its reliance on reward signals and two-stage learning scheme elegantly avoids the intricacies of mastering program synthesis from the ground up. The research demonstrates its superiority over traditional DRL and program synthesis methods, not just in performance metrics but also in the interpretability and editability of the synthesized programs. This framework paves the way for future advancements, particularly in scenarios demanding interpretable and generalizable policies.

PDF Markdown