Emergent Mind

An Invitation to Deep Reinforcement Learning

(2312.08365)
Published Dec 13, 2023 in cs.LG and cs.AI

Abstract

Training a deep neural network to maximize a target objective has become the standard recipe for successful machine learning over the last decade. These networks can be optimized with supervised learning, if the target objective is differentiable. For many interesting problems, this is however not the case. Common objectives like intersection over union (IoU), bilingual evaluation understudy (BLEU) score or rewards cannot be optimized with supervised learning. A common workaround is to define differentiable surrogate losses, leading to suboptimal solutions with respect to the actual objective. Reinforcement learning (RL) has emerged as a promising alternative for optimizing deep neural networks to maximize non-differentiable objectives in recent years. Examples include aligning LLMs via human feedback, code generation, object detection or control problems. This makes RL techniques relevant to the larger machine learning audience. The subject is, however, time intensive to approach due to the large range of methods, as well as the often very theoretical presentation. In this introduction, we take an alternative approach, different from classic reinforcement learning textbooks. Rather than focusing on tabular problems, we introduce reinforcement learning as a generalization of supervised learning, which we first apply to non-differentiable objectives and later to temporal problems. Assuming only basic knowledge of supervised learning, the reader will be able to understand state-of-the-art deep RL algorithms like proximal policy optimization (PPO) after reading this tutorial.

Overview

  • Reinforcement learning (RL) extends supervised learning by tackling optimization problems with non-differentiable objectives.

  • RL utilizes value learning, such as Q-learning, to predict expected rewards and optimize an implicit optimal policy.

  • Policy gradients directly adjust action probabilities based on received rewards, characterized by methods like REINFORCE.

  • RL techniques are advanced for sequential decision making, addressing sample efficiency, sparse rewards, and error accumulation.

  • The paper emphasizes RL’s potential in diverse applications and warns about ensuring that models' behavior aligns with desired outcomes.

Reinforcement Learning as a Generalization of Supervised Learning

Reinforcement learning (RL) is a paradigm that extends beyond the capabilities of supervised learning, particularly in scenarios where the optimization objective is non-differentiable. This capability makes RL a potent tool for a variety of problems, especially those found outside the realm of traditional games or simulated environments.

Bridging Non-Differentiable Objectives

Supervised learning typically operates on differentiable objectives, making use of gradient-based optimization. However, many real-world problems involve objectives that are not differentiable, such as ranking human preferences or code execution speed. RL steps in by providing a framework to optimize non-differentiable functions through either value learning or policy gradients.

Value Learning

Value learning involves the prediction of expected rewards, effectively bridging the gap between actions and their outcomes without the need for the reward function to be differentiable. This technique involves the learning of Q-functions or action-value functions which estimate the optimal policy implicitly. The Q-function can then be optimized through various methods, including deep Q-learning for discrete action spaces or actor-critic methods in the continuous domain.

Policy Gradients

Alternatively, policy gradients operate by directly manipulating the probability distribution over actions based on the received rewards. This family of methods includes REINFORCE, which optimizes action probabilities using samples from a learned distribution.

Extending Techniques to Sequential Decision Making

While value learning and policy gradients can be applied to problems where only a single prediction is made, extending these methods to sequential decision making tasks introduces additional considerations. These include data collection strategies that can improve sample efficiency and reduce the variance of gradients, and overcoming challenges like sparse rewards and compounding errors.

Off-Policy Learning

Off-policy methods like Soft Actor-Critic allow for the reuse of data collected from previous policies, increasing efficiency. They often require additional stabilization techniques such as using target networks for Q-functions.

On-Policy Learning

On-policy methods such as Proximal Policy Optimization collect data using the current policy and discard it after each update, thereby requiring fresh data each time. This approach incorporates several enhancements such as advantage functions to mitigate common issues of policy gradients.

Conclusion and Broader Impact

By framing RL as a generalization of supervised learning, this paper provides insights into how the field can address a broader set of problems with non-differentiable objectives. The ability of RL to optimize based on rewards, without requiring differentiability, opens up new prospects for machine learning applications. However, practitioners must consider not just quantitative metrics but the qualitative assessment of trained models to ensure the desired behavior aligns with positive outcomes, as RL models can learn to exploit any shortcomings in the reward design. Reinforcement learning holds promise for diverse applications, and ongoing research is vital to further develop the methods discussed herein.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.