Emergent Mind

Vision-Language Models as a Source of Rewards

(2312.09187)

Published Dec 14, 2023 in cs.LG

Abstract

Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

Overview

The paper introduces using pre-trained vision-language models (VLMs) like CLIP to generate reward signals for reinforcement learning without needing additional fine-tuning.
It demonstrates that larger VLMs produce more precise rewards, which results in more effective reinforcement learning agents.
Contrastive VLMs are leveraged to create binary rewards for RL agents, facilitating achievement within a partially observable Markov decision process (POMDP).
Experiments show that optimizing for VLM-derived rewards leads to maximization of actual ground truth rewards, and better results are obtained with larger VLMs.
The findings suggest the potential of VLMs to train generalized agents in visually complex environments, marking progress toward more adaptable AI.

Introduction to Vision-Language Models as Rewards

The ambition of creating versatile AI agents that can navigate and accomplish objectives within complex environments is a major focus in the field of reinforcement learning (RL). A substantial obstacle in this area is the necessity for diverse reward functions to train agents for various goals. The paper scrutinizes using vision-language models (VLMs) as a new method for generating rewards in reinforcement learning. Specifically, it looks at pre-trained VLMs, like CLIP, to produce reward signals that do not require further fine-tuning with environment-specific data. The method is demonstrated in two different visual domains, and the results indicate that larger VLMs provide more accurate rewards, leading to more effective RL agents.

Related Work and Methodological Foundations

There is a recent research interest in using VLMs for creating reward functions. Pre-trained VLMs have already displayed proficiency in tasks such as visual detection, classification, and question-answering. The paper outlines efforts where CLIP-based models have been fine-tuned with video and text from Minecraft to develop effective shaping rewards, allowing agents to perform specific tasks more efficiently.

The methodology proposed involves using contrastive VLMs to produce a straightforward binary reward for RL. This process creates an image encoder and a text encoder to generate a reward signal from environmental observations and text-based goals. The rewards serve as indicators for the achievement of defined goals within a partially observable Markov decision process (POMDP), leveraging the intrinsic reward rather than relying on explicitly programmed ground truth rewards.

Empirical Evaluations and Results

In the empirical study, experiments assess how the use of VLM rewards correlates with the fundamental ground truth rewards and explore the influence of scaling up the VLM. Key research questions include determining if optimizing the VLM reward leads to higher ground truth rewards, and whether larger VLMs enhance the performance of the reward function.

The experimental setup mimics standard online RL, using environments such as Playhouse and AndroidEnv to challenge the agent with tasks like locating objects or opening apps. The essential finding from these experiments is that training agents to maximize VLM-derived rewards concurrently maximizes the actual ground truth reward. Moreover, increasing the size of VLM models improves both their accuracy in offline settings and their effectiveness as a reward signal during RL training.

Conclusion and Practical Implications

The paper demonstrates that pre-existing VLMs can provide precise rewards for visual tasks based on language goals. When the scale of VLMs is increased, the accuracy of reward predictions also improves significantly, which, in succession, leads to better-performing RL agents. These findings suggest that as VLMs continue to evolve, it might become feasible to train generalized agents in visually-rich settings without the need for additional fine-tuning, a step forward in creating more adaptable and capable AI systems.

Create an account to read this summary for free:

"Vision-Language Models as a Source of Rewards", Baumli et al 2023 (11 points, 0 comments) in /r/mlscaling

"Vision-Language Models as a Source of Rewards", Baumli et al 2023 (2 points, 1 comment) in /r/reinforcementlearning

https://twitter.com/22146921/status/1736022605356790061

https://twitter.com/1637708085958696961/status/1736377473305026959