Learning to Drive by Watching YouTube Videos: Action-Conditioned Contrastive Policy Pretraining

Published 5 Apr 2022 in cs.CV, cs.LG, and cs.RO | (2204.02393v2)

Abstract: Deep visuomotor policy learning, which aims to map raw visual observation to action, achieves promising results in control tasks such as robotic manipulation and autonomous driving. However, it requires a huge number of online interactions with the training environment, which limits its real-world application. Compared to the popular unsupervised feature learning for visual recognition, feature pretraining for visuomotor control tasks is much less explored. In this work, we aim to pretrain policy representations for driving tasks by watching hours-long uncurated YouTube videos. Specifically, we train an inverse dynamic model with a small amount of labeled data and use it to predict action labels for all the YouTube video frames. A new contrastive policy pretraining method is then developed to learn action-conditioned features from the video frames with pseudo action labels. Experiments show that the resulting action-conditioned features obtain substantial improvements for the downstream reinforcement learning and imitation learning tasks, outperforming the weights pretrained from previous unsupervised learning methods and ImageNet pretrained weight. Code, model weights, and data are available at: https://metadriverse.github.io/ACO.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (30)

View on Semantic Scholar

Summary

The paper presents a novel pretraining approach that leverages uncurated YouTube videos to significantly improve sample efficiency in autonomous driving.
The paper utilizes an inverse dynamics model to generate pseudo-action labels, enabling effective action-conditioned contrastive learning for policy pretraining.
The paper demonstrates enhanced performance in imitation learning, reinforcement learning, and lane detection tasks, confirming the broader applicability of the method.

An Analysis of Action-Conditioned Contrastive Policy Pretraining for Autonomous Driving

In the discussed paper, the authors present a novel approach to facilitating the development of autonomous driving systems by leveraging action-conditioned contrastive policy pretraining. Their work addresses a critical challenge in deep visuomotor policy learning—namely, the low sample efficiency of reinforcement learning and imitation learning methods, which typically depend on extensive online interactions and expert demonstrations. This limitation impedes real-world applicability. The authors propose a solution that utilizes uncurated YouTube videos to pretrain policy representations, thus improving sample efficiency and opening new avenues for applying deep learning in autonomous driving.

Methodology Overview

The paper outlines an innovative method of utilizing vast amounts of unlabeled driving video data sourced from platforms like YouTube. Here's how the methodology unfolds:

Inverse Dynamics Model: A small subset of labeled driving data is used to train an inverse dynamics model. This model predicts action labels from visual frames, effectively generating pseudo-action labels that are crucial for the pretraining process.
Action-Conditioned Contrastive Learning: This novel approach involves developing an action-conditioned pretraining paradigm termed Action-conditioned COntrastive Learning (ACO). The core idea is to implement contrastive learning by forming two types of pairs:
- Instance Contrastive Pair (ICP): Formed by creating different views of a single image.
- Action Contrastive Pair (ACP): Formed by images that involve similar driving actions, as predicted by the inverse dynamics model.
Training with YouTube Videos: The approach capitalizes on the diverse set of visual scenes available in YouTube driving videos, using them to train a neural network that maps visual inputs to action decisions more effectively.

Experimental Validation

The authors validate their methodology through experiments that include imitation learning (IL), reinforcement learning (RL), and lane detection tasks. The pretrained models, initialized with weights learned via the ACO method, consistently outperform those trained with ImageNet or previous unsupervised learning strategies. One key outcome was a notable improvement in imitation learning success rates, particularly when training data was limited.

Additionally, reinforcement learning performance evaluated with the PPO algorithm also demonstrated significant enhancement, both with and without fine-tuning the model backbones during training. The lane detection experiment results further highlight the generalizability of the pretrained features, confirming that action consultancy improves not only policy learning but also related visual tasks.

Implications and Future Directions

This work highlights promising implications for both theory and practice. Theoretically, incorporating action-conditioned perspectives into contrastive learning represents a noteworthy advancement in policy pretraining paradigms. It highlights the potential to effectively utilize unstructured, unlabeled data from online sources to enhance model performance in complex, real-world environments.

Practically, the study provides a cost-effective pathway for the development of more scalable and robust autonomous driving systems, reducing reliance on exhaustive in-house data collection and manual annotation.

Future developments in AI could further refine this approach by integrating more nuanced action prediction models or exploring other sources of publicly available data for even richer contextual understanding.

Overall, the authors’ work proposes a methodology that holds promise in reducing the resource intensity of developing sophisticated driving policies while maintaining high performance across a range of related tasks, thus contributing notably to the field of autonomous vehicles and beyond.

Markdown Report Issue