Emergent Mind

Abstract

World models can foresee the outcomes of different actions, which is of paramount importance for autonomous driving. Nevertheless, existing driving world models still have limitations in generalization to unseen environments, prediction fidelity of critical details, and action controllability for flexible application. In this paper, we present Vista, a generalizable driving world model with high fidelity and versatile controllability. Based on a systematic diagnosis of existing methods, we introduce several key ingredients to address these limitations. To accurately predict real-world dynamics at high resolution, we propose two novel losses to promote the learning of moving instances and structural information. We also devise an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, we incorporate a versatile set of controls from high-level intentions (command, goal point) to low-level maneuvers (trajectory, angle, and speed) through an efficient learning strategy. After large-scale training, the capabilities of Vista can seamlessly generalize to different scenarios. Extensive experiments on multiple datasets show that Vista outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Moreover, for the first time, we utilize the capacity of Vista itself to establish a generalizable reward for real-world action evaluation without accessing the ground truth actions.

Vista anticipates realistic futures, controlled by multi-modal actions, evaluating real-world driving actions.

Overview

  • Vista is an advanced driving world model designed to improve generalization in unseen environments, boost prediction fidelity, and enhance action controllability for autonomous driving.

  • The model introduces innovative loss functions and leverages large-scale driving videos to perform accurate long-horizon predictions and diverse action controls, making it versatile for various autonomous driving tasks.

  • Experimental results show that Vista significantly outperforms state-of-the-art models, demonstrating superior fidelity and realistic prediction capabilities across multiple datasets.

Overview of Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability

The paper presents Vista, an advanced driving world model developed to address specific limitations in existing models related to generalization to unseen environments, prediction fidelity of critical details, and action controllability. The model is designed to foresee the outcomes of different actions for autonomous driving, which is critical for ensuring safety and efficiency in real-world driving scenarios.

Key Contributions

  1. Enhanced Generalization Capability:

    • Vista leverages a large corpus of worldwide driving videos to improve its generalization capability. Through systematic inclusion of dynamic priors (position, velocity, and acceleration), the model is able to maintain coherent long-horizon rollouts, effectively predicting real-world dynamics in varying scenarios.
  2. High-Fidelity Prediction:

    • Two novel loss functions are introduced: the dynamics enhancement loss and the structure preservation loss. The former prioritizes dynamic regions in the video, such as moving vehicles and sidewalks, while the latter maintains structural details by focusing on high-frequency components in the prediction. These additions significantly enhance the visual accuracy and realism of future predictions at high resolutions (576×1024 pixels).
  3. Versatile Action Controllability:

    • Vista supports a diverse set of action formats, including high-level intentions (commands, goal points) and low-level maneuvers (trajectory, angle, and speed), through a unified conditioning interface and an efficient training strategy. This versatility extends the model's applicability to various autonomous driving tasks, from evaluating high-level policies to executing precise maneuvers.
  4. Evaluation of Real-World Actions:

    • Utilizing its own capabilities, Vista is implemented as a generalizable reward function to evaluate real-world driving actions without requiring ground truth actions. This approach leverages the prediction uncertainty to assess action reliability, enhancing the model's utility in real-world applications where ground truth data is often unavailable.

Experimental Validation

A comprehensive set of experiments demonstrates Vista's superiority over existing driving world models. Key results include:

Quantitative Performance:

- On the nuScenes validation set, Vista outperforms state-of-the-art models with a 55% improvement in FID and a 27% improvement in FVD.

Generalization Across Datasets:

- Vista's predictions were consistently preferred by human evaluators over those from state-of-the-art video generation models across multiple diverse datasets such as OpenDV-YouTube-val, nuScenes, Waymo, and CODA.

Long-Horizon Prediction:

- Unlike previous models, Vista is capable of realistic long-horizon prediction, maintaining high fidelity over 15-second rollouts, a feature critical for long-term planning in autonomous driving.

Effective Action Control:

- Evaluations revealed that applying action controls via high-level intentions or low-level maneuvers resulted in predictions closely mirroring true driving behaviors, evidenced by significant reductions in FVD scores.

Implications and Future Directions

The implications of this research are multifaceted. Practically, Vista's ability to generalize and predict driving dynamics with high fidelity makes it a valuable tool for developing and testing autonomous driving systems. The versatility in action control also means it can be integrated into various stages of autonomous driving pipelines, from high-level planning to low-level motion control.

Theoretically, the paper introduces novel techniques that can be leveraged beyond autonomous driving. The dynamics enhancement and structure preservation loss functions can be adopted in other domains requiring high-fidelity video generation with complex dynamics.

Future research could explore scaling Vista to even larger datasets and integrating it with scalable architectures to further enhance computation efficiency. Additionally, extending Vista's framework to other domains, such as robotics and simulation environments, could prove beneficial.

Conclusion

Vista represents a significant step forward in the development of generalizable driving world models. Its enhanced fidelity, versatile controllability, and robust evaluation mechanism highlight its potential in pushing the boundaries of autonomous driving technologies. Future advancements based on this work could open new avenues for the broader application of AI-driven world models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.