Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability (2405.17398v5)
Abstract: World models can foresee the outcomes of different actions, which is of paramount importance for autonomous driving. Nevertheless, existing driving world models still have limitations in generalization to unseen environments, prediction fidelity of critical details, and action controllability for flexible application. In this paper, we present Vista, a generalizable driving world model with high fidelity and versatile controllability. Based on a systematic diagnosis of existing methods, we introduce several key ingredients to address these limitations. To accurately predict real-world dynamics at high resolution, we propose two novel losses to promote the learning of moving instances and structural information. We also devise an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, we incorporate a versatile set of controls from high-level intentions (command, goal point) to low-level maneuvers (trajectory, angle, and speed) through an efficient learning strategy. After large-scale training, the capabilities of Vista can seamlessly generalize to different scenarios. Extensive experiments on multiple datasets show that Vista outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Moreover, for the first time, we utilize the capacity of Vista itself to establish a generalizable reward for real-world action evaluation without accessing the ground truth actions.
- Compositional Foundation Models for Hierarchical Planning. In NeurIPS, 2023.
- Video Pretraining (VPT): Learning to Act by Watching Unlabeled Online Videos. In NeurIPS, 2022.
- Lumiere: A Space-Time Diffusion Model for Video Generation. arXiv preprint arXiv:2401.12945, 2024.
- Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. In ICLR, 2024.
- Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. arXiv preprint arXiv:2311.15127, 2023.
- Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. In CVPR, 2023.
- MUVO: A Multimodal Generative World Model for Autonomous Driving with Geometric Representations. arXiv preprint arXiv:2311.11762, 2023.
- InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR, 2023.
- Genie: Generative Interactive Environments. arXiv preprint arXiv:2402.15391, 2024.
- nuScenes: A Multimodal Dataset for Autonomous Driving. In CVPR, 2020.
- nuPlan: A Closed-Loop ML-based Planning Benchmark for Autonomous Vehicles. In CVPR Workshops, 2021.
- MP3: A Unified Model to Map, Perceive, Predict and Plan. In CVPR, 2021.
- Using Left and Right Brains Together: Towards Vision and Language Planning. arXiv preprint arXiv:2402.10534, 2024.
- Learning by Cheating. In CoRL, 2019.
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation. arXiv preprint arXiv:2310.19512, 2023.
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models. In CVPR, 2024.
- End-to-End Autonomous Driving: Challenges and Frontiers. arXiv preprint arXiv:2306.16927, 2023.
- GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation. In CVPR, 2024.
- TransFuser: Imitation with Transformer-based Sensor Fusion for Autonomous Driving. IEEE TPAMI, 2023.
- Emu: Enhancing Image Generation Models using Photogenic Needles in a Haystack. arXiv preprint arXiv:2309.15807, 2023.
- Diffusion Models Beat GANs on Image Synthesis. In NeurIPS, 2021.
- Learning Universal Policies via Text-Guided Video Generation. In NeurIPS, 2023.
- Video Language Planning. In ICLR, 2024.
- Visual Foresight: Model-based Deep Reinforcement Learning for Vision-based Robotic Control. arXiv preprint arXiv:1812.00568, 2018.
- Video Prediction Models as Rewards for Reinforcement Learning. In NeurIPS, 2023.
- Taming Transformers for High-Resolution Image Synthesis. In CVPR, 2021.
- Deep Visual Foresight for Planning Robot Motion. In ICRA, 2017.
- Enhance Sample Efficiency and Robustness of End-to-End Urban Autonomous Driving via Semantic Masked World Model. In NeurIPS Workshops, 2022.
- Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv preprint arXiv:2311.10709, 2023.
- Generative Adversarial Nets. In NeurIPS, 2014.
- AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. In ICLR, 2024.
- MaskViT: Masked Visual Pre-Training for Video Prediction. In ICLR, 2023.
- The Essential Role of Causality in Foundation World Models for Embodied AI. arXiv preprint arXiv:2402.06665, 2024.
- Language Models Represent Space and Time. In ICLR, 2024.
- Nicholas Guttenberg and CrossLabs. Diffusion with Offset Noise, 2023.
- Recurrent World Models Facilitate Policy Evolution. In NeurIPS, 2018.
- Dream to Control: Learning Behaviors by Latent Imagination. arXiv preprint arXiv:1912.01603, 2019.
- Learning Latent Dynamics for Planning from Pixels. In ICML, 2019.
- Mastering Atari with Discrete World Models. In ICLR, 2021.
- Mastering Diverse Domains through World Models. arXiv preprint arXiv:2301.04104, 2023.
- Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning. arXiv preprint arXiv:2402.14407, 2024.
- Latent Video Diffusion Models for High-Fidelity Long Video Generation. arXiv preprint arXiv:2211.13221, 2022.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017.
- Denoising Diffusion Probabilistic Models. In NeurIPS, 2020.
- Cascaded Diffusion Models for High Fidelity Image Generation. JMLR, 2022.
- Classifier-Free Diffusion Guidance. arXiv preprint arXiv:2207.12598, 2022.
- Video Diffusion Models. arXiv preprint arXiv:2204.03458, 2022.
- Model-based Imitation Learning for Urban Driving. In NeurIPS, 2022.
- FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras. In ICCV, 2021.
- GAIA-1: A Generative World Model for Autonomous Driving. arXiv preprint arXiv:2309.17080, 2023.
- LoRA: Low-Rank Adaptation of Large Language Models. In ICLR, 2022.
- ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning. In ECCV, 2022.
- Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis. arXiv preprint arXiv:2312.08782, 2023.
- Planning-Oriented Autonomous Driving. In CVPR, 2023.
- Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning. arXiv preprint arXiv:2312.05230, 2023.
- Diffusion Reward: Learning Rewards via Conditional Video Diffusion. arXiv preprint arXiv:2312.14134, 2023.
- ADriver-I: A General World Model for Autonomous Driving. arXiv preprint arXiv:2311.13549, 2023.
- DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving. In ICCV, 2023.
- Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving. In CVPR, 2023.
- VAD: Vectorized Scene Representation for Efficient Autonomous Driving. In ICCV, 2023.
- Elucidating the Design Space of Diffusion-based Generative Models. In NeurIPS, 2022.
- Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting. In CVPR, 2023.
- DriveGAN: Towards a Controllable High-Quality Neural Simulation. In CVPR, 2021.
- Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
- Learning to Act from Actionless Videos through Dense Correspondences. In ICLR, 2024.
- Pathdreamer: A World Model for Indoor Navigation. In ICCV, 2021.
- VideoPoet: A Large Language Model for Zero-Shot Video Generation. arXiv preprint arXiv:2312.14125, 2023.
- DreamDrone. arXiv preprint arXiv:2312.08746, 2023.
- XVO: Generalized Visual Odometry via Cross-Modal Self-Training. In ICCV, 2023.
- Yann LeCun. A Path towards Autonomous Machine Intelligence. Open Review, 62, 2022.
- Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future. arXiv preprint arXiv:2312.03408, 2023.
- Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe. IEEE TPAMI, 2023.
- CODA: A Real-World Road Corner Case Dataset for Object Detection in Autonomous Driving. In ECCV, 2022.
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. In ICLR, 2023.
- Think2Drive: Efficient Reinforcement Learning by Thinking in Latent World Model for Quasi-Realistic Autonomous Driving (in CARLA-v2). arXiv preprint arXiv:2402.16720, 2024.
- BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. In ECCV, 2022.
- Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? In CVPR, 2024.
- MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction. In ICLR, 2023.
- Learning to Model the World with Language. arXiv preprint arXiv:2308.01399, 2023.
- World Model on Million-Length Video and Language With RingAttention. arXiv preprint arXiv:2402.08268, 2024.
- Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101, 2017.
- Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. In ICLR, 2017.
- DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. In NeurIPS, 2022.
- WoVoGen: World Volume-Aware Diffusion for Controllable Multi-Camera Driving Scene Generation. arXiv preprint arXiv:2312.02934, 2023.
- Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference. arXiv preprint arXiv:2310.04378, 2023.
- Structured World Models from Human Videos. In RSS, 2023.
- On Distillation of Guided Diffusion Models. In CVPR, 2023.
- Transformers are Sample-Efficient World Models. In ICLR, 2023.
- Deep Dynamics Models for Learning Dexterous Manipulation. In CoRL, 2020.
- Action-Conditional Video Prediction using Deep Networks in Atari Games. In NeurIPS, 2015.
- Scalable Diffusion Models with Transformers. In ICCV, 2023.
- Learning Real-World Robot Policies by Dreaming. In IROS, 2019.
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In ICLR, 2024.
- High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.
- Progressive Distillation for Fast Sampling of Diffusion Models. In ICLR, 2023.
- Learning a Driving Simulator. arXiv preprint arXiv:1608.01230, 2016.
- Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. arXiv preprint arXiv:2403.12015, 2024.
- Data-Efficient Reinforcement Learning with Self-Predictive Representations. In ICLR, 2021.
- ViNT: A Foundation Model for Visual Navigation. In CoRL, 2023.
- DriveLM: Driving with Graph Visual Question Answering. arXiv preprint arXiv:2312.14150, 2023.
- Make-A-Video: Text-to-Video Generation without Text-Video Data. In ICLR, 2023.
- Denoising Diffusion Implicit Models. In ICLR, 2021.
- Consistency Models. In ICML, 2023.
- Score-based Generative Modeling through Stochastic Differential Equations. In ICLR, 2021.
- Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In CVPR, 2020.
- Richard S Sutton. The Quest for a Common Model of the Intelligent Decision Maker. arXiv preprint arXiv:2202.13252, 2022.
- Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. In NeurIPS, 2020.
- Towards Accurate Generative Models of Video: A New Metric & Challenges. arXiv preprint arXiv:1812.01717, 2018.
- Attention is All You Need. In NeurIPS, 2017.
- MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In NeurIPS, 2022.
- SV3D: Novel Multi-View Synthesis and 3D Generation from a Single Image using Latent Video Diffusion. arXiv preprint arXiv:2403.12008, 2024.
- Generating Videos with Scene Dynamics. In NeurIPS, 2016.
- DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation. In ICCV, 2023.
- ModelScope Text-to-Video Technical Report. arXiv preprint arXiv:2308.06571, 2023.
- VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation. arXiv preprint arXiv:2305.10874, 2023.
- A Recipe for Scaling up Text-to-Video Generation with Text-Free Videos. In CVPR, 2024.
- VideoLCM: Video Latent Consistency Model. arXiv preprint arXiv:2312.09109, 2023.
- DriveDreamer: Towards Real-World-Driven World Models for Autonomous Driving. arXiv preprint arXiv:2309.09777, 2023.
- LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models. arXiv preprint arXiv:2309.15103, 2023.
- Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving. In CVPR, 2024.
- MotionCtrl: A Unified and Flexible Motion Controller for Video Generation. arXiv preprint arXiv:2312.03641, 2023.
- Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting. In NeurIPS Datasets and Benchmarks, 2023.
- Towards A Better Metric for Text-to-Video Generation. arXiv preprint arXiv:2401.07781, 2024.
- Pre-Training Contextualized World Models with In-the-Wild Videos for Reinforcement Learning. In NeurIPS, 2023.
- Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling. In ICLR, 2023.
- DynamiCrafter: Animating Open-Domain Images with Video Diffusion Priors. arXiv preprint arXiv:2310.12190, 2023.
- A Survey on Robotics with Foundation Models: toward Embodied AI. arXiv preprint arXiv:2402.02385, 2024.
- VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv preprint arXiv:2104.10157, 2021.
- Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities. arXiv preprint arXiv:2401.08045, 2024.
- Generalized Predictive Model for Autonomous Driving. In CVPR, 2024.
- Learning Interactive Real-World Simulators. In ICLR, 2024.
- Video as the New Language for Real-World Decision Making. arXiv preprint arXiv:2402.17139, 2024.
- Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion. arXiv preprint arXiv:2402.03162, 2024.
- Visual Point Cloud Forecasting Enables Scalable Autonomous Driving. In CVPR, 2024.
- Rethinking the Open-Loop Evaluation of End-to-End Autonomous Driving in nuScenes. arXiv preprint arXiv:2305.10430, 2023.
- Language-Guided World Models: A Model-based Approach to AI Control. arXiv preprint arXiv:2402.01695, 2024.
- Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion. In ICLR, 2024.
- I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models. arXiv preprint arXiv:2311.04145, 2023.
- UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models. In NeurIPS, 2023.
- OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving. arXiv preprint arXiv:2311.16038, 2023.
- GenAD: Generative End-to-End Autonomous Driving. arXiv preprint arXiv:2402.11502, 2024.
- Embodied Understanding of Driving Scenarios. arXiv preprint arXiv:2403.04593, 2024.