Emergent Mind

From Sora What We Can See: A Survey of Text-to-Video Generation

(2405.10674)
Published May 17, 2024 in cs.CV and cs.AI

Abstract

With impressive achievements made, artificial intelligence is on the path forward to artificial general intelligence. Sora, developed by OpenAI, which is capable of minute-level world-simulative abilities can be considered as a milestone on this developmental path. However, despite its notable successes, Sora still encounters various obstacles that need to be resolved. In this survey, we embark from the perspective of disassembling Sora in text-to-video generation, and conducting a comprehensive review of literature, trying to answer the question, \textit{From Sora What We Can See}. Specifically, after basic preliminaries regarding the general algorithms are introduced, the literature is categorized from three mutually perpendicular dimensions: evolutionary generators, excellent pursuit, and realistic panorama. Subsequently, the widely used datasets and metrics are organized in detail. Last but more importantly, we identify several challenges and open problems in this domain and propose potential future directions for research and development.

Evolution and current capabilities of text-to-video (T2V) generation, featuring Sora's advancements and challenges.

Overview

  • The paper surveys advancements in Text-to-Video (T2V) generation, detailing different models such as GANs, Variational Autoencoders (VAEs), diffusion models, and autoregressive transformers, each addressing the complexities of video generation from text prompts.

  • It highlights critical aspects for superior video generation like extended duration, superior resolution, and seamless quality, showcasing models like TATS, NUWA-XL, and FLAVR that contribute to these advancements.

  • The paper also discusses various challenges and open problems in T2V, including realistic motion and coherence, data privacy, and simultaneous multi-shot video generation, suggesting future directions for research and applications in areas like robotics and digital twins.

A Survey of Text-to-Video Generation

Introduction

It's exciting to see how far we’ve come in generating videos from text prompts. This field, known as Text-to-Video (T2V) generation, has seen substantial advancements, especially with the emergence of models like OpenAI's Sora. You might be familiar with models that generate images from text, but video generation adds layers of complexity because it must account for temporal coherence—keeping the video smooth and logical over time. Let's dive into a comprehensive survey on this subject as outlined by Rui Sun et al.

Evolutionary Generators

The journey of T2V generation can be largely segmented based on the foundational algorithms: GAN/VAE-based, Diffusion-based, and Autoregressive-based.

GAN/VAE-based

Early works leaned heavily on Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). These models paved the way but had limitations in handling video dynamics effectively. For instance:

Diffusion-based

Inspired by the success of diffusion models in text-to-image (T2I) tasks, researchers applied similar principles to T2V. Some notable advancements include:

  • VDM improved upon the traditional 2D diffusion models by incorporating 3D U-Net architectures.
  • Make-A-Video and Imagen Video capitalized on pre-trained T2I models to enhance motion realism and generate longer videos.
  • STA-DM further made strides by focusing on maintaining temporal and spatial coherence within generated videos.

Autoregressive-based

Autoregressive transformers have become quite effective for T2V tasks, especially for long-video generation:

  • NUWA-Infinity can generate videos frame by frame, maintaining coherence for extended sequences.
  • Phenaki incorporates a tokenized video representation to handle variable-length video generation, showcasing excellent temporal dynamics.

Excellent Pursuit

To achieve superior video generation, models focus on three critical aspects: extended duration, superior resolution, and seamless quality.

Extended Duration

Models like TATS and NUWA-XL show how hierarchical or autoregressive frameworks can generate long-duration videos by maintaining temporal coherence across many frames.

Superior Resolution

Generating high-resolution videos is crucial and challenging. For example, Show-1 uses a hybrid model combining both pixel-based and latent-based diffusion models to upscale videos, achieving high resolution while maintaining quality.

Seamless Quality

To enhance frame-to-frame quality and consistency, methods like FLAVR leverage 3D spatio-temporal convolutions, ensuring that videos are not only high in resolution but also fluid and artifact-free.

Realistic Panorama

Several elements are key to making T2V videos realistic: dynamic motion, complex scenes, multiple objects, and a rational layout.

Dynamic Motion

Models like AnimateDiff incorporate temporal-spatial learning to handle motion dynamics effectively, ensuring actions within the videos appear natural and coherent over time.

Complex Scene

Leveraging LLMs like VideoDirectorGPT, some models can generate intricate scenes with multiple interacting elements.

Multiple Objects

Handling multiple objects involves challenges like attribute mixing and object disappearance. Innovations like Detector Guidance (DG) help separate and clarify objects, maintaining their unique characteristics throughout the video.

Rational Layout

Creating a rational layout that adheres to physical and spatial principles is critical. LLM-grounded Video Diffusion (LVD) helps generate structured scene layouts that guide video creation, ensuring the sequence aligns with logical and realistic layouts.

Datasets and Metrics

The paper also dives into various datasets and evaluation metrics crucial for training and assessing T2V models. Key datasets span domains like face, open, movie, action, instruction, and cooking. Evaluation metrics range from image-level metrics like PSNR and SSIM to video-specific metrics such as Video Inception Score (Video IS) and Fréchet Video Distance (FVD), ensuring comprehensive quality assessment.

Challenges and Open Problems

Despite the advances, several challenges remain:

  • Realistic Motion and Coherence: Ensuring video frames transition smoothly and actions appear natural remains a significant hurdle.
  • Data Access Privacy: Leveraging private datasets while ensuring privacy.
  • Simultaneous Multi-shot Video Generation: Generating videos with consistent characters and styles across multiple shots.
  • Multi-Agent Co-creation: Collaborating in a multi-agent setup to achieve complex video generation tasks.

Future Directions

Looking ahead, the paper suggests some intriguing future directions:

  • Robot Learning from Visual Assistance: Using generated videos to aid robots in learning new tasks through demonstration.
  • Infinity 3D Dynamic Scene Reconstruction and Generation: Combining Sora with 3D technologies like NeRF for real-time, infinite scene generation.
  • Augmented Digital Twins: Enhancing digital twin systems by incorporating Sora’s simulation capabilities to improve real-time data accuracy and interactivity.

Conclusion

The paper by Rui Sun et al. provides a detailed exploration of T2V generation, highlighting impressive advances and outlining challenges and future opportunities. As T2V models continue to evolve, their applications—from enhancing robotics to improving digital twins—will undoubtedly expand, making this an exciting field to watch.

For those interested in keeping up with the latest in T2V, you might want to explore the studies surveyed in the paper, many of which are listed in detail at this GitHub repository.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.