Temporally Consistent Transformers for Video Generation (2210.02396v2)

Published 5 Oct 2022 in cs.CV, cs.AI, and cs.LG

Abstract: To generate accurate videos, algorithms have to understand the spatial and temporal dependencies in the world. Current algorithms enable accurate predictions over short horizons but tend to suffer from temporal inconsistencies. When generated content goes out of view and is later revisited, the model invents different content instead. Despite this severe limitation, no established benchmarks on complex data exist for rigorously evaluating video generation with long temporal dependencies. In this paper, we curate 3 challenging video datasets with long-range dependencies by rendering walks through 3D scenes of procedural mazes, Minecraft worlds, and indoor scans. We perform a comprehensive evaluation of current models and observe their limitations in temporal consistency. Moreover, we introduce the Temporally Consistent Transformer (TECO), a generative model that substantially improves long-term consistency while also reducing sampling time. By compressing its input sequence into fewer embeddings, applying a temporal transformer, and expanding back using a spatial MaskGit, TECO outperforms existing models across many metrics. Videos are available on the website: https://wilson1yan.github.io/teco

Citations (18)

View on Semantic Scholar

Summary

The paper presents TECO, a model that significantly improves temporal consistency in long-horizon video generation.
It leverages a temporal transformer with spatial MaskGit and novel 3D environment benchmarks to test long-range dependencies.
DropLoss training and robust evaluations show TECO outperforms SOTA models on LPIPS and FVD metrics.

Overview of "Temporally Consistent Transformers for Video Generation"

The paper "Temporally Consistent Transformers for Video Generation" introduces the Temporally Consistent Transformer (TECO), a novel model designed to improve the temporal consistency of video predictions over long horizons. Existing video generation methods often falter in maintaining temporal consistency, particularly when generating long sequences. This results in video outputs where the generated content lacks cohesion when revisited later in the sequence. The authors address these challenges with the development of TECO, leveraging an innovative architectural approach and the introduction of new evaluation benchmarks.

Contributions of the Paper

Novel Benchmark Datasets: The paper curates three challenging datasets featuring long-range dependencies which are rendered from 3D environments (DMLab, Minecraft, and Habitat). These datasets are intended to rigorously test the temporal consistency of video generative models. Existing datasets primarily focus on short-term dependencies, necessitating this advancement for meaningful evaluation of long-horizon video generation.
The TECO Model: TECO is a significant advancement over previous models due to its enhanced ability to generate temporally consistent videos. It compresses the input sequence into fewer embeddings, processes them using a temporal transformer, and then reconstructs the video frames using a spatial MaskGit model. This methodology allows TECO to outperform existing models on multiple metrics while also reducing sampling time.
Evaluation and Results: Through comprehensive evaluation on the devised datasets, TECO demonstrates superior performance in maintaining long-term consistency across generated video frames. Notably, TECO delivers temporal consistency over sequences of hundreds of frames, exceeding the capabilities of SOTA models like Perceiver AR and Latent FDM.
DropLoss Training: The paper introduces DropLoss, a scalable training technique that selectively omits certain time indices from loss computations during training. This innovation significantly reduces training costs without compromising model performance, enabling the efficient use of larger models for a given computational budget.

Numerical Outcomes

The authors present strong numerical results where TECO achieves superior scores in LPIPS and FVD metrics compared to baselines, indicating better perceptual quality and temporal consistency. For instance, TECO consistently maintains a lower LPIPS across timesteps, reflecting high-fidelity predictions when video contexts recur.

Implications and Future Speculations

TECO's improvements in temporal consistency have important implications for practical applications in fields requiring realistic video generation over extensive sequences, such as video simulation for robotics, gaming, and virtual reality environments. The paper highlights the ongoing trade-off between video fidelity and temporal consistency, a challenge maintained even with TECO’s architecture. Future advancements could involve enhancing sequence models through more efficient sequence processing techniques, such as those seen in other forms of efficient transformers, to further balance this trade-off.

This research underscores the continuing evolution of video generation models, with TECO setting a precedent for handling long-term dependencies in generative video tasks. Future research work could explore further optimizations, such as direct pixel-level training using GAN or diffusion approaches to address existing reconstruction artifacts and blur, particularly in high-information scenes found in datasets like Kinetics-600. Overall, TECO’s contributions mark meaningful progress in the field of video generation by setting new standards for temporal consistency.

PDF Markdown