Emergent Mind

Abstract

Significant advancements have been made in video generative models recently. Unlike image generation, video generation presents greater challenges, requiring not only generating high-quality frames but also ensuring temporal consistency across these frames. Despite the impressive progress, research on metrics for evaluating the quality of generated videos, especially concerning temporal and motion consistency, remains underexplored. To bridge this research gap, we propose Fr\'echet Video Motion Distance (FVMD) metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fr\'echet distance. We conduct sensitivity analysis by injecting noise into real videos to verify the effectiveness of FVMD. Further, we carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics. Additionally, our motion features can consistently improve the performance of Video Quality Assessment (VQA) models, indicating that our approach is also applicable to unary video quality evaluation. Code is available at https://github.com/ljh0v0/FMD-frechet-motion-distance.

Pipeline of the proposed Fréchet Video Motion Distance tracking video key point trajectories.

Overview

  • The paper introduces the Fréchet Video Motion Distance (FVMD), a novel metric designed to evaluate motion consistency in generated videos, addressing gaps in current evaluation methods such as FID-VID, FVD, and VBench.

  • The methodology leverages key point tracking, specifically using the PIPs++ model to extract detailed motion features like velocity and acceleration, which are then transformed into statistical histograms for robust comparison.

  • Empirical validation, including sensitivity analysis and a large-scale human study, demonstrates FVMD's superior capability in detecting temporal inconsistencies and its high correlation with human judgments compared to existing metrics.

Frechet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos

Recent trends in video generation have focused on enhancing the quality and temporal coherence of generated content. Unlike static image generation, video generation entails a higher degree of complexity, necessitating not only visual fidelity in individual frames but also seamless temporal continuity across them. The paper by Liu et al. introduces the Fréchet Video Motion Distance (FVMD), a novel evaluation metric specifically designed to measure motion consistency in generated videos.

Background and Motivation

With the advent of advanced generative models, such as diffusion models, the capability to generate high-quality videos has markedly improved. However, the evaluation of these videos has predominantly relied on metrics like FID-VID, FVD, and VBench, which either overlook temporal coherence or fail to effectively capture complex motion patterns in dynamically generated content. For instance, while FVD utilizes an action recognition model to evaluate temporal coherence, it does not prioritize the intricate motion patterns central to tasks like motion-guided video generation. VBench, despite its comprehensive approach, tends to penalize videos with notable dynamic motion. This gap motivates the need for a dedicated metric that harmonizes visual fidelity and motion consistency.

Proposed Metric: Fréchet Video Motion Distance (FVMD)

The central contribution of the paper is the FVMD metric, which evaluates the motion consistency in videos by leveraging the Fréchet distance, applied to motion features derived from key point tracking.

Methodology

Motion Feature Extraction:

  • Key points in videos are tracked using the PIPs++ model, an advanced key point tracking approach accommodating occlusions and complex movements.
  • For each video frame, velocity and acceleration fields are computed to capture the changes in motion patterns. These fields offer a detailed representation of motion, encapsulating the physical properties of generated movements.

Statistical Representation:

  • The computed velocity and acceleration fields are transformed into histograms. Two types of histograms are used: quantized 2D histograms and dense 1D histograms. The latter, inspired by the HOG approach, quantizes the motion vectors based on their magnitudes and angles.

Distance Calculation:

  • The similarity between generated videos and ground-truth videos is measured using the Fréchet distance applied to the extracted motion features. The use of statistical histograms ensures a robust comparison, considering the inherent variability in video content.

Empirical Evaluation

The metric was subjected to rigorous validation, encompassing sensitivity analysis and alignment with human judgment.

Sensitivity Analysis:

The metric's capability to detect temporal inconsistencies was tested by injecting various types of noise (e.g., local swaps, global swaps) into real videos. FVMD demonstrated superior sensitivity in capturing these discrepancies, particularly when utilizing combined velocity and acceleration features with dense 1D histograms.

Human Study:

A large-scale human study was conducted to compare FVMD with existing metrics. Over 200 raters evaluated videos generated by different models. The FVMD consistently exhibited higher correlation with human judgments compared to FID-VID, FVD, SSIM, PSNR, and VBench. This indicates its robustness in reflecting human-perceived video quality.

Implications and Future Work

The introduction of FVMD has several practical and theoretical implications:

Practical Applications:

FVMD can be employed as a reliable metric for evaluating the quality of videos generated by diverse models. It offers a nuanced approach to assessing temporal coherence, pivotal for applications in entertainment, virtual reality, and video editing.

Theoretical Implications:

The research underscores the importance of motion consistency in video generation and encourages further exploration into physical laws embedded in motion patterns. Future work can aim to refine motion representations, ensuring that generated movements adhere to plausible physical dynamics.

Conclusion

The Fréchet Video Motion Distance presents a substantial advancement in the evaluation of video generative models. By focusing on the intricacies of motion consistency, FVMD fills a critical gap left by previous metrics, aligning more closely with human perception and enhancing the credibility of video quality assessments. Moving forward, integrating more sophisticated motion representations could further elevate the standards of generative video evaluation.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.