Emergent Mind

Towards A Better Metric for Text-to-Video Generation

(2401.07781)
Published Jan 15, 2024 in cs.CV

Abstract

Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.

Score distributions in TVGE show text-to-video generation often struggles with quality or text alignment.

Overview

  • The paper introduces a new metric called Text-to-Video Score (T2VScore) for evaluating machine-generated videos from textual descriptions.

  • T2VScore assesses the alignment of video content with its corresponding text prompt, using advanced language models for a nuanced evaluation.

  • The metric also measures the video's technical and structural quality to ensure a comprehensive evaluation of video fidelity.

  • A dataset named Text-to-Video Generation Evaluation (TVGE) is presented to refine and validate T2VScore against human judgment.

  • Experiments confirm that T2VScore correlates well with human evaluations, outperforming existing baseline metrics.

Overview of the Text-to-Video Score (T2VScore)

Evaluating machine-generated videos from textual descriptions remains a complex task. The Text-to-Video Score (T2VScore) seeks to refine the assessment process by focusing on text-video alignment and video quality.

Evaluating Text-Video Alignment

T2VScore emphasizes the importance of alignment between the content of a video and the initiating text prompt, evaluating how accurately the video reflects the prompt's description. This aspect is known as Text-Video Alignment, and it's one of the two core criteria addressed by T2VScore.

To assess text-video alignment, the T2VScore employs a process that begins with decomposing the prompt into semantic elements and then formulates questions answered by advanced language models. This question-answering approach ensures a more detailed and nuanced evaluation, capturing temporal dynamics and specific elements that could be overlooked by less granular metrics.

Measuring Video Quality

The second core aspect is the evaluation of video quality, which extends beyond the mere textual alignment to include the structural and technical integrity of the video itself. The novel evaluation pipeline integrates a technical expert, adept at detecting distortions and artifacts, with a semantic expert, proficient in apprehending content coherence. Their combined judgments offer a robust and nuanced quality score, informed by the varied aspects that contribute to overall video fidelity.

The TVGE Dataset

To support the development and fine-tuning of T2VScore, the authors introduce the Text-to-Video Generation Evaluation (TVGE) dataset. This resource gathers an extensive array of human judgments on generated videos, offering a key benchmark that can aid in calibrating the T2VScore's effectiveness against human perception.

Verification through Experiments

Experiments utilizing the TVGE dataset demonstrate T2VScore's significant correlation with human judgment, outperforming baseline metrics. Its two components, focused on alignment and quality, each address distinct and essential dimensions of the generated content, confirming the need for a dual-pronged approach in accurate text-to-video generation evaluation.

The T2VScore provides a comprehensive metric for the evaluation of text-to-video generation, offering a more refined tool for developers and researchers to gauge the quality and relevance of generated video content.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.