T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation (2407.14505v2)

Published 19 Jul 2024 in cs.CV

Abstract: Text-to-video (T2V) generative models have advanced significantly, yet their ability to compose different objects, attributes, actions, and motions into a video remains unexplored. Previous text-to-video benchmarks also neglect this important ability for evaluation. In this work, we conduct the first systematic study on compositional text-to-video generation. We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation. T2V-CompBench encompasses diverse aspects of compositionality, including consistent attribute binding, dynamic attribute binding, spatial relationships, motion binding, action binding, object interactions, and generative numeracy. We further carefully design evaluation metrics of multimodal LLM (MLLM)-based, detection-based, and tracking-based metrics, which can better reflect the compositional text-to-video generation quality of seven proposed categories with 1400 text prompts. The effectiveness of the proposed metrics is verified by correlation with human evaluations. We also benchmark various text-to-video generative models and conduct in-depth analysis across different models and various compositional categories. We find that compositional text-to-video generation is highly challenging for current models, and we hope our attempt could shed light on future research in this direction.

Citations (11)

View on Semantic Scholar

Summary

The paper presents T2V-CompBench, a comprehensive benchmark that tests seven compositional aspects in text-to-video generation using 700 diverse GPT-4 prompts.
The paper introduces specialized evaluation metrics, including MLLM-based, detection-based, and tracking-based measures, to capture nuanced spatial, temporal, and relational features.
The paper’s extensive evaluation of 20 models reveals that while commercial systems often outperform open-source ones, significant challenges remain in dynamic attribute binding and generative numeracy.

An Overview of T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-Video Generation

The paper "T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-Video Generation" presents a novel evaluation framework explicitly designed to address the intricacies of compositional text-to-video (T2V) generation. This work fills a notable gap in the landscape of video generation research by constructing a benchmark that emphasizes compositionality, a dimension largely overlooked by existing benchmarks which usually focus on simpler aspects of video generation.

Key Contributions

Benchmark Construction: The paper introduces T2V-CompBench, which rigorously tests the compositional abilities of T2V models across seven categories: consistent attribute binding, dynamic attribute binding, spatial relationships, action binding, motion binding, object interactions, and generative numeracy. Each category comprises 100 video generation prompts created using GPT-4, ensuring coverage of diverse and challenging scenarios.
Evaluation Metrics: Recognizing the inadequacies of traditional metrics like Inception Score (IS) and Fréchet Video Distance (FVD) in compositional contexts, the authors propose three specialized metrics:
- MLLM-Based Metrics: Utilizing Multimodal LLMs (MLLMs) for nuanced understanding and scoring of dynamic attribute binding, consistent attribute binding, and action binding.
- Detection-Based Metrics: Leveraging object detection models to evaluate spatial relationships and generative numeracy.
- Tracking-Based Metrics: Utilizing tracking models to assess motion binding, focusing on the differentiation between object and camera motion.
Extensive Benchmarking: The paper evaluates 20 T2V models, including 13 open-source and 7 commercial models. The results reveal that current models struggle significantly with challenges posed by compositional prompts, highlighting the need for further advancements in T2V generation.

Notable Findings

The research finds that commercial models generally outperform open-source ones across compositional categories, with certain models such as Dreamina and Gen-2 showing relatively better performance. However, none of the models consistently excel across all categories, underscoring the complexity and difficulty of compositional T2V generation. Particularly challenging categories include dynamic attribute binding and generative numeracy, where models often fail to accurately capture temporal changes or object quantities.

Implications and Future Directions

Practical Implications

The introduction of T2V-CompBench provides a comprehensive and rigorous framework for evaluating compositional T2V models, facilitating benchmarking and guiding research in developing more sophisticated generative models. The diverse categories ensure that models are evaluated on a variety of scenarios, pushing the boundaries of current capabilities in video generation.

Theoretical Implications

The findings suggest fundamental limitations in existing T2V models, especially in handling complex, dynamic, and multi-object scenes. This calls for a deeper integration of temporal and spatial understanding within generative frameworks and might necessitate novel architectures that can better grasp and generate compositional content.

Speculation on Future Developments

Future developments in AI for T2V generation may include:

Advanced Temporal Models: Enhanced temporal modeling to capture dynamic attribute changes with greater fidelity.
Multimodal Reasoning: Improved multimodal reasoning abilities in models, enabling better understanding and generation of compositional relationships.
Integrative Frameworks: Development of unified frameworks that can simultaneously address spatial, temporal, and relational aspects in video generation.

Given the findings and the comprehensive nature of the proposed benchmark, T2V-CompBench is poised to be a critical tool for driving the next generation of improvements in text-to-video generative models. The benchmark's impact will likely extend beyond mere evaluation, influencing the design and training of future models in this evolving field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_vztu/status/1815471531910472168

https://twitter.com/TheTuringPost/status/1818829030537543798

https://twitter.com/XihuiLiu/status/1816236955086250162

https://twitter.com/gm8xx8/status/1815199903598985410

https://twitter.com/javaeeeee1/status/1816226651778474408