VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models (2403.06098v4)

Published 10 Mar 2024 in cs.CV and cs.CL

Abstract: The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, along with other text-to-video diffusion models, is highly reliant on prompts, and there is no publicly available dataset that features a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 Million unique text-to-Video Prompts from real users. Additionally, this dataset includes 6.69 million videos generated by four state-of-the-art diffusion models, alongside some related data. We initially discuss the curation of this large-scale dataset, a process that is both time-consuming and costly. Subsequently, we underscore the need for a new prompt dataset specifically designed for text-to-video generation by illustrating how VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Our extensive and diverse dataset also opens up many exciting new research areas. For instance, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models to develop better, more efficient, and safer models. The project (including the collected dataset VidProM and related code) is publicly available at https://vidprom.github.io under the CC-BY-NC 4.0 License.

References (57)

Citations (19)

View on Semantic Scholar

Summary

The paper introduces VidProM, a novel dataset with 1.67M prompts and 6.69M videos to advance text-to-video diffusion research.
It details a robust curation process using text embeddings, NSFW evaluations, and semantic filtering to ensure high-quality data.
The dataset supports diverse research areas, from prompt engineering and efficient video generation to fake video detection safeguards.

Unveiling VidProM: A Pioneering Dataset for Text-to-Video Diffusion Models

Introduction to VidProM and its Uniqueness

VidProM stands as the inaugural dataset to provide a comprehensive library of 1.67 million unique text-to-video prompts, accompanied by 6.69 million videos generated through four cutting-edge text-to-video diffusion models. This dataset marks a significant stride in the field of video generation, furnishing researchers with a robust framework for exploring new dimensions in text-to-video prompt engineering, efficient video generation strategies, and enhanced methodologies for fake video and video copy detection. The VidProM dataset, accessible through GitHub and Hugging Face under the CC-BY-NC 4.0 License, promises to catalyze advancements in creating more sophisticated, efficient, and safer text-to-video diffusion models.

Dataset Curation and Content Analysis

The assembly of VidProM involved a meticulous process, encompassing the extraction of prompts from official Pika Discord channels, embedding prompts with OpenAI's text-embedding-3-large model, evaluating NSFW probabilities, and generating videos using state-of-the-art diffusion models. A notable subset, VidProS, ensures semantic diversity by limiting cosine similarity among prompts. Each data point in VidProM consists of a prompt, UUID, timestamp, NSFW probabilities, a 3072-dimensional prompt embedding, and four uniquely generated videos, presenting a comprehensive schema for research utilization.

Comparative Insights with DiffusionDB

A critical assessment reveals the distinct advantages of VidProM over DiffusionDB, particularly in the realms of semantic uniqueness, advanced embedding techniques, and data collection breadth. VidProM's dedication to video content, as opposed to static images in DiffusionDB, underscores its extensive utility and complexity, advocating for its indispensability in text-to-video research endeavors.

Dissecting User Preferences and Prompts

Intriguingly, VidProM's analysis sheds light on prevalent themes and subjects favored by users in video generation requests, such as modern aesthetics, motion dynamics, and thematic elements spanning humans to fantastical entities. This analysis not only enriches understanding of user tendencies but also opens avenues for tailored model developments.

Forging New Research Frontiers

VidProM's introduction is a harbinger for multifaceted research prospects, ranging from refining text-to-video diffusion models, innovating in prompt engineering, enhancing video generation efficiency, to pioneering in fake video detection and video copy safeguards. Additionally, its potential to underpin copyright-resilient multimodal learning projects signifies its versatility and extended applicability beyond immediate diffusion model advancements.

Conclusion and Future Directions

VidProM epitomizes a landmark resource for the AI research community, particularly in bolstering the progress of text-to-video generation technologies. Its wide-ranging implications, coupled with the identified need for expanded datasets akin to VidProM—possibly incorporating next-generation models like Sora—underscore the dataset's foundational role in shaping the future trajectory of AI-driven video content generation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1767385182166815148

https://twitter.com/ducha_aiki/status/1767856394026295715

https://twitter.com/CSVisionPapers/status/1767726498717315196

https://twitter.com/knishimae0531/status/1767757891459846399

https://twitter.com/javaeeeee1/status/1768986096233267614

YouTube

Show All Videos