Emergent Mind

Abstract

The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, along with other text-to-video diffusion models, is highly reliant on prompts, and there is no publicly available dataset that features a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 Million unique text-to-Video Prompts from real users. Additionally, this dataset includes 6.69 million videos generated by four state-of-the-art diffusion models, alongside some related data. We initially discuss the curation of this large-scale dataset, a process that is both time-consuming and costly. Subsequently, we underscore the need for a new prompt dataset specifically designed for text-to-video generation by illustrating how VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Our extensive and diverse dataset also opens up many exciting new research areas. For instance, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models to develop better, more efficient, and safer models. The project (including the collected dataset VidProM and related code) is publicly available at https://vidprom.github.io under the CC-BY-NC 4.0 License.

Dataset with 1.67M text-to-video prompts, 6.69M videos, enabling research in video generation and detection.

Overview

  • VidProM is the first dataset that includes 1.67 million unique text-to-video prompts and 6.69 million videos generated from advanced text-to-video diffusion models.

  • Accessible through GitHub and Hugging Face with a CC-BY-NC 4.0 License, VidProM is aimed at advancing research in prompt engineering, video generation strategies, and fake video detection methods.

  • Compared to DiffusionDB, VidProM offers advanced embedding techniques, greater semantic uniqueness, and focuses exclusively on video content, highlighting its unique contribution to text-to-video research.

  • VidProM opens new research avenues in refining text-to-video models, prompt engineering innovations, efficiency in video generation, and pioneering fake video and video copy detection technologies.

Unveiling VidProM: A Pioneering Dataset for Text-to-Video Diffusion Models

Introduction to VidProM and its Uniqueness

VidProM stands as the inaugural dataset to provide a comprehensive library of 1.67 million unique text-to-video prompts, accompanied by 6.69 million videos generated through four cutting-edge text-to-video diffusion models. This dataset marks a significant stride in the realm of video generation, furnishing researchers with a robust framework for exploring new dimensions in text-to-video prompt engineering, efficient video generation strategies, and enhanced methodologies for fake video and video copy detection. The VidProM dataset, accessible through GitHub and Hugging Face under the CC-BY-NC 4.0 License, promises to catalyze advancements in creating more sophisticated, efficient, and safer text-to-video diffusion models.

Dataset Curation and Content Analysis

The assembly of VidProM involved a meticulous process, encompassing the extraction of prompts from official Pika Discord channels, embedding prompts with OpenAI's text-embedding-3-large model, evaluating NSFW probabilities, and generating videos using state-of-the-art diffusion models. A notable subset, VidProS, ensures semantic diversity by limiting cosine similarity among prompts. Each data point in VidProM consists of a prompt, UUID, timestamp, NSFW probabilities, a 3072-dimensional prompt embedding, and four uniquely generated videos, presenting a comprehensive schema for research utilization.

Comparative Insights with DiffusionDB

A critical assessment reveals the distinct advantages of VidProM over DiffusionDB, particularly in the realms of semantic uniqueness, advanced embedding techniques, and data collection breadth. VidProM's dedication to video content, as opposed to static images in DiffusionDB, underscores its extensive utility and complexity, advocating for its indispensability in text-to-video research endeavors.

Dissecting User Preferences and Prompts

Intriguingly, VidProM's analysis sheds light on prevalent themes and subjects favored by users in video generation requests, such as modern aesthetics, motion dynamics, and thematic elements spanning humans to fantastical entities. This analysis not only enriches understanding of user tendencies but also opens avenues for tailored model developments.

Forging New Research Frontiers

VidProM's introduction is a harbinger for multifaceted research prospects, ranging from refining text-to-video diffusion models, innovating in prompt engineering, enhancing video generation efficiency, to pioneering in fake video detection and video copy safeguards. Additionally, its potential to underpin copyright-resilient multimodal learning projects signifies its versatility and extended applicability beyond immediate diffusion model advancements.

Conclusion and Future Directions

VidProM epitomizes a landmark resource for the AI research community, particularly in bolstering the progress of text-to-video generation technologies. Its wide-ranging implications, coupled with the identified need for expanded datasets akin to VidProM—possibly incorporating next-generation models like Sora—underscore the dataset's foundational role in shaping the future trajectory of AI-driven video content generation.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube