Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models (2403.06098v4)

Published 10 Mar 2024 in cs.CV and cs.CL

Abstract: The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, along with other text-to-video diffusion models, is highly reliant on prompts, and there is no publicly available dataset that features a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 Million unique text-to-Video Prompts from real users. Additionally, this dataset includes 6.69 million videos generated by four state-of-the-art diffusion models, alongside some related data. We initially discuss the curation of this large-scale dataset, a process that is both time-consuming and costly. Subsequently, we underscore the need for a new prompt dataset specifically designed for text-to-video generation by illustrating how VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Our extensive and diverse dataset also opens up many exciting new research areas. For instance, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models to develop better, more efficient, and safer models. The project (including the collected dataset VidProM and related code) is publicly available at https://vidprom.github.io under the CC-BY-NC 4.0 License.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  2. Videocrafter1: Open diffusion models for high-quality video generation, 2023.
  3. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024.
  4. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  5. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  6. OpenAI. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators. Accessed: 2024-03-06.
  7. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  8. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  9. Pika art. https://pika.art/. Accessed: 2024-03-06.
  10. Morph studio. https://app.morphstudio.com. Accessed: 2024-03-06.
  11. Genie 2024. https://sites.google.com/view/genie-2024/home. Accessed: 2024-03-06.
  12. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023.
  13. Advancing high-resolution video-language representation with large-scale video transcriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022.
  14. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  15. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  16. Advancing high-resolution video-language representation with large-scale video transcriptions. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  17. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. 2024.
  18. Celebv-text: A large-scale facial text-video dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14805–14814, 2023.
  19. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
  20. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670, 2021.
  21. PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104. Association for Computational Linguistics, 2022.
  22. DiffusionDB: A large-scale prompt gallery dataset for text-to-image generative models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 893–911. Association for Computational Linguistics, July 2023.
  23. Tyrrrz. Discordchatexporter. https://github.com/Tyrrrz/DiscordChatExporter. Accessed: 2024-03-06.
  24. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
  25. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008.
  26. WizMap: Scalable Interactive Visualization for Exploring Large Machine Learning Embeddings. arXiv 2306.09328, 2023.
  27. Antifakeprompt: Prompt-tuned vision-language models are fake image detectors. arXiv preprint arXiv:2310.17419, 2023.
  28. Frequency masking for universal deepfake detection. arXiv preprint arXiv:2401.06506, 2024.
  29. Cnn-generated images are surprisingly easy to spot…for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  30. Fusing global and local features for generalized ai-synthesized image detection. In 2022 IEEE International Conference on Image Processing (ICIP), pages 3465–3469. IEEE, 2022.
  31. Leveraging frequency analysis for deep fake image recognition. In International conference on machine learning, pages 3247–3258. PMLR, 2020.
  32. Detecting generated images by real images. In European Conference on Computer Vision, pages 95–110. Springer, 2022.
  33. Learning on gradients: Generalized artifacts representation for gan-generated images detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12105–12114, 2023.
  34. Dire for diffusion-generated image detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22445–22455, October 2023.
  35. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. 2024.
  36. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2023.
  37. High-resolution image synthesis with latent diffusion models, 2021.
  38. A similarity alignment model for video copy segment matching. arXiv preprint arXiv:2305.15679, 2023.
  39. A dual-level detection method for video copy detection. arXiv preprint arXiv:2305.12361, 2023.
  40. Feature-compatible progressive learning for video copy detection. arXiv preprint arXiv:2304.10305, 2023.
  41. 3rd place solution to meta ai video similarity challenge. arXiv preprint arXiv:2304.11964, 2023.
  42. Towards a better metric for text-to-video generation. arXiv preprint arXiv:2401.07781, 2024.
  43. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  44. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems, 36, 2023.
  45. Evalcrafter: Benchmarking and evaluating large video generation models. 2023.
  46. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10638–10647, 2020.
  47. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
  48. Multi-event video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22113–22123, 2023.
  49. Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7622–7631, 2018.
  50. Multi-modal dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 958–959, 2020.
  51. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023.
  52. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, 2022. Association for Computational Linguistics.
  53. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671. Association for Computational Linguistics, 2022.
  54. Investigating prompt engineering in diffusion models. arXiv preprint arXiv:2211.15462, 2022.
  55. Seek for incantations: Towards accurate text-to-image diffusion synthesis through prompt engineering. arXiv preprint arXiv:2401.06345, 2024.
  56. What makes good examples for visual in-context learning? Advances in Neural Information Processing Systems, 36, 2024.
  57. Exploring effective factors for improving visual in-context learning. arXiv preprint arXiv:2304.04748, 2023.
Citations (19)

Summary

  • The paper introduces VidProM, a novel dataset with 1.67M prompts and 6.69M videos to advance text-to-video diffusion research.
  • It details a robust curation process using text embeddings, NSFW evaluations, and semantic filtering to ensure high-quality data.
  • The dataset supports diverse research areas, from prompt engineering and efficient video generation to fake video detection safeguards.

Unveiling VidProM: A Pioneering Dataset for Text-to-Video Diffusion Models

Introduction to VidProM and its Uniqueness

VidProM stands as the inaugural dataset to provide a comprehensive library of 1.67 million unique text-to-video prompts, accompanied by 6.69 million videos generated through four cutting-edge text-to-video diffusion models. This dataset marks a significant stride in the field of video generation, furnishing researchers with a robust framework for exploring new dimensions in text-to-video prompt engineering, efficient video generation strategies, and enhanced methodologies for fake video and video copy detection. The VidProM dataset, accessible through GitHub and Hugging Face under the CC-BY-NC 4.0 License, promises to catalyze advancements in creating more sophisticated, efficient, and safer text-to-video diffusion models.

Dataset Curation and Content Analysis

The assembly of VidProM involved a meticulous process, encompassing the extraction of prompts from official Pika Discord channels, embedding prompts with OpenAI's text-embedding-3-large model, evaluating NSFW probabilities, and generating videos using state-of-the-art diffusion models. A notable subset, VidProS, ensures semantic diversity by limiting cosine similarity among prompts. Each data point in VidProM consists of a prompt, UUID, timestamp, NSFW probabilities, a 3072-dimensional prompt embedding, and four uniquely generated videos, presenting a comprehensive schema for research utilization.

Comparative Insights with DiffusionDB

A critical assessment reveals the distinct advantages of VidProM over DiffusionDB, particularly in the realms of semantic uniqueness, advanced embedding techniques, and data collection breadth. VidProM's dedication to video content, as opposed to static images in DiffusionDB, underscores its extensive utility and complexity, advocating for its indispensability in text-to-video research endeavors.

Dissecting User Preferences and Prompts

Intriguingly, VidProM's analysis sheds light on prevalent themes and subjects favored by users in video generation requests, such as modern aesthetics, motion dynamics, and thematic elements spanning humans to fantastical entities. This analysis not only enriches understanding of user tendencies but also opens avenues for tailored model developments.

Forging New Research Frontiers

VidProM's introduction is a harbinger for multifaceted research prospects, ranging from refining text-to-video diffusion models, innovating in prompt engineering, enhancing video generation efficiency, to pioneering in fake video detection and video copy safeguards. Additionally, its potential to underpin copyright-resilient multimodal learning projects signifies its versatility and extended applicability beyond immediate diffusion model advancements.

Conclusion and Future Directions

VidProM epitomizes a landmark resource for the AI research community, particularly in bolstering the progress of text-to-video generation technologies. Its wide-ranging implications, coupled with the identified need for expanded datasets akin to VidProM—possibly incorporating next-generation models like Sora—underscore the dataset's foundational role in shaping the future trajectory of AI-driven video content generation.

Youtube Logo Streamline Icon: https://streamlinehq.com