A Recipe for Scaling up Text-to-Video Generation with Text-free Videos (2312.15770v1)
Abstract: Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared. Following such a pipeline, we study the effect of doubling the scale of training set (i.e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally, we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be publicly available at https://tf-t2v.github.io/.
- Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
- Conditional GAN with discriminative filter generation for text-to-video synthesis. In IJCAI, page 2, 2019.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023.
- Cerspense. Zeroscope: Diffusion-based text-to-video synthesis. https://huggingface.co/cerspense/zeroscope_v2_576w, 2023.
- Pix2video: Video editing using image diffusion. In ICCV, pages 23206–23217, 2023.
- Stablevideo: Text-driven consistency-aware diffusion video editing. In ICCV, pages 23040–23050, 2023.
- Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023a.
- Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023b.
- Flownet: Learning optical flow with convolutional networks. In ICCV, pages 2758–2766, 2015.
- Taming Transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021.
- Structure and content-guided video synthesis with diffusion models. In ICCV, pages 7346–7356, 2023.
- Scenescape: Text-driven consistent scene generation. arXiv preprint arXiv:2302.01133, 2023.
- Preserve your own correlation: A noise prior for video diffusion models. In ICCV, pages 22930–22941, 2023.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
- Generative adversarial nets. NeurIPS, 27, 2014.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via Transformers. In ICLR, 2023.
- Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator. arXiv preprint arXiv:2309.14494, 2023a.
- Composer: Creative and controllable image synthesis with composable conditions. ICML, 2023b.
- Towards understanding action recognition. In ICCV, pages 3192–3199, 2013.
- Scaling up GANs for text-to-image synthesis. In CVPR, pages 10124–10134, 2023.
- Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
- Gd-vdm: Generated depth for better diffusion-based video generation. arXiv preprint arXiv:2306.11173, 2023.
- Action-aware embedding enhancement for image-text retrieval. In AAAI, pages 1323–1331, 2022.
- Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
- Vdt: An empirical study on video diffusion with Transformers. arXiv preprint arXiv:2305.13311, 2023.
- Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, pages 10209–10218, 2023.
- Dreamtalk: When expressive talking head generation meets diffusion probabilistic models. arXiv preprint arXiv:2312.09767, 2023.
- Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Conditional image-to-video generation with latent flow diffusion models. In CVPR, pages 18444–18455, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, pages 16784–16804. PMLR, 2022.
- Fatezero: Fusing attentions for zero-shot text-based video editing. In ICCV, 2023.
- Hierarchical spatio-temporal decoupling for text-to-video generation. arXiv preprint arXiv:2312.04483, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
- Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- Closed-form factorization of latent semantics in GANs. In CVPR, pages 1532–1540, 2021.
- Make-a-video: Text-to-video generation without text-video data. ICLR, 2023.
- StyleGAN-v: A continuous video generator with the price, image quality and perks of StyleGAN2. In CVPR, pages 3626–3636, 2022.
- Denoising diffusion implicit models. In ICLR, 2021.
- MocoGAN: Decomposing motion and content for video generation. In CVPR, pages 1526–1535, 2018.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
- Tdn: Temporal difference networks for efficient action recognition. In CVPR, pages 1895–1904, 2021.
- Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023b.
- Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023c.
- Videocomposer: Compositional video synthesis with motion controllability. NeurIPS, 2023d.
- Molo: Motion-augmented long-short contrastive learning for few-shot action recognition. In CVPR, pages 18011–18021, 2023e.
- Videolcm: Video latent consistency model. arXiv preprint arXiv:2312.09109, 2023f.
- G3an: Disentangling appearance and motion for video generation. In CVPR, pages 5264–5273, 2020.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023g.
- Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023h.
- Styleinv: A temporal style modulated inversion network for unconditional video generation. In ICCV, pages 22851–22861, 2023i.
- Dreamvideo: Composing your dream videos with customized subject and motion. arXiv preprint arXiv:2312.04433, 2023.
- Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, pages 720–736. Springer, 2022.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, pages 7623–7633, 2023.
- Make-your-video: Customized video generation using textual and structural guidance. arXiv preprint arXiv:2306.00943, 2023a.
- Simda: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710, 2023b.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
- Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022.
- Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. arXiv preprint arXiv:2308.08089, 2023.
- Magvit: Masked generative video Transformer. In CVPR, pages 10459–10469, 2023a.
- Video probabilistic diffusion models in projected latent space. In CVPR, pages 18456–18466, 2023b.
- Instructvideo: Instructing video diffusion models with human feedback. arXiv preprint arXiv:2312.12490, 2023.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023a.
- Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023b.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023c.
- Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023d.
- Slow feature analysis for human action recognition. TPAMI, 34(3):436–450, 2012.
- Controlvideo: Adding conditional control for one shot text-to-video editing. arXiv preprint arXiv:2305.17098, 2023.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Synthesizing videos from images for image-to-video adaptation. In ACMMM, pages 8294–8303, 2023.