Emergent Mind

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

(2312.00330)
Published Dec 1, 2023 in cs.CV and cs.AI

Abstract

Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image. Considering the scarcity of stylized video datasets, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image using a decoupling learning strategy. Additionally, we design a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors.

Overview

  • StyleCrafter offers a novel method for creating stylized videos using text prompts enhanced by reference images for style guidance.

  • The system integrates a style control adapter into existing text-to-video models to manage style fidelity issues.

  • StyleCrafter uses a unique finetuning approach, where style is learned from images and then transferred to text-generated video.

  • A scale-adaptive fusion module is employed to balance text content and image style, allowing for various combinations.

  • The method overcomes some limitations of text-to-image techniques and provides flexibility but has dependencies on base models and may struggle with highly stylized semantics.

The paper introduces StyleCrafter, an innovative method for generating stylized videos from text prompts. While text-to-video (T2V) models are becoming increasingly capable, they face challenges in producing videos that match specific styles conveyed by text alone. These challenges include difficulty in conveying style nuances through text and degraded style fidelity in the generated videos.

StyleCrafter addresses these issues by incorporating a style control adapter into pre-trained T2V models, allowing users to guide video generation using a reference image instead of relying solely on textual descriptions. The adapter is initially trained using image datasets rich in style variations, and this training is later transferred to video through a unique finetuning approach.

To bolster the effectiveness of the style adaptation, the text prompts from which the videos are generated omit style descriptions, with the style information instead extracted exclusively from the provided reference image. A learning strategy that focuses on decoupling style and content plays a crucial role here. The method also includes a scale-adaptive fusion module to adjust the influence of text-based content features and image-based style features, accommodating different text and style combinations.

The resulted method is capable of generating high-quality, stylized videos that are both aligned with the content of the text and exhibit the style of the reference image. Experiments show that this approach is more flexible and efficient than existing text-to-image (T2I) methods adapted for video and even rival methods that require fine-tuning for specific styles.

Despite these advancements, there are inherent limitations to the method, one of which is its dependency on the base T2V models like VideoCrafter. This means it inherits any constraints the base models have, for example, generating high-definition faces can still be challenging. Moreover, while StyleCrafter excels within a range of visual styles, exceptionally intricate or highly stylized semantics might not be captured perfectly due to the lack of stylized video data for training—a potential avenue for further research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.