StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter (2312.00330v2)

Published 1 Dec 2023 in cs.CV and cs.AI

Abstract: Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image. Considering the scarcity of stylized video datasets, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image using a decoupling learning strategy. Additionally, we design a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors.

Summary

The paper introduces a novel style control adapter that decouples style and content for enhanced video generation.
It employs a two-step training strategy by initially using image datasets and then transferring learned style adaptations to video models.
Experiments demonstrate improved flexibility and efficiency over adapted text-to-image methods with enhanced style fidelity.

The paper introduces StyleCrafter, an innovative method for generating stylized videos from text prompts. While text-to-video (T2V) models are becoming increasingly capable, they face challenges in producing videos that match specific styles conveyed by text alone. These challenges include difficulty in conveying style nuances through text and degraded style fidelity in the generated videos.

StyleCrafter addresses these issues by incorporating a style control adapter into pre-trained T2V models, allowing users to guide video generation using a reference image instead of relying solely on textual descriptions. The adapter is initially trained using image datasets rich in style variations, and this training is later transferred to video through a unique finetuning approach.

To bolster the effectiveness of the style adaptation, the text prompts from which the videos are generated omit style descriptions, with the style information instead extracted exclusively from the provided reference image. A learning strategy that focuses on decoupling style and content plays a crucial role here. The method also includes a scale-adaptive fusion module to adjust the influence of text-based content features and image-based style features, accommodating different text and style combinations.

The resulted method is capable of generating high-quality, stylized videos that are both aligned with the content of the text and exhibit the style of the reference image. Experiments show that this approach is more flexible and efficient than existing text-to-image (T2I) methods adapted for video and even rival methods that require fine-tuning for specific styles.

Despite these advancements, there are inherent limitations to the method, one of which is its dependency on the base T2V models like VideoCrafter. This means it inherits any constraints the base models have, for example, generating high-definition faces can still be challenging. Moreover, while StyleCrafter excels within a range of visual styles, exceptionally intricate or highly stylized semantics might not be captured perfectly due to the lack of stylized video data for training—a potential avenue for further research.

PDF Markdown

Related Papers

GitHub

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

Reddit

"StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter", Liu et al 2023 (11 points, 1 comment)