AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars (2306.00547v2)

Published 1 Jun 2023 in cs.CV and cs.GR

Abstract: Capturing and editing full head performances enables the creation of virtual characters with various applications such as extended reality and media production. The past few years witnessed a steep rise in the photorealism of human head avatars. Such avatars can be controlled through different input data modalities, including RGB, audio, depth, IMUs and others. While these data modalities provide effective means of control, they mostly focus on editing the head movements such as the facial expressions, head pose and/or camera viewpoint. In this paper, we propose AvatarStudio, a text-based method for editing the appearance of a dynamic full head avatar. Our approach builds on existing work to capture dynamic performances of human heads using neural radiance field (NeRF) and edits this representation with a text-to-image diffusion model. Specifically, we introduce an optimization strategy for incorporating multiple keyframes representing different camera viewpoints and time stamps of a video performance into a single diffusion model. Using this personalized diffusion model, we edit the dynamic NeRF by introducing view-and-time-aware Score Distillation Sampling (VT-SDS) following a model-based guidance approach. Our method edits the full head in a canonical space, and then propagates these edits to remaining time steps via a pretrained deformation network. We evaluate our method visually and numerically via a user study, and results show that our method outperforms existing approaches. Our experiments validate the design choices of our method and highlight that our edits are genuine, personalized, as well as 3D- and time-consistent.

Citations (24)

View on Semantic Scholar

Summary

The paper introduces a method that edits 3D dynamic human head avatars using text inputs while preserving identity fidelity.
It integrates a multi-view NeRF representation with a diffusion model, employing view-and-time-aware Score Distillation Sampling for temporal consistency.
Experiments and user studies demonstrate superior visual results and identity retention compared to existing techniques.

Detailed Examination of "AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars" (2306.00547)

Introduction

"AvatarStudio" presents a novel approach for editing 3D dynamic human head avatars via text inputs. This work leverages Neural Radiance Fields (NeRF) in conjunction with text-to-image diffusion models to enable precise appearance editing of dynamic head avatars while maintaining identity fidelity. The paper outlines methodologies for capturing head performances using NeRF and enhancing them through text-driven diffusion models under a novel optimization strategy that integrates multiple camera viewpoints and time stamps.

Proposed Methodology

AvatarStudio Framework

AvatarStudio's framework consists of several key components designed to achieve text-driven editing with high fidelity:

Input Representation: Adopts HQ3DAvatar's NeRF-based dynamic representation to capture and maintain the high-quality details of the head avatar.
Optimization Strategy: Incorporates a novel optimization strategy tailored for multi-view and temporal data, enabling the fine-tuning of a latent diffusion model with a unique class-specific identifier for each image of the head.
Score Distillation Sampling (SDS): Introduces view-and-time-aware SDS, allowing edits that preserve temporal consistency and 3D coherence through a combination of fine-tuned and pre-trained diffusion models.
Figure 1: Sample viewpoints and timestamps used in our diffusion model fine-tuning.

Implementation Details

Fine-Tuning: Utilizes multi-view captures of different temporal timestamps to optimize a diffusion model for each head, while avoiding information leakage through consistent noise sampling within batches.
Editing in Canonical Space: The editing is performed in a canonical space, followed by time-step propagation using a deformation network to maintain temporal coherence.
Diffusion Model Integration: Fine-tuned on specific multi-view keyframes to accommodate diverse viewing angles and varied facial expressions in video sequences.

Experimental Evaluation

The efficacy of AvatarStudio was evaluated through both visual assessments and a structured user paper:

Visual Results: Demonstrated superior 3D and temporal consistency in edited avatars compared to existing methods (e.g., Dream Fields, Instruct-NeRF2NeRF), shown in various scenarios including photorealistic and non-photorealistic edits.
User Study: Conducted with 48 participants to assess identity preservation, prompt adherence, and temporal coherence in edited videos. AvatarStudio consistently outperformed baselines, with particularly strong results in comprehensively preserving identity and respecting textual prompts.

Implications and Future Work

The work presents significant innovations in text-driven avatar editing, potentially impacting domains like interactive media, virtual reality, and personalized content creation. Future work could explore reducing computational costs and extending methods to work with monocular inputs or under varied lighting conditions for greater applicability in real-world scenarios.

Conclusion

AvatarStudio brings forward a potent combination of NeRF and fine-tuned diffusion models to achieve detailed text-driven editing of dynamic head avatars. It represents a step towards more intuitive and accessible avatar customization techniques, setting the stage for further advancements in text-based interactive systems.