Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model

Published 15 Jun 2023 in cs.CV | (2306.09551v1)

Abstract: Recent research has demonstrated that the combination of pretrained diffusion models with neural radiance fields (NeRFs) has emerged as a promising approach for text-to-3D generation. Simply coupling NeRF with diffusion models will result in cross-view inconsistency and degradation of stylized view syntheses. To address this challenge, we propose the Edit-DiffNeRF framework, which is composed of a frozen diffusion model, a proposed delta module to edit the latent semantic space of the diffusion model, and a NeRF. Instead of training the entire diffusion for each scene, our method focuses on editing the latent semantic space in frozen pretrained diffusion models by the delta module. This fundamental change to the standard diffusion framework enables us to make fine-grained modifications to the rendered views and effectively consolidate these instructions in a 3D scene via NeRF training. As a result, we are able to produce an edited 3D scene that faithfully aligns to input text instructions. Furthermore, to ensure semantic consistency across different viewpoints, we propose a novel multi-view semantic consistency loss that extracts a latent semantic embedding from the input view as a prior, and aim to reconstruct it in different views. Our proposed method has been shown to effectively edit real-world 3D scenes, resulting in 25% improvement in the alignment of the performed 3D edits with text instructions compared to prior work.

Abstract PDF HTML Upgrade to Chat

References (32)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces a delta module to finely edit the latent semantic space of a frozen 2D diffusion model, enhancing text-to-3D synthesis.
It integrates NeRF with a multi-view semantic consistency loss to ensure coherent and consistent 3D scene generation from different views.
Empirical results demonstrate a 25% boost in aligning edited 3D scenes with text instructions compared to conventional methods.

"Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model" is a cutting-edge research paper that addresses a significant challenge in text-to-3D generation using neural radiance fields (NeRFs) and pretrained diffusion models. The combination of these technologies has shown promise, but conventional methods often suffer from cross-view inconsistencies and a degradation in the stylized synthesis of views.

To mitigate these issues, the authors propose the Edit-DiffNeRF framework, comprising three main components:

Frozen Diffusion Model: Instead of retraining the entire diffusion model for each scene, the authors retain the pretrained model as it is.
Delta Module: This module is introduced to edit the latent semantic space of the frozen diffusion model. By focusing on editing the semantic space rather than retraining from scratch, the approach allows for fine-grained modifications aligned with text instructions.
NeRF: Integrates with the above components to generate coherent and consistent 3D scenes.

The fundamental innovation lies in the delta module, which allows for the fine-tuning of the latent semantic space. This enables precise modifications to the 2D diffusion model's output, which are then faithfully translated into the 3D domain via NeRF. Notably, this method avoids the need for extensive retraining, making it more efficient.

Additionally, the authors introduce a multi-view semantic consistency loss that plays a critical role in ensuring that the semantic information is consistently maintained across different viewpoints. This loss function works by extracting a latent semantic embedding from the input view and aiming to reconstruct it accurately in different views, thereby improving the overall coherence and alignment of the 3D scene with the input text.

Empirical results demonstrate the efficacy of Edit-DiffNeRF, with the method achieving a 25% improvement in aligning 3D edits with text instructions compared to previous approaches. This significant enhancement underlines the framework's capability to edit real-world 3D scenes effectively, maintaining both visual and semantic consistency across multiple views.

In summary, Edit-DiffNeRF presents a novel approach to overcoming the challenges in text-to-3D generation by editing latent semantic spaces of frozen diffusion models, ensuring fine-tuned, coherent 3D scene synthesis in alignment with user-provided text instructions.

Markdown Report Issue