GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions (2311.16037v2)

Published 27 Nov 2023 in cs.CV and cs.GR

Abstract: Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).

References (57)

Citations (75)

View on Semantic Scholar

Summary

The paper introduces GaussianEditor, a novel framework that uses 3D Gaussian splatting for precise, text-guided edits in 3D scenes.
It employs region of interest extraction, 3D Gaussian RoI alignment, and image-grounded segmentation to confine edits to specific scene areas.
The framework significantly reduces training time compared to Instruct-NeRF2NeRF while preserving high fidelity in both target and surrounding regions.

GaussianEditor: A Framework for Precise 3D Scene Editing with Text Instructions

The paper introduces GaussianEditor, a framework created to address the limitations of current 3D scene editing methods that utilize text instructions. While significant advancements have been made with 2D diffusion models, the key issue addressed here is the inability of these models to perform precise and localized editing in 3D scenes. GaussianEditor resolves this by leveraging 3D Gaussian splatting, which inherently allows for explicit and individual manipulation of 3D points, enabling detailed and accurate editing with text instructions.

GaussianEditor is structured around three main components: region of interest (RoI) extraction, 3D Gaussian RoI alignment, and delicate editing within the Gaussian RoI. The initial step involves extracting RoI from the textual instructions. Utilizing recent advancements in multimodal processing, the framework extracts key descriptions using a LLM and aligns these descriptions to match regions within a 3D scene.

In comparison to Instruct-NeRF2NeRF, which requires substantially more time and struggles with localizing edits due to entanglement of regions, GaussianEditor achieves the same within 20 minutes on a single V100 GPU, thereby more than halving the training speed required by Instruct-NeRF2NeRF (45 minutes to 2 hours depending on scene complexity). This is a notable computational enhancement facilitated by 3D Gaussian splatting, which excels in real-time rendering and individual manipulation of Gaussian splats.

The framework's editing precision is enabled through image-grounded segmentation to localize the RoI in the image space, which is subsequently lifted back to the 3D Gaussian space. This ensures updates during the editing process are confined accurately without unintentional modifications to surrounding scene elements. This capability allows GaussianEditor to perform consistent multi-round editing while adhering closely to user-specified instructions.

Quantitatively, GaussianEditor matches Instruct-NeRF2NeRF in terms of achieving desired text-image similarities while significantly enhancing image-image similarities, indicating a better preservation of non-target regions. It harnesses the spatial independence of Gaussians to distinguish foreground and background rendering, allowing precise edits limited strictly to intended scene components such as modifying the color of a particular object without affecting neighboring features.

Moreover, the framework introduces scene description generation and employs existing 2D models for an embedded editing process, further enhancing its operability within existing 3D graphics frameworks. It successfully demonstrates that a systematic integration of explicit 3D representation models with advanced language and vision models can profoundly enhance scene editing precision and fidelity.

Looking forward, the potential continued extension of GaussianEditor into dynamic scenes pavements new exploration avenues for real-time interactive content creation and user-generated scene manipulation, pertinent in entertainment, virtual reality, and architectural visualization domains.

This paper is an instrumental step towards highly efficient, accurate, and user-guided 3D editing paradigms and opens the floor for further refinement in explicit and differentiable 3D modeling frameworks, which could see adoption across various real-world applications requiring precision-driven 3D content generation and manipulation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_vztu/status/1816884625073143926