Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions (2311.16037v2)

Published 27 Nov 2023 in cs.CV and cs.GR

Abstract: Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Sine: Semantic-driven image-based nerf editing with prior-guided editing field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  2. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV, 2021.
  3. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, 2022.
  4. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  5. Segment anything in 3d with nerfs. In NeurIPS, 2023.
  6. Tensorf: Tensorial radiance fields. In ECCV, 2022.
  7. Stylizing 3d scene via implicit representation and hypernetwork. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022.
  8. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 2021.
  9. Text-driven editing of 3d scenes without retraining. Arxiv preprint arXiv:2309.04917, 2023.
  10. Textdeformer: Geometry manipulation using text guidance. In ACM SIGGRAPH 2023 Conference Proceedings, 2023.
  11. Instruct-nerf2nerf: Editing 3d scenes with instructions. In CVPR, 2023.
  12. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  13. Denoising diffusion probabilistic models. NeurIPS, 2020.
  14. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 2022.
  15. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
  16. Learning to stylize novel views. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  17. Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  18. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG), 2023.
  19. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  20. Decomposing nerf for editing via feature field distillation. Advances in Neural Information Processing Systems, 2022.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  22. Climatenerf: Physically-based neural rendering for extreme climate synthesis. arXiv e-prints, 2022.
  23. Focaldreamer: Text-driven 3d editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608, 2023b.
  24. Nerf-in: Free-form nerf inpainting with rgb-d priors. arXiv preprint arXiv:2206.04901, 2022.
  25. Editing conditional radiance fields. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
  26. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  27. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
  28. Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  29. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 2021.
  30. Watch your steps: Local image and scene editing by text instructions. In arXiv preprint arXiv:2308.08947, 2023.
  31. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 2022.
  32. Snerf: stylized neural implicit representations for 3d scenes. arXiv preprint arXiv:2207.02363, 2022.
  33. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  34. Neural articulated radiance field. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  35. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
  36. Learning transferable visual models from natural language supervision. In ICML, 2021.
  37. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  38. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
  39. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  40. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, 2022a.
  41. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 2022b.
  42. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022c.
  43. Plenoxels: Radiance fields without neural networks. In CVPR, 2022.
  44. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, 2015.
  45. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 2019.
  46. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR, 2022.
  47. Neural feature fusion fields: 3d distillation of self-supervised 2d image representations. In 2022 International Conference on 3D Vision (3DV), 2022.
  48. Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  49. Nerf-art: Text-driven neural radiance fields stylization. TVCG, 2023.
  50. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023.
  51. Palettenerf: Palette-based color editing for nerfs. arXiv preprint arXiv:2212.12871, 2022.
  52. Instructp2p: Learning to edit 3d point clouds with text instructions. arXiv preprint arXiv:2306.07154, 2023.
  53. Deforming radiance fields with cages. In European Conference on Computer Vision, 2022.
  54. Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In European Conference on Computer Vision, 2022.
  55. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arxiv:2310.08529, 2023.
  56. Arf: Artistic radiance fields. In European Conference on Computer Vision, 2022.
  57. Dreameditor: Text-driven 3d scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.
Citations (75)

Summary

  • The paper introduces GaussianEditor, a novel framework that uses 3D Gaussian splatting for precise, text-guided edits in 3D scenes.
  • It employs region of interest extraction, 3D Gaussian RoI alignment, and image-grounded segmentation to confine edits to specific scene areas.
  • The framework significantly reduces training time compared to Instruct-NeRF2NeRF while preserving high fidelity in both target and surrounding regions.

GaussianEditor: A Framework for Precise 3D Scene Editing with Text Instructions

The paper introduces GaussianEditor, a framework created to address the limitations of current 3D scene editing methods that utilize text instructions. While significant advancements have been made with 2D diffusion models, the key issue addressed here is the inability of these models to perform precise and localized editing in 3D scenes. GaussianEditor resolves this by leveraging 3D Gaussian splatting, which inherently allows for explicit and individual manipulation of 3D points, enabling detailed and accurate editing with text instructions.

GaussianEditor is structured around three main components: region of interest (RoI) extraction, 3D Gaussian RoI alignment, and delicate editing within the Gaussian RoI. The initial step involves extracting RoI from the textual instructions. Utilizing recent advancements in multimodal processing, the framework extracts key descriptions using a LLM and aligns these descriptions to match regions within a 3D scene.

In comparison to Instruct-NeRF2NeRF, which requires substantially more time and struggles with localizing edits due to entanglement of regions, GaussianEditor achieves the same within 20 minutes on a single V100 GPU, thereby more than halving the training speed required by Instruct-NeRF2NeRF (45 minutes to 2 hours depending on scene complexity). This is a notable computational enhancement facilitated by 3D Gaussian splatting, which excels in real-time rendering and individual manipulation of Gaussian splats.

The framework's editing precision is enabled through image-grounded segmentation to localize the RoI in the image space, which is subsequently lifted back to the 3D Gaussian space. This ensures updates during the editing process are confined accurately without unintentional modifications to surrounding scene elements. This capability allows GaussianEditor to perform consistent multi-round editing while adhering closely to user-specified instructions.

Quantitatively, GaussianEditor matches Instruct-NeRF2NeRF in terms of achieving desired text-image similarities while significantly enhancing image-image similarities, indicating a better preservation of non-target regions. It harnesses the spatial independence of Gaussians to distinguish foreground and background rendering, allowing precise edits limited strictly to intended scene components such as modifying the color of a particular object without affecting neighboring features.

Moreover, the framework introduces scene description generation and employs existing 2D models for an embedded editing process, further enhancing its operability within existing 3D graphics frameworks. It successfully demonstrates that a systematic integration of explicit 3D representation models with advanced language and vision models can profoundly enhance scene editing precision and fidelity.

Looking forward, the potential continued extension of GaussianEditor into dynamic scenes pavements new exploration avenues for real-time interactive content creation and user-generated scene manipulation, pertinent in entertainment, virtual reality, and architectural visualization domains.

This paper is an instrumental step towards highly efficient, accurate, and user-guided 3D editing paradigms and opens the floor for further refinement in explicit and differentiable 3D modeling frameworks, which could see adoption across various real-world applications requiring precision-driven 3D content generation and manipulation.

X Twitter Logo Streamline Icon: https://streamlinehq.com