TIGER

TIGER: Text-Instructed
3D Gaussian Retrieval and Coherent Editing

Teng Xu
Jiamin Chen
Peng Chen
Youjia Zhang
Junqing Yu
Wei Yang ^†

Huazhong University of Science and Technology

{tengxu, jiaminchen, pengchen, youjiazhang, yjqing, weiyangcs}@hust.edu.cn

^†Corresponding Author

Abstract

Editing objects within a scene is a critical functionality required across a broad spectrum of applications in computer vision and graphics. As 3D Gaussian Splatting (3DGS) emerges as a frontier in scene representation, the effective modification of 3D Gaussian scenes has become increasingly vital. This process entails accurately retrieve the target objects and subsequently performing modifications based on instructions. Though available in pieces, existing techniques mainly embed sparse semantics into Gaussians for retrieval, and rely on an iterative dataset update paradigm for editing, leading to over-smoothing or inconsistency issues. To this end, this paper proposes a systematic approach, namely TIGER, for coherent text-instructed 3D Gaussian retrieval and editing. In contrast to the top-down language grounding approach for 3D Gaussians, we adopt a bottom-up language aggregation strategy to generate a denser language embedded 3D Gaussians that supports open-vocabulary retrieval. To overcome the over-smoothing and inconsistency issues in editing, we propose a Coherent Score Distillation (CSD) that aggregates a 2D image editing diffusion model and a multi-view diffusion model for score distillation, producing multi-view consistent editing with much finer details. In various experiments, we demonstrate that our TIGER is able to accomplish more consistent and realistic edits than prior work.

Bottom-up Language Feature Extraction

To generate the language feature map, we first use MaskCLIP to produce low-resolution patch-level language features. This patch feature encompass multi-scale global information through masked self-distillation features. This is fundamentally different from extracting features by cropping the image first and then inputting it into Clip. We then use FeatUp to upsample these features to pixel-level language features.

Although the language feature is sufficient resolution for Gaussian supervision at this stage, they still suffer uneven bounderies. For further refinement, we apply SAM at the finest level, and generate a set of fine binary masks, we then conduct masked average to aggregate all the language features within each mask to obtain refined semantic boundaries. Notice our masked average process is fundamentally different to the top-down approach as our feature generated from MaskCLIP contains global information that across semantic boundaries. While extracting language features within each mask lead of the absence of context information outside the mask in the final language feature.

Coherent Score Distillation

Existing 3D Gaussian editing methods unanimously adopt the iterative dataset updating scheme, which induces over-smoothing and multi-face Janus problem as the editing of each image is independent. To address the issue, we propose a novel Coherent Score Distillation (CSD) that integrate the SDS losses of a 2D image editing diffusion model, i.e., the InstructPix2Pix, and a multi-view diffusion model, i.e., the MVDream, producing multi-view consistent editing with fine details.

Results

TIGER can handle various scenes and prompts. In various experiments, we demonstrate that our TIGER is able to accomplish more consistent and realistic edits than prior work.

BibTeX

    @article{xu2024tiger,
      title={TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing},
      author={Xu, Teng and Chen, Jiamin and Chen, Peng and Zhang, Youjia and Yu, Junqing and Yang, Wei},
      journal={arXiv preprint arXiv:2405.14455},
      year={2024}
    }

TIGER: Text-Instructed
3D Gaussian Retrieval and Coherent Editing

Paper

Project

Code

Abstract

Bottom-up Language Feature Extraction

Coherent Score Distillation

Results

Related links

BibTeX