HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting

Jingyu Lin1,    Jiaqi Gu2,    Lubin Fan2†,    Bojian Wu3,    Yujing Lou4,   
Renjie Chen1,    Ligang Liu1,    Jieping Ye2   
1University of Science and Technology of China     2Individual Researcher     3Zhejiang University     4Shanghai Jiaotong University    
Corresponding authors.
Teaser image.

HybridGS is the first hybrid representation that combines multi-view consistent 3D Gaussians and single-view independent 2D Gaussians, which is used to decouple the transients and statics presented in the scene. Our results demonstrate reasonable decompositions.



Abstract

Generating high-quality novel view renderings of 3D Gaussian Splatting (3DGS) in scenes featuring transient objects is challenging. We propose a novel hybrid representation, termed as HybridGS, using 2D Gaussians for transient objects per image and maintaining traditional 3D Gaussians for the whole static scenes. Note that, the 3DGS itself is better suited for modeling static scenes that assume multi-view consistency, but the transient objects appear occasionally and do not adhere to the assumption, thus we model them as planar objects from a single view, represented with 2D Gaussians. Our novel representation decomposes the scene from the perspective of fundamental viewpoint consistency, making it more reasonable. Additionally, we present a novel multi-view regulated supervision method for 3DGS that leverages information from co-visible regions, further enhancing the distinctions between the transients and statics. Then, we propose a straightforward yet effective multi-stage training strategy to ensure robust training and high-quality view synthesis across various settings. Experiments on benchmark datasets show our state-of-the-art performance of novel view synthesis in both indoor and outdoor scenes, even in the presence of distracting elements.



Method

Figure: Pipeline.

Given a casually captured image sequence, we decompose the whole scene into 2D Gaussians for transient objects and 3D Gaussians for static scenes. To warm up, we start by training a basic 3DGS to capture static elements. This is followed by iterative training of 2D and 3D Gaussians, where our transients and statics are combined using an \(\alpha\)-blending strategy with masks to produce the final renderings. The masks provide guidance for 3D Gaussians in the iterative training stage. During the joint-training, both 2D and 3D Gaussians are trained to further optimize the decomposition results.

Essentially, given a set of input images \(\{I_k | k=1,2,..., N\}\) with the corresponding camera parameters, for each view \(I\), the goal of our method is to reasonably decouple the transients \(I_t\) and statics \(I_s\) as follows:

\[ I = M_t \odot I_t + (1 - M_t) \odot I_s, \]

where the \(M_t \in [0, 1]\) represents the transient mask, and \(\odot\) is per pixel multiplication. When a pixel's value in \(M_t\) approaches 1, it indicates a higher probability that the location is transient; conversely, a lower value suggests a greater likelihood that area is static.

To achieve the goal, we decompose the whole 3D scene with two components:
(1) Multi-view consistent 3D Gaussians is used for rendering \(I_s\), which leverages multi-view information from images and models the static scene with a set of unified 3D Gaussians. The multi-view inputs regulate the 3D Gaussians to be consistent and robust across the different views.
(2) Single-view independent 2D Gaussians is responsible for modeling \(I_t\), which enables our approach to handle the fact that images are casually captured with varying transients. Concretely, we form a set of view-independent 2D Gaussians to model transients as planar objects from a single view.

This novel combination allows for a more precise and reasonable representation of 3D scenes, enabling better performance on novel views.



Results

Table: Comparison on NeRF On-the-go Dataset
Method Low Occlusion Medium Occlusion High Occlusion
Mountain Fountain Corner Patio Spot Patio-High
PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS
RobustNeRF 17.54 0.496 0.383 15.65 0.318 0.576 23.04 0.764 0.244 20.39 0.718 0.251 20.65 0.625 0.391 20.54 0.578 0.366
NeRF On-the-go 20.15 0.644 0.259 20.11 0.609 0.314 24.22 0.806 0.190 20.78 0.754 0.219 23.33 0.787 0.189 21.41 0.718 0.235
3DGS 19.40 0.638 0.213 19.96 0.659 0.185 20.90 0.713 0.241 17.48 0.704 0.199 20.77 0.693 0.316 17.29 0.604 0.363
SLS-mlp* 19.84 0.580 0.294 20.19 0.612 0.258 24.03 0.795 0.258 21.55 0.838 0.065 23.52 0.756 0.185 20.31 0.664 0.259
HybridGS (Ours) 21.73 0.693 0.284 21.11 0.674 0.252 25.03 0.847 0.151 21.98 0.812 0.169 24.33 0.794 0.196 21.77 0.741 0.211
Table: Comparison on RobustNeRF Dataset
Method Statue Android Yoda Crab (1) Crab (2)
PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS
RobustNeRF 20.60 0.76 0.15 23.28 0.75 0.13 29.78 0.82 0.15 32.22 0.94 0.06 - - -
NeRF On-the-go 21.58 0.77 0.24 23.50 0.75 0.21 29.96 0.83 0.24 - - - - - -
3DGS 21.02 0.81 0.16 23.11 0.81 0.13 26.33 0.91 0.14 31.80 0.96 0.08 29.74 - -
SLS-mlp 22.54 0.84 0.13 25.05 0.85 0.09 33.66 0.96 0.10 35.85 0.97 0.08 34.43 - -
HybridGS (Ours) 22.93 0.87 0.10 25.15 0.85 0.07 35.32 0.96 0.07 36.31 0.97 0.05 35.17 0.96 0.08


Visualization

Figure: Visualization of novel view synthesis results on the testing set of NeRF On-the-go dataset.

Figure: Visualization of scene decomposition into transients and statics.

Figure: Qualitative results compared to 3DGS during the training steps.

Our method demonstrates superior results by effectively reducing artifacts and providing clearer boundaries. This results in a cleaner statics compared to other methods, showcasing enhanced visual quality and precision in novel views.

Also, our method achieves superior transient mask separation in both indoor and outdoor scenes. It effectively separates transients and statics, and the resulting renderings closely resemble the ground truth images, demonstrating its effectiveness.

Finally, as training iterations increase, 3DGS tends to gradually integrate transient elements into the static components, rendering the residuals being almost incapable of capturing transient contents. In contrast, our HybridGS effectively distinguishes transients from statics over time, leading to consistent improvements.



BibTeX

@InProceedings{lin2024hybridgs,
    author = {Jingyu Lin, Jiaqi Gu, Lubin Fan, Bojian Wu, Yujing Lou, Renjie Chen, Ligang Liu, Jieping Ye},
    title = {HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting},
    booktitle = {Arxiv},
    year = {2024}
}