HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting

CVPR 2025

Jingyu Lin¹, Jiaqi Gu², Lubin Fan^2†, Bojian Wu³, Yujing Lou⁴,
Renjie Chen¹, Ligang Liu¹, Jieping Ye²

¹University of Science and Technology of China ²Individual Researcher ³Zhejiang University ⁴Shanghai Jiaotong University

^†Corresponding authors.

Arxiv Code 🤗 Data

HybridGS is the first hybrid representation that combines multi-view consistent 3D Gaussians and single-view independent 2D Gaussians, which is used to decouple the transients and statics presented in the scene. Our results demonstrate reasonable decompositions.

Abstract

Generating high-quality novel view renderings of 3D Gaussian Splatting (3DGS) in scenes featuring transient objects is challenging. We propose a novel hybrid representation, termed as HybridGS, using 2D Gaussians for transient objects per image and maintaining traditional 3D Gaussians for the whole static scenes. Note that, the 3DGS itself is better suited for modeling static scenes that assume multi-view consistency, but the transient objects appear occasionally and do not adhere to the assumption, thus we model them as planar objects from a single view, represented with 2D Gaussians. Our novel representation decomposes the scene from the perspective of fundamental viewpoint consistency, making it more reasonable. Additionally, we present a novel multi-view regulated supervision method for 3DGS that leverages information from co-visible regions, further enhancing the distinctions between the transients and statics. Then, we propose a straightforward yet effective multi-stage training strategy to ensure robust training and high-quality view synthesis across various settings. Experiments on benchmark datasets show our state-of-the-art performance of novel view synthesis in both indoor and outdoor scenes, even in the presence of distracting elements.

Method

Figure: Pipeline.

Given a casually captured image sequence, we decompose the whole scene into 2D Gaussians for transient objects and 3D Gaussians for static scenes. To warm up, we start by training a basic 3DGS to capture static elements. This is followed by iterative training of 2D and 3D Gaussians, where our transients and statics are combined using an \(\alpha\)-blending strategy with masks to produce the final renderings. The masks provide guidance for 3D Gaussians in the iterative training stage. During the joint-training, both 2D and 3D Gaussians are trained to further optimize the decomposition results.

Essentially, given a set of input images \(\{I_k | k=1,2,..., N\}\) with the corresponding camera parameters, for each view \(I\), the goal of our method is to reasonably decouple the transients \(I_t\) and statics \(I_s\) as follows:

\[ I = M_t \odot I_t + (1 - M_t) \odot I_s, \]

where the \(M_t \in [0, 1]\) represents the transient mask, and \(\odot\) is per pixel multiplication. When a pixel's value in \(M_t\) approaches 1, it indicates a higher probability that the location is transient; conversely, a lower value suggests a greater likelihood that area is static.

To achieve the goal, we decompose the whole 3D scene with two components:
(1) Multi-view consistent 3D Gaussians is used for rendering \(I_s\), which leverages multi-view information from images and models the static scene with a set of unified 3D Gaussians. The multi-view inputs regulate the 3D Gaussians to be consistent and robust across the different views.
(2) Single-view independent 2D Gaussians is responsible for modeling \(I_t\), which enables our approach to handle the fact that images are casually captured with varying transients. Concretely, we form a set of view-independent 2D Gaussians to model transients as planar objects from a single view.

This novel combination allows for a more precise and reasonable representation of 3D scenes, enabling better performance on novel views.

Results

Table: Comparison on NeRF On-the-go Dataset
Method	Low Occlusion						Medium Occlusion						High Occlusion
	Mountain			Fountain			Corner			Patio			Spot			Patio-High
	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS
RobustNeRF	17.54	0.496	0.383	15.65	0.318	0.576	23.04	0.764	0.244	20.39	0.718	0.251	20.65	0.625	0.391	20.54	0.578	0.366
NeRF On-the-go	20.15	0.644	0.259	20.11	0.609	0.314	24.22	0.806	0.190	20.78	0.754	0.219	23.33	0.787	0.189	21.41	0.718	0.235
3DGS	19.40	0.638	0.213	19.96	0.659	0.185	20.90	0.713	0.241	17.48	0.704	0.199	20.77	0.693	0.316	17.29	0.604	0.363
SLS-mlp*	19.84	0.580	0.294	20.19	0.612	0.258	24.03	0.795	0.258	21.55	0.838	0.065	23.52	0.756	0.185	20.31	0.664	0.259
HybridGS (Ours)	21.73	0.693	0.284	21.11	0.674	0.252	25.03	0.847	0.151	21.98	0.812	0.169	24.33	0.794	0.196	21.77	0.741	0.211

Table: Comparison on RobustNeRF Dataset
Method	Statue			Android			Yoda			Crab (1)			Crab (2)
Method	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS	PSNR	SSIM	LPIPS
RobustNeRF	20.60	0.76	0.15	23.28	0.75	0.13	29.78	0.82	0.15	32.22	0.94	0.06	-	-	-
NeRF On-the-go	21.58	0.77	0.24	23.50	0.75	0.21	29.96	0.83	0.24	-	-	-	-	-	-
3DGS	21.02	0.81	0.16	23.11	0.81	0.13	26.33	0.91	0.14	31.80	0.96	0.08	29.74	-	-
SLS-mlp	22.54	0.84	0.13	25.05	0.85	0.09	33.66	0.96	0.10	35.85	0.97	0.08	34.43	-	-
HybridGS (Ours)	22.93	0.87	0.10	25.15	0.85	0.07	35.32	0.96	0.07	36.31	0.97	0.05	35.17	0.96	0.08

Visualization

Figure: Visualization of novel view synthesis results on the testing set of NeRF On-the-go dataset.

Figure: Visualization of scene decomposition into transients and statics.

Figure: Qualitative results compared to 3DGS during the training steps.

Our method demonstrates superior results by effectively reducing artifacts and providing clearer boundaries. This results in a cleaner statics compared to other methods, showcasing enhanced visual quality and precision in novel views.

Also, our method achieves superior transient mask separation in both indoor and outdoor scenes. It effectively separates transients and statics, and the resulting renderings closely resemble the ground truth images, demonstrating its effectiveness.