Given a casually captured image sequence, we decompose the whole scene into 2D Gaussians for transient objects and 3D Gaussians for static scenes. To warm up, we start by training a basic 3DGS to capture static elements. This is followed by iterative training of 2D and 3D Gaussians, where our transients and statics are combined using an \(\alpha\)-blending strategy with masks to produce the final renderings. The masks provide guidance for 3D Gaussians in the iterative training stage. During the joint-training, both 2D and 3D Gaussians are trained to further optimize the decomposition results.
Essentially, given a set of input images \(\{I_k | k=1,2,..., N\}\) with the corresponding camera parameters, for each view \(I\), the goal of our method is to reasonably decouple the transients \(I_t\) and statics \(I_s\) as follows:
\[
I = M_t \odot I_t + (1 - M_t) \odot I_s,
\]
where the \(M_t \in [0, 1]\) represents the transient mask, and \(\odot\) is per pixel multiplication. When a pixel's value in \(M_t\) approaches 1, it indicates a higher probability that the location is transient; conversely, a lower value suggests a greater likelihood that area is static.
To achieve the goal, we decompose the whole 3D scene with two components:
(1) Multi-view consistent 3D Gaussians is used for rendering \(I_s\), which leverages multi-view information from images and models the static scene with a set of unified 3D Gaussians. The multi-view inputs regulate the 3D Gaussians to be consistent and robust across the different views.
(2) Single-view independent 2D Gaussians is responsible for modeling \(I_t\), which enables our approach to handle the fact that images are casually captured with varying transients. Concretely, we form a set of view-independent 2D Gaussians to model transients as planar objects from a single view.
This novel combination allows for a more precise and reasonable representation of 3D scenes, enabling better performance on novel views.