Learning Neural Volumetric Pose Features
for Camera Localization

ECCV 2024

Jingyu Lin^1*†, Jiaqi Gu^2†, Bojian Wu³, Lubin Fan^2‡, Renjie Chen^1‡, Ligang Liu¹, Jieping Ye²

¹University of Science and Technology of China ²Alibaba Cloud ³Zhejiang University

^*This work was done when Jingyu Lin was an intern at Alibaba Cloud.

^†Equal contributions.

^‡Corresponding authors.

Abstract

We introduce a novel neural volumetric pose feature, termed PoseMap, designed to enhance camera localization by encapsulating the information between images and the associated camera poses. Our framework leverages an Absolute Pose Regression (APR) architecture, together with an augmented NeRF module. This integration not only facilitates the generation of novel views to enrich the training dataset but also enables the learning of effective pose features. Additionally, we extend our architecture for self-supervised online alignment, allowing our method to be used and fine-tuned for unlabelled images within a unified framework. Experiments demonstrate that our method achieves 14.28% and 20.51% performance gain on average in indoor and outdoor benchmark scenes, outperforming existing APR methods with state-of-the-art accuracy.

Method

Figure: Pipeline.

Given a set of images and the associated camera poses \( \{(I, p)\} \), our goal is to train a neural network that takes a query image \( I^* \) as input and directly outputs its corresponding camera pose \( p^* \). Here the camera pose is represented as a \( 3\times 4 \) matrix which is a combination of translation and rotation with regard to a reference coordinate system. The whole pipeline is demonstrated in the figure. It contains two main entangled modules: APRNet and NeRF-P. Concretely, with an input image \( I \), APRNet leverages separate branches to extract image features \( \mathcal{F}_{image}(I) \) and estimates the camera pose \( \hat{p} \). With the given ground-truth pose, NeRF-P subsequently renders a synthetic image \( \hat{I} \), which will be forwarded to the feature extraction branch of APRNet and obtain \( \mathcal{F}_{image}(\hat{I}) \).

On the other hand, we propose a novel implicit pose features \( \mathcal{F}_{pose}(\hat{p}) \) called PoseMap, by enhancing the volumetric rendering module with an extra pose feature branch on original NeRF architecture, which will be further used for pose prediction. The key idea of this design choice is that NeRF itself is an abstraction of visual and geometric information. We propose an autoencoder-style pose branch to leverage NeRF to aggregate global attributes from samples of light rays on PoseMap and a pose decoder of a series of MLP decoders for the distillation of implicit pose features. This novel combination allows for a more precise and detailed representation of the camera pose. In all, APRNet is optimized, with the supervision of the ground-truth PoseMap \( \mathcal{F}_{pose}(p) \), and by minimizing the discrepancies of image features.

Results

Table: Results on 7-scenes dataset.

First, we compare our methods with one-frame APR methods on 7-scenes dataset. The Table shows that our method achieves the best results on all indoor scenes. To be specific, our main pipeline, denoted as PMNet, achieves state-of-the-art performance. Moreover, PMNet_ud finetuned in the self-supervising stage with unlabelled data further enhances the pose regression accuracy. The translation and rotation errors are reduced by 14.28% (0.07 → 0.06) and 12.67%(2.21 → 1.93) on average, respectively. For the Heads and Pumpkin scenes, our results are comparable to those of DFNet, though not superior. Upon reviewing the dataset, we found that these scenes are relatively small, with limited visible range and camera pose variations. Consequently, the benefits of our PoseMap may not be particularly significant.

Table: Results on Cambridge Landmarks dataset.

Then we evaluate our methods on the more challenging dataset, i.e., Cambridge Landmarks. As shown in the Table, our PMNet leads to the dominant performance advantage over all scenes in single-frame APR methods. Compared to DFNet_dm which also uses unlabelled data for online training, our method gets a performance gain by 20.51% (0.39 → 0.31) and 17.71% (0.96 → 0.79) for translations and rotations, respectively.

Visualization

Figure: Visualization.

Figure: 6-DOF Camera Pose Regression.

We visualize the camera localization sequences on 7-scenes dataset in Figure. It shows that our trajectories mostly coincide with the ground truth that means our method could estimate camera positions with high accuracy. Comparing with the SOTA method (i.e., DFNet), our method achieves better estimation results since our trajectories are much closer to the ground truth, and the number of high rotation error zones is less.

Learning Neural Volumetric Pose Features for Camera Localization

Abstract

Method

Results

Visualization

BibTeX

Learning Neural Volumetric Pose Features
for Camera Localization