Given a set of images and the associated camera poses \( \{(I, p)\} \), our goal is to train a neural network that takes a query image \( I^* \) as input and directly outputs its corresponding camera pose \( p^* \).
Here the camera pose is represented as a \( 3\times 4 \) matrix which is a combination of translation and rotation with regard to a reference coordinate system.
The whole pipeline is demonstrated in the figure.
It contains two main entangled modules: APRNet and NeRF-P.
Concretely, with an input image \( I \), APRNet leverages separate branches to extract image features \( \mathcal{F}_{image}(I) \) and estimates the camera pose \( \hat{p} \).
With the given ground-truth pose, NeRF-P subsequently renders a synthetic image \( \hat{I} \), which will be forwarded to the feature extraction branch of APRNet and obtain \( \mathcal{F}_{image}(\hat{I}) \).
On the other hand, we propose a novel implicit pose features \( \mathcal{F}_{pose}(\hat{p}) \) called PoseMap, by enhancing the volumetric rendering module with an extra pose feature branch on original NeRF architecture, which will be further used for pose prediction.
The key idea of this design choice is that NeRF itself is an abstraction of visual and geometric information.
We propose an autoencoder-style pose branch to leverage NeRF to aggregate global attributes from samples of light rays on PoseMap and a pose decoder of a series of MLP decoders for the distillation of implicit pose features.
This novel combination allows for a more precise and detailed representation of the camera pose.
In all, APRNet is optimized, with the supervision of the ground-truth PoseMap \( \mathcal{F}_{pose}(p) \), and by minimizing the discrepancies of image features.