Given a video, VoT estimates the metric camera trajectory from it. The method relies solely on images as input and does not need post-optimization.
VoT Architecture. Given multiple input frames, a frozen image encoder extracts per-image token embeddings. Camera embeddings are then concatenated to aggregate the information for camera pose estimation. The embeddings are decoded by L repeating decoder blocks with temporal and spatial attention modules. The rotations are projected onto the SO(3) manifold to ensure valid relative rotations.
Trajectories estimated from videos of the test splits of indoor (ScanNet, ARKit), and outdoor (KITTI) datasets. TUM dataset was not used during training. Trajectories are evalated over the whole video without alignment to the ground-truth.
ARKit 41069050
ARKit 41159557
ARKit 41254382
ScanNet 0732
ScanNet 0762
ScanNet 0794
KITTI 03
KITTI 04
KITTI 07
TUM xyz
TUM desk2
TUM long office
@misc{yugay2025visualodometrytransformers,
title={Visual Odometry with Transformers},
author={Vlardimir Yugay and Duy-Kien Nguyen and Theo Gevers and Cees G. M. Snoek and Martin R. Oswald},
year={2025},
eprint={2510.03348},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.03348},
}