Visual Odometry with Transformers

University of Amsterdam,
*Equal contribution

TLDR

Given a video, VoT estimates the metric camera trajectory from it. The method relies solely on images as input and does not need post-optimization.

Method Overview

VoT Architecture. Given multiple input frames, a frozen image encoder extracts per-image token embeddings. Camera embeddings are then concatenated to aggregate the information for camera pose estimation. The embeddings are decoded by L repeating decoder blocks with temporal and spatial attention modules. The rotations are projected onto the SO(3) manifold to ensure valid relative rotations.

Camera Pose Estimation Results

Trajectories estimated from videos of the test splits of indoor (ScanNet, ARKit), and outdoor (KITTI) datasets. TUM dataset was not used during training. Trajectories are evalated over the whole video without alignment to the ground-truth.

ARKit 41069050

ARKit 41159557

ARKit 41254382

ScanNet 0732

ScanNet 0762

ScanNet 0794

KITTI 03

KITTI 04

KITTI 07

TUM xyz

TUM desk2

TUM long office

BibTeX

@misc{yugay2025visualodometrytransformers,
      title={Visual Odometry with Transformers}, 
      author={Vlardimir Yugay and Duy-Kien Nguyen and Theo Gevers and Cees G. M. Snoek and Martin R. Oswald},
      year={2025},
      eprint={2510.03348},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.03348}, 
}