FVO: Fast Visual Odometry with Transformers

University of Amsterdam,
*Equal contribution

TLDR

Given a video, FVO estimates the metric camera trajectory from it. The method relies solely on images as input and does not need post-optimization.

Method Overview

Fast Visual Odometry Pipeline. The metric camera trajectory is derived by passing overlapping image windows through a transformer that estimates relative camera poses and their confidence scores. Subsequently, the inference module integrates these pose and confidence estimates into a unified trajectory. FVO is almost 2 times faster than the fastest baseline on commodity hardware. Moreover, our method does not rely on camera parameters or test-time optimization.



Visual Odometry Transformer

Visual ddometry transformer architecture.. Given multiple input frames, a frozen image encoder extracts per-image token embeddings. Camera embeddings are then concatenated to aggregate the information for camera pose estimation. The embeddings are decoded by L repeating decoder blocks with temporal and spatial attention modules. The rotations are projected onto the SO(3) manifold to ensure valid relative rotations.

Camera Pose Estimation Results

Trajectories estimated from videos of the test splits of indoor (ScanNet, ARKit), and outdoor (KITTI) datasets. TUM dataset was not used during training. Trajectories are evalated over the whole video without alignment to the ground-truth.

ARKit 41069050

ARKit 41159557

ARKit 41254382

ScanNet 0732

ScanNet 0762

ScanNet 0794

KITTI 03

KITTI 04

KITTI 07

TUM xyz

TUM desk2

TUM long office

BibTeX

@misc{yugay2026fvofastvisualodometry,
      title={FVO: Fast Visual Odometry with Transformers}, 
      author={Vlardimir Yugay and Duy-Kien Nguyen and Theo Gevers and Cees G. M. Snoek and Martin R. Oswald},
      year={2026},
      eprint={2510.03348},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.03348}, 
}