Given a video, FVO estimates the metric camera trajectory from it. The method relies solely on images as input and does not need post-optimization.
Fast Visual Odometry Pipeline. The metric camera trajectory is derived by passing overlapping image windows through a transformer that estimates relative camera poses and their confidence scores. Subsequently, the inference module integrates these pose and confidence estimates into a unified trajectory. FVO is almost 2 times faster than the fastest baseline on commodity hardware. Moreover, our method does not rely on camera parameters or test-time optimization.
Visual ddometry transformer architecture.. Given multiple input frames, a frozen image encoder extracts
per-image token embeddings. Camera embeddings are then concatenated to aggregate the information
for camera pose estimation. The embeddings are decoded by
Trajectories estimated from videos of the test splits of indoor (ScanNet, ARKit), and outdoor (KITTI) datasets. TUM dataset was not used during training. Trajectories are evalated over the whole video without alignment to the ground-truth.
ARKit 41069050
ARKit 41159557
ARKit 41254382
ScanNet 0732
ScanNet 0762
ScanNet 0794
KITTI 03
KITTI 04
KITTI 07
TUM xyz
TUM desk2
TUM long office
@misc{yugay2026fvofastvisualodometry,
title={FVO: Fast Visual Odometry with Transformers},
author={Vlardimir Yugay and Duy-Kien Nguyen and Theo Gevers and Cees G. M. Snoek and Martin R. Oswald},
year={2026},
eprint={2510.03348},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.03348},
}