VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames

Arxiv 2025

1Zhejiang University      2Westlake University     

Abstract

We introduce VicaSplat, a feed-forward model for 3D Gaussian splatting prediction and camera pose estimation from a sequence of unposed images, which is a challenging yet practical task for real-world applications. The core of our method features a novel transformer-based network architecture. In particular, our model starts with an image encoder that maps each image into a list of visual tokens. All the visual tokens are concatenated together with additional learnable anchor tokens for camera poses. A novel transformer decoder is then used to process all the tokens such that they can fully communicate with each other. The anchor tokens causally probe features from visual tokens of different views, and further modulate them frame-wisely to inject view-dependent features. 3D Gaussian splatting and camera pose parameters can then be estimated via different prediction heads. Experiments show that VicaSplat surpasses state-of-the-art methods for multi-view inputs, and achieves comparable performance to prior two-view approaches. Notably, VicaSplat also demonstrates exceptional cross-dataset generalization capability on the ScanNet benchmark, achieving superior performance without any domain-specific fine-tuning.

Method Overview

Overview of VicaSplat. The model employs a transformer encoder to convert video frames into visual tokens, while a custom transformer decoder with learnable camera tokens processes these representations. Several dedicated prediction heads then predict camera poses and 3D Gaussian splats respectively.

Architecture of one block in VicaSplat decoder. Both camera tokens and visual tokens are fed into the block. They firstly fully interact with each other within a Video-Camera Attention layer. Then the visual tokens go through an additional Cross-Neighbor Attention layer to enhance the view-consistency. Moreover, we utilize the differentiated camera tokens to further inject the complex view-dependent features into visual tokens via Framewise Modulation.

Reconstructed 3D Gaussians and Predicted Cameras

Qualitative Comparisons on NVS with 8 Input Views


More Novel View Synthesis Results

BibTeX

@article{li2025vicasplat,
      title   = {VicaSplat: A Single Run is All You Need for 3D Gaussian Splatting and Camera Estimation from Unposed Video Frames},
      author  = {Zhiqi Li and Chengrui Dong and Yiming Chen and Zhangchi Huang and Peidong Liu},
      journal = {arXiv preprint arXiv:2503.10286},
      year    = {2025}
    }