ECCV 2026

Walking in the Implicit

Interactive World Exploration via Neural Scene Representation

NeuWorld rolls out a fixed-length, renderable Neural Implicit Scene state and renders queried observations under camera control.

Zhiqi Li^1,2 Chengrui Dong^1,2 Zhenhua Du^1,2 Hangning Zhou^3,† Cong Qiu³ Hailong Qin³ Mu Yang³ Dongxu Wei² Peidong Liu^2,*

¹Zhejiang University ²Westlake University ³Afari Intelligent Drive

^†Project Lead ^*Corresponding Author

arXiv Paper Code Hugging Face BibTeX

NeuWorld teaser showing camera-controlled exploration from a neural implicit scene state.

Query camera motion Future poses define the next local exploration segment.

Roll out NIS A compact scene state evolves instead of a growing frame sequence.

Render queried views The sampled state is decoded into pose-conditioned observations.

Abstract

Key idea. Replace autoregressive frame-latent rollout with a fixed-length renderable scene state.

Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS).

This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history.

By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.

Interactive World Exploration Neural Implicit Scene Camera-Controlled Generation

Method

Encode partial NIS

Current observation and sparse future poses are mapped into the same NIS modality as the rollout state.

Retrieve memory

Geometry-aware retrieval selects history frames relevant to the upcoming camera trajectory.

Sample scene state

A diffusion transformer evolves fixed-length NIS tokens under camera and history conditions.

Render observations

The frozen NIS decoder deterministically renders queried views from the sampled local state.

Scene-state rollout Fixed-capacity NIS replaces growing video-latent trajectories.

Unified conditioning Camera, reference image, and history cues share one latent scene interface.

History-aware consistency Retrieved memory supports long-horizon rollout and revisitation.

Gallery

Citation

@inproceedings{li2026neuworld,
  title     = {Walking in the Implicit: Interactive World Exploration via Neural Scene Representation},
  author    = {Li, Zhiqi and Dong, Chengrui and Du, Zhenhua and Zhou, Hangning and Qiu, Cong and Qin, Hailong and Yang, Mu and Wei, Dongxu and Liu, Peidong},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}