ECCV 2026

Walking in the Implicit

Interactive World Exploration via Neural Scene Representation

NeuWorld rolls out a fixed-length, renderable Neural Implicit Scene state and renders queried observations under camera control.

Zhiqi Li1,2 Chengrui Dong1,2 Zhenhua Du1,2 Hangning Zhou3,† Cong Qiu3 Hailong Qin3 Mu Yang3 Dongxu Wei2 Peidong Liu2,*
1Zhejiang University 2Westlake University 3Afari Intelligent Drive
Project Lead *Corresponding Author
NeuWorld teaser showing camera-controlled exploration from a neural implicit scene state.
Query camera motion Future poses define the next local exploration segment.
Roll out NIS A compact scene state evolves instead of a growing frame sequence.
Render queried views The sampled state is decoded into pose-conditioned observations.
Abstract

Key idea. Replace autoregressive frame-latent rollout with a fixed-length renderable scene state.

Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS).

This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history.

By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.

Interactive World Exploration Neural Implicit Scene Camera-Controlled Generation
Method
NeuWorld method pipeline with partial NIS conditioning, geometry-aware retrieval, NIS-DiT rollout, and decoder rendering.
At each interaction step, partial NIS and memory NIS conditions guide NIS-DiT to sample the next local scene state, which the frozen decoder renders under queried future poses.
01

Encode partial NIS

Current observation and sparse future poses are mapped into the same NIS modality as the rollout state.

02

Retrieve memory

Geometry-aware retrieval selects history frames relevant to the upcoming camera trajectory.

03

Sample scene state

A diffusion transformer evolves fixed-length NIS tokens under camera and history conditions.

04

Render observations

The frozen NIS decoder deterministically renders queried views from the sampled local state.

Scene-state rollout Fixed-capacity NIS replaces growing video-latent trajectories.
Unified conditioning Camera, reference image, and history cues share one latent scene interface.
History-aware consistency Retrieved memory supports long-horizon rollout and revisitation.
Citation
@inproceedings{li2026neuworld,
  title     = {Walking in the Implicit: Interactive World Exploration via Neural Scene Representation},
  author    = {Li, Zhiqi and Dong, Chengrui and Du, Zhenhua and Zhou, Hangning and Qiu, Cong and Qin, Hailong and Yang, Mu and Wei, Dongxu and Liu, Peidong},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}