MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation

Zhiqi Li^1,2, Wenhuan Li³, Tengfei Wang^3,†, Zhenwei Wang³, Junta Wu³, Haoyuan Wang³,
Yunhan Yang³, Zehuan Huang³, Yang Li³, Chunchao Guo^3,†, Peidong Liu²

¹Zhejiang University ²Westlake University ³Tencent Hunyuan
^†Corresponding Authors

Paper Code

Generate complex compositional 3D assets from a single image.

Abstract

Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks.

Interactive Results

Method Overview

Overview of MoCA. Our DiT model starts with packing each component's latents using several learnable queries through a cross-attention layer. Random ID embeddings are applied to distinguish different components. Then, each component's full latents and compressed version are fed into our DiT model, which is comprised with interleaved local attention blocks and our proposed Mixture-of-Components Attention blocks. Finally, the clean latents of all components are separately decoded to the global space by a frozen shape decoder to form the final 3D asset.

Illustration of Mixture-of-Components Attention. The calculation stream for component $\mathbf{c}_i$ is highlighted. This procedure is permutation-invariant across all components.

Instance-Composed Scene Generation

TBD...

BibTeX

@misc{li2025moca,
      title={MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation}, 
      author={Zhiqi Li and Wenhuan Li and Tengfei Wang and Zhenwei Wang and Junta Wu and Haoyuan Wang and Yunhan Yang and Zehuan Huang and Yang Li and Chunchao Guo and Peidong Liu},
      year={2025},
      eprint={2512.07628},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.07628}, 
}