DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation

NeurIPS 2024
Zhiqi Li* 1,2 ,   Yiming Chen* 1,2 ,   Peidong Liu† 2

1 Zhejiang University

2 Westlake University

Abstract

Recent advancements in 2D/3D generative techniques have facilitated the generation of dynamic 3D objects from monocular videos. Previous methods mainly rely on the implicit neural radiance fields (NeRF) or explicit Gaussian Splatting as the underlying representation, and struggle to achieve satisfactory spatial-temporal consistency and high-quality surface texture. Drawing inspiration from modern 3D animation pipelines, we introduce DreamMesh4D, a novel framework that combines mesh representation with sparse-controlled deformation technique to generate high-quality 4D object from a monocular video. To overcome the limitation of classical texture representation, we bind Gaussian splats to the surface of the triangular mesh for differentiable optimization of both the texture and mesh vertices. In particular, DreamMesh4D begins with a coarse mesh provided by a single image based 3D generation method. Sparse points are then uniformly sampled across the surface of the mesh, and are used to build a deformation graph to drive the motion of the 3D object for the sake of computational efficiency and providing additional constraint. For each step, transformations of sparse control points are predicted using a deformation network, and the mesh vertices as well as the bound surface Gaussians are deformed via a geometric skinning algorithm. The skinning algorithm is a hybrid approach combining LBS (linear blending skinning) and DQS (dual-quaternion skinning), mitigating drawbacks associated with both approaches. The static surface Gaussians and mesh vertices as well as the dynamic deformation network are learned via reference view photometric loss, score distillation loss as well as other regularization losses in a two-stage manner. Extensive experiments demonstrate that our method outperforms prior video-to-4D generation methods in terms of rendering quality and spatial-temporal consistency. Furthermore, our mesh-based representation is compatible with modern geometric pipelines, showcasing its potential in the 3D gaming and film industry.

Method Overview

Overview of proposed method. In static stage shown in top left part, a reference image is picked from the input video from with we generate a Gaussian-mesh hybrid representation through a image-to-3D pipeline. As for dynamic stage, we build a deformation graph between mesh vertices and sparse control nodes, and then the mesh and surface Gaussians are deformed by fusing the deformation of control nodes predicted by a MLP through a novel adaptive hybrid skinning algorithm.

Gallery

More Results

Blender Scene Demo

Citation


    
    @inproceedings{li2024dreammesh4d,
        title={DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation},
        author={Zhiqi Li and Yiming Chen and Peidong Liu},
        booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
        year={2024}
    }