Papers
Topics
Authors
Recent
2000 character limit reached

BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Published 23 Dec 2021 in cs.CV and cs.GR | (2112.12761v3)

Abstract: Prior work for articulated 3D shape reconstruction often relies on specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to diverse sets of objects in the wild. We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. BANMo builds high-fidelity, articulated 3D models (including shape and animatable skinning weights) from many monocular casual videos in a differentiable rendering framework. While the use of many videos provides more coverage of camera views and object articulations, they introduce significant challenges in establishing correspondence across scenes with different backgrounds, illumination conditions, etc. Our key insight is to merge three schools of thought; (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization, and (3) canonical embeddings that generate correspondences between pixels and an articulated model. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints and poses. Project webpage: banmo-www.github.io .

Citations (160)

Summary

  • The paper introduces BANMo, a framework for reconstructing animatable 3D models from casual videos without requiring advanced equipment.
  • It employs a fusion of NeRF-based canonical shape modeling, neural blend skinning, and self-supervised canonical embeddings for pixel registration.
  • Experimental results demonstrate improved 3D reconstruction fidelity and scalability compared to prior methods like Nerfies, with applications in motion retargeting.

Building Animatable 3D Neural Models from Many Casual Videos

Introduction

The paper "BANMo: Building Animatable 3D Neural Models from Many Casual Videos" addresses the challenge of reconstructing high-fidelity, articulated 3D models from sets of casual RGB videos without relying on specialized equipment or pre-defined templates. BANMo merges techniques from deformable shape models, canonical embeddings, and volumetric NeRFs to achieve this, enabling realistic rendering and animation from multiple casual video sources.

Method Overview

BANMo constructs 3D models using three primary components:

  1. Canonical Shape Model: Utilizing Neural Radiance Fields (NeRF), 3D points in a canonical space are predicted with properties like color, density, and embeddings. This framework handles appearance variations and establishes dense correspondences across multiple video sources. Figure 1

    Figure 1: Method overview. BANMo optimizes a set of shape and deformation parameters that describe the video observations in pixel colors, silhouettes, optical flow, and higher-order feature descriptors, based on a differentiable volume rendering framework.

  2. Neural Blend Skinning Model: The deformation of the object is modeled through neural blend skinning. This method manages large-scale deformations without requiring predefined skeletons, effectively mapping and inverting between canonical and camera spaces. Figure 2

    Figure 2: Canonical Embeddings. Jointly optimize an implicit function to produce canonical embeddings from 3D canonical points that match to the 2D DensePose CSE embeddings.

  3. Registration via Canonical Embeddings: BANMo leverages a feature-matching mechanism across 2D video frames to ensure coherent registration of pixels. This method achieves dense correspondence through self-supervised canonical embedding learning.

Losses and Optimization

The BANMo framework is trained by minimizing a comprehensive suite of losses:

  • Reconstruction Losses: Compare rendered images to actual video observations, including RGB, silhouettes, and optical flow.
  • Feature Registration Losses: Enforce consistency between feature matching and geometric warping.
  • Cycle Consistency Losses: Regularize both 2D feature projections and 3D point transformations to maintain coherent reconstruction fidelity.

Experimental Results

Qualitative results show BANMo's superiority in reconstructing articulated shapes compared to ViSER and Nerfies, producing high-fidelity geometrical details like animal limbs and facial features. BANMo demonstrates improved scalability and accuracy with increased video input, unlike Nerfies, which struggles with large motion and registration.

Quantitative evaluations illustrate BANMo's efficacy in achieving lower 3D Chamfer distances and higher F-scores across diverse datasets, including the AMA dataset and various animated objects. Figure 3

Figure 3: Qualitative comparison of our method with prior art.

Diagnostics and Ablation

BANMo's reliance on root pose initialization is critical for stabilization during optimization. Canonical embeddings significantly aid video registration, preventing artifacts such as ghosting without them. The neural blend skinning model proves essential for managing complex deformations, revealed through experiments with dynamic objects, such as eagles. Figure 4

Figure 4: Compliance to topology changes in optimization. Improper reconstructions are automatically corrected via gradient updates.

Applications and Future Work

BANMo's capabilities extend to novel applications like motion retargeting, where pre-optimized models can mimic motions from unrelated video sources. The paper suggests the potential for general pipelines requiring further advancement in pose estimation completeness and computational efficiency. Figure 5

Figure 5: Motion re-targeting from a pre-optimized cat model to a tiger.

Conclusion

BANMo represents a significant advancement in the field of 3D model reconstruction from casual video sources, merging cutting-edge techniques for comprehensive high-fidelity, animatable models. While challenges in compute efficiency and reliance on pre-trained pose estimation models persist, BANMo establishes a promising foundation for future work in video-based 3D model generation.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.