NViST: In the Wild New View Synthesis from a Single Image with Transformers (2312.08568v2)

Published 13 Dec 2023 in cs.CV

Abstract: We propose NViST, a transformer-based model for efficient and generalizable novel-view synthesis from a single image for real-world scenes. In contrast to many methods that are trained on synthetic data, object-centred scenarios, or in a category-specific manner, NViST is trained on MVImgNet, a large-scale dataset of casually-captured real-world videos of hundreds of object categories with diverse backgrounds. NViST transforms image inputs directly into a radiance field, conditioned on camera parameters via adaptive layer normalisation. In practice, NViST exploits fine-tuned masked autoencoder (MAE) features and translates them to 3D output tokens via cross-attention, while addressing occlusions with self-attention. To move away from object-centred datasets and enable full scene synthesis, NViST adopts a 6-DOF camera pose model and only requires relative pose, dropping the need for canonicalization of the training data, which removes a substantial barrier to it being used on casually captured datasets. We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures. We conduct qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that our model represents a step forward towards enabling true in-the-wild generalizable novel-view synthesis from a single image. Project webpage: https://wbjang.github.io/nvist_webpage.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces NViST, a transformer-based framework that synthesizes novel 3D views from single images captured in diverse real-world scenes.
It utilizes a fine-tuned Masked Autoencoder and cross-attention mechanisms to convert image features into radiance field representations while handling relative camera poses.
Experimental results on MVImgNet and ShapeNet-SRN showcase superior PSNR and LPIPS scores, highlighting its robustness in unconstrained environments.

A Review of "NViST: In the Wild New View Synthesis from a Single Image with Transformers"

The paper "NViST: In the Wild New View Synthesis from a Single Image with Transformers" by Wonbong Jang and Lourdes Agapito presents a transformer-based architecture for novel view synthesis (NVS) from a single image captured in real-world conditions. This work aims to address the challenges of NVS when handling diverse and complex scenes in unstructured environments, a significant departure from traditional object-centered and synthetic dataset methodologies frequently seen in this research domain.

Methodology Overview

The authors introduce NViST, leveraging the transformer architecture to efficiently translate a single image into a novel 3D viewpoint. The model architecture consists of three main components: an encoder, a decoder, and a renderer. The encoder uses a fine-tuned Masked Autoencoder (MAE) to extract feature tokens from the input image. The decoder, leveraging cross-attention mechanisms, transforms these tokens into a vector-matrix representation of a radiance field, which is then volumetrically rendered.

A key innovation in NViST is the use of relative camera pose instead of requiring alignment to a canonical view, thus broadening its applicability to casually captured datasets. Additionally, the architecture incorporates novel ways of handling scale ambiguities by conditioning on camera parameters using adaptive layer normalization.

Dataset and Training

NViST is trained using MVImgNet, a large-scale dataset comprising real-world images from over 177 categories. This dataset, which spans a broad range of common objects and complex scenes, provides the model with sufficient variability to learn generalizable scene representations. The training process involves a balance of losses, including a photometric L2 loss, a perceptual loss (LPIPS), and a distortion-based regularizer, which together enhance the fidelity of synthesized views.

Evaluation and Results

NViST demonstrates substantial improvements over existing methods like PixelNeRF and VisionNeRF, especially in handling in-the-wild scenes with varying backgrounds and occlusions. The model also generalizes well to unseen categories and out-of-distribution scenes captured with mobile phones. Quantitative metrics on the MVImgNet dataset show superior PSNR and LPIPS scores compared to previous approaches.

In addition to standard benchmarks, the paper also evaluates NViST on the ShapeNet-SRN dataset to validate its performance in a more controlled synthetic environment. Though this requires a different training setup, the model's use of relative camera poses remains consistent, showcasing its robustness across datasets.

Implications and Future Directions

This research presents significant implications for real-world applications of NVS, such as augmented reality, telepresence, and interactive entertainment, where capturing novel views from casual images is highly desired. The methodological advancements in handling relative poses and unstructured data potentially pave the way for more adaptable NVS models in various real-world scenarios.

Looking forward, extending NViST to incorporate probabilistic models could enhance its ability to predict multiple plausible scenes from ambiguous inputs. Moreover, exploring multiview extensions could further improve the accuracy and fidelity of synthesized views, particularly in dynamic and highly interactive environments.

In summary, NViST represents an important step towards more flexible and robust novel view synthesis, offering a scalable and generalizable approach well-suited for practical deployment in unconstrained settings. The integration of vision transformers with advanced radiance field evaluations underscores a promising path forward in the evolution of 3D scene understanding from single-image inputs.

PDF Markdown

Related Papers

GitHub

NViST: In the Wild New View Synthesis from a Single Image with Transformers

Tweets

https://twitter.com/wbjang11/status/1765453564782997833

https://twitter.com/1265589869948088320/status/1742490201442627785