Perceiver: General Perception with Iterative Attention

Published 4 Mar 2021 in cs.CV, cs.AI, cs.LG, cs.SD, and eess.AS | (2103.03206v2)

Abstract: Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.

Abstract PDF Upgrade to Chat

Citations (859)

View on Semantic Scholar

Summary

The paper introduces a model that unifies diverse data types like images, audio, and 3D points using iterative attention.
It leverages a cross-attention mechanism to reduce transformer complexity from quadratic to linear, enhancing scalability.
Empirical evaluations on ImageNet, AudioSet, and ModelNet40 demonstrate competitive performance without domain-specific modifications.

Perceiver: General Perception with Iterative Attention

The "Perceiver: General Perception with Iterative Attention" paper introduces the Perceiver architecture, a model designed to handle diverse high-dimensional inputs without relying on domain-specific assumptions, making it highly versatile across multiple types of data such as images, audio, video, and point clouds. It builds upon the Transformer architecture but addresses its computational inefficiencies through an innovative cross-attention mechanism that scales linearly with input size, enabling the Perceiver to process a vast array of input types effectively.

Architecture Overview

The Perceiver's design incorporates two main components: a cross-attention module and a Transformer tower. The cross-attention module uses an asymmetric attention mechanism where a smaller latent array generates queries, and the larger input array generates keys and values. This setup reduces the quadratic complexity traditionally associated with Transformers to linear complexity with respect to the number of input elements.

Cross-Attention Modules: In the initial stage, cross-attention maps the input array to a latent array. This mapping creates an information bottleneck that allows the latent space to distill essential input features iteratively.
Latent Transformer: Following the cross-attention phase, a stack of Transformer layers -- operating solely in the latent space -- processes these distilled features. By decoupling the network depth from the input size, the architecture can scale to deep models without significant computational overhead.

Key Features and Innovations

Iterative Attention: The Perceiver iteratively refines its understanding of the input by alternating between cross-attention and latent self-attention layers. This iterative mechanism allows the model to focus on different parts of the input over successive layers.
Position and Modality Encodings: To maintain input spatial relationships, the model uses position encodings such as Fourier features, which encode the positions along various dimensions without assuming a fixed spatial structure. This feature generalizes well across different data modalities.

Empirical Performance

The paper demonstrates the Perceiver's versatility and efficacy across several benchmarks:

ImageNet Classification: The Perceiver achieves results competitive with ResNet-50 and Vision Transformers (ViT) without relying on 2D convolutions. By processing raw pixels (50,000 inputs) directly, the Perceiver matches the performance of traditional models.
AudioSet: The Perceiver shows strong results in both uni-modal (audio-only) and multi-modal (audio+video) settings. It achieves near state-of-the-art mean average precision (mAP) scores when using either raw audio or mel spectrograms as inputs.
ModelNet40: For 3D point cloud classification, the Perceiver competes with specialized architectures like PointNet++, achieving an accuracy of 85.7%, despite not leveraging advanced geometric features or extensive data augmentations.

Implications and Future Directions

The Perceiver represents a significant step towards a more unified, general-purpose model that can handle various types of sensory data without bespoke architectural modifications. This flexibility suggests several far-reaching implications:

Reduced Need for Domain-Specific Models: By eliminating the necessity for domain-specific architectures, the Perceiver can simplify the development and deployment of machine learning models across diverse applications.
Enhanced Multi-Modal Processing: The architecture's ability to seamlessly integrate different data modalities opens up opportunities for more sophisticated and coherent multi-modal understanding and reasoning.

Speculation on Future Developments

Given the Perceiver's strong empirical results, it is reasonable to anticipate several key directions for future research:

Scale and Pre-training: As with models like ViT, the performance of the Perceiver is likely to benefit from scaling up and pre-training on extensive datasets, which could further enhance its robustness and accuracy across tasks.
Enhanced Feature Engineering: Future works may explore more sophisticated position encoding methods or feature engineering techniques that adapt dynamically to the input data's characteristics, potentially improving the model's performance on non-grid-like data.
Parameter Efficiency: Although weight sharing in iterative layers has reduced the parameter count substantially, additional research could aim to streamline the model further, making it more resource-efficient.

Conclusion

The "Perceiver" paper presents a compelling case for a versatile, scalable perception model that bridges the gap between flexibility and efficiency. By leveraging iterative attention mechanisms and domain-agnostic position encodings, the Perceiver sets the stage for future advancements in general-purpose AI models capable of handling the complexities of varied and high-dimensional sensory inputs. This work contributes significantly to the ongoing efforts to develop more generalized AI systems and offers a promising foundation for further innovation in the field.

Markdown Report Issue