Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding (2106.02795v3)

Published 5 Jun 2021 in cs.LG, cs.AI, and cs.CV

Abstract: Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where $L_2$ distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.

Citations (76)

View on Semantic Scholar

Summary

The paper introduces learnable Fourier features as a novel, flexible method for multi-dimensional spatial positional encoding in attention models like Transformers.
Experiments show this approach consistently outperforms traditional positional encoding methods across tasks like image generation, object detection, and image classification, improving accuracy and convergence.
This parameter-efficient method provides an adaptable inductive bias that learns optimal spatial relationships, offering significant potential for performance gains in various AI domains requiring precise positional understanding.

Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding

Positional encoding is an essential component in attention-based deep learning models, such as Transformers, enabling them to process sequences or images where the position of information is pivotal. The paper "Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding" introduces a novel method for positional encoding using learnable Fourier features. This approach aims to overcome the limitations of traditional sinusoidal or embedding-based positional encoding methods by offering a flexible, data-driven way to capture positional information, particularly in multi-dimensional spaces.

Overview of the Proposed Method

Traditional positional encoding methods either use fixed sinusoidal functions or trainable embeddings to encode position information. While sinusoidal positional encoding provides a straightforward way to inject positional bias into the model, it lacks flexibility and task-specific adaptability. Training embeddings for each position can capture complex positional relationships but can also be inefficient, especially for long sequences or variable length sequences in higher-dimensional spaces.

The paper proposes a hybrid approach that leverages learnable Fourier features modulated with a multi-layer perceptron (MLP). This approach treats positional encoding as a continuous-valued vector, which alleviates the sparsity issue associated with embedding-based methods, and provides greater flexibility and efficiency by capturing complex, task-specific positional relationships. The method is characterized by using trainable weights in the Fourier feature representation, allowing the model to learn these features optimally based on the task at hand.

Key Contributions

Learnable Fourier Feature Mapping: The Fourier feature representation allows modeling of multi-dimensional positions while approximating Euclidean distance, which can be desirable in many spatial tasks.
Parameter Efficiency: The proposed method does not increase the number of parameters with sequence length, offering a scalable solution for higher-dimensional positional encoding.
Inductive Bias and Adaptability: By providing an inductive bias through Euclidean approximation initially, the model can adapt to specific task requirements throughout training.
Performance Improvements: Experiments on various benchmark tasks demonstrate that the learnable Fourier feature representation consistently outperforms existing positional encoding methods by improving model accuracy and accelerating convergence.

Experimental Results

The paper reports experimental results across four tasks: image generation on the ImageNet 64x64 dataset, object detection using DETR on the COCO dataset, image classification using Vision Transformers, and widget captioning in user interfaces. In all tasks, the learnable Fourier features demonstrate superior performance compared to traditional positional encoding methods.

Image Generation: Learnable Fourier features enable the Reformer model to achieve faster convergence and better accuracy compared to baseline methods using concatenated embeddings or sinusoidal encodings.
Object Detection: In the DETR model, learnable Fourier features yield improved detection performance while efficiently handling unseen image sizes without requiring complex position normalization adjustments.
Image Classification and Widget Captioning: While traditional positional encoding methods may suffice for certain tasks, the learnable Fourier features provide a significant advantage in tasks requiring a deeper understanding of spatial relationships, such as widget captioning where multi-dimensional positional relationships are crucial.

Implications and Future Directions

The introduction of learnable Fourier features for spatial positional encoding has significant implications for the design of attention-based models in AI. This approach could lead to improved performance in various domains requiring precise positional understanding, such as robotics, geospatial analysis, and complex user interfaces. The parameter-efficient nature of the method also suggests potential for its application in large-scale, high-dimensional tasks. Future work could explore extending this approach to tasks involving relative or hierarchical positional relationships and investigate its integration with other architectural components for improved performance in diverse applications.