Emergent Mind

Transformer in Transformer

(2103.00112)
Published Feb 27, 2021 in cs.CV and cs.AI

Abstract

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$\times$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$\times$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/TNT.

Proposed TNT framework with shared inner transformer block and word position encodings across visual sentences.

Overview

  • The paper introduces the 'Transformer in Transformer' (TNT) architecture, a novel approach to enhance visual transformers by embedding one transformer within another to improve handling of image hierarchies.

  • The TNT model employs a two-level attention mechanism, with an inner transformer for local structures within image patches and an outer transformer for global patch-level information, resulting in enhanced representation capabilities.

  • Empirical results show that TNT outperforms existing visual transformers like ViT and DeiT on benchmarks such as ImageNet, demonstrating its effectiveness in various classification and object detection tasks while maintaining computational efficiency.

An Overview of "Transformer in Transformer"

The paper "Transformer in Transformer" presents a novel architecture aimed at augmenting the representational capabilities of visual transformers. The authors introduce the concept of embedding a transformer within a transformer (TNT) to improve the processing of visual information by breaking down the hierarchical structures within images more finely.

Introduction and Motivation

Traditional visual transformers like Vision Transformer (ViT) divide an image into several local patches, treating these as sequences for processing via self-attention mechanisms. However, the authors argue that this method overlooks potential features within smaller sub-components of these patches. By failing to account for finer granularity, the existing approaches may lose out on the detailed intra-patch relationships that could enhance model performance. The TNT architecture addresses this issue by considering both the larger patches and their smaller sub-components.

Transformer in Transformer Architecture

The proposed TNT architecture implements a hierarchical strategy where each image is first split into larger "visual sentences" or patches, which are subsequently divided into smaller "visual words". This enables the model to simultaneously capture local and global structures through a two-level attention mechanism:

  1. Inner Transformer Block: This block computes attention within the smaller patches (visual words) of a larger patch (visual sentence), enhancing the local representations.
  2. Outer Transformer Block: This block processes the larger patches (visual sentences), focusing on capturing global structural information.

The integration of these two attention mechanisms enables the TNT model to leverage both fine-grained local details and broader contextual relationships, improving the model's overall performance on visual tasks.

Computational Efficiency

Despite introducing an additional layer of complexity via the inner transformer, the TNT model maintains computational efficiency. The inner transformer's parameters and operations are relatively lightweight compared to the outer transformer, owing to the smaller scale of the sub-patches. The analysis demonstrates that the computation cost increases only marginally while providing substantial performance improvements.

Empirical Results

A series of experiments on the ImageNet benchmark reveal that the TNT model significantly outperforms conventional visual transformers like DeiT and ViT, achieving an 81.5% top-1 accuracy on ImageNet with similar computational costs. Notably, the TNT-S variant achieved a 1.7% higher accuracy compared to DeiT-S.

The authors further substantiate their model's robustness through transfer learning on various datasets:

Additionally, TNT demonstrates superiority in object detection tasks when integrated into DETR and achieves competitive results on COCO2017. The model also excels in semantic segmentation on the ADE20K dataset.

Visualization and Interpretability

The paper includes detailed visualizations illustrating the enhanced variability and contextual integrity of feature maps in TNT compared to conventional visual transformers. These visuals highlight how TNT better preserves local information and diversifies feature representations, contributing to its improved performance.

Implications and Future Directions

This research underscores the potential advantages of a multi-scale attention mechanism in enhancing visual transformer performance. The proposed TNT architecture sets a precedent for future work in efficiently capturing multi-level dependencies within visual data. Future developments could explore:

  • Scaling TNT to larger models and datasets
  • Integrating TNT with other advanced techniques like squeeze-and-excitation (SE)
  • Application of TNT in more diverse and complex visual tasks beyond image classification and object detection

Conclusion

The "Transformer in Transformer (TNT)" architecture represents a significant step forward in the design of visual transformers, addressing critical limitations related to the granularity of feature extraction. By introducing a hierarchical attention mechanism, the authors demonstrate substantial improvements in model performance across a range of benchmarks, highlighting TNT's capacity for fine-grained and robust visual representation. This work paves the way for more nuanced and effective application of transformers in computer vision, promising advancements in both theoretical understanding and practical implementations.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.