Transformer in Transformer (2103.00112v3)

Published 27 Feb 2021 in cs.CV and cs.AI

Abstract: Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$\times$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$\times$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/TNT.

Citations (1,346)

View on Semantic Scholar

Summary

The paper introduces TNT, a novel visualization technique that embeds an inner transformer within an outer transformer to refine local feature extraction.
It demonstrates improved performance with an 81.5% top-1 accuracy on ImageNet while maintaining computational efficiency.
The architecture's robust design is validated through extensive experiments, showing promise in transfer learning and various visual tasks.

An Overview of "Transformer in Transformer"

The paper "Transformer in Transformer" presents a novel architecture aimed at augmenting the representational capabilities of visual transformers. The authors introduce the concept of embedding a transformer within a transformer (TNT) to improve the processing of visual information by breaking down the hierarchical structures within images more finely.

Introduction and Motivation

Traditional visual transformers like Vision Transformer (ViT) divide an image into several local patches, treating these as sequences for processing via self-attention mechanisms. However, the authors argue that this method overlooks potential features within smaller sub-components of these patches. By failing to account for finer granularity, the existing approaches may lose out on the detailed intra-patch relationships that could enhance model performance. The TNT architecture addresses this issue by considering both the larger patches and their smaller sub-components.

Transformer in Transformer Architecture

The proposed TNT architecture implements a hierarchical strategy where each image is first split into larger "visual sentences" or patches, which are subsequently divided into smaller "visual words". This enables the model to simultaneously capture local and global structures through a two-level attention mechanism:

Inner Transformer Block: This block computes attention within the smaller patches (visual words) of a larger patch (visual sentence), enhancing the local representations.
Outer Transformer Block: This block processes the larger patches (visual sentences), focusing on capturing global structural information.

The integration of these two attention mechanisms enables the TNT model to leverage both fine-grained local details and broader contextual relationships, improving the model's overall performance on visual tasks.

Computational Efficiency

Despite introducing an additional layer of complexity via the inner transformer, the TNT model maintains computational efficiency. The inner transformer's parameters and operations are relatively lightweight compared to the outer transformer, owing to the smaller scale of the sub-patches. The analysis demonstrates that the computation cost increases only marginally while providing substantial performance improvements.

Empirical Results

A series of experiments on the ImageNet benchmark reveal that the TNT model significantly outperforms conventional visual transformers like DeiT and ViT, achieving an 81.5% top-1 accuracy on ImageNet with similar computational costs. Notably, the TNT-S variant achieved a 1.7% higher accuracy compared to DeiT-S.

The authors further substantiate their model's robustness through transfer learning on various datasets:

CIFAR-10 and CIFAR-100 for superordinate-level classification
Oxford-IIIT Pets and Oxford 102 Flowers for fine-grained classification
iNaturalist 2019 for large-scale multi-class classification

Additionally, TNT demonstrates superiority in object detection tasks when integrated into DETR and achieves competitive results on COCO2017. The model also excels in semantic segmentation on the ADE20K dataset.

Visualization and Interpretability

The paper includes detailed visualizations illustrating the enhanced variability and contextual integrity of feature maps in TNT compared to conventional visual transformers. These visuals highlight how TNT better preserves local information and diversifies feature representations, contributing to its improved performance.

Implications and Future Directions

This research underscores the potential advantages of a multi-scale attention mechanism in enhancing visual transformer performance. The proposed TNT architecture sets a precedent for future work in efficiently capturing multi-level dependencies within visual data. Future developments could explore:

Scaling TNT to larger models and datasets
Integrating TNT with other advanced techniques like squeeze-and-excitation (SE)
Application of TNT in more diverse and complex visual tasks beyond image classification and object detection

Conclusion

The "Transformer in Transformer (TNT)" architecture represents a significant step forward in the design of visual transformers, addressing critical limitations related to the granularity of feature extraction. By introducing a hierarchical attention mechanism, the authors demonstrate substantial improvements in model performance across a range of benchmarks, highlighting TNT's capacity for fine-grained and robust visual representation. This work paves the way for more nuanced and effective application of transformers in computer vision, promising advancements in both theoretical understanding and practical implementations.

PDF Markdown

Related Papers

GitHub

GitHub - huawei-noah/Efficient-AI-Backbones: Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab. (3,870 stars)

Tweets

https://twitter.com/TMac572002/status/1796126868170514564

YouTube

Show All Videos