Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification (2208.12398v1)

Published 26 Aug 2022 in cs.CV

Abstract: Few-shot classification which aims to recognize unseen classes using very limited samples has attracted more and more attention. Usually, it is formulated as a metric learning problem. The core issue of few-shot classification is how to learn (1) consistent representations for images in both support and query sets and (2) effective metric learning for images between support and query sets. In this paper, we show that the two challenges can be well modeled simultaneously via a unified Query-Support TransFormer (QSFormer) model. To be specific,the proposed QSFormer involves global query-support sample Transformer (sampleFormer) branch and local patch Transformer (patchFormer) learning branch. sampleFormer aims to capture the dependence of samples in support and query sets for image representation. It adopts the Encoder, Decoder and Cross-Attention to respectively model the Support, Query (image) representation and Metric learning for few-shot classification task. Also, as a complementary to global learning branch, we adopt a local patch Transformer to extract structural representation for each image sample by capturing the long-range dependence of local image patches. In addition, a novel Cross-scale Interactive Feature Extractor (CIFE) is proposed to extract and fuse multi-scale CNN features as an effective backbone module for the proposed few-shot learning method. All modules are integrated into a unified framework and trained in an end-to-end manner. Extensive experiments on four popular datasets demonstrate the effectiveness and superiority of the proposed QSFormer.

Citations (26)

View on Semantic Scholar

Summary

The paper presents the QSFormer model which unifies query-support transformers to enhance feature representation and metric learning in few-shot classification.
It integrates both global sample and local patch transformers using cross-attention and Earth Mover’s Distance to improve class discrimination.
Experimental results on miniImageNet and tieredImageNet demonstrate significant performance gains in 1-shot and 5-shot tasks over state-of-the-art methods.

Few-Shot Learning Meets Transformer: Unified Query-Support Transformers for Few-Shot Classification

Introduction

The paper presents a novel approach to few-shot classification, a task that involves recognizing new classes using only a limited number of examples. The authors propose a Query-Support Transformer (QSFormer) model to simultaneously address the challenges of feature representation and metric learning in a unified framework. By integrating a global sample Transformer (sampleFormer) and a local patch Transformer (patchFormer), the method enables more effective few-shot classification.

Methodology

QSFormer Framework

The QSFormer framework comprises several key components: the Cross-scale Interactive Feature Extractor (CIFE), the Sample Transformer Module, the Patch Transformer Module, and the combination of these modules for metric learning and classification.

Cross-scale Interactive Feature Extractor (CIFE): This module acts as the backbone, utilizing a ResNet architecture to generate multi-scale features. It enhances the representations by capturing cross-scale dependencies using a Transformer-based network (Figure 1).
Figure 1: Illustration of Cross-scale Interactive Feature Extractor (CIFE) for feature extraction.
Sample Transformer Module (sampleFormer): As illustrated in Figure 2, sampleFormer uses an encoder-decoder structure to model the relationships between query and support samples. In this design, the encoder emphasizes support sample interaction, while the decoder focuses on query sample representation and metric learning through cross-attention mechanisms.
Figure 2: An overview of the proposed QSFormer framework, which mainly consists of Cross-scale Interactive Feature Extractor (CIFE), Sample Transformer Module, Patch Transformer Module, Metric Learning and Few-shot Classification. More details can be found in Section \uppercase\expandafter{\romannumeral3
Patch Transformer Module (patchFormer): This module complements the global approach by modeling local relationships within image patches. The patchFormer calculates metrics using the Earth Mover's Distance (EMD) method, enhancing fine-grained feature alignment.

Metric Learning and Few-shot Classification

The QSFormer employs a combination of global and local metrics derived from sampleFormer and patchFormer respectively. The final metric combines these to classify query images based on the nearest neighbor principle. The paper introduces a contrastive loss to learn effective representations by maximizing the similarity between matching query-support pairs while minimizing the similarity for non-matching pairs.

Results

The QSFormer demonstrates its efficacy across multiple datasets, notably miniImageNet and tieredImageNet, outperforming several state-of-the-art models in both 1-shot and 5-shot tasks. The model significantly improved performance over transformer-based and metric learning models by substantial margins, showcasing its capability in few-shot learning scenarios.

Visualization of Similarity Distributions

To illustrate the effectiveness of the proposed approach, the similarity distribution between query-support pairs is provided (Figure 3). The QSFormer significantly separates positive from negative pairs, a testament to its robust metric learning process.

Figure 3: Comparison of similarity distribution between Baseline and our QSFormer.

Conclusion

The proposed QSFormer model introduces a comprehensive framework that unifies representation learning and metric learning for few-shot classification. The ability to analyze both global and local relationships within query and support samples represents a significant advancement, as evidenced by experimental success across standard benchmarks. Future work may explore further refinements of this model in handling varying few-shot scenarios and extensions to diverse application domains.