Image Fusion Transformer (2107.09011v4)

Published 19 Jul 2021 in cs.CV

Abstract: In image fusion, images obtained from different sensors are fused to generate a single image with enhanced information. In recent years, state-of-the-art methods have adopted Convolution Neural Networks (CNNs) to encode meaningful features for image fusion. Specifically, CNN-based methods perform image fusion by fusing local features. However, they do not consider long-range dependencies that are present in the image. Transformer-based models are designed to overcome this by modeling the long-range dependencies with the help of self-attention mechanism. This motivates us to propose a novel Image Fusion Transformer (IFT) where we develop a transformer-based multi-scale fusion strategy that attends to both local and long-range information (or global context). The proposed method follows a two-stage training approach. In the first stage, we train an auto-encoder to extract deep features at multiple scales. In the second stage, multi-scale features are fused using a Spatio-Transformer (ST) fusion strategy. The ST fusion blocks are comprised of a CNN and a transformer branch which capture local and long-range features, respectively. Extensive experiments on multiple benchmark datasets show that the proposed method performs better than many competitive fusion algorithms. Furthermore, we show the effectiveness of the proposed ST fusion strategy with an ablation analysis. The source code is available at: https://github.com/Vibashan/Image-Fusion-Transformer.

Citations (91)

View on Semantic Scholar

Summary

The paper presents a novel two-stage training approach that integrates CNN-based local features with transformer-based long-range dependencies.
Experiments show the IFT model outperformed competing methods, achieving higher entropy and mutual information scores in various image fusion tasks.
The dual-pathway design offers practical improvements in medical imaging, remote sensing, and surveillance through effective sensor fusion.

Image Fusion Transformer: A Comprehensive Evaluation

The paper "Image Fusion Transformer" presents a novel approach to image fusion by leveraging transformer architectures to capture both local and long-range dependencies, which are often overlooked in traditional CNN-based methods. The research introduces the Image Fusion Transformer (IFT) model, which seeks to improve image fusion outcomes by integrating complementary information from different sensor modalities.

Methodological Contributions

The authors propose a two-stage training approach that employs an auto-encoder for multi-scale feature extraction followed by a novel Spatio-Transformer (ST) fusion strategy. This fusion strategy comprises a convolutional neural network (CNN) branch to extract local features and a transformer branch to model long-range dependencies through self-attention. This dual-pathway approach allows for a more comprehensive extraction and integration of features, facilitating enhanced fusion outputs that capture both the essential local details and broader, context-based information.

Numerical Results

Extensive experiments demonstrate that the IFT model outperforms existing state-of-the-art techniques across various benchmark datasets. For infrared and visible image fusion, the IFT achieved an entropy (En) value of 6.9862 and a mutual information (MI) score of 13.9725, surpassing competing methods such as RFN-Nest and DenseFuse. Similarly, in MRI and PET image fusion tasks, the proposed model recorded an entropy score of 6.4328 and a correlation coefficient (CC) of 0.9463, significantly higher than those achieved by conventional techniques. These results underscore the efficacy of incorporating long-range dependencies in image fusion applications.

Implications

The introduction of transformers into the image fusion domain offers several implications for both theoretical research and practical applications. Theoretically, this approach highlights the potential of transformer-based architectures to bridge the gap between local feature encoding and global spatial dependencies. From a practical perspective, enhanced image fusion techniques can improve outcomes in fields such as medical imaging, remote sensing, and night-time surveillance, where sensor modalities often capture complementary yet disparate information.

Future Directions

The research sets the stage for further exploration of transformer architectures in image processing tasks. Future developments might involve the integration of more sophisticated attention mechanisms or hybrid models that combine the strengths of CNNs and transformers in new ways. Additionally, exploring the scalability of such models on large-scale datasets or in real-time applications offers promising avenues for advancing the field.

In conclusion, the Image Fusion Transformer provides a valuable contribution to the advancement of image fusion methodologies. By effectively capturing both local and long-range contexts, this approach sets a new benchmark for accuracy and quality in the synthesis of information from multiple imaging modalities.

PDF Markdown

Related Papers

GitHub

GitHub - Vibashan/Image-Fusion-Transformer: Official Pytorch Codebase for Image-Fusion-Transformer (103 stars)