An Empirical Study of Remote Sensing Pretraining (2204.02825v4)

Published 6 Apr 2022 in cs.CV

Abstract: Deep learning has largely reshaped remote sensing (RS) research for aerial image understanding and made a great success. Nevertheless, most of the existing deep models are initialized with the ImageNet pretrained weights. Since natural images inevitably present a large domain gap relative to aerial images, probably limiting the finetuning performance on downstream aerial scene tasks. This issue motivates us to conduct an empirical study of remote sensing pretraining (RSP) on aerial images. To this end, we train different networks from scratch with the help of the largest RS scene recognition dataset up to now -- MillionAID, to obtain a series of RS pretrained backbones, including both convolutional neural networks (CNN) and vision transformers such as Swin and ViTAE, which have shown promising performance on computer vision tasks. Then, we investigate the impact of RSP on representative downstream tasks including scene recognition, semantic segmentation, object detection, and change detection using these CNN and vision transformer backbones. Empirical study shows that RSP can help deliver distinctive performances in scene recognition tasks and in perceiving RS related semantics such as "Bridge" and "Airplane". We also find that, although RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, it may still suffer from task discrepancies, where downstream tasks require different representations from scene recognition tasks. These findings call for further research efforts on both large-scale pretraining datasets and effective pretraining methods. The codes and pretrained models will be released at https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing.

Citations (160)

View on Semantic Scholar

Summary

The paper demonstrates that remote sensing pretraining significantly improves aerial image understanding by effectively bridging the domain gap from ImageNet pretrained models.
The study employs CNNs and vision transformers like Swin and ViTAE, training from scratch on the MillionAID dataset to create tailored backbones for remote sensing tasks.
Results indicate that RSP models excel in object detection and change detection, with vision transformers notably outperforming in complex scene recognition tasks.

An Empirical Study of Remote Sensing Pretraining

The paper investigates remote sensing pretraining (RSP) by utilizing remote sensing images in deep learning, targeting aerial image understanding. The paper underscores the limitation of using ImageNet pretrained weights due to substantial domain differences between natural and aerial images, which may impede optimal fine-tuning on downstream aerial tasks.

Methodology and Implementation

The authors examine the efficacy of RSP using the extensive MillionAID dataset, employing a diverse set of neural network architectures including CNNs and vision transformers like Swin and ViTAE. These networks are trained from scratch to establish remote sensing-specific backbones. The paper evaluates the impact of RSP on various downstream tasks such as scene recognition, semantic segmentation, object detection, and change detection.

Key Findings

Scene Recognition: RSP significantly enhances performance on aerial scenes compared to models pretrained on ImageNet. Vision transformers demonstrated superior capability, with ViTAEv2-S notably outperforming others in understanding complex aerial landscapes.
Semantic Segmentation: Traditional ImageNet pretrained models displayed a slight edge, attributed to their comprehensive data covering varied spectral information beneficial for pixel-level tasks. However, RSP models showcased improved detection of specific semantics like “Bridge.”
Object Detection: RSP-enhanced models proved more effective, especially in detecting oriented bounding boxes in aerial images, emphasizing the alignment of pretraining dataset characteristics with the task requirements.
Change Detection: Vision transformer models, particularly those with RSP, illustrated superior performance, indicating robust contextual representation beneficial for detecting temporal changes in aerial images.

Theoretical and Practical Implications

The findings suggest that RSP narrows domain gaps effectively, providing enhanced starting points for neural networks in handling remote sensing tasks. Moreover, vision transformers, with their ability to model locality and long-range dependencies, are positioned as strong candidates for future remote sensing applications.

Future Directions

The research highlights several avenues for exploration:

Developing larger and more diverse pretraining datasets specifically for remote sensing to further enhance model performance.
Exploring unsupervised or self-supervised pretraining to leverage vast amounts of unlabeled remote sensing data.
Refining network architectures to better align with the unique characteristics of remote sensing data.

This paper presents a comprehensive exploration of RSP, demonstrating its potential to enhance aerial image analysis and setting a precedent for further research in optimizing neural network pretraining for domain-specific tasks.

Related Papers

GitHub

GitHub - ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing: A comprehensive list [SAMRS@NeurIPS'23, RVSA@TGRS'22, RSP@TGRS'22] of our research works related to remote sensing, including papers, codes, and citations. Note: The repo for [TGRS'22] "An Empirical Study of Remote Sensing Pretraining" has been moved to: https://github.com/ViTAE-Transformer/RSP (428 stars)

Tweets

https://twitter.com/robmarkcole/status/1512666657495343104