Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fus-MAE: A cross-attention-based data fusion approach for Masked Autoencoders in remote sensing (2401.02764v2)

Published 5 Jan 2024 in cs.CV

Abstract: Self-supervised frameworks for representation learning have recently stirred up interest among the remote sensing community, given their potential to mitigate the high labeling costs associated with curating large satellite image datasets. In the realm of multimodal data fusion, while the often used contrastive learning methods can help bridging the domain gap between different sensor types, they rely on data augmentations techniques that require expertise and careful design, especially for multispectral remote sensing data. A possible but rather scarcely studied way to circumvent these limitations is to use a masked image modelling based pretraining strategy. In this paper, we introduce Fus-MAE, a self-supervised learning framework based on masked autoencoders that uses cross-attention to perform early and feature-level data fusion between synthetic aperture radar and multispectral optical data - two modalities with a significant domain gap. Our empirical findings demonstrate that Fus-MAE can effectively compete with contrastive learning strategies tailored for SAR-optical data fusion and outperforms other masked-autoencoders frameworks trained on a larger corpus.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 567–576.
  2. “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
  3. “What makes multi-modal learning better than single (provably),” Advances in Neural Information Processing Systems, vol. 34, pp. 10944–10956, 2021.
  4. “Sen12ms-cr-ts: A remote-sensing data set for multimodal multitemporal cloud removal,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022.
  5. “Guided patchwise nonlocal sar despeckling,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9, pp. 6484–6498, 2019.
  6. “Improving language understanding by generative pre-training,” 2018.
  7. “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  8. “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
  9. “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009.
  10. “Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9414–9423.
  11. “Bigearthnet-mm: A large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets],” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 3, pp. 174–180, 2021.
  12. “Sen12ms–a curated dataset of georeferenced multi-spectral sentinel-1/2 imagery for deep learning and data fusion,” arXiv preprint arXiv:1906.07789, 2019.
  13. “Ssl4eo-s12: A large-scale multi-modal, multi-temporal dataset for self-supervised learning in earth observation,” arXiv preprint arXiv:2211.07044, 2022.
  14. “Self-supervised remote sensing feature learning: Learning paradigms, challenges, and future works,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–26, 2023.
  15. “Satvit: Pretraining transformers for earth observation,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.
  16. “3dmae: Joint sar and optical representation learning with vertical masking,” IEEE Geoscience and Remote Sensing Letters, 2023.
  17. “Self-supervised multi-image super-resolution for push-frame satellite images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1121–1131.
  18. “Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery,” Advances in Neural Information Processing Systems, vol. 35, pp. 197–211, 2022.
  19. “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430.
  20. “Colorization as a proxy task for visual understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6874–6883.
  21. “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
  22. “Self-supervised sar-optical data fusion of sentinel-1/-2 images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021.
  23. “Semi-supervised learning for joint sar and multispectral land cover classification,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.
  24. “Self-supervised vision transformers for joint sar-optical representation learning,” in IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2022, pp. 139–142.
  25. “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  26. “Ringmo: A remote sensing foundation model with masked image modeling,” IEEE Transactions on Geoscience and Remote Sensing, 2022.
  27. “Large scale masked autoencoding for reducing label requirements on sar data,” arXiv preprint arXiv:2310.00826, 2023.
  28. “Multimae: Multi-modal multi-task masked autoencoders,” in European Conference on Computer Vision. Springer, 2022, pp. 348–367.
  29. “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
Citations (3)

Summary

We haven't generated a summary for this paper yet.