Emergent Mind


An in-depth comprehension of global land cover is essential in Earth observation, forming the foundation for a multitude of applications. Although remote sensing technology has advanced rapidly, leading to a proliferation of satellite imagery, the inherent complexity of these images often makes them difficult for non-expert users to understand. Natural language, as a carrier of human knowledge, can be a bridge between common users and complicated satellite imagery. In this context, we introduce a global-scale, high-quality image-text dataset for remote sensing, providing natural language descriptions for Sentinel-2 data to facilitate the understanding of satellite imagery for common users. Specifically, we utilize Sentinel-2 data for its global coverage as the foundational image source, employing semantic segmentation labels from the European Space Agency's (ESA) WorldCover project to enrich the descriptions of land covers. By conducting in-depth semantic analysis, we formulate detailed prompts to elicit rich descriptions from ChatGPT. To enhance the dataset's quality, we introduce the manual verification process. This step involves manual inspection and correction to refine the dataset, thus significantly improving its accuracy and quality. Finally, we offer the community ChatEarthNet, a large-scale image-text dataset characterized by global coverage, high quality, wide-ranging diversity, and detailed descriptions. ChatEarthNet consists of 163,488 image-text pairs with captions generated by ChatGPT-3.5 and an additional 10,000 image-text pairs with captions generated by ChatGPT-4V(ision). This dataset has significant potential for training vision-language geo-foundation models and evaluating large vision-language models for remote sensing. The dataset will be made publicly available.


  • ChatEarthNet combines Sentinel-2 satellite imagery and ESA WorldCover land cover information with the language generation capabilities of ChatGPT-3.5 and ChatGPT-4V to create a high-quality, globally distributed image-text dataset.

  • The construction process involves sophisticated prompt engineering tailored for the strengths of ChatGPT-3.5 and ChatGPT-4V, ensuring high-relevance and detailed descriptions for over 173,000 image-text pairs.

  • The dataset not only offers a detailed spectral and global distribution of the Earth's surface but also enriches AI training in remote sensing with its descriptive language, focusing on diverse landscapes and urban settings.

  • ChatEarthNet serves as an essential tool for advancing AI research in Earth observation, enabling the development of more sophisticated vision-language models for interpreting and describing the Earth's surface.

Exploring ChatEarthNet: A Substantial Leap in Remote Sensing Image-Text Datasets

Introduction to ChatEarthNet

The field of remote sensing has long sought ways to enhance the interpretability of satellite imagery for a broader audience. Recent advancements in LLMs and their capacity for generating natural language descriptions have paved the way for innovative approaches to this challenge. In this context, the ChatEarthNet dataset emerges as a pivotal development. It stands out for its global-scale coverage, employing Sentinel-2 satellite data and the ESA's WorldCover project for land cover information. This dataset relies on sophisticated prompts designed for ChatGPT-3.5 and ChatGPT-4V to generate detailed, high-quality captions for each image. The methodological underpinnings of ChatEarthNet illustrate a meticulous approach to bridging the gap between complex satellite imagery and the accessibility provided by natural language descriptions.

Comprehensive Dataset Construction

The strategic foundation of ChatEarthNet lies in its construction process. Sentinel-2 imagery, known for its extensive global coverage and spectral richness, serves as the dataset's backbone. The inclusion of land cover maps from the WorldCover project enriches this imagery with meaningful semantic segmentation, facilitating accurate, context-rich descriptions. Prompt engineering is central to this endeavor, tailored to leverage the strengths of both ChatGPT versions used. This intricacy in dataset creation ensures that each of the 163,488 image-text pairs from ChatGPT-3.5, and an additional 10,000 pairs from ChatGPT-4V, are of superior quality and relevance.

Sentinel-2 Data and Land Cover Information

The dataset's reliance on Sentinel-2 data and ESA's WorldCover land cover maps ensures a comprehensive representation of the Earth's surface. The specifications include global distribution, temporal diversity, and a detailed spectral band selection, encompassing various landforms and urban layouts. These aspects are crucial for capturing the Earth's diversity and are pivotal for the dataset's broad applicability in remote sensing tasks.

Prompt Design and Manual Verification

The dataset construction undertakes a novel approach in prompt design, engaging with the distinct capabilities of ChatGPT-3.5 and ChatGPT-4V. For ChatGPT-3.5, the prompts are text-based, meticulously formulated to describe the land cover map's semantic content. ChatGPT-4V, with its ability to interpret images, receives prompts enriched with spatial and semantic nuances. This dual approach in prompt design showcases a thoughtful attempt to extract the most accurate and detailed descriptions possible. Manual verification adds another layer of quality assurance, addressing any inaccuracies and ensuring the dataset's descriptions are precise and reliable.

Analytical Insights

The analysis of ChatEarthNet offers fascinating insights into the dataset's characteristics. Geographic distribution confirms the dataset's global-scale ambition, showcasing a wide variety of landscapes and urban settings. Word clouds and word frequency histograms reveal the richness of the language used in the descriptions, highlighting the descriptive power of the employed LLMs. This linguistic diversity enriches the dataset further, making it a potent tool for training and evaluating vision-language models tailored for remote sensing applications.

Diverse Applications and Future Directions

ChatEarthNet's well-documented construction process and analytical examination underscore its potential as a foundational dataset for training advanced vision-language models in the remote sensing domain. Its detailed, globally distributed image-text pairs provide a unique resource for developing models capable of interpreting and describing Earth's surface. As AI continues to evolve, datasets like ChatEarthNet will undoubtedly play a crucial role in expanding the capabilities of vision-language models, enabling more sophisticated applications in Earth observation and beyond.


ChatEarthNet exemplifies a significant stride in the integration of language models with remote sensing technology. By combining Sentinel-2 imagery with the descriptive prowess of ChatGPT-3.5 and ChatGPT-4V, it offers a dataset that not only enhances the interpretability of satellite images for a wide audience but also serves as a critical resource for advancing AI research in Earth observation. As the field of AI continues to progress, the implications of ChatEarthNet and similar datasets will resonate across various applications, paving the way for innovative solutions in understanding and monitoring our planet.

Create an account to read this summary for free:


Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.