1,070 research outputs found
Deep learning in remote sensing: a review
Standing at the paradigm shift towards data-intensive science, machine
learning techniques are becoming increasingly important. In particular, as a
major breakthrough in the field, deep learning has proven as an extremely
powerful tool in many fields. Shall we embrace deep learning as the key to all?
Or, should we resist a 'black-box' solution? There are controversial opinions
in the remote sensing community. In this article, we analyze the challenges of
using deep learning for remote sensing data analysis, review the recent
advances, and provide resources to make deep learning in remote sensing
ridiculously simple to start with. More importantly, we advocate remote sensing
scientists to bring their expertise into deep learning, and use it as an
implicit general model to tackle unprecedented large-scale influential
challenges, such as climate change and urbanization.Comment: Accepted for publication IEEE Geoscience and Remote Sensing Magazin
S4: Self-Supervised Sensing Across the Spectrum
Satellite image time series (SITS) segmentation is crucial for many
applications like environmental monitoring, land cover mapping and agricultural
crop type classification. However, training models for SITS segmentation
remains a challenging task due to the lack of abundant training data, which
requires fine grained annotation. We propose S4 a new self-supervised
pre-training approach that significantly reduces the requirement for labeled
training data by utilizing two new insights: (a) Satellites capture images in
different parts of the spectrum such as radio frequencies, and visible
frequencies. (b) Satellite imagery is geo-registered allowing for fine-grained
spatial alignment. We use these insights to formulate pre-training tasks in S4.
We also curate m2s2-SITS, a large-scale dataset of unlabeled,
spatially-aligned, multi-modal and geographic specific SITS that serves as
representative pre-training data for S4. Finally, we evaluate S4 on multiple
SITS segmentation datasets and demonstrate its efficacy against competing
baselines while using limited labeled data
Disaster Analysis using Satellite Image Data with Knowledge Transfer and Semi-Supervised Learning Techniques
With the increase in frequency of disasters and crisis situations like floods, earthquake and hurricanes, the requirement to handle the situation efficiently through disaster response and humanitarian relief has increased. Disasters are mostly unpredictable in nature with respect to their impact on people and property. Moreover, the dynamic and varied nature of disasters makes it difficult to predict their impact accurately for advanced preparation of responses [104]. It is also notable that the economical loss due to natural disasters has increased in recent years, and it, along with the pure humanitarian need, is one of the reasons to research innovative approaches to the mitigation and management of disaster operations efficiently [1]
Self-Supervised Learning for Invariant Representations From Multi-Spectral and SAR Images
Self-Supervised learning (SSL) has become the new state of the art in several domain classification and segmentation tasks. One popular category of SSL are distillation networks such as Bootstrap Your Own Latent (BYOL). This work proposes RS-BYOL, which builds on BYOL in the remote sensing (RS) domain where data are non-trivially different from natural RGB images. Since multi-spectral (MS) and synthetic aperture radar (SAR) sensors provide varied spectral and spatial resolution information, we utilise them as an implicit augmentation to learn invariant feature embeddings. In order to learn RS based invariant features with SSL, we trained RS-BYOL in two ways, i.e. single channel feature learning and three channel feature learning. This work explores the usefulness of single channel feature learning from random 10 MS bands of 10m-20 m resolution and VV-VH of SAR bands compared to the common notion of using three or more bands. In our linear probing evaluation, these single channel features reached a 0.92 F1 score on the EuroSAT classification task and 59.6 mIoU on the IEEE Data Fusion Contest (DFC) segmentation task for certain single bands. We also compare our results with ImageNet weights and show that the RS based SSL model outperforms the supervised ImageNet based model. We further explore the usefulness of multi-modal data compared to single modality data, and it is shown that utilising MS and SAR data allows better invariant representations to be learnt than utilising only MS data
MULTI-MODAL SELF-SUPERVISED REPRESENTATION LEARNING FOR EARTH OBSERVATION
Self-Supervised learning (SSL) has reduced the performance gap between supervised and unsupervised learning, due to its ability to learn invariant representations. This is a boon to the domains like Earth Observation (EO), where labelled data availability is scarce but unlabelled data is freely available. While Transfer Learning from generic RGB pre-trained models is still common-place in EO, we argue that, it is essential to have good EO domain specific pre-trained model in order to use with downstream tasks with limited labelled data. Hence, we explored the applicability of SSL with multi-modal satellite imagery for downstream tasks. For this we utilised the state-of-art SSL architectures i.e. BYOL and SimSiam to train on EO data. Also to obtain better invariant representations, we considered multi-spectral (MS) images and synthetic aperture radar (SAR) images as separate augmented views of an image to maximise their similarity. Our work shows that by learning single channel representations through non-contrastive learning, our approach can outperform ImageNet pre-trained models significantly on a scene classification task. We further explored the usefulness of a momentum encoder by comparing the two architectures i.e. BYOL and SimSiam but did not identify a significant improvement in performance between the models
Cross-sensor self-supervised training and alignment for remote sensing
Large-scale "foundation models" have gained traction as a way to leverage the
vast amounts of unlabeled remote sensing data collected every day. However, due
to the multiplicity of Earth Observation satellites, these models should learn
"sensor agnostic" representations, that generalize across sensor
characteristics with minimal fine-tuning. This is complicated by data
availability, as low-resolution imagery, such as Sentinel-2 and Landsat-8 data,
are available in large amounts, while very high-resolution aerial or satellite
data is less common. To tackle these challenges, we introduce cross-sensor
self-supervised training and alignment for remote sensing (X-STARS). We design
a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD),
to align representations across sensors, even with vastly different
resolutions. Our X-STARS can be applied to train models from scratch, or to
adapt large models pretrained on e.g low-resolution EO data to new
high-resolution sensors, in a continual pretraining framework. We collect and
release MSC-France, a new multi-sensor dataset, on which we train our X-STARS
models, then evaluated on seven downstream classification and segmentation
tasks. We demonstrate that X-STARS outperforms the state-of-the-art by a
significant margin with less data across various conditions of data
availability and resolutions
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment
Deep Learning (DL) is undergoing a paradigm shift with the emergence of
foundation models, aptly named by their crucial, yet incomplete nature. In this
work, we focus on Contrastive Language-Image Pre-training (CLIP), an
open-vocabulary foundation model, which achieves high accuracy across many
image classification tasks and is often competitive with a fully supervised
baseline without being explicitly trained. Nevertheless, there are still
domains where zero-shot CLIP performance is far from optimal, such as Remote
Sensing (RS) and medical imagery. These domains do not only exhibit
fundamentally different distributions compared to natural images, but also
commonly rely on complementary modalities, beyond RGB, to derive meaningful
insights. To this end, we propose a methodology for the purpose of aligning
distinct RS imagery modalities with the visual and textual modalities of CLIP.
Our two-stage procedure, comprises of robust fine-tuning CLIP in order to deal
with the distribution shift, accompanied by the cross-modal alignment of a RS
modality encoder, in an effort to extend the zero-shot capabilities of CLIP. We
ultimately demonstrate our method on the tasks of RS imagery classification and
cross-modal retrieval. We empirically show that both robust fine-tuning and
cross-modal alignment translate to significant performance gains, across
several RS benchmark datasets. Notably, these enhancements are achieved without
the reliance on textual descriptions, without introducing any task-specific
parameters, without training from scratch and without catastrophic forgetting
Semi-supervised learning for joint SAR and multispectral land cover classification
Semi-supervised learning techniques are gaining popularity due to their
capability of building models that are effective, even when scarce amounts of
labeled data are available. In this paper, we present a framework and specific
tasks for self-supervised pretraining of \textit{multichannel} models, such as
the fusion of multispectral and synthetic aperture radar images. We show that
the proposed self-supervised approach is highly effective at learning features
that correlate with the labels for land cover classification. This is enabled
by an explicit design of pretraining tasks which promotes bridging the gaps
between sensing modalities and exploiting the spectral characteristics of the
input. In a semi-supervised setting, when limited labels are available, using
the proposed self-supervised pretraining, followed by supervised finetuning for
land cover classification with SAR and multispectral data, outperforms
conventional approaches such as purely supervised learning, initialization from
training on ImageNet and other recent self-supervised approaches.Comment: IEEE Geoscience and Remote Sensing Letter
BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping
We propose a metadata-aware self-supervised learning~(SSL)~framework useful
for fine-grained classification and ecological mapping of bird species around
the world. Our framework unifies two SSL strategies: Contrastive Learning~(CL)
and Masked Image Modeling~(MIM), while also enriching the embedding space with
metadata available with ground-level imagery of birds. We separately train
uni-modal and cross-modal ViT on a novel cross-view global bird species dataset
containing ground-level imagery, metadata (location, time), and corresponding
satellite imagery. We demonstrate that our models learn fine-grained and
geographically conditioned features of birds, by evaluating on two downstream
tasks: fine-grained visual classification~(FGVC) and cross-modal retrieval.
Pre-trained models learned using our framework achieve SotA performance on FGVC
of iNAT-2021 birds and in transfer learning settings for CUB-200-2011 and
NABirds datasets. Moreover, the impressive cross-modal retrieval performance of
our model enables the creation of species distribution maps across any
geographic region. The dataset and source code will be released at
https://github.com/mvrl/BirdSAT}.Comment: Accepted at WACV 202
- …
