1,439 research outputs found
Fine-Grained Object Recognition and Zero-Shot Learning in Remote Sensing Imagery
Fine-grained object recognition that aims to identify the type of an object
among a large number of subcategories is an emerging application with the
increasing resolution that exposes new details in image data. Traditional fully
supervised algorithms fail to handle this problem where there is low
between-class variance and high within-class variance for the classes of
interest with small sample sizes. We study an even more extreme scenario named
zero-shot learning (ZSL) in which no training example exists for some of the
classes. ZSL aims to build a recognition model for new unseen categories by
relating them to seen classes that were previously learned. We establish this
relation by learning a compatibility function between image features extracted
via a convolutional neural network and auxiliary information that describes the
semantics of the classes of interest by using training samples from the seen
classes. Then, we show how knowledge transfer can be performed for the unseen
classes by maximizing this function during inference. We introduce a new data
set that contains 40 different types of street trees in 1-ft spatial resolution
aerial data, and evaluate the performance of this model with manually annotated
attributes, a natural language model, and a scientific taxonomy as auxiliary
information. The experiments show that the proposed model achieves 14.3%
recognition accuracy for the classes with no training examples, which is
significantly better than a random guess accuracy of 6.3% for 16 test classes,
and three other ZSL algorithms.Comment: G. Sumbul, R. G. Cinbis, S. Aksoy, "Fine-Grained Object Recognition
and Zero-Shot Learning in Remote Sensing Imagery", IEEE Transactions on
Geoscience and Remote Sensing (TGRS), in press, 201
CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought Language Prompting
Natural scene analysis and remote sensing imagery offer immense potential for
advancements in large-scale language-guided context-aware data utilization.
This potential is particularly significant for enhancing performance in
downstream tasks such as object detection and segmentation with designed
language prompting. In light of this, we introduce the CPSeg, Chain-of-Thought
Language Prompting for Finer-grained Semantic Segmentation), an innovative
framework designed to augment image segmentation performance by integrating a
novel "Chain-of-Thought" process that harnesses textual information associated
with images. This groundbreaking approach has been applied to a flood disaster
scenario. CPSeg encodes prompt texts derived from various sentences to
formulate a coherent chain-of-thought. We propose a new vision-language
dataset, FloodPrompt, which includes images, semantic masks, and corresponding
text information. This not only strengthens the semantic understanding of the
scenario but also aids in the key task of semantic segmentation through an
interplay of pixel and text matching maps. Our qualitative and quantitative
analyses validate the effectiveness of CPSeg.Comment: WACV 202
SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model
The success of the Segment Anything Model (SAM) demonstrates the significance
of data-centric machine learning. However, due to the difficulties and high
costs associated with annotating Remote Sensing (RS) images, a large amount of
valuable RS data remains unlabeled, particularly at the pixel level. In this
study, we leverage SAM and existing RS object detection datasets to develop an
efficient pipeline for generating a large-scale RS segmentation dataset, dubbed
SAMRS. SAMRS totally possesses 105,090 images and 1,668,241 instances,
surpassing existing high-resolution RS segmentation datasets in size by several
orders of magnitude. It provides object category, location, and instance
information that can be used for semantic segmentation, instance segmentation,
and object detection, either individually or in combination. We also provide a
comprehensive analysis of SAMRS from various aspects. Moreover, preliminary
experiments highlight the importance of conducting segmentation pre-training
with SAMRS to address task discrepancies and alleviate the limitations posed by
limited training data during fine-tuning. The code and dataset will be
available at https://github.com/ViTAE-Transformer/SAMRS.Comment: Accepted by NeurIPS 2023 Datasets and Benchmarks Trac
RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model
Pre-trained Vision-Language Foundation Models utilizing extensive image-text
paired data have demonstrated unprecedented image-text association
capabilities, achieving remarkable results across various downstream tasks. A
critical challenge is how to make use of existing large-scale pre-trained VLMs,
which are trained on common objects, to perform the domain-specific transfer
for accomplishing domain-related downstream tasks. In this paper, we propose a
new framework that includes the Domain Foundation Model (DFM), bridging the gap
between the General Foundation Model (GFM) and domain-specific downstream
tasks. Moreover, we present an image-text paired dataset in the field of remote
sensing (RS), RS5M, which has 5 million RS images with English descriptions.
The dataset is obtained from filtering publicly available image-text paired
datasets and captioning label-only RS datasets with pre-trained VLM. These
constitute the first large-scale RS image-text paired dataset. Additionally, we
tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the
DFM. Experimental results show that our proposed dataset are highly effective
for various tasks, improving upon the baseline by in
zero-shot classification tasks, and obtaining good results in both
Vision-Language Retrieval and Semantic Localization tasks.
\url{https://github.com/om-ai-lab/RS5M}Comment: RS5M dataset v
BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping
We propose a metadata-aware self-supervised learning~(SSL)~framework useful
for fine-grained classification and ecological mapping of bird species around
the world. Our framework unifies two SSL strategies: Contrastive Learning~(CL)
and Masked Image Modeling~(MIM), while also enriching the embedding space with
metadata available with ground-level imagery of birds. We separately train
uni-modal and cross-modal ViT on a novel cross-view global bird species dataset
containing ground-level imagery, metadata (location, time), and corresponding
satellite imagery. We demonstrate that our models learn fine-grained and
geographically conditioned features of birds, by evaluating on two downstream
tasks: fine-grained visual classification~(FGVC) and cross-modal retrieval.
Pre-trained models learned using our framework achieve SotA performance on FGVC
of iNAT-2021 birds and in transfer learning settings for CUB-200-2011 and
NABirds datasets. Moreover, the impressive cross-modal retrieval performance of
our model enables the creation of species distribution maps across any
geographic region. The dataset and source code will be released at
https://github.com/mvrl/BirdSAT}.Comment: Accepted at WACV 202
Automated High-resolution Earth Observation Image Interpretation: Outcome of the 2020 Gaofen Challenge
In this article, we introduce the 2020 Gaofen Challenge and relevant scientific outcomes. The 2020 Gaofen Challenge is an international competition, which is organized by the China High-Resolution Earth Observation Conference Committee and the Aerospace Information Research Institute, Chinese Academy of Sciences and technically cosponsored by the IEEE Geoscience and Remote Sensing Society and the International Society for Photogrammetry and Remote Sensing. It aims at promoting the academic development of automated high-resolution earth observation image interpretation. Six independent tracks have been organized in this challenge, which cover the challenging problems in the field of object detection and semantic segmentation. With the development of convolutional neural networks, deep-learning-based methods have achieved good performance on image interpretation. In this article, we report the details and the best-performing methods presented so far in the scope of this challenge
What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation
While semantic segmentation has seen tremendous improvements in the past,
there is still significant labeling efforts necessary and the problem of
limited generalization to classes that have not been present during training.
To address this problem, zero-shot semantic segmentation makes use of large
self-supervised vision-language models, allowing zero-shot transfer to unseen
classes. In this work, we build a benchmark for Multi-domain Evaluation of
Semantic Segmentation (MESS), which allows a holistic analysis of performance
across a wide range of domain-specific datasets such as medicine, engineering,
earth monitoring, biology, and agriculture. To do this, we reviewed 120
datasets, developed a taxonomy, and classified the datasets according to the
developed taxonomy. We select a representative subset consisting of 22 datasets
and propose it as the MESS benchmark. We evaluate eight recently published
models on the proposed MESS benchmark and analyze characteristics for the
performance of zero-shot transfer models. The toolkit is available at
https://github.com/blumenstiel/MESS
- …