5,542 research outputs found
RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data
In this paper, we introduce the task of visual grounding for remote sensing
data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS)
images with the guidance of natural language. To retrieve rich information from
RS imagery using natural language, many research tasks, like RS image visual
question answering, RS image captioning, and RS image-text retrieval have been
investigated a lot. However, the object-level visual grounding on RS images is
still under-explored. Thus, in this work, we propose to construct the dataset
and explore deep learning models for the RSVG task. Specifically, our
contributions can be summarized as follows. 1) We build the new large-scale
benchmark dataset of RSVG, termed RSVGD, to fully advance the research of RSVG.
This new dataset includes image/expression/box triplets for training and
evaluating visual grounding models. 2) We benchmark extensive state-of-the-art
(SOTA) natural image visual grounding methods on the constructed RSVGD dataset,
and some insightful analyses are provided based on the results. 3) A novel
transformer-based Multi-Level Cross-Modal feature learning (MLCM) module is
proposed. Remotely-sensed images are usually with large scale variations and
cluttered backgrounds. To deal with the scale-variation problem, the MLCM
module takes advantage of multi-scale visual features and multi-granularity
textual embeddings to learn more discriminative representations. To cope with
the cluttered background problem, MLCM adaptively filters irrelevant noise and
enhances salient features. In this way, our proposed model can incorporate more
effective multi-level and multi-modal features to boost performance.
Furthermore, this work also provides useful insights for developing better RSVG
models. The dataset and code will be publicly available at
https://github.com/ZhanYang-nwpu/RSVG-pytorch.Comment: 12 pages, 10 figure
GeoChat: Grounded Large Vision-Language Model for Remote Sensing
Recent advancements in Large Vision-Language Models (VLMs) have shown great
promise in natural image domains, allowing users to hold a dialogue about given
visual content. However, such general-domain VLMs perform poorly for Remote
Sensing (RS) scenarios, leading to inaccurate or fabricated information when
presented with RS domain-specific queries. Such a behavior emerges due to the
unique challenges introduced by RS imagery. For example, to handle
high-resolution RS imagery with diverse scale changes across categories and
many small objects, region-level reasoning is necessary alongside holistic
scene interpretation. Furthermore, the lack of domain-specific multimodal
instruction following data as well as strong backbone models for RS make it
hard for the models to align their behavior with user queries. To address these
limitations, we propose GeoChat - the first versatile remote sensing VLM that
offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can not only answer image-level queries but also accepts
region inputs to hold region-specific dialogue. Furthermore, it can visually
ground objects in its responses by referring to their spatial coordinates. To
address the lack of domain-specific datasets, we generate a novel RS multimodal
instruction-following dataset by extending image-text pairs from existing
diverse RS datasets. We establish a comprehensive benchmark for RS multitask
conversations and compare with a number of baseline methods. GeoChat
demonstrates robust zero-shot performance on various RS tasks, e.g., image and
region captioning, visual question answering, scene classification, visually
grounded conversations and referring detection. Our code is available at
https://github.com/mbzuai-oryx/geochat.Comment: 10 pages, 4 figure
Non-Visible Light Data Synthesis and Application: A Case Study for Synthetic Aperture Radar Imagery
We explore the "hidden" ability of large-scale pre-trained image generation
models, such as Stable Diffusion and Imagen, in non-visible light domains,
taking Synthetic Aperture Radar (SAR) data for a case study. Due to the
inherent challenges in capturing satellite data, acquiring ample SAR training
samples is infeasible. For instance, for a particular category of ship in the
open sea, we can collect only few-shot SAR images which are too limited to
derive effective ship recognition models. If large-scale models pre-trained
with regular images can be adapted to generating novel SAR images, the problem
is solved. In preliminary study, we found that fine-tuning these models with
few-shot SAR images is not working, as the models can not capture the two
primary differences between SAR and regular images: structure and modality. To
address this, we propose a 2-stage low-rank adaptation method, and we call it
2LoRA. In the first stage, the model is adapted using aerial-view regular image
data (whose structure matches SAR), followed by the second stage where the base
model from the first stage is further adapted using SAR modality data.
Particularly in the second stage, we introduce a novel prototype LoRA (pLoRA),
as an improved version of 2LoRA, to resolve the class imbalance problem in SAR
datasets. For evaluation, we employ the resulting generation model to
synthesize additional SAR data. This augmentation, when integrated into the
training process of SAR classification as well as segmentation models, yields
notably improved performance for minor classe
Ship detection in SAR images based on Maxtree representation and graph signal processing
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.This paper discusses an image processing architecture and tools to address the problem of ship detection in synthetic-aperture radar images. The detection strategy relies on a tree-based representation of images, here a Maxtree, and graph signal processing tools. Radiometric as well as geometric attributes are evaluated and associated with the Maxtree nodes. They form graph attribute signals which are processed with graph filters. The goal of this filtering step is to exploit the correlation existing between attribute values on neighboring tree nodes. Considering that trees are specific graphs where the connectivity toward ancestors and descendants may have a different meaning, we analyze several linear, nonlinear, and morphological filtering strategies. Beside graph filters, two new filtering notions emerge from this analysis: tree and branch filters. Finally, we discuss a ship detection architecture that involves graph signal filters and machine learning tools. This architecture demonstrates the interest of applying graph signal processing tools on the tree-based representation of images and of going beyond classical graph filters. The resulting approach significantly outperforms state-of-the-art algorithms. Finally, a MATLAB toolbox allowing users to experiment with the tools discussed in this paper on Maxtree or Mintree has been created and made public.Peer ReviewedPostprint (author's final draft
Few-shot Object Detection on Remote Sensing Images
In this paper, we deal with the problem of object detection on remote sensing
images. Previous methods have developed numerous deep CNN-based methods for
object detection on remote sensing images and the report remarkable
achievements in detection performance and efficiency. However, current
CNN-based methods mostly require a large number of annotated samples to train
deep neural networks and tend to have limited generalization abilities for
unseen object categories. In this paper, we introduce a few-shot learning-based
method for object detection on remote sensing images where only a few annotated
samples are provided for the unseen object categories. More specifically, our
model contains three main components: a meta feature extractor that learns to
extract feature representations from input images, a reweighting module that
learn to adaptively assign different weights for each feature representation
from the support images, and a bounding box prediction module that carries out
object detection on the reweighted feature maps. We build our few-shot object
detection model upon YOLOv3 architecture and develop a multi-scale object
detection framework. Experiments on two benchmark datasets demonstrate that
with only a few annotated samples our model can still achieve a satisfying
detection performance on remote sensing images and the performance of our model
is significantly better than the well-established baseline models.Comment: 12pages, 7 figure
Relation Network for Multi-label Aerial Image Classification
Multi-label classification plays a momentous role in perceiving intricate
contents of an aerial image and triggers several related studies over the last
years. However, most of them deploy few efforts in exploiting label relations,
while such dependencies are crucial for making accurate predictions. Although
an LSTM layer can be introduced to modeling such label dependencies in a chain
propagation manner, the efficiency might be questioned when certain labels are
improperly inferred. To address this, we propose a novel aerial image
multi-label classification network, attention-aware label relational reasoning
network. Particularly, our network consists of three elemental modules: 1) a
label-wise feature parcel learning module, 2) an attentional region extraction
module, and 3) a label relational inference module. To be more specific, the
label-wise feature parcel learning module is designed for extracting high-level
label-specific features. The attentional region extraction module aims at
localizing discriminative regions in these features and yielding attentional
label-specific features. The label relational inference module finally predicts
label existences using label relations reasoned from outputs of the previous
module. The proposed network is characterized by its capacities of extracting
discriminative label-wise features in a proposal-free way and reasoning about
label relations naturally and interpretably. In our experiments, we evaluate
the proposed model on the UCM multi-label dataset and a newly produced dataset,
AID multi-label dataset. Quantitative and qualitative results on these two
datasets demonstrate the effectiveness of our model. To facilitate progress in
the multi-label aerial image classification, the AID multi-label dataset will
be made publicly available
Recurrently Exploring Class-wise Attention in A Hybrid Convolutional and Bidirectional LSTM Network for Multi-label Aerial Image Classification
Aerial image classification is of great significance in remote sensing
community, and many researches have been conducted over the past few years.
Among these studies, most of them focus on categorizing an image into one
semantic label, while in the real world, an aerial image is often associated
with multiple labels, e.g., multiple object-level labels in our case. Besides,
a comprehensive picture of present objects in a given high resolution aerial
image can provide more in-depth understanding of the studied region. For these
reasons, aerial image multi-label classification has been attracting increasing
attention. However, one common limitation shared by existing methods in the
community is that the co-occurrence relationship of various classes, so called
class dependency, is underexplored and leads to an inconsiderate decision. In
this paper, we propose a novel end-to-end network, namely class-wise
attention-based convolutional and bidirectional LSTM network (CA-Conv-BiLSTM),
for this task. The proposed network consists of three indispensable components:
1) a feature extraction module, 2) a class attention learning layer, and 3) a
bidirectional LSTM-based sub-network. Particularly, the feature extraction
module is designed for extracting fine-grained semantic feature maps, while the
class attention learning layer aims at capturing discriminative class-specific
features. As the most important part, the bidirectional LSTM-based sub-network
models the underlying class dependency in both directions and produce
structured multiple object labels. Experimental results on UCM multi-label
dataset and DFC15 multi-label dataset validate the effectiveness of our model
quantitatively and qualitatively
- …