335 research outputs found
Learning to Evaluate Performance of Multi-modal Semantic Localization
Semantic localization (SeLo) refers to the task of obtaining the most
relevant locations in large-scale remote sensing (RS) images using semantic
information such as text. As an emerging task based on cross-modal retrieval,
SeLo achieves semantic-level retrieval with only caption-level annotation,
which demonstrates its great potential in unifying downstream tasks. Although
SeLo has been carried out successively, but there is currently no work has
systematically explores and analyzes this urgent direction. In this paper, we
thoroughly study this field and provide a complete benchmark in terms of
metrics and testdata to advance the SeLo task. Firstly, based on the
characteristics of this task, we propose multiple discriminative evaluation
metrics to quantify the performance of the SeLo task. The devised significant
area proportion, attention shift distance, and discrete attention distance are
utilized to evaluate the generated SeLo map from pixel-level and region-level.
Next, to provide standard evaluation data for the SeLo task, we contribute a
diverse, multi-semantic, multi-objective Semantic Localization Testset
(AIR-SLT). AIR-SLT consists of 22 large-scale RS images and 59 test cases with
different semantics, which aims to provide a comprehensive evaluations for
retrieval models. Finally, we analyze the SeLo performance of RS cross-modal
retrieval models in detail, explore the impact of different variables on this
task, and provide a complete benchmark for the SeLo task. We have also
established a new paradigm for RS referring expression comprehension, and
demonstrated the great advantage of SeLo in semantics through combining it with
tasks such as detection and road extraction. The proposed evaluation metrics,
semantic localization testsets, and corresponding scripts have been open to
access at github.com/xiaoyuan1996/SemanticLocalizationMetrics .Comment: 19 pages, 11 figure
A Dimensional Structure based Knowledge Distillation Method for Cross-Modal Learning
Due to limitations in data quality, some essential visual tasks are difficult
to perform independently. Introducing previously unavailable information to
transfer informative dark knowledge has been a common way to solve such hard
tasks. However, research on why transferred knowledge works has not been
extensively explored. To address this issue, in this paper, we discover the
correlation between feature discriminability and dimensional structure (DS) by
analyzing and observing features extracted from simple and hard tasks. On this
basis, we express DS using deep channel-wise correlation and intermediate
spatial distribution, and propose a novel cross-modal knowledge distillation
(CMKD) method for better supervised cross-modal learning (CML) performance. The
proposed method enforces output features to be channel-wise independent and
intermediate ones to be uniformly distributed, thereby learning semantically
irrelevant features from the hard task to boost its accuracy. This is
especially useful in specific applications where the performance gap between
dual modalities is relatively large. Furthermore, we collect a real-world CML
dataset to promote community development. The dataset contains more than 10,000
paired optical and radar images and is continuously being updated. Experimental
results on real-world and benchmark datasets validate the effectiveness of the
proposed method
Recommended from our members
Scale-wise interaction fusion and knowledge distillation network for aerial scene recognition
Data availability statement: Data sharing is not applicable to this article as no new data were created or analysed in this study.Copyright © 2023 The Authors. Aerial scene recognition (ASR) has attracted great attention due to its increasingly essential applications. Most of the ASR methods adopt the multi-scale architecture because both global and local features play great roles in ASR. However, the existing multi-scale methods neglect the effective interactions among different scales and various spatial locations when fusing global and local features, leading to a limited ability to deal with challenges of large-scale variation and complex background in aerial scene images. In addition, existing methods may suffer from poor generalisations due to millions of to-be-learnt parameters and inconsistent predictions between global and local features. To tackle these problems, this study proposes a scale-wise interaction fusion and knowledge distillation (SIF-KD) network for learning robust and discriminative features with scale-invariance and background-independent information. The main highlights of this study include two aspects. On the one hand, a global-local features collaborative learning scheme is devised for extracting scale-invariance features so as to tackle the large-scale variation problem in aerial scene images. Specifically, a plug-and-play multi-scale context attention fusion module is proposed for collaboratively fusing the context information between global and local features. On the other hand, a scale-wise knowledge distillation scheme is proposed to produce more consistent predictions by distilling the predictive distribution between different scales during training. Comprehensive experimental results show the proposed SIF-KD network achieves the best overall accuracy with 99.68%, 98.74% and 95.47% on the UCM, AID and NWPU-RESISC45 datasets, respectively, compared with state of the arts.National Natural Science Foundation of China. Grant Numbers: 62201452, 2271296, 62201453;
Natural Science Basic Research Program of Shaanxi. Grant Number: 2022JQ-592;
Key Research and Development Program of Shaanxi Province. Grant Number: 2021JC-47;
Shaanxi Provincial Education Department. Grant Number: 22JK0568
A Transformer and Visual Foundation Model-Based Method for Cross-View Remote Sensing Image Retrieval
Retrieving UAV images that lack POS information with georeferenced satellite orthoimagery is challenging due to the differences in angles of views. Most existing methods rely on deep neural networks with a large number of parameters, leading to substantial time and financial investments in network training. Consequently, these methods may not be well-suited for downstream tasks that have high timeliness requirements. In this work, we propose a cross-view remote sensing image retrieval method based on transformer and visual foundation model. We investigated the potential of visual foundation model for extracting common features from cross-view images. Training is only conducted on a small, self-designed retrieval head, alleviating the burden of network training. Specifically, we designed a CVV module to optimize the features extracted from the visual foundation model, making these features more adept for cross-view image retrieval tasks. And we designed an MLP head to achieve similarity discrimination. The method is verified on a publicly available dataset containing multiple scenes. Our method shows excellent results in terms of both efficiency and accuracy on 15 sub-datasets (10 or 50 scene categories) derived from the public dataset, which holds practical value in engineering applications with streamlined scene categories and constrained computational resources. Furthermore, we initiated a comprehensive discussion and conducted ablation experiments on the network design to validate its efficacy. Additionally, we analyzed the presence of overfitting within the network and deliberated on the limitations of our study, proposing potential avenues for future enhancements
Knowledge Distillation and Continual Learning for Optimized Deep Neural Networks
Over the past few years, deep learning (DL) has been achieving state-of-theart performance on various human tasks such as speech generation, language translation, image segmentation, and object detection. While traditional machine learning models require hand-crafted features, deep learning algorithms can automatically extract discriminative features and learn complex knowledge from large datasets. This powerful learning ability makes deep learning models attractive to both academia and big corporations.
Despite their popularity, deep learning methods still have two main limitations: large memory consumption and catastrophic knowledge forgetting. First, DL algorithms use very deep neural networks (DNNs) with many billion parameters, which have a big model size and a slow inference speed. This restricts the application of DNNs in resource-constraint devices such as mobile phones and autonomous vehicles. Second, DNNs are known to suffer from catastrophic forgetting. When incrementally learning new tasks, the model performance on old tasks significantly drops. The ability to accommodate new knowledge while retaining previously learned knowledge is called continual learning. Since the realworld environments in which the model operates are always evolving, a robust neural network needs to have this continual learning ability for adapting to new changes
- …