335 research outputs found

    Learning to Evaluate Performance of Multi-modal Semantic Localization

    Full text link
    Semantic localization (SeLo) refers to the task of obtaining the most relevant locations in large-scale remote sensing (RS) images using semantic information such as text. As an emerging task based on cross-modal retrieval, SeLo achieves semantic-level retrieval with only caption-level annotation, which demonstrates its great potential in unifying downstream tasks. Although SeLo has been carried out successively, but there is currently no work has systematically explores and analyzes this urgent direction. In this paper, we thoroughly study this field and provide a complete benchmark in terms of metrics and testdata to advance the SeLo task. Firstly, based on the characteristics of this task, we propose multiple discriminative evaluation metrics to quantify the performance of the SeLo task. The devised significant area proportion, attention shift distance, and discrete attention distance are utilized to evaluate the generated SeLo map from pixel-level and region-level. Next, to provide standard evaluation data for the SeLo task, we contribute a diverse, multi-semantic, multi-objective Semantic Localization Testset (AIR-SLT). AIR-SLT consists of 22 large-scale RS images and 59 test cases with different semantics, which aims to provide a comprehensive evaluations for retrieval models. Finally, we analyze the SeLo performance of RS cross-modal retrieval models in detail, explore the impact of different variables on this task, and provide a complete benchmark for the SeLo task. We have also established a new paradigm for RS referring expression comprehension, and demonstrated the great advantage of SeLo in semantics through combining it with tasks such as detection and road extraction. The proposed evaluation metrics, semantic localization testsets, and corresponding scripts have been open to access at github.com/xiaoyuan1996/SemanticLocalizationMetrics .Comment: 19 pages, 11 figure

    A Dimensional Structure based Knowledge Distillation Method for Cross-Modal Learning

    Full text link
    Due to limitations in data quality, some essential visual tasks are difficult to perform independently. Introducing previously unavailable information to transfer informative dark knowledge has been a common way to solve such hard tasks. However, research on why transferred knowledge works has not been extensively explored. To address this issue, in this paper, we discover the correlation between feature discriminability and dimensional structure (DS) by analyzing and observing features extracted from simple and hard tasks. On this basis, we express DS using deep channel-wise correlation and intermediate spatial distribution, and propose a novel cross-modal knowledge distillation (CMKD) method for better supervised cross-modal learning (CML) performance. The proposed method enforces output features to be channel-wise independent and intermediate ones to be uniformly distributed, thereby learning semantically irrelevant features from the hard task to boost its accuracy. This is especially useful in specific applications where the performance gap between dual modalities is relatively large. Furthermore, we collect a real-world CML dataset to promote community development. The dataset contains more than 10,000 paired optical and radar images and is continuously being updated. Experimental results on real-world and benchmark datasets validate the effectiveness of the proposed method

    A Transformer and Visual Foundation Model-Based Method for Cross-View Remote Sensing Image Retrieval

    Get PDF
    Retrieving UAV images that lack POS information with georeferenced satellite orthoimagery is challenging due to the differences in angles of views. Most existing methods rely on deep neural networks with a large number of parameters, leading to substantial time and financial investments in network training. Consequently, these methods may not be well-suited for downstream tasks that have high timeliness requirements. In this work, we propose a cross-view remote sensing image retrieval method based on transformer and visual foundation model. We investigated the potential of visual foundation model for extracting common features from cross-view images. Training is only conducted on a small, self-designed retrieval head, alleviating the burden of network training. Specifically, we designed a CVV module to optimize the features extracted from the visual foundation model, making these features more adept for cross-view image retrieval tasks. And we designed an MLP head to achieve similarity discrimination. The method is verified on a publicly available dataset containing multiple scenes. Our method shows excellent results in terms of both efficiency and accuracy on 15 sub-datasets (10 or 50 scene categories) derived from the public dataset, which holds practical value in engineering applications with streamlined scene categories and constrained computational resources. Furthermore, we initiated a comprehensive discussion and conducted ablation experiments on the network design to validate its efficacy. Additionally, we analyzed the presence of overfitting within the network and deliberated on the limitations of our study, proposing potential avenues for future enhancements

    Knowledge Distillation and Continual Learning for Optimized Deep Neural Networks

    Get PDF
    Over the past few years, deep learning (DL) has been achieving state-of-theart performance on various human tasks such as speech generation, language translation, image segmentation, and object detection. While traditional machine learning models require hand-crafted features, deep learning algorithms can automatically extract discriminative features and learn complex knowledge from large datasets. This powerful learning ability makes deep learning models attractive to both academia and big corporations. Despite their popularity, deep learning methods still have two main limitations: large memory consumption and catastrophic knowledge forgetting. First, DL algorithms use very deep neural networks (DNNs) with many billion parameters, which have a big model size and a slow inference speed. This restricts the application of DNNs in resource-constraint devices such as mobile phones and autonomous vehicles. Second, DNNs are known to suffer from catastrophic forgetting. When incrementally learning new tasks, the model performance on old tasks significantly drops. The ability to accommodate new knowledge while retaining previously learned knowledge is called continual learning. Since the realworld environments in which the model operates are always evolving, a robust neural network needs to have this continual learning ability for adapting to new changes
    • …
    corecore