262 research outputs found
Automatic Caption Generation for Aerial Images: A Survey
Aerial images have attracted attention from researcher community since long time. Generating a caption for an aerial image describing its content in comprehensive way is less studied but important task as it has applications in agriculture, defence, disaster management and many more areas. Though different approaches were followed for natural image caption generation, generating a caption for aerial image remains a challenging task due to its special nature. Use of emerging techniques from Artificial Intelligence (AI) and Natural Language Processing (NLP) domains have resulted in generation of accepted quality captions for aerial images. However lot needs to be done to fully utilize potential of aerial image caption generation task. This paper presents detail survey of the various approaches followed by researchers for aerial image caption generation task. The datasets available for experimentation, criteria used for performance evaluation and future directions are also discussed
Remote sensing image captioning with pre-trained transformer models
Remote sensing images, and the unique properties that characterize them, are attracting increased attention from computer vision researchers, largely due to their many possible applications. The area of computer vision for remote sensing has effectively seen many recent advances, e.g. in tasks such as object detection or scene classification. Recent work in the area has also addressed the task of generating a natural language description of a given remote sensing image, effectively combining techniques from both natural language processing and computer vision. Despite some previously published results, there nonetheless are still many limitations and possibilities for improvement. It remains challenging to generate text that is fluid and linguistically rich while maintaining semantic consistency and good discrimination ability about the objects and visual patterns that should be described. The previous proposals that have come closest to achieving the goals of remote sensing image captioning have used neural encoder-decoder architectures, often including specialized attention mechanisms to help the system in integrating the most relevant visual features while generating the textual descriptions. Taking previous work into consideration, this work proposes a new approach for remote sensing image captioning, using an encoder-decoder model based on the Transformer architecture, and where both the encoder and the decoder are based on components from a pre-existing model that was already trained with large amounts of data. Experiments were carried out using the three main datasets that exist for assessing remote sensing image captioning methods, respectively the Sydney-captions, the \acrshort{UCM}-captions, and the \acrshort{RSICD} datasets. The results show improvements over some previous proposals, although particularly on the larger \acrshort{RSICD} dataset they are still far from the current state-of-art methods. A careful analysis of the results also points to some limitations in the current evaluation methodology, mostly based on automated n-gram overlap metrics such as BLEU or ROUGE
SMAN : Stacked Multi-Modal Attention Network for cross-modal image-text retrieval
This article focuses on tackling the task of the cross-modal image-text retrieval which has been an interdisciplinary topic in both computer vision and natural language processing communities. Existing global representation alignment-based methods fail to pinpoint the semantically meaningful portion of images and texts, while the local representation alignment schemes suffer from the huge computational burden for aggregating the similarity of visual fragments and textual words exhaustively. In this article, we propose a stacked multimodal attention network (SMAN) that makes use of the stacked multimodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Specifically, we sequentially employ intramodal information and multimodal information as guidance to perform multiple-step attention reasoning so that the fine-grained correlation between image and text can be modeled. As a consequence, we are capable of discovering the semantically meaningful visual regions or words in a sentence which contributes to measuring the cross-modal similarity in a more precise manner. Moreover, we present a novel bidirectional ranking loss that enforces the distance among pairwise multimodal instances to be closer. Doing so allows us to make full use of pairwise supervised information to preserve the manifold structure of heterogeneous pairwise data. Extensive experiments on two benchmark datasets demonstrate that our SMAN consistently yields competitive performance compared to state-of-the-art methods
Deep learning-based image captioning for visually impaired people
Vision loss can affect people of all ages. Severe or complete vision loss may occur when the eye or brain parts that need to process images are damaged. In this paper, in order to facilitate the blind, deep learning algorithms are used to caption the image for the blind person in which the blind can know about the object, distance and position of object. Whenever an image is captured via the camera, the scenes are recognized and predicted by the machine. After the prediction, it will be sent as an audio output to the user. Thus, with the help of this paper an artificial vision to the blind, can be achieved and help them to gain confidence while travelling alone
Toward Remote Sensing Image Retrieval Under a Deep Image Captioning Perspective
The performance of remote sensing image retrieval (RSIR) systems depends on the capability of the extracted features in characterizing the semantic content of images. Existing RSIR systems describe images by visual descriptors that model the primitives (such as different land-cover classes) present in the images. However, the visual descriptors may not be sufficient to describe the high-level complex content of RS images (e.g., attributes and relationships among different land-cover classes). To address this issue, in this article, we present an RSIR system that aims at generating and exploiting textual descriptions to accurately describe the relationships between the objects and their attributes present in RS images with captions (i.e., sentences). To this end, the proposed retrieval system consists of three main steps. The first step aims to encode the image visual features and then translate the encoded features into a textual description that summarizes the content of the image with captions. This is achieved based on the combination of a convolutional neural network with a recurrent neural network. The second step aims to convert the generated textual descriptions into semantically meaningful feature vectors. This is achieved by using the recent word embedding techniques. Finally, the last step estimates the similarity between the vectors of the textual descriptions of the query image and those of the archive images, and then retrieve the most similar images to the query image. Experimental results obtained on two different datasets show that the description of the image content with captions in the framework of RSIR leads to an accurate retrieval performance.EC/H2020/759764/EU/Accurate and Scalable Processing of Big Data in Earth Observation/BigEart
Changes to Captions: An Attentive Network for Remote Sensing Change Captioning
In recent years, advanced research has focused on the direct learning and
analysis of remote sensing images using natural language processing (NLP)
techniques. The ability to accurately describe changes occurring in
multi-temporal remote sensing images is becoming increasingly important for
geospatial understanding and land planning. Unlike natural image change
captioning tasks, remote sensing change captioning aims to capture the most
significant changes, irrespective of various influential factors such as
illumination, seasonal effects, and complex land covers. In this study, we
highlight the significance of accurately describing changes in remote sensing
images and present a comparison of the change captioning task for natural and
synthetic images and remote sensing images. To address the challenge of
generating accurate captions, we propose an attentive changes-to-captions
network, called Chg2Cap for short, for bi-temporal remote sensing images. The
network comprises three main components: 1) a Siamese CNN-based feature
extractor to collect high-level representations for each image pair; 2) an
attentive decoder that includes a hierarchical self-attention block to locate
change-related features and a residual block to generate the image embedding;
and 3) a transformer-based caption generator to decode the relationship between
the image embedding and the word embedding into a description. The proposed
Chg2Cap network is evaluated on two representative remote sensing datasets, and
a comprehensive experimental analysis is provided. The code and pre-trained
models will be available online at https://github.com/ShizhenChang/Chg2Cap
RSGPT: A Remote Sensing Vision Language Model and Benchmark
The emergence of large-scale large language models, with GPT-4 as a prominent
example, has significantly propelled the rapid advancement of artificial
general intelligence and sparked the revolution of Artificial Intelligence 2.0.
In the realm of remote sensing (RS), there is a growing interest in developing
large vision language models (VLMs) specifically tailored for data analysis in
this domain. However, current research predominantly revolves around visual
recognition tasks, lacking comprehensive, large-scale image-text datasets that
are aligned and suitable for training large VLMs, which poses significant
challenges to effectively training such models for RS applications. In computer
vision, recent research has demonstrated that fine-tuning large vision language
models on small-scale, high-quality datasets can yield impressive performance
in visual and language understanding. These results are comparable to
state-of-the-art VLMs trained from scratch on massive amounts of data, such as
GPT-4. Inspired by this captivating idea, in this work, we build a high-quality
Remote Sensing Image Captioning dataset (RSICap) that facilitates the
development of large VLMs in the RS field. Unlike previous RS datasets that
either employ model-generated captions or short descriptions, RSICap comprises
2,585 human-annotated captions with rich and high-quality information. This
dataset offers detailed descriptions for each image, encompassing scene
descriptions (e.g., residential area, airport, or farmland) as well as object
information (e.g., color, shape, quantity, absolute position, etc). To
facilitate the evaluation of VLMs in the field of RS, we also provide a
benchmark evaluation dataset called RSIEval. This dataset consists of
human-annotated captions and visual question-answer pairs, allowing for a
comprehensive assessment of VLMs in the context of RS
- …