1,054 research outputs found
Multi-modal gated recurrent units for image description
Using a natural language sentence to describe the content of an image is a
challenging but very important task. It is challenging because a description
must not only capture objects contained in the image and the relationships
among them, but also be relevant and grammatically correct. In this paper a
multi-modal embedding model based on gated recurrent units (GRU) which can
generate variable-length description for a given image. In the training step,
we apply the convolutional neural network (CNN) to extract the image feature.
Then the feature is imported into the multi-modal GRU as well as the
corresponding sentence representations. The multi-modal GRU learns the
inter-modal relations between image and sentence. And in the testing step, when
an image is imported to our multi-modal GRU model, a sentence which describes
the image content is generated. The experimental results demonstrate that our
multi-modal GRU model obtains the state-of-the-art performance on Flickr8K,
Flickr30K and MS COCO datasets.Comment: 25 pages, 7 figures, 6 tables, magazin
Remote Sensing Scene Classification with Masked Image Modeling (MIM)
Remote sensing scene classification has been extensively studied for its
critical roles in geological survey, oil exploration, traffic management,
earthquake prediction, wildfire monitoring, and intelligence monitoring. In the
past, the Machine Learning (ML) methods for performing the task mainly used the
backbones pretrained in the manner of supervised learning (SL). As Masked Image
Modeling (MIM), a self-supervised learning (SSL) technique, has been shown as a
better way for learning visual feature representation, it presents a new
opportunity for improving ML performance on the scene classification task. This
research aims to explore the potential of MIM pretrained backbones on four
well-known classification datasets: Merced, AID, NWPU-RESISC45, and Optimal-31.
Compared to the published benchmarks, we show that the MIM pretrained Vision
Transformer (ViTs) backbones outperform other alternatives (up to 18% on top 1
accuracy) and that the MIM technique can learn better feature representation
than the supervised learning counterparts (up to 5% on top 1 accuracy).
Moreover, we show that the general-purpose MIM-pretrained ViTs can achieve
competitive performance as the specially designed yet complicated Transformer
for Remote Sensing (TRS) framework. Our experiment results also provide a
performance baseline for future studies.Comment: arXiv admin note: text overlap with arXiv:2301.1205
Automatic Caption Generation for Aerial Images: A Survey
Aerial images have attracted attention from researcher community since long time. Generating a caption for an aerial image describing its content in comprehensive way is less studied but important task as it has applications in agriculture, defence, disaster management and many more areas. Though different approaches were followed for natural image caption generation, generating a caption for aerial image remains a challenging task due to its special nature. Use of emerging techniques from Artificial Intelligence (AI) and Natural Language Processing (NLP) domains have resulted in generation of accepted quality captions for aerial images. However lot needs to be done to fully utilize potential of aerial image caption generation task. This paper presents detail survey of the various approaches followed by researchers for aerial image caption generation task. The datasets available for experimentation, criteria used for performance evaluation and future directions are also discussed
Deep Learning for Aerial Scene Understanding in High Resolution Remote Sensing Imagery from the Lab to the Wild
Diese Arbeit präsentiert die Anwendung von Deep Learning beim Verständnis von Luftszenen, z. B. Luftszenenerkennung, Multi-Label-Objektklassifizierung und semantische Segmentierung. Abgesehen vom Training tiefer Netzwerke unter Laborbedingungen bietet diese Arbeit auch Lernstrategien für praktische Szenarien, z. B. werden Daten ohne Einschränkungen gesammelt oder Annotationen sind knapp
- …