9,343 research outputs found
Automatic Caption Generation for Aerial Images: A Survey
Aerial images have attracted attention from researcher community since long time. Generating a caption for an aerial image describing its content in comprehensive way is less studied but important task as it has applications in agriculture, defence, disaster management and many more areas. Though different approaches were followed for natural image caption generation, generating a caption for aerial image remains a challenging task due to its special nature. Use of emerging techniques from Artificial Intelligence (AI) and Natural Language Processing (NLP) domains have resulted in generation of accepted quality captions for aerial images. However lot needs to be done to fully utilize potential of aerial image caption generation task. This paper presents detail survey of the various approaches followed by researchers for aerial image caption generation task. The datasets available for experimentation, criteria used for performance evaluation and future directions are also discussed
Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition
Most research on facial expression recognition (FER) is conducted in highly
controlled environments, but its performance is often unacceptable when applied
to real-world situations. This is because when unexpected objects occlude the
face, the FER network faces difficulties extracting facial features and
accurately predicting facial expressions. Therefore, occluded FER (OFER) is a
challenging problem. Previous studies on occlusion-aware FER have typically
required fully annotated facial images for training. However, collecting facial
images with various occlusions and expression annotations is time-consuming and
expensive. Latent-OFER, the proposed method, can detect occlusions, restore
occluded parts of the face as if they were unoccluded, and recognize them,
improving FER accuracy. This approach involves three steps: First, the vision
transformer (ViT)-based occlusion patch detector masks the occluded position by
training only latent vectors from the unoccluded patches using the support
vector data description algorithm. Second, the hybrid reconstruction network
generates the masking position as a complete image using the ViT and
convolutional neural network (CNN). Last, the expression-relevant latent vector
extractor retrieves and uses expression-related information from all latent
vectors by applying a CNN-based class activation map. This mechanism has a
significant advantage in preventing performance degradation from occlusion by
unseen objects. The experimental results on several databases demonstrate the
superiority of the proposed method over state-of-the-art methods.Comment: 11 pages, 8 figure
RSGPT: A Remote Sensing Vision Language Model and Benchmark
The emergence of large-scale large language models, with GPT-4 as a prominent
example, has significantly propelled the rapid advancement of artificial
general intelligence and sparked the revolution of Artificial Intelligence 2.0.
In the realm of remote sensing (RS), there is a growing interest in developing
large vision language models (VLMs) specifically tailored for data analysis in
this domain. However, current research predominantly revolves around visual
recognition tasks, lacking comprehensive, large-scale image-text datasets that
are aligned and suitable for training large VLMs, which poses significant
challenges to effectively training such models for RS applications. In computer
vision, recent research has demonstrated that fine-tuning large vision language
models on small-scale, high-quality datasets can yield impressive performance
in visual and language understanding. These results are comparable to
state-of-the-art VLMs trained from scratch on massive amounts of data, such as
GPT-4. Inspired by this captivating idea, in this work, we build a high-quality
Remote Sensing Image Captioning dataset (RSICap) that facilitates the
development of large VLMs in the RS field. Unlike previous RS datasets that
either employ model-generated captions or short descriptions, RSICap comprises
2,585 human-annotated captions with rich and high-quality information. This
dataset offers detailed descriptions for each image, encompassing scene
descriptions (e.g., residential area, airport, or farmland) as well as object
information (e.g., color, shape, quantity, absolute position, etc). To
facilitate the evaluation of VLMs in the field of RS, we also provide a
benchmark evaluation dataset called RSIEval. This dataset consists of
human-annotated captions and visual question-answer pairs, allowing for a
comprehensive assessment of VLMs in the context of RS
Deep learning for unsupervised domain adaptation in medical imaging: Recent advancements and future perspectives
Deep learning has demonstrated remarkable performance across various tasks in
medical imaging. However, these approaches primarily focus on supervised
learning, assuming that the training and testing data are drawn from the same
distribution. Unfortunately, this assumption may not always hold true in
practice. To address these issues, unsupervised domain adaptation (UDA)
techniques have been developed to transfer knowledge from a labeled domain to a
related but unlabeled domain. In recent years, significant advancements have
been made in UDA, resulting in a wide range of methodologies, including feature
alignment, image translation, self-supervision, and disentangled representation
methods, among others. In this paper, we provide a comprehensive literature
review of recent deep UDA approaches in medical imaging from a technical
perspective. Specifically, we categorize current UDA research in medical
imaging into six groups and further divide them into finer subcategories based
on the different tasks they perform. We also discuss the respective datasets
used in the studies to assess the divergence between the different domains.
Finally, we discuss emerging areas and provide insights and discussions on
future research directions to conclude this survey.Comment: Under Revie
IR Design for Application-Specific Natural Language: A Case Study on Traffic Data
In the realm of software applications in the transportation industry,
Domain-Specific Languages (DSLs) have enjoyed widespread adoption due to their
ease of use and various other benefits. With the ceaseless progress in computer
performance and the rapid development of large-scale models, the possibility of
programming using natural language in specified applications - referred to as
Application-Specific Natural Language (ASNL) - has emerged. ASNL exhibits
greater flexibility and freedom, which, in turn, leads to an increase in
computational complexity for parsing and a decrease in processing performance.
To tackle this issue, our paper advances a design for an intermediate
representation (IR) that caters to ASNL and can uniformly process
transportation data into graph data format, improving data processing
performance. Experimental comparisons reveal that in standard data query
operations, our proposed IR design can achieve a speed improvement of over
forty times compared to direct usage of standard XML format data
Recommended from our members
A Survey of Quantum-Cognitively Inspired Sentiment Analysis Models
Quantum theory, originally proposed as a physical theory to describe the motions of microscopic particles, has been applied to various non-physics domains involving human cognition and decision-making that are inherently uncertain and exhibit certain non-classical, quantum-like characteristics. Sentiment analysis is a typical example of such domains. In the last few years, by leveraging the modeling power of quantum probability (a non-classical probability stemming from quantum mechanics methodology) and deep neural networks, a range of novel quantum-cognitively inspired models for sentiment analysis have emerged and performed well. This survey presents a timely overview of the latest developments in this fascinating cross-disciplinary area. We first provide a background of quantum probability and quantum cognition at a theoretical level, analyzing their advantages over classical theories in modeling the cognitive aspects of sentiment analysis. Then, recent quantum-cognitively inspired models are introduced and discussed in detail, focusing on how they approach the key challenges of the sentiment analysis task. Finally, we discuss the limitations of the current research and highlight future research directions
A review of abnormal behavior detection in activities of daily living
Abnormal behavior detection (ABD) systems are built to automatically identify and recognize abnormal behavior from various input data types, such as sensor-based and vision-based input. As much as the attention received for ABD systems, the number of studies on ABD in activities of daily living (ADL) is limited. Owing to the increasing rate of elderly accidents in the home compound, ABD in ADL research should be given as much attention to preventing accidents by sending out signals when abnormal behavior such as falling is detected. In this study, we compare and contrast the formation of the ABD system in ADL from input data types (sensor-based input and vision-based input) to modeling techniques (conventional and deep learning approaches). We scrutinize the public datasets available and provide solutions for one of the significant issues: the lack of datasets in ABD in ADL. This work aims to guide new research to understand the field of ABD in ADL better and serve as a reference for future study of better Ambient Assisted Living with the growing smart home trend
GlobalMind: Global Multi-head Interactive Self-attention Network for Hyperspectral Change Detection
High spectral resolution imagery of the Earth's surface enables users to
monitor changes over time in fine-grained scale, playing an increasingly
important role in agriculture, defense, and emergency response. However, most
current algorithms are still confined to describing local features and fail to
incorporate a global perspective, which limits their ability to capture
interactions between global features, thus usually resulting in incomplete
change regions. In this paper, we propose a Global Multi-head INteractive
self-attention change Detection network (GlobalMind) to explore the implicit
correlation between different surface objects and variant land cover
transformations, acquiring a comprehensive understanding of the data and
accurate change detection result. Firstly, a simple but effective Global Axial
Segmentation (GAS) strategy is designed to expand the self-attention
computation along the row space or column space of hyperspectral images,
allowing the global connection with high efficiency. Secondly, with GAS, the
global spatial multi-head interactive self-attention (Global-M) module is
crafted to mine the abundant spatial-spectral feature involving potential
correlations between the ground objects from the entire rich and complex
hyperspectral space. Moreover, to acquire the accurate and complete
cross-temporal changes, we devise a global temporal interactive multi-head
self-attention (GlobalD) module which incorporates the relevance and variation
of bi-temporal spatial-spectral features, deriving the integrate potential same
kind of changes in the local and global range with the combination of GAS. We
perform extensive experiments on five mostly used hyperspectral datasets, and
our method outperforms the state-of-the-art algorithms with high accuracy and
efficiency.Comment: 14 page, 18 figure
ChatABL: Abductive Learning via Natural Language Interaction with ChatGPT
Large language models (LLMs) such as ChatGPT have recently demonstrated
significant potential in mathematical abilities, providing valuable reasoning
paradigm consistent with human natural language. However, LLMs currently have
difficulty in bridging perception, language understanding and reasoning
capabilities due to incompatibility of the underlying information flow among
them, making it challenging to accomplish tasks autonomously. On the other
hand, abductive learning (ABL) frameworks for integrating the two abilities of
perception and reasoning has seen significant success in inverse decipherment
of incomplete facts, but it is limited by the lack of semantic understanding of
logical reasoning rules and the dependence on complicated domain knowledge
representation. This paper presents a novel method (ChatABL) for integrating
LLMs into the ABL framework, aiming at unifying the three abilities in a more
user-friendly and understandable manner. The proposed method uses the strengths
of LLMs' understanding and logical reasoning to correct the incomplete logical
facts for optimizing the performance of perceptual module, by summarizing and
reorganizing reasoning rules represented in natural language format. Similarly,
perceptual module provides necessary reasoning examples for LLMs in natural
language format. The variable-length handwritten equation deciphering task, an
abstract expression of the Mayan calendar decoding, is used as a testbed to
demonstrate that ChatABL has reasoning ability beyond most existing
state-of-the-art methods, which has been well supported by comparative studies.
To our best knowledge, the proposed ChatABL is the first attempt to explore a
new pattern for further approaching human-level cognitive ability via natural
language interaction with ChatGPT
- …