389 research outputs found
Scene Graph Generation with External Knowledge and Image Reconstruction
Scene graph generation has received growing attention with the advancements
in image understanding tasks such as object detection, attributes and
relationship prediction,~\etc. However, existing datasets are biased in terms
of object and relationship labels, or often come with noisy and missing
annotations, which makes the development of a reliable scene graph prediction
model very challenging. In this paper, we propose a novel scene graph
generation algorithm with external knowledge and image reconstruction loss to
overcome these dataset issues. In particular, we extract commonsense knowledge
from the external knowledge base to refine object and phrase features for
improving generalizability in scene graph generation. To address the bias of
noisy object annotations, we introduce an auxiliary image reconstruction path
to regularize the scene graph generation network. Extensive experiments show
that our framework can generate better scene graphs, achieving the
state-of-the-art performance on two benchmark datasets: Visual Relationship
Detection and Visual Genome datasets.Comment: 10 pages, 5 figures, Accepted in CVPR 201
Weakly Supervised Reasoning by Neuro-Symbolic Approaches
Deep learning has largely improved the performance of various natural
language processing (NLP) tasks. However, most deep learning models are
black-box machinery, and lack explicit interpretation. In this chapter, we will
introduce our recent progress on neuro-symbolic approaches to NLP, which
combines different schools of AI, namely, symbolism and connectionism.
Generally, we will design a neural system with symbolic latent structures for
an NLP task, and apply reinforcement learning or its relaxation to perform
weakly supervised reasoning in the downstream task. Our framework has been
successfully applied to various tasks, including table query reasoning,
syntactic structure reasoning, information extraction reasoning, and rule
reasoning. For each application, we will introduce the background, our
approach, and experimental results.Comment: Compendium of Neurosymbolic Artificial Intelligence, 665--692, 2023,
IOS Pres
Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning
Recently, the attention-enriched encoder-decoder framework has aroused great
interest in image captioning due to its overwhelming progress. Many visual
attention models directly leverage meaningful regions to generate image
descriptions. However, seeking a direct transition from visual space to text is
not enough to generate fine-grained captions. This paper exploits a
feature-compounding approach to bring together high-level semantic concepts and
visual information regarding the contextual environment fully end-to-end. Thus,
we propose a stacked cross-modal feature consolidation (SCFC) attention network
for image captioning in which we simultaneously consolidate cross-modal
features through a novel compounding function in a multi-step reasoning
fashion. Besides, we jointly employ spatial information and context-aware
attributes (CAA) as the principal components in our proposed compounding
function, where our CAA provides a concise context-sensitive semantic
representation. To make better use of consolidated features potential, we
further propose an SCFC-LSTM as the caption generator, which can leverage
discriminative semantic information through the caption generation process. The
experimental results indicate that our proposed SCFC can outperform various
state-of-the-art image captioning benchmarks in terms of popular metrics on the
MSCOCO and Flickr30K datasets
Pedestrian Attribute Recognition: A Survey
Recognizing pedestrian attributes is an important task in computer vision
community due to it plays an important role in video surveillance. Many
algorithms has been proposed to handle this task. The goal of this paper is to
review existing works using traditional methods or based on deep learning
networks. Firstly, we introduce the background of pedestrian attributes
recognition (PAR, for short), including the fundamental concepts of pedestrian
attributes and corresponding challenges. Secondly, we introduce existing
benchmarks, including popular datasets and evaluation criterion. Thirdly, we
analyse the concept of multi-task learning and multi-label learning, and also
explain the relations between these two learning algorithms and pedestrian
attribute recognition. We also review some popular network architectures which
have widely applied in the deep learning community. Fourthly, we analyse
popular solutions for this task, such as attributes group, part-based,
\emph{etc}. Fifthly, we shown some applications which takes pedestrian
attributes into consideration and achieve better performance. Finally, we
summarized this paper and give several possible research directions for
pedestrian attributes recognition. The project page of this paper can be found
from the following website:
\url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey:
https://sites.google.com/view/ahu-pedestrianattributes
๋ฅ ๋ด๋ด ๋คํธ์ํฌ ๊ธฐ๋ฐ์ ๋ฌธ์ฅ ์ธ์ฝ๋๋ฅผ ์ด์ฉํ ๋ฌธ์ฅ ๊ฐ ๊ด๊ณ ๋ชจ๋ธ๋ง
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ,2020. 2. ์ด์๊ตฌ.๋ฌธ์ฅ ๋งค์นญ์ด๋ ๋ ๋ฌธ์ฅ ๊ฐ ์๋ฏธ์ ์ผ๋ก ์ผ์นํ๋ ์ ๋๋ฅผ ์์ธกํ๋ ๋ฌธ์ ์ด๋ค.
์ด๋ค ๋ชจ๋ธ์ด ๋ ๋ฌธ์ฅ ์ฌ์ด์ ๊ด๊ณ๋ฅผ ํจ๊ณผ์ ์ผ๋ก ๋ฐํ๋ด๊ธฐ ์ํด์๋ ๋์ ์์ค์ ์์ฐ์ด ํ
์คํธ ์ดํด ๋ฅ๋ ฅ์ด ํ์ํ๊ธฐ ๋๋ฌธ์, ๋ฌธ์ฅ ๋งค์นญ์ ๋ค์ํ ์์ฐ์ด ์ฒ๋ฆฌ ์์ฉ์ ์ฑ๋ฅ์ ์ค์ํ ์ํฅ์ ๋ฏธ์น๋ค.
๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ๋ฌธ์ฅ ์ธ์ฝ๋, ๋งค์นญ ํจ์, ์ค์ง๋ ํ์ต์ด๋ผ๋ ์ธ ๊ฐ์ง ์ธก๋ฉด์์ ๋ฌธ์ฅ ๋งค์นญ์ ์ฑ๋ฅ ๊ฐ์ ์ ๋ชจ์ํ๋ค.
๋ฌธ์ฅ ์ธ์ฝ๋๋ ๋ฌธ์ฅ์ผ๋ก๋ถํฐ ์ ์ฉํ ํน์ง๋ค์ ์ถ์ถํ๋ ์ญํ ์ ํ๋ ๊ตฌ์ฑ ์์๋ก, ๋ณธ ๋
ผ๋ฌธ์์๋ ๋ฌธ์ฅ ์ธ์ฝ๋์ ์ฑ๋ฅ ํฅ์์ ์ํ์ฌ Gumbel Tree-LSTM๊ณผ Cell-aware Stacked LSTM์ด๋ผ๋ ๋ ๊ฐ์ ์๋ก์ด ์ํคํ
์ฒ๋ฅผ ์ ์ํ๋ค.
Gumbel Tree-LSTM์ ์ฌ๊ท์ ๋ด๋ด ๋คํธ์ํฌ(recursive neural network) ๊ตฌ์กฐ์ ๊ธฐ๋ฐํ ์ํคํ
์ฒ์ด๋ค.
๊ตฌ์กฐ ์ ๋ณด๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ฅผ ์
๋ ฅ์ผ๋ก ์ฌ์ฉํ๋ ๊ธฐ์กด์ ์ฌ๊ท์ ๋ด๋ด ๋คํธ์ํฌ ๋ชจ๋ธ๊ณผ ๋ฌ๋ฆฌ, Gumbel Tree-LSTM์ ๊ตฌ์กฐ๊ฐ ์๋ ๋ฐ์ดํฐ๋ก๋ถํฐ ํน์ ๋ฌธ์ ์ ๋ํ ์ฑ๋ฅ์ ์ต๋ํํ๋ ํ์ฑ ์ ๋ต์ ํ์ตํ๋ค.
Cell-aware Stacked LSTM์ LSTM ๊ตฌ์กฐ๋ฅผ ๊ฐ์ ํ ์ํคํ
์ฒ๋ก, ์ฌ๋ฌ LSTM ๋ ์ด์ด๋ฅผ ์ค์ฒฉํ์ฌ ์ฌ์ฉํ ๋ ๋ง๊ฐ ๊ฒ์ดํธ(forget gate)๋ฅผ ์ถ๊ฐ์ ์ผ๋ก ๋์
ํ์ฌ ์์ง ๋ฐฉํฅ์ ์ ๋ณด ํ๋ฆ์ ๋ ํจ์จ์ ์ผ๋ก ์ ์ดํ ์ ์๋๋ก ํ๋ค.
ํํธ, ์๋ก์ด ๋งค์นญ ํจ์๋ก์ ์ฐ๋ฆฌ๋ ์์๋ณ ์์ ํ ๋ฌธ์ฅ ๋งค์นญ(element-wise bilinear sentence matching, ElBiS) ํจ์๋ฅผ ์ ์ํ๋ค.
ElBiS ์๊ณ ๋ฆฌ์ฆ์ ํน์ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๋ ๋ฐ์ ์ ํฉํ ๋ฐฉ์์ผ๋ก ๋ ๋ฌธ์ฅ ํํ์ ํ๋์ ๋ฒกํฐ๋ก ํฉ์น๋ ๋ฐฉ๋ฒ์ ์๋์ผ๋ก ์ฐพ๋ ๊ฒ์ ๋ชฉ์ ์ผ๋ก ํ๋ค.
๋ฌธ์ฅ ํํ์ ์ป์ ๋์ ์๋ก ๊ฐ์ ๋ฌธ์ฅ ์ธ์ฝ๋๋ฅผ ์ฌ์ฉํ๋ค๋ ์ฌ์ค๋ก๋ถํฐ ์ฐ๋ฆฌ๋ ๋ฒกํฐ์ ๊ฐ ์์ ๊ฐ ์์ ํ(bilinear) ์ํธ ์์ฉ๋ง์ ๊ณ ๋ คํ์ฌ๋ ๋ ๋ฌธ์ฅ ๋ฒกํฐ ๊ฐ ๋น๊ต๋ฅผ ์ถฉ๋ถํ ์ ์ํํ ์ ์๋ค๋ ๊ฐ์ค์ ์๋ฆฝํ๊ณ ์ด๋ฅผ ์คํ์ ์ผ๋ก ๊ฒ์ฆํ๋ค.
์ํธ ์์ฉ์ ๋ฒ์๋ฅผ ์ ํํจ์ผ๋ก์จ, ์๋์ผ๋ก ์ ์ฉํ ๋ณํฉ ๋ฐฉ๋ฒ์ ์ฐพ๋๋ค๋ ์ด์ ์ ์ ์งํ๋ฉด์ ๋ชจ๋ ์ํธ ์์ฉ์ ๊ณ ๋ คํ๋ ์์ ํ ํ๋ง ๋ฐฉ๋ฒ์ ๋นํด ํ์ํ ํ๋ผ๋ฏธํฐ์ ์๋ฅผ ํฌ๊ฒ ์ค์ผ ์ ์๋ค.
๋ง์ง๋ง์ผ๋ก, ํ์ต ์ ๋ ์ด๋ธ์ด ์๋ ๋ฐ์ดํฐ์ ๋ ์ด๋ธ์ด ์๋ ๋ฐ์ดํฐ๋ฅผ ํจ๊ป ์ฌ์ฉํ๋ ์ค์ง๋ ํ์ต์ ์ํด ์ฐ๋ฆฌ๋ ๊ต์ฐจ ๋ฌธ์ฅ ์ ์ฌ ๋ณ์ ๋ชจ๋ธ(cross-sentence latent variable model, CS-LVM)์ ์ ์ํ๋ค.
CS-LVM์ ์์ฑ ๋ชจ๋ธ์ ์ถ์ฒ ๋ฌธ์ฅ(source sentence)์ ์ ์ฌ ํํ ๋ฐ ์ถ์ฒ ๋ฌธ์ฅ๊ณผ ๋ชฉํ ๋ฌธ์ฅ(target sentence) ๊ฐ์ ๊ด๊ณ๋ฅผ ๋ํ๋ด๋ ๋ณ์๋ก๋ถํฐ ๋ชฉํ ๋ฌธ์ฅ์ด ์์ฑ๋๋ค๊ณ ๊ฐ์ ํ๋ค.
CS-LVM์์๋ ๋ ๋ฌธ์ฅ์ด ํ๋์ ๋ชจ๋ธ ์์์ ๋ชจ๋ ๊ณ ๋ ค๋๊ธฐ ๋๋ฌธ์, ํ์ต์ ์ฌ์ฉ๋๋ ๋ชฉ์ ํจ์๊ฐ ๋ ์์ฐ์ค๋ฝ๊ฒ ์ ์๋๋ค.
๋ํ, ์ฐ๋ฆฌ๋ ์์ฑ ๋ชจ๋ธ์ ํ๋ผ๋ฏธํฐ๊ฐ ๋ ์๋ฏธ์ ์ผ๋ก ์ ํฉํ ๋ฌธ์ฅ์ ์์ฑํ๋๋ก ์ ๋ํ๊ธฐ ์ํ์ฌ ์ผ๋ จ์ ์๋ฏธ ์ ์ฝ๋ค์ ์ ์ํ๋ค.
๋ณธ ํ์ ๋
ผ๋ฌธ์์ ์ ์๋ ๊ฐ์ ๋ฐฉ์๋ค์ ๋ฌธ์ฅ ๋งค์นญ ๊ณผ์ ์ ํฌํจํ๋ ๋ค์ํ ์์ฐ์ด ์ฒ๋ฆฌ ์์ฉ์ ํจ์ฉ์ฑ์ ๋์ผ ๊ฒ์ผ๋ก ๊ธฐ๋๋๋ค.Sentence matching is a task of predicting the degree of match between two sentences.
Since high level of understanding natural language text is needed for a model to identify the relationship between two sentences,
it is an important component for various natural language processing applications.
In this dissertation, we seek for the improvement of the sentence matching module from the following three ingredients: sentence encoder, matching function, and semi-supervised learning.
To enhance a sentence encoder network which takes responsibility of extracting useful features from a sentence, we propose two new sentence encoder architectures: Gumbel Tree-LSTM and Cell-aware Stacked LSTM (CAS-LSTM).
Gumbel Tree-LSTM is based on a recursive neural network (RvNN) architecture, however unlike typical RvNN architectures it does not need a structured input.
Instead, it learns from data a parsing strategy that is optimized for a specific task.
The latter, CAS-LSTM, extends the stacked long short-term memory (LSTM) architecture by introducing an additional forget gate for better handling of vertical information flow.
And then, as a new matching function, we present the element-wise bilinear sentence matching (ElBiS) function.
It aims to automatically find an aggregation scheme that fuses two sentence representations into a single one suitable for a specific task.
From the fact that a sentence encoder is shared across inputs, we hypothesize and empirically prove that considering only the element-wise bilinear interaction is sufficient for comparing two sentence vectors.
By restricting the interaction, we can largely reduce the number of required parameters compared with full bilinear pooling methods without losing the advantage of automatically discovering useful aggregation schemes.
Finally, to facilitate semi-supervised training, i.e. to make use of both labeled and unlabeled data in training, we propose the cross-sentence latent variable model (CS-LVM).
Its generative model assumes that a target sentence is generated from the latent representation of a source sentence and the variable indicating the relationship between the source and the target sentence.
As it considers the two sentences in a pair together in a single model, the training objectives are defined more naturally than prior approaches based on the variational auto-encoder (VAE).
We also define semantic constraints that force the generator to generate semantically more plausible sentences.
We believe that the improvements proposed in this dissertation would advance the effectiveness of various natural language processing applications containing modeling sentence pairs.Chapter 1 Introduction 1
1.1 Sentence Matching 1
1.2 Deep Neural Networks for Sentence Matching 2
1.3 Scope of the Dissertation 4
Chapter 2 Background and Related Work 9
2.1 Sentence Encoders 9
2.2 Matching Functions 11
2.3 Semi-Supervised Training 13
Chapter 3 Sentence Encoder: Gumbel Tree-LSTM 15
3.1 Motivation 15
3.2 Preliminaries 16
3.2.1 Recursive Neural Networks 16
3.2.2 Training RvNNs without Tree Information 17
3.3 Model Description 19
3.3.1 Tree-LSTM 19
3.3.2 Gumbel-Softmax 20
3.3.3 Gumbel Tree-LSTM 22
3.4 Implementation Details 25
3.5 Experiments 27
3.5.1 Natural Language Inference 27
3.5.2 Sentiment Analysis 32
3.5.3 Qualitative Analysis 33
3.6 Summary 36
Chapter 4 Sentence Encoder: Cell-aware Stacked LSTM 38
4.1 Motivation 38
4.2 Related Work 40
4.3 Model Description 43
4.3.1 Stacked LSTMs 43
4.3.2 Cell-aware Stacked LSTMs 44
4.3.3 Sentence Encoders 46
4.4 Experiments 47
4.4.1 Natural Language Inference 47
4.4.2 Paraphrase Identification 50
4.4.3 Sentiment Classification 52
4.4.4 Machine Translation 53
4.4.5 Forget Gate Analysis 55
4.4.6 Model Variations 56
4.5 Summary 59
Chapter 5 Matching Function: Element-wise Bilinear Sentence Matching 60
5.1 Motivation 60
5.2 Proposed Method: ElBiS 61
5.3 Experiments 63
5.3.1 Natural language inference 64
5.3.2 Paraphrase Identification 66
5.4 Summary and Discussion 68
Chapter 6 Semi-Supervised Training: Cross-Sentence Latent Variable Model 70
6.1 Motivation 70
6.2 Preliminaries 71
6.2.1 Variational Auto-Encoders 71
6.2.2 von MisesโFisher Distribution 73
6.3 Proposed Framework: CS-LVM 74
6.3.1 Cross-Sentence Latent Variable Model 75
6.3.2 Architecture 78
6.3.3 Optimization 79
6.4 Experiments 84
6.4.1 Natural Language Inference 84
6.4.2 Paraphrase Identification 85
6.4.3 Ablation Study 86
6.4.4 Generated Sentences 88
6.4.5 Implementation Details 89
6.5 Summary and Discussion 90
Chapter 7 Conclusion 92
Appendix A Appendix 96
A.1 Sentences Generated from CS-LVM 96Docto
Toward Multi-modal Multi-aspect Deep Alignment and Integration
Multi-modal/-aspect data contains complementary information about the same thing of interest that
has the promising potential of leading to improved model robustness and thus gaining an increasing
research focus. There are two typical categories of multi-modal/-aspect problems that require crossmodal/-
aspect alignment and integration: 1) heterogeneous multi-modal problems that deal with data
from multiple media forms, such as text, image etc., and 2) homogeneous multi-aspect problems that
handle data with different aspects represented by the same media form, such as the syntactic and
semantic aspects of a textual sentence etc. However, most of the existing approaches for multimodal/-
aspect simply tackle the cross-modal/-aspect alignment and integration through various deep
learning neural networks in an implicit manner and optimize based on the final task goals, leaving the
potential strategies for improving the cross-modal/-aspect alignment and integration under-explored.
This thesis aims to initiate an exploration of strategies and approaches towards multi-modal/-aspect
deep alignment and integration. By looking into the limitations of existing approaches for both
heterogeneous multi-modal problems and homogeneous multi-aspect problems, it proposes novel
strategies and approaches for improving the cross-modal/-aspect alignment and integration and
evaluates on the most essential representative tasks. For the heterogeneous setting, a cross-modal
information captured graph-structured representation learning approach is proposed to enforce better
cross-modal alignment and evaluated on the Language-to-Vision and Vision-and-Language
scenarios. On the other hand, for the homogeneous setting, a bi-directional and deep crossintegration
mechanism is explored to synthesise the multi-level semantics for comprehensive text
understanding, which is validated in the joint multi-aspect natural language understanding context
and its generalised text understanding setting
A review of technical factors to consider when designing neural networks for semantic segmentation of Earth Observation imagery
Semantic segmentation (classification) of Earth Observation imagery is a
crucial task in remote sensing. This paper presents a comprehensive review of
technical factors to consider when designing neural networks for this purpose.
The review focuses on Convolutional Neural Networks (CNNs), Recurrent Neural
Networks (RNNs), Generative Adversarial Networks (GANs), and transformer
models, discussing prominent design patterns for these ANN families and their
implications for semantic segmentation. Common pre-processing techniques for
ensuring optimal data preparation are also covered. These include methods for
image normalization and chipping, as well as strategies for addressing data
imbalance in training samples, and techniques for overcoming limited data,
including augmentation techniques, transfer learning, and domain adaptation. By
encompassing both the technical aspects of neural network design and the
data-related considerations, this review provides researchers and practitioners
with a comprehensive and up-to-date understanding of the factors involved in
designing effective neural networks for semantic segmentation of Earth
Observation imagery.Comment: 145 pages with 32 figure
A Survey on Knowledge Graphs: Representation, Acquisition and Applications
Human knowledge provides a formal understanding of the world. Knowledge
graphs that represent structural relations between entities have become an
increasingly popular research direction towards cognition and human-level
intelligence. In this survey, we provide a comprehensive review of knowledge
graph covering overall research topics about 1) knowledge graph representation
learning, 2) knowledge acquisition and completion, 3) temporal knowledge graph,
and 4) knowledge-aware applications, and summarize recent breakthroughs and
perspective directions to facilitate future research. We propose a full-view
categorization and new taxonomies on these topics. Knowledge graph embedding is
organized from four aspects of representation space, scoring function, encoding
models, and auxiliary information. For knowledge acquisition, especially
knowledge graph completion, embedding methods, path inference, and logical rule
reasoning, are reviewed. We further explore several emerging topics, including
meta relational learning, commonsense reasoning, and temporal knowledge graphs.
To facilitate future research on knowledge graphs, we also provide a curated
collection of datasets and open-source libraries on different tasks. In the
end, we have a thorough outlook on several promising research directions
- โฆ