14 research outputs found
Toward Multi-modal Multi-aspect Deep Alignment and Integration
Multi-modal/-aspect data contains complementary information about the same thing of interest that
has the promising potential of leading to improved model robustness and thus gaining an increasing
research focus. There are two typical categories of multi-modal/-aspect problems that require crossmodal/-
aspect alignment and integration: 1) heterogeneous multi-modal problems that deal with data
from multiple media forms, such as text, image etc., and 2) homogeneous multi-aspect problems that
handle data with different aspects represented by the same media form, such as the syntactic and
semantic aspects of a textual sentence etc. However, most of the existing approaches for multimodal/-
aspect simply tackle the cross-modal/-aspect alignment and integration through various deep
learning neural networks in an implicit manner and optimize based on the final task goals, leaving the
potential strategies for improving the cross-modal/-aspect alignment and integration under-explored.
This thesis aims to initiate an exploration of strategies and approaches towards multi-modal/-aspect
deep alignment and integration. By looking into the limitations of existing approaches for both
heterogeneous multi-modal problems and homogeneous multi-aspect problems, it proposes novel
strategies and approaches for improving the cross-modal/-aspect alignment and integration and
evaluates on the most essential representative tasks. For the heterogeneous setting, a cross-modal
information captured graph-structured representation learning approach is proposed to enforce better
cross-modal alignment and evaluated on the Language-to-Vision and Vision-and-Language
scenarios. On the other hand, for the homogeneous setting, a bi-directional and deep crossintegration
mechanism is explored to synthesise the multi-level semantics for comprehensive text
understanding, which is validated in the joint multi-aspect natural language understanding context
and its generalised text understanding setting
Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis
Recognizing the layout of unstructured digital documents is crucial when
parsing the documents into the structured, machine-readable format for
downstream applications. Recent studies in Document Layout Analysis usually
rely on computer vision models to understand documents while ignoring other
information, such as context information or relation of document components,
which are vital to capture. Our Doc-GCN presents an effective way to harmonize
and integrate heterogeneous aspects for Document Layout Analysis. We first
construct graphs to explicitly describe four main aspects, including syntactic,
semantic, density, and appearance/visual information. Then, we apply graph
convolutional networks for representing each aspect of information and use
pooling to integrate them. Finally, we aggregate each aspect and feed them into
2-layer MLPs for document layout component classification. Our Doc-GCN achieves
new state-of-the-art results in three widely used DLA datasets.Comment: Accepted by COLING 202
MC-DRE: Multi-Aspect Cross Integration for Drug Event/Entity Extraction
Extracting meaningful drug-related information chunks, such as adverse drug
events (ADE), is crucial for preventing morbidity and saving many lives. Most
ADEs are reported via an unstructured conversation with the medical context, so
applying a general entity recognition approach is not sufficient enough. In
this paper, we propose a new multi-aspect cross-integration framework for drug
entity/event detection by capturing and aligning different
context/language/knowledge properties from drug-related documents. We first
construct multi-aspect encoders to describe semantic, syntactic, and medical
document contextual information by conducting those slot tagging tasks, main
drug entity/event detection, part-of-speech tagging, and general medical named
entity recognition. Then, each encoder conducts cross-integration with other
contextual information in three ways: the key-value cross, attention cross, and
feedforward cross, so the multi-encoders are integrated in depth. Our model
outperforms all SOTA on two widely used tasks, flat entity detection and
discontinuous event extraction.Comment: Accepted at CIKM 202
Understanding Attention for Vision-and-Language Tasks
Attention mechanism has been used as an important component across
Vision-and-Language(VL) tasks in order to bridge the semantic gap between
visual and textual features. While attention has been widely used in VL tasks,
it has not been examined the capability of different attention alignment
calculation in bridging the semantic gap between visual and textual clues. In
this research, we conduct a comprehensive analysis on understanding the role of
attention alignment by looking into the attention score calculation methods and
check how it actually represents the visual region's and textual token's
significance for the global assessment. We also analyse the conditions which
attention score calculation mechanism would be more (or less) interpretable,
and which may impact the model performance on three different VL tasks,
including visual question answering, text-to-image generation, text-and-image
matching (both sentence and image retrieval). Our analysis is the first of its
kind and provides useful insights of the importance of each attention alignment
score calculation when applied at the training phase of VL tasks, commonly
ignored in attention-based cross modal models, and/or pretrained models. Our
code is available at: https://github.com/adlnlp/Attention_VLComment: Accepted in COLING 202
Tri-level Joint Natural Language Understanding for Multi-turn Conversational Datasets
Natural language understanding typically maps single utterances to a dual
level semantic frame, sentence level intent and slot labels at the word level.
The best performing models force explicit interaction between intent detection
and slot filling. We present a novel tri-level joint natural language
understanding approach, adding domain, and explicitly exchange semantic
information between all levels. This approach enables the use of multi-turn
datasets which are a more natural conversational environment than single
utterance. We evaluate our model on two multi-turn datasets for which we are
the first to conduct joint slot-filling and intent detection. Our model
outperforms state-of-the-art joint models in slot filling and intent detection
on multi-turn data sets. We provide an analysis of explicit interaction
locations between the layers. We conclude that including domain information
improves model performance.Comment: accepted at INTERSPEECH 202
Interpretable deep learning in single-cell omics
Recent developments in single-cell omics technologies have enabled the
quantification of molecular profiles in individual cells at an unparalleled
resolution. Deep learning, a rapidly evolving sub-field of machine learning,
has instilled a significant interest in single-cell omics research due to its
remarkable success in analysing heterogeneous high-dimensional single-cell
omics data. Nevertheless, the inherent multi-layer nonlinear architecture of
deep learning models often makes them `black boxes' as the reasoning behind
predictions is often unknown and not transparent to the user. This has
stimulated an increasing body of research for addressing the lack of
interpretability in deep learning models, especially in single-cell omics data
analyses, where the identification and understanding of molecular regulators
are crucial for interpreting model predictions and directing downstream
experimental validations. In this work, we introduce the basics of single-cell
omics technologies and the concept of interpretable deep learning. This is
followed by a review of the recent interpretable deep learning models applied
to various single-cell omics research. Lastly, we highlight the current
limitations and discuss potential future directions. We anticipate this review
to bring together the single-cell and machine learning research communities to
foster future development and application of interpretable deep learning in
single-cell omics research
Form-NLU: Dataset for the Form Language Understanding
Compared to general document analysis tasks, form document structure
understanding and retrieval are challenging. Form documents are typically made
by two types of authors; A form designer, who develops the form structure and
keys, and a form user, who fills out form values based on the provided keys.
Hence, the form values may not be aligned with the form designer's intention
(structure and keys) if a form user gets confused. In this paper, we introduce
Form-NLU, the first novel dataset for form structure understanding and its key
and value information extraction, interpreting the form designer's intent and
the alignment of user-written value on it. It consists of 857 form images, 6k
form keys and values, and 4k table keys and values. Our dataset also includes
three form types: digital, printed, and handwritten, which cover diverse form
appearances and layouts. We propose a robust positional and logical
relation-based form key-value information extraction framework. Using this
dataset, Form-NLU, we first examine strong object detection models for the form
layout understanding, then evaluate the key information extraction task on the
dataset, providing fine-grained results for different types of forms and keys.
Furthermore, we examine it with the off-the-shelf pdf layout extraction tool
and prove its feasibility in real-world cases.Comment: Accepted by SIGIR 202