14 research outputs found

    Toward Multi-modal Multi-aspect Deep Alignment and Integration

    Get PDF
    Multi-modal/-aspect data contains complementary information about the same thing of interest that has the promising potential of leading to improved model robustness and thus gaining an increasing research focus. There are two typical categories of multi-modal/-aspect problems that require crossmodal/- aspect alignment and integration: 1) heterogeneous multi-modal problems that deal with data from multiple media forms, such as text, image etc., and 2) homogeneous multi-aspect problems that handle data with different aspects represented by the same media form, such as the syntactic and semantic aspects of a textual sentence etc. However, most of the existing approaches for multimodal/- aspect simply tackle the cross-modal/-aspect alignment and integration through various deep learning neural networks in an implicit manner and optimize based on the final task goals, leaving the potential strategies for improving the cross-modal/-aspect alignment and integration under-explored. This thesis aims to initiate an exploration of strategies and approaches towards multi-modal/-aspect deep alignment and integration. By looking into the limitations of existing approaches for both heterogeneous multi-modal problems and homogeneous multi-aspect problems, it proposes novel strategies and approaches for improving the cross-modal/-aspect alignment and integration and evaluates on the most essential representative tasks. For the heterogeneous setting, a cross-modal information captured graph-structured representation learning approach is proposed to enforce better cross-modal alignment and evaluated on the Language-to-Vision and Vision-and-Language scenarios. On the other hand, for the homogeneous setting, a bi-directional and deep crossintegration mechanism is explored to synthesise the multi-level semantics for comprehensive text understanding, which is validated in the joint multi-aspect natural language understanding context and its generalised text understanding setting

    Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis

    Full text link
    Recognizing the layout of unstructured digital documents is crucial when parsing the documents into the structured, machine-readable format for downstream applications. Recent studies in Document Layout Analysis usually rely on computer vision models to understand documents while ignoring other information, such as context information or relation of document components, which are vital to capture. Our Doc-GCN presents an effective way to harmonize and integrate heterogeneous aspects for Document Layout Analysis. We first construct graphs to explicitly describe four main aspects, including syntactic, semantic, density, and appearance/visual information. Then, we apply graph convolutional networks for representing each aspect of information and use pooling to integrate them. Finally, we aggregate each aspect and feed them into 2-layer MLPs for document layout component classification. Our Doc-GCN achieves new state-of-the-art results in three widely used DLA datasets.Comment: Accepted by COLING 202

    MC-DRE: Multi-Aspect Cross Integration for Drug Event/Entity Extraction

    Full text link
    Extracting meaningful drug-related information chunks, such as adverse drug events (ADE), is crucial for preventing morbidity and saving many lives. Most ADEs are reported via an unstructured conversation with the medical context, so applying a general entity recognition approach is not sufficient enough. In this paper, we propose a new multi-aspect cross-integration framework for drug entity/event detection by capturing and aligning different context/language/knowledge properties from drug-related documents. We first construct multi-aspect encoders to describe semantic, syntactic, and medical document contextual information by conducting those slot tagging tasks, main drug entity/event detection, part-of-speech tagging, and general medical named entity recognition. Then, each encoder conducts cross-integration with other contextual information in three ways: the key-value cross, attention cross, and feedforward cross, so the multi-encoders are integrated in depth. Our model outperforms all SOTA on two widely used tasks, flat entity detection and discontinuous event extraction.Comment: Accepted at CIKM 202

    Understanding Attention for Vision-and-Language Tasks

    Full text link
    Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region's and textual token's significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attention-based cross modal models, and/or pretrained models. Our code is available at: https://github.com/adlnlp/Attention_VLComment: Accepted in COLING 202

    Tri-level Joint Natural Language Understanding for Multi-turn Conversational Datasets

    Full text link
    Natural language understanding typically maps single utterances to a dual level semantic frame, sentence level intent and slot labels at the word level. The best performing models force explicit interaction between intent detection and slot filling. We present a novel tri-level joint natural language understanding approach, adding domain, and explicitly exchange semantic information between all levels. This approach enables the use of multi-turn datasets which are a more natural conversational environment than single utterance. We evaluate our model on two multi-turn datasets for which we are the first to conduct joint slot-filling and intent detection. Our model outperforms state-of-the-art joint models in slot filling and intent detection on multi-turn data sets. We provide an analysis of explicit interaction locations between the layers. We conclude that including domain information improves model performance.Comment: accepted at INTERSPEECH 202

    Interpretable deep learning in single-cell omics

    Full text link
    Recent developments in single-cell omics technologies have enabled the quantification of molecular profiles in individual cells at an unparalleled resolution. Deep learning, a rapidly evolving sub-field of machine learning, has instilled a significant interest in single-cell omics research due to its remarkable success in analysing heterogeneous high-dimensional single-cell omics data. Nevertheless, the inherent multi-layer nonlinear architecture of deep learning models often makes them `black boxes' as the reasoning behind predictions is often unknown and not transparent to the user. This has stimulated an increasing body of research for addressing the lack of interpretability in deep learning models, especially in single-cell omics data analyses, where the identification and understanding of molecular regulators are crucial for interpreting model predictions and directing downstream experimental validations. In this work, we introduce the basics of single-cell omics technologies and the concept of interpretable deep learning. This is followed by a review of the recent interpretable deep learning models applied to various single-cell omics research. Lastly, we highlight the current limitations and discuss potential future directions. We anticipate this review to bring together the single-cell and machine learning research communities to foster future development and application of interpretable deep learning in single-cell omics research

    Form-NLU: Dataset for the Form Language Understanding

    Full text link
    Compared to general document analysis tasks, form document structure understanding and retrieval are challenging. Form documents are typically made by two types of authors; A form designer, who develops the form structure and keys, and a form user, who fills out form values based on the provided keys. Hence, the form values may not be aligned with the form designer's intention (structure and keys) if a form user gets confused. In this paper, we introduce Form-NLU, the first novel dataset for form structure understanding and its key and value information extraction, interpreting the form designer's intent and the alignment of user-written value on it. It consists of 857 form images, 6k form keys and values, and 4k table keys and values. Our dataset also includes three form types: digital, printed, and handwritten, which cover diverse form appearances and layouts. We propose a robust positional and logical relation-based form key-value information extraction framework. Using this dataset, Form-NLU, we first examine strong object detection models for the form layout understanding, then evaluate the key information extraction task on the dataset, providing fine-grained results for different types of forms and keys. Furthermore, we examine it with the off-the-shelf pdf layout extraction tool and prove its feasibility in real-world cases.Comment: Accepted by SIGIR 202
    corecore