72,070 research outputs found

    Self-Attention Network for Text Representation Learning

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.This research studies the effectiveness and efficiency of self-attention mechanisms for text representation learning in a deep learning-based natural language processing literature. We focus on developing novel self-attention networks to capture semantic and syntactic knowledge underlying natural language texts, and thus benefit a wide range of downstream natural language understanding tasks, which is followed by improving the networks with external relational, structured, factoid, and commonsense information in knowledge graphs. In the last decade, recurrent neural networks and convolutional neural networks are widely used to produce context-aware representations for natural language text: the former can capture long-range dependency but is hard to parallelize and not time-efficient; the latter focuses on local dependency but does not perform well on some tasks. Attention mechanisms, especially self-attention mechanisms, have recently attracted tremendous interest from both academia and industry, due to their light-weight structures, parallelizable computations, outstanding performance on a broad spectrum of natural language processing tasks. We first propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). Compared to previous works, the proposed attention mechanism is able to capture the subtle difference in contexts and thus alleviate the ambiguity or polysemy problem. Based solely on the proposed attention, we present a light-weight neural model, directional self-attention network, to learn both token- and sentence-level context-aware representations, with high efficiency and competitive performance. Furthermore, we improve the proposed network along with several directions: First, we extend the self-attention to a hierarchical structure to capture local and global dependencies for memory efficiency. Second, we introduce hard attention to the self-attention mechanism for mutual benefits of soft and hard attention mechanisms. Third, we capture both pairwise and global dependencies by a novel compatibility function composed of dot-product and additive attentions. Then, this research conducts extensive experiments on benchmark tasks to verify the effectiveness of the proposed self-attention networks from both quantitative and qualitative perspective. The benchmark tasks, including natural language inference, sentiment analysis, semantic role labeling, are able to comprehensively estimate models' capability of capturing both semantic and syntactic information underlying natural language texts. The empirical results show the proposed models empirically achieve state-of-the-art performance on a wide range of natural language understanding tasks, and are as fast and as memory-efficient as convolutional models. Lastly, although self-attention networks, even if those initialized by a pre-trained language model, learn powerful contextualized representations and achieve state-of-the- art performance, open questions still remain about what these models have learned and improvements can be made along with several directions. One such direction is when downstream task performance depends on relational knowledge - the kind stored in knowledge graphs. Therefore, we explore incorporation of self-attention networks and human-curated knowledge graphs because such knowledge can improve a self-attention network by either conducting symbolic reasoning over knowledge graphs to derive targeted results or embedding the relational information into neural networks to boost representation learning. We study several potential approaches in three knowledge graph-related scenarios in a natural language processing literature, i.e., knowledge-based question answering, knowledge base completion and commonsense reasoning. Experiments conducted on knowledge graph-related benchmarks show the effectiveness of our proposed models

    Multi-turn Inference Matching Network for Natural Language Inference

    Full text link
    Natural Language Inference (NLI) is a fundamental and challenging task in Natural Language Processing (NLP). Most existing methods only apply one-pass inference process on a mixed matching feature, which is a concatenation of different matching features between a premise and a hypothesis. In this paper, we propose a new model called Multi-turn Inference Matching Network (MIMN) to perform multi-turn inference on different matching features. In each turn, the model focuses on one particular matching feature instead of the mixed matching feature. To enhance the interaction between different matching features, a memory component is employed to store the history inference information. The inference of each turn is performed on the current matching feature and the memory. We conduct experiments on three different NLI datasets. The experimental results show that our model outperforms or achieves the state-of-the-art performance on all the three datasets

    Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together

    Full text link
    Neural networks equipped with self-attention have parallelizable computation, light-weight structure, and the ability to capture both long-range and local dependencies. Further, their expressive power and performance can be boosted by using a vector to measure pairwise dependency, but this requires to expand the alignment matrix to a tensor, which results in memory and computation bottlenecks. In this paper, we propose a novel attention mechanism called "Multi-mask Tensorized Self-Attention" (MTSA), which is as fast and as memory-efficient as a CNN, but significantly outperforms previous CNN-/RNN-/attention-based models. MTSA 1) captures both pairwise (token2token) and global (source2token) dependencies by a novel compatibility function composed of dot-product and additive attentions, 2) uses a tensor to represent the feature-wise alignment scores for better expressive power but only requires parallelizable matrix multiplications, and 3) combines multi-head with multi-dimensional attentions, and applies a distinct positional mask to each head (subspace), so the memory and computation can be distributed to multiple heads, each with sequential information encoded independently. The experiments show that a CNN/RNN-free model based on MTSA achieves state-of-the-art or competitive performance on nine NLP benchmarks with compelling memory- and time-efficiency
    • …
    corecore