Self-Attention Network for Text Representation Learning

Abstract

University of Technology Sydney. Faculty of Engineering and Information Technology.This research studies the effectiveness and efficiency of self-attention mechanisms for text representation learning in a deep learning-based natural language processing literature. We focus on developing novel self-attention networks to capture semantic and syntactic knowledge underlying natural language texts, and thus benefit a wide range of downstream natural language understanding tasks, which is followed by improving the networks with external relational, structured, factoid, and commonsense information in knowledge graphs. In the last decade, recurrent neural networks and convolutional neural networks are widely used to produce context-aware representations for natural language text: the former can capture long-range dependency but is hard to parallelize and not time-efficient; the latter focuses on local dependency but does not perform well on some tasks. Attention mechanisms, especially self-attention mechanisms, have recently attracted tremendous interest from both academia and industry, due to their light-weight structures, parallelizable computations, outstanding performance on a broad spectrum of natural language processing tasks. We first propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). Compared to previous works, the proposed attention mechanism is able to capture the subtle difference in contexts and thus alleviate the ambiguity or polysemy problem. Based solely on the proposed attention, we present a light-weight neural model, directional self-attention network, to learn both token- and sentence-level context-aware representations, with high efficiency and competitive performance. Furthermore, we improve the proposed network along with several directions: First, we extend the self-attention to a hierarchical structure to capture local and global dependencies for memory efficiency. Second, we introduce hard attention to the self-attention mechanism for mutual benefits of soft and hard attention mechanisms. Third, we capture both pairwise and global dependencies by a novel compatibility function composed of dot-product and additive attentions. Then, this research conducts extensive experiments on benchmark tasks to verify the effectiveness of the proposed self-attention networks from both quantitative and qualitative perspective. The benchmark tasks, including natural language inference, sentiment analysis, semantic role labeling, are able to comprehensively estimate models' capability of capturing both semantic and syntactic information underlying natural language texts. The empirical results show the proposed models empirically achieve state-of-the-art performance on a wide range of natural language understanding tasks, and are as fast and as memory-efficient as convolutional models. Lastly, although self-attention networks, even if those initialized by a pre-trained language model, learn powerful contextualized representations and achieve state-of-the- art performance, open questions still remain about what these models have learned and improvements can be made along with several directions. One such direction is when downstream task performance depends on relational knowledge - the kind stored in knowledge graphs. Therefore, we explore incorporation of self-attention networks and human-curated knowledge graphs because such knowledge can improve a self-attention network by either conducting symbolic reasoning over knowledge graphs to derive targeted results or embedding the relational information into neural networks to boost representation learning. We study several potential approaches in three knowledge graph-related scenarios in a natural language processing literature, i.e., knowledge-based question answering, knowledge base completion and commonsense reasoning. Experiments conducted on knowledge graph-related benchmarks show the effectiveness of our proposed models

    Similar works

    Full text

    thumbnail-image