17 research outputs found
Distance-based Self-Attention Network for Natural Language Inference
Attention mechanism has been used as an ancillary means to help RNN or CNN.
However, the Transformer (Vaswani et al., 2017) recently recorded the
state-of-the-art performance in machine translation with a dramatic reduction
in training time by solely using attention. Motivated by the Transformer,
Directional Self Attention Network (Shen et al., 2017), a fully attention-based
sentence encoder, was proposed. It showed good performance with various data by
using forward and backward directional information in a sentence. But in their
study, not considered at all was the distance between words, an important
feature when learning the local dependency to help understand the context of
input text. We propose Distance-based Self-Attention Network, which considers
the word distance by using a simple distance mask in order to model the local
dependency without losing the ability of modeling global dependency which
attention has inherent. Our model shows good performance with NLI data, and it
records the new state-of-the-art result with SNLI data. Additionally, we show
that our model has a strength in long sentences or documents.Comment: 12 pages, 13 figure
Dynamic Self-Attention : Computing Attention over Words Dynamically for Sentence Embedding
In this paper, we propose Dynamic Self-Attention (DSA), a new self-attention
mechanism for sentence embedding. We design DSA by modifying dynamic routing in
capsule network (Sabouretal.,2017) for natural language processing. DSA attends
to informative words with a dynamic weight vector. We achieve new
state-of-the-art results among sentence encoding methods in Stanford Natural
Language Inference (SNLI) dataset with the least number of parameters, while
showing comparative results in Stanford Sentiment Treebank (SST) dataset.Comment: 7 pages, 4 figure
Towards Open Intent Discovery for Conversational Text
Detecting and identifying user intent from text, both written and spoken,
plays an important role in modelling and understand dialogs. Existing research
for intent discovery model it as a classification task with a predefined set of
known categories. To generailze beyond these preexisting classes, we define a
new task of \textit{open intent discovery}. We investigate how intent can be
generalized to those not seen during training. To this end, we propose a
two-stage approach to this task - predicting whether an utterance contains an
intent, and then tagging the intent in the input utterance. Our model consists
of a bidirectional LSTM with a CRF on top to capture contextual semantics,
subject to some constraints. Self-attention is used to learn long distance
dependencies. Further, we adapt an adversarial training approach to improve
robustness and perforamce across domains. We also present a dataset of 25k
real-life utterances that have been labelled via crowd sourcing. Our
experiments across different domains and real-world datasets show the
effectiveness of our approach, with less than 100 annotated examples needed per
unique domain to recognize diverse intents. The approach outperforms
state-of-the-art baselines by 5-15% F1 score points
Self-Attentional Acoustic Models
Self-attention is a method of encoding sequences of vectors by relating these
vectors to each-other based on pairwise similarities. These models have
recently shown promising results for modeling discrete sequences, but they are
non-trivial to apply to acoustic modeling due to computational and modeling
issues. In this paper, we apply self-attention to acoustic modeling, proposing
several improvements to mitigate these issues: First, self-attention memory
grows quadratically in the sequence length, which we address through a
downsampling technique. Second, we find that previous approaches to incorporate
position information into the model are unsuitable and explore other
representations and hybrid models to this end. Third, to stress the importance
of local context in the acoustic signal, we propose a Gaussian biasing approach
that allows explicit control over the context range. Experiments find that our
model approaches a strong baseline based on LSTMs with network-in-network
connections while being much faster to compute. Besides speed, we find that
interpretability is a strength of self-attentional acoustic models, and
demonstrate that self-attention heads learn a linguistically plausible division
of labor.Comment: Published at Interspeech 201
The Natural Language Decathlon: Multitask Learning as Question Answering
Deep learning has improved performance on many natural language processing
(NLP) tasks individually. However, general NLP models cannot emerge within a
paradigm that focuses on the particularities of a single metric, dataset, and
task. We introduce the Natural Language Decathlon (decaNLP), a challenge that
spans ten tasks: question answering, machine translation, summarization,
natural language inference, sentiment analysis, semantic role labeling,
zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and
commonsense pronoun resolution. We cast all tasks as question answering over a
context. Furthermore, we present a new Multitask Question Answering Network
(MQAN) jointly learns all tasks in decaNLP without any task-specific modules or
parameters in the multitask setting. MQAN shows improvements in transfer
learning for machine translation and named entity recognition, domain
adaptation for sentiment analysis and natural language inference, and zero-shot
capabilities for text classification. We demonstrate that the MQAN's
multi-pointer-generator decoder is key to this success and performance further
improves with an anti-curriculum training strategy. Though designed for
decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic
parsing task in the single-task setting. We also release code for procuring and
processing data, training and evaluating models, and reproducing all
experiments for decaNLP
Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding
Attention-based models have shown significant improvement over traditional
algorithms in several NLP tasks. The Transformer, for instance, is an
illustrative example that generates abstract representations of tokens inputted
to an encoder based on their relationships to all tokens in a sequence. Recent
studies have shown that although such models are capable of learning syntactic
features purely by seeing examples, explicitly feeding this information to deep
learning models can significantly enhance their performance. Leveraging
syntactic information like part of speech (POS) may be particularly beneficial
in limited training data settings for complex models such as the Transformer.
We show that the syntax-infused Transformer with multiple features achieves an
improvement of 0.7 BLEU when trained on the full WMT 14 English to German
translation dataset and a maximum improvement of 1.99 BLEU points when trained
on a fraction of the dataset. In addition, we find that the incorporation of
syntax into BERT fine-tuning outperforms baseline on a number of downstream
tasks from the GLUE benchmark
Integrated Eojeol Embedding for Erroneous Sentence Classification in Korean Chatbots
This paper attempts to analyze the Korean sentence classification system for
a chatbot. Sentence classification is the task of classifying an input sentence
based on predefined categories. However, spelling or space error contained in
the input sentence causes problems in morphological analysis and tokenization.
This paper proposes a novel approach of Integrated Eojeol (Korean syntactic
word separated by space) Embedding to reduce the effect that poorly analyzed
morphemes may make on sentence classification. It also proposes two noise
insertion methods that further improve classification performance. Our
evaluation results indicate that the proposed system classifies erroneous
sentences more accurately than the baseline system by 17%p.0Comment: 9 pages, 2 figure
DGA-Net Dynamic Gaussian Attention Network for Sentence Semantic Matching
Sentence semantic matching requires an agent to determine the semantic
relation between two sentences, where much recent progress has been made by the
advancement of representation learning techniques and inspiration of human
behaviors. Among all these methods, attention mechanism plays an essential role
by selecting important parts effectively. However, current attention methods
either focus on all the important parts in a static way or only select one
important part at one attention step dynamically, which leaves a large space
for further improvement. To this end, in this paper, we design a novel Dynamic
Gaussian Attention Network (DGA-Net) to combine the advantages of current
static and dynamic attention methods. More specifically, we first leverage
pre-trained language model to encode the input sentences and construct semantic
representations from a global perspective. Then, we develop a Dynamic Gaussian
Attention (DGA) to dynamically capture the important parts and corresponding
local contexts from a detailed perspective. Finally, we combine the global
information and detailed local information together to decide the semantic
relation of sentences comprehensively and precisely. Extensive experiments on
two popular sentence semantic matching tasks demonstrate that our proposed
DGA-Net is effective in improving the ability of attention mechanism.Comment: Accepted by CICAI202
Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News
Fake news are nowadays an issue of pressing concern, given their recent rise
as a potential threat to high-quality journalism and well-informed public
discourse. The Fake News Challenge (FNC-1) was organized in 2017 to encourage
the development of machine learning-based classification systems for stance
detection (i.e., for identifying whether a particular news article agrees,
disagrees, discusses, or is unrelated to a particular news headline), thus
helping in the detection and analysis of possible instances of fake news. This
article presents a new approach to tackle this stance detection problem, based
on the combination of string similarity features with a deep neural
architecture that leverages ideas previously advanced in the context of
learning efficient text representations, document classification, and natural
language inference. Specifically, we use bi-directional Recurrent Neural
Networks, together with max-pooling over the temporal/sequential dimension and
neural attention, for representing (i) the headline, (ii) the first two
sentences of the news article, and (iii) the entire news article. These
representations are then combined/compared, complemented with similarity
features inspired on other FNC-1 approaches, and passed to a final layer that
predicts the stance of the article towards the headline. We also explore the
use of external sources of information, specifically large datasets of sentence
pairs originally proposed for training and evaluating natural language
inference methods, in order to pre-train specific components of the neural
network architecture (e.g., the RNNs used for encoding sentences). The obtained
results attest to the effectiveness of the proposed ideas and show that our
model, particularly when considering pre-training and the combination of neural
representations together with similarity features, slightly outperforms the
previous state-of-the-art.Comment: Accepted for publication in the special issue of the ACM Journal of
Data and Information Quality (ACM JDIQ) on Combating Digital Misinformation
and Disinformatio
Multiple Structural Priors Guided Self Attention Network for Language Understanding
Self attention networks (SANs) have been widely utilized in recent NLP
studies. Unlike CNNs or RNNs, standard SANs are usually position-independent,
and thus are incapable of capturing the structural priors between sequences of
words. Existing studies commonly apply one single mask strategy on SANs for
incorporating structural priors while failing at modeling more abundant
structural information of texts. In this paper, we aim at introducing multiple
types of structural priors into SAN models, proposing the Multiple Structural
Priors Guided Self Attention Network (MS-SAN) that transforms different
structural priors into different attention heads by using a novel multi-mask
based multi-head attention mechanism. In particular, we integrate two
categories of structural priors, including the sequential order and the
relative position of words. For the purpose of capturing the latent
hierarchical structure of the texts, we extract these information not only from
the word contexts but also from the dependency syntax trees. Experimental
results on two tasks show that MS-SAN achieves significant improvements against
other strong baselines