17 research outputs found
Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together
Neural networks equipped with self-attention have parallelizable computation,
light-weight structure, and the ability to capture both long-range and local
dependencies. Further, their expressive power and performance can be boosted by
using a vector to measure pairwise dependency, but this requires to expand the
alignment matrix to a tensor, which results in memory and computation
bottlenecks. In this paper, we propose a novel attention mechanism called
"Multi-mask Tensorized Self-Attention" (MTSA), which is as fast and as
memory-efficient as a CNN, but significantly outperforms previous
CNN-/RNN-/attention-based models. MTSA 1) captures both pairwise (token2token)
and global (source2token) dependencies by a novel compatibility function
composed of dot-product and additive attentions, 2) uses a tensor to represent
the feature-wise alignment scores for better expressive power but only requires
parallelizable matrix multiplications, and 3) combines multi-head with
multi-dimensional attentions, and applies a distinct positional mask to each
head (subspace), so the memory and computation can be distributed to multiple
heads, each with sequential information encoded independently. The experiments
show that a CNN/RNN-free model based on MTSA achieves state-of-the-art or
competitive performance on nine NLP benchmarks with compelling memory- and
time-efficiency
Improving BERT with Self-Supervised Attention
One of the most popular paradigms of applying large pre-trained NLP models
such as BERT is to fine-tune it on a smaller dataset. However, one challenge
remains as the fine-tuned model often overfits on smaller datasets. A symptom
of this phenomenon is that irrelevant or misleading words in the sentence,
which are easy to understand for human beings, can substantially degrade the
performance of these finetuned BERT models. In this paper, we propose a novel
technique, called Self-Supervised Attention (SSA) to help facilitate this
generalization challenge. Specifically, SSA automatically generates weak,
token-level attention labels iteratively by probing the fine-tuned model from
the previous iteration. We investigate two different ways of integrating SSA
into BERT and propose a hybrid approach to combine their benefits. Empirically,
through a variety of public datasets, we illustrate significant performance
improvement using our SSA-enhanced BERT model
What If We Simply Swap the Two Text Fragments? A Straightforward yet Effective Way to Test the Robustness of Methods to Confounding Signals in Nature Language Inference Tasks
Nature language inference (NLI) task is a predictive task of determining the
inference relationship of a pair of natural language sentences. With the
increasing popularity of NLI, many state-of-the-art predictive models have been
proposed with impressive performances. However, several works have noticed the
statistical irregularities in the collected NLI data set that may result in an
over-estimated performance of these models and proposed remedies. In this
paper, we further investigate the statistical irregularities, what we refer as
confounding factors, of the NLI data sets. With the belief that some NLI labels
should preserve under swapping operations, we propose a simple yet effective
way (swapping the two text fragments) of evaluating the NLI predictive models
that naturally mitigate the observed problems. Further, we continue to train
the predictive models with our swapping manner and propose to use the deviation
of the model's evaluation performances under different percentages of training
text fragments to be swapped to describe the robustness of a predictive model.
Our evaluation metrics leads to some interesting understandings of recent
published NLI methods. Finally, we also apply the swapping operation on NLI
models to see the effectiveness of this straightforward method in mitigating
the confounding factor problems in training generic sentence embeddings for
other NLP transfer tasks.Comment: 8 pages, to appear at AAAI 1