4 research outputs found
Bidirectional Attention as a Mixture of Continuous Word Experts
Bidirectional attention \unicode{x2013} composed of self-attention with
positional encodings and the masked language model (MLM) objective
\unicode{x2013} has emerged as a key component of modern large language
models (LLMs). Despite its empirical success, few studies have examined its
statistical underpinnings: What statistical model is bidirectional attention
implicitly fitting? What sets it apart from its non-attention predecessors? We
explore these questions in this paper. The key observation is that fitting a
single-layer single-head bidirectional attention, upon reparameterization, is
equivalent to fitting a continuous bag of words (CBOW) model with
mixture-of-experts (MoE) weights. Further, bidirectional attention with
multiple heads and multiple layers is equivalent to stacked MoEs and a mixture
of MoEs, respectively. This statistical viewpoint reveals the distinct use of
MoE in bidirectional attention, which aligns with its practical effectiveness
in handling heterogeneous data. It also suggests an immediate extension to
categorical tabular data, if we view each word location in a sentence as a
tabular feature. Across empirical studies, we find that this extension
outperforms existing tabular extensions of transformers in out-of-distribution
(OOD) generalization. Finally, this statistical perspective of bidirectional
attention enables us to theoretically characterize when linear word analogies
are present in its word embeddings. These analyses show that bidirectional
attention can require much stronger assumptions to exhibit linear word
analogies than its non-attention predecessors.Comment: 31 page
Semantic Representation and Inference for NLP
Semantic representation and inference is essential for Natural Language
Processing (NLP). The state of the art for semantic representation and
inference is deep learning, and particularly Recurrent Neural Networks (RNNs),
Convolutional Neural Networks (CNNs), and transformer Self-Attention models.
This thesis investigates the use of deep learning for novel semantic
representation and inference, and makes contributions in the following three
areas: creating training data, improving semantic representations and extending
inference learning. In terms of creating training data, we contribute the
largest publicly available dataset of real-life factual claims for the purpose
of automatic claim verification (MultiFC), and we present a novel inference
model composed of multi-scale CNNs with different kernel sizes that learn from
external sources to infer fact checking labels. In terms of improving semantic
representations, we contribute a novel model that captures non-compositional
semantic indicators. By definition, the meaning of a non-compositional phrase
cannot be inferred from the individual meanings of its composing words (e.g.,
hot dog). Motivated by this, we operationalize the compositionality of a phrase
contextually by enriching the phrase representation with external word
embeddings and knowledge graphs. Finally, in terms of inference learning, we
propose a series of novel deep learning architectures that improve inference by
using syntactic dependencies, by ensembling role guided attention heads,
incorporating gating layers, and concatenating multiple heads in novel and
effective ways. This thesis consists of seven publications (five published and
two under review).Comment: PhD thesis, the University of Copenhage