10 research outputs found
Bidirectional Attention as a Mixture of Continuous Word Experts
Bidirectional attention \unicode{x2013} composed of self-attention with
positional encodings and the masked language model (MLM) objective
\unicode{x2013} has emerged as a key component of modern large language
models (LLMs). Despite its empirical success, few studies have examined its
statistical underpinnings: What statistical model is bidirectional attention
implicitly fitting? What sets it apart from its non-attention predecessors? We
explore these questions in this paper. The key observation is that fitting a
single-layer single-head bidirectional attention, upon reparameterization, is
equivalent to fitting a continuous bag of words (CBOW) model with
mixture-of-experts (MoE) weights. Further, bidirectional attention with
multiple heads and multiple layers is equivalent to stacked MoEs and a mixture
of MoEs, respectively. This statistical viewpoint reveals the distinct use of
MoE in bidirectional attention, which aligns with its practical effectiveness
in handling heterogeneous data. It also suggests an immediate extension to
categorical tabular data, if we view each word location in a sentence as a
tabular feature. Across empirical studies, we find that this extension
outperforms existing tabular extensions of transformers in out-of-distribution
(OOD) generalization. Finally, this statistical perspective of bidirectional
attention enables us to theoretically characterize when linear word analogies
are present in its word embeddings. These analyses show that bidirectional
attention can require much stronger assumptions to exhibit linear word
analogies than its non-attention predecessors.Comment: 31 page
PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning
Synthetic image datasets offer unmatched advantages for designing and
evaluating deep neural networks: they make it possible to (i) render as many
data samples as needed, (ii) precisely control each scene and yield granular
ground truth labels (and captions), (iii) precisely control distribution shifts
between training and testing to isolate variables of interest for sound
experimentation. Despite such promise, the use of synthetic image data is still
limited -- and often played down -- mainly due to their lack of realism. Most
works therefore rely on datasets of real images, which have often been scraped
from public images on the internet, and may have issues with regards to
privacy, bias, and copyright, while offering little control over how objects
precisely appear. In this work, we present a path to democratize the use of
photorealistic synthetic data: we develop a new generation of interactive
environments for representation learning research, that offer both
controllability and realism. We use the Unreal Engine, a powerful game engine
well known in the entertainment industry, to produce PUG (Photorealistic Unreal
Graphics) environments and datasets for representation learning. In this paper,
we demonstrate the potential of PUG to enable more rigorous evaluations of
vision models
Getting aligned on representational alignment
Biological and artificial information processing systems form representations
that they can use to categorize, reason, plan, navigate, and make decisions.
How can we measure the extent to which the representations formed by these
diverse systems agree? Do similarities in representations then translate into
similar behavior? How can a system's representations be modified to better
match those of another system? These questions pertaining to the study of
representational alignment are at the heart of some of the most active research
areas in cognitive science, neuroscience, and machine learning. For example,
cognitive scientists measure the representational alignment of multiple
individuals to identify shared cognitive priors, neuroscientists align fMRI
responses from multiple individuals into a shared representational space for
group-level analyses, and ML researchers distill knowledge from teacher models
into student models by increasing their alignment. Unfortunately, there is
limited knowledge transfer between research communities interested in
representational alignment, so progress in one field often ends up being
rediscovered independently in another. Thus, greater cross-field communication
would be advantageous. To improve communication between these fields, we
propose a unifying framework that can serve as a common language between
researchers studying representational alignment. We survey the literature from
all three fields and demonstrate how prior work fits into this framework.
Finally, we lay out open problems in representational alignment where progress
can benefit all three of these fields. We hope that our work can catalyze
cross-disciplinary collaboration and accelerate progress for all communities
studying and developing information processing systems. We note that this is a
working paper and encourage readers to reach out with their suggestions for
future revisions.Comment: Working paper, changes to be made in upcoming revision