117 research outputs found
One Venue, Two Conferences: The Separation of Chinese and American Citation Networks
At NeurIPS, American and Chinese institutions cite papers from each other's
regions substantially less than they cite endogamously. We build a citation
graph to quantify this divide, compare it to European connectivity, and discuss
the causes and consequences of the separation.Comment: Workshop on Cultures of AI and AI for Culture @ NeurIPS 202
Training dynamics of neural language models
Why do artificial neural networks model language so well? We claim that in order to answer this question and understand the biases that lead to such high performing language models---and all models that handle language---we must analyze the training process. For decades, linguists have used the tools of developmental linguistics to study human bias towards linguistic structure. Similarly, we wish to consider a neural network's training dynamics, i.e., the analysis of training in practice and the study of why our optimization methods work when applied. This framing shows us how structural patterns and linguistic properties are gradually built up, revealing more about why LSTM models learn so effectively on language data.
To explore these questions, we might be tempted to appropriate methods from developmental linguistics, but we do not wish to make cognitive claims, so we avoid analogizing between human and artificial language learners. We instead use mathematical tools designed for investigating language model training dynamics. These tools can take advantage of crucial differences between child development and model training: we have access to activations, weights, and gradients in a learning model, and can manipulate learning behavior directly or by perturbing inputs. While most research in training dynamics has focused on vision tasks, language offers direct annotation of its well-documented and intuitive latent hierarchical structures (e.g., syntax and semantics) and is therefore an ideal domain for exploring the effect of training dynamics on the representation of such structure.
Focusing on LSTM models, we investigate the natural sparsity of gradients and activations, finding that word representations are focused on just a few neurons late in training. Similarity analysis reveals how word embeddings learned for different tasks are highly similar at the beginning of training, but gradually become task-specific. Using synthetic data and measuring feature interactions, we also discover that hierarchical representations in LSTMs may be a result of their learning strategy: they tend to build new trees out of familiar phrases, by mingling together the meaning of constituents so they depend on each other. These discoveries constitute just a few possible explanations for how LSTMs learn generalized language representations, with further theories on more architectures to be uncovered by the growing field of NLP training dynamics
Towards out-of-distribution generalization in large-scale astronomical surveys: robust networks learn similar representations
The generalization of machine learning (ML) models to out-of-distribution
(OOD) examples remains a key challenge in extracting information from upcoming
astronomical surveys. Interpretability approaches are a natural way to gain
insights into the OOD generalization problem. We use Centered Kernel Alignment
(CKA), a similarity measure metric of neural network representations, to
examine the relationship between representation similarity and performance of
pre-trained Convolutional Neural Networks (CNNs) on the CAMELS Multifield
Dataset. We find that when models are robust to a distribution shift, they
produce substantially different representations across their layers on OOD
data. However, when they fail to generalize, these representations change less
from layer to layer on OOD data. We discuss the potential application of
similarity representation in guiding model design, training strategy, and
mitigating the OOD problem by incorporating CKA as an inductive bias during
training.Comment: Accepted to Machine Learning and the Physical Sciences Workshop,
NeurIPS 202
TRAM: Bridging Trust Regions and Sharpness Aware Minimization
Sharpness-aware minimization (SAM) reports improving domain generalization by
reducing the loss surface curvature in the parameter space. However,
generalization during fine-tuning is often more dependent on the
transferability of representations in the function space. Trust-region methods
(TR) target this goal by regularizing representation curvature to reduce
catastrophic forgetting of pre-trained task-agnostic information while adopting
task-specific skills. We consider unifying these strategies for low curvature
in both parameter space and function space to improve out-of-domain (OOD)
generalization. We propose Trust Region Aware Minimization (TRAM), a SAM
algorithm fine-tuning for low parameter sharpness and smooth, informative
representations preserving pre-trained structure. TRAM uses a trust region
bound to inform the SAM adversarial neighborhood, introducing an awareness of
function curvature within optimization for flatter minima. We empirically
validate TRAM in vision (cross-dataset adaptation) and text (OOD language
modeling, zero-shot cross-lingual transfer) tasks where robust domain transfer
and representation generality are critical. TRAM outperforms SAM- and TR-based
optimization across all tasks, notably surpassing competing methods for hard
transfer between anticorrelated domains. TRAM establishes a novel standard in
fine-tuning for domain-generalizable models with minimal additional computation
over previous sharpness-aware methods.Comment: Camera Ready for ICLR 2024 (Accepted as Spotlight). 21 pages, 14
tables, 2 figure
Pareto Probing: Trading Off Accuracy for Complexity
The question of how to probe contextual word representations for linguistic
structure in a way that is both principled and useful has seen significant
attention recently in the NLP literature. In our contribution to this
discussion, we argue for a probe metric that reflects the fundamental trade-off
between probe complexity and performance: the Pareto hypervolume. To measure
complexity, we present a number of parametric and non-parametric metrics. Our
experiments using Pareto hypervolume as an evaluation metric show that probes
often do not conform to our expectations---e.g., why should the non-contextual
fastText representations encode more morpho-syntactic information than the
contextual BERT representations? These results suggest that common, simplistic
probing tasks, such as part-of-speech labeling and dependency arc labeling, are
inadequate to evaluate the linguistic structure encoded in contextual word
representations. This leads us to propose full dependency parsing as a probing
task. In support of our suggestion that harder probing tasks are necessary, our
experiments with dependency parsing reveal a wide gap in syntactic knowledge
between contextual and non-contextual representations.Comment: Tiago Pimentel and Naomi Saphra contributed equally to this work.
Camera ready version of EMNLP 2020 publication. Code available in
https://github.com/rycolab/pareto-probin
Latent State Models of Training Dynamics
The impact of randomness on model training is poorly understood. How do
differences in data order and initialization actually manifest in the model,
such that some training runs outperform others or converge faster? Furthermore,
how can we interpret the resulting training dynamics and the phase transitions
that characterize different trajectories? To understand the effect of
randomness on the dynamics and outcomes of neural network training, we train
models multiple times with different random seeds and compute a variety of
metrics throughout training, such as the norm, mean, and variance of the
neural network's weights. We then fit a hidden Markov model (HMM) over the
resulting sequences of metrics. The HMM represents training as a stochastic
process of transitions between latent states, providing an intuitive overview
of significant changes during training. Using our method, we produce a
low-dimensional, discrete representation of training dynamics on grokking
tasks, image classification, and masked language modeling. We use the HMM
representation to study phase transitions and identify latent "detour" states
that slow down convergence.Comment: Accepted at TMLR 2023. Updated Jan 19, 2024 with erratu
Linear Connectivity Reveals Generalization Strategies
It is widely accepted in the mode connectivity literature that when two
neural networks are trained similarly on the same data, they are connected by a
path through parameter space over which test set accuracy is maintained. Under
some circumstances, including transfer learning from pretrained models, these
paths are presumed to be linear. In contrast to existing results, we find that
among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of
finetuned models have large barriers of increasing loss on the linear paths
between them. On each task, we find distinct clusters of models which are
linearly connected on the test loss surface, but are disconnected from models
outside the cluster -- models that occupy separate basins on the surface. By
measuring performance on specially-crafted diagnostic datasets, we find that
these clusters correspond to different generalization strategies: one cluster
behaves like a bag of words model under domain shift, while another cluster
uses syntactic heuristics. Our work demonstrates how the geometry of the loss
surface can guide models towards different heuristic functions.Comment: Publushed as a conference paper at ICLR 202
- …