Search CORE

117 research outputs found

LSTMS Compose — and Learn — Bottom-Up

Author: Lopez Adam
Saphra Naomi
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Crossref

Edinburgh Research Explorer

One Venue, Two Conferences: The Separation of Chinese and American Citation Networks

Author: Forde Jessica Zosa
Gu Yuling
Saphra Naomi
Zhao Bingchen
Publication venue
Publication date: 16/11/2022
Field of study

At NeurIPS, American and Chinese institutions cite papers from each other's regions substantially less than they cite endogamously. We build a citation graph to quantify this divide, compare it to European connectivity, and discuss the causes and consequences of the separation.Comment: Workshop on Cultures of AI and AI for Culture @ NeurIPS 202

arXiv.org e-Print Archive

Training dynamics of neural language models

Author: Saphra Naomi
Publication venue: The University of Edinburgh
Publication date: 31/07/2021
Field of study

Why do artificial neural networks model language so well? We claim that in order to answer this question and understand the biases that lead to such high performing language models---and all models that handle language---we must analyze the training process. For decades, linguists have used the tools of developmental linguistics to study human bias towards linguistic structure. Similarly, we wish to consider a neural network's training dynamics, i.e., the analysis of training in practice and the study of why our optimization methods work when applied. This framing shows us how structural patterns and linguistic properties are gradually built up, revealing more about why LSTM models learn so effectively on language data. To explore these questions, we might be tempted to appropriate methods from developmental linguistics, but we do not wish to make cognitive claims, so we avoid analogizing between human and artificial language learners. We instead use mathematical tools designed for investigating language model training dynamics. These tools can take advantage of crucial differences between child development and model training: we have access to activations, weights, and gradients in a learning model, and can manipulate learning behavior directly or by perturbing inputs. While most research in training dynamics has focused on vision tasks, language offers direct annotation of its well-documented and intuitive latent hierarchical structures (e.g., syntax and semantics) and is therefore an ideal domain for exploring the effect of training dynamics on the representation of such structure. Focusing on LSTM models, we investigate the natural sparsity of gradients and activations, finding that word representations are focused on just a few neurons late in training. Similarity analysis reveals how word embeddings learned for different tasks are highly similar at the beginning of training, but gradually become task-specific. Using synthetic data and measuring feature interactions, we also discover that hierarchical representations in LSTMs may be a result of their learning strategy: they tend to build new trees out of familiar phrases, by mingling together the meaning of constituents so they depend on each other. These discoveries constitute just a few possible explanations for how LSTMs learn generalized language representations, with further theories on more architectures to be uncovered by the growing field of NLP training dynamics

Edinburgh Research Archive

Towards out-of-distribution generalization in large-scale astronomical surveys: robust networks learn similar representations

Author: Andrianomena Sambatra
Gondhalekar Yash
Hassan Sultan
Saphra Naomi
Publication venue
Publication date: 29/11/2023
Field of study

The generalization of machine learning (ML) models to out-of-distribution (OOD) examples remains a key challenge in extracting information from upcoming astronomical surveys. Interpretability approaches are a natural way to gain insights into the OOD generalization problem. We use Centered Kernel Alignment (CKA), a similarity measure metric of neural network representations, to examine the relationship between representation similarity and performance of pre-trained Convolutional Neural Networks (CNNs) on the CAMELS Multifield Dataset. We find that when models are robust to a distribution shift, they produce substantially different representations across their layers on OOD data. However, when they fail to generalize, these representations change less from layer to layer on OOD data. We discuss the potential application of similarity representation in guiding model design, training strategy, and mitigating the OOD problem by incorporating CKA as an inductive bias during training.Comment: Accepted to Machine Learning and the Physical Sciences Workshop, NeurIPS 202

arXiv.org e-Print Archive

TRAM: Bridging Trust Regions and Sharpness Aware Minimization

Author: Dasigi Pradeep
Peng Hao
Saphra Naomi
Sherborne Tom
Publication venue
Publication date: 12/03/2024
Field of study

Sharpness-aware minimization (SAM) reports improving domain generalization by reducing the loss surface curvature in the parameter space. However, generalization during fine-tuning is often more dependent on the transferability of representations in the function space. Trust-region methods (TR) target this goal by regularizing representation curvature to reduce catastrophic forgetting of pre-trained task-agnostic information while adopting task-specific skills. We consider unifying these strategies for low curvature in both parameter space and function space to improve out-of-domain (OOD) generalization. We propose Trust Region Aware Minimization (TRAM), a SAM algorithm fine-tuning for low parameter sharpness and smooth, informative representations preserving pre-trained structure. TRAM uses a trust region bound to inform the SAM adversarial neighborhood, introducing an awareness of function curvature within optimization for flatter minima. We empirically validate TRAM in vision (cross-dataset adaptation) and text (OOD language modeling, zero-shot cross-lingual transfer) tasks where robust domain transfer and representation generality are critical. TRAM outperforms SAM- and TR-based optimization across all tasks, notably surpassing competing methods for hard transfer between anticorrelated domains. TRAM establishes a novel standard in fine-tuning for domain-generalizable models with minimal additional computation over previous sharpness-aware methods.Comment: Camera Ready for ICLR 2024 (Accepted as Spotlight). 21 pages, 14 tables, 2 figure

arXiv.org e-Print Archive

Pareto Probing: Trading Off Accuracy for Complexity

Author: Cotterell Ryan
Pimentel Tiago
Saphra Naomi
Williams Adina
Publication venue
Publication date: 01/01/2020
Field of study

The question of how to probe contextual word representations for linguistic structure in a way that is both principled and useful has seen significant attention recently in the NLP literature. In our contribution to this discussion, we argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance: the Pareto hypervolume. To measure complexity, we present a number of parametric and non-parametric metrics. Our experiments using Pareto hypervolume as an evaluation metric show that probes often do not conform to our expectations---e.g., why should the non-contextual fastText representations encode more morpho-syntactic information than the contextual BERT representations? These results suggest that common, simplistic probing tasks, such as part-of-speech labeling and dependency arc labeling, are inadequate to evaluate the linguistic structure encoded in contextual word representations. This leads us to propose full dependency parsing as a probing task. In support of our suggestion that harder probing tasks are necessary, our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.Comment: Tiago Pimentel and Naomi Saphra contributed equally to this work. Camera ready version of EMNLP 2020 publication. Code available in https://github.com/rycolab/pareto-probin

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

Latent State Models of Training Dynamics

Author: Chen Angelica
Cho Kyunghyun
Hu Michael Y.
Saphra Naomi
Publication venue
Publication date: 19/01/2024
Field of study

The impact of randomness on model training is poorly understood. How do differences in data order and initialization actually manifest in the model, such that some training runs outperform others or converge faster? Furthermore, how can we interpret the resulting training dynamics and the phase transitions that characterize different trajectories? To understand the effect of randomness on the dynamics and outcomes of neural network training, we train models multiple times with different random seeds and compute a variety of metrics throughout training, such as the

L_2

norm, mean, and variance of the neural network's weights. We then fit a hidden Markov model (HMM) over the resulting sequences of metrics. The HMM represents training as a stochastic process of transitions between latent states, providing an intuitive overview of significant changes during training. Using our method, we produce a low-dimensional, discrete representation of training dynamics on grokking tasks, image classification, and masked language modeling. We use the HMM representation to study phase transitions and identify latent "detour" states that slow down convergence.Comment: Accepted at TMLR 2023. Updated Jan 19, 2024 with erratu

arXiv.org e-Print Archive

Linear Connectivity Reveals Generalization Strategies

Author: Bansal Rachit
Cho Kyunghyun
Juneja Jeevesh
Saphra Naomi
Sedoc João
Publication venue
Publication date: 23/01/2023
Field of study

It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster -- models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.Comment: Publushed as a conference paper at ICLR 202

arXiv.org e-Print Archive