14 research outputs found
Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training
Representational spaces learned via language modeling are fundamental to
Natural Language Processing (NLP), however there has been limited understanding
regarding how and when during training various types of linguistic information
emerge and interact. Leveraging a novel information theoretic probing suite,
which enables direct comparisons of not just task performance, but their
representational subspaces, we analyze nine tasks covering syntax, semantics
and reasoning, across 2M pre-training steps and five seeds. We identify
critical learning phases across tasks and time, during which subspaces emerge,
share information, and later disentangle to specialize. Across these phases,
syntactic knowledge is acquired rapidly after 0.5% of full training. Continued
performance improvements primarily stem from the acquisition of open-domain
knowledge, while semantics and reasoning tasks benefit from later boosts to
long-range contextualization and higher specialization. Measuring cross-task
similarity further reveals that linguistically related tasks share information
throughout training, and do so more during the critical phase of learning than
before or after. Our findings have implications for model interpretability,
multi-task learning, and learning from limited data.Comment: Accepted at EMNLP 2023 (Findings
Establishing Trustworthiness: Rethinking Tasks and Model Evaluation
Language understanding is a multi-faceted cognitive capability, which the
Natural Language Processing (NLP) community has striven to model
computationally for decades. Traditionally, facets of linguistic intelligence
have been compartmentalized into tasks with specialized model architectures and
corresponding evaluation protocols. With the advent of large language models
(LLMs) the community has witnessed a dramatic shift towards general purpose,
task-agnostic approaches powered by generative models. As a consequence, the
traditional compartmentalized notion of language tasks is breaking down,
followed by an increasing challenge for evaluation and analysis. At the same
time, LLMs are being deployed in more real-world scenarios, including
previously unforeseen zero-shot setups, increasing the need for trustworthy and
reliable systems. Therefore, we argue that it is time to rethink what
constitutes tasks and model evaluation in NLP, and pursue a more holistic view
on language, placing trustworthiness at the center. Towards this goal, we
review existing compartmentalized approaches for understanding the origins of a
model's functional capacity, and provide recommendations for more multi-faceted
evaluation protocols.Comment: Accepted at EMNLP 2023 (Main Conference), camera-read
Experimental Standards for Deep Learning Research: A Natural Language Processing Perspective
The field of Deep Learning (DL) has undergone explosive growth during the
last decade, with a substantial impact on Natural Language Processing (NLP) as
well. Yet, compared to more established disciplines, a lack of common
experimental standards remains an open challenge to the field at large.
Starting from fundamental scientific principles, we distill ongoing discussions
on experimental standards in NLP into a single, widely-applicable methodology.
Following these best practices is crucial to strengthen experimental evidence,
improve reproducibility and support scientific progress. These standards are
further collected in a public repository to help them transparently adapt to
future needs
Probing for Labeled Dependency Trees
Probing has become an important tool for analyzing representations in Natural
Language Processing (NLP). For graphical NLP tasks such as dependency parsing,
linear probes are currently limited to extracting undirected or unlabeled parse
trees which do not capture the full task. This work introduces DepProbe, a
linear probe which can extract labeled and directed dependency parse trees from
embeddings while using fewer parameters and compute than prior methods.
Leveraging its full task coverage and lightweight parametrization, we
investigate its predictive power for selecting the best transfer language for
training a full biaffine attention parser. Across 13 languages, our proposed
method identifies the best source treebank 94% of the time, outperforming
competitive baselines and prior work. Finally, we analyze the informativeness
of task-specific subspaces in contextual embeddings as well as which benefits a
full parser's non-linear parametrization provides.Comment: Accepted at ACL 2022 (Main Conference
Sort by Structure: Language Model Ranking as Dependency Probing
Making an informed choice of pre-trained language model (LM) is critical for
performance, yet environmentally costly, and as such widely underexplored. The
field of Computer Vision has begun to tackle encoder ranking, with promising
forays into Natural Language Processing, however they lack coverage of
linguistic tasks such as structured prediction. We propose probing to rank LMs,
specifically for parsing dependencies in a given language, by measuring the
degree to which labeled trees are recoverable from an LM's contextualized
embeddings. Across 46 typologically and architecturally diverse LM-language
pairs, our probing approach predicts the best LM choice 79% of the time using
orders of magnitude less compute than training a full parser. Within this
study, we identify and analyze one recently proposed decoupled LM - RemBERT -
and find it strikingly contains less inherent dependency information, but often
yields the best parser after full fine-tuning. Without this outlier our
approach identifies the best LM in 89% of cases.Comment: Accepted at NAACL 2022 (Main Conference