Search CORE

604 research outputs found

Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks

Author: Hovy Dirk
Pierrehumbert Janet B.
Röttger Paul
Vidgen Bertie
Publication venue
Publication date: 01/01/2022
Field of study

Labelled data is the foundation of most natural language processing tasks. However, labelling data is difficult and there often are diverse valid beliefs about what the correct data labels should be. So far, dataset creators have acknowledged annotator subjectivity, but rarely actively managed it in the annotation process. This has led to partly-subjective datasets that fail to serve a clear downstream use. To address this issue, we propose two contrasting paradigms for data annotation. The descriptive paradigm encourages annotator subjectivity, whereas the prescriptive paradigm discourages it. Descriptive annotation allows for the surveying and modelling of different beliefs, whereas prescriptive annotation enables the training of models that consistently apply one belief. We discuss benefits and challenges in implementing both paradigms, and argue that dataset creators should explicitly aim for one or the other to facilitate the intended use of their dataset. Lastly, we conduct an annotation experiment using hate speech data that illustrates the contrast between the two paradigms.Comment: Accepted at NAACL 2022 (Main Conference

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

Time Machine GPT

Author: Drinkall Felix
Pierrehumbert Janet B.
Rahimikia Eghbal
Zohren Stefan
Publication venue
Publication date: 29/04/2024
Field of study

Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called Time Machine GPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets

The University of Manchester - Institutional Repository

Disorganized attachment in adolescence: Emotional and physiological dysregulation during the Friends and Family Interview and a conflict interaction.

Author: Decarli A.
Pierrehumbert B.
Schaan V.K.
Schulz A.
Vögele C.
Publication venue
Publication date: 01/02/2022
Field of study

The current study examined the effects of attachment on autonomy, relatedness, and emotion regulation during an attachment interview (Friends and Family Interview; FFI) and a Parent×Child Conflict interaction (Family Interaction Task; FIT) in 49 adolescents (11 to 17 years old). Disorganized adolescents displayed behaviors promoting autonomy and relatedness less frequently and at a lower extent than organized ones in the FIT with mothers but not with fathers. Disorganized adolescents also showed a steeper decrease in heart rate variability (HRV) than organized ones, during both the FFI and the FITs. Moreover, disorganized adolescents responded with a more marked increase in skin conductance level to the FIT with mothers than organized individuals. Dismissing adolescents showed behaviors promoting autonomy and relatedness less frequently and to a lesser extent than secure ones, while displaying more often behaviors undermining autonomy and relatedness in the FITs. Dismissing adolescents also showed a more pronounced decrease in HRV during the FFI than secure and preoccupied individuals; no differences were found between these groups in HRV during the FITs. The results suggest that disorganized adolescents had more difficulties in regulating their emotions during both the FFI and the FITs, whereas dismissing individuals seemed effectively challenged only during the interview

Serveur académique lausannois

Indication of insensitivity of planetary weathering behavior and habitable zone to surface land fraction

Author: Anglada-Escudé
Benneke
Berner
Borucki
Cowan
Cowan
Dorian S. Abbot
Drever
Forget
Fred J. Ciesla
Fu
Hu
Kaltenegger
Kaltenegger
Kaltenegger
Kaltenegger
Kawahara
Kawahara
Kirschvink
Kite
Kite
Kuchner
Meadows
Meybeck
Nicolas B. Cowan
Palle
Peixoto
Peters
Pierrehumbert
Pierrehumbert
Raymond
Raymond
Vogt
Wordsworth
Wordsworth
Publication venue: 'IOP Publishing'
Publication date: 08/08/2012
Field of study

It is likely that unambiguous habitable zone terrestrial planets of unknown water content will soon be discovered. Water content helps determine surface land fraction, which influences planetary weathering behavior. This is important because the silicate weathering feedback determines the width of the habitable zone in space and time. Here a low-order model of weathering and climate, useful for gaining qualitative understanding, is developed to examine climate evolution for planets of various land-ocean fractions. It is pointed out that, if seafloor weathering does not depend directly on surface temperature, there can be no weathering-climate feedback on a waterworld. This would dramatically narrow the habitable zone of a waterworld. Results from our model indicate that weathering behavior does not depend strongly on land fraction for partially ocean-covered planets. This is powerful because it suggests that previous habitable zone theory is robust to changes in land fraction, as long as there is some land. Finally, a mechanism is proposed for a waterworld to prevent complete water loss during a moist greenhouse through rapid weathering of exposed continents. This process is named a "waterworld self-arrest," and it implies that waterworlds can go through a moist greenhouse stage and end up as planets like Earth with partial ocean coverage. This work stresses the importance of surface and geologic effects, in addition to the usual incident stellar flux, for habitability.Comment: 15 pages, 6 figures, accepted at Ap

arXiv.org e-Print Archive

Crossref

An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers

Author: Hofmann Valentin
Muresan Smarandakov
Nakov Preslav
Pierrehumbert Janet B.
Schütze Hinrich
Villavicencio Aline
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/05/2022
Field of study

We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models (PLMs). FLOTA uses the vocabulary of a standard tokenizer but tries to preserve the morphological structure of words during tokenization. We evaluate FLOTA on morphological gold segmentations as well as a text classification task, using BERT, GPT-2, and XLNet as example PLMs. FLOTA leads to performance gains, makes inference more efficient, and enhances the robustness of PLMs with respect to whitespace noise

Open Access LMU

Increased insolation threshold for runaway greenhouse processes on Earth like planets

Author: A Arking
A Borysow
A Bucholtz
Alizée Pottier
AP Ingersoll
B Charnay
B Van Leer
Benjamin Charnay
BJ Soden
C Goldblatt
C Goldblatt
C Richard
DMW Frierson
DO Gough
E Wolf
F Forget
F Forget
F Hourdin
F Selsis
Francois Forget
G Le Hir
GC Simpson
GL Mellor
H Le Treut
J Leconte
J Yang
JE Hansen
JF Kasting
JF Kasting
JF Kasting
Jérémy Leconte
LS Rothman
M Ishiwatari
M Komabayashi
O Boucher
OB Toon
R Wordsworth
R Wordsworth
RD Wordsworth
RK Kopparapu
Robin Wordsworth
RT Pierrehumbert
RT Pierrehumbert
RT Pierrehumbert
S Clough
S Manabe
S Nakajima
SC Sherwood
V Ramanathan
Y Abe
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Because the solar luminosity increases over geological timescales, Earth climate is expected to warm, increasing water evaporation which, in turn, enhances the atmospheric greenhouse effect. Above a certain critical insolation, this destabilizing greenhouse feedback can "runaway" until all the oceans are evaporated. Through increases in stratospheric humidity, warming may also cause oceans to escape to space before the runaway greenhouse occurs. The critical insolation thresholds for these processes, however, remain uncertain because they have so far been evaluated with unidimensional models that cannot account for the dynamical and cloud feedback effects that are key stabilizing features of Earth's climate. Here we use a 3D global climate model to show that the threshold for the runaway greenhouse is about 375 W/m

^2

, significantly higher than previously thought. Our model is specifically developed to quantify the climate response of Earth-like planets to increased insolation in hot and extremely moist atmospheres. In contrast with previous studies, we find that clouds have a destabilizing feedback on the long term warming. However, subsident, unsaturated regions created by the Hadley circulation have a stabilizing effect that is strong enough to defer the runaway greenhouse limit to higher insolation than inferred from 1D models. Furthermore, because of wavelength-dependent radiative effects, the stratosphere remains cold and dry enough to hamper atmospheric water escape, even at large fluxes. This has strong implications for Venus early water history and extends the size of the habitable zone around other stars.Comment: Published in Nature. Online publication date: December 12, 2013. Accepted version before journal editing and with Supplementary Informatio

arXiv.org e-Print Archive

Crossref

HAL-INSU

HAL-Polytechnique

HAL-Ecole des Ponts ParisTech

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Author: Hofmann Valentin
Li Wenjie
Navigli Roberto
Pierrehumbert Janet B.
Schütze Hinrich
Xia Fei
Zong Chengqing
Publication venue
Publication date: 01/08/2021
Field of study

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used

Open Access LMU

Dynamic Contextualized Word Embeddings

Author: Hofmann Valentin
Li Wenjie
Navigli Roberto
Pierrehumbert Janet B.
Schütze Hinrich
Xia Fei
Zong Chengqing
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/08/2021
Field of study

Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language model (PLM), dynamic contextualized word embeddings model time and social space jointly, which makes them attractive for a range of NLP tasks involving semantic variability. We highlight potential application scenarios by means of qualitative and quantitative analyses on four English datasets

Open Access LMU