Search CORE

41 research outputs found

Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks

Author: Hovy Dirk
Pierrehumbert Janet B.
Röttger Paul
Vidgen Bertie
Publication venue
Publication date: 01/01/2022
Field of study

Labelled data is the foundation of most natural language processing tasks. However, labelling data is difficult and there often are diverse valid beliefs about what the correct data labels should be. So far, dataset creators have acknowledged annotator subjectivity, but rarely actively managed it in the annotation process. This has led to partly-subjective datasets that fail to serve a clear downstream use. To address this issue, we propose two contrasting paradigms for data annotation. The descriptive paradigm encourages annotator subjectivity, whereas the prescriptive paradigm discourages it. Descriptive annotation allows for the surveying and modelling of different beliefs, whereas prescriptive annotation enables the training of models that consistently apply one belief. We discuss benefits and challenges in implementing both paradigms, and argue that dataset creators should explicitly aim for one or the other to facilitate the intended use of their dataset. Lastly, we conduct an annotation experiment using hate speech data that illustrates the contrast between the two paradigms.Comment: Accepted at NAACL 2022 (Main Conference

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

Time Machine GPT

Author: Drinkall Felix
Pierrehumbert Janet B.
Rahimikia Eghbal
Zohren Stefan
Publication venue
Publication date: 29/04/2024
Field of study

Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called Time Machine GPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets

The University of Manchester - Institutional Repository

Using character n-grams to classify native language in a non-native English corpus of transcribed speech

Author: Pierrehumbert Janet B.
Rohde Hannah
Vaughn Charlotte
Publication venue
Publication date: 01/01/2009
Field of study

Edinburgh Research Explorer

An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers

Author: Hofmann Valentin
Muresan Smarandakov
Nakov Preslav
Pierrehumbert Janet B.
Schütze Hinrich
Villavicencio Aline
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/05/2022
Field of study

We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models (PLMs). FLOTA uses the vocabulary of a standard tokenizer but tries to preserve the morphological structure of words during tokenization. We evaluate FLOTA on morphological gold segmentations as well as a text classification task, using BERT, GPT-2, and XLNet as example PLMs. FLOTA leads to performance gains, makes inference more efficient, and enhances the robustness of PLMs with respect to whitespace noise

Open Access LMU

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

Author: Hofmann Valentin
Li Wenjie
Navigli Roberto
Pierrehumbert Janet B.
Schütze Hinrich
Xia Fei
Zong Chengqing
Publication venue
Publication date: 01/08/2021
Field of study

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used

Open Access LMU

Dynamic Contextualized Word Embeddings

Author: Hofmann Valentin
Li Wenjie
Navigli Roberto
Pierrehumbert Janet B.
Schütze Hinrich
Xia Fei
Zong Chengqing
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/08/2021
Field of study

Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language model (PLM), dynamic contextualized word embeddings model time and social space jointly, which makes them attractive for a range of NLP tasks involving semantic variability. We highlight potential application scenarios by means of qualitative and quantitative analyses on four English datasets

Open Access LMU

Niche as a determinant of word fate in online groups

Author: A Baronchelli
A Dijksterhuis
A Hotho
Adilson E. Motter
C Cattuto
C Cattuto
C Eble
CD Manning
D Crystal
D Fisher
D Jablonski
D Nettle
D Sornette
D Watts
DJ Hruschka
DM Abrams
DW Nickerson
E Lieberman
Eduardo G. Altmann
EG Altmann
EM Rogers
Enrico Scalas
EV Clark
G Hardin
G Lupyan
G Smitherman
G Szabo
HP Grice
I Trestian
J Kleinberg
J Munat
J-B Michel
J-P Onnela
Janet B. Pierrehumbert
JF Fontanari
K Kuiper
K Lerman
KW Church
L Milroy
L Steels
M Foote
M Pagel
M Seshadri
MA Serrano
MC González
MH Davis
ML Salganik
NL Komarova
P Chesley
P Eckert
P Wexler
Q Lu
R Crane
R Schifanella
R Torres Cacoullos
RA Blythe
RD Malmgren
RK Colwell
RV Solé
S Fortunato
S Kirby
S Wasserman
S Wichmann
W Kruskal
W Labov
Y Neuman
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2011
Field of study

Patterns of word use both reflect and influence a myriad of human activities and interactions. Like other entities that are reproduced and evolve, words rise or decline depending upon a complex interplay between {their intrinsic properties and the environments in which they function}. Using Internet discussion communities as model systems, we define the concept of a word niche as the relationship between the word and the characteristic features of the environments in which it is used. We develop a method to quantify two important aspects of the size of the word niche: the range of individuals using the word and the range of topics it is used to discuss. Controlling for word frequency, we show that these aspects of the word niche are strong determinants of changes in word frequency. Previous studies have already indicated that word frequency itself is a correlate of word success at historical time scales. Our analysis of changes in word frequencies over time reveals that the relative sizes of word niches are far more important than word frequencies in the dynamics of the entire vocabulary at shorter time scales, as the language adapts to new concepts and social groupings. We also distinguish endogenous versus exogenous factors as additional contributors to the fates of words, and demonstrate the force of this distinction in the rise of novel words. Our results indicate that short-term nonstationarity in word statistics is strongly driven by individual proclivities, including inclinations to provide novel information and to project a distinctive social identity.Comment: Supporting Information is available here: http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0019009.s00

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words

Author: A Bell
A Bunde
A Sarkar
A Vázquez
A-L Barabási
Adilson E. Motter
B Grosz
B McShane
BH Partee
C Goodwin
CE Shannon
CF Hockett
D Ron
DJ Watts
DR Cox
E Alvarez-Lacalle
Eduardo G. Altmann
Enrico Scalas
F Wu
GK Zipf
GK Zipf
H Kamp
HA Simon
I Heim
J Laherrere
J van Benthem
J Wixted
Janet B. Pierrehumbert
JP Herrera
JR Anderson
K von Fintel
K-I Goh
KW Church
L Hrebicek
L Nigam
M Ortuño
M Politi
MA Montemurro
MA Serrano
MD Hauser
MEJ Newman
MK Tanenhaus
MS Santhanam
P Bak
R Corral
R Lambiotte
R Montague
RD Malmgren
RH Baayen
S Redner
SM Katz
W Kruskal
Y Yannaros
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 11/11/2009
Field of study

Background: Zipf's discovery that word frequency distributions obey a power law established parallels between biological and physical processes, and language, laying the groundwork for a complex systems perspective on human communication. More recent research has also identified scaling regularities in the dynamics underlying the successive occurrences of events, suggesting the possibility of similar findings for language as well. Methodology/Principal Findings: By considering frequent words in USENET discussion groups and in disparate databases where the language has different levels of formality, here we show that the distributions of distances between successive occurrences of the same word display bursty deviations from a Poisson process and are well characterized by a stretched exponential (Weibull) scaling. The extent of this deviation depends strongly on semantic type -- a measure of the logicality of each word -- and less strongly on frequency. We develop a generative model of this behavior that fully determines the dynamics of word usage. Conclusions/Significance: Recurrence patterns of words are well described by a stretched exponential distribution of recurrence times, an empirical scaling that cannot be anticipated from Zipf's law. Because the use of words provides a uniquely precise and powerful lens on human thought and activity, our findings also have implications for other overt manifestations of collective human dynamics

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central