41 research outputs found
Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks
Labelled data is the foundation of most natural language processing tasks.
However, labelling data is difficult and there often are diverse valid beliefs
about what the correct data labels should be. So far, dataset creators have
acknowledged annotator subjectivity, but rarely actively managed it in the
annotation process. This has led to partly-subjective datasets that fail to
serve a clear downstream use. To address this issue, we propose two contrasting
paradigms for data annotation. The descriptive paradigm encourages annotator
subjectivity, whereas the prescriptive paradigm discourages it. Descriptive
annotation allows for the surveying and modelling of different beliefs, whereas
prescriptive annotation enables the training of models that consistently apply
one belief. We discuss benefits and challenges in implementing both paradigms,
and argue that dataset creators should explicitly aim for one or the other to
facilitate the intended use of their dataset. Lastly, we conduct an annotation
experiment using hate speech data that illustrates the contrast between the two
paradigms.Comment: Accepted at NAACL 2022 (Main Conference
Time Machine GPT
Large language models (LLMs) are often trained on extensive, temporally indiscriminate text corpora, reflecting the lack of datasets with temporal metadata. This approach is not aligned with the evolving nature of language. Conventional methods for creating temporally adapted language models often depend on further pre-training static models on time-specific data. This paper presents a new approach: a series of point-in-time LLMs called Time Machine GPT (TiMaGPT), specifically designed to be nonprognosticative. This ensures they remain uninformed about future factual information and linguistic changes. This strategy is beneficial for understanding language evolution and is of critical importance when applying models in dynamic contexts, such as time-series forecasting, where foresight of future information can prove problematic. We provide access to both the models and training datasets
Using character n-grams to classify native language in a non-native English corpus of transcribed speech
An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers
We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models (PLMs). FLOTA uses the vocabulary of a standard tokenizer but tries to preserve the morphological structure of words during tokenization. We evaluate FLOTA on morphological gold segmentations as well as a text classification task, using BERT, GPT-2, and XLNet as example PLMs. FLOTA leads to performance gains, makes inference more efficient, and enhances the robustness of PLMs with respect to whitespace noise
Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words
How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used
Dynamic Contextualized Word Embeddings
Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language model (PLM), dynamic contextualized word embeddings model time and social space jointly, which makes them attractive for a range of NLP tasks involving semantic variability. We highlight potential application scenarios by means of qualitative and quantitative analyses on four English datasets
Niche as a determinant of word fate in online groups
Patterns of word use both reflect and influence a myriad of human activities
and interactions. Like other entities that are reproduced and evolve, words
rise or decline depending upon a complex interplay between {their intrinsic
properties and the environments in which they function}. Using Internet
discussion communities as model systems, we define the concept of a word niche
as the relationship between the word and the characteristic features of the
environments in which it is used. We develop a method to quantify two important
aspects of the size of the word niche: the range of individuals using the word
and the range of topics it is used to discuss. Controlling for word frequency,
we show that these aspects of the word niche are strong determinants of changes
in word frequency. Previous studies have already indicated that word frequency
itself is a correlate of word success at historical time scales. Our analysis
of changes in word frequencies over time reveals that the relative sizes of
word niches are far more important than word frequencies in the dynamics of the
entire vocabulary at shorter time scales, as the language adapts to new
concepts and social groupings. We also distinguish endogenous versus exogenous
factors as additional contributors to the fates of words, and demonstrate the
force of this distinction in the rise of novel words. Our results indicate that
short-term nonstationarity in word statistics is strongly driven by individual
proclivities, including inclinations to provide novel information and to
project a distinctive social identity.Comment: Supporting Information is available here:
http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0019009.s00
Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words
Background: Zipf's discovery that word frequency distributions obey a power
law established parallels between biological and physical processes, and
language, laying the groundwork for a complex systems perspective on human
communication. More recent research has also identified scaling regularities in
the dynamics underlying the successive occurrences of events, suggesting the
possibility of similar findings for language as well.
Methodology/Principal Findings: By considering frequent words in USENET
discussion groups and in disparate databases where the language has different
levels of formality, here we show that the distributions of distances between
successive occurrences of the same word display bursty deviations from a
Poisson process and are well characterized by a stretched exponential (Weibull)
scaling. The extent of this deviation depends strongly on semantic type -- a
measure of the logicality of each word -- and less strongly on frequency. We
develop a generative model of this behavior that fully determines the dynamics
of word usage.
Conclusions/Significance: Recurrence patterns of words are well described by
a stretched exponential distribution of recurrence times, an empirical scaling
that cannot be anticipated from Zipf's law. Because the use of words provides a
uniquely precise and powerful lens on human thought and activity, our findings
also have implications for other overt manifestations of collective human
dynamics