425 research outputs found
Computational investigations of derivational morphology
The notion that it is difficult to make predictions about derivational morphology has been a recurring theme in morphological research over the last decades. It can be unclear whether a derivative exists at all, what a derivative means exactly, and which affix is used to form a derivative. The central goal of this thesis is to demonstrate that recent progress in natural language processing (NLP) allows for a fresh view on the (un-)predictability of derivational morphology.
Prior research in morphology has recognized semantic and extralinguistic factors as two key challenges for successfully predicting derivational morphology. The first set of papers contained in the thesis leverages novel methods from NLP and applies them to large-scale, socially-stratified datasets. I find that this computational approach results in substantially improved models, demonstrating that derivational morphology is predictable to a larger extent than previously thought.
A side result of the first part of the thesis is that tokenization (i.e., the way in which words are segmented) affects the capability of NLP systems to predict derivational morphology, raising the question whether it deteriorates performance on a larger scale. The second set of papers contained in the thesis shows that this is indeed the case. As a remedy, I devise tokenization strategies that are directly informed by morphology, with beneficial effects on performance.
On a wider scale, the results of this thesis suggest that NLP and deep learning more generally can greatly benefit linguistic research, a view that is still contested by many scholars in linguistics. At the same time, the thesis shows that even, or perhaps especially, in the age of large language models, linguistic insights continue to be relevant for the development of human language technology
DagoBERT: Generating Derivational Morphology with a Pretrained Language Model
Can pretrained language models (PLMs) generate derivationally complex words?
We present the first study investigating this question, taking BERT as the
example PLM. We examine BERT's derivational capabilities in different settings,
ranging from using the unmodified pretrained model to full finetuning. Our best
model, DagoBERT (Derivationally and generatively optimized BERT), clearly
outperforms the previous state of the art in derivation generation (DG).
Furthermore, our experiments show that the input segmentation crucially impacts
BERT's derivational knowledge, suggesting that the performance of PLMs could be
further improved if a morphologically informed vocabulary of units were used
Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words
How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used
An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers
We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models (PLMs). FLOTA uses the vocabulary of a standard tokenizer but tries to preserve the morphological structure of words during tokenization. We evaluate FLOTA on morphological gold segmentations as well as a text classification task, using BERT, GPT-2, and XLNet as example PLMs. FLOTA leads to performance gains, makes inference more efficient, and enhances the robustness of PLMs with respect to whitespace noise
Improving Tokenisation by Alternative Treatment of Spaces
Tokenisation is the first step in almost all NLP tasks, and state-of-the-art
transformer-based language models all use subword tokenisation algorithms to
process input text. Existing algorithms have problems, often producing
tokenisations of limited linguistic validity, and representing equivalent
strings differently depending on their position within a word. We hypothesise
that these problems hinder the ability of transformer-based models to handle
complex words, and suggest that these problems are a result of allowing tokens
to include spaces. We thus experiment with an alternative tokenisation approach
where spaces are always treated as individual tokens. Specifically, we apply
this modification to the BPE and Unigram algorithms. We find that our modified
algorithms lead to improved performance on downstream NLP tasks that involve
handling complex words, whilst having no detrimental effect on performance in
general natural language understanding tasks. Intrinsically, we find our
modified algorithms give more morphologically correct tokenisations, in
particular when handling prefixes. Given the results of our experiments, we
advocate for always treating spaces as individual tokens as an improved
tokenisation method
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Large language models (LLMs) have recently reached an impressive level of
linguistic capability, prompting comparisons with human language skills.
However, there have been relatively few systematic inquiries into the
linguistic capabilities of the latest generation of LLMs, and those studies
that do exist (i) ignore the remarkable ability of humans to generalize, (ii)
focus only on English, and (iii) investigate syntax or semantics and overlook
other capabilities that lie at the heart of human language, like morphology.
Here, we close these gaps by conducting the first rigorous analysis of the
morphological capabilities of ChatGPT in four typologically varied languages
(specifically, English, German, Tamil, and Turkish). We apply a version of
Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for
the four examined languages. We find that ChatGPT massively underperforms
purpose-built systems, particularly in English. Overall, our results -- through
the lens of morphology -- cast a new light on the linguistic capabilities of
ChatGPT, suggesting that claims of human-like language skills are premature and
misleading.Comment: EMNLP 202
Biomedical Language Models are Robust to Sub-optimal Tokenization
As opposed to general English, many concepts in biomedical terminology have
been designed in recent history by biomedical professionals with the goal of
being precise and concise. This is often achieved by concatenating meaningful
biomedical morphemes to create new semantic units. Nevertheless, most modern
biomedical language models (LMs) are pre-trained using standard domain-specific
tokenizers derived from large scale biomedical corpus statistics without
explicitly leveraging the agglutinating nature of biomedical language. In this
work, we first find that standard open-domain and biomedical tokenizers are
largely unable to segment biomedical terms into meaningful components.
Therefore, we hypothesize that using a tokenizer which segments biomedical
terminology more accurately would enable biomedical LMs to improve their
performance on downstream biomedical NLP tasks, especially ones which involve
biomedical terms directly such as named entity recognition (NER) and entity
linking. Surprisingly, we find that pre-training a biomedical LM using a more
accurate biomedical tokenizer does not improve the entity representation
quality of a language model as measured by several intrinsic and extrinsic
measures such as masked language modeling prediction (MLM) accuracy as well as
NER and entity linking performance. These quantitative findings, along with a
case study which explores entity representation quality more directly, suggest
that the biomedical pre-training process is quite robust to instances of
sub-optimal tokenization.Comment: BioNLP @ ACL 202
- …