41 research outputs found
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Large language models (LLMs) have recently reached an impressive level of
linguistic capability, prompting comparisons with human language skills.
However, there have been relatively few systematic inquiries into the
linguistic capabilities of the latest generation of LLMs, and those studies
that do exist (i) ignore the remarkable ability of humans to generalize, (ii)
focus only on English, and (iii) investigate syntax or semantics and overlook
other capabilities that lie at the heart of human language, like morphology.
Here, we close these gaps by conducting the first rigorous analysis of the
morphological capabilities of ChatGPT in four typologically varied languages
(specifically, English, German, Tamil, and Turkish). We apply a version of
Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for
the four examined languages. We find that ChatGPT massively underperforms
purpose-built systems, particularly in English. Overall, our results -- through
the lens of morphology -- cast a new light on the linguistic capabilities of
ChatGPT, suggesting that claims of human-like language skills are premature and
misleading.Comment: EMNLP 202
Understanding compositional data augmentation in automatic morphological inflection
Data augmentation techniques are widely used in low-resource automatic
morphological inflection to address the issue of data sparsity. However, the
full implications of these techniques remain poorly understood. In this study,
we aim to shed light on the theoretical aspects of the data augmentation
strategy StemCorrupt, a method that generates synthetic examples by randomly
substituting stem characters in existing gold standard training examples. Our
analysis uncovers that StemCorrupt brings about fundamental changes in the
underlying data distribution, revealing inherent compositional concatenative
structure. To complement our theoretical analysis, we investigate the
data-efficiency of StemCorrupt. Through evaluation across a diverse set of
seven typologically distinct languages, we demonstrate that selecting a subset
of datapoints with both high diversity and high predictive uncertainty
significantly enhances the data-efficiency of StemCorrupt compared to
competitive baselines. Furthermore, we explore the impact of typological
features on the choice of augmentation strategy and find that languages
incorporating non-concatenativity, such as morphonological alternations, derive
less benefit from synthetic examples with high predictive uncertainty. We
attribute this effect to phonotactic violations induced by StemCorrupt,
emphasizing the need for further research to ensure optimal performance across
the entire spectrum of natural language morphology.Comment: 13 pages, 7 figure