Search CORE

41 research outputs found

Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

Author: Cai Anna
Dutt Ritam
Hengle Amey
Hofmann Valentin
Kabra Anubha
Kantharuban Anjali
Kulkarni Atharva
Mortensen David R.
Oflazer Kemal
Schütze Hinrich
Vijayakumar Abhishek
Weissweiler Leonie
Yu Haofei
Publication venue
Publication date: 26/10/2023
Field of study

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results -- through the lens of morphology -- cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.Comment: EMNLP 202

arXiv.org e-Print Archive

A Computational Model for the Linguistic Notion of Morphological Paradigm

Author: Hulden Mans
Liu Ling
Silfverberg Miikka
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Neural sequence-to-sequence models for low-resource morphology

Author: Kann Katharina
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 27/03/2019
Field of study

Digitale Hochschulschriften der LMU

Understanding compositional data augmentation in automatic morphological inflection

Author: Samir Farhan
Silfverberg Miikka
Publication venue
Publication date: 23/05/2023
Field of study

Data augmentation techniques are widely used in low-resource automatic morphological inflection to address the issue of data sparsity. However, the full implications of these techniques remain poorly understood. In this study, we aim to shed light on the theoretical aspects of the data augmentation strategy StemCorrupt, a method that generates synthetic examples by randomly substituting stem characters in existing gold standard training examples. Our analysis uncovers that StemCorrupt brings about fundamental changes in the underlying data distribution, revealing inherent compositional concatenative structure. To complement our theoretical analysis, we investigate the data-efficiency of StemCorrupt. Through evaluation across a diverse set of seven typologically distinct languages, we demonstrate that selecting a subset of datapoints with both high diversity and high predictive uncertainty significantly enhances the data-efficiency of StemCorrupt compared to competitive baselines. Furthermore, we explore the impact of typological features on the choice of augmentation strategy and find that languages incorporating non-concatenativity, such as morphonological alternations, derive less benefit from synthetic examples with high predictive uncertainty. We attribute this effect to phonotactic violations induced by StemCorrupt, emphasizing the need for further research to ensure optimal performance across the entire spectrum of natural language morphology.Comment: 13 pages, 7 figure

arXiv.org e-Print Archive