Search CORE

19 research outputs found

Adaptor Grammars for Unsupervised Paradigm Clustering

Author: Goldwater Sharon
Lopez Adam
McCurdy Kate
Publication venue
Publication date: 01/08/2021
Field of study

This work describes the Edinburgh submission to the SIGMORPHON 2021 Shared Task 2 on unsupervised morphological paradigm clustering. Given raw text input, the task was to assign each token to a cluster with other tokens from the same paradigm. We use Adaptor Grammar segmentations combined with frequency-based heuristics to predict paradigm clusters. Our system achieved the highest average F1 score across 9 test languages, placing first out of 15 submissions

Edinburgh Research Explorer

The Paradigm Discovery Problem

Author: Cotterell Ryan
Elsner Micha
Erdmann Alexander
Habash Nizar
Wu Shijie
Publication venue
Publication date: 01/01/2020
Field of study

This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work. Our code and data are available for public use.Comment: Forthcoming at ACL 202

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

Author: Cai Anna
Dutt Ritam
Hengle Amey
Hofmann Valentin
Kabra Anubha
Kantharuban Anjali
Kulkarni Atharva
Mortensen David R.
Oflazer Kemal
Schütze Hinrich
Vijayakumar Abhishek
Weissweiler Leonie
Yu Haofei
Publication venue
Publication date: 26/10/2023
Field of study

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results -- through the lens of morphology -- cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.Comment: EMNLP 202

arXiv.org e-Print Archive

Recommended from our members

Integrating Machine Learning Into Language Documentation and Description

Author: Moeller Sarah
Publication venue: University of Colorado Boulder
Publication date: 12/04/2021
Field of study

At least 40% of the world’s 7000+ languages are believed to be in danger of disappearing from human use by the end of this century. Many languages will disappear with almost no record of their existence because efforts to document and describe these languages are encountering an “annotation bottleneck” at early stages of analysis and annotation. Current annotation methods are too slow and expensive to counteract the pace of language endangerment and loss. Annotation could be sped and improved by machine learning. However, state-of-the-art supervised machine learning depends heavily on large amounts of annotated data. This dissertation explores how to train supervised machine learning systems for morphological analysis during language documentation and description. The systems are applied to nine languages. The research investigates ways that linguists and NLP scientists may want to adjust their expectations and workflows so that both can achieve optimal results with endangered data. New methods for tasks in morphological analysis are explored. First, various approaches to automating morpheme segmentation and glossing are compared. Second, a new task is presented for learning morphological paradigms and automatically generating new morphological resources: IGT-to-paradigms (IGT2P). Third, the impact of POS tags on segmentation, glossing, and paradigm induction is examined, showing that the presence or absence of POS tags does not have a significant bearing on the performance of machine learning systems. The results indicate that Natural Language Processing (NLP) systems could be successfully integrated into the documentary and descriptive workflow. At the same time, the relatively high accuracy achieved from noisy field data with little or no additional human annotation hints that NLP may benefit from limited documentary linguistic data which may be the only or largest linguistically annotated resource available for some languages.</p

CU Scholar Institutional Repository

Acquisition of Inflectional Morphology in Artificial Neural Networks With Prior Knowledge

Author: Kann Katharina
Publication venue: ScholarWorks@UMass Amherst
Publication date: 11/10/2019
Field of study

How does knowledge of one language’s morphology influence learning of inflection rules in a second one? In order to investigate this question in artificial neural network models, we perform experiments with a sequence-to-sequence architecture, which we train on different combinations of eight source and three target languages. A detailed analysis of the model outputs suggests the following conclusions: (i) if source and target language are closely related, acquisition of the target language’s inflectional morphology constitutes an easier task for the model; (ii) knowledge of a prefixing (resp. suffixing) language makes acquisition of a suffixing (resp. prefixing) language’s morphology more challenging; and (iii) surprisingly, a source language which exhibits an agglutinative morphology simplifies learning of a second language’s inflectional morphology, independent of their relatedness

arXiv.org e-Print Archive

ScholarWorks@UMass Amherst

State-of-the-art generalisation research in NLP: a taxonomy and review

Author: Artetxe Mikel
Batsuren Khuyagbaatar
Christodoulopoulos Christos
Cotterell Ryan
Dankers Verna
Elazar Yanai
Frieske Rita
Giulianelli Mario
Hupkes Dieuwke
Jin Zhijing
Khalatbari Leila
Lasri Karim
Pimentel Tiago
Ryskina Maria
Saphra Naomi
Schottmann Florian
Sinclair Arabella
Sinha Koustuv
Sun Kaiser
Ulmer Dennis
Publication venue
Publication date: 06/10/2022
Field of study

The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what `good generalisation' entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it. In this paper, we aim to lay the ground-work to improve both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP, we use that taxonomy to present a comprehensive map of published generalisation studies, and we make recommendations for which areas might deserve attention in the future. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they aim to solve, the type of data shift they consider, the source by which this data shift is obtained, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 previous papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis of the current state of generalisation research in NLP, and make recommendations for the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to up-date as new NLP generalisation studies are published. With this work, we aim to make steps towards making state-of-the-art generalisation testing the new status quo in NLP.Comment: 35 pages of content + 53 pages of reference

arXiv.org e-Print Archive

Repository for Publications and Research Data

International Migration, Integration and Social Cohesion online publications

UvA-DARE