19 research outputs found
Adaptor Grammars for Unsupervised Paradigm Clustering
This work describes the Edinburgh submission to the SIGMORPHON 2021 Shared Task 2 on unsupervised morphological paradigm clustering. Given raw text input, the task was to assign each token to a cluster with other tokens from the same paradigm. We use Adaptor Grammar segmentations combined with frequency-based heuristics to predict paradigm clusters. Our system achieved the highest average F1 score across 9 test languages, placing first out of 15 submissions
The Paradigm Discovery Problem
This work treats the paradigm discovery problem (PDP), the task of learning
an inflectional morphological system from unannotated sentences. We formalize
the PDP and develop evaluation metrics for judging systems. Using currently
available resources, we construct datasets for the task. We also devise a
heuristic benchmark for the PDP and report empirical results on five diverse
languages. Our benchmark system first makes use of word embeddings and string
similarity to cluster forms by cell and by paradigm. Then, we bootstrap a
neural transducer on top of the clustered data to predict words to realize the
empty paradigm slots. An error analysis of our system suggests clustering by
cell across different inflection classes is the most pressing challenge for
future work. Our code and data are available for public use.Comment: Forthcoming at ACL 202
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Large language models (LLMs) have recently reached an impressive level of
linguistic capability, prompting comparisons with human language skills.
However, there have been relatively few systematic inquiries into the
linguistic capabilities of the latest generation of LLMs, and those studies
that do exist (i) ignore the remarkable ability of humans to generalize, (ii)
focus only on English, and (iii) investigate syntax or semantics and overlook
other capabilities that lie at the heart of human language, like morphology.
Here, we close these gaps by conducting the first rigorous analysis of the
morphological capabilities of ChatGPT in four typologically varied languages
(specifically, English, German, Tamil, and Turkish). We apply a version of
Berko's (1958) wug test to ChatGPT, using novel, uncontaminated datasets for
the four examined languages. We find that ChatGPT massively underperforms
purpose-built systems, particularly in English. Overall, our results -- through
the lens of morphology -- cast a new light on the linguistic capabilities of
ChatGPT, suggesting that claims of human-like language skills are premature and
misleading.Comment: EMNLP 202
Recommended from our members
Integrating Machine Learning Into Language Documentation and Description
At least 40% of the world’s 7000+ languages are believed to be in danger of disappearing from human use by the end of this century. Many languages will disappear with almost no record of their existence because efforts to document and describe these languages are encountering an “annotation bottleneck” at early stages of analysis and annotation. Current annotation methods are too slow and expensive to counteract the pace of language endangerment and loss. Annotation could be sped and improved by machine learning. However, state-of-the-art supervised machine learning depends heavily on large amounts of annotated data.
This dissertation explores how to train supervised machine learning systems for morphological analysis during language documentation and description. The systems are applied to nine languages. The research investigates ways that linguists and NLP scientists may want to adjust their expectations and workflows so that both can achieve optimal results with endangered data.
New methods for tasks in morphological analysis are explored. First, various approaches to automating morpheme segmentation and glossing are compared. Second, a new task is presented for learning morphological paradigms and automatically generating new morphological resources: IGT-to-paradigms (IGT2P). Third, the impact of POS tags on segmentation, glossing, and paradigm induction is examined, showing that the presence or absence of POS tags does not have a significant bearing on the performance of machine learning systems. The results indicate that Natural Language Processing (NLP) systems could be successfully integrated into the documentary and descriptive workflow. At the same time, the relatively high accuracy achieved from noisy field data with little or no additional human annotation hints that NLP may benefit from limited documentary linguistic data which may be the only or largest linguistically annotated resource available for some languages.</p
Acquisition of Inflectional Morphology in Artificial Neural Networks With Prior Knowledge
How does knowledge of one language’s morphology influence learning of inflection rules in a second one? In order to investigate this question in artificial neural network models, we perform experiments with a sequence-to-sequence architecture, which we train on different combinations of eight source and three target languages. A detailed analysis of the model outputs suggests the following conclusions: (i) if source and target language are closely related, acquisition of the target language’s inflectional morphology constitutes an easier task for the model; (ii) knowledge of a prefixing (resp. suffixing) language makes acquisition of a suffixing (resp. prefixing) language’s morphology more challenging; and (iii) surprisingly, a source language which exhibits an agglutinative morphology simplifies learning of a second language’s inflectional morphology, independent of their relatedness
State-of-the-art generalisation research in NLP: a taxonomy and review
The ability to generalise well is one of the primary desiderata of natural
language processing (NLP). Yet, what `good generalisation' entails and how it
should be evaluated is not well understood, nor are there any common standards
to evaluate it. In this paper, we aim to lay the ground-work to improve both of
these issues. We present a taxonomy for characterising and understanding
generalisation research in NLP, we use that taxonomy to present a comprehensive
map of published generalisation studies, and we make recommendations for which
areas might deserve attention in the future. Our taxonomy is based on an
extensive literature review of generalisation research, and contains five axes
along which studies can differ: their main motivation, the type of
generalisation they aim to solve, the type of data shift they consider, the
source by which this data shift is obtained, and the locus of the shift within
the modelling pipeline. We use our taxonomy to classify over 400 previous
papers that test generalisation, for a total of more than 600 individual
experiments. Considering the results of this review, we present an in-depth
analysis of the current state of generalisation research in NLP, and make
recommendations for the future. Along with this paper, we release a webpage
where the results of our review can be dynamically explored, and which we
intend to up-date as new NLP generalisation studies are published. With this
work, we aim to make steps towards making state-of-the-art generalisation
testing the new status quo in NLP.Comment: 35 pages of content + 53 pages of reference