53 research outputs found
Recommended from our members
Analogy in Contact: Modeling Maltese Plural Inflection
Maltese is often described as having a hybrid morphological system resulting from extensive contact between Semitic and Romance language varieties. Such a designation reflects an etymological divide as much as it does a larger tradition in the literature to consider concatenative and non-concatenative morphological patterns as distinct in the language architecture. Using a combination of computational modeling and information theoretic methods, we quantify the extent to which the phonology and etymology of a Maltese singular noun may predict the morphological process (affixal vs. templatic) as well as the specific plural allomorph (affix or template) relating a singular noun to its associated plural form(s) in the lexicon. The results indicate phonological pressures shape the organization of the Maltese lexicon with predictive power that extends beyond that of a word\u27s etymology, in line with analogical theories of language change in contact
Recommended from our members
Formalizing Inflectional Paradigm Shape with Information Theory
āParadigm shape,ā our term for the morphological structure formed by implicative relations between inflected forms, has not been formally quantified in a gradient manner. We develop a method to formalize paradigm shape by modeling the joint effect of stem alternations and affixes. Applied to Spanish verbs, our model successfully captures aspects of both allomorphic and distributional classes. These results are replicable and extendable to other languages
Recommended from our members
Interpreting Sequence-to-Sequence Models for Russian Inflectional Morphology
Morphological inflection, as an engineering task in NLP, has seen a rise in the use of neural sequence-to-sequence models (Kann et al. 2016, Cotterell et al. 2018, Aharoni et al. 2017). While these outperform traditional systems based on edit rule induction, it is hard to interpret what they are learning in linguistic terms. We propose a new method of analyzing morphological sequence-to-sequence models which groups errors into linguistically meaningful classes, making what the model learns more transparent. As a case study, we analyze a seq2seq model on Russian, finding that semantic and lexically conditioned allomorphy (e.g. inanimate nouns like zavod `factory\u27 and animates like otec `father\u27 have different, animacy-conditioned accusative forms) are responsible for its relatively low accuracy. Augmenting the model with word embeddings as a proxy for lexical semantics leads to significant improvements in predicted wordform accuracy
Recommended from our members
Normalization may be ineffective for phonetic category learning
Sound categories often overlap in their acoustics, which can make phonetic learning difficult. Several studies argued that normalizing acoustics relative to context improves category separation (e.g. Dillon et al., 2013). However, recent work shows that normalization is ineffective for learning Japanese vowel length from spontaneous child-directed speech (Hitczenko et al., 2018). We show that this discrepancy arises from differences between spontaneous and controlled lab speech, and that normalization can increase category overlap when there are regularities in which contexts different sounds occur in - a hallmark of spontaneous speech. Therefore, normalization is unlikely to help in real, naturalistic phonetic learning situations
Challenges and solutions for Latin named entity recognition
Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity
Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track
the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree
of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality
The Paradigm Discovery Problem
This work treats the paradigm discovery problem (PDP), the task of learning
an inflectional morphological system from unannotated sentences. We formalize
the PDP and develop evaluation metrics for judging systems. Using currently
available resources, we construct datasets for the task. We also devise a
heuristic benchmark for the PDP and report empirical results on five diverse
languages. Our benchmark system first makes use of word embeddings and string
similarity to cluster forms by cell and by paradigm. Then, we bootstrap a
neural transducer on top of the clustered data to predict words to realize the
empty paradigm slots. An error analysis of our system suggests clustering by
cell across different inflection classes is the most pressing challenge for
future work. Our code and data are available for public use.Comment: Forthcoming at ACL 202
Recommended from our members
Stop the Morphological Cycle, I Want to Get Off: Modeling the Development of Fusion
Historical linguists observe that many fusional (unsegmentable) morphological structures developed from agglutinative (segmentable) predecessors. Such changes may result when learners fail to acquire a phonological alternation, and instead, āchunkā the altered versions of morphemes and memorize them as underlying representations. We present a Bayesian model of this process, which learns which morphosyntactic properties are chunked together, what their underlying representations are, and what phonological processes apply to them. In simulations using artificial data, we provide quantitative support to two claims about agglutinative and fusional structures: that optional morphological markers discourage fusion from developing, but that stress-based vowel reduction encourages it
- ā¦