73 research outputs found
Marrying Universal Dependencies and Universal Morphology
The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects
each present schemata for annotating the morphosyntactic details of language.
Each project also provides corpora of annotated text in many languages - UD at
the token level and UniMorph at the type level. As each corpus is built by
different annotators, language-specific decisions hinder the goal of universal
schemata. With compatibility of tags, each project's annotations could be used
to validate the other's. Additionally, the availability of both type- and
token-level resources would be a boon to tasks such as parsing and homograph
disambiguation. To ease this interoperability, we present a deterministic
mapping from Universal Dependencies v2 features into the UniMorph schema. We
validate our approach by lookup in the UniMorph corpora and find a
macro-average of 64.13% recall. We also note incompatibilities due to paucity
of data on either side. Finally, we present a critical evaluation of the
foundations, strengths, and weaknesses of the two annotation projects.Comment: UDW1
Weird inflects but OK : Making sense of morphological generation errors
We conduct a manual error analysis of the CoNLL-SIGMORPHON 2017 Shared Task on Morphological Reinflection. In this task, systems are given a word in citation form (e.g., hug) and asked to produce the corresponding inflected form (e.g., the simple past hugged). This design lets us analyze errors much like we might analyze children's production errors. We propose an error taxonomy and use it to annotate errors made by the top two systems across twelve languages. Many of the observed errors are related to inflectional patterns sensitive to inherent linguistic properties such as animacy or affect; many others are failures to predict truly unpredictable inflectional behaviors. We also find nearly one quarter of the residual "errors" reflect errors in the gold data. © 2019 Association for Computational Linguistics.Peer reviewe
Recommended from our members
Lexical databases for computational analyses: A linguistic perspective
Large typological databases have permitted new ways of studying cross-linguistic morphological variation. Recently, computational modelers with typological interests have begun to turn to broad multilingual text databases. In this paper, we will focus particularly on the UniMorph database, a collection of morphological paradigms, mostly gathered automatically from the crowd-sourced multi-lingual dictionary Wiktionary. It was designed to make the large quantity of data contained in Wiktionary available for NLP researchers by standardizing the data and putting it into a form that is easy to access. For typological studies, however, the requirements for a linguistically informed view of morphological variation are quite different. They involve using a morphological database as a scientific instrument to both formulate and test hypotheses about the nature and organization of language systems. The requirements are, accordingly, much higher. In this paper, we survey some of the methodological challenges and pitfalls involved in using corpora for typological research, and we end with a proposal for best practices and directions for further research
Distant Supervision from Disparate Sources for Low-Resource Part-of-Speech Tagging
We introduce DsDs: a cross-lingual neural part-of-speech tagger that learns
from disparate sources of distant supervision, and realistically scales to
hundreds of low-resource languages. The model exploits annotation projection,
instance selection, tag dictionaries, morphological lexicons, and distributed
representations, all in a uniform framework. The approach is simple, yet
surprisingly effective, resulting in a new state of the art without access to
any gold annotated data.Comment: EMNLP 201
One-Shot Neural Cross-Lingual Transfer for Paradigm Completion
We present a novel cross-lingual transfer method for paradigm completion, the
task of mapping a lemma to its inflected forms, using a neural encoder-decoder
model, the state of the art for the monolingual task. We use labeled data from
a high-resource language to increase performance on a low-resource language. In
experiments on 21 language pairs from four different language families, we
obtain up to 58% higher accuracy than without transfer and show that even
zero-shot and one-shot learning are possible. We further find that the degree
of language relatedness strongly influences the ability to transfer
morphological knowledge.Comment: Accepted at ACL 201
Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources
We propose a novel hybrid approach to lemmatization that enhances the seq2seq
neural model with additional lemmas extracted from an external lexicon or a
rule-based system. During training, the enhanced lemmatizer learns both to
generate lemmas via a sequential decoder and copy the lemma characters from the
external candidates supplied during run-time. Our lemmatizer enhanced with
candidates extracted from the Apertium morphological analyzer achieves
statistically significant improvements compared to baseline models not
utilizing additional lemma information, achieves an average accuracy of 97.25%
on a set of 23 UD languages, which is 0.55% higher than obtained with the
Stanford Stanza model on the same set of languages. We also compare with other
methods of integrating external data into lemmatization and show that our
enhanced system performs considerably better than a simple lexicon extension
method based on the Stanza system, and it achieves complementary improvements
w.r.t. the data augmentation method
- …