3 research outputs found
Training Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw Text
Lemmatization aims to reduce the sparse data problem by relating the
inflected forms of a word to its dictionary form. Using context can help, both
for unseen and ambiguous words. Yet most context-sensitive approaches require
full lemma-annotated sentences for training, which may be scarce or unavailable
in low-resource languages. In addition (as shown here), in a low-resource
setting, a lemmatizer can learn more from labeled examples of distinct
words (types) than from (contiguous) labeled tokens, since the latter
contain far fewer distinct types. To combine the efficiency of type-based
learning with the benefits of context, we propose a way to train a
context-sensitive lemmatizer with little or no labeled corpus data, using
inflection tables from the UniMorph project and raw text examples from
Wikipedia that provide sentence contexts for the unambiguous UniMorph examples.
Despite these being unambiguous examples, the model successfully generalizes
from them, leading to improved results (both overall, and especially on unseen
words) in comparison to a baseline that does not use context.Comment: Published in NAACL 201
Neural Transductive Learning and Beyond: Morphological Generation in the Minimal-Resource Setting
Neural state-of-the-art sequence-to-sequence (seq2seq) models often do not
perform well for small training sets. We address paradigm completion, the
morphological task of, given a partial paradigm, generating all missing forms.
We propose two new methods for the minimal-resource setting: (i) Paradigm
transduction: Since we assume only few paradigms available for training, neural
seq2seq models are able to capture relationships between paradigm cells, but
are tied to the idiosyncracies of the training set. Paradigm transduction
mitigates this problem by exploiting the input subset of inflected forms at
test time. (ii) Source selection with high precision (SHIP): Multi-source
models which learn to automatically select one or multiple sources to predict a
target inflection do not perform well in the minimal-resource setting. SHIP is
an alternative to identify a reliable source if training data is limited. On a
52-language benchmark dataset, we outperform the previous state of the art by
up to 9.71% absolute accuracy.Comment: EMNLP 201
Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks
In this paper we present a novel lemmatization method based on a
sequence-to-sequence neural network architecture and morphosyntactic context
representation. In the proposed method, our context-sensitive lemmatizer
generates the lemma one character at a time based on the surface form
characters and its morphosyntactic features obtained from a morphological
tagger. We argue that a sliding window context representation suffers from
sparseness, while in majority of cases the morphosyntactic features of a word
bring enough information to resolve lemma ambiguities while keeping the context
representation dense and more practical for machine learning systems.
Additionally, we study two different data augmentation methods utilizing
autoencoder training and morphological transducers especially beneficial for
low resource languages. We evaluate our lemmatizer on 52 different languages
and 76 different treebanks, showing that our system outperforms all latest
baseline systems. Compared to the best overall baseline, UDPipe Future, our
system outperforms it on 62 out of 76 treebanks reducing errors on average by
19% relative. The lemmatizer together with all trained models is made available
as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.Comment: Accepted to the Journal of Natural Language Engineerin