4,411 research outputs found
Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks
In this paper we present a novel lemmatization method based on a
sequence-to-sequence neural network architecture and morphosyntactic context
representation. In the proposed method, our context-sensitive lemmatizer
generates the lemma one character at a time based on the surface form
characters and its morphosyntactic features obtained from a morphological
tagger. We argue that a sliding window context representation suffers from
sparseness, while in majority of cases the morphosyntactic features of a word
bring enough information to resolve lemma ambiguities while keeping the context
representation dense and more practical for machine learning systems.
Additionally, we study two different data augmentation methods utilizing
autoencoder training and morphological transducers especially beneficial for
low resource languages. We evaluate our lemmatizer on 52 different languages
and 76 different treebanks, showing that our system outperforms all latest
baseline systems. Compared to the best overall baseline, UDPipe Future, our
system outperforms it on 62 out of 76 treebanks reducing errors on average by
19% relative. The lemmatizer together with all trained models is made available
as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.Comment: Accepted to the Journal of Natural Language Engineerin
Parser Training with Heterogeneous Treebanks
How to make the most of multiple heterogeneous treebanks when training a
monolingual dependency parser is an open question. We start by investigating
previously suggested, but little evaluated, strategies for exploiting multiple
treebanks based on concatenating training sets, with or without fine-tuning. We
go on to propose a new method based on treebank embeddings. We perform
experiments for several languages and show that in many cases fine-tuning and
treebank embeddings lead to substantial improvements over single treebanks or
concatenation, with average gains of 2.0--3.5 LAS points. We argue that
treebank embeddings should be preferred due to their conceptual simplicity,
flexibility and extensibility.Comment: 7 pages. Accepted to ACL 2018, short paper
82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models
We present the Uppsala system for the CoNLL 2018 Shared Task on universal
dependency parsing. Our system is a pipeline consisting of three components:
the first performs joint word and sentence segmentation; the second predicts
part-of- speech tags and morphological features; the third predicts dependency
trees from words and tags. Instead of training a single parsing model for each
treebank, we trained models with multiple treebanks for one language or closely
related languages, greatly reducing the number of models. On the official test
run, we ranked 7th of 27 teams for the LAS and MLAS metrics. Our system
obtained the best scores overall for word segmentation, universal POS tagging,
and morphological features.Comment: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from
Raw Text to Universal Dependencie
Low-Resource Syntactic Transfer with Unsupervised Source Reordering
We describe a cross-lingual transfer method for dependency parsing that takes
into account the problem of word order differences between source and target
languages. Our model only relies on the Bible, a considerably smaller parallel
data than the commonly used parallel data in transfer methods. We use the
concatenation of projected trees from the Bible corpus, and the gold-standard
treebanks in multiple source languages along with cross-lingual word
representations. We demonstrate that reordering the source treebanks before
training on them for a target language improves the accuracy of languages
outside the European language family. Our experiments on 68 treebanks (38
languages) in the Universal Dependencies corpus achieve a high accuracy for all
languages. Among them, our experiments on 16 treebanks of 12 non-European
languages achieve an average UAS absolute improvement of 3.3% over a
state-of-the-art method.Comment: Accepted in NAACL 201
Towards JointUD: Part-of-speech Tagging and Lemmatization using Recurrent Neural Networks
This paper describes our submission to CoNLL 2018 UD Shared Task. We have
extended an LSTM-based neural network designed for sequence tagging to
additionally generate character-level sequences. The network was jointly
trained to produce lemmas, part-of-speech tags and morphological features.
Sentence segmentation, tokenization and dependency parsing were handled by
UDPipe 1.2 baseline. The results demonstrate the viability of the proposed
multitask architecture, although its performance still remains far from
state-of-the-art.Comment: System description paper of our system for the CoNLL 2018 shared task
on Universal Dependency parsin
SyntaxNet Models for the CoNLL 2017 Shared Task
We describe a baseline dependency parsing system for the CoNLL2017 Shared
Task. This system, which we call "ParseySaurus," uses the DRAGNN framework
[Kong et al, 2017] to combine transition-based recurrent parsing and tagging
with character-based word representations. On the v1.3 Universal Dependencies
Treebanks, the new system outpeforms the publicly available, state-of-the-art
"Parsey's Cousins" models by 3.47% absolute Labeled Accuracy Score (LAS) across
52 treebanks.Comment: Tech repor
Cross-Lingual Transfer of Semantic Roles: From Raw Text to Semantic Roles
We describe a transfer method based on annotation projection to develop a
dependency-based semantic role labeling system for languages for which no
supervised linguistic information other than parallel data is available. Unlike
previous work that presumes the availability of supervised features such as
lemmas, part-of-speech tags, and dependency parse trees, we only make use of
word and character features. Our deep model considers using character-based
representations as well as unsupervised stem embeddings to alleviate the need
for supervised features. Our experiments outperform a state-of-the-art method
that uses supervised lexico-syntactic features on 6 out of 7 languages in the
Universal Proposition Bank.Comment: Accepted at the 13th International Conference on Computational
Semantics (IWCS 2019
An improved neural network model for joint POS tagging and dependency parsing
We propose a novel neural network model for joint part-of-speech (POS)
tagging and dependency parsing. Our model extends the well-known BIST
graph-based dependency parser (Kiperwasser and Goldberg, 2016) by incorporating
a BiLSTM-based tagging component to produce automatically predicted POS tags
for the parser. On the benchmark English Penn treebank, our model obtains
strong UAS and LAS scores at 94.51% and 92.87%, respectively, producing 1.5+%
absolute improvements to the BIST graph-based parser, and also obtaining a
state-of-the-art POS tagging accuracy at 97.97%. Furthermore, experimental
results on parsing 61 "big" Universal Dependencies treebanks from raw texts
show that our model outperforms the baseline UDPipe (Straka and Strakov\'a,
2017) with 0.8% higher average POS tagging score and 3.6% higher average LAS
score. In addition, with our model, we also obtain state-of-the-art downstream
task scores for biomedical event extraction and opinion analysis applications.
Our code is available together with all pre-trained models at:
https://github.com/datquocnguyen/jPTDPComment: 11 pages; In Proceedings of the CoNLL 2018 Shared Task: Multilingual
Parsing from Raw Text to Universal Dependencies, to appea
Token-based typology and word order entropy: A study based on universal dependencies
The present paper discusses the benefits and challenges of token-based typology, which takes into account the frequencies of words and constructions in language use. This approach makes it possible to introduce new criteria for language classification, which would be difficult or impossible to achieve with the traditional, type-based approach. This point is illustrated by several quantitative studies of word order variation, which can be measured as entropy at different levels of granularity. I argue that this variation can be explained by general functional mechanisms and pressures, which manifest themselves in language use, such as optimization of processing (including avoidance of ambiguity) and grammaticalization of predictable units occurring in chunks. The case studies are based on multilingual corpora, which have been parsed using the Universal Dependencies annotation scheme
Modeling Composite Labels for Neural Morphological Tagging
Neural morphological tagging has been regarded as an extension to POS tagging
task, treating each morphological tag as a monolithic label and ignoring its
internal structure. We propose to view morphological tags as composite labels
and explicitly model their internal structure in a neural sequence tagger. For
this, we explore three different neural architectures and compare their
performance with both CRF and simple neural multiclass baselines. We evaluate
our models on 49 languages and show that the neural architecture that models
the morphological labels as sequences of morphological category values performs
significantly better than both baselines establishing state-of-the-art results
in morphological tagging for most languages.Comment: Proceedings of the 22nd Conference on Computational Natural Language
Learning, 201
- …