121 research outputs found
Byte-based Language Identification with Deep Convolutional Networks
We report on our system for the shared task on discriminating between similar
languages (DSL 2016). The system uses only byte representations in a deep
residual network (ResNet). The system, named ResIdent, is trained only on the
data released with the task (closed training). We obtain 84.88% accuracy on
subtask A, 68.80% accuracy on subtask B1, and 69.80% accuracy on subtask B2. A
large difference in accuracy on development data can be observed with
relatively minor changes in our network's architecture and hyperparameters. We
therefore expect fine-tuning of these parameters to yield higher accuracies.Comment: 7 pages. Adapted reviewer comments. arXiv admin note: text overlap
with arXiv:1609.0705
Tracking Typological Traits of Uralic Languages in Distributed Language Representations
Although linguistic typology has a long history, computational approaches
have only recently gained popularity. The use of distributed representations in
computational linguistics has also become increasingly popular. A recent
development is to learn distributed representations of language, such that
typologically similar languages are spatially close to one another. Although
empirical successes have been shown for such language representations, they
have not been subjected to much typological probing. In this paper, we first
look at whether this type of language representations are empirically useful
for model transfer between Uralic languages in deep neural networks. We then
investigate which typological features are encoded in these representations by
attempting to predict features in the World Atlas of Language Structures, at
various stages of fine-tuning of the representations. We focus on Uralic
languages, and find that some typological traits can be automatically inferred
with accuracies well above a strong baseline.Comment: Finnish abstract included in the pape
SU-RUG at the CoNLL-SIGMORPHON 2017 shared task: Morphological Inflection with Attentional Sequence-to-Sequence Models
This paper describes the Stockholm University/University of Groningen
(SU-RUG) system for the SIGMORPHON 2017 shared task on morphological
inflection. Our system is based on an attentional sequence-to-sequence neural
network model using Long Short-Term Memory (LSTM) cells, with joint training of
morphological inflection and the inverse transformation, i.e. lemmatization and
morphological analysis. Our system outperforms the baseline with a large
margin, and our submission ranks as the 4th best team for the track we
participate in (task 1, high-resource).Comment: 4 pages, to appear at CoNLL-SIGMORPHON 201
One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis
When learning a new skill, you take advantage of your preexisting skills and
knowledge. For instance, if you are a skilled violinist, you will likely have
an easier time learning to play cello. Similarly, when learning a new language
you take advantage of the languages you already speak. For instance, if your
native language is Norwegian and you decide to learn Dutch, the lexical overlap
between these two languages will likely benefit your rate of language
acquisition. This thesis deals with the intersection of learning multiple tasks
and learning multiple languages in the context of Natural Language Processing
(NLP), which can be defined as the study of computational processing of human
language. Although these two types of learning may seem different on the
surface, we will see that they share many similarities.
The traditional approach in NLP is to consider a single task for a single
language at a time. However, recent advances allow for broadening this
approach, by considering data for multiple tasks and languages simultaneously.
This is an important approach to explore further as the key to improving the
reliability of NLP, especially for low-resource languages, is to take advantage
of all relevant data whenever possible. In doing so, the hope is that in the
long term, low-resource languages can benefit from the advances made in NLP
which are currently to a large extent reserved for high-resource languages.
This, in turn, may then have positive consequences for, e.g., language
preservation, as speakers of minority languages will have a lower degree of
pressure to using high-resource languages. In the short term, answering the
specific research questions posed should be of use to NLP researchers working
towards the same goal.Comment: PhD thesis, University of Groninge
Semantic Tagging with Deep Residual Networks
We propose a novel semantic tagging task, sem-tagging, tailored for the
purpose of multilingual semantic parsing, and present the first tagger using
deep residual networks (ResNets). Our tagger uses both word and character
representations and includes a novel residual bypass architecture. We evaluate
the tagset both intrinsically on the new task of semantic tagging, as well as
on Part-of-Speech (POS) tagging. Our system, consisting of a ResNet and an
auxiliary loss function predicting our semantic tags, significantly outperforms
prior results on English Universal Dependencies POS tagging (95.71% accuracy on
UD v1.2 and 95.67% accuracy on UD v1.3).Comment: COLING 2016, camera ready versio
- …