1,091 research outputs found
CharManteau: Character Embedding Models For Portmanteau Creation
Portmanteaus are a word formation phenomenon where two words are combined to
form a new word. We propose character-level neural sequence-to-sequence (S2S)
methods for the task of portmanteau generation that are end-to-end-trainable,
language independent, and do not explicitly use additional phonetic
information. We propose a noisy-channel-style model, which allows for the
incorporation of unsupervised word lists, improving performance over a standard
source-to-target model. This model is made possible by an exhaustive candidate
generation strategy specifically enabled by the features of the portmanteau
task. Experiments find our approach superior to a state-of-the-art FST-based
baseline with respect to ground truth accuracy and human evaluation.Comment: Accepted for publication in EMNLP 201
The Paradigm Discovery Problem
This work treats the paradigm discovery problem (PDP), the task of learning
an inflectional morphological system from unannotated sentences. We formalize
the PDP and develop evaluation metrics for judging systems. Using currently
available resources, we construct datasets for the task. We also devise a
heuristic benchmark for the PDP and report empirical results on five diverse
languages. Our benchmark system first makes use of word embeddings and string
similarity to cluster forms by cell and by paradigm. Then, we bootstrap a
neural transducer on top of the clustered data to predict words to realize the
empty paradigm slots. An error analysis of our system suggests clustering by
cell across different inflection classes is the most pressing challenge for
future work. Our code and data are available for public use.Comment: Forthcoming at ACL 202
From Phonology to Syntax:Unsupervised Linguistic Typology at Different Levels with Language Embeddings
A core part of linguistic typology is the classification of languages
according to linguistic properties, such as those detailed in the World Atlas
of Language Structure (WALS). Doing this manually is prohibitively
time-consuming, which is in part evidenced by the fact that only 100 out of
over 7,000 languages spoken in the world are fully covered in WALS.
We learn distributed language representations, which can be used to predict
typological properties on a massively multilingual scale. Additionally,
quantitative and qualitative analyses of these language embeddings can tell us
how language similarities are encoded in NLP models for tasks at different
typological levels. The representations are learned in an unsupervised manner
alongside tasks at three typological levels: phonology (grapheme-to-phoneme
prediction, and phoneme reconstruction), morphology (morphological inflection),
and syntax (part-of-speech tagging).
We consider more than 800 languages and find significant differences in the
language representations encoded, depending on the target task. For instance,
although Norwegian Bokm{\aa}l and Danish are typologically close to one
another, they are phonologically distant, which is reflected in their language
embeddings growing relatively distant in a phonological task. We are also able
to predict typological features in WALS with high accuracies, even for unseen
language families.Comment: Accepted to NAACL 2018 (long paper). arXiv admin note: text overlap
with arXiv:1711.0546
Tackling Sequence to Sequence Mapping Problems with Neural Networks
In Natural Language Processing (NLP), it is important to detect the
relationship between two sequences or to generate a sequence of tokens given
another observed sequence. We call the type of problems on modelling sequence
pairs as sequence to sequence (seq2seq) mapping problems. A lot of research has
been devoted to finding ways of tackling these problems, with traditional
approaches relying on a combination of hand-crafted features, alignment models,
segmentation heuristics, and external linguistic resources. Although great
progress has been made, these traditional approaches suffer from various
drawbacks, such as complicated pipeline, laborious feature engineering, and the
difficulty for domain adaptation. Recently, neural networks emerged as a
promising solution to many problems in NLP, speech recognition, and computer
vision. Neural models are powerful because they can be trained end to end,
generalise well to unseen examples, and the same framework can be easily
adapted to a new domain.
The aim of this thesis is to advance the state-of-the-art in seq2seq mapping
problems with neural networks. We explore solutions from three major aspects:
investigating neural models for representing sequences, modelling interactions
between sequences, and using unpaired data to boost the performance of neural
models. For each aspect, we propose novel models and evaluate their efficacy on
various tasks of seq2seq mapping.Comment: PhD thesi
- …