78 research outputs found
Learning Language Representations for Typology Prediction
One central mystery of neural NLP is what neural models "know" about their
subject matter. When a neural machine translation system learns to translate
from one language to another, does it learn the syntax or semantics of the
languages? Can this knowledge be extracted from the system to fill holes in
human scientific knowledge? Existing typological databases contain relatively
full feature specifications for only a few hundred languages. Exploiting the
existence of parallel texts in more than a thousand languages, we build a
massive many-to-one neural machine translation (NMT) system from 1017 languages
into English, and use this to predict information missing from typological
databases. Experiments show that the proposed method is able to infer not only
syntactic, but also phonological and phonetic inventory features, and improves
over a baseline that has access to information about the languages' geographic
and phylogenetic neighbors.Comment: EMNLP 201
Universal Phone Recognition with a Multilingual Allophone System
Multilingual models can improve language processing, particularly for low
resource situations, by sharing parameters across languages. Multilingual
acoustic models, however, generally ignore the difference between phonemes
(sounds that can support lexical contrasts in a particular language) and their
corresponding phones (the sounds that are actually spoken, which are language
independent). This can lead to performance degradation when combining a variety
of training languages, as identically annotated phonemes can actually
correspond to several different underlying phonetic realizations. In this work,
we propose a joint model of both language-independent phone and
language-dependent phoneme distributions. In multilingual ASR experiments over
11 languages, we find that this model improves testing performance by 2%
phoneme error rate absolute in low-resource conditions. Additionally, because
we are explicitly modeling language-independent phones, we can build a
(nearly-)universal phone recognizer that, when combined with the PHOIBLE large,
manually curated database of phone inventories, can be customized into 2,000
language dependent recognizers. Experiments on two low-resourced indigenous
languages, Inuktitut and Tusom, show that our recognizer achieves phone
accuracy improvements of more than 17%, moving a step closer to speech
recognition for all languages in the world.Comment: ICASSP 202
- …