37,728 research outputs found
A Theory of Unsupervised Translation Motivated by Understanding Animal Communication
Recent years have seen breakthroughs in neural language models that capture
nuances of language, culture, and knowledge. Neural networks are capable of
translating between languages -- in some cases even between two languages where
there is little or no access to parallel translations, in what is known as
Unsupervised Machine Translation (UMT). Given this progress, it is intriguing
to ask whether machine learning tools can ultimately enable understanding
animal communication, particularly that of highly intelligent animals. Our work
is motivated by an ambitious interdisciplinary initiative, Project CETI, which
is collecting a large corpus of sperm whale communications for machine
analysis.
We propose a theoretical framework for analyzing UMT when no parallel data
are available and when it cannot be assumed that the source and target corpora
address related subject domains or posses similar linguistic structure. The
framework requires access to a prior probability distribution that should
assign non-zero probability to possible translations. We instantiate our
framework with two models of language. Our analysis suggests that accuracy of
translation depends on the complexity of the source language and the amount of
``common ground'' between the source language and target prior.
We also prove upper bounds on the amount of data required from the source
language in the unsupervised setting as a function of the amount of data
required in a hypothetical supervised setting. Surprisingly, our bounds suggest
that the amount of source data required for unsupervised translation is
comparable to the supervised setting. For one of the language models which we
analyze we also prove a nearly matching lower bound.
Our analysis is purely information-theoretic and as such can inform how much
source data needs to be collected, but does not yield a computationally
efficient procedure
Multilingual Unsupervised Sentence Simplification
Progress in Sentence Simplification has been hindered by the lack of
supervised data, particularly in languages other than English. Previous work
has aligned sentences from original and simplified corpora such as English
Wikipedia and Simple English Wikipedia, but this limits corpus size, domain,
and language. In this work, we propose using unsupervised mining techniques to
automatically create training corpora for simplification in multiple languages
from raw Common Crawl web data. When coupled with a controllable generation
mechanism that can flexibly adjust attributes such as length and lexical
complexity, these mined paraphrase corpora can be used to train simplification
systems in any language. We further incorporate multilingual unsupervised
pretraining methods to create even stronger models and show that by training on
mined data rather than supervised corpora, we outperform the previous best
results. We evaluate our approach on English, French, and Spanish
simplification benchmarks and reach state-of-the-art performance with a totally
unsupervised approach. We will release our models and code to mine the data in
any language included in Common Crawl
- …