1,574 research outputs found
Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts
Although there are increasing and significant ties between China and
Portuguese-speaking countries, there is not much parallel corpora in the
Chinese-Portuguese language pair. Both languages are very populous, with 1.2
billion native Chinese speakers and 279 million native Portuguese speakers, the
language pair, however, could be considered as low-resource in terms of
available parallel corpora. In this paper, we describe our methods to curate
Chinese-Portuguese parallel corpora and evaluate their quality. We extracted
bilingual data from Macao government websites and proposed a hierarchical
strategy to build a large parallel corpus. Experiments are conducted on
existing and our corpora using both Phrased-Based Machine Translation (PBMT)
and the state-of-the-art Neural Machine Translation (NMT) models. The results
of this work can be used as a benchmark for future Chinese-Portuguese MT
systems. The approach we used in this paper also shows a good example on how to
boost performance of MT systems for low-resource language pairs.Comment: accepted by LREC 201
Modeling Coherence for Neural Machine Translation with Dynamic and Topic Caches
Sentences in a well-formed text are connected to each other via various links
to form the cohesive structure of the text. Current neural machine translation
(NMT) systems translate a text in a conventional sentence-by-sentence fashion,
ignoring such cross-sentence links and dependencies. This may lead to generate
an incoherent target text for a coherent source text. In order to handle this
issue, we propose a cache-based approach to modeling coherence for neural
machine translation by capturing contextual information either from recently
translated sentences or the entire document. Particularly, we explore two types
of caches: a dynamic cache, which stores words from the best translation
hypotheses of preceding sentences, and a topic cache, which maintains a set of
target-side topical words that are semantically related to the document to be
translated. On this basis, we build a new layer to score target words in these
two caches with a cache-based neural model. Here the estimated probabilities
from the cache-based neural model are combined with NMT probabilities into the
final word prediction probabilities via a gating mechanism. Finally, the
proposed cache-based neural model is trained jointly with NMT system in an
end-to-end manner. Experiments and analysis presented in this paper demonstrate
that the proposed cache-based model achieves substantial improvements over
several state-of-the-art SMT and NMT baselines.Comment: Accepted by COLING2018,11 pages, 3 figure
A Framework for Hierarchical Multilingual Machine Translation
Multilingual machine translation has recently been in vogue given its
potential for improving machine translation performance for low-resource
languages via transfer learning. Empirical examinations demonstrating the
success of existing multilingual machine translation strategies, however, are
limited to experiments in specific language groups. In this paper, we present a
hierarchical framework for building multilingual machine translation strategies
that takes advantage of a typological language family tree for enabling
transfer among similar languages while avoiding the negative effects that
result from incorporating languages that are too different to each other.
Exhaustive experimentation on a dataset with 41 languages demonstrates the
validity of the proposed framework, especially when it comes to improving the
performance of low-resource languages via the use of typologically related
families for which richer sets of resources are available
XNLI: Evaluating Cross-lingual Sentence Representations
State-of-the-art natural language processing systems rely on supervision in
the form of annotated data to learn competent models. These models are
generally trained on data in a single language (usually English), and cannot be
directly used beyond that language. Since collecting data in every language is
not realistic, there has been a growing interest in cross-lingual language
understanding (XLU) and low-resource cross-language transfer. In this work, we
construct an evaluation set for XLU by extending the development and test sets
of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15
languages, including low-resource languages such as Swahili and Urdu. We hope
that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence
understanding by providing an informative standard evaluation task. In
addition, we provide several baselines for multilingual sentence understanding,
including two based on machine translation systems, and two that use parallel
data to train aligned multilingual bag-of-words and LSTM encoders. We find that
XNLI represents a practical and challenging evaluation suite, and that directly
translating the test data yields the best performance among available
baselines.Comment: EMNLP 201
Multilingual Factor Analysis
In this work we approach the task of learning multilingual word
representations in an offline manner by fitting a generative latent variable
model to a multilingual dictionary. We model equivalent words in different
languages as different views of the same word generated by a common latent
variable representing their latent lexical meaning. We explore the task of
alignment by querying the fitted model for multilingual embeddings achieving
competitive results across a variety of tasks. The proposed model is robust to
noise in the embedding space making it a suitable method for distributed
representations learned from noisy corpora.Comment: Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistic
Dimension Projection among Languages based on Pseudo-relevant Documents for Query Translation
Using top-ranked documents in response to a query has been shown to be an
effective approach to improve the quality of query translation in
dictionary-based cross-language information retrieval. In this paper, we
propose a new method for dictionary-based query translation based on dimension
projection of embedded vectors from the pseudo-relevant documents in the source
language to their equivalents in the target language. To this end, first we
learn low-dimensional vectors of the words in the pseudo-relevant collections
separately and then aim to find a query-dependent transformation matrix between
the vectors of translation pairs appeared in the collections. At the next step,
representation of each query term is projected to the target language and then,
after using a softmax function, a query-dependent translation model is built.
Finally, the model is used for query translation. Our experiments on four CLEF
collections in French, Spanish, German, and Italian demonstrate that the
proposed method outperforms a word embedding baseline based on bilingual
shuffling and a further number of competitive baselines. The proposed method
reaches up to 87% performance of machine translation (MT) in short queries and
considerable improvements in verbose queries
Unsupervised Clinical Language Translation
As patients' access to their doctors' clinical notes becomes common,
translating professional, clinical jargon to layperson-understandable language
is essential to improve patient-clinician communication. Such translation
yields better clinical outcomes by enhancing patients' understanding of their
own health conditions, and thus improving patients' involvement in their own
care. Existing research has used dictionary-based word replacement or
definition insertion to approach the need. However, these methods are limited
by expert curation, which is hard to scale and has trouble generalizing to
unseen datasets that do not share an overlapping vocabulary. In contrast, we
approach the clinical word and sentence translation problem in a completely
unsupervised manner. We show that a framework using representation learning,
bilingual dictionary induction and statistical machine translation yields the
best precision at 10 of 0.827 on professional-to-consumer word translation, and
mean opinion scores of 4.10 and 4.28 out of 5 for clinical correctness and
layperson readability, respectively, on sentence translation. Our
fully-unsupervised strategy overcomes the curation problem, and the clinically
meaningful evaluation reduces biases from inappropriate evaluators, which are
critical in clinical machine learning.Comment: Accepted to KDD 201
Multilingual Embeddings Jointly Induced from Contexts and Concepts: Simple, Strong and Scalable
Word embeddings induced from local context are prevalent in NLP. A simple and
effective context-based multilingual embedding learner is Levy et al. (2017)'s
S-ID (sentence ID) method. Another line of work induces high-performing
multilingual embeddings from concepts (Dufter et al., 2018). In this paper, we
propose Co+Co, a simple and scalable method that combines context-based and
concept-based learning. From a sentence aligned corpus, concepts are extracted
via sampling; words are then associated with their concept ID and sentence ID
in embedding learning. This is the first work that successfully combines
context-based and concept-based embedding learning. We show that Co+Co performs
well for two different application scenarios: the Parallel Bible Corpus (1000+
languages, low-resource) and EuroParl (12 languages, high-resource). Among
methods applicable to both corpora, Co+Co performs best in our evaluation setup
of six tasks
Training a code-switching language model with monolingual data
A lack of code-switching data complicates the training of code-switching (CS)
language models. We propose an approach to train such CS language models on
monolingual data only. By constraining and normalizing the output projection
matrix in RNN-based language models, we bring embeddings of different languages
closer to each other. Numerical and visualization results show that the
proposed approaches remarkably improve the performance of CS language models
trained on monolingual data. The proposed approaches are comparable or even
better than training CS language models with artificially generated CS data. We
additionally use unsupervised bilingual word translation to analyze whether
semantically equivalent words in different languages are mapped together.Comment: Accepted as an oral presentation in ICASSP 202
Unsupervised Cross-lingual Transfer of Word Embedding Spaces
Cross-lingual transfer of word embeddings aims to establish the semantic
mappings among words in different languages by learning the transformation
functions over the corresponding word embedding spaces. Successfully solving
this problem would benefit many downstream tasks such as to translate text
classification models from resource-rich languages (e.g. English) to
low-resource languages. Supervised methods for this problem rely on the
availability of cross-lingual supervision, either using parallel corpora or
bilingual lexicons as the labeled data for training, which may not be available
for many low resource languages. This paper proposes an unsupervised learning
approach that does not require any cross-lingual labeled data. Given two
monolingual word embedding spaces for any language pair, our algorithm
optimizes the transformation functions in both directions simultaneously based
on distributional matching as well as minimizing the back-translation losses.
We use a neural network implementation to calculate the Sinkhorn distance, a
well-defined distributional similarity measure, and optimize our objective
through back-propagation. Our evaluation on benchmark datasets for bilingual
lexicon induction and cross-lingual word similarity prediction shows stronger
or competitive performance of the proposed method compared to other
state-of-the-art supervised and unsupervised baseline methods over many
language pairs.Comment: EMNLP 201
- …