225,642 research outputs found
SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings
Word alignments are useful for tasks like statistical and neural machine translation (NMT) and annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are competitive and mostly superior to traditional statistical aligners, even in scenarios with abundant parallel data. For example, for a set of 100k parallel sentences, contextualized embeddings achieve a word alignment F1 for English-German that is more than 5% higher (absolute) than eflomal, a high quality alignment model
Unsupervised Neural Machine Translation with SMT as Posterior Regularization
Without real bilingual corpus available, unsupervised Neural Machine
Translation (NMT) typically requires pseudo parallel data generated with the
back-translation method for the model training. However, due to weak
supervision, the pseudo data inevitably contain noises and errors that will be
accumulated and reinforced in the subsequent training process, leading to bad
translation performance. To address this issue, we introduce phrase based
Statistic Machine Translation (SMT) models which are robust to noisy data, as
posterior regularizations to guide the training of unsupervised NMT models in
the iterative back-translation process. Our method starts from SMT models built
with pre-trained language models and word-level translation tables inferred
from cross-lingual embeddings. Then SMT and NMT models are optimized jointly
and boost each other incrementally in a unified EM framework. In this way, (1)
the negative effect caused by errors in the iterative back-translation process
can be alleviated timely by SMT filtering noises from its phrase tables;
meanwhile, (2) NMT can compensate for the deficiency of fluency inherent in
SMT. Experiments conducted on en-fr and en-de translation tasks show that our
method outperforms the strong baseline and achieves new state-of-the-art
unsupervised machine translation performance.Comment: To be presented at AAAI 2019; 9 pages, 4 figure
Source side pre-ordering using recurrent neural networks for English-Myanmar machine translation
Word reordering has remained one of the challenging problems for machine translation when translating between language pairs with different word orders e.g. English and Myanmar. Without reordering between these languages, a source sentence may be translated directly with similar word order and translation can not be meaningful. Myanmar is a subject-objectverb (SOV) language and an effective reordering is essential for translation. In this paper, we applied a pre-ordering approach using recurrent neural networks to pre-order words of the source Myanmar sentence into target English’s word order. This neural pre-ordering model is automatically derived from parallel word-aligned data with syntactic and lexical features based on dependency parse trees of the source sentences. This can generate arbitrary permutations that may be non-local on the sentence and can be combined into English-Myanmar machine translation. We exploited the model to reorder English sentences into Myanmar-like word order as a preprocessing stage for machine translation, obtaining improvements quality comparable to baseline rule-based pre-ordering approach on asian language treebank (ALT) corpus
OpusFilter : A Configurable Parallel Corpus Filtering Toolbox
This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data. Applying our tool leads to improved translation quality while significantly reducing the size of the training data, also clearly outperforming an alternative ranking given in the crawled data set. Furthermore, we show the ability of OpusFilter to perform data selection for domain adaptation.This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data. Applying our tool leads to improved translation quality while significantly reducing the size of the training data, also clearly outperforming an alternative ranking given in the crawled data set. Furthermore, we show the ability of OpusFilter to perform data selection for domain adaptation.Peer reviewe
Unsupervised Bilingual Lexicon Induction from Mono-lingual Multimodal Data
Bilingual lexicon induction, translating words from the source language to
the target language, is a long-standing natural language processing task.
Recent endeavors prove that it is promising to employ images as pivot to learn
the lexicon induction without reliance on parallel corpora. However, these
vision-based approaches simply associate words with entire images, which are
constrained to translate concrete words and require object-centered images. We
humans can understand words better when they are within a sentence with
context. Therefore, in this paper, we propose to utilize images and their
associated captions to address the limitations of previous approaches. We
propose a multi-lingual caption model trained with different mono-lingual
multimodal data to map words in different languages into joint spaces. Two
types of word representation are induced from the multi-lingual caption model:
linguistic features and localized visual features. The linguistic feature is
learned from the sentence contexts with visual semantic constraints, which is
beneficial to learn translation for words that are less visual-relevant. The
localized visual feature is attended to the region in the image that correlates
to the word, so that it alleviates the image restriction for salient visual
representation. The two types of features are complementary for word
translation. Experimental results on multiple language pairs demonstrate the
effectiveness of our proposed method, which substantially outperforms previous
vision-based approaches without using any parallel sentences or supervision of
seed word pairs.Comment: Accepted by AAAI 201
Recommended from our members
Machine Translation of Arabic Dialects
This thesis discusses different approaches to machine translation (MT) from Dialectal Arabic (DA) to English. These approaches handle the varying stages of Arabic dialects in terms of types of available resources and amounts of training data. The overall theme of this work revolves around building dialectal resources and MT systems or enriching existing ones using the currently available resources (dialectal or standard) in order to quickly and cheaply scale to more dialects without the need to spend years and millions of dollars to create such resources for every dialect.
Unlike Modern Standard Arabic (MSA), DA-English parallel corpora is scarcely available for few dialects only. Dialects differ from each other and from MSA in orthography, morphology, phonology, and to some lesser degree syntax. This means that combining all available parallel data, from dialects and MSA, to train DA-to-English statistical machine translation (SMT) systems might not provide the desired results. Similarly, translating dialectal sentences with an SMT system trained on that dialect only is also challenging due to different factors that affect the sentence word choices against that of the SMT training data. Such factors include the level of dialectness (e.g., code switching to MSA versus dialectal training data), topic (sports versus politics), genre (tweets versus newspaper), script (Arabizi versus Arabic), and timespan of test against training. The work we present utilizes any available Arabic resource such as a preprocessing tool or a parallel corpus, whether MSA or DA, to improve DA-to-English translation and expand to more dialects and sub-dialects.
The majority of Arabic dialects have no parallel data to English or to any other foreign language. They also have no preprocessing tools such as normalizers, morphological analyzers, or tokenizers. For such dialects, we present an MSA-pivoting approach where DA sentences are translated to MSA first, then the MSA output is translated to English using the wealth of MSA-English parallel data. Since there is virtually no DA-MSA parallel data to train an SMT system, we build a rule-based DA-to-MSA MT system, ELISSA, that uses morpho-syntactic translation rules along with dialect identification and language modeling components. We also present a rule-based approach to quickly and cheaply build a dialectal morphological analyzer, ADAM, which provides ELISSA with dialectal word analyses.
Other Arabic dialects have a relatively small-sized DA-English parallel data amounting to a few million words on the DA side. Some of these dialects have dialect-dependent preprocessing tools that can be used to prepare the DA data for SMT systems. We present techniques to generate synthetic parallel data from the available DA-English and MSA- English data. We use this synthetic data to build statistical and hybrid versions of ELISSA as well as improve our rule-based ELISSA-based MSA-pivoting approach. We evaluate our best MSA-pivoting MT pipeline against three direct SMT baselines trained on these three parallel corpora: DA-English data only, MSA-English data only, and the combination of DA-English and MSA-English data. Furthermore, we leverage the use of these four MT systems (the three baselines along with our MSA-pivoting system) in two system combination approaches that benefit from their strengths while avoiding their weaknesses.
Finally, we propose an approach to model dialects from monolingual data and limited DA-English parallel data without the need for any language-dependent preprocessing tools. We learn DA preprocessing rules using word embedding and expectation maximization. We test this approach by building a morphological segmentation system and we evaluate its performance on MT against the state-of-the-art dialectal tokenization tool
- …