295,295 research outputs found
Bilingually motivated domain-adapted word segmentation for statistical machine translation
We introduce a word segmentation approach to languages where word boundaries are not orthographically marked,
with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is
adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and
demonstrate that our approach scores consistently among the best results across different data conditions
Translation of the Lampung Language Text Dialect of Nyo into the Indonesian Language with DMT and SMT Approach
Research on the translation of Lampung language text dialect of Nyo into Indonesian is done with two approaches, namely Direct Machine Translation (DMT) and Statistical Machine Translation (SMT). This research experiment was conducted as a preliminary effort in helping students immigrants in the province of Lampung, translating the Lampung language dialect of Nyo through prototypes or models was built. In the DMT approach, the dictionary is used as the primary tool. In contrast, in SMT, the parallel corpus of Lampung Nyo and Indonesian language is used to make language models and translation models using Moses Decoder. The result of text translation accuracy with the DMT approach is 39.32%, and for the SMT approach is 59.85%. Both approaches use Bilingual Evaluation Understudy (BLEU) assessment.Research on the translation of Lampung language text dialect of Nyo into Indonesian is done with two approaches, namely Direct Machine Translation (DMT) and Statistical Machine Translation (SMT). This research experiment was conducted as a preliminary effort in helping students immigrants in the province of Lampung, translating the Lampung language dialect of Nyo through prototypes or models was built. In the DMT approach, the dictionary is used as the primary tool. In contrast, in SMT, the parallel corpus of Lampung Nyo and Indonesian language is used to make language models and translation models using Moses Decoder. The result of text translation accuracy with the DMT approach is 39.32%, and for the SMT approach is 59.85%. Both approaches use Bilingual Evaluation Understudy (BLEU) assessment
Joint Training for Neural Machine Translation Models with Monolingual Data
Monolingual data have been demonstrated to be helpful in improving
translation quality of both statistical machine translation (SMT) systems and
neural machine translation (NMT) systems, especially in resource-poor or domain
adaptation tasks where parallel data are not rich enough. In this paper, we
propose a novel approach to better leveraging monolingual data for neural
machine translation by jointly learning source-to-target and target-to-source
NMT models for a language pair with a joint EM optimization method. The
training process starts with two initial NMT models pre-trained on parallel
data for each direction, and these two models are iteratively updated by
incrementally decreasing translation losses on training data. In each iteration
step, both NMT models are first used to translate monolingual data from one
language to the other, forming pseudo-training data of the other NMT model.
Then two new NMT models are learnt from parallel data together with the pseudo
training data. Both NMT models are expected to be improved and better
pseudo-training data can be generated in next step. Experiment results on
Chinese-English and English-German translation tasks show that our approach can
simultaneously improve translation quality of source-to-target and
target-to-source models, significantly outperforming strong baseline systems
which are enhanced with monolingual data for model training including
back-translation.Comment: Accepted by AAAI 201
Target-Side Context for Discriminative Models in Statistical Machine Translation
Discriminative translation models utilizing source context have been shown to
help statistical machine translation performance. We propose a novel extension
of this work using target context information. Surprisingly, we show that this
model can be efficiently integrated directly in the decoding process. Our
approach scales to large training data sizes and results in consistent
improvements in translation quality on four language pairs. We also provide an
analysis comparing the strengths of the baseline source-context model with our
extended source-context and target-context model and we show that our extension
allows us to better capture morphological coherence. Our work is freely
available as part of Moses.Comment: Accepted as a long paper for ACL 201
Marker-based filtering of bilingual phrase pairs for SMT
State-of-the-art statistical machine translation
systems make use of a large translation table obtained after scoring a set of bilingual phrase pairs automatically extracted from a parallel corpus. The number of bilingual phrase pairs extracted from a pair of aligned sentences grows exponentially as the length of the sentences increases; therefore, the number of entries in the phrase table used to carry out the translation may become unmanageable, especially when online, 'on demand' translation is required in real time. We describe
the use of closed-class words to filter the set of bilingual phrase pairs extracted from the parallel corpus by taking into account the alignment information
and the type of the words involved in the alignments. On four European language pairs, we show that our simple yet novel approach can filter the phrase table by up to
a third yet still provide competitive results compared to the baseline. Furthermore, it provides a nice balance between the unfiltered approach and pruning using stop
words, where the deterioration in translation quality is unacceptably high
- ā¦