295,295 research outputs found

    Bilingually motivated domain-adapted word segmentation for statistical machine translation

    Get PDF
    We introduce a word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and demonstrate that our approach scores consistently among the best results across different data conditions

    Translation of the Lampung Language Text Dialect of Nyo into the Indonesian Language with DMT and SMT Approach

    Get PDF
    Research on the translation of Lampung language text dialect of Nyo into Indonesian is done with two approaches, namely Direct Machine Translation (DMT) and Statistical Machine Translation (SMT). This research experiment was conducted as a preliminary effort in helping students immigrants in the province of Lampung, translating the Lampung language dialect of Nyo through prototypes or models was built. In the DMT approach, the dictionary is used as the primary tool. In contrast, in SMT, the parallel corpus of Lampung Nyo and Indonesian language is used to make language models and translation models using Moses Decoder. The result of text translation accuracy with the DMT approach is 39.32%, and for the SMT approach is 59.85%. Both approaches use Bilingual Evaluation Understudy (BLEU) assessment.Research on the translation of Lampung language text dialect of Nyo into Indonesian is done with two approaches, namely Direct Machine Translation (DMT) and Statistical Machine Translation (SMT). This research experiment was conducted as a preliminary effort in helping students immigrants in the province of Lampung, translating the Lampung language dialect of Nyo through prototypes or models was built. In the DMT approach, the dictionary is used as the primary tool. In contrast, in SMT, the parallel corpus of Lampung Nyo and Indonesian language is used to make language models and translation models using Moses Decoder. The result of text translation accuracy with the DMT approach is 39.32%, and for the SMT approach is 59.85%. Both approaches use Bilingual Evaluation Understudy (BLEU) assessment

    Joint Training for Neural Machine Translation Models with Monolingual Data

    Full text link
    Monolingual data have been demonstrated to be helpful in improving translation quality of both statistical machine translation (SMT) systems and neural machine translation (NMT) systems, especially in resource-poor or domain adaptation tasks where parallel data are not rich enough. In this paper, we propose a novel approach to better leveraging monolingual data for neural machine translation by jointly learning source-to-target and target-to-source NMT models for a language pair with a joint EM optimization method. The training process starts with two initial NMT models pre-trained on parallel data for each direction, and these two models are iteratively updated by incrementally decreasing translation losses on training data. In each iteration step, both NMT models are first used to translate monolingual data from one language to the other, forming pseudo-training data of the other NMT model. Then two new NMT models are learnt from parallel data together with the pseudo training data. Both NMT models are expected to be improved and better pseudo-training data can be generated in next step. Experiment results on Chinese-English and English-German translation tasks show that our approach can simultaneously improve translation quality of source-to-target and target-to-source models, significantly outperforming strong baseline systems which are enhanced with monolingual data for model training including back-translation.Comment: Accepted by AAAI 201

    Target-Side Context for Discriminative Models in Statistical Machine Translation

    Get PDF
    Discriminative translation models utilizing source context have been shown to help statistical machine translation performance. We propose a novel extension of this work using target context information. Surprisingly, we show that this model can be efficiently integrated directly in the decoding process. Our approach scales to large training data sizes and results in consistent improvements in translation quality on four language pairs. We also provide an analysis comparing the strengths of the baseline source-context model with our extended source-context and target-context model and we show that our extension allows us to better capture morphological coherence. Our work is freely available as part of Moses.Comment: Accepted as a long paper for ACL 201

    Marker-based filtering of bilingual phrase pairs for SMT

    Get PDF
    State-of-the-art statistical machine translation systems make use of a large translation table obtained after scoring a set of bilingual phrase pairs automatically extracted from a parallel corpus. The number of bilingual phrase pairs extracted from a pair of aligned sentences grows exponentially as the length of the sentences increases; therefore, the number of entries in the phrase table used to carry out the translation may become unmanageable, especially when online, 'on demand' translation is required in real time. We describe the use of closed-class words to filter the set of bilingual phrase pairs extracted from the parallel corpus by taking into account the alignment information and the type of the words involved in the alignments. On four European language pairs, we show that our simple yet novel approach can filter the phrase table by up to a third yet still provide competitive results compared to the baseline. Furthermore, it provides a nice balance between the unfiltered approach and pruning using stop words, where the deterioration in translation quality is unacceptably high
    • ā€¦
    corecore