21 research outputs found

    English-to-Czech MT: Large Data and Beyond

    Get PDF

    Modeling Target-Side Inflection in Neural Machine Translation

    Full text link
    NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization over the rich vocabulary in morphologically rich languages with strong inflectional phenomena. We introduce a simple approach to overcome this problem by training a system to produce the lemma of a word and its morphologically rich POS tag, which is then followed by a deterministic generation step. We apply this strategy for English-Czech and English-German translation scenarios, obtaining improvements in both settings. We furthermore show that the improvement is not due to only adding explicit morphological information.Comment: Accepted as a research paper at WMT17. (Updated version with corrected references.

    Latent Tree Language Model

    Full text link
    In this paper we introduce Latent Tree Language Model (LTLM), a novel approach to language modeling that encodes syntax and semantics of a given sentence as a tree of word roles. The learning phase iteratively updates the trees by moving nodes according to Gibbs sampling. We introduce two algorithms to infer a tree for a given sentence. The first one is based on Gibbs sampling. It is fast, but does not guarantee to find the most probable tree. The second one is based on dynamic programming. It is slower, but guarantees to find the most probable tree. We provide comparison of both algorithms. We combine LTLM with 4-gram Modified Kneser-Ney language model via linear interpolation. Our experiments with English and Czech corpora show significant perplexity reductions (up to 46% for English and 49% for Czech) compared with standalone 4-gram Modified Kneser-Ney language model.Comment: Accepted to EMNLP 201

    Target-Side Context for Discriminative Models in Statistical Machine Translation

    Get PDF
    Discriminative translation models utilizing source context have been shown to help statistical machine translation performance. We propose a novel extension of this work using target context information. Surprisingly, we show that this model can be efficiently integrated directly in the decoding process. Our approach scales to large training data sizes and results in consistent improvements in translation quality on four language pairs. We also provide an analysis comparing the strengths of the baseline source-context model with our extended source-context and target-context model and we show that our extension allows us to better capture morphological coherence. Our work is freely available as part of Moses.Comment: Accepted as a long paper for ACL 201

    Bilingual Learning of Multi-sense Embeddings with Discrete Autoencoders

    Get PDF
    We present an approach to learning multi-sense word embeddings relying both on monolingual and bilingual information. Our model consists of an encoder, which uses monolingual and bilingual context (i.e. a parallel sentence) to choose a sense for a given word, and a decoder which predicts context words based on the chosen sense. The two components are estimated jointly. We observe that the word representations induced from bilingual data outperform the monolingual counterparts across a range of evaluation tasks, even though crosslingual information is not available at test time

    SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task

    Get PDF
    This paper presents the description of 12 systems submitted to the WMT16 IT-task, covering six different languages, namely Basque, Bulgarian, Dutch, Czech, Portuguese and Spanish. All these systems were developed under the scope of the QTLeap project, presenting a common strategy. For each language two different systems were submitted, namely a phrase-based MT system built using Moses, and a system exploiting deep language engineering approaches, that in all the languages but Bulgarian was implemented using TectoMT. For 4 of the 6 languages, the TectoMT-based system performs better than the Moses-based one

    Merged bilingual trees based on Universal Dependencies in Machine Translation

    Get PDF
    In this paper, we present our new experimental system of merging dependency representations of two parallel sentences into one dependency tree. All the inner nodes in dependency tree represent source-target pairs of words, the extra words are in form of leaf nodes. We use Universal Dependencies annotation style, in which the function words, whose usage often differs between languages, are annotated as leaves. The parallel treebank is parsed in minimally supervised way. Unaligned words are there automatically pushed to leaves. We present a simple translation system trained on such merged trees and evaluate it in WMT 2016 English-to-Czech and Czech-to-English translation task. Even though the model is so far very simple and no language model and word-reordering model were used, the Czech-to-English variant reached similar BLEU score as another established tree-based system
    corecore