820 research outputs found

    Example-based machine translation of the Basque language

    Get PDF
    Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus (270, 000 sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art approaches according to several common automatic evaluation metrics

    Coupling hierarchical word reordering and decoding in phrase-based statistical machine translation

    Get PDF
    In this paper, we start with the existing idea of taking reordering rules automatically derived from syntactic representations, and applying them in a preprocessing step before translation to make the source sentence structurally more like the target; and we propose a new approach to hierarchically extracting these rules. We evaluate this, combined with a lattice-based decoding, and show improvements over stateof-the-art distortion models.Postprint (published version

    A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena

    Get PDF
    Word reordering is one of the most difficult aspects of statistical machine translation (SMT), and an important factor of its quality and efficiency. Despite the vast amount of research published to date, the interest of the community in this problem has not decreased, and no single method appears to be strongly dominant across language pairs. Instead, the choice of the optimal approach for a new translation task still seems to be mostly driven by empirical trials. To orientate the reader in this vast and complex research area, we present a comprehensive survey of word reordering viewed as a statistical modeling challenge and as a natural language phenomenon. The survey describes in detail how word reordering is modeled within different string-based and tree-based SMT frameworks and as a stand-alone task, including systematic overviews of the literature in advanced reordering modeling. We then question why some approaches are more successful than others in different language pairs. We argue that, besides measuring the amount of reordering, it is important to understand which kinds of reordering occur in a given language pair. To this end, we conduct a qualitative analysis of word reordering phenomena in a diverse sample of language pairs, based on a large collection of linguistic knowledge. Empirical results in the SMT literature are shown to support the hypothesis that a few linguistic facts can be very useful to anticipate the reordering characteristics of a language pair and to select the SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic

    Dependency-based Bilingual Word Embeddings and Neural Machine Translation

    Get PDF
    Bilingual word embeddings, which represent lexicons from various languages in a common embedding space, are critical for facilitating semantic and knowledge trans- fers in a wide range of cross-lingual NLP applications. The significance of learning bilingual word embedding representations in many Natural Language Processing (NLP) tasks motivates us to investigate the effect of many factors, including syntac- tical information, on the learning process for different languages with varying levels of structural complexity. By analysing the components that influence the learning process of bilingual word embeddings (BWEs), this thesis examines some factors for learning bilingual word embeddings effectively. Our findings in this thesis demon- strate that increasing the embedding size for language pairs has a positive impact on the learning process for BWEs. While sentence length depends on the language. Short sentences perform better than long ones in the En-ES experiment. However, by increasing the sentence, En-Ar and En-De experiment achieve improved model accuracy. Arabic segmentation, according to En-Ar experiments, is essential to the learning process for BWEs and can boost model accuracy by up to 10%. Incorporating dependency features into the learning process enhances the trained models performance and results in more improved BWEs in all language pairs. Finally, we investigated how the dependancy-based pretrained BWEs affected the neural machine translation (NMT) model. The findings indicate that in various MT evaluation matrices, the trained dependancy-based NMT models outperform the baseline NMT model
    corecore