57 research outputs found

    Hybrid example-based SMT: the best of both worlds?

    Get PDF
    (Way and Gough, 2005) provide an indepth comparison of their Example-Based Machine Translation (EBMT) system with a Statistical Machine Translation (SMT) system constructed from freely available tools. According to a wide variety of automatic evaluation metrics, they demonstrated that their EBMT system outperformed the SMT system by a factor of two to one. Nevertheless, they did not test their EBMT system against a phrase-based SMT system. Obtaining their training and test data for English–French, we carry out a number of experiments using the Pharaoh SMT Decoder. While better results are seen when Pharaoh is seeded with Giza++ word- and phrase-based data compared to EBMT sub-sentential alignments, in general better results are obtained when combinations of this 'hybrid' data is used to construct the translation and probability models. While for the most part the EBMT system of (Gough & Way, 2004b) outperforms any flavour of the phrasebased SMT systems constructed in our experiments, combining the data sets automatically induced by both Giza++ and their EBMT system leads to a hybrid system which improves on the EBMT system per se for French–English

    Hybridity in MT: experiments on the Europarl corpus

    Get PDF
    (Way & Gough, 2005) demonstrate that their Marker-based EBMT system is capable of outperforming a word-based SMT system trained on reasonably large data sets. (Groves & Way, 2005) take this a stage further and demonstrate that while the EBMT system also outperforms a phrase-based SMT (PBSMT) system, a hybrid 'example-based SMT' system incorporating marker chunks and SMT sub-sentential alignments is capable of outperforming both baseline translation models for French{English translation. In this paper, we show that similar gains are to be had from constructing a hybrid 'statistical EBMT' system capable of outperforming the baseline system of (Way & Gough, 2005). Using the Europarl (Koehn, 2005) training and test sets we show that this time around, although all 'hybrid' variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in (Groves & Way, 2005), to create a hybrid 'example-based SMT' system, outperforms the baseline SMT and EBMT systems from which it is derived. Furthermore, we provide further evidence in favour of hybrid systems by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive e®ect on translation quality

    Hybrid data-driven models of machine translation

    Get PDF
    Corpus-based approaches to Machine Translation (MT) dominate the MT research field today, with Example-Based MT (EBMT) and Statistical MT (SMT) representing two different frameworks within the data-driven paradigm. EBMT has always made use of both phrasal and lexical correspondences to produce high-quality translations. Early SMT models, on the other hand, were based on word-level correpsondences, but with the advent of more sophisticated phrase-based approaches, the line between EBMT and SMT has become increasingly blurred. In this thesis we carry out a number of translation experiments comparing the performance of the state-of-the-art marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005) against a phrase-based SMT (PBSMT) system built using the state-of-the-art PHARAOphHra se-based decoder (Koehn, 2004a) and employing standard phrasal extraction in euristics (Koehn et al., 2003). In additin e describe experiments investigating the possibility of combining elements of EBMT and SMT in order to create a hybrid data-driven model of MT capable of outperforming either approach from which it is derived. Making use of training and testlng data taken from a French-Enghsh translation memory of Sun Microsystems computer documentation, we find that while better results are seen when the PBSMT system is seeded with GIZA++ word- and phrasebased data compared to EBMT marker-based sub-sentential alignments, in general improvements are obtained when combinations of this 'hybrid' data are used to construct the translation and probability models. While for the most part the baseline marker-based EBMT system outperforms any flavour of the PBSbIT systems constructed in these experiments, combining the data sets automatically induced by both GIZA++ and the EBMT system leads to a hybrid system which improves on the EBMT system per se for French-English. On a different data set, taken from the Europarl corpus (Koehn, 2005), we perform a number of experiments maklng use of incremental training data sizes of 78K, 156K and 322K sentence pairs. On this data set, we show that similar gains are to be had from constructing a hybrid 'statistical EBMT' system capable of outperforming the baseline EBMT system. This time around, although all 'hybrid' variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in the Sun Mzcrosystems experiments, to create a hybrid 'example-based SMT' system, outperforms the baseline SMT and EBMT systems from which it is derlved. Furthermore, we provide further evidence in favour of hybrid data-dr~ven approaches by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive effect on translation quality. Following on from these findings we present a new hybrid data-driven MT architecture, together with a novel marker-based decoder which improves upon the performance of the marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005), and compares favourably with the stateof-the-art PHARAOH SMHT decoder (Koehn, 2004a)

    Example-based machine translation of the Basque language

    Get PDF
    Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus (270, 000 sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art approaches according to several common automatic evaluation metrics

    Contextual bitext-derived paraphrases in automatic MT evaluation

    Get PDF
    In this paper we present a novel method for deriving paraphrases during automatic MT evaluation using only the source and reference texts, which are necessary for the evaluation, and word and phrase alignment software. Using target language paraphrases produced through word and phrase alignment a number of alternative reference sentences are constructed automatically for each candidate translation. The method produces lexical and lowlevel syntactic paraphrases that are relevant to the domain in hand, does not use external knowledge resources, and can be combined with a variety of automatic MT evaluation system

    Evaluating syntax-driven approaches to phrase extraction for MT

    Get PDF
    In this paper, we examine a number of different phrase segmentation approaches for Machine Translation and how they perform when used to supplement the translation model of a phrase-based SMT system. This work represents a summary of a number of years of research carried out at Dublin City University in which it has been found that improvements can be made using hybrid translation models. However, the level of improvement achieved is dependent on the amount of training data used. We describe the various approaches to phrase segmentation and combination explored, and outline a series of experiments investigating the relative merits of each method

    Wrapper syntax for example-based machine translation

    Get PDF
    TransBooster is a wrapper technology designed to improve the performance of wide-coverage machine translation systems. Using linguistically motivated syntactic information, it automatically decomposes source language sentences into shorter and syntactically simpler chunks, and recomposes their translation to form target language sentences. This generally improves both the word order and lexical selection of the translation. To date, TransBooster has been successfully applied to rule-based MT, statistical MT, and multi-engine MT. This paper presents the application of TransBooster to Example-Based Machine Translation. In an experiment conducted on test sets extracted from Europarl and the Penn II Treebank we show that our method can raise the BLEU score up to 3.8% relative to the EBMT baseline. We also conduct a manual evaluation, showing that TransBooster-enhanced EBMT produces a better output in terms of fluency than the baseline EBMT in 55% of the cases and in terms of accuracy in 53% of the cases

    A syntactic skeleton for statistical machine translation

    Get PDF
    We present a method for improving statistical machine translation performance by using linguistically motivated syntactic information. Our algorithm recursively decomposes source language sentences into syntactically simpler and shorter chunks, and recomposes their translation to form target language sentences. This improves both the word order and lexical selection of the translation. We report statistically significant relative improvementsof 3.3% BLEU score in an experiment (English!Spanish) carried out on an 800-sentence test set extracted from the Europarl corpus

    OpenMaTrEx: a free/open-source marker-driven example-based machine translation system

    Get PDF
    We describe OpenMaTrEx, a free/open-source example based machine translation (EBMT) system based on the marker hypothesis, comprising a marker-driven chunker, a collection of chunk aligners, and two engines: one based on a simple proof-of-concept monotone EBMT recombinator and a Moses-based statistical decoder. OpenMaTrEx is a free/open-source release of the basic components of MaTrEx, the Dublin City University machine translation system

    Generating Gender Augmented Data for NLP

    Get PDF
    Gender bias is a frequent occurrence in NLP-based applications, especially pronounced in gender-inflected languages. Bias can appear through associations of certain adjectives and animate nouns with the natural gender of referents, but also due to unbalanced grammatical gender frequencies of inflected words. This type of bias becomes more evident in generating conversational utterances where gender is not specified within the sentence, because most current NLP applications still work on a sentence-level context. As a step towards more inclusive NLP, this paper proposes an automatic and generalisable rewriting approach for short conversational sentences. The rewriting method can be applied to sentences that, without extra-sentential context, have multiple equivalent alternatives in terms of gender. The method can be applied both for creating gender balanced outputs as well as for creating gender balanced training data. The proposed approach is based on a neural machine translation (NMT) system trained to 'translate' from one gender alternative to another. Both the automatic and manual analysis of the approach show promising results for automatic generation of gender alternatives for conversational sentences in Spanish.Comment: 10 pages, 4 table
    corecore