64 research outputs found

    Combining semantic and syntactic generalization in example-based machine translation

    Get PDF
    In this paper, we report our experiments in combining two EBMT systems that rely on generalized templates, Marclator and CMU-EBMT, on an English–German translation task. Our goal was to see whether a statistically significant improvement could be achieved over the individual performances of these two systems. We observed that this was not the case. However, our system consistently outperformed a lexical EBMT baseline system

    Using same-language machine translation to create alternative target sequences for text-to-speech synthesis

    Get PDF
    Modern speech synthesis systems attempt to produce speech utterances from an open domain of words. In some situations, the synthesiser will not have the appropriate units to pronounce some words or phrases accurately but it still must attempt to pronounce them. This paper presents a hybrid machine translation and unit selection speech synthesis system. The machine translation system was trained with English as the source and target language. Rather than the synthesiser only saying the input text as would happen in conventional synthesis systems, the synthesiser may say an alternative utterance with the same meaning. This method allows the synthesiser to overcome the problem of insufficient units in runtime

    Hybrid example-based SMT: the best of both worlds?

    Get PDF
    (Way and Gough, 2005) provide an indepth comparison of their Example-Based Machine Translation (EBMT) system with a Statistical Machine Translation (SMT) system constructed from freely available tools. According to a wide variety of automatic evaluation metrics, they demonstrated that their EBMT system outperformed the SMT system by a factor of two to one. Nevertheless, they did not test their EBMT system against a phrase-based SMT system. Obtaining their training and test data for English–French, we carry out a number of experiments using the Pharaoh SMT Decoder. While better results are seen when Pharaoh is seeded with Giza++ word- and phrase-based data compared to EBMT sub-sentential alignments, in general better results are obtained when combinations of this 'hybrid' data is used to construct the translation and probability models. While for the most part the EBMT system of (Gough & Way, 2004b) outperforms any flavour of the phrasebased SMT systems constructed in our experiments, combining the data sets automatically induced by both Giza++ and their EBMT system leads to a hybrid system which improves on the EBMT system per se for French–English

    Marker-based filtering of bilingual phrase pairs for SMT

    Get PDF
    State-of-the-art statistical machine translation systems make use of a large translation table obtained after scoring a set of bilingual phrase pairs automatically extracted from a parallel corpus. The number of bilingual phrase pairs extracted from a pair of aligned sentences grows exponentially as the length of the sentences increases; therefore, the number of entries in the phrase table used to carry out the translation may become unmanageable, especially when online, 'on demand' translation is required in real time. We describe the use of closed-class words to filter the set of bilingual phrase pairs extracted from the parallel corpus by taking into account the alignment information and the type of the words involved in the alignments. On four European language pairs, we show that our simple yet novel approach can filter the phrase table by up to a third yet still provide competitive results compared to the baseline. Furthermore, it provides a nice balance between the unfiltered approach and pruning using stop words, where the deterioration in translation quality is unacceptably high

    Lost in translation: the problems of using mainstream MT evaluation metrics for sign language translation

    Get PDF
    In this paper we consider the problems of applying corpus-based techniques to minority languages that are neither politically recognised nor have a formally accepted writing system, namely sign languages. We discuss the adoption of an annotated form of sign language data as a suitable corpus for the development of a data-driven machine translation (MT) system, and deal with issues that arise from its use. Useful software tools that facilitate easy annotation of video data are also discussed. Furthermore, we address the problems of using traditional MT evaluation metrics for sign language translation. Based on the candidate translations produced from our example-based machine translation system, we discuss why standard metrics fall short of providing an accurate evaluation and suggest more suitable evaluation methods

    Teaching and assessing empirical approaches to machine translation

    Get PDF
    Empirical methods in Natural Language Processing (NLP) and Machine Translation (MT) have become mainstream in the research field. Accordingly, it is important that the tools and techniques in these paradigms be taught to potential future researchers and developers in University courses. While many dedicated courses on Statistical NLP can be found, there are few, if any courses on Empirical Approaches to MT. This paper presents the development and assessment of one such course as taught to final year undergraduates taking a degree in NLP

    Active Learning in Example-Based Machine Translation

    Get PDF
    Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 227-230. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

    Hybrid data-driven models of machine translation

    Get PDF
    Corpus-based approaches to Machine Translation (MT) dominate the MT research field today, with Example-Based MT (EBMT) and Statistical MT (SMT) representing two different frameworks within the data-driven paradigm. EBMT has always made use of both phrasal and lexical correspondences to produce high-quality translations. Early SMT models, on the other hand, were based on word-level correpsondences, but with the advent of more sophisticated phrase-based approaches, the line between EBMT and SMT has become increasingly blurred. In this thesis we carry out a number of translation experiments comparing the performance of the state-of-the-art marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005) against a phrase-based SMT (PBSMT) system built using the state-of-the-art PHARAOphHra se-based decoder (Koehn, 2004a) and employing standard phrasal extraction in euristics (Koehn et al., 2003). In additin e describe experiments investigating the possibility of combining elements of EBMT and SMT in order to create a hybrid data-driven model of MT capable of outperforming either approach from which it is derived. Making use of training and testlng data taken from a French-Enghsh translation memory of Sun Microsystems computer documentation, we find that while better results are seen when the PBSMT system is seeded with GIZA++ word- and phrasebased data compared to EBMT marker-based sub-sentential alignments, in general improvements are obtained when combinations of this 'hybrid' data are used to construct the translation and probability models. While for the most part the baseline marker-based EBMT system outperforms any flavour of the PBSbIT systems constructed in these experiments, combining the data sets automatically induced by both GIZA++ and the EBMT system leads to a hybrid system which improves on the EBMT system per se for French-English. On a different data set, taken from the Europarl corpus (Koehn, 2005), we perform a number of experiments maklng use of incremental training data sizes of 78K, 156K and 322K sentence pairs. On this data set, we show that similar gains are to be had from constructing a hybrid 'statistical EBMT' system capable of outperforming the baseline EBMT system. This time around, although all 'hybrid' variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in the Sun Mzcrosystems experiments, to create a hybrid 'example-based SMT' system, outperforms the baseline SMT and EBMT systems from which it is derlved. Furthermore, we provide further evidence in favour of hybrid data-dr~ven approaches by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive effect on translation quality. Following on from these findings we present a new hybrid data-driven MT architecture, together with a novel marker-based decoder which improves upon the performance of the marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005), and compares favourably with the stateof-the-art PHARAOH SMHT decoder (Koehn, 2004a)
    corecore