17 research outputs found

    Automatic generation of parallel treebanks: an efficient unsupervised system

    Get PDF
    The need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. This is true especially for parallel treebanks, of which very few exist. The ones that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this work I introduce a novel open-source platform for the fast and robust automatic generation of parallel treebanks through sub-tree alignment, using a limited amount of external resources. The intrinsic and extrinsic evaluations that I undertook demonstrate that my system is a feasible alternative to the manual annotation of parallel treebanks. Therefore, I expect the presented platform to help boost research in the field of syntaxaugmented machine translation and lead to advancements in other fields where parallel treebanks can be employed

    Unsupervised generation of parallel treebanks through sub-tree alignment

    Get PDF
    The need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. This is true especially for parallel treebanks, of which very few exist. The ones that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this paper we introduce an open-source system for fast and robust automatic generation of parallel treebanks. We expect the opening of the presented platform to the scientific community to help boost research in the field of data-oriented machine translation and lead to advancements in other fields where parallel treebanks can be employed

    Highlighting matched and mismatched segments in translation memory output through sub-­tree alignment

    Get PDF
    In recent years, it is becoming more and more clear that the localisation industry does not have the necessary manpower to satisfy the increasing demand for high-quality translation. This has fuelled the search new and existing technologies that would increase translator throughput. As Translation Memory (TM) systems are the most commonly employed tool by translators, a number of enhancements are available to assist them in their job. One such enhancement would be to show the translator which parts of the sentence that needs to be translated match which parts of the fuzzy match suggested by the TM. For this information to be used, however, the translators have to carry it over to the TM translation themselves. In this paper, we present a novel methodology that can automatically detect and highlight the segments that need to be modified in a TM-­suggested translation. We base it on state-­of-the-art sub-­tree align- ment technology (Zhechev,2010) that can produce aligned phrase-­based-­tree pairs from unannotated data. Our system operates in a three-­step process. First, the fuzzy match selected by the TM and its translation are aligned. This lets us know which segments of the source-­language sentence correspond to which segments in its translation. In the second step, the fuzzy match is aligned to the input sentence that is currently being translated. This tells us which parts of the input sentence are available in the fuzzy match and which still need to be translated. In the third step, the fuzzy match is used as an intermediary, through which the alignments between the input sentence and the TM translation are established. In this way, we can detect with precision the segments in the suggested translation that the translator needs to edit and highlight them appropriately to set them apart from the segments that are already good translations for parts of the input sentence. Additionally, we can show the alignments—as detected by our system—between the input and the translation, which will make it even easier for the translator to post-edit the TM suggestion. This alignment information can additionally be used to pre- translate the mismatched segments, further reducing the post-­editing load

    Evaluating syntax-driven approaches to phrase extraction for MT

    Get PDF
    In this paper, we examine a number of different phrase segmentation approaches for Machine Translation and how they perform when used to supplement the translation model of a phrase-based SMT system. This work represents a summary of a number of years of research carried out at Dublin City University in which it has been found that improvements can be made using hybrid translation models. However, the level of improvement achieved is dependent on the amount of training data used. We describe the various approaches to phrase segmentation and combination explored, and outline a series of experiments investigating the relative merits of each method

    Parallel Treebanks in Phrase-Based Statistical Machine Translation

    Get PDF
    Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by hand. In this paper, we describe how we make use of new tools to automatically build a large parallel treebank and extract a set of linguistically motivated phrase pairs from it. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PBSMT) system leads to significant improvements in translation quality. We describe further experiments on incorporating parallel treebank information into PBSMT, such as word alignments. We investigate the conditions under which the incorporation of parallel treebank data performs optimally. Finally, we discuss the potential of parallel treebanks in other paradigms of MT

    Seeding statistical machine translation with translation memory output through tree-based structural alignment

    Get PDF
    With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, with the current focus on the use of high-quality Statistical Machine Translation (SMT) as a supplement to the established Translation Memory (TM) technology. In this paper we present a novel modular approach that utilises state-of-the-art sub-tree alignment to pick out pre-translated segments from a TM match and seed with them an SMT system to produce a final translation. We show that the presented system can outperform pure SMT when a good TM match is found. It can also be used in a Computer-Aided Translation (CAT) environment to present almost perfect translations to the human user with markup highlighting the segments of the translation that need to be checked manually for correctness

    Tree Alignment through Semantic Role Annotation Projection

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 73-82. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

    Discovery of Ambiguous and Unambiguous Discourse Connectives via Annotation Projection

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 83-92. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

    Resourcing machine translation with parallel treebanks

    Get PDF
    The benefits of syntax-based approaches to data-driven machine translation (MT) are clear: given the right model, a combination of hierarchical structure, constituent labels and morphological information can be exploited to produce more fluent, grammatical translation output. This has been demonstrated by the recent shift in research focus towards such linguistically motivated approaches. However, one issue facing developers of such models that is not encountered in the development of state-of-the-art string-based statistical MT (SMT) systems is the lack of available syntactically annotated training data for many languages. In this thesis, we propose a solution to the problem of limited resources for syntax-based MT by introducing a novel sub-sentential alignment algorithm for the induction of translational equivalence links between pairs of phrase structure trees. This algorithm, which operates on a language pair-independent basis, allows for the automatic generation of large-scale parallel treebanks which are useful not only for machine translation, but also across a variety of natural language processing tasks. We demonstrate the viability of our automatically generated parallel treebanks by means of a thorough evaluation process during which they are compared to a manually annotated gold standard parallel treebank both intrinsically and in an MT task. Following this, we hypothesise that these parallel treebanks are not only useful in syntax-based MT, but also have the potential to be exploited in other paradigms of MT. To this end, we carry out a large number of experiments across a variety of data sets and language pairs, in which we exploit the information encoded within the parallel treebanks in various components of phrase-based statistical MT systems. We demonstrate that improvements in translation accuracy can be achieved by enhancing SMT phrase tables with linguistically motivated phrase pairs extracted from a parallel treebank, while showing that a number of other features in SMT can also be supplemented with varying degrees of effectiveness. Finally, we examine ways in which synchronous grammars extracted from parallel treebanks can improve the quality of translation output, focussing on real translation examples from a syntax-based MT system

    Maximising TM performance through sub-tree alignment and SMT

    Get PDF
    With the steadily increasing demand for high quality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, in particular focusing on the use of high-quality Statistical Machine Translation (SMT) supplementing the established Translation Memory (TM) technology. In this paper, we present a novel modular approach that utilises state-of-the-art sub-tree alignment and SMT techniques to turn the fuzzy matches from a TM into near perfect translations. Rather than relegate SMT to a last-resort status where it is only used should the TM system fail to produce the desired output, for us SMT is an integral part of the translation process that we rely on to obtain high-quality results. We show that the presented system consistently produces better quality output than the TM and performs on par or better than the standalone SMT system
    corecore