290 research outputs found

    Unsupervised generation of parallel treebanks through sub-tree alignment

    Get PDF
    The need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. This is true especially for parallel treebanks, of which very few exist. The ones that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this paper we introduce an open-source system for fast and robust automatic generation of parallel treebanks. We expect the opening of the presented platform to the scientific community to help boost research in the field of data-oriented machine translation and lead to advancements in other fields where parallel treebanks can be employed

    Seeding statistical machine translation with translation memory output through tree-based structural alignment

    Get PDF
    With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, with the current focus on the use of high-quality Statistical Machine Translation (SMT) as a supplement to the established Translation Memory (TM) technology. In this paper we present a novel modular approach that utilises state-of-the-art sub-tree alignment to pick out pre-translated segments from a TM match and seed with them an SMT system to produce a final translation. We show that the presented system can outperform pure SMT when a good TM match is found. It can also be used in a Computer-Aided Translation (CAT) environment to present almost perfect translations to the human user with markup highlighting the segments of the translation that need to be checked manually for correctness

    Highlighting matched and mismatched segments in translation memory output through sub-­tree alignment

    Get PDF
    In recent years, it is becoming more and more clear that the localisation industry does not have the necessary manpower to satisfy the increasing demand for high-quality translation. This has fuelled the search new and existing technologies that would increase translator throughput. As Translation Memory (TM) systems are the most commonly employed tool by translators, a number of enhancements are available to assist them in their job. One such enhancement would be to show the translator which parts of the sentence that needs to be translated match which parts of the fuzzy match suggested by the TM. For this information to be used, however, the translators have to carry it over to the TM translation themselves. In this paper, we present a novel methodology that can automatically detect and highlight the segments that need to be modified in a TM-­suggested translation. We base it on state-­of-the-art sub-­tree align- ment technology (Zhechev,2010) that can produce aligned phrase-­based-­tree pairs from unannotated data. Our system operates in a three-­step process. First, the fuzzy match selected by the TM and its translation are aligned. This lets us know which segments of the source-­language sentence correspond to which segments in its translation. In the second step, the fuzzy match is aligned to the input sentence that is currently being translated. This tells us which parts of the input sentence are available in the fuzzy match and which still need to be translated. In the third step, the fuzzy match is used as an intermediary, through which the alignments between the input sentence and the TM translation are established. In this way, we can detect with precision the segments in the suggested translation that the translator needs to edit and highlight them appropriately to set them apart from the segments that are already good translations for parts of the input sentence. Additionally, we can show the alignments—as detected by our system—between the input and the translation, which will make it even easier for the translator to post-edit the TM suggestion. This alignment information can additionally be used to pre- translate the mismatched segments, further reducing the post-­editing load

    Building and querying parallel treebanks

    Get PDF
    This paper describes our work on building a trilingual parallel treebank. We have annotated constituent structure trees from three text genres (a philosophy novel, economy reports and a technical user manual). Our parallel treebank includes word and phrase alignments. The alignment information was manually checked using a graphical tool that allows the annotator to view a pair of trees from parallel sentences. This tool comes with a powerful search facility which supersedes the expressivity of previous popular treebank query engines

    Bootstrapping parallel treebanks

    Full text link
    This paper argues for the development of parallel treebanks. It summarizes the work done in this area and reports on experiments for building a Swedish-German treebank. And it describes our approach for reusing resources from one language while annotating another language

    Annotating a Parallel Monolingual Treebank with Semantic Similarity Relations

    Get PDF
    Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), 85-96. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/4476

    Annotation, exploitation and evaluation of parallel corpora

    Get PDF
    Exchange between the translation studies and the computational linguistics communities has traditionally not been very intense. Among other things, this is reflected by the different views on parallel corpora. While computational linguistics does not always strictly pay attention to the translation direction (e.g. when translation rules are extracted from (sub)corpora which actually only consist of translations), translation studies are amongst other things concerned with exactly comparing source and target texts (e.g. to draw conclusions on interference and standardization effects). However, there has recently been more exchange between the two fields – especially when it comes to the annotation of parallel corpora. This special issue brings together the different research perspectives. Its contributions show – from both perspectives – how the communities have come to interact in recent years

    Using the Stockholm TreeAligner

    Get PDF
    Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), 73-78. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/4476
    corecore