78 research outputs found

    Highlighting matched and mismatched segments in translation memory output through sub-­tree alignment

    Get PDF
    In recent years, it is becoming more and more clear that the localisation industry does not have the necessary manpower to satisfy the increasing demand for high-quality translation. This has fuelled the search new and existing technologies that would increase translator throughput. As Translation Memory (TM) systems are the most commonly employed tool by translators, a number of enhancements are available to assist them in their job. One such enhancement would be to show the translator which parts of the sentence that needs to be translated match which parts of the fuzzy match suggested by the TM. For this information to be used, however, the translators have to carry it over to the TM translation themselves. In this paper, we present a novel methodology that can automatically detect and highlight the segments that need to be modified in a TM-­suggested translation. We base it on state-­of-the-art sub-­tree align- ment technology (Zhechev,2010) that can produce aligned phrase-­based-­tree pairs from unannotated data. Our system operates in a three-­step process. First, the fuzzy match selected by the TM and its translation are aligned. This lets us know which segments of the source-­language sentence correspond to which segments in its translation. In the second step, the fuzzy match is aligned to the input sentence that is currently being translated. This tells us which parts of the input sentence are available in the fuzzy match and which still need to be translated. In the third step, the fuzzy match is used as an intermediary, through which the alignments between the input sentence and the TM translation are established. In this way, we can detect with precision the segments in the suggested translation that the translator needs to edit and highlight them appropriately to set them apart from the segments that are already good translations for parts of the input sentence. Additionally, we can show the alignments—as detected by our system—between the input and the translation, which will make it even easier for the translator to post-edit the TM suggestion. This alignment information can additionally be used to pre- translate the mismatched segments, further reducing the post-­editing load

    Unsupervised generation of parallel treebanks through sub-tree alignment

    Get PDF
    The need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. This is true especially for parallel treebanks, of which very few exist. The ones that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this paper we introduce an open-source system for fast and robust automatic generation of parallel treebanks. We expect the opening of the presented platform to the scientific community to help boost research in the field of data-oriented machine translation and lead to advancements in other fields where parallel treebanks can be employed

    Seeding statistical machine translation with translation memory output through tree-based structural alignment

    Get PDF
    With the steadily increasing demand for high-quality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, with the current focus on the use of high-quality Statistical Machine Translation (SMT) as a supplement to the established Translation Memory (TM) technology. In this paper we present a novel modular approach that utilises state-of-the-art sub-tree alignment to pick out pre-translated segments from a TM match and seed with them an SMT system to produce a final translation. We show that the presented system can outperform pure SMT when a good TM match is found. It can also be used in a Computer-Aided Translation (CAT) environment to present almost perfect translations to the human user with markup highlighting the segments of the translation that need to be checked manually for correctness

    Capturing translational divergences with a statistical tree-to-tree aligner

    Get PDF
    Parallel treebanks, which comprise paired source-target parse trees aligned at sub-sentential level, could be useful for many applications, particularly data-driven machine translation. In this paper, we focus on how translational divergences are captured within a parallel treebank using a fully automatic statistical tree-to-tree aligner. We observe that while the algorithm performs well at the phrase level, performance on lexical-level alignments is compromised by an inappropriate bias towards coverage rather than precision. This preference for high precision rather than broad coverage in terms of expressing translational divergences through tree-alignment stands in direct opposition to the situation for SMT word-alignment models. We suggest that this has implications not only for tree-alignment itself but also for the broader area of induction of syntaxaware models for SMT

    Automatic generation of parallel treebanks: an efficient unsupervised system

    Get PDF
    The need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. This is true especially for parallel treebanks, of which very few exist. The ones that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this work I introduce a novel open-source platform for the fast and robust automatic generation of parallel treebanks through sub-tree alignment, using a limited amount of external resources. The intrinsic and extrinsic evaluations that I undertook demonstrate that my system is a feasible alternative to the manual annotation of parallel treebanks. Therefore, I expect the presented platform to help boost research in the field of syntaxaugmented machine translation and lead to advancements in other fields where parallel treebanks can be employed

    Robust language pair-independent sub-tree alignment

    Get PDF
    Data-driven approaches to machine translation (MT) achieve state-of-the-art results. Many syntax-aware approaches, such as Example-Based MT and Data-Oriented Translation, make use of tree pairs aligned at sub-sentential level. Obtaining sub-sentential alignments manually is time-consuming and error-prone, and requires expert knowledge of both source and target languages. We propose a novel, language pair-independent algorithm which automatically induces alignments between phrase-structure trees. We evaluate the alignments themselves against a manually aligned gold standard, and perform an extrinsic evaluation by using the aligned data to train and test a DOT system. Our results show that translation accuracy is comparable to that of the same translation system trained on manually aligned data, and coverage improves

    Computing simplicial representatives of homotopy group elements

    Get PDF
    A central problem of algebraic topology is to understand the homotopy groups πd(X)\pi_d(X) of a topological space XX. For the computational version of the problem, it is well known that there is no algorithm to decide whether the fundamental group π1(X)\pi_1(X) of a given finite simplicial complex XX is trivial. On the other hand, there are several algorithms that, given a finite simplicial complex XX that is simply connected (i.e., with π1(X)\pi_1(X) trivial), compute the higher homotopy group πd(X)\pi_d(X) for any given d2d\geq 2. %The first such algorithm was given by Brown, and more recently, \v{C}adek et al. However, these algorithms come with a caveat: They compute the isomorphism type of πd(X)\pi_d(X), d2d\geq 2 as an \emph{abstract} finitely generated abelian group given by generators and relations, but they work with very implicit representations of the elements of πd(X)\pi_d(X). Converting elements of this abstract group into explicit geometric maps from the dd-dimensional sphere SdS^d to XX has been one of the main unsolved problems in the emerging field of computational homotopy theory. Here we present an algorithm that, given a~simply connected space XX, computes πd(X)\pi_d(X) and represents its elements as simplicial maps from a suitable triangulation of the dd-sphere SdS^d to XX. For fixed dd, the algorithm runs in time exponential in size(X)size(X), the number of simplices of XX. Moreover, we prove that this is optimal: For every fixed d2d\geq 2, we construct a family of simply connected spaces XX such that for any simplicial map representing a generator of πd(X)\pi_d(X), the size of the triangulation of SdS^d on which the map is defined, is exponential in size(X)size(X)

    IST Austria Thesis

    Get PDF
    The first part of the thesis considers the computational aspects of the homotopy groups πd(X) of a topological space X. It is well known that there is no algorithm to decide whether the fundamental group π1(X) of a given finite simplicial complex X is trivial. On the other hand, there are several algorithms that, given a finite simplicial complex X that is simply connected (i.e., with π1(X) trivial), compute the higher homotopy group πd(X) for any given d ≥ 2. However, these algorithms come with a caveat: They compute the isomorphism type of πd(X), d ≥ 2 as an abstract finitely generated abelian group given by generators and relations, but they work with very implicit representations of the elements of πd(X). We present an algorithm that, given a simply connected space X, computes πd(X) and represents its elements as simplicial maps from suitable triangulations of the d-sphere Sd to X. For fixed d, the algorithm runs in time exponential in size(X), the number of simplices of X. Moreover, we prove that this is optimal: For every fixed d ≥ 2, we construct a family of simply connected spaces X such that for any simplicial map representing a generator of πd(X), the size of the triangulation of S d on which the map is defined, is exponential in size(X). In the second part of the thesis, we prove that the following question is algorithmically undecidable for d < ⌊3(k+1)/2⌋, k ≥ 5 and (k, d) ̸= (5, 7), which covers essentially everything outside the meta-stable range: Given a finite simplicial complex K of dimension k, decide whether there exists a piecewise-linear (i.e., linear on an arbitrarily fine subdivision of K) embedding f : K ↪→ Rd of K into a d-dimensional Euclidean space

    Maximising TM performance through sub-tree alignment and SMT

    Get PDF
    With the steadily increasing demand for high quality translation, the localisation industry is constantly searching for technologies that would increase translator throughput, in particular focusing on the use of high-quality Statistical Machine Translation (SMT) supplementing the established Translation Memory (TM) technology. In this paper, we present a novel modular approach that utilises state-of-the-art sub-tree alignment and SMT techniques to turn the fuzzy matches from a TM into near perfect translations. Rather than relegate SMT to a last-resort status where it is only used should the TM system fail to produce the desired output, for us SMT is an integral part of the translation process that we rely on to obtain high-quality results. We show that the presented system consistently produces better quality output than the TM and performs on par or better than the standalone SMT system
    corecore