4 research outputs found

    The integration of machine translation and translation memory

    Get PDF
    We design and evaluate several models for integrating Machine Translation (MT) output into a Translation Memory (TM) environment to facilitate the adoption of MT technology in the localization industry. We begin with the integration on the segment level via translation recommendation and translation reranking. Given an input to be translated, our translation recommendation model compares the output from the MT and the TMsystems, and presents the better one to the post-editor. Our translation reranking model combines k-best lists from both systems, and generates a new list according to estimated post-editing effort. We perform both automatic and human evaluation on these models. When measured against the consensus of human judgement, the recommendation model obtains 0.91 precision at 0.93 recall, and the reranking model obtains 0.86 precision at 0.59 recall. The high precision of these models indicates that they can be integrated into TM environments without the risk of deteriorating the quality of the post-editing candidate, and can thereby preserve TM assets and established cost estimation methods associated with TMs. We then explore methods for a deeper integration of translation memory and machine translation on the sub-segment level. We predict whether phrase pairs derived from fuzzy matches could be used to constrain the translation of an input segment. Using a series of novel linguistically-motivated features, our constraints lead both to more consistent translation output, and to improved translation quality, reflected by a 1.2 improvement in BLEU score and a 0.72 reduction in TER score, both of statistical significance (p < 0.01). In sum, we present our work in three aspects: 1) translation recommendation and translation reranking models that can access high quality MT outputs in the TMenvironment, 2) a sub-segment translation memory and machine translation integration model that improves both translation consistency and translation quality, and 3) a human evaluation pipeline to validate the effectiveness of our models with human judgements

    Differences between Human and Machine-generated Institutional Translations: A comparative analysis using quantitative methods

    Get PDF
    Η μηχανική μετάφραση αποτελεί δημοφιλή επιλογή τα τελευταία χρόνια. Παρόλ’ αυτά, υστερεί συγκριτικά με τον ανθρώπινο τρόπο γραφής σε ποιότητα και φυσικότητα. Η παρούσα εργασία αποσκοπεί στη διερεύνηση των διαφορών μεταξύ αυτόματης και μη-αυτοματοποιημένης μετάφρασης Ελληνικών κειμένων θεσμικού χαρακτήρα, συγκρίνοντας ποσοτικά γλωσσικά χαρακτηριστικά των δύο τύπων μετάφρασης στα αγγλικά κείμενα-στόχους. Όπως προέκυψε από έλεγχο σημαντικότητας ανεξάρτητων δειγμάτων (t) τα δύο σώματα κειμένων διέφεραν σε μια σειρά γλωσσικών χαρακτηριστικών: γενικές πληροφορίες (π.χ. μήκος λέξεων), κατηγορίες λέξεων (π.χ. μέρη του λόγου, συχνότητα), λεξιλογικό πλούτο, συντακτική δομή και κειμενική συνοχή. Ωστόσο, ο βαθμός της διαφοροποίησης στα δύο δείγματα δεν ήταν εντυπωσιακός. Ένα δεύτερο πείραμα βασιζόμενο στο Multilayer Perceptron Νευρωτικό Δίκτυο αποκάλυψε πως το μηχάνημα ήταν σε θέση να κατηγοριοποιήσει με ακρίβεια το 82% των κειμένων ως προερχόμενα από ανθρώπινο ή αυτόματο μεταφραστή. Με βάση αυτά τα αποτελέσματα προκύπτει ότι οι διαφορές μεταξύ της ανθρώπινης και της μηχανικής μετάφρασης, όσον αφορά το παρόν κειμενικό είδος, είναι ανιχνεύσιμες με τη χρήση μεθόδων μηχανικής μάθησης, όμως οι διαφοροποίηση δεν είναι τόσο ξεκάθαρη όσο στο βαθμό που αναμενόταν. Περαιτέρω διερεύνηση είναι απαραίτητη για να διευκρινιστεί εάν τα γλωσσικά χαρακτηριστικά που διαφοροποιούν τους δύο τύπους μετάφρασης μπορούν να αξιοποιηθούν μελλοντικά ως δείκτες μεταφραστικής ποιότητας.Machine translation, commonly referred to as MT, has gained popularity over the recent years; however, it has not yet reached the quality and naturalness of human writing. The present thesis aims to explore how human and automatic English translations of Greek institutional texts differ by comparing quantitative characteristics of the two translation types. Statistical analysis using independent samples t-tests revealed that the two corpora differed in a range of linguistic features including descriptive characteristics (e.g. word length), word information (e.g. parts of speech, word frequency), lexical diversity, syntax and cohesion; however, the degree of variation was not striking. In a follow-up examination, using Multilayer Perceptron neural network, the machine was able to classify correctly almost 82% of the texts as automatic or human-produced. These results suggest that the differences between HT and MT regarding the subgenre in question are detectable using machine learning techniques, but the distinction is not as clear-cut as expected. Further research is needed to determine whether the text properties that differ most in the two corpora can be used effectively as predictors of translation quality

    Irish dependency treebanking and parsing

    Get PDF
    Despite enjoying the status of an official EU language, Irish is considered a minority language. As with most minority languages, it is a `low-density' language, which means it lacks important linguistic and Natural Language Processing (NLP) resources. Relative to better-resourced languages such as English or French, for example, little research has been carried out on computational analysis or processing of Irish. Parsing is the method of analysing the linguistic structure of text, and it is an invaluable processing step that is required for many different types of language technology applications. As a verb-initial language, Irish has several features that are uncharacteristic of many languages previously studied in parsing research. Our work broadens the application of NLP methods to less studied language structures and provides a basis on which future work in Irish NLP is possible. We report on the development of a dependency treebank that serves as training data for the first full Irish dependency parser. We discuss the linguistic structures of Irish, and the motivation behind the design of our annotation scheme. Our work also examines various methods of employing semi-automated approaches to treebank development. We overcome the relatively small pool of linguistic and technological resources available for the Irish language with these approaches, and show that even in early stages of development, parsing results for Irish are promising. What counts as a sufficient number of trees for training a parser varies according to languages. Through empirical methods, we explore the impact our treebank's size and content has on parsing accuracy for Irish. We also discuss our work in crosslingual studies through converting our treebank to a universal annotation scheme. Finally we extend our Irish NLP work to the unstructured user-generated text of Irish tweets. We report on the creation of a POS-tagged corpus of Irish tweets and the training of statistical POS-tagging models. We show how existing resources can be leveraged for this domain-adapted resource development
    corecore