538 research outputs found

    Recycling texts: human evaluation of example-based machine translation subtitles for DVD

    Get PDF
    This project focuses on translation reusability in audiovisual contexts. Specifically, the project seeks to establish (1) whether target language subtitles produced by an EBMT system are considered intelligible and acceptable by viewers of movies on DVD, and (2)whether a relationship exists between the ‘profiles’ of corpora used to train an EBMT system, on the one hand, and viewers’ judgements of the intelligibility and acceptability of the subtitles produced by the system, on the other. The impact of other factors, namely: whether movie-viewing subjects have knowledge of the soundtrack language; subjects’ linguistic background; and subjects’ prior knowledge of the (Harry Potter) movie clips viewed; is also investigated. Corpus profiling is based on measurements (partly using corpus-analysis tools) of three characteristics of the corpora used to train the EBMT system: the number of source language repetitions they contain; the size of the corpus; and the homogeneity of the corpus (independent variables). As a quality control measure in this prospective profiling phase, we also elicit human judgements (through a combined questionnaire and interview) on the quality of the corpus data and on the reusability in new contexts of the TL subtitles. The intelligibility and acceptability of EBMT-produced subtitles (dependent variables) are, in turn, established through end-user evaluation sessions. In these sessions 44 native German-speaking subjects view short movie clips containing EBMT-generated German subtitles, and following each clip answer questions (again, through a combined questionnaire and interview) relating to the quality characteristics mentioned above. The findings of the study suggest that an increase in corpus size along with a concomitant increase in the number of source language repetitions and a decrease in corpus homogeneity, improves the readability of the EBMT-generated subtitles. It does not, however, have a significant effect on the comprehensibility, style or wellformedness of the EBMT-generated subtitles. Increasing corpus size and SL repetitions also results in a higher number of alternative TL translations in the corpus that are deemed acceptable by evaluators in the corpus profiling phase. The research also finds that subjects are more critical of subtitles when they do not understand the soundtrack language, while subjects’ linguistic background does not have a significant effect on their judgements of the quality of EBMT-generated subtitles. Prior knowledge of the Harry Potter genre, on the other hand, appears to have an effect on how viewing subjects rate the severity of observed errors in the subtitles, and on how they rate the style of subtitles, although this effect is training corpus-dependent. The introduction of repeated subtitles did not reduce the intelligibility or acceptability of the subtitles. Overall, the findings indicate that the subtitles deemed the most acceptable when evaluated in a non-AVT environment (albeit one in which rich contextual information was available) were the same as the subtitles deemed the most acceptable in an AVT environment, although richer data were gathered from the AVT environment

    Text complexity and text simplification in the crisis management domain

    Get PDF
    Due to the fact that emergency situations can lead to substantial losses, both financial and in terms of human lives, it is essential that texts used in a crisis situation be clearly understandable. This thesis is concerned with the study of the complexity of the crisis management sub-language and with methods to produce new, clear texts and to rewrite pre-existing crisis management documents which are too complex to be understood. By doing this, this interdisciplinary study makes several contributions to the crisis management field. First, it contributes to the knowledge of the complexity of the texts used in the domain, by analysing the presence of a set of written language complexity issues derived from the psycholinguistic literature in a novel corpus of crisis management documents. Second, since the text complexity analysis shows that crisis management documents indeed exhibit high numbers of text complexity issues, the thesis adapts to the English language controlled language writing guidelines which, when applied to the crisis management language, reduce its complexity and ambiguity, leading to clear text documents. Third, since low quality of communication can have fatal consequences in emergency situations, the proposed controlled language guidelines and a set of texts which were re-written according to them are evaluated from multiple points of view. In order to achieve that, the thesis both applies existing evaluation approaches and develops new methods which are more appropriate for the task. These are used in two evaluation experiments – evaluation on extrinsic tasks and evaluation of users’ acceptability. The evaluations on extrinsic tasks (evaluating the impact of the controlled language on text complexity, reading comprehension under stress, manual translation, and machine translation tasks) Text Complexity and Text Simplification in the Crisis Management domain 4 show a positive impact of the controlled language on simplified documents and thus ensure the quality of the resource. The evaluation of users’ acceptability contributes additional findings about manual simplification and helps to determine directions for future implementation. The thesis also gives insight into reading comprehension, machine translation, and cross-language adaptability, and provides original contributions to machine translation, controlled languages, and natural language generation evaluation techniques, which make it valuable for several scientific fields, including Linguistics, Psycholinguistics, and a number of different sub-fields of NLP.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Differences between Human and Machine-generated Institutional Translations: A comparative analysis using quantitative methods

    Get PDF
    Η μηχανική μετάφραση αποτελεί δημοφιλή επιλογή τα τελευταία χρόνια. Παρόλ’ αυτά, υστερεί συγκριτικά με τον ανθρώπινο τρόπο γραφής σε ποιότητα και φυσικότητα. Η παρούσα εργασία αποσκοπεί στη διερεύνηση των διαφορών μεταξύ αυτόματης και μη-αυτοματοποιημένης μετάφρασης Ελληνικών κειμένων θεσμικού χαρακτήρα, συγκρίνοντας ποσοτικά γλωσσικά χαρακτηριστικά των δύο τύπων μετάφρασης στα αγγλικά κείμενα-στόχους. Όπως προέκυψε από έλεγχο σημαντικότητας ανεξάρτητων δειγμάτων (t) τα δύο σώματα κειμένων διέφεραν σε μια σειρά γλωσσικών χαρακτηριστικών: γενικές πληροφορίες (π.χ. μήκος λέξεων), κατηγορίες λέξεων (π.χ. μέρη του λόγου, συχνότητα), λεξιλογικό πλούτο, συντακτική δομή και κειμενική συνοχή. Ωστόσο, ο βαθμός της διαφοροποίησης στα δύο δείγματα δεν ήταν εντυπωσιακός. Ένα δεύτερο πείραμα βασιζόμενο στο Multilayer Perceptron Νευρωτικό Δίκτυο αποκάλυψε πως το μηχάνημα ήταν σε θέση να κατηγοριοποιήσει με ακρίβεια το 82% των κειμένων ως προερχόμενα από ανθρώπινο ή αυτόματο μεταφραστή. Με βάση αυτά τα αποτελέσματα προκύπτει ότι οι διαφορές μεταξύ της ανθρώπινης και της μηχανικής μετάφρασης, όσον αφορά το παρόν κειμενικό είδος, είναι ανιχνεύσιμες με τη χρήση μεθόδων μηχανικής μάθησης, όμως οι διαφοροποίηση δεν είναι τόσο ξεκάθαρη όσο στο βαθμό που αναμενόταν. Περαιτέρω διερεύνηση είναι απαραίτητη για να διευκρινιστεί εάν τα γλωσσικά χαρακτηριστικά που διαφοροποιούν τους δύο τύπους μετάφρασης μπορούν να αξιοποιηθούν μελλοντικά ως δείκτες μεταφραστικής ποιότητας.Machine translation, commonly referred to as MT, has gained popularity over the recent years; however, it has not yet reached the quality and naturalness of human writing. The present thesis aims to explore how human and automatic English translations of Greek institutional texts differ by comparing quantitative characteristics of the two translation types. Statistical analysis using independent samples t-tests revealed that the two corpora differed in a range of linguistic features including descriptive characteristics (e.g. word length), word information (e.g. parts of speech, word frequency), lexical diversity, syntax and cohesion; however, the degree of variation was not striking. In a follow-up examination, using Multilayer Perceptron neural network, the machine was able to classify correctly almost 82% of the texts as automatic or human-produced. These results suggest that the differences between HT and MT regarding the subgenre in question are detectable using machine learning techniques, but the distinction is not as clear-cut as expected. Further research is needed to determine whether the text properties that differ most in the two corpora can be used effectively as predictors of translation quality

    Language technologies for a multilingual Europe

    Get PDF
    This volume of the series “Translation and Multilingual Natural Language Processing” includes most of the papers presented at the Workshop “Language Technology for a Multilingual Europe”, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic “Multilingual Resources and Multilingual Applications”, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop “Language Technology for a Multilingual Europe” was co-organised by the two GSCL working groups “Text Technology” and “Machine Translation” (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)
    corecore