11 research outputs found

    Domain adaptation strategies in statistical machine translation: a brief overview

    Get PDF
    © Cambridge University Press, 2015.Statistical machine translation (SMT) is gaining interest given that it can easily be adapted to any pair of languages. One of the main challenges in SMT is domain adaptation because the performance in translation drops when testing conditions deviate from training conditions. Many research works are arising to face this challenge. Research is focused on trying to exploit all kinds of material, if available. This paper provides an overview of research, which copes with the domain adaptation challenge in SMT.Peer ReviewedPostprint (author's final draft

    An unsupervised perplexity-based method for boilerplate removal

    Get PDF
    The availability of large web-based corpora has led to significant advances in a wide range of technologies, including massive retrieval systems or deep neural networks. However, leveraging this data is challenging, since web content is plagued by the so-called boilerplate: ads, incomplete or noisy text and rests of the navigation structure, such as menus or navigation bars. In this work, we present a novel and efficient approach to extract useful and well-formed content from web-scraped data. Our approach takes advantage of Language Models and their implicit knowledge about correctly formed text, and we demonstrate here that perplexity is a valuable artefact that can contribute in terms of effectiveness and efficiency. As a matter of fact, the removal of noisy parts leads to lighter AI or search solutions that are effective and entail important reductions in resources spent. We exemplify here the usefulness of our method with two downstream tasks, search and classification, and a cleaning task. We also provide a Python package with pre-trained models and a web demo demonstrating the capabilities of our approachS

    Human evaluation of three machine translation systems : from quality to attitudes by professional translators

    Get PDF
    This article aims to compare three machine translation systems with a focus on human evaluation. The systems under analysis are a domain-adapted statistical machine translation system, a domain-adapted neural machine translation system and a generic machine translation system. The comparison is carried out on translation from Spanish into German with industrial documentation of machine tool components and processes. The focus is on the human evaluation of the machine translation output, specifically on: fluency, adequacy and ranking at the segment level; fluency, adequacy, need for post-editing, ease of post-editing, and mental effort required in post-editing at the document level; productivity (post-editing speed and post-editing effort) and attitudes. Emphasis is placed on human factors in the evaluation process.En este artículo se comparan tres sistemas de traducción automática poniendo especial atención en la evaluación humana. Los sistemas analizados son un sistema estadístico de traducción automática con adaptación al dominio, un sistema neuronal de traducción automática con adaptación al dominio y un sistema de traducción automática genérico. La comparación se lleva a cabo en una traducción del español al alemán de documentación industrial de componentes y procesos de máquina herramienta. El estudio se centra en la evaluación humana de la traducción automática, en concreto en los siguientes aspectos: fluidez, adecuación y ranquin a nivel de segmento; fluidez, adecuación, necesidad de posedición, facilidad de posedición y esfuerzo mental requerido en la posedición a nivel de documento; productividad (velocidad de posedición y esfuerzo de posedición) y actitudes. Se hace énfasis en los factores humanos del proceso de evaluación

    Summary: Inverting an Estonian-english statistical machine translation model

    Get PDF
    Käesolevas töös on käsitletud statistilist masintõlget nii teoreetiliselt kui ka praktiliselt. Statistiline masintõlge on valdkond, mis üritab panna arvutit tõlkima, ilma et ta teaks midagi keelte ametliku grammatika kohta, vaid saab sisendiks ainult paralleelkorpuse ehk miljoneid lausepaare, kus üks paariline on teise paarilise tõlge. Praktilises pooles kasutati olemasolevat Mosese statistilise masintõlke raamistikku, et luua uus tõlkemudel inglise-eesti suunal. Lisaks pöörati ümber olemasolev eesti-inglise tõlkemudel, mis oli kaalutult kokku pandud erinevatest korpustest saadud mudelitest. Kogu töö käigus loodi 1 keelemudel, 2 fraasimudelit ja 2 ümberpaiknemismudelit. Teoreetiline osa oli referatiivne ning käsitles just neid fraasi-, keele- ja ümberpaiknemismudeli algoritme, mida me sisuliselt kasutasime töö praktilises osas. Täpsemalt käsitleti kahesuunalist leksikograafiliste kaaludega fraasimudelit, trigramm keelemudelit, mis kasutas silumiseks rekursiivset interpolatsiooni koos Witten-Belli meetodiga ning kahesuunalist msd (monotone, swap, discontinues ehk jääb paigale, vahetab, katkendlik) ümberpaiknemismudelit. Töö lõpus tõlgiti rohkem kui tuhandelauseline testkorpus ja hinnati saadud tulemust automaatse hindamismeetodiga BLEU. Lisaks vaadeldi tulemust lähemalt käsitsi. Kuigi paljud kerged laused tõlgiti peaaegu ideaalselt, siis keerulisemate lausetega hakkasid vähemalt osaliselt tekkima raskused. Suurim probleem oli konteksti mittemõistmine, sellele järgnesid käänamine ja lause ülesehitus. Töö väljundiks on valmiv statistilise masintõlke mudel inglise-eesti suunal ning teadmine, et antud valdkond on perspektiivikas. Töö on lisaks mõeldud inglise-eesti suunal statistilise masintõlke tegemise alustamiseks.The present thesis is about statistical machine translation in both theoretical and practical manner. Statistical machine translation is an area, which aims to make the machine translate without giving it any knowledge about grammar of the languages. It only receives a parallel corpora with millions of sentence pairs, where the only certain knowledge is, that in each pair, the sentences translate to each other. In the practical part of the work, we used statistical machine translation framework Moses, to create a new language model for English-Estonian direction. In addition an existing opposite direction language model, which was built from different corporas and put together weigthed, was inverted. Throughout the work 1 language-model, 2 phrase-models and 2 reordering-models were created. Theoretical part of the work involved describing different algorithms that were used inside the framework and its components in the practical part. To be more precise we discussed bidirectional phrase-model with lexical weights, tri-gram language model with recursive interpolation and Witten-Bell smoothing and bidirectional msd(monotone, swap, discontinues) reordering model. At the end stage of the work a test corpora with more than 1000 sentences was translated using the created models. Result was measured with automatic evaluation method BLEU. In addition, the result was examined close up and even though there were many good to allmost perfect translations for simpler sentences, in more complex sentences, there started to exist errors. Most common was misunderstanding the contex. Others worth mentioning were wrong inflection and bad sentence structure. As the result of the work a English-Estonian machine translation model was made and we came to the conclusion that it is a promising field for English-Estonian translation. The work at hand is also meant for educational purposes for anyone willing to step into statistical machine translation field for English-Estonian direction

    Perplexity minimization for translation model domain adaptation in statistical machine translation

    Get PDF
    We investigate the problem of domain adaptation for parallel data in Statistical Machine Translation (SMT). While techniques for domain adaptation of monolingual data can be borrowed for parallel data, we explore conceptual differences between translation model and language model domain adaptation and their effect on performance, such as the fact that translation models typically consist of several features that have different characteristics and can be optimized separately. We also explore adapting multiple (4–10) data sets with no a priori distinction between in-domain and out-of-domain data except for an in-domain development set

    Medidas de distância entre línguas baseadas em corpus. Aplicação à linguística histórica do galego, português, espanhol e inglês

    Get PDF
    xiv, 225 p.El objetivo de esta tesis es plantear y verificar una metodología basada en corpus paracuantificar automáticamente la distancia entre lenguas y variantes de lenguas. Para ello se ha partido de las técnicas usadas y contrastadas en identificación de idiomas, buscando aquellasque son más robustas y pueden cuantificar cuánto se acerca un texto a un modelo de lenguaje.También como objetivo secundario hemos investigado el papel que juega la ortografía comofactor de divergencia y convergencia entre lenguas.El método elegido es no-supervisado y puede aplicarse al cálculo de la distancia entre idiomas,entre períodos históricos de lenguas o entre variantes de lenguas

    Final Report on User Interface Studies, Cognitive and User Modelling

    Get PDF
    D1.3 marks the final CASMACAT report on user interface studies, cognitive and user modelling covering the completion of tasks T1.5 (Cognitive Modelling) and T1.6 (User Modelling) as part of Work Package 1. Within tasks T1.1 to T1.4, a series of experiments have established a solid understanding of human behaviour in computer-aided translation, focusing on the use of visualization options, different translation modalities, individual differences in translation production, translator types and translation/postediting styles. Additionally, the bulk of this experimental data has been released as a publicly available database under a creative common license and further details on this can be found in D1.4. In parallel to these more holistic studies, a second set of experiments aimed to examine some of these factors in a constrained laboratory setting. These focused on the underlying psycholinguistic processing and cognitive modelling of translators’ activity to capture reading difficulty, verification and perplexity during translation and post-editing. This deliverable combines these earlier empirical findings with experiments conducted in Year 3 of the project and grounds translation within a broader theoretical framework associated with human sentence processing and communication. As well as broadening our general understanding of bilingual cognitive processing, there were two major objectives behind the experimental investigations in Year 3. The first was to evaluate the utility of providing translators with Source-Target word alignment information through spatially-direct visual cues. The second was to determine what, if any, differences arise from expertise by comparing the results between a group of bilinguals and a group of professionally trained translators on the same translation-related tasks

    Amélioration des systèmes de traduction par analyse linguistique et thématique (Application à la traduction depuis l'arabe)

    Get PDF
    La traduction automatique des documents est considérée comme l une des tâches les plus difficiles en traitement automatique des langues et de la parole. Les particularités linguistiques de certaines langues, comme la langue arabe, rendent la tâche de traduction automatique plus difficile. Notre objectif dans cette thèse est d'améliorer les systèmes de traduction de l'arabe vers le français et vers l'anglais. Nous proposons donc une étude détaillée sur ces systèmes. Les principales recherches portent à la fois sur la construction de corpus parallèles, le prétraitement de l'arabe et sur l'adaptation des modèles de traduction et de langue.Tout d'abord, un corpus comparable journalistique a été exploré pour en extraire automatiquement un corpus parallèle. Ensuite, différentes approches d adaptation du modèle de traduction sont exploitées, soit en utilisant le corpus parallèle extrait automatiquement soit en utilisant un corpus parallèle construit automatiquement.Nous démontrons que l'adaptation des données du système de traduction permet d'améliorer la traduction. Un texte en arabe doit être prétraité avant de le traduire et ceci à cause du caractère agglutinatif de la langue arabe. Nous présentons notre outil de segmentation de l'arabe, SAPA (Segmentor and Part-of-speech tagger for Arabic), indépendant de toute ressource externe et permettant de réduire les temps de calcul. Cet outil permet de prédire simultanément l étiquette morpho-syntaxique ainsi que les proclitiques (conjonctions, prépositions, etc.) pour chaque mot, ensuite de séparer les proclitiques du lemme (ou mot de base). Nous décrivons également dans cette thèse notre outil de détection des entités nommées, NERAr (Named Entity Recognition for Arabic), et nous examions l'impact de l'intégration de la détection des entités nommées dans la tâche de prétraitement et la pré-traduction de ces entités nommées en utilisant des dictionnaires bilingues. Nous présentons par la suite plusieurs méthodes pour l'adaptation thématique des modèles de traduction et de langue expérimentées sur une application réelle contenant un corpus constitué d un ensemble de phrases multicatégoriques.Ces expériences ouvrent des perspectives importantes de recherche comme par exemple la combinaison de plusieurs systèmes lors de la traduction pour l'adaptation thématique. Il serait également intéressant d'effectuer une adaptation temporelle des modèles de traduction et de langue. Finalement, les systèmes de traduction améliorés arabe-français et arabe-anglais sont intégrés dans une plateforme d'analyse multimédia et montrent une amélioration des performances par rapport aux systèmes de traduction de base.Machine Translation is one of the most difficult tasks in natural language and speech processing. The linguistic peculiarities of some languages makes the machine translation task more difficult. In this thesis, we present a detailed study of machine translation systems from arabic to french and to english.Our principle researches carry on building parallel corpora, arabic preprocessing and adapting translation and language models. We propose a method for automatic extraction of parallel news corpora from a comparable corpora. Two approaches for translation model adaptation are explored using whether parallel corpora extracted automatically or parallel corpora constructed automatically. We demonstrate that adapting data used to build machine translation system improves translation.Arabic texts have to be preprocessed before machine translation and this because of the agglutinative character of arabic language. A prepocessing tool for arabic, SAPA (Segmentor and Part-of-speech tagger for Arabic), much faster than the state of the art tools and totally independant of any other external resource was developed. This tool predicts simultaneously morphosyntactic tags and proclitics (conjunctions, prepositions, etc.) for every word, then splits off words into lemma and proclitics.We describe also in this thesis, our named entity recognition tool for arabic, NERAr, and we focus on the impact of integrating named entity recognition in the preprocessing task. We used bilingual dictionaries to propose translations of the detected named entities. We present then many approaches to adapt thematically translation and language models using a corpora consists of a set of multicategoric sentences.These experiments open important research perspectives such as combining many systems when translating. It would be interesting also to focus on a temporal adaptation of translation and language models.Finally, improved machine translation systems from arabic to french and english are integrated in a multimedia platform analysis and shows improvements compared to basic machine translation systems.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    Statistical machine translation system and computational domain adaptation

    Get PDF
    Statističko strojno prevođenje temeljeno na frazama jedan je od mogućih pristupa automatskom strojnom prevođenju. U radu su predložene metode za poboljšanje kvalitete strojnog prijevoda prilagodbom određenih parametara u modelu sustava za statističko strojno prevođenje. Ideja rada bila jest izgraditi sustave za statističko strojno prevođenje temeljeno na frazama za hrvatski i engleski jezik. Sustavi su trenirani za dva jezična smjera, na dvije domene, na paralelnim korpusima različitih veličina i obilježja za hrvatsko-engleski i englesko-hrvatski jezični par, nakon čega proveden postupak ugađanja sustava. Istraženi su hibridni sustavi koji objedinjuju značajke obiju domena. Time je ispitan izravan utjecaj adaptacije domene na kvalitetu automatskog strojnog prijevoda hrvatskog jezika, a nova saznanja mogu koristiti pri izgradnji novih sustava. Provedena je automatska i ljudska evaluacija (vrednovanje) strojnih prijevoda, a dobiveni rezultati uspoređeni su s rezultatima strojnih prijevoda dobivenih primjenom postojećih web servisa za statističko strojno prevođenje.Phrase-based statistical machine translation is one of possible automatic machine translation approaches. This work proposes methods for increasing the quality of machine translation by adapting certain parameters in the statistical machine translation model. The idea was to build phrase-based statistical machine translation systems for Croatian and English language. The systems were be trained for two directions, on two domains, on parallel corpora of different sizes and characteristics for Croatian-English and English-Croatian language pair, after which the tuning procedure was conducted. Afterwards, hybrid systems which combine features of both domains were investigated. Thereby the direct impact of domain adaptation on the quality of automatic machine translation of Croatian language was explored, whereas new findings can be utilised for building new systems. Automatic and human evaluation of machine translations were carried out, while obtained results were compared with results obtained from applying existing statistical machine translation web services
    corecore