8 research outputs found

    A review of EBMT using proportional analogies

    Get PDF
    Some years ago a number of papers reported an experimental implementation of Example Based Machine Translation (EBMT) using Proportional Analogy. This approach, a type of analogical learning, was attractive because of its simplicity; and the papers reported considerable success with the method. This paper reviews what we believe to be the totality of research reported using this method, as an introduction to our own experiments in this framework, reported in a companion paper. We report first some lack of clarity in the previously published work, and then report our findings that the purity of the proportional analogy approach imposes huge run-time complexity for the EBMT task even when heuristics as hinted at in the original literature are applied to reduce the amount of computation

    Mitigating problems in analogy-based EBMT with SMT and vice versa: a case study with named entity transliteration

    Get PDF
    Five years ago, a number of papers reported an experimental implementation of an Example Based Machine Translation (EBMT) system using proportional analogy. This approach, a type of analogical learning, was attractive because of its simplicity; and the paper reported considerable success with the method using various language pairs. In this paper, we describe our attempt to use this approach for tackling English–Hindi Named Entity (NE) Transliteration. We have implemented our own EBMT system using proportional analogy and have found that the analogy-based system on its own has low precision but a high recall due to the fact that a large number of names are untransliterated with the approach. However, mitigating problems in analogy-based EBMT with SMT and vice-versa have shown considerable improvement over the individual approach

    Statistically motivated example-based machine translation using translation memory

    Get PDF
    In this paper we present a novel way of integrating Translation Memory into an Example-based Machine translation System (EBMT) to deal with the issue of low resources. We have used a dialogue of 380 sentences as the example-base for our system. The translation units in the Translation Memories are automatically extracted based on the aligned phrases (words) of a statistical machine translation (SMT) system. We attempt to use the approach to improve translation from English to Bangla as many statistical machine translation systems have difficulty with such small amounts of training data. We have found the approach shows improvement over a baseline SMT system

    Mitigating the problems of SMT using EBMT

    Get PDF
    Statistical Machine Translation (SMT) typically has difficulties with less-resourced languages even with homogeneous data. In this thesis we address the application of Example-Based Machine Translation (EBMT) methods to overcome some of these difficulties. We adopt three alternative approaches to tackle these problems focusing on two poorly-resourced translation tasks (English–Bangla and English–Turkish). First, we adopt a runtime approach to EBMT using proportional analogy. In addition to the translation task, we have tested the EBMT system using proportional analogy for named entity transliteration. In the second attempt, we use a compiled approach to EBMT. Finally, we present a novel way of integrating Translation Memory (TM) into an EBMT system. We discuss the development of these three different EBMT systems and the experiments we have performed. In addition, we present an approach to augment the output quality by strategically combining EBMT systems and SMT systems. The hybrid system shows significant improvement for different language pairs. Runtime EBMT systems in general have significant time complexity issues especially for large example-base. We explore two methods to address this issue in our system by making the system scalable at runtime for a large example-base (English–French). First, we use a heuristic-based approach. Secondly we use an IR-based indexing technique to speed up the time-consuming matching procedure of the EBMT system. The index-based matching procedure substantially improves run-time speed without affecting translation quality

    Normalisation orthographique de corpus bruités

    Get PDF
    The information contained in messages posted on the Internet (forums, social networks, review sites...) is of strategic importance for many companies. However, few tools have been designed for analysing such messages, the spelling, typography and syntax of which are often noisy. This industrial PhD thesis has been carried out within the viavoo company with the aim of improving the results of a lemma-based information retrieval tool. We have developed a processing pipeline for the normalisation of noisy texts. Its aim is to ensure that each word is assigned the standard spelling corresponding to one of its lemma’s inflected forms. First, among all tokens of the corpus that are unknown to a reference lexicon, we automatically determine which ones result from alterations — and therefore should be normalised — as opposed to those that do not (neologisms, loanwords...). Normalisation candidates are then generated for these tokens using weighted rules obtained by analogy-based machine learning techniques. Next we identify tokens that are known to the reference lexicon but are nevertheless the result of an alteration (grammatical errors), and generate normalisation candidates for each of them. Finally, language models allow us to perform a context-sensitive disambiguation of the normalisation candidates generated for all types of alterations. Numerous experiments and evaluations are carried out on French data for each module and for the overall pipeline. Special attention has been paid to keep all modules as language-independent as possible, which paves the way for future adaptations of our pipeline to other European languages.Les messages publiĂ©s par les internautes comportent un intĂ©rĂȘt stratĂ©gique pour les entreprises. NĂ©anmoins, peu d’outils ont Ă©tĂ© conçus pour faciliter l'analyse de ces messages souvent bruitĂ©s. Cette thĂšse, rĂ©alisĂ©e au sein de l'entreprise viavoo, veut amĂ©liorer les rĂ©sultats d’un outil d'extraction d'information qui fait abstraction de la variabilitĂ© flexionnelle. Nous avons ainsi dĂ©veloppĂ© une chaĂźne de traitements pour la normalisation orthographique de textes bruitĂ©s. Notre approche consiste tout d'abord Ă  dĂ©terminer automatiquement, parmi les tokens du corpus traitĂ© qui sont inconnus d'un lexique, ceux qui rĂ©sultent d’altĂ©rations et qu'il conviendrait de normaliser, par opposition aux autres (nĂ©ologismes, emprunts...). Des candidats de normalisation sont alors proposĂ©s pour ces tokens Ă  l'aide de rĂšgles pondĂ©rĂ©es obtenues par des techniques d'apprentissage par analogie. Nous identifions ensuite des tokens connus du lexique qui rĂ©sultent nĂ©anmoins d’une altĂ©ration (fautes grammaticales), et proposons des candidats de normalisation pour ces tokens. Enfin, des modĂšles de langue permettent de prendre en compte le contexte dans lequel apparaissent les diffĂ©rents types d'altĂ©rations pour lesquels des candidats de normalisation ont Ă©tĂ© proposĂ©s afin de choisir les plus probables. DiffĂ©rentes expĂ©riences et Ă©valuations sont rĂ©alisĂ©es sur le français Ă  chaque Ă©tape et sur la chaĂźne complĂšte. Une attention particuliĂšre a Ă©tĂ© portĂ©e au caractĂšre faiblement dĂ©pendant de la langue des modules dĂ©veloppĂ©s, ce qui permet d'envisager son adaptation Ă  d'autres langues europĂ©ennes

    The GREYC Machine Translation System for the IWSLT 2007 Evaluation Campaign

    No full text
    The GREYC machine translation (MT) system is a slight evolution of the ALEPH machine translation system that participated in the IWLST 2005 campaign. It is a pure examplebased MT system that exploits proportional analogies. The training data used for this campaign were limited on purpose to the sole data provided by the organizers. However, the training data were expanded with the results of sub-sentential alignments. The system participated in the two classical tasks of translation of manually transcribed texts from Japanese to English and Arabic to English. 1
    corecore