34 research outputs found

    Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation

    Get PDF
    With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and translation technologies

    Building Parallel Corpora from Movies

    Get PDF
    International audienceThis paper proposes to use DTW to construct parallel corpora from difficult data. Parallel corpora are considered as raw material for machine translation (MT), frequently, MT systems use European or Canadian parliament corpora. In order to achieve a realistic machine translation system, we decided to use movie subtitles. These data could be considered difficult because they contain unfamiliar expressions, abbreviations, hesitations, words which do not exist in classical dictionaries (as vulgar words), etc. The obtained parallel corpora can constitute a rich ressource to train decoding spontaneous speech translation system. From 40 movies, we align 43013 English subtitles with 42306 French subtitles. This leads to 37625 aligned pairs with a precision of 92,3%

    Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings

    Get PDF
    Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following: * We release ‘language packs’ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations. * We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora. * We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains. * We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams. This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations. We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available

    Non-Fluent Synthetic Target-Language Data Improve Neural Machine Translation

    Get PDF
    When the amount of parallel sentences available to train a neural machine translation is scarce, a common practice is to generate new synthetic training samples from them. A number of approaches have been proposed to produce synthetic parallel sentences that are similar to those in the parallel data available. These approaches work under the assumption that non-fluent target-side synthetic training samples can be harmful and may deteriorate translation performance. Even so, in this paper we demonstrate that synthetic training samples with non-fluent target sentences can improve translation performance if they are used in a multilingual machine translation framework as if they were sentences in another language. We conducted experiments on ten low-resource and four high-resource translation tasks and found out that this simple approach consistently improves translation performance as compared to state-of-the-art methods for generating synthetic training samples similar to those found in corpora. Furthermore, this improvement is independent of the size of the original training corpus, the resulting systems are much more robust against domain shift and produce less hallucinations.This paper is part of the R+D+i project PID2021-127999NB-I00 funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund A way to make Europe. The computational resources used were funded by the European Regional Development Fund through project ID-IFEDER/2020/003

    Understanding and Enhancing the Use of Context for Machine Translation

    Get PDF
    To understand and infer meaning in language, neural models have to learn complicated nuances. Discovering distinctive linguistic phenomena from data is not an easy task. For instance, lexical ambiguity is a fundamental feature of language which is challenging to learn. Even more prominently, inferring the meaning of rare and unseen lexical units is difficult with neural networks. Meaning is often determined from context. With context, languages allow meaning to be conveyed even when the specific words used are not known by the reader. To model this learning process, a system has to learn from a few instances in context and be able to generalize well to unseen cases. The learning process is hindered when training data is scarce for a task. Even with sufficient data, learning patterns for the long tail of the lexical distribution is challenging. In this thesis, we focus on understanding certain potentials of contexts in neural models and design augmentation models to benefit from them. We focus on machine translation as an important instance of the more general language understanding problem. To translate from a source language to a target language, a neural model has to understand the meaning of constituents in the provided context and generate constituents with the same meanings in the target language. This task accentuates the value of capturing nuances of language and the necessity of generalization from few observations. The main problem we study in this thesis is what neural machine translation models learn from data and how we can devise more focused contexts to enhance this learning. Looking more in-depth into the role of context and the impact of data on learning models is essential to advance the NLP field. Moreover, it helps highlight the vulnerabilities of current neural networks and provides insights into designing more robust models.Comment: PhD dissertation defended on November 10th, 202

    Building a bilingual dictionary from movie subtitles based on inter-lingual triggers

    Get PDF
    International audienceThis paper focuses on two aspects of Machine Translation: parallel corpora and translation model. First, we present a method to automatically build parallel corpora from subtitle files. We use subtitle files gathered from the Internet. This leads to useful data for Subtitling Machine Translation. Our method is based on Dynamic Time Warping. We evaluated this alignment method by comparing it with a sample aligned by hand and we obtained a precision of alignment equal to 0.920.92. Second, we use the notion of inter-lingual triggers in order to build from the subtitle parallel corpora multilingual dictionaries and translation tables for machine translation. Inter-lingual triggers allow to detect couple of source and target words from parallel corpora. The Mutual Information measure used to determine inter-lingual triggers allows to hypothesize that a word in the source language is a translation of another word in the target language. We evaluate the obtained dictionary by comparing it to two existing dictionaries. Then, we integrated the obtained translation tables into an entire translation decoding process supplied by Pharaoh. We compared the translation performance using our translation tables with the performance obtained by the Giza++ tool. The results showed that the system tuned for our tables improves the Bleu value by 2.2% compared to the ones obtained by Giza++

    Automatic construction of English/Chinese parallel corpus.

    Get PDF
    Li Kar Wing.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 88-96).Abstracts in English and Chinese.ABSTRACT --- p.iACKNOWLEDGEMENTS --- p.vLIST OF TABLES --- p.viiiLIST OF FIGURES --- p.ixCHAPTERSChapter 1. --- INTRODUCTION --- p.1Chapter 1.1 --- Application of corpus-based techniques --- p.2Chapter 1.1.1 --- Machine Translation (MT) --- p.2Chapter 1.1.1.1 --- Linguistic --- p.3Chapter 1.1.1.2 --- Statistical --- p.4Chapter 1.1.1.3 --- Lexicon construction --- p.4Chapter 1.1.2 --- Cross-lingual Information Retrieval (CLIR) --- p.6Chapter 1.1.2.1 --- Controlled vocabulary --- p.6Chapter 1.1.2.2 --- Free text --- p.7Chapter 1.1.2.3 --- Application corpus-based approach in CLIR --- p.9Chapter 1.2 --- Overview of linguistic resources --- p.10Chapter 1.3 --- Written language corpora --- p.12Chapter 1.3.1 --- Types of corpora --- p.13Chapter 1.3.2 --- Limitation of comparable corpora --- p.16Chapter 1.4 --- Outline of the dissertation --- p.17Chapter 2. --- LITERATURE REVIEW --- p.19Chapter 2.1 --- Research in automatic corpus construction --- p.20Chapter 2.2 --- Research in translation alignment --- p.25Chapter 2.2.1 --- Sentence alignment --- p.27Chapter 2.2.2 --- Word alignment --- p.28Chapter 2.3 --- Research in alignment of sequences --- p.33Chapter 3. --- ALIGNMENT AT WORD LEVEL AND CHARACTER LEVEL --- p.35Chapter 3.1 --- Title alignment --- p.35Chapter 3.1.1 --- Lexical features --- p.37Chapter 3.1.2 --- Grammatical features --- p.40Chapter 3.1.3 --- The English/Chinese alignment model --- p.41Chapter 3.2 --- Alignment at word level and character level --- p.42Chapter 3.2.1 --- Alignment at word level --- p.42Chapter 3.2.2 --- Alignment at character level: Longest matching --- p.44Chapter 3.2.3 --- Longest common subsequence(LCS) --- p.46Chapter 3.2.4 --- Applying LCS in the English/Chinese alignment model --- p.48Chapter 3.3 --- Reduce overlapping ambiguity --- p.52Chapter 3.3.1 --- Edit distance --- p.52Chapter 3.3.2 --- Overlapping in the algorithm model --- p.54Chapter 4. --- ALIGNMENT AT TITLE LEVEL --- p.59Chapter 4.1 --- Review of score functions --- p.59Chapter 4.2 --- The Score function --- p.60Chapter 4.2.1 --- (C matches E) and (E matches C) --- p.60Chapter 4.2.2 --- Length similarity --- p.63Chapter 5. --- EXPERIMENTAL RESULTS --- p.69Chapter 5.1 --- Hong Kong government press release articles --- p.69Chapter 5.2 --- Hang Seng Bank economic monthly reports --- p.76Chapter 5.3 --- Hang Seng Bank press release articles --- p.78Chapter 5.4 --- Hang Seng Bank speech articles --- p.81Chapter 5.5 --- Quality of the collections and future work --- p.84Chapter 6. --- CONCLUSION --- p.87Bibliograph
    corecore