1,053 research outputs found

    Constructing a Large-Scale English-Persian Parallel Corpus

    Get PDF
    In recent years the exploitation of large text corpora in solving various kinds of linguistic problems, including those of translation, is commonplace. Yet a large-scale English-Persian corpus is still unavailable, because of certain difficulties and the amount of work required to overcome them.The project reported here is an attempt to constitute an English-Persian parallel corpus composed of digital texts and Web documents containing little or no noise. The Internet is useful because translations of existing texts are often published on the Web. The task is to find parallel pages in English and Persian, to judge their translation quality, and to download and align them. The corpus so created is of course open; that is, more material can be added as the need arises.One of the main activities associated with building such a corpus is to develop software for parallel concordancing, in which a user can enter a search string in one language and see all the citations for that string in it and corresponding sentences in the target language. Our intention is to construct general translation memory software using the present English-Persian parallel corpus.Au cours des dernières années, l’exploitation de grands corpus de textes pour résoudre des problèmes linguistiques, notamment des problèmes de traduction, est devenue une pratique courante. Jusqu’à récemment, aucun corpus bilingue anglais-persan à grande échelle n’avait été constitué, en raison des difficultés qu’implique une telle entreprise.Cet article présente un projet réalisé en vue de colliger des corpus de textes numériques variés, tels que des documents du réseau Internet, avec le moins de bruit possible. L’utilisation d’Internet peut être considérée comme une aide précieuse car, souvent, il existe des traductions antérieures qui sont déjà publiées sur le Web. La tâche consiste à trouver les pages parallèles en anglais et en persan, à évaluer la qualité de leur traduction, à les télécharger et à les aligner. Le corpus ainsi obtenu est un corpus ouvert, soit un corpus auquel de nouvelles données peuvent être ajoutées, selon les besoins.Une des principales conséquences de l’élaboration d’un tel corpus est la mise au point d’un logiciel de concordance parallèle, dans lequel l’utilisateur pourrait introduire une chaîne de caractères dans une langue et afficher toutes les citations concernant cette chaîne dans la langue recherchée ainsi que des phrases correspondantes dans la langue cible. L’étape suivante serait d’utiliser ce corpus parallèle pour construire un logiciel de traduction générale.Le corpus bilingue aligné se trouve être utile dans beaucoup d’autres cas, entre autres pour la traduction par ordinateur, pour lever les ambiguïtés de sens, pour le rétablissement des données interlangues, en lexicographie ainsi que pour l’apprentissage des langues

    A Hybrid Accurate Alignment method for large Persian-English corpus construction based on statistical analysis and Lexicon/Persian Word net

    Get PDF
    A bilingual corpus is considered as a very important knowledge source and an inevitable requirement for many natural language processing (NLP) applications in which two languages are involved. For some languages such as Persian, lack of such resources is much more significant. Several applications, including statistical and example-based machine translation needs bilingual corpora, in which large amounts of texts from two different languages have been aligned at the sentence or phrase levels. In order to meet this requirement, this paper aims to propose an accurate and hybrid sentence alignment method for construction of an English-Persian parallel corpus. As the first step, the proposed method uses statistical length based analysis for filtering of candidates. Punctuation marks are used as a directing feature to reduce the complexity and increase the accuracy. Finally, the proposed method makes use of some lexical knowledge in order to produce the final output. . In the phase of lexical analysis, a bilingual dictionary as well as a Persian semantic net (denoted as FarsNet) is used to calculate the extended semantic similarity. Experiments showed the positive effect of expansion on synonym words by extended semantic similarity on the accuracy of the sentence alignment process. In the proposed matching scheme, a semantic load based approach (which considers the verb as the pivot and the main part of a sentence) was also used in order for increasing the accuracy. The results obtained from the experiments were promising and the generated parallel corpus can be used as an effective knowledge source by researchers who work on Persian language

    Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization

    Get PDF
    Translation alignment is an essential task in Digital Humanities and Natural Language Processing, and it aims to link words/phrases in the source text with their translation equivalents in the translation. In addition to its importance in teaching and learning historical languages, translation alignment builds bridges between ancient and modern languages through which various linguistics annotations can be transferred. This thesis focuses on word-level translation alignment applied to historical languages in general and Ancient Greek and Latin in particular. As the title indicates, the thesis addresses four interdisciplinary aspects of translation alignment. The starting point was developing Ugarit, an interactive annotation tool to perform manual alignment aiming to gather training data to train an automatic alignment model. This effort resulted in more than 190k accurate translation pairs that I used for supervised training later. Ugarit has been used by many researchers and scholars also in the classroom at several institutions for teaching and learning ancient languages, which resulted in a large, diverse crowd-sourced aligned parallel corpus allowing us to conduct experiments and qualitative analysis to detect recurring patterns in annotators’ alignment practice and the generated translation pairs. Further, I employed the recent advances in NLP and language modeling to develop an automatic alignment model for historical low-resourced languages, experimenting with various training objectives and proposing a training strategy for historical languages that combines supervised and unsupervised training with mono- and multilingual texts. Then, I integrated this alignment model into other development workflows to project cross-lingual annotations and induce bilingual dictionaries from parallel corpora. Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold standard datasets and support quantitative and qualitative evaluation of translation alignment models. Besides, I designed and implemented visual analytics tools and reading environments for parallel texts and proposed various visualization approaches to support different alignment-related tasks employing the latest advances in information visualization and best practice. Overall, this thesis presents a comprehensive study that includes manual and automatic alignment techniques, evaluation methods and visual analytics tools that aim to advance the field of translation alignment for historical languages

    MultiWiki: interlingual text passage alignment in Wikipedia

    No full text
    In this article we address the problem of text passage alignment across interlingual article pairs in Wikipedia. We develop methods that enable the identification and interlinking of text passages written in different languages and containing overlapping information. Interlingual text passage alignment can enable Wikipedia editors and readers to better understand language-specific context of entities, provide valuable insights in cultural differences and build a basis for qualitative analysis of the articles. An important challenge inthis context is the trade-off between the granularity of the extracted text passages and the precision of the alignment. Whereas short text passages can result in more precise alignment, longer text passages can facilitate a better overview of the differences in an article pair. To better understand these aspects from the user perspective, we conduct a user study at the example of the German, Russian and the English Wikipedia and collect a user-annotated benchmark. Then we propose MultiWiki – a method that adopts an integrated approach to the text passage alignment using semantic similarity measures and greedy algorithms and achieves precise results with respect to the user-defined alignment. MultiWiki demonstration is publicly available and currently supports four language pairs

    Reordering in statistical machine translation

    Get PDF
    PhDMachine translation is a challenging task that its difficulties arise from several characteristics of natural language. The main focus of this work is on reordering as one of the major problems in MT and statistical MT, which is the method investigated in this research. The reordering problem in SMT originates from the fact that not all the words in a sentence can be consecutively translated. This means words must be skipped and be translated out of their order in the source sentence to produce a fluent and grammatically correct sentence in the target language. The main reason that reordering is needed is the fundamental word order differences between languages. Therefore, reordering becomes a more dominant issue, the more source and target languages are structurally different. The aim of this thesis is to study the reordering phenomenon by proposing new methods of dealing with reordering in SMT decoders and evaluating the effectiveness of the methods and the importance of reordering in the context of natural language processing tasks. In other words, we propose novel ways of performing the decoding to improve the reordering capabilities of the SMT decoder and in addition we explore the effect of improving the reordering on the quality of specific NLP tasks, namely named entity recognition and cross-lingual text association. Meanwhile, we go beyond reordering in text association and present a method to perform cross-lingual text fragment alignment, based on models of divergence from randomness. The main contribution of this thesis is a novel method named dynamic distortion, which is designed to improve the ability of the phrase-based decoder in performing reordering by adjusting the distortion parameter based on the translation context. The model employs a discriminative reordering model, which is combining several fea- 2 tures including lexical and syntactic, to predict the necessary distortion limit for each sentence and each hypothesis expansion. The discriminative reordering model is also integrated into the decoder as an extra feature. The method achieves substantial improvements over the baseline without increase in the decoding time by avoiding reordering in unnecessary positions. Another novel method is also presented to extend the phrase-based decoder to dynamically chunk, reorder, and apply phrase translations in tandem. Words inside the chunks are moved together to enable the decoder to make long-distance reorderings to capture the word order differences between languages with different sentence structures. Another aspect of this work is the task-based evaluation of the reordering methods and other translation algorithms used in the phrase-based SMT systems. With more successful SMT systems, performing multi-lingual and cross-lingual tasks through translating becomes more feasible. We have devised a method to evaluate the performance of state-of-the art named entity recognisers on the text translated by a SMT decoder. Specifically, we investigated the effect of word reordering and incorporating reordering models in improving the quality of named entity extraction. In addition to empirically investigating the effect of translation in the context of crosslingual document association, we have described a text fragment alignment algorithm to find sections of the two documents in different languages, that are content-wise related. The algorithm uses similarity measures based on divergence from randomness and word-based translation models to perform text fragment alignment on a collection of documents in two different languages. All the methods proposed in this thesis are extensively empirically examined. We have tested all the algorithms on common translation collections used in different evaluation campaigns. Well known automatic evaluation metrics are used to compare the suggested methods to a state-of-the art baseline and results are analysed and discussed
    corecore