Search CORE

5,954 research outputs found

Identifying Word Translations in Non-Parallel Texts

Author: Rapp Reinhard
Publication venue
Publication date: 01/01/1995
Field of study

Common algorithms for sentence and word-alignment allow the automatic identification of word translations from parallel texts. This study suggests that the identification of word translations should also be possible with non-parallel and even unrelated texts. The method proposed is based on the assumption that there is a correlation between the patterns of word co-occurrences in texts of different languages.Comment: 3 pages, requires aclap.sty and epic.st

arXiv.org e-Print Archive

CiteSeerX

Parallel texts alignment

Author: Gomes Luís Manuel dos Santos
Publication venue: FCT - UNL
Publication date: 01/01/2009
Field of study

Trabalho apresentado no âmbito do Mestrado em Engenharia Informática, como requisito parcial para obtenção do grau de Mestre em Engenharia InformáticaAlignment of parallel texts (texts that are translation of each other) is a required step for many applications that use parallel texts, including statistical machine translation, automatic extraction of translation equivalents, automatic creation of concordances, etc. This dissertation presents a new methodology for parallel texts alignment that departs from previous work in several ways. One important departure is a shift of goals concerning the use of lexicons for obtaining correspondences between the texts. Previous methods try to infer a bilingual lexicon as part of the alignment process and use it to obtain correspondences between the texts. Some of those methods can use external lexicons to complement the inferred one, but they tend to consider them as secondary. This dissertation presents several arguments supporting the thesis that lexicon inference should not be embedded in the alignment process. The method described complies with this statement and relies exclusively on externally managed lexicons to obtain correspondences. Moreover, the algorithms presented can handle very large lexicons containing terms of arbitrary length. Besides the exclusive use of external lexicons, this dissertation presents a new method for obtaining correspondences between translation equivalents found in the texts. It uses a decision criteria based on features that have been overlooked by prior work. The proposed method is iterative and refines the alignment at each iteration. It uses the alignment obtained in one iteration as a guide to obtaining new correspondences in the next iteration, which in turn are used to compute a finer alignment. This iterative scheme allows the method to correct correspondence errors from previous iterations in face of new information

Repositório da Universidade Nova de Lisboa

K-vec: A New Approach for Aligning Parallel Texts

Author: Church Kenneth
Fung Pascale
Publication venue
Publication date: 01/01/1994
Field of study

Various methods have been proposed for aligning texts in two or more languages such as the Canadian Parliamentary Debates(Hansards). Some of these methods generate a bilingual lexicon as a by-product. We present an alternative alignment strategy which we call K-vec, that starts by estimating the lexicon. For example, it discovers that the English word "fisheries" is similar to the French "pe^ches" by noting that the distribution of "fisheries" in the English text is similar to the distribution of "pe^ches" in the French. K-vec does not depend on sentence boundaries.Comment: 7 pages, uuencoded, compressed PostScript; Proc. COLING-9

arXiv.org e-Print Archive

CiteSeerX

Real-Time Identification of Parallel Texts from Bilingual Newsfeed

Author: Foster George
Nadeau David
Publication venue
Publication date: 01/01/2004
Field of study

Parallel texts are documents that present parallel translations. This paper describes a simple method that can be deployed on a real-time news feed to create an infinitely growing source of parallel texts in French and English. Our experiment was lead on the Canada Newswire news feed. Given some of its intrinsic properties, it was possible to deploy a relatively simple text matching techniques that rely on language independent cognates such numbers, capitalized words, punctuation and new lines characters. On three week of press releases, our system correctly identified the vast majority of parallel press release. It committed only minor errors on repeated news items

NRC Publications Archive

CogPrints Cognitive Sciences Eprint Archive

PEDANT: parallel texts in Göteborg

Author: Ridings Daniel
Publication venue: 'African Journals Online (AJOL)'
Publication date: 12/10/2016
Field of study

The article presents the status of the PEDANT project with parallel corpora at the Language Bank at Göteborg University. The solutions for access to the corpus data are presented. Access is provided by way of the internet and standard applications and SGML-aware programming tools. The SGML format for encoding translation pairs is outlined together. The methods allow working with everything from plain text to texts densely encoded with linguistic information. Keywords: sgml, parallel corpora, morphosyntactic encoding, lemmatization, multiword units, compound words, internet acces

AJOL - African Journals Online

Babylon parallel text builder: Gathering parallel texts for low-density languages

Author: Michael Mohler
Rada Mihalcea
Publication venue
Publication date: 01/01/2008
Field of study

This paper describes BABYLON, a system that attempts to overcome the shortage of parallel texts in low-density languages by supplementing existing parallel texts with texts gathered automatically from the Web. In addition to the identification of entire Web pages, we also propose a new feature specifically designed to find parallel text chunks within a single document. Experiments carried out on the Quechua-Spanish language pair show that the system is successful in automatically identifying a significant amount of parallel texts on the Web. Evaluations of a machine translation system trained on this corpus indicate that the Web-gathered parallel texts can supplement manually compiled parallel texts and perform significantly better than the manually compiled texts when tested on other Web-gathered data. 1

CiteSeerX

Automatic Discovery of Non-Compositional Compounds in Parallel Data

Author: Melamed I. Dan
Publication venue
Publication date: 01/01/1997
Field of study

Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discovering sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of non-compositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunciations.Comment: 12 pages; uses natbib.sty, here.st

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

Parallel Texts

Author: Mihalcea Rada, 1974-
Simard Michel
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/09/2005
Field of study

Article discussing parallel texts and natural language processing

UNT Digital Library