8 research outputs found
Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text
Parallel corpora are a valuable resource for machine translation, but at
present their availability and utility is limited by genre- and
domain-specificity, licensing restrictions, and the basic difficulty of
locating parallel texts in all but the most dominant of the world's languages.
A parallel corpus resource not yet explored is the World Wide Web, which hosts
an abundance of pages in parallel translation, offering a potential solution to
some of these problems and unique opportunities of its own. This paper presents
the necessary first step in that exploration: a method for automatically
finding parallel translated documents on the Web. The technique is conceptually
simple, fully language independent, and scalable, and preliminary evaluation
results indicate that the method may be accurate enough to apply without human
intervention.Comment: LaTeX2e, 11 pages, 7 eps figures; uses psfig, llncs.cls, theapa.sty.
An Appendix at http://umiacs.umd.edu/~resnik/amta98/amta98_appendix.html
contains test dat
Automatic construction of English/Chinese parallel corpus.
Li Kar Wing.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 88-96).Abstracts in English and Chinese.ABSTRACT --- p.iACKNOWLEDGEMENTS --- p.vLIST OF TABLES --- p.viiiLIST OF FIGURES --- p.ixCHAPTERSChapter 1. --- INTRODUCTION --- p.1Chapter 1.1 --- Application of corpus-based techniques --- p.2Chapter 1.1.1 --- Machine Translation (MT) --- p.2Chapter 1.1.1.1 --- Linguistic --- p.3Chapter 1.1.1.2 --- Statistical --- p.4Chapter 1.1.1.3 --- Lexicon construction --- p.4Chapter 1.1.2 --- Cross-lingual Information Retrieval (CLIR) --- p.6Chapter 1.1.2.1 --- Controlled vocabulary --- p.6Chapter 1.1.2.2 --- Free text --- p.7Chapter 1.1.2.3 --- Application corpus-based approach in CLIR --- p.9Chapter 1.2 --- Overview of linguistic resources --- p.10Chapter 1.3 --- Written language corpora --- p.12Chapter 1.3.1 --- Types of corpora --- p.13Chapter 1.3.2 --- Limitation of comparable corpora --- p.16Chapter 1.4 --- Outline of the dissertation --- p.17Chapter 2. --- LITERATURE REVIEW --- p.19Chapter 2.1 --- Research in automatic corpus construction --- p.20Chapter 2.2 --- Research in translation alignment --- p.25Chapter 2.2.1 --- Sentence alignment --- p.27Chapter 2.2.2 --- Word alignment --- p.28Chapter 2.3 --- Research in alignment of sequences --- p.33Chapter 3. --- ALIGNMENT AT WORD LEVEL AND CHARACTER LEVEL --- p.35Chapter 3.1 --- Title alignment --- p.35Chapter 3.1.1 --- Lexical features --- p.37Chapter 3.1.2 --- Grammatical features --- p.40Chapter 3.1.3 --- The English/Chinese alignment model --- p.41Chapter 3.2 --- Alignment at word level and character level --- p.42Chapter 3.2.1 --- Alignment at word level --- p.42Chapter 3.2.2 --- Alignment at character level: Longest matching --- p.44Chapter 3.2.3 --- Longest common subsequence(LCS) --- p.46Chapter 3.2.4 --- Applying LCS in the English/Chinese alignment model --- p.48Chapter 3.3 --- Reduce overlapping ambiguity --- p.52Chapter 3.3.1 --- Edit distance --- p.52Chapter 3.3.2 --- Overlapping in the algorithm model --- p.54Chapter 4. --- ALIGNMENT AT TITLE LEVEL --- p.59Chapter 4.1 --- Review of score functions --- p.59Chapter 4.2 --- The Score function --- p.60Chapter 4.2.1 --- (C matches E) and (E matches C) --- p.60Chapter 4.2.2 --- Length similarity --- p.63Chapter 5. --- EXPERIMENTAL RESULTS --- p.69Chapter 5.1 --- Hong Kong government press release articles --- p.69Chapter 5.2 --- Hang Seng Bank economic monthly reports --- p.76Chapter 5.3 --- Hang Seng Bank press release articles --- p.78Chapter 5.4 --- Hang Seng Bank speech articles --- p.81Chapter 5.5 --- Quality of the collections and future work --- p.84Chapter 6. --- CONCLUSION --- p.87Bibliograph
Rapid Resource Transfer for Multilingual Natural Language Processing
Until recently the focus of the Natural Language Processing (NLP)
community has been on a handful of mostly European languages. However, the
rapid changes taking place in the economic and political climate of the
world precipitate a similar change to the relative importance given to
various languages. The importance of rapidly acquiring NLP resources and
computational capabilities in new languages is widely accepted.
Statistical NLP models have a distinct advantage over rule-based methods
in achieving this goal since they require far less manual labor. However,
statistical methods require two fundamental resources for training: (1)
online corpora (2) manual annotations. Creating these two resources can be
as difficult as porting rule-based methods.
This thesis demonstrates the feasibility of acquiring both corpora and
annotations by exploiting existing resources for well-studied languages.
Basic resources for new languages can be acquired in a rapid and
cost-effective manner by utilizing existing resources cross-lingually.
Currently, the most viable method of obtaining online corpora is
converting existing printed text into electronic form using Optical
Character Recognition (OCR). Unfortunately, a language that lacks online
corpora most likely lacks OCR as well. We tackle this problem by taking an
existing OCR system that was desgined for a specific language and using
that OCR system for a language with a similar script. We present a
generative OCR model that allows us to post-process output from a
non-native OCR system to achieve accuracy close to, or better than, a
native one. Furthermore, we show that the performance of a native or
trained OCR system can be improved by the same method.
Next, we demonstrate cross-utilization of annotations on treebanks. We
present an algorithm that projects dependency trees across parallel
corpora. We also show that a reasonable quality treebank can be generated
by combining projection with a small amount of language-specific
post-processing. The projected treebank allows us to train a parser that
performs comparably to a parser trained on manually generated data
Lexical selection for machine translation
Current research in Natural Language Processing (NLP) tends to exploit corpus resources as a way of overcoming the problem of knowledge acquisition. Statistical analysis of corpora can reveal trends and probabilities of occurrence, which have proved to be helpful in various ways. Machine Translation (MT) is no exception to this trend. Many MT researchers have attempted to extract knowledge from parallel bilingual corpora. The MT problem is generally decomposed into two sub-problems: lexical selection and reordering of the selected words. This research addresses the problem of lexical selection of open-class lexical items in the framework of MT. The work reported in this thesis investigates different methodologies to handle this problem, using a corpus-based approach. The current framework can be applied to any language pair, but we focus on Arabic and English. This is because Arabic words are hugely ambiguous and thus pose a challenge for the current task of lexical selection. We use a challenging Arabic-English parallel corpus, containing many long passages with no punctuation marks to denote sentence boundaries. This points to the robustness of the adopted approach. In our attempt to extract lexical equivalents from the parallel corpus we focus on the co-occurrence relations between words. The current framework adopts a lexicon-free approach towards the selection of lexical equivalents. This has the double advantage of investigating the effectiveness of different techniques without being distracted by the properties of the lexicon and at the same time saving much time and effort, since constructing a lexicon is time-consuming and labour-intensive. Thus, we use as little, if any, hand-coded information as possible. The accuracy score could be improved by adding hand-coded information. The point of the work reported here is to see how well one can do without any such manual intervention. With this goal in mind, we carry out a number of preprocessing steps in our framework. First, we build a lexicon-free Part-of-Speech (POS) tagger for Arabic. This POS tagger uses a combination of rule-based, transformation-based learning (TBL) and probabilistic techniques. Similarly, we use a lexicon-free POS tagger for English. We use the two POS taggers to tag the bi-texts. Second, we develop lexicon-free shallow parsers for Arabic and English. The two parsers are then used to label the parallel corpus with dependency relations (DRs) for some critical constructions. Third, we develop stemmers for Arabic and English, adopting the same knowledge -free approach. These preprocessing steps pave the way for the main system (or proposer) whose task is to extract translational equivalents from the parallel corpus. The framework starts with automatically extracting a bilingual lexicon using unsupervised statistical techniques which exploit the notion of co-occurrence patterns in the parallel corpus. We then choose the target word that has the highest frequency of occurrence from among a number of translational candidates in the extracted lexicon in order to aid the selection of the contextually correct translational equivalent. These experiments are carried out on either raw or POS-tagged texts. Having labelled the bi-texts with DRs, we use them to extract a number of translation seeds to start a number of bootstrapping techniques to improve the proposer. These seeds are used as anchor points to resegment the parallel corpus and start the selection process once again. The final F-score for the selection process is 0.701. We have also written an algorithm for detecting ambiguous words in a translation lexicon and obtained a precision score of 0.89.EThOS - Electronic Theses Online ServiceEgyptian GovernmentGBUnited Kingdo
Traducción inglés-español y censura de textos narrativos en la España de Franco: TRACEni (1962-1969) = English-Spanish Translation and censorship of narrative texts in Franco̕s Spain: TRACEni (1962-1969)
825 p.La presente tesis doctoral, enmarcada dentro del proyecto TRACE (Traducciones CEnsuradas), y tomando como referente teórico la rama descriptiva de los estudios de traducción, ha contribuido a tener una visión más enriquecedora de las repercusiones (auto)censorias en el panorama literario de la narrativa traducida del siglo XX español, más concretamente en el periodo franquista 1962-1969. Una vez delimitados el marco metodológico y contextual y las partes integrantes de la novela, asà como elaborado un corpus textual de 9.118 novelas originales en inglés y traducidas al castellano, cuya fecha en censura se produjo durante el periodo indicado, se estableció el objeto de análisis (textos originales, censurados y publicados) sobre los que se llevarÃa a cabo un análisis descriptivo-comparativo a nivel macrotextual y microtextual), teniendo en cuenta cambios a nivel formal, semántico y pragmático. El nivel microtextual ha asumido el mayor peso en el análisis (sobre cinco pares textuales de novelas representativas del corpus) en el que se ha tomado el translema como unidad de comparación para evidenciar los casos de autocensura, censura externa y umbral de permisividad y comprobar si han sido los traductores o la censura oficial los causantes de las similitudes o diferencias de contenido y/o forma de los textos publicados con respecto a sus respectivos originale