784 research outputs found
A Pattern Matching method for finding Noun and Proper Noun Translations from Noisy Parallel Corpora
We present a pattern matching method for compiling a bilingual lexicon of
nouns and proper nouns from unaligned, noisy parallel texts of
Asian/Indo-European language pairs. Tagging information of one language is
used. Word frequency and position information for high and low frequency words
are represented in two different vector forms for pattern matching. New anchor
point finding and noise elimination techniques are introduced. We obtained a
73.1\% precision. We also show how the results can be used in the compilation
of domain-specific noun phrases.Comment: 8 pages, uuencoded compressed postscript file. To appear in the
Proceedings of the 33rd AC
Computer Assisted Language Learning Based on Corpora and Natural Language Processing : The Experience of Project CANDLE
This paper describes Project CANDLE, an ongoing 3-year project which uses various corpora and NLP technologies to construct an online English learning environment for learners in Taiwan. This report focuses on the interim results obtained in the first eighteen months. First, an English-Chinese parallel corpus, Sinorama, was used as the main course material for reading, writing, and culture-based learning courses. Second, an online bilingual concordancer, TotalRecall, and a collocation reference tool, TANGO, were developed based on Sinorama and other corpora. Third, many online lessons, including extensive reading, verb-noun collocations, and vocabulary, were designed to be used alone or together with TotalRecall and TANGO. Fourth, an online collocation check program, MUST, was developed for detecting V-N miscollocation and suggesting adequate collocates in student’s writings based on the hypothesis of L1 interference and the database of BNC and the bilingual Sinorama Corpus. Other computational scaffoldings are under development. It is hoped that this project will help intermediate learners in Taiwan enhance their English proficiency with effective pedagogical approaches and versatile language reference tools
Developing Word-aligned Myanmar-English Parallel Corpus based on the IBM Models
Word alignment in bilingual corpora has been an active research
topic in the Machine Translation research groups. Corpus is the
body of text collections, which are useful for Language
Processing (NLP). Parallel text alignment is the identification of
the corresponding sentences in the parallel text. Large
collections of parallel level are prerequisite for many areas of
linguistic research. Parallel corpus helps in making statistical
bilingual dictionary, in supporting statistical machine translation
and in supporting as training data for word sense disambiguation
and translation disambiguation. Nowadays, the world is a global
network and everybody will be learned more than one language.
So, multilingual corpora are more processing. Thus, the main
purpose of this system is to construct word-aligned parallel
corpus to be able in Myanmar-English machine translation. One
useful concept is to identify correspondences between words in
one language and in other language. The proposed approach is
based on the first three IBM models and EM algorithm. It also
shows that the approach can also be improved by using a list of
cognates and morphological analysis
Phraseology in Corpus-Based Translation Studies: A Stylistic Study of Two Contemporary Chinese Translations of Cervantes's Don Quijote
The present work sets out to investigate the stylistic profiles of two modern Chinese versions of
Cervantes’s Don Quijote (I): by Yang Jiang (1978), the first direct translation from Castilian to Chinese,
and by Liu Jingsheng (1995), which is one of the most commercially successful versions of the
Castilian literary classic. This thesis focuses on a detailed linguistic analysis carried out with the help
of the latest textual analytical tools, natural language processing applications and statistical packages.
The type of linguistic phenomenon singled out for study is four-character expressions (FCEXs), which
are a very typical category of Chinese phraseology. The work opens with the creation of a descriptive
framework for the annotation of linguistic data extracted from the parallel corpus of Don Quijote.
Subsequently, the classified and extracted data are put through several statistical tests. The results of
these tests prove to be very revealing regarding the different use of FCEXs in the two Chinese
translations. The computational modelling of the linguistic data would seem to indicate that among
other findings, while Liu’s use of archaic idioms has followed the general patterns of the original and
also of Yang’s work in the first half of Don Quijote I, noticeable variations begin to emerge in the
second half of Liu’s more recent version. Such an idiosyncratic use of archaisms by Liu, which may be
defined as style shifting or style variation, is then analyzed in quantitative terms through the application
of the proposed context-motivated theory (CMT). The results of applying the CMT-derived statistical
models show that the detected stylistic variation may well point to the internal consistency of the
translator in rendering the second half of Part I of the novel, which reflects his freer, more creative and
experimental style of translation. Through the introduction and testing of quantitative research methods
adapted from corpus linguistics and textual statistics, this thesis has made a major contribution to
methodological innovation in the study of style within the context of corpus-based translation studies
Filtering parallel texts to improve translation model and cross-language information retrieval
Mémoire numérisé par la Direction des bibliothèques de l'Université de Montréal
- …