8 research outputs found

    Korean-Chinese Person Name Translation for Cross Language Information Retrieval

    Get PDF
    PACLIC 21 / Seoul National University, Seoul, Korea / November 1-3, 200

    Automatic Construction of Chinese-English Translation Lexicons

    Get PDF
    The process of constructing translation lexicons from parallel texts (bitexts) can be broken down into three stages: mapping bitext correspondence, counting co-occurrences, and estimating a translation model. State-of-the-art techniques for accomplishing each stage of the process had already been developed, but only for bitexts involving fairly similar languages. Correct and efficient implementation of each stage poses special challenges when the parallel texts involve two very different languages. This report describes our theoretical and empirical investigations into how existing techniques might be extended and applied to Chinese/English bitexts

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    Augmenting Translation Lexica by Learning Generalised Translation Patterns

    Get PDF
    Bilingual Lexicons do improve quality: of parallel corpora alignment, of newly extracted translation pairs, of Machine Translation, of cross language information retrieval, among other applications. In this regard, the first problem addressed in this thesis pertains to the classification of automatically extracted translations from parallel corpora-collections of sentence pairs that are translations of each other. The second problem is concerned with machine learning of bilingual morphology with applications in the solution of first problem and in the generation of Out-Of-Vocabulary translations. With respect to the problem of translation classification, two separate classifiers for handling multi-word and word-to-word translations are trained, using previously extracted and manually classified translation pairs as correct or incorrect. Several insights are useful for distinguishing the adequate multi-word candidates from those that are inadequate such as, lack or presence of parallelism, spurious terms at translation ends such as determiners, co-ordinated conjunctions, properties such as orthographic similarity between translations, the occurrence and co-occurrence frequency of the translation pairs. Morphological coverage reflecting stem and suffix agreements are explored as key features in classifying word-to-word translations. Given that the evaluation of extracted translation equivalents depends heavily on the human evaluator, incorporation of an automated filter for appropriate and inappropriate translation pairs prior to human evaluation contributes to tremendously reduce this work, thereby saving the time involved and progressively improving alignment and extraction quality. It can also be applied to filtering of translation tables used for training machine translation engines, and to detect bad translation choices made by translation engines, thus enabling significative productivity enhancements in the post-edition process of machine made translations. An important attribute of the translation lexicon is the coverage it provides. Learning suffixes and suffixation operations from the lexicon or corpus of a language is an extensively researched task to tackle out-of-vocabulary terms. However, beyond mere words or word forms are the translations and their variants, a powerful source of information for automatic structural analysis, which is explored from the perspective of improving word-to-word translation coverage and constitutes the second part of this thesis. In this context, as a phase prior to the suggestion of out-of-vocabulary bilingual lexicon entries, an approach to automatically induce segmentation and learn bilingual morph-like units by identifying and pairing word stems and suffixes is proposed, using the bilingual corpus of translations automatically extracted from aligned parallel corpora, manually validated or automatically classified. Minimally supervised technique is proposed to enable bilingual morphology learning for language pairs whose bilingual lexicons are highly defective in what concerns word-to-word translations representing inflection diversity. Apart from the above mentioned applications in the classification of machine extracted translations and in the generation of Out-Of-Vocabulary translations, learned bilingual morph-units may also have a great impact on the establishment of correspondences of sub-word constituents in the cases of word-to-multi-word and multi-word-to-multi-word translations and in compression, full text indexing and retrieval applications

    Phoneme-based statistical transliteration of foreign names for OOV problem.

    Get PDF
    Gao Wei.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 79-82).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.iiiBibliographic Notes --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- What is Transliteration? --- p.1Chapter 1.2 --- Existing Problems --- p.2Chapter 1.3 --- Objectives --- p.4Chapter 1.4 --- Outline --- p.4Chapter 2 --- Background --- p.6Chapter 2.1 --- Source-channel Model --- p.6Chapter 2.2 --- Transliteration for English-Chinese --- p.8Chapter 2.2.1 --- Rule-based Approach --- p.8Chapter 2.2.2 --- Similarity-based Framework --- p.8Chapter 2.2.3 --- Direct Semi-Statistical Approach --- p.9Chapter 2.2.4 --- Source-channel-based Approach --- p.11Chapter 2.3 --- Chapter Summary --- p.14Chapter 3 --- Transliteration Baseline --- p.15Chapter 3.1 --- Transliteration Using IBM SMT --- p.15Chapter 3.1.1 --- Introduction --- p.15Chapter 3.1.2 --- GIZA++ for Transliteration Modeling --- p.16Chapter 3.1.3 --- CMU-Cambridge Toolkits for Language Modeling --- p.21Chapter 3.1.4 --- Re Write Decoder for Decoding --- p.21Chapter 3.2 --- Limitations of IBM SMT --- p.22Chapter 3.3 --- Experiments Using IBM SMT --- p.25Chapter 3.3.1 --- Data Preparation --- p.25Chapter 3.3.2 --- Performance Measurement --- p.27Chapter 3.3.3 --- Experimental Results --- p.27Chapter 3.4 --- Chapter Summary --- p.28Chapter 4 --- Direct Transliteration Modeling --- p.29Chapter 4.1 --- Soundness of the Direct Model一Direct-1 --- p.30Chapter 4.2 --- Alignment of Phoneme Chunks --- p.31Chapter 4.3 --- Transliteration Model Training --- p.33Chapter 4.3.1 --- EM Training for Symbol-mappings --- p.33Chapter 4.3.2 --- WFST for Phonetic Transition --- p.36Chapter 4.3.3 --- Issues for Incorrect Syllables --- p.36Chapter 4.4 --- Language Model Training --- p.36Chapter 4.5 --- Search Algorithm --- p.39Chapter 4.6 --- Experimental Results --- p.41Chapter 4.6.1 --- Experiment I: C.A. Distribution --- p.41Chapter 4.6.2 --- Experiment II: Top-n Accuracy --- p.41Chapter 4.6.3 --- Experiment III: Comparisons with the Baseline --- p.43Chapter 4.6.4 --- Experiment IV: Influence of m Candidates --- p.43Chapter 4.7 --- Discussions --- p.43Chapter 4.8 --- Chapter Summary --- p.46Chapter 5 --- Improving Direct Transliteration --- p.47Chapter 5.1 --- Improved Direct Model´ؤDirect-2 --- p.47Chapter 5.1.1 --- Enlightenment from Source-Channel --- p.47Chapter 5.1.2 --- Using Contextual Features --- p.48Chapter 5.1.3 --- Estimation Based on MaxEnt --- p.49Chapter 5.1.4 --- Features for Transliteration --- p.51Chapter 5.2 --- Direct-2 Model Training --- p.53Chapter 5.2.1 --- Procedure and Results --- p.53Chapter 5.2.2 --- Discussions --- p.53Chapter 5.3 --- Refining the Model Direct-2 --- p.55Chapter 5.3.1 --- Refinement Solutions --- p.55Chapter 5.3.2 --- Direct-2R Model Training --- p.56Chapter 5.4 --- Evaluation --- p.57Chapter 5.4.1 --- Search Algorithm --- p.57Chapter 5.4.2 --- Direct Transliteration Models vs. Baseline --- p.59Chapter 5.4.3 --- Direct-2 vs. Direct-2R --- p.63Chapter 5.4.4 --- Experiments on Direct-2R --- p.65Chapter 5.5 --- Chapter Summary --- p.71Chapter 6 --- Conclusions --- p.72Chapter 6.1 --- Thesis Summary --- p.72Chapter 6.2 --- Cross Language Applications --- p.73Chapter 6.3 --- Future Work and Directions --- p.74Chapter A --- IPA-ARPABET Symbol Mapping Table --- p.77Bibliography --- p.8

    Recherche d'information translinguistique sur les documents en arabe

    Full text link
    Thèse numérisée par la Division de la gestion de documents et des archives de l'Université de Montréal
    corecore