282 research outputs found

    Character-level and syntax-level models for low-resource and multilingual natural language processing

    Get PDF
    There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)

    Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)

    Get PDF

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    Cross-language Information Retrieval

    Full text link
    Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assumption is true. In such cases, Cross-Language Information Retrieval (CLIR) is needed. This chapter reviews the state of the art for CLIR and outlines some open research questions.Comment: 49 pages, 0 figure

    Improved cross-language information retrieval via disambiguation and vocabulary discovery

    Get PDF
    Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR. In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy. Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR

    Hongloumeng, Honglou Meng, Hong Loumeng, or Hong Lou Meng

    Get PDF
    The current Pinyin Romanization of Chinese book and journal titles is rich in examples of inconsistencies, and this problem has much more been identified than examined. The current paper traces the problem back to the guiding documents, analyzes their inborn problem. It is argued that the currently dominant practice of aggregating syllables is the source of the inconsistencies, and it results from ambiguous wordings and misconception of “ci” as the basic unit in the guiding documents. Based on this analysis, a practice of Romanizing Chinese on the basis of “zi” is put forward, and the underlying rationale analyzed. The purpose is to contribute to the solution of the issue of inconsistency and offer an approach to standardizing the practice of Pinyin Romanization of Chinese book and journal titles

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Phoneme-based statistical transliteration of foreign names for OOV problem.

    Get PDF
    Gao Wei.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 79-82).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.iiiBibliographic Notes --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- What is Transliteration? --- p.1Chapter 1.2 --- Existing Problems --- p.2Chapter 1.3 --- Objectives --- p.4Chapter 1.4 --- Outline --- p.4Chapter 2 --- Background --- p.6Chapter 2.1 --- Source-channel Model --- p.6Chapter 2.2 --- Transliteration for English-Chinese --- p.8Chapter 2.2.1 --- Rule-based Approach --- p.8Chapter 2.2.2 --- Similarity-based Framework --- p.8Chapter 2.2.3 --- Direct Semi-Statistical Approach --- p.9Chapter 2.2.4 --- Source-channel-based Approach --- p.11Chapter 2.3 --- Chapter Summary --- p.14Chapter 3 --- Transliteration Baseline --- p.15Chapter 3.1 --- Transliteration Using IBM SMT --- p.15Chapter 3.1.1 --- Introduction --- p.15Chapter 3.1.2 --- GIZA++ for Transliteration Modeling --- p.16Chapter 3.1.3 --- CMU-Cambridge Toolkits for Language Modeling --- p.21Chapter 3.1.4 --- Re Write Decoder for Decoding --- p.21Chapter 3.2 --- Limitations of IBM SMT --- p.22Chapter 3.3 --- Experiments Using IBM SMT --- p.25Chapter 3.3.1 --- Data Preparation --- p.25Chapter 3.3.2 --- Performance Measurement --- p.27Chapter 3.3.3 --- Experimental Results --- p.27Chapter 3.4 --- Chapter Summary --- p.28Chapter 4 --- Direct Transliteration Modeling --- p.29Chapter 4.1 --- Soundness of the Direct Model一Direct-1 --- p.30Chapter 4.2 --- Alignment of Phoneme Chunks --- p.31Chapter 4.3 --- Transliteration Model Training --- p.33Chapter 4.3.1 --- EM Training for Symbol-mappings --- p.33Chapter 4.3.2 --- WFST for Phonetic Transition --- p.36Chapter 4.3.3 --- Issues for Incorrect Syllables --- p.36Chapter 4.4 --- Language Model Training --- p.36Chapter 4.5 --- Search Algorithm --- p.39Chapter 4.6 --- Experimental Results --- p.41Chapter 4.6.1 --- Experiment I: C.A. Distribution --- p.41Chapter 4.6.2 --- Experiment II: Top-n Accuracy --- p.41Chapter 4.6.3 --- Experiment III: Comparisons with the Baseline --- p.43Chapter 4.6.4 --- Experiment IV: Influence of m Candidates --- p.43Chapter 4.7 --- Discussions --- p.43Chapter 4.8 --- Chapter Summary --- p.46Chapter 5 --- Improving Direct Transliteration --- p.47Chapter 5.1 --- Improved Direct Model´ؤDirect-2 --- p.47Chapter 5.1.1 --- Enlightenment from Source-Channel --- p.47Chapter 5.1.2 --- Using Contextual Features --- p.48Chapter 5.1.3 --- Estimation Based on MaxEnt --- p.49Chapter 5.1.4 --- Features for Transliteration --- p.51Chapter 5.2 --- Direct-2 Model Training --- p.53Chapter 5.2.1 --- Procedure and Results --- p.53Chapter 5.2.2 --- Discussions --- p.53Chapter 5.3 --- Refining the Model Direct-2 --- p.55Chapter 5.3.1 --- Refinement Solutions --- p.55Chapter 5.3.2 --- Direct-2R Model Training --- p.56Chapter 5.4 --- Evaluation --- p.57Chapter 5.4.1 --- Search Algorithm --- p.57Chapter 5.4.2 --- Direct Transliteration Models vs. Baseline --- p.59Chapter 5.4.3 --- Direct-2 vs. Direct-2R --- p.63Chapter 5.4.4 --- Experiments on Direct-2R --- p.65Chapter 5.5 --- Chapter Summary --- p.71Chapter 6 --- Conclusions --- p.72Chapter 6.1 --- Thesis Summary --- p.72Chapter 6.2 --- Cross Language Applications --- p.73Chapter 6.3 --- Future Work and Directions --- p.74Chapter A --- IPA-ARPABET Symbol Mapping Table --- p.77Bibliography --- p.8
    • …
    corecore