427 research outputs found

    Corpus-driven bilingual lexicon extraction

    Get PDF
    This paper introduces some key aspects of machine translation in order to situate the role of the bilingual lexicon in transfer-based systems. It then discusses the data-driven approach to extracting bilingual knowledge automatically from bilingual texts, tracing the processes of alignment at different levels of granularity. The paper concludes with some suggestions for future work.peer-reviewe

    Ridge Regression, Hubness, and Zero-Shot Learning

    Full text link
    This paper discusses the effect of hubness in zero-shot learning, when ridge regression is used to find a mapping between the example space to the label space. Contrary to the existing approach, which attempts to find a mapping from the example space to the label space, we show that mapping labels into the example space is desirable to suppress the emergence of hubs in the subsequent nearest neighbor search step. Assuming a simple data model, we prove that the proposed approach indeed reduces hubness. This was verified empirically on the tasks of bilingual lexicon extraction and image labeling: hubness was reduced with both of these tasks and the accuracy was improved accordingly.Comment: To be presented at ECML/PKDD 201

    Bilingual Lexicon Extraction from Comparable Corpora as Metasearch

    Get PDF
    International audienceIn this article we present a novel way of looking at the problem of automatic acquisition of pairs of translationally equivalent words from comparable corpora. We ๏ฌrst present the standard and extended approaches traditionally dedicated to this task. We then reinterpret the extended method, and motivate a novel model to reformulate this approach inspired by the metasearch engines in information retrieval. The empirical results show that performances of our model are always better than the baseline obtained with the extended approach and also competitive with the standard approach

    Identification of Fertile Translations in Medical Comparable Corpora: a Morpho-Compositional Approach

    Get PDF
    This paper defines a method for lexicon in the biomedical domain from comparable corpora. The method is based on compositional translation and exploits morpheme-level translation equivalences. It can generate translations for a large variety of morphologically constructed words and can also generate 'fertile' translations. We show that fertile translations increase the overall quality of the extracted lexicon for English to French translation

    Multilingual Lexicon Extraction under Resource-Poor Language Pairs

    Get PDF
    In general, bilingual and multilingual lexicons are important resources in many natural language processing fields such as information retrieval and machine translation. Such lexicons are usually extracted from bilingual (e.g., parallel or comparable) corpora with external seed dictionaries. However, few such corpora and bilingual seed dictionaries are publicly available for many language pairs such as Koreanโ€“French. It is important that such resources for these language pairs be publicly available or easily accessible when a monolingual resource is considered. This thesis presents efficient approaches for extracting bilingual single-/multi-word lexicons for resource-poor language pairs such as Koreanโ€“French and Koreanโ€“Spanish. The goal of this thesis is to present several efficient methods of extracting translated single-/multi-words from bilingual corpora based on a statistical method. Three approaches for single words and one approach for multi-words are proposed. The first approach is the pivot context-based approach (PCA). The PCA uses a pivot language to connect source and target languages. It builds context vectors from two parallel corpora sharing one pivot language and calculates their similarity scores to choose the best translation equivalents. The approach can reduce the effort required when using a seed dictionary for translation by using parallel corpora rather than comparable corpora. The second approach is the extended pivot context-based approach (EPCA). This approach gathers similar context vectors for each source word to augment its context. The approach assumes that similar vectors can enrich contexts. For example, young and youth can augment the context of baby. In the investigation described here, such similar vectors were collected by similarity measures such as cosine similarity. The third approach for single words uses a competitive neural network algorithm (i.e., self-organizing mapsSOM). The SOM-based approach (SA) uses synonym vectors rather than context vectors to train two different SOMs (i.e., source and target SOMs) in different ways. A source SOM is trained in an unsupervised way, while a target SOM is trained in a supervised way. The fourth approach is the constituent-based approach (CTA), which deals with multi-word expressions (MWEs). This approach reinforces the PCA for multi-words (PCAM). It extracts bilingual MWEs taking all constituents of the source MWEs into consideration. The PCAM 2 identifies MWE candidates by pointwise mutual information first and then adds them to input data as single units in order to use the PCA directly. The experimental results show that the proposed approaches generally perform well for resource-poor language pairs, particularly Korean and Frenchโ€“Spanish. The PCA and SA have demonstrated good performance for such language pairs. The EPCA would not have shown a stronger performance than expected. The CTA performs well even when word contexts are insufficient. Overall, the experimental results show that the CTA significantly outperforms the PCAM. In the future, homonyms (i.e., homographs such as lead or tear) should be considered. In particular, the domains of bilingual corpora should be identified. In addition, more parts of speech such as verbs, adjectives, or adverbs could be tested. In this thesis, only nouns are discussed for simplicity. Finally, thorough error analysis should also be conducted.Abstract List of Abbreviations List of Tables List of Figures Acknowledgement Chapter 1 Introduction 1.1 Multilingual Lexicon Extraction 1.2 Motivations and Goals 1.3 Organization Chapter 2 Background and Literature Review 2.1 Extraction of Bilingual Translations of Single-words 2.1.1 Context-based approach 2.1.2 Extended approach 2.1.3 Pivot-based approach 2.2 Extractiong of Bilingual Translations of Multi-Word Expressions 2.2.1 MWE identification 2.2.2 MWE alignment 2.3 Self-Organizing Maps 2.4 Evaluation Measures Chapter 3 Pivot Context-Based Approach 3.1 Concept of Pivot-Based Approach 3.2 Experiments 3.2.1 Resources 3.2.2 Results 3.3 Summary Chapter 4 Extended Pivot Context-Based Approach 4.1 Concept of Extended Pivot Context-Based Approach 4.2 Experiments 4.2.1 Resources 4.2.2 Results 4.3 Summary Chapter 5 SOM-Based Approach 5.1 Concept of SOM-Based Approach 5.2 Experiments 5.2.1 Resources 5.2.2 Results 5.3 Summary Chapter 6 Constituent-Based Approach 6.1 Concept of Constituent-Based Approach 6.2 Experiments 6.2.1 Resources 6.2.2 Results 6.3 Summary Chapter 7 Conclusions and Future Work 7.1 Conclusions 7.2 Future Work Reference

    Bilingual Lexicon Extraction Using a Modified Perceptron Algorithm

    Get PDF
    ์ „์‚ฐ ์–ธ์–ดํ•™ ๋ถ„์•ผ์—์„œ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์™€ ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋Š” ๊ธฐ๊ณ„๋ฒˆ์—ญ๊ณผ ๊ต์ฐจ ์ •๋ณด ํƒ์ƒ‰ ๋“ฑ์˜ ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•œ ์ž์›์œผ๋กœ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜๋Š” ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์—์„œ ๋ฒˆ์—ญ ํ™•๋ฅ ๋“ค์„ ์ถ”์ถœํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋Š” ๊ต์ฐจ ์ •๋ณด ํƒ์ƒ‰์—์„œ ์ง์ ‘์ ์œผ๋กœ ๋‹จ์–ด ๋Œ€ ๋‹จ์–ด ๋ฒˆ์—ญ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ๋˜ํ•œ ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์—์„œ ๋ฒˆ์—ญ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋„์™€์ฃผ๋Š” ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ•™์Šต์„ ์œ„ํ•œ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์™€ ์ด์ค‘์–ธ์–ด ์–ดํœ˜์˜ ์šฉ๋Ÿ‰์ด ํฌ๋ฉด ํด์ˆ˜๋ก ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋ฅผ ์ˆ˜๋™์œผ๋กœ, ์ฆ‰ ์‚ฌ๋žŒ์˜ ํž˜์œผ๋กœ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์€ ๋งŽ์€ ๋น„์šฉ๊ณผ ์‹œ๊ฐ„๊ณผ ๋…ธ๋™์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ด์œ ๋“ค ๋•Œ๋ฌธ์— ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋ฅผ ์ถ”์ถœํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์€ ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ ๊ฐ๊ด‘๋ฐ›๊ฒŒ ๋˜์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋ฅผ ์ถ”์ถœํ•˜๋Š” ์ƒˆ๋กญ๊ณ  ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค. ์ด์ค‘์–ธ์–ด ์–ดํœ˜ ์ถ”์ถœ์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ๋‹ค๋ฃจ์–ด์ง€๋Š” ๋ฒกํ„ฐ ๊ณต๊ฐ„ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ณ , ์‹ ๊ฒฝ๋ง์˜ ํ•œ ์ข…๋ฅ˜์ธ ํผ์…‰ํŠธ๋ก  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ค‘์–ธ์–ด ์–ดํœ˜์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ํ•™์Šตํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ฐ˜๋ณต์ ์œผ๋กœ ํ•™์Šต๋œ ์ด์ค‘์–ธ์–ด ์–ดํœ˜์˜ ๊ฐ€์ค‘์น˜์™€ ํผ์…‰ํŠธ๋ก ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข… ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋“ค์„ ์ถ”์ถœํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ํ•™์Šต๋˜์ง€ ์•Š์€ ์ดˆ๊ธฐ์˜ ๊ฒฐ๊ณผ์— ๋น„ํ•ด์„œ ๋ฐ˜๋ณต ํ•™์Šต๋œ ๊ฒฐ๊ณผ๊ฐ€ ํ‰๊ท  3.5%์˜ ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค1. Introduction 2. Literature Review 2.1 Linguistic resources: The text corpora 2.2 A vector space model 2.3 Neural networks: The single layer Perceptron 2.4 Evaluation metrics 3. System Architecture of Bilingual Lexicon Extraction System 3.1 Required linguistic resources 3.2 System architecture 4. Building a Seed Dictionary 4.1 Methodology: Context Based Approach (CBA) 4.2 Experiments and results 4.2.1 Experimental setups 4.2.2 Experimental results 4.3 Discussions 5. Extracting Bilingual Lexicons 4.1 Methodology: Iterative Approach (IA) 4.2 Experiments and results 4.2.1 Experimental setups 4.2.2 Experimental results 4.3 Discussions 6. Conclusions and Future Work

    Automatic Bilingual Lexicon Extraction for a Minority Target Language

    Get PDF
    PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200
    • โ€ฆ
    corecore