163 research outputs found

    Bilingual Lexicon Extraction Using a Modified Perceptron Algorithm

    Get PDF
    ์ „์‚ฐ ์–ธ์–ดํ•™ ๋ถ„์•ผ์—์„œ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์™€ ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋Š” ๊ธฐ๊ณ„๋ฒˆ์—ญ๊ณผ ๊ต์ฐจ ์ •๋ณด ํƒ์ƒ‰ ๋“ฑ์˜ ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•œ ์ž์›์œผ๋กœ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜๋Š” ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์—์„œ ๋ฒˆ์—ญ ํ™•๋ฅ ๋“ค์„ ์ถ”์ถœํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋Š” ๊ต์ฐจ ์ •๋ณด ํƒ์ƒ‰์—์„œ ์ง์ ‘์ ์œผ๋กœ ๋‹จ์–ด ๋Œ€ ๋‹จ์–ด ๋ฒˆ์—ญ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ๋˜ํ•œ ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์—์„œ ๋ฒˆ์—ญ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋„์™€์ฃผ๋Š” ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ•™์Šต์„ ์œ„ํ•œ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์™€ ์ด์ค‘์–ธ์–ด ์–ดํœ˜์˜ ์šฉ๋Ÿ‰์ด ํฌ๋ฉด ํด์ˆ˜๋ก ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋ฅผ ์ˆ˜๋™์œผ๋กœ, ์ฆ‰ ์‚ฌ๋žŒ์˜ ํž˜์œผ๋กœ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์€ ๋งŽ์€ ๋น„์šฉ๊ณผ ์‹œ๊ฐ„๊ณผ ๋…ธ๋™์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ด์œ ๋“ค ๋•Œ๋ฌธ์— ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋ฅผ ์ถ”์ถœํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์€ ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ ๊ฐ๊ด‘๋ฐ›๊ฒŒ ๋˜์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋ฅผ ์ถ”์ถœํ•˜๋Š” ์ƒˆ๋กญ๊ณ  ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค. ์ด์ค‘์–ธ์–ด ์–ดํœ˜ ์ถ”์ถœ์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ๋‹ค๋ฃจ์–ด์ง€๋Š” ๋ฒกํ„ฐ ๊ณต๊ฐ„ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ณ , ์‹ ๊ฒฝ๋ง์˜ ํ•œ ์ข…๋ฅ˜์ธ ํผ์…‰ํŠธ๋ก  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ค‘์–ธ์–ด ์–ดํœ˜์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ํ•™์Šตํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ฐ˜๋ณต์ ์œผ๋กœ ํ•™์Šต๋œ ์ด์ค‘์–ธ์–ด ์–ดํœ˜์˜ ๊ฐ€์ค‘์น˜์™€ ํผ์…‰ํŠธ๋ก ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข… ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋“ค์„ ์ถ”์ถœํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ํ•™์Šต๋˜์ง€ ์•Š์€ ์ดˆ๊ธฐ์˜ ๊ฒฐ๊ณผ์— ๋น„ํ•ด์„œ ๋ฐ˜๋ณต ํ•™์Šต๋œ ๊ฒฐ๊ณผ๊ฐ€ ํ‰๊ท  3.5%์˜ ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค1. Introduction 2. Literature Review 2.1 Linguistic resources: The text corpora 2.2 A vector space model 2.3 Neural networks: The single layer Perceptron 2.4 Evaluation metrics 3. System Architecture of Bilingual Lexicon Extraction System 3.1 Required linguistic resources 3.2 System architecture 4. Building a Seed Dictionary 4.1 Methodology: Context Based Approach (CBA) 4.2 Experiments and results 4.2.1 Experimental setups 4.2.2 Experimental results 4.3 Discussions 5. Extracting Bilingual Lexicons 4.1 Methodology: Iterative Approach (IA) 4.2 Experiments and results 4.2.1 Experimental setups 4.2.2 Experimental results 4.3 Discussions 6. Conclusions and Future Work

    Transitive probabilistic CLIR models.

    Get PDF
    Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectiveness\ud up to 83% of monolingual performance, which is significantly better than a baseline using the synonym operator

    A survey of cross-lingual word embedding models

    Get PDF
    Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.</jats:p

    Multilingual word embeddings and their utility in cross-lingual learning

    Get PDF
    Word embeddings - dense vector representations of a wordโ€™s distributional semantics - are an indespensable component of contemporary natural language processing (NLP). Bilingual embeddings, in particular, have attracted much attention in recent years, given their inherent applicability to cross-lingual NLP tasks, such as Part-of-speech tagging and dependency parsing. However, despite recent advancements in bilingual embedding mapping, very little research has been dedicated to aligning embeddings multilingually, where word embeddings for a variable amount of languages are oriented to a single vector space. Given a proper alignment, one potential use case for multilingual embeddings is cross-lingual transfer learning, where a machine learning model trained on resource-rich languages (e.g. Finnish and Estonian) can โ€œtransferโ€ its salient features to a related language for which annotated resources are scarce (e.g. North Sami). The effect of the quality of this alignment on downstream cross-lingual NLP tasks has also been left largely unexplored, however. With this in mind, our work is motivated by two goals. First, we aim to leverage existing supervised and unsupervised methods in bilingual embedding mapping towards inducing high quality multilingual embeddings. To this end, we propose three algorithms (one supervised, two unsupervised) and evaluate them against a completely supervised bilingual system and a commonly employed baseline approach. Second, we investigate the utility of multilingual embeddings in two common cross-lingual transfer learning scenarios: POS-tagging and dependency parsing. To do so, we train a joint POS-tagger/dependency parser on Universal Dependencies treebanks for a variety of Indo-European languages and evaluate it on other, closely related languages. Although we ultimately observe that, in most settings, multilingual word embeddings themselves do not induce a cross-lingual signal, our experimental framework and results offer many insights for future cross-lingual learning experiments

    A Survey of Paraphrasing and Textual Entailment Methods

    Full text link
    Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of natural language processing applications, including question answering, summarization, text generation, and machine translation. We summarize key ideas from the two areas by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of Informatics, Athens University of Economics and Business, Greece, 201
    • โ€ฆ
    corecore