480 research outputs found

    Bilingual Lexicon Extraction Using a Modified Perceptron Algorithm

    Get PDF
    ์ „์‚ฐ ์–ธ์–ดํ•™ ๋ถ„์•ผ์—์„œ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์™€ ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋Š” ๊ธฐ๊ณ„๋ฒˆ์—ญ๊ณผ ๊ต์ฐจ ์ •๋ณด ํƒ์ƒ‰ ๋“ฑ์˜ ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•œ ์ž์›์œผ๋กœ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜๋Š” ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์—์„œ ๋ฒˆ์—ญ ํ™•๋ฅ ๋“ค์„ ์ถ”์ถœํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋Š” ๊ต์ฐจ ์ •๋ณด ํƒ์ƒ‰์—์„œ ์ง์ ‘์ ์œผ๋กœ ๋‹จ์–ด ๋Œ€ ๋‹จ์–ด ๋ฒˆ์—ญ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ๋˜ํ•œ ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์—์„œ ๋ฒˆ์—ญ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋„์™€์ฃผ๋Š” ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ•™์Šต์„ ์œ„ํ•œ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์™€ ์ด์ค‘์–ธ์–ด ์–ดํœ˜์˜ ์šฉ๋Ÿ‰์ด ํฌ๋ฉด ํด์ˆ˜๋ก ๊ธฐ๊ณ„๋ฒˆ์—ญ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋ฅผ ์ˆ˜๋™์œผ๋กœ, ์ฆ‰ ์‚ฌ๋žŒ์˜ ํž˜์œผ๋กœ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์€ ๋งŽ์€ ๋น„์šฉ๊ณผ ์‹œ๊ฐ„๊ณผ ๋…ธ๋™์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ด์œ ๋“ค ๋•Œ๋ฌธ์— ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋ฅผ ์ถ”์ถœํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์€ ์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ ๊ฐ๊ด‘๋ฐ›๊ฒŒ ๋˜์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋ฅผ ์ถ”์ถœํ•˜๋Š” ์ƒˆ๋กญ๊ณ  ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค. ์ด์ค‘์–ธ์–ด ์–ดํœ˜ ์ถ”์ถœ์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ๋‹ค๋ฃจ์–ด์ง€๋Š” ๋ฒกํ„ฐ ๊ณต๊ฐ„ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ณ , ์‹ ๊ฒฝ๋ง์˜ ํ•œ ์ข…๋ฅ˜์ธ ํผ์…‰ํŠธ๋ก  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ค‘์–ธ์–ด ์–ดํœ˜์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ํ•™์Šตํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ฐ˜๋ณต์ ์œผ๋กœ ํ•™์Šต๋œ ์ด์ค‘์–ธ์–ด ์–ดํœ˜์˜ ๊ฐ€์ค‘์น˜์™€ ํผ์…‰ํŠธ๋ก ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข… ์ด์ค‘์–ธ์–ด ์–ดํœ˜๋“ค์„ ์ถ”์ถœํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ํ•™์Šต๋˜์ง€ ์•Š์€ ์ดˆ๊ธฐ์˜ ๊ฒฐ๊ณผ์— ๋น„ํ•ด์„œ ๋ฐ˜๋ณต ํ•™์Šต๋œ ๊ฒฐ๊ณผ๊ฐ€ ํ‰๊ท  3.5%์˜ ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค1. Introduction 2. Literature Review 2.1 Linguistic resources: The text corpora 2.2 A vector space model 2.3 Neural networks: The single layer Perceptron 2.4 Evaluation metrics 3. System Architecture of Bilingual Lexicon Extraction System 3.1 Required linguistic resources 3.2 System architecture 4. Building a Seed Dictionary 4.1 Methodology: Context Based Approach (CBA) 4.2 Experiments and results 4.2.1 Experimental setups 4.2.2 Experimental results 4.3 Discussions 5. Extracting Bilingual Lexicons 4.1 Methodology: Iterative Approach (IA) 4.2 Experiments and results 4.2.1 Experimental setups 4.2.2 Experimental results 4.3 Discussions 6. Conclusions and Future Work

    Termhood-based Comparability Metrics of Comparable Corpus in Special Domain

    Full text link
    Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages, such as English, French, and Spanish and so on. So, obtaining comparable corpora automatically for such domains could be an answer to this problem effectively. Comparable corpora, that the subcorpora are not translations of each other, can be easily obtained from web. Therefore, building and using comparable corpora is often a more feasible option in multilingual information processing. Comparability metrics is one of key issues in the field of building and using comparable corpus. Currently, there is no widely accepted definition or metrics method of corpus comparability. In fact, Different definitions or metrics methods of comparability might be given to suit various tasks about natural language processing. A new comparability, namely, termhood-based metrics, oriented to the task of bilingual terminology extraction, is proposed in this paper. In this method, words are ranked by termhood not frequency, and then the cosine similarities, calculated based on the ranking lists of word termhood, is used as comparability. Experiments results show that termhood-based metrics performs better than traditional frequency-based metrics

    Parallel texts alignment

    Get PDF
    Trabalho apresentado no รขmbito do Mestrado em Engenharia Informรกtica, como requisito parcial para obtenรงรฃo do grau de Mestre em Engenharia InformรกticaAlignment of parallel texts (texts that are translation of each other) is a required step for many applications that use parallel texts, including statistical machine translation, automatic extraction of translation equivalents, automatic creation of concordances, etc. This dissertation presents a new methodology for parallel texts alignment that departs from previous work in several ways. One important departure is a shift of goals concerning the use of lexicons for obtaining correspondences between the texts. Previous methods try to infer a bilingual lexicon as part of the alignment process and use it to obtain correspondences between the texts. Some of those methods can use external lexicons to complement the inferred one, but they tend to consider them as secondary. This dissertation presents several arguments supporting the thesis that lexicon inference should not be embedded in the alignment process. The method described complies with this statement and relies exclusively on externally managed lexicons to obtain correspondences. Moreover, the algorithms presented can handle very large lexicons containing terms of arbitrary length. Besides the exclusive use of external lexicons, this dissertation presents a new method for obtaining correspondences between translation equivalents found in the texts. It uses a decision criteria based on features that have been overlooked by prior work. The proposed method is iterative and refines the alignment at each iteration. It uses the alignment obtained in one iteration as a guide to obtaining new correspondences in the next iteration, which in turn are used to compute a finer alignment. This iterative scheme allows the method to correct correspondence errors from previous iterations in face of new information

    Word-to-Word Models of Translational Equivalence

    Full text link
    Parallel texts (bitexts) have properties that distinguish them from other kinds of parallel data. First, most words translate to only one other word. Second, bitext correspondence is noisy. This article presents methods for biasing statistical translation models to reflect these properties. Analysis of the expected behavior of these biases in the presence of sparse data predicts that they will result in more accurate models. The prediction is confirmed by evaluation with respect to a gold standard -- translation models that are biased in this fashion are significantly more accurate than a baseline knowledge-poor model. This article also shows how a statistical translation model can take advantage of various kinds of pre-existing knowledge that might be available about particular language pairs. Even the simplest kinds of language-specific knowledge, such as the distinction between content words and function words, is shown to reliably boost translation model performance on some tasks. Statistical models that are informed by pre-existing knowledge about the model domain combine the best of both the rationalist and empiricist traditions

    Multilingual sentiment analysis in social media.

    Get PDF
    252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations

    Identifying Semantic Divergences in Parallel Text without Annotations

    Full text link
    Recognizing that even correct translations are not always semantically equivalent, we automatically detect meaning divergences in parallel sentence pairs with a deep neural model of bilingual semantic similarity which can be trained for any parallel corpus without any manual annotation. We show that our semantic model detects divergences more accurately than models based on surface features derived from word alignments, and that these divergences matter for neural machine translation.Comment: Accepted as a full paper to NAACL 201
    • โ€ฆ
    corecore