480 research outputs found
Bilingual Lexicon Extraction Using a Modified Perceptron Algorithm
์ ์ฐ ์ธ์ดํ ๋ถ์ผ์์ ๋ณ๋ ฌ ๋ง๋ญ์น์ ์ด์ค์ธ์ด ์ดํ๋ ๊ธฐ๊ณ๋ฒ์ญ๊ณผ ๊ต์ฐจ ์ ๋ณด ํ์ ๋ฑ์ ๋ถ์ผ์์ ์ค์ํ ์์์ผ๋ก ์ฌ์ฉ๋๊ณ ์๋ค. ์๋ฅผ ๋ค์ด, ๋ณ๋ ฌ ๋ง๋ญ์น๋ ๊ธฐ๊ณ๋ฒ์ญ ์์คํ
์์ ๋ฒ์ญ ํ๋ฅ ๋ค์ ์ถ์ถํ๋๋ฐ ์ฌ์ฉ๋๋ค. ์ด์ค์ธ์ด ์ดํ๋ ๊ต์ฐจ ์ ๋ณด ํ์์์ ์ง์ ์ ์ผ๋ก ๋จ์ด ๋ ๋จ์ด ๋ฒ์ญ์ ๊ฐ๋ฅํ๊ฒ ํ๋ค. ๋ํ ๊ธฐ๊ณ๋ฒ์ญ ์์คํ
์์ ๋ฒ์ญ ํ๋ก์ธ์ค๋ฅผ ๋์์ฃผ๋ ์ญํ ์ ํ๊ณ ์๋ค. ๊ทธ๋ฆฌ๊ณ ํ์ต์ ์ํ ๋ณ๋ ฌ ๋ง๋ญ์น์ ์ด์ค์ธ์ด ์ดํ์ ์ฉ๋์ด ํฌ๋ฉด ํด์๋ก ๊ธฐ๊ณ๋ฒ์ญ ์์คํ
์ ์ฑ๋ฅ์ด ํฅ์๋๋ค. ๊ทธ๋ฌ๋ ์ด๋ฌํ ์ด์ค์ธ์ด ์ดํ๋ฅผ ์๋์ผ๋ก, ์ฆ ์ฌ๋์ ํ์ผ๋ก ๊ตฌ์ถํ๋ ๊ฒ์ ๋ง์ ๋น์ฉ๊ณผ ์๊ฐ๊ณผ ๋
ธ๋์ ํ์๋ก ํ๋ค. ์ด๋ฌํ ์ด์ ๋ค ๋๋ฌธ์ ์ด์ค์ธ์ด ์ดํ๋ฅผ ์ถ์ถํ๋ ์ฐ๊ตฌ๊ฐ ๋ง์ ์ฐ๊ตฌ์๋ค์๊ฒ ๊ฐ๊ด๋ฐ๊ฒ ๋์๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ด์ค์ธ์ด ์ดํ๋ฅผ ์ถ์ถํ๋ ์๋กญ๊ณ ํจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ๋ก ์ ์ ์ํ๋ค. ์ด์ค์ธ์ด ์ดํ ์ถ์ถ์์ ๊ฐ์ฅ ๋ง์ด ๋ค๋ฃจ์ด์ง๋ ๋ฒกํฐ ๊ณต๊ฐ ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก ํ๊ณ , ์ ๊ฒฝ๋ง์ ํ ์ข
๋ฅ์ธ ํผ์
ํธ๋ก ์๊ณ ๋ฆฌ์ฆ์ ์ฌ์ฉํ์ฌ ์ด์ค์ธ์ด ์ดํ์ ๊ฐ์ค์น๋ฅผ ๋ฐ๋ณตํด์ ํ์ตํ๋ค. ๊ทธ๋ฆฌ๊ณ ๋ฐ๋ณต์ ์ผ๋ก ํ์ต๋ ์ด์ค์ธ์ด ์ดํ์ ๊ฐ์ค์น์ ํผ์
ํธ๋ก ์ ์ฌ์ฉํ์ฌ ์ต์ข
์ด์ค์ธ์ด ์ดํ๋ค์ ์ถ์ถํ๋ค.
๊ทธ ๊ฒฐ๊ณผ, ํ์ต๋์ง ์์ ์ด๊ธฐ์ ๊ฒฐ๊ณผ์ ๋นํด์ ๋ฐ๋ณต ํ์ต๋ ๊ฒฐ๊ณผ๊ฐ ํ๊ท 3.5%์ ์ ํ๋ ํฅ์์ ์ป์ ์ ์์๋ค1. Introduction
2. Literature Review
2.1 Linguistic resources: The text corpora
2.2 A vector space model
2.3 Neural networks: The single layer Perceptron
2.4 Evaluation metrics
3. System Architecture of Bilingual Lexicon Extraction System
3.1 Required linguistic resources
3.2 System architecture
4. Building a Seed Dictionary
4.1 Methodology: Context Based Approach (CBA)
4.2 Experiments and results
4.2.1 Experimental setups
4.2.2 Experimental results
4.3 Discussions
5. Extracting Bilingual Lexicons
4.1 Methodology: Iterative Approach (IA)
4.2 Experiments and results
4.2.1 Experimental setups
4.2.2 Experimental results
4.3 Discussions
6. Conclusions and Future Work
Termhood-based Comparability Metrics of Comparable Corpus in Special Domain
Cross-Language Information Retrieval (CLIR) and machine translation (MT)
resources, such as dictionaries and parallel corpora, are scarce and hard to
come by for special domains. Besides, these resources are just limited to a few
languages, such as English, French, and Spanish and so on. So, obtaining
comparable corpora automatically for such domains could be an answer to this
problem effectively. Comparable corpora, that the subcorpora are not
translations of each other, can be easily obtained from web. Therefore,
building and using comparable corpora is often a more feasible option in
multilingual information processing. Comparability metrics is one of key issues
in the field of building and using comparable corpus. Currently, there is no
widely accepted definition or metrics method of corpus comparability. In fact,
Different definitions or metrics methods of comparability might be given to
suit various tasks about natural language processing. A new comparability,
namely, termhood-based metrics, oriented to the task of bilingual terminology
extraction, is proposed in this paper. In this method, words are ranked by
termhood not frequency, and then the cosine similarities, calculated based on
the ranking lists of word termhood, is used as comparability. Experiments
results show that termhood-based metrics performs better than traditional
frequency-based metrics
Parallel texts alignment
Trabalho apresentado no รขmbito do Mestrado em Engenharia Informรกtica, como requisito parcial para obtenรงรฃo do grau de Mestre em Engenharia InformรกticaAlignment of parallel texts (texts that are translation of each other) is a required step for many applications that use parallel texts, including statistical machine translation, automatic extraction of translation equivalents, automatic creation of concordances, etc.
This dissertation presents a new methodology for parallel texts alignment that departs from previous work in several ways. One important departure is a shift of goals concerning the use of lexicons for obtaining correspondences between the texts. Previous methods try to infer a bilingual lexicon as part of the alignment process and use it to obtain correspondences between the texts. Some of those methods can use external lexicons to complement the inferred one,
but they tend to consider them as secondary. This dissertation presents several arguments supporting the thesis that lexicon inference should not be embedded in the alignment process. The method described complies with this statement and relies exclusively on externally managed lexicons to obtain correspondences. Moreover, the algorithms presented can handle very large lexicons containing terms of arbitrary length.
Besides the exclusive use of external lexicons, this dissertation presents a new method for obtaining correspondences between translation equivalents found in the texts. It uses a decision criteria based on features that have been overlooked by prior work.
The proposed method is iterative and refines the alignment at each iteration. It uses the
alignment obtained in one iteration as a guide to obtaining new correspondences in the next iteration, which in turn are used to compute a finer alignment. This iterative scheme allows the method to correct correspondence errors from previous iterations in face of new information
Word-to-Word Models of Translational Equivalence
Parallel texts (bitexts) have properties that distinguish them from other
kinds of parallel data. First, most words translate to only one other word.
Second, bitext correspondence is noisy. This article presents methods for
biasing statistical translation models to reflect these properties. Analysis of
the expected behavior of these biases in the presence of sparse data predicts
that they will result in more accurate models. The prediction is confirmed by
evaluation with respect to a gold standard -- translation models that are
biased in this fashion are significantly more accurate than a baseline
knowledge-poor model. This article also shows how a statistical translation
model can take advantage of various kinds of pre-existing knowledge that might
be available about particular language pairs. Even the simplest kinds of
language-specific knowledge, such as the distinction between content words and
function words, is shown to reliably boost translation model performance on
some tasks. Statistical models that are informed by pre-existing knowledge
about the model domain combine the best of both the rationalist and empiricist
traditions
Multilingual sentiment analysis in social media.
252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations
Identifying Semantic Divergences in Parallel Text without Annotations
Recognizing that even correct translations are not always semantically
equivalent, we automatically detect meaning divergences in parallel sentence
pairs with a deep neural model of bilingual semantic similarity which can be
trained for any parallel corpus without any manual annotation. We show that our
semantic model detects divergences more accurately than models based on surface
features derived from word alignments, and that these divergences matter for
neural machine translation.Comment: Accepted as a full paper to NAACL 201
- โฆ