Search CORE

5 research outputs found

Building a bilingual lexicon using phrase-based statistical machine translation via a pivot language

Author: Jun&apos
Naoaki Okazaki
Takashi Tsunakawa
Publication venue
Publication date: 01/01/2008
Field of study

Abstract This paper proposes a novel method for building a bilingual lexicon through a pivot language by using phrase-based statistical machine translation (SMT). Given two bilingual lexicons between language pairs L f -L p and L p -L e , we assume these lexicons as parallel corpora. Then, we merge the extracted two phrase tables into one phrase table between L f and L e . Finally, we construct a phrase-based SMT system for translating the terms in the lexicon L f -L p into terms of L e and, obtain a new lexicon L f -L e . In our experiments with Chinese-English and JapaneseEnglish lexicons, our system could cover 72.8% of Chinese terms and drastically improve the utilization ratio

CiteSeerX

Електронний словник підвищеної швидкодії на основі хеш-адресації без колізій

Author: Гуменюк Інна Олександрівна
Publication venue: Київ
Publication date: 01/01/2021
Field of study

Ціллю представлених в дипломному проєкті досліджень є підвищення швид- кодії електронних словників інтелектуальних систем комп’ютерного перекладу за рахунок використання найбільш швидкого виду пошуку – хеш-адресації. . Для підвищення швидкості пошуку запропоновано застосувати хеш- адресацію без колізій. Можливість швидкого віднаходження хеш-перетворення, яке не утворює колізій забезпечується шляхом розрідження адресного простору пам'яті. Контекстна інформація для ключових слів розміщується в вільних проміжках між задіяними хеш-адресами. Розроблено процедуру підбору хеш- перетворення без колізій для заданого масиву ключових слів, організацію розміщення слів за хеш-адресами та супутньої інформації, а також організацію пошуку контекстної інформації за ключем. Результати досліджень можуть бути використані для підвищення ефективності систем інтелектуального комп’ютерного перекладуThe aim of the research presented in the diploma project is to increase the speed of electronic dictionaries of intelligent computer translation systems by using the fastest type of search - hash addressing. To increase the search speed, it is recommended to use collision-free hash addressing. The ability to quickly find a hash transformation that does not form collisions is provided by depleting the address space of memory. Contextual information for keywords is placed in the free spaces between the involved hash addresses. The procedure of selection of hash-transformation without collisions for the set array of keywords, the organization of placement of words on hash addresses and the accompanying information, and also the organization of search of the contextual information on a key is developed. Research results can be used to increase the efficiency of intelligent computer transla- tion systems

Electronic Archive of Kyiv Polytechnic Institute

Multilingual Lexicon Extraction under Resource-Poor Language Pairs

Author: 서형원
Publication venue: 한국해양대학교
Publication date: 01/08/2015
Field of study

In general, bilingual and multilingual lexicons are important resources in many natural language processing fields such as information retrieval and machine translation. Such lexicons are usually extracted from bilingual (e.g., parallel or comparable) corpora with external seed dictionaries. However, few such corpora and bilingual seed dictionaries are publicly available for many language pairs such as Korean–French. It is important that such resources for these language pairs be publicly available or easily accessible when a monolingual resource is considered. This thesis presents efficient approaches for extracting bilingual single-/multi-word lexicons for resource-poor language pairs such as Korean–French and Korean–Spanish. The goal of this thesis is to present several efficient methods of extracting translated single-/multi-words from bilingual corpora based on a statistical method. Three approaches for single words and one approach for multi-words are proposed. The first approach is the pivot context-based approach (PCA). The PCA uses a pivot language to connect source and target languages. It builds context vectors from two parallel corpora sharing one pivot language and calculates their similarity scores to choose the best translation equivalents. The approach can reduce the effort required when using a seed dictionary for translation by using parallel corpora rather than comparable corpora. The second approach is the extended pivot context-based approach (EPCA). This approach gathers similar context vectors for each source word to augment its context. The approach assumes that similar vectors can enrich contexts. For example, young and youth can augment the context of baby. In the investigation described here, such similar vectors were collected by similarity measures such as cosine similarity. The third approach for single words uses a competitive neural network algorithm (i.e., self-organizing mapsSOM). The SOM-based approach (SA) uses synonym vectors rather than context vectors to train two different SOMs (i.e., source and target SOMs) in different ways. A source SOM is trained in an unsupervised way, while a target SOM is trained in a supervised way. The fourth approach is the constituent-based approach (CTA), which deals with multi-word expressions (MWEs). This approach reinforces the PCA for multi-words (PCAM). It extracts bilingual MWEs taking all constituents of the source MWEs into consideration. The PCAM 2 identifies MWE candidates by pointwise mutual information first and then adds them to input data as single units in order to use the PCA directly. The experimental results show that the proposed approaches generally perform well for resource-poor language pairs, particularly Korean and French–Spanish. The PCA and SA have demonstrated good performance for such language pairs. The EPCA would not have shown a stronger performance than expected. The CTA performs well even when word contexts are insufficient. Overall, the experimental results show that the CTA significantly outperforms the PCAM. In the future, homonyms (i.e., homographs such as lead or tear) should be considered. In particular, the domains of bilingual corpora should be identified. In addition, more parts of speech such as verbs, adjectives, or adverbs could be tested. In this thesis, only nouns are discussed for simplicity. Finally, thorough error analysis should also be conducted.Abstract List of Abbreviations List of Tables List of Figures Acknowledgement Chapter 1 Introduction 1.1 Multilingual Lexicon Extraction 1.2 Motivations and Goals 1.3 Organization Chapter 2 Background and Literature Review 2.1 Extraction of Bilingual Translations of Single-words 2.1.1 Context-based approach 2.1.2 Extended approach 2.1.3 Pivot-based approach 2.2 Extractiong of Bilingual Translations of Multi-Word Expressions 2.2.1 MWE identification 2.2.2 MWE alignment 2.3 Self-Organizing Maps 2.4 Evaluation Measures Chapter 3 Pivot Context-Based Approach 3.1 Concept of Pivot-Based Approach 3.2 Experiments 3.2.1 Resources 3.2.2 Results 3.3 Summary Chapter 4 Extended Pivot Context-Based Approach 4.1 Concept of Extended Pivot Context-Based Approach 4.2 Experiments 4.2.1 Resources 4.2.2 Results 4.3 Summary Chapter 5 SOM-Based Approach 5.1 Concept of SOM-Based Approach 5.2 Experiments 5.2.1 Resources 5.2.2 Results 5.3 Summary Chapter 6 Constituent-Based Approach 6.1 Concept of Constituent-Based Approach 6.2 Experiments 6.2.1 Resources 6.2.2 Results 6.3 Summary Chapter 7 Conclusions and Future Work 7.1 Conclusions 7.2 Future Work Reference

한국해양대학교(KMOU)

Sentiment analysis and resources for informal Arabic text on social media

Author: Itani Maher
Publication venue: 'Sheffield Hallam University'
Publication date
Field of study

Online content posted by Arab users on social networks does not generally abide by the grammatical and spelling rules. These posts, or comments, are valuable because they contain users’ opinions towards different objects such as products, policies, institutions, and people. These opinions constitute important material for commercial and governmental institutions. Commercial institutions can use these opinions to steer marketing campaigns, optimize their products and know the weaknesses and/ or strengths of their products. Governmental institutions can benefit from the social networks posts to detect public opinion before or after legislating a new policy or law and to learn about the main issues that concern citizens. However, the huge size of online data and its noisy nature can hinder manual extraction and classification of opinions present in online comments. Given the irregularity of dialectal Arabic (or informal Arabic), tools developed for formally correct Arabic are of limited use. This is specifically the case when employed in sentiment analysis (SA) where the target of the analysis is social media content. This research implemented a system that addresses this challenge. This work can be roughly divided into three blocks: building a corpus for SA and manually tagging it to check the performance of the constructed lexicon-based (LB) classifier; building a sentiment lexicon that consists of three different sets of patterns (negative, positive, and spam); and finally implementing a classifier that employs the lexicon to classify Facebook comments. In addition to providing resources for dialectal Arabic SA and classifying Facebook comments, this work categorises reasons behind incorrect classification, provides preliminary solutions for some of them with focus on negation, and uses regular expressions to detect the presence of lexemes. This work also illustrates how the constructed classifier works along with its different levels of reporting. Moreover, it compares the performance of the LB classifier against Naïve Bayes classifier and addresses how NLP tools such as POS tagging and Named Entity Recognition can be employed in SA. In addition, the work studies the performance of the implemented LB classifier and the developed sentiment lexicon when used to classify other corpora used in the literature, and the performance of lexicons used in the literature to classify the corpora constructed in this research. With minor changes, the classifier can be used in domain classification of documents (sports, science, news, etc.). The work ends with a discussion of research questions arising from the research reported

Crossref

Sheffield Hallam University Research Archive