1,076 research outputs found
Multilingual Lexicon Extraction under Resource-Poor Language Pairs
In general, bilingual and multilingual lexicons are important resources in many natural language processing fields such as information retrieval and machine translation. Such lexicons are usually extracted from bilingual (e.g., parallel or comparable) corpora with external seed dictionaries. However, few such corpora and bilingual seed dictionaries are publicly available for many language pairs such as Korean–French. It is important that such resources for these language pairs be publicly available or easily accessible when a monolingual resource is considered.
This thesis presents efficient approaches for extracting bilingual single-/multi-word lexicons for resource-poor language pairs such as Korean–French and Korean–Spanish. The goal of this thesis is to present several efficient methods of extracting translated single-/multi-words from bilingual corpora based on a statistical method.
Three approaches for single words and one approach for multi-words are proposed. The first approach is the pivot context-based approach (PCA). The PCA uses a pivot language to connect source and target languages. It builds context vectors from two parallel corpora sharing one pivot language and calculates their similarity scores to choose the best translation equivalents. The approach can reduce the effort required when using a seed dictionary for translation by using parallel corpora rather than comparable corpora. The second approach is the extended pivot context-based approach (EPCA). This approach gathers similar context vectors for each source word to augment its context. The approach assumes that similar vectors can enrich contexts. For example, young and youth can augment the context of baby. In the investigation described here, such similar vectors were collected by similarity measures such as cosine similarity. The third approach for single words uses a competitive neural network algorithm (i.e., self-organizing mapsSOM). The SOM-based approach (SA) uses synonym vectors rather than context vectors to train two different SOMs (i.e., source and target SOMs) in different ways. A source SOM is trained in an unsupervised way, while a target SOM is trained in a supervised way.
The fourth approach is the constituent-based approach (CTA), which deals with multi-word expressions (MWEs). This approach reinforces the PCA for multi-words (PCAM). It extracts bilingual MWEs taking all constituents of the source MWEs into consideration. The PCAM
2
identifies MWE candidates by pointwise mutual information first and then adds them to input data as single units in order to use the PCA directly.
The experimental results show that the proposed approaches generally perform well for resource-poor language pairs, particularly Korean and French–Spanish. The PCA and SA have demonstrated good performance for such language pairs. The EPCA would not have shown a stronger performance than expected. The CTA performs well even when word contexts are insufficient. Overall, the experimental results show that the CTA significantly outperforms the PCAM.
In the future, homonyms (i.e., homographs such as lead or tear) should be considered. In particular, the domains of bilingual corpora should be identified. In addition, more parts of speech such as verbs, adjectives, or adverbs could be tested. In this thesis, only nouns are discussed for simplicity. Finally, thorough error analysis should also be conducted.Abstract
List of Abbreviations
List of Tables
List of Figures
Acknowledgement
Chapter 1 Introduction
1.1 Multilingual Lexicon Extraction
1.2 Motivations and Goals
1.3 Organization
Chapter 2 Background and Literature Review
2.1 Extraction of Bilingual Translations of Single-words
2.1.1 Context-based approach
2.1.2 Extended approach
2.1.3 Pivot-based approach
2.2 Extractiong of Bilingual Translations of Multi-Word Expressions
2.2.1 MWE identification
2.2.2 MWE alignment
2.3 Self-Organizing Maps
2.4 Evaluation Measures
Chapter 3 Pivot Context-Based Approach
3.1 Concept of Pivot-Based Approach
3.2 Experiments
3.2.1 Resources
3.2.2 Results
3.3 Summary
Chapter 4 Extended Pivot Context-Based Approach
4.1 Concept of Extended Pivot Context-Based Approach
4.2 Experiments
4.2.1 Resources
4.2.2 Results
4.3 Summary
Chapter 5 SOM-Based Approach
5.1 Concept of SOM-Based Approach
5.2 Experiments
5.2.1 Resources
5.2.2 Results
5.3 Summary
Chapter 6 Constituent-Based Approach
6.1 Concept of Constituent-Based Approach
6.2 Experiments
6.2.1 Resources
6.2.2 Results
6.3 Summary
Chapter 7 Conclusions and Future Work
7.1 Conclusions
7.2 Future Work
Reference
Towards a Universal Wordnet by Learning from Combined Evidenc
Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification
Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization
Translation alignment is an essential task in Digital Humanities and Natural
Language Processing, and it aims to link words/phrases in the source
text with their translation equivalents in the translation. In addition to
its importance in teaching and learning historical languages, translation
alignment builds bridges between ancient and modern languages through
which various linguistics annotations can be transferred. This thesis focuses
on word-level translation alignment applied to historical languages in general
and Ancient Greek and Latin in particular. As the title indicates, the thesis
addresses four interdisciplinary aspects of translation alignment.
The starting point was developing Ugarit, an interactive annotation tool
to perform manual alignment aiming to gather training data to train an
automatic alignment model. This effort resulted in more than 190k accurate
translation pairs that I used for supervised training later. Ugarit has been
used by many researchers and scholars also in the classroom at several
institutions for teaching and learning ancient languages, which resulted
in a large, diverse crowd-sourced aligned parallel corpus allowing us to
conduct experiments and qualitative analysis to detect recurring patterns in
annotators’ alignment practice and the generated translation pairs.
Further, I employed the recent advances in NLP and language modeling to
develop an automatic alignment model for historical low-resourced languages,
experimenting with various training objectives and proposing a training
strategy for historical languages that combines supervised and unsupervised
training with mono- and multilingual texts. Then, I integrated this alignment
model into other development workflows to project cross-lingual annotations
and induce bilingual dictionaries from parallel corpora.
Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined
its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold
standard datasets and support quantitative and qualitative evaluation of
translation alignment models. Besides, I designed and implemented visual
analytics tools and reading environments for parallel texts and proposed
various visualization approaches to support different alignment-related tasks
employing the latest advances in information visualization and best practice.
Overall, this thesis presents a comprehensive study that includes manual and
automatic alignment techniques, evaluation methods and visual analytics
tools that aim to advance the field of translation alignment for historical
languages
Projecting named entity tags from a resource rich language to a resource poor language
Named Entities (NE) are the prominent entities appearing in textual documents.Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc.This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism.A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism.The English corpus is the translated version of the Malay corpus.Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping.The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.An evaluation of the selected open source NER tool for English is also presented
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
We introduce an architecture to learn joint multilingual sentence
representations for 93 languages, belonging to more than 30 different language
families and written in 28 different scripts. Our system uses a single BiLSTM
encoder with a shared BPE vocabulary for all languages, which is coupled with
an auxiliary decoder and trained on publicly available parallel corpora. This
enables us to learn a classifier on top of the resulting sentence embeddings
using English annotated data only, and transfer it to any of the 93 languages
without any modification. Our approach sets a new state-of-the-art on zero-shot
cross-lingual natural language inference for all the 14 languages in the XNLI
dataset but one. We also achieve very competitive results in cross-lingual
document classification (MLDoc dataset). Our sentence embeddings are also
strong at parallel corpus mining, establishing a new state-of-the-art in the
BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new
test set of aligned sentences in 122 languages based on the Tatoeba corpus, and
show that our sentence embeddings obtain strong results in multilingual
similarity search even for low-resource languages. Our PyTorch implementation,
pre-trained encoder and the multilingual test set will be freely available
Word sense discovery and disambiguation
The work is based on the assumption that words with similar syntactic usage have similar meaning, which was proposed by Zellig S. Harris (1954,1968). We study his assumption from two aspects: Firstly, different meanings (word senses) of a word should manifest themselves in different usages (contexts), and secondly, similar usages (contexts) should lead to similar meanings (word senses).
If we start with the different meanings of a word, we should be able to find distinct contexts for the meanings in text corpora. We separate the meanings by grouping and labeling contexts in an unsupervised or weakly supervised manner (Publication 1, 2 and 3). We are confronted with the question of how best to represent contexts in order to induce effective classifiers of contexts, because differences in context are the only means we have to separate word senses.
If we start with words in similar contexts, we should be able to discover similarities in meaning. We can do this monolingually or multilingually. In the monolingual material, we find synonyms and other related words in an unsupervised way (Publication 4). In the multilingual material, we ?nd translations by supervised learning of transliterations (Publication 5). In both the monolingual and multilingual case, we first discover words with similar contexts, i.e., synonym or translation lists. In the monolingual case we also aim at finding structure in the lists by discovering groups of similar words, e.g., synonym sets.
In this introduction to the publications of the thesis, we consider the larger background issues of how meaning arises, how it is quantized into word senses, and how it is modeled. We also consider how to define, collect and represent contexts. We discuss how to evaluate the trained context classi?ers and discovered word sense classifications, and ?nally we present the word sense discovery and disambiguation methods of the publications.
This work supports Harris' hypothesis by implementing three new methods modeled on his hypothesis. The methods have practical consequences for creating thesauruses and translation dictionaries, e.g., for information retrieval and machine translation purposes.
Keywords: Word senses, Context, Evaluation, Word sense disambiguation, Word sense discovery
Semi-automatic acquisition of domain-specific semantic structures.
Siu, Kai-Chung.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 99-106).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Thesis Outline --- p.5Chapter 2 --- Background --- p.6Chapter 2.1 --- Natural Language Understanding --- p.6Chapter 2.1.1 --- Rule-based Approaches --- p.7Chapter 2.1.2 --- Stochastic Approaches --- p.8Chapter 2.1.3 --- Phrase-Spotting Approaches --- p.9Chapter 2.2 --- Grammar Induction --- p.10Chapter 2.2.1 --- Semantic Classification Trees --- p.11Chapter 2.2.2 --- Simulated Annealing --- p.12Chapter 2.2.3 --- Bayesian Grammar Induction --- p.12Chapter 2.2.4 --- Statistical Grammar Induction --- p.13Chapter 2.3 --- Machine Translation --- p.14Chapter 2.3.1 --- Rule-based Approach --- p.15Chapter 2.3.2 --- Statistical Approach --- p.15Chapter 2.3.3 --- Example-based Approach --- p.16Chapter 2.3.4 --- Knowledge-based Approach --- p.16Chapter 2.3.5 --- Evaluation Method --- p.19Chapter 3 --- Semi-Automatic Grammar Induction --- p.20Chapter 3.1 --- Agglomerative Clustering --- p.20Chapter 3.1.1 --- Spatial Clustering --- p.21Chapter 3.1.2 --- Temporal Clustering --- p.24Chapter 3.1.3 --- Free Parameters --- p.26Chapter 3.2 --- Post-processing --- p.27Chapter 3.3 --- Chapter Summary --- p.29Chapter 4 --- Application to the ATIS Domain --- p.30Chapter 4.1 --- The ATIS Domain --- p.30Chapter 4.2 --- Parameters Selection --- p.32Chapter 4.3 --- Unsupervised Grammar Induction --- p.35Chapter 4.4 --- Prior Knowledge Injection --- p.40Chapter 4.5 --- Evaluation --- p.43Chapter 4.5.1 --- Parse Coverage in Understanding --- p.45Chapter 4.5.2 --- Parse Errors --- p.46Chapter 4.5.3 --- Analysis --- p.47Chapter 4.6 --- Chapter Summary --- p.49Chapter 5 --- Portability to Chinese --- p.50Chapter 5.1 --- Corpus Preparation --- p.50Chapter 5.1.1 --- Tokenization --- p.51Chapter 5.2 --- Experiments --- p.52Chapter 5.2.1 --- Unsupervised Grammar Induction --- p.52Chapter 5.2.2 --- Prior Knowledge Injection --- p.56Chapter 5.3 --- Evaluation --- p.58Chapter 5.3.1 --- Parse Coverage in Understanding --- p.59Chapter 5.3.2 --- Parse Errors --- p.60Chapter 5.4 --- Grammar Comparison Across Languages --- p.60Chapter 5.5 --- Chapter Summary --- p.64Chapter 6 --- Bi-directional Machine Translation --- p.65Chapter 6.1 --- Bilingual Dictionary --- p.67Chapter 6.2 --- Concept Alignments --- p.68Chapter 6.3 --- Translation Procedures --- p.73Chapter 6.3.1 --- The Matching Process --- p.74Chapter 6.3.2 --- The Searching Process --- p.76Chapter 6.3.3 --- Heuristics to Aid Translation --- p.81Chapter 6.4 --- Evaluation --- p.82Chapter 6.4.1 --- Coverage --- p.83Chapter 6.4.2 --- Performance --- p.86Chapter 6.5 --- Chapter Summary --- p.89Chapter 7 --- Conclusions --- p.90Chapter 7.1 --- Summary --- p.90Chapter 7.2 --- Future Work --- p.92Chapter 7.2.1 --- Suggested Improvements on Grammar Induction Process --- p.92Chapter 7.2.2 --- Suggested Improvements on Bi-directional Machine Trans- lation --- p.96Chapter 7.2.3 --- Domain Portability --- p.97Chapter 7.3 --- Contributions --- p.97Bibliography --- p.99Chapter A --- Original SQL Queries --- p.107Chapter B --- Induced Grammar --- p.109Chapter C --- Seeded Categories --- p.11
- …