98 research outputs found
Korean-to-Chinese Machine Translation using Chinese Character as Pivot Clue
Korean-Chinese is a low resource language pair, but Korean and Chinese have a
lot in common in terms of vocabulary. Sino-Korean words, which can be converted
into corresponding Chinese characters, account for more than fifty of the
entire Korean vocabulary. Motivated by this, we propose a simple linguistically
motivated solution to improve the performance of the Korean-to-Chinese neural
machine translation model by using their common vocabulary. We adopt Chinese
characters as a translation pivot by converting Sino-Korean words in Korean
sentences to Chinese characters and then train the machine translation model
with the converted Korean sentences as source sentences. The experimental
results on Korean-to-Chinese translation demonstrate that the models with the
proposed method improve translation quality up to 1.5 BLEU points in comparison
to the baseline models.Comment: 9 page
๋ฌธ๋งฅ ์ธ์๊ธฐ๋ฐ์ ๋ฌธ์ ๋จ์ ์ ๊ฒฝ๋ง ๊ธฐ๊ณ ๋ฒ์ญ ์ฐ๊ตฌ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2022.2. ์ ๊ต๋ฏผ.The neural machine translation (NMT) has attracted great attention in recent years, as it has yielded state-of-the-art translation quality. Despite of their promising results, many current NMT systems are sentence-level; translating each sentence independently. This ignores contexts on text thus producing inadequate and inconsistent translations at the document-level. To overcome the shortcomings, the context-aware NMT (CNMT) has been proposed that takes contextual sentences as input. This dissertation proposes novel methods for improving the CNMT system and an application of CNMT. We first tackle the efficient modeling of multiple contextual sentences on CNMT encoder. For this purpose, we propose a hierarchical context encoder that encodes contextual sentences from token-level to sentence-level. This novel architecture enables the model to achieve state-of-the-art performance on translation quality while taking less computation time on training and translation than existing methods.
Secondly, we investigate the training method for CNMT models, where most models rely on negative log-likelihood (NLL) that do not fully exploit contextual dependencies. To overcome the insufficiency, we introduce coreference-based contrastive learning for CNMT that generates contrastive examples from coreference chains between the source and target sentences. The proposed method improves pronoun resolution accuracy of CNMT models, as well as overall translation quality.
Finally, we investigate an application of CNMT on dealing with Korean honorifics which depends on contextual information for generating adequate translations. For the English-Korean translation task, we propose to use CNMT models that capture crucial contextual information on the English source document and adopt a context-aware post-editing system for exploiting contexts on Korean target sentences, resulting in more consistent Korean honorific translations.์ ๊ฒฝ๋ง ๊ธฐ๊ณ๋ฒ์ญ ๊ธฐ๋ฒ์ ์ต๊ทผ ๋ฒ์ญ ํ์ง์ ์์ด์ ํฐ ์ฑ๋ฅ ํฅ์์ ์ด๋ฃฉํ์ฌ ๋ง์ ์ฃผ๋ชฉ์ ๋ฐ๊ณ ์๋ค. ๊ทธ๋ผ์๋ ๋ถ๊ตฌํ๊ณ ํ์ฌ ๋๋ถ๋ถ์ ์ ๊ฒฝ๋ง ๋ฒ์ญ ์์คํ
์ ํ
์คํธ๋ฅผ ๋
๋ฆฝ๋ ๋ฌธ์ฅ ๋จ์๋ก ๋ฒ์ญ์ ์ํํ๊ธฐ ๋๋ฌธ์ ํ
์คํธ์ ์กด์ฌํ๋ ๋ฌธ๋งฅ์ ๋ฌด์ํ๊ณ ๊ฒฐ๊ตญ ๋ฌธ์ ๋จ์๋ก ํ์
ํ์ ๋ ์ ์ ํ์ง ์์ ๋ฒ์ญ๋ฌธ์ ์์ฑํ ์ ์๋ ๋จ์ ์ด ์๋ค. ์ด๋ฅผ ๊ทน๋ณตํ๊ธฐ ์ํด ์ฃผ๋ณ ๋ฌธ์ฅ์ ๋์์ ๊ณ ๋ คํ๋ ๋ฌธ๋งฅ ์ธ์ ๊ธฐ๋ฐ ์ ๊ฒฝ๋ง ๋ฒ์ญ ๊ธฐ๋ฒ์ด ์ ์๋๊ณ ์๋ค.
๋ณธ ํ์ ๋
ผ๋ฌธ์ ๋ฌธ๋งฅ ์ธ์ ๊ธฐ๋ฐ ์ ๊ฒฝ๋ง ๋ฒ์ญ ์์คํ
์ ์ฑ๋ฅ์ ๊ฐ์ ์ํฌ ์ ์๋ ๊ธฐ๋ฒ๋ค๊ณผ ๋ฌธ๋งฅ ์ธ์ ๊ธฐ๋ฐ ์ ๊ฒฝ๋ง ๋ฒ์ญ ๊ธฐ๋ฒ์ ํ์ฉ ๋ฐฉ์์ ์ ์ํ๋ค. ๋จผ์ ์ฌ๋ฌ ๊ฐ์ ๋ฌธ๋งฅ ๋ฌธ์ฅ๋ค์ ํจ๊ณผ์ ์ผ๋ก ๋ชจ๋ธ๋งํ๊ธฐ ์ํด ๋ฌธ๋งฅ ๋ฌธ์ฅ๋ค์ ํ ํฐ ๋ ๋ฒจ ๋ฐ ๋ฌธ์ฅ ๋ ๋ฒจ๋ก ๋จ๊ณ์ ์ผ๋ก ํํํ๋ ๊ณ์ธต์ ๋ฌธ๋งฅ ์ธ์ฝ๋๋ฅผ ์ ์ํ์๋ค. ์ ์๋ ๋ชจ๋ธ์ ๊ธฐ์กด ๋ชจ๋ธ๋ค๊ณผ ๋น๊ตํ์ฌ ๊ฐ์ฅ ์ข์ ๋ฒ์ญ ํ์ง์ ์ป์ผ๋ฉด์ ๋์์ ํ์ต ๋ฐ ๋ฒ์ญ์ ๊ฑธ๋ฆฌ๋ ์ฐ์ฐ ์๊ฐ์ ๋จ์ถํ์๋ค. ๋ ๋ฒ์งธ๋ก๋ ๋ฌธ๋งฅ ์ธ์ ๊ธฐ๋ฐ ์ ๊ฒฝ๋ง ๋ฒ์ญ๋ชจ๋ธ์ ํ์ต ๋ฐฉ๋ฒ์ ๊ฐ์ ํ๊ณ ์ ํ์๋๋ฐ ์ด๋ ๊ธฐ์กด ์ฐ๊ตฌ์์๋ ๋ฌธ๋งฅ์ ๋ํ ์์กด ๊ด๊ณ๋ฅผ ์ ๋ถ ํ์ฉํ์ง ๋ชปํ๋ ์ ํต์ ์ธ ์์ ๋ก๊ทธ์ฐ๋ ์์คํจ์์ ์์กดํ๊ณ ์๊ธฐ ๋๋ฌธ์ด๋ค. ์ด๋ฅผ ๋ณด์ํ๊ธฐ ์ํด ๋ฌธ๋งฅ ์ธ์ ๊ธฐ๋ฐ ์ ๊ฒฝ๋ง ๋ฒ์ญ๋ชจ๋ธ์ ์ํ ์ํธ์ฐธ์กฐ์ ๊ธฐ๋ฐํ ๋์กฐํ์ต ๊ธฐ๋ฒ์ ์ ์ํ๋ค. ์ ์๋ ๊ธฐ๋ฒ์ ์๋ฌธ๊ณผ ์ฃผ๋ณ ๋ฌธ๋งฅ ๋ฌธ์ฅ๋ค ์ฌ์ด์ ์กด์ฌํ๋ ์ํธ์ฐธ์กฐ ์ฌ์ฌ์ ํ์ฉํ์ฌ ๋์กฐ ์ฌ๋ก๋ฅผ ์์ฑํ๋ฉฐ, ๋ฌธ๋งฅ ์ธ์ ๊ธฐ๋ฐ ์ ๊ฒฝ๋ง ๋ฒ์ญ ๋ชจ๋ธ๋ค์ ์ ๋ฐ์ ์ธ ๋ฒ์ญ ํ์ง ๋ฟ๋ง ์๋๋ผ ๋๋ช
์ฌ ํด๊ฒฐ ์ฑ๋ฅ๋ ํฌ๊ฒ ํฅ์์์ผฐ๋ค. ๋ง์ง๋ง์ผ๋ก๋ ๋งฅ๋ฝ ์ ๋ณด๊ฐ ํ์ํ ํ๊ตญ์ด ๊ฒฝ์ด์ฒด ๋ฒ์ญ์ ์์ด์ ๋ฌธ๋งฅ ์ธ์ ๊ธฐ๋ฐ ์ ๊ฒฝ๋ง ๋ฒ์ญ ๊ธฐ๋ฒ์ ํ์ฉ ๋ฐฉ์์ ๋ํด์๋ ์ฐ๊ตฌํ์๋ค. ์ด์ ์์ด-ํ๊ตญ์ด ๋ฒ์ญ ๋ฌธ์ ์ ๋ฌธ๋งฅ ์ธ์ ๊ธฐ๋ฐ ์ ๊ฒฝ๋ง ๋ฒ์ญ ๊ธฐ๋ฒ์ ์ ์ฉํ์ฌ ์์ด ์๋ฌธ์์ ํ์์ ์ธ ๋งฅ๋ฝ ์ ๋ณด๋ฅผ ์ถ์ถํ๋ ํํธ ํ๊ตญ์ด ๋ฒ์ญ๋ฌธ์์๋ ๋ฌธ๋งฅ ์ธ์ ์ฌํํธ์ง ์์คํ
์ ํ์ฉํ์ฌ ๋ณด๋ค ์ผ๊ด๋ ํ๊ตญ์ด ๊ฒฝ์ด์ฒด ํํ์ ๋ฒ์ญํ๋๋ก ๊ฐ์ ํ๋ ๊ธฐ๋ฒ์ ์ ์ํ์๋ค.Abstract i
Contents ii
List of Tables vi
List of Figures viii
1 Introduction 1
2 Background: Neural Machine Translation 7
2.1 A Brief History 7
2.2 Problem Setup 9
2.3 Encoder-Decoder architectures 10
2.3.1 RNN-based Architecture 11
2.3.2 SAN-based Architecture 13
2.4 Training 16
2.5 Decoding 16
2.6 Evaluation 17
3 Efficient Hierarchical Architecture for Modeling Contextual Sentences 18
3.1 Related works 20
3.1.1 Modeling Context in NMT 20
3.1.2 Hierarchical Context Modeling 21
3.1.3 Evaluation of Context-aware NMT 21
3.2 Model description 22
3.2.1 Context-aware NMT encoders 22
3.2.2 Hierarchical context encoder 27
3.3 Data 28
3.3.1 English-German IWSLT 2017 corpus 29
3.3.2 OpenSubtitles corpus 29
3.3.3 English-Korean subtitle corpus 31
3.4 Experiments 31
3.4.1 Hyperparameters and Training details 31
3.4.2 Overall BLEU evaluation 32
3.4.3 Model complexity analysis 32
3.4.4 BLEU evaluation on helpful/unhelpful context 34
3.4.5 EnKo pronoun resolution test suite 35
3.4.6 Qualitative Analysis 37
3.5 Summary of Efficient Hierarchical Architecture for Modeling Contextual Sentences 43
4 Contrastive Learning for Context-aware Neural Machine Translation 44
4.1 Related Works 46
4.1.1 Context-aware NMT Architectures 46
4.1.2 Coreference and NMT 47
4.1.3 Data augmentation for NMT 47
4.1.4 Contrastive Learning 47
4.2 Context-aware NMT models 48
4.3 Our Method: CorefCL 50
4.3.1 Data Augmentation Using Coreference 50
4.3.2 Contrastive Learning for Context-aware NMT 52
4.4 Experiments 53
4.4.1 Datasets 53
4.4.2 Settings 54
4.4.3 Overall BLEU Evaluation 55
4.4.4 Results on English-German Contrastive Evaluation Set 57
4.4.5 Analysis 58
4.5 Summary of Contrastive Learning for Context-aware Neural Machine Translation 59
5 Improving English-Korean Honorific Translation Using Contextual Information 60
5.1 Related Works 63
5.1.1 Neural Machine Translation dealing with Korean 63
5.1.2 Controlling the Styles in NMT 63
5.1.3 Context-Aware NMT Framework and Application 64
5.2 Addressing Korean Honorifics in Context 65
5.2.1 Overview of Korean Honorifics System 65
5.2.2 The Role of Context on Choosing Honorifics 68
5.3 Context-Aware NMT Frameworks 69
5.3.1 NMT Model with Contextual Encoders 71
5.3.2 Context-Aware Post Editing (CAPE) 71
5.4 Our Proposed Method - Context-Aware NMT for Korean Honorifics 73
5.4.1 Using CNMT methods for Honorific-Aware Translation 74
5.4.2 Scope of Honorific Expressions 75
5.4.3 Automatic Honorific Labeling 76
5.5 Experiments 77
5.5.1 Dataset and Preprocessing 77
5.5.2 Model Implementation and Training Details 80
5.5.3 Metrics 80
5.5.4 Results 81
5.5.5 Translation Examples and Analysis 86
5.6 Summary of Improving English-Korean Honorific Translation Using Contextual Information 89
6 Future Directions 91
6.1 Document-level Datasets 91
6.2 Document-level Evaluation 92
6.3 Bias and Fairness of Document-level NMT 93
6.4 Towards Practical Applications 94
7 Conclusions 96
Abstract (In Korean) 117
Acknowledgment 119๋ฐ
Recommended from our members
Representation Learning beyond Semantic Similarity: Character-aware and Function-specific Approaches
Representation learning is a research area within machine learning and natural language processing (NLP) concerned with building machine-understandable representations of discrete units of text. Continuous representations are at the core of modern machine learning applications, and representation learning has thereby become one of the central research areas in NLP. The induction of text representations is typically based on the distributional hypothesis, and consequently encodes general information about word similarity. Words or phrases with similar meaning obtain similar representations in a vector space constructed for this purpose. This established methodology excels for morphologically-simple languages such as English, and in data-rich settings. However, several useful lexical relations such as entailment or selectional preference, are not captured or get conflated with other relations. Another challenge is dealing with low-data regimes for morphologically-complex and under-resourced languages.
In this thesis we construct novel representation learning methods that go beyond the limitations of the distributional hypothesis and investigate solutions that induce vector spaces with diverse properties. In particular, we look at how the vector space induction process influences the contained information, and how the information manifests in a number of core NLP tasks: semantic similarity, lexical entailment, selectional preference, and language modeling. We contribute novel evaluations of state-of-the-art models highlighting their current capabilities and limitations. An analysis of language modeling in 50 typologically-diverse languages demonstrates that representations can indeed pose a performance bottleneck. We introduce a novel approach to leveraging subword-level information in word representations: our solution lifts this bottleneck in low-resource scenarios. Finally, we introduce a novel paradigm of function-specific representation learning that aims to integrate fine-grained semantic relations and real-world knowledge into the word vector spaces. We hope this thesis can serve as a valuable overview on word representations, and inspire future work in modeling \textit{semantic similarity and beyond}.ERC Consolidator Grant LEXICAL (648909
Multilingual CheckList: Generation and Evaluation
The recently proposed CheckList (Riberio et al,. 2020) approach to evaluation
of NLP systems has revealed high failure rates for basic capabilities for
multiple state-of-the-art and commercial models. However, the CheckList
creation process is manual which creates a bottleneck towards creation of
multilingual CheckLists catering 100s of languages. In this work, we explore
multiple approaches to generate and evaluate the quality of Multilingual
CheckList. We device an algorithm -- Automated Multilingual Checklist
Generation (AMCG) for automatically transferring a CheckList from a source to a
target language that relies on a reasonable machine translation system. We then
compare the CheckList generated by AMCG with CheckLists generated with
different levels of human intervention. Through in-depth crosslingual
experiments between English and Hindi, and broad multilingual experiments
spanning 11 languages, we show that the automatic approach can provide accurate
estimates of failure rates of a model across capabilities, as would a
human-verified CheckList, and better than CheckLists generated by humans from
scratch
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
General methods for fine-grained morphological and syntactic disambiguation
We present methods for improved handling of morphologically
rich languages (MRLS) where we define
MRLS as languages that
are morphologically more complex than English. Standard
algorithms for language modeling, tagging and parsing have
problems with the productive nature of such
languages. Consider for example the possible forms of a
typical English verb like work that generally has four
four different
forms: work, works, working
and worked. Its Spanish counterpart trabajar
has 6 different forms in present
tense: trabajo, trabajas, trabaja, trabajamos, trabajรกis
and trabajan and more than 50 different forms when
including the different tenses, moods (indicative,
subjunctive and imperative) and participles. Such a high
number of forms leads to sparsity issues: In a recent
Wikipedia dump of more than 400 million tokens we find that
20 of these forms occur only twice or less and that 10 forms
do not occur at all. This means that even if we only need
unlabeled data to estimate a model and even when looking at
a relatively common and frequent verb, we do not have enough
data to make reasonable estimates for some of its
forms. However, if we decompose an unseen form such
as trabajarรฉis `you will work', we find that it
is trabajar in future tense and second person
plural. This allows us to make the predictions that are
needed to decide on the grammaticality (language modeling)
or syntax (tagging and parsing) of a sentence.
In the first part of this thesis, we develop
a morphological language model. A language model
estimates the grammaticality and coherence of a
sentence. Most language models used today are word-based
n-gram models, which means that they estimate the
transitional probability of a word following a history, the
sequence of the (n - 1) preceding words. The probabilities
are estimated from the frequencies of the history and the
history followed by the target word in a huge text
corpus. If either of the sequences is unseen, the length of
the history has to be reduced. This leads to a less accurate
estimate as less context is taken into account.
Our morphological language model estimates an additional
probability from the morphological classes of the
words. These classes are built automatically by extracting
morphological features from the word forms. To this end, we
use unsupervised segmentation algorithms to find the
suffixes of word forms. Such an algorithm might for example
segment trabajarรฉis into trabaja
and rรฉis and we can then estimate the properties
of trabajarรฉis from other word forms with the same or
similar morphological properties. The data-driven nature of
the segmentation algorithms allows them to not only find
inflectional suffixes (such as -rรฉis), but also more
derivational phenomena such as the head nouns of compounds
or even endings such as -tec, which identify
technology oriented companies such
as Vortec, Memotec and Portec and would
not be regarded as a morphological suffix by traditional
linguistics. Additionally, we extract shape features such as
if a form contains digits or capital characters. This is
important because many rare or unseen forms are proper
names or numbers and often do not have meaningful
suffixes. Our class-based morphological model is then
interpolated with a word-based model to combine the
generalization capabilities of the first and the high
accuracy in case of sufficient data of the second.
We evaluate our model across 21 European languages and find
improvements between 3% and 11% in perplexity, a standard
language modeling evaluation measure. Improvements are
highest for languages with more productive and complex
morphology such as Finnish and Estonian, but also visible
for languages with a relatively simple morphology such as
English and Dutch. We conclude that a morphological
component yields consistent improvements for all the tested
languages and argue that it should be part of every language
model.
Dependency trees represent the syntactic structure of a
sentence by attaching each word to its syntactic head, the
word it is directly modifying. Dependency parsing
is usually tackled using heavily lexicalized (word-based)
models and a thorough morphological preprocessing is
important for optimal performance, especially for MRLS. We
investigate if the lack of morphological features can be
compensated by features induced using hidden Markov
models with latent annotations (HMM-LAs)
and find this to be the case for German. HMM-LAs were
proposed as a method to increase part-of-speech tagging
accuracy. The model splits the observed part-of-speech tags
(such as verb and noun) into subtags. An expectation
maximization algorithm is then used to fit the subtags to
different roles. A verb tag for example might be split into
an auxiliary verb and a full verb subtag. Such a split is
usually beneficial because these two verb classes have
different contexts. That is, a full verb might follow an
auxiliary verb, but usually not another full verb.
For German and English, we find that our model leads to
consistent improvements over a parser
not using subtag features. Looking at the labeled attachment
score (LAS), the number of words correctly attached to their head,
we observe an improvement from 90.34 to 90.75 for English
and from 87.92 to 88.24 for German. For German, we
additionally find that our model achieves almost the same
performance (88.24) as a model using tags annotated by a
supervised morphological tagger (LAS of 88.35). We also find
that the German latent tags correlate with
morphology. Articles for example are split by their
grammatical case.
We also investigate the part-of-speech tagging accuracies of
models using the traditional treebank tagset and models
using induced tagsets of the same size and find that the
latter outperform the former, but are in turn outperformed
by a discriminative tagger.
Furthermore, we present a method for fast and
accurate morphological tagging. While
part-of-speech tagging annotates tokens in context with
their respective word categories, morphological tagging
produces a complete annotation containing all the relevant
inflectional features such as case, gender and tense. A
complete reading is represented as a single tag. As a
reading might consist of several morphological features the
resulting tagset usually contains hundreds or even thousands
of tags. This is an issue for many decoding algorithms such
as Viterbi which have runtimes depending quadratically on
the number of tags. In the case of morphological tagging,
the problem can be avoided by using a morphological
analyzer. A morphological analyzer is a manually created
finite-state transducer that produces the possible
morphological readings of a word form. This analyzer can be
used to prune the tagging lattice and to allow for the
application of standard sequence labeling algorithms. The
downside of this approach is that such an analyzer is not
available for every language or might not have the coverage
required for the task. Additionally, the output tags of some
analyzers are not compatible with the annotations of the
treebanks, which might require some manual mapping of the
different annotations or even to reduce the complexity of
the annotation.
To avoid this problem we propose to use the posterior
probabilities of a conditional random field (CRF)
lattice to prune the space of possible
taggings. At the zero-order level the posterior
probabilities of a token can be calculated independently
from the other tokens of a sentence. The necessary
computations can thus be performed in linear time. The
features available to the model at this time are similar to
the features used by a morphological analyzer (essentially
the word form and features based on it), but also include
the immediate lexical context. As the ambiguity of word
types varies substantially, we just fix the average number of
readings after pruning by dynamically estimating a
probability threshold. Once we obtain the pruned lattice, we
can add tag transitions and convert it into a first-order
lattice. The quadratic forward-backward computations are now
executed on the remaining plausible readings and thus
efficient. We can now continue pruning and extending the
lattice order at a relatively low additional runtime cost
(depending on the pruning thresholds). The training of the
model can be implemented efficiently by applying stochastic
gradient descent (SGD). The CRF gradient can be calculated
from a lattice of any order as long as the correct reading
is still in the lattice. During training, we thus run the
lattice pruning until we either reach the maximal order or
until the correct reading is pruned. If the reading is
pruned we perform the gradient update with the highest order
lattice still containing the reading. This approach is
similar to early updating in the structured perceptron
literature and forces the model to learn how to keep the
correct readings in the lower order lattices. In practice,
we observe a high number of lower updates during the first
training epoch and almost exclusively higher order updates
during later epochs.
We evaluate our CRF tagger on six languages with different
morphological properties. We find that for languages with a
high word form ambiguity such as German, the pruning results
in a moderate drop in tagging accuracy while for languages
with less ambiguity such as Spanish and Hungarian the loss
due to pruning is negligible. However, our pruning strategy
allows us to train higher order models (order > 1), which give
substantial improvements for all languages and also
outperform unpruned first-order models. That is, the model
might lose some of the correct readings during pruning, but
is also able to solve more of the harder cases that require
more context. We also find our model to substantially and
significantly outperform a number of frequently used taggers
such as Morfette and SVMTool.
Based on our morphological tagger we develop a simple method
to increase the performance of a state-of-the-art
constituency parser. A constituency tree
describes the syntactic properties of a sentence by
assigning spans of text to a hierarchical bracket
structure. developed a
language-independent approach for the automatic annotation
of accurate and compact grammars. Their implementation --
known as the Berkeley parser -- gives state-of-the-art results
for many languages such as English and German. For some MRLS
such as Basque and Korean, however, the parser gives
unsatisfactory results because of its simple unknown word
model. This model maps unknown words to a small number of
signatures (similar to our morphological classes). These
signatures do not seem expressive enough for many of the
subtle distinctions made during parsing. We propose to
replace rare words by the morphological reading generated by
our tagger instead. The motivation is twofold. First, our
tagger has access to a number of lexical and sublexical
features not available during parsing. Second, we expect
the morphological readings to contain most of the
information required to make the correct parsing decision
even though we know that things such as the correct
attachment of prepositional phrases might require some
notion of lexical semantics.
In experiments on the SPMRL 2013 dataset
of nine MRLS we find our method to give improvements for all
languages except French for which we observe a minor drop in
the Parseval score of 0.06. For Hebrew, Hungarian and
Basque we find substantial absolute improvements of 5.65,
11.87 and 15.16, respectively.
We also performed an extensive evaluation on the utility of
word representations for morphological tagging. Our goal was
to reduce the drop in performance that is caused when a
model trained on a specific domain is applied to some other
domain. This problem is usually addressed by domain adaption
(DA). DA adapts a model towards a specific domain using a
small amount of labeled or a huge amount of unlabeled data
from that domain. However, this procedure requires us to
train a model for every target domain. Instead we are trying
to build a robust system that is trained on domain-specific
labeled and domain-independent or general unlabeled data. We
believe word representations to be key in the development of
such models because they allow us to leverage unlabeled
data efficiently. We compare data-driven representations to
manually created morphological analyzers. We understand
data-driven representations as models that cluster word
forms or map them to a vectorial representation. Examples
heavily used in the literature include Brown clusters,
Singular Value Decompositions of count
vectors and neural-network-based
embeddings. We create a test suite of
six languages consisting of in-domain and out-of-domain test
sets. To this end we converted annotations for Spanish and
Czech and annotated the German part of the Smultron
treebank with a morphological layer. In
our experiments on these data sets we find Brown clusters to
outperform the other data-driven representations. Regarding
the comparison with morphological analyzers, we find Brown
clusters to give slightly better performance in
part-of-speech tagging, but to be substantially outperformed
in morphological tagging
Understanding Word Embedding Stability Across Languages and Applications
Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this thesis, we consider several aspects of embedding spaces, including their stability. First, we propose a definition of stability, and show that common English word embeddings are surprisingly unstable. We explore how properties of data, words, and algorithms relate to instability. We extend this work to approximately 100 world languages, considering how linguistic typology relates to stability. Additionally, we consider contextualized output embedding spaces. Using paraphrases, we explore properties and assumptions of BERT, a popular embedding algorithm.
Second, we consider how stability and other word embedding properties affect tasks where embeddings are commonly used. We consider both word embeddings used as features in downstream applications and corpus-centered applications, where embeddings are used to study characteristics of language and individual writers. In addition to stability, we also consider other word embedding properties, specifically batching and curriculum learning, and how methodological choices made for these properties affect downstream tasks.
Finally, we consider how knowledge of stability affects how we use word embeddings. Throughout this thesis, we discuss strategies to mitigate instability and provide analyses highlighting the strengths and weaknesses of word embeddings in different scenarios and languages. We show areas where more work is needed to improve embeddings, and we show where embeddings are already a strong tool.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162917/1/lburdick_1.pd
An automatic morphological analysis system for Indonesian
This thesis reports the creation of SANTI-morf (Sistem Analisis Teks Indonesia โ morfologi), a rule-based system that performs morphological annotation for Indonesian. The system has been built across three stages, namely preliminaries, annotation scheme creation (the linguistic aspect of the project), and system implementation (the computational aspect of the project). The preliminary matters covered include the necessary key concepts in morphology and Natural Language Processing (NLP), as well as a concise description of Indonesian morphology (largely based on the two primary reference grammars of Indonesian, Alwi et al. 1998 and Sneddon et al. 2010, together with work in the linguistic literature on Indonesian morphology (e.g. Kridalaksana 1989; Chaer 2008). As part of this preliminary stage, I created a testbed corpus for evaluation purposes. The design of the testbed is justified by considering the design of existing evaluation corpora, such as the testbed used by the English Constraint Grammar or EngCG system (Voutilanen 1992), the British National Corpus (BNC) 1994 evaluation data , and the training data used by MorphInd (Larasati et al. 2011), a morphological analyser (MA) for Indonesian. The dataset for this testbed was created by narrowing down an existing very large bit unbalanced collection of texts (drawn from the Leipzig corpora; see Goldhahn et al. 2012). The initial collection was reduced to a corpus composed of nine domains following the domain categorisation of the BNC) . A set of texts from each domain, proportional in size, was extracted and combined to form a testbed that complies with the design cited informed by the prior literature. The second stage, scheme creation, involved the creation of a new Morphological Annotation Scheme (MAS) for Indonesian, for use in the SANTI-morf system. First, a review of MASs in different languages (Finnish, Turkish, Arabic, Indonesian) as well as the Universal Dependencies MAS identifies the best practices in the field. From these, 15 design principles for the novel MAS were devised. This MAS consists of a morphological tagset, together with comprehensive justification of the morphological analyses used in the system. It achieves full morpheme-level annotation, presenting each morphemeโs orthographic and citation forms in the defined output, accompanied by robust morphological analyses, both formal and functional; to my knowledge, this is the first MAS of its kind for Indonesian. The MASโs design is based not only on reference grammars of Indonesian and other linguistic sources, but also on the anticipated needs of researchers and other users of texts and corpora annotated using this scheme of analysis. The new MAS aims at The third stage of the project, implementation, consisted of three parts: a benchmarking evaluation exercise, a survey of frameworks and tools, leading ultimately to the actual implementation and evaluation of SANTI-morf. MorphInd (Larasati et al. 2012) is the prior state-of-the-art MA for Indonesian. That being the case, I evaluated MorphIndโs performance against the aforementioned testbed, both as just5ification of the need for an improved system, and to serve as a benchmark for SANTI-morf. MorphInd scored 93% on lexical coverage and 89% on tagging accuracy. Next, I surveyed existing MAs frameworks and tools. This survey justifies my choice for the rule-based approach (inspired by Koskenniemiโs 1983 Two Level Morphology, and NooJ (Silberztein 2S003) as respectively the framework and the software tool for SANTI-morf. After selection of this approach and tool, the language resources that constitute the SANTI-morf system were created. These are, primarily, a number of lexicons and sets of analysis rules, as well as necessary NooJ system configuration files. SANTI-morfโs 3 lexicon files (in total 86,590 entries) and 15 rule files (in total 659 rules) are organised into four modules, namely the Annotator, the Guesser, the Improver and the Disambiguator. These modules are applied one after another in a pipeline. The Annotator provides initial morpheme-level annotation for Indonesian words by identifying their having been built according to various morphological processes (affixation, reduplication, compounding, and cliticisation). The Guesser ensures that words not covered by the Annotator, because they are not covered by its lexicons, receive best guesses as to the correct analysis from the application of a set of probable but not exceptionless rules. The Improver improves the existing annotation, by adding probable analyses that the Annotator might have missed. Finally, the Disambiguator resolves ambiguities, that is, words for which the earlier elements of the pipeline have generated two or more possible analyses in terms of the morphemes identified or their annotation. NooJ annotations are saved in a binary file, but for evaluation purposes, plain-text output is required. I thus developed a system for data export using an in-NooJ mapping to and from a modified, exportable expression of the MAS, and wrote a small program to enable re-conversion of the output in plain-text format. For purposes of the evaluation, I created a 10,000 -word gold-standard SANTI-morf manually-annotated dataset. The outcome of the evaluation is that SANTI-morf has 100% coverage (because a best-guess analysis is always provided for unrecognised word forms), and 99% precision and recall for the morphological annotations, with a 1% rate of remaining ambiguity in the final output. SANTI-morf is thus shown to present a number of advancements over MorphInd, the state-of-the-art MA for Indonesian, exhibiting more robust annotation and better coverage. Other performance indicators, namely the high precision and recall, make SANTI-morf a concrete advance in the field of automated morphological annotation for Indonesian, and in consequence a substantive contribution to the field of Indonesian linguistics overall
Empirical studies on word representations
One of the most fundamental tasks in natural language processing is representing words with mathematical objects (such as vectors). The word representations, which are most often estimated from data, allow capturing the meaning of words. They enable comparing words according to their semantic similarity, and have been shown to work extremely well when included in complex real-world applications. A large part of our work deals with ways of estimating word representations directly from large quantities of text. Our methods exploit the idea that words which occur in similar contexts have a similar meaning. How we define the context is an important focus of our thesis. The context can consist of a number of words to the left and to the right of the word in question, but, as we show, obtaining context words via syntactic links (such as the link between the verb and its subject) often works better. We furthermore investigate word representations that accurately capture multiple meanings of a single word. We show that translation of a word in context contains information that can be used to disambiguate the meaning of that word
- โฆ