2,510 research outputs found
Tuning syntactically enhanced word alignment for statistical machine translation
We introduce a syntactically enhanced word alignment model that is more flexible than state-of-the-art generative word
alignment models and can be tuned according to different end tasks. First of all, this model takes the advantages of
both unsupervised and supervised word alignment approaches by obtaining anchor alignments from unsupervised generative
models and seeding the anchor alignments into a supervised discriminative model. Second, this model offers the flexibility of tuning the alignment according to different
optimisation criteria. Our experiments show that using our word alignment in a Phrase-Based Statistical Machine Translation system yields a 5.38% relative increase
on IWSLT 2007 task in terms of BLEU score
HMM word-to-phrase alignment with dependency constraints
In this paper, we extend the HMMwordto-phrase alignment model with syntactic dependency constraints. The syntactic
dependencies between multiple words in one language are introduced into the model in a bid to produce coherent
alignments. Our experimental results on a variety of ChineseāEnglish data show that our syntactically constrained
model can lead to as much as a 3.24% relative improvement in BLEU score over current HMM word-to-phrase alignment models on a Phrase-Based Statistical Machine Translation system when the training data is small, and a comparable performance compared to IBM model 4 on a Hiero-style system
with larger training data. An intrinsic alignment quality evaluation shows that our alignment model with dependency
constraints leads to improvements in both precision (by 1.74% relative) and recall (by 1.75% relative) over the model without dependency information
Multilingual unsupervised word alignment models and their application
Word alignment is an essential task in natural language processing because of its critical role in training statistical machine translation (SMT) models, error analysis for neural machine translation (NMT), building bilingual lexicon, and annotation transfer. In this thesis, we explore models for word alignment, how they can be extended to incorporate linguistically-motivated alignment types, and how they can be neuralized in an end-to-end fashion. In addition to these methodological developments, we apply our word alignment models to cross-lingual part-of-speech projection. First, we present a new probabilistic model for word alignment where word alignments are associated with linguistically-motivated alignment types. We propose a novel task of joint prediction of word alignment and alignment types and propose novel semi-supervised learning algorithms for this task. We also solve a sub-task of predicting the alignment type given an aligned word pair. The proposed joint generative models (alignment-type-enhanced models) significantly outperform the models without alignment types in terms of word alignment and translation quality. Next, we present an unsupervised neural Hidden Markov Model for word alignment, where emission and transition probabilities are modeled using neural networks. The model is simpler in structure, allows for seamless integration of additional context, and can be used in an end-to-end neural network. Finally, we tackle the part-of-speech tagging task for the zero-resource scenario where no part-of-speech (POS) annotated training data is available. We present a cross-lingual projection approach where neural HMM aligners are used to obtain high quality word alignments between resource-poor and resource-rich languages. Moreover, high quality neural POS taggers are used to provide annotations for the resource-rich language side of the parallel data, as well as to train a tagger on the projected data. Our experimental results on truly low-resource languages show that our methods outperform their corresponding baselines
Recommended from our members
Minimally supervised induction of morphology through bitexts
textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems.
Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one languageāthe source languageāto another languageāthe target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis.
While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic
Constrained word alignment models for statistical machine translation
Word alignment is a fundamental and crucial component in Statistical Machine Translation (SMT) systems. Despite the enormous progress made in the past two decades, this task remains an active research topic simply because the quality of word alignment is still far from optimal. Most state-of-the-art word alignment models are grounded on statistical learning theory treating word alignment as a general sequence alignment problem, where many linguistically motivated insights are not incorporated. In this thesis, we propose new word alignment models with linguistically motivated constraints in a bid to improve the quality of word alignment for Phrase-Based SMT systems (PB-SMT). We start the exploration with an investigation
into segmentation constraints for word alignment by proposing a novel algorithm, namely word packing, which is motivated by the fact that one concept expressed by one word in one language can frequently surface as a compound or
collocation in another language. Our algorithm takes advantage of the interaction between segmentation and alignment, starting with some segmentation for both the
source and target language and updating the segmentation with respect to the word alignment results using state-of-the-art word alignment models; thereafter a refined
word alignment can be obtained based on the updated segmentation. In this process, the updated segmentation acts as a hard constraint on the word alignment
models and reduces the complexity of the alignment models by generating more 1-to-1 correspondences through word packing. Experimental results show that this algorithm can lead to statistically significant improvements over the state-of-the-art word alignment models. Given that word packing imposes "hard" segmentation constraints on the word aligner, which is prone to introducing noise, we propose two
new word alignment models using syntactic dependencies as soft constraints. The first model is a syntactically enhanced discriminative word alignment model, where
we use a set of feature functions to express the syntactic dependency information encoded in both source and target languages. One the one hand, this model enjoys
great flexibility in its capacity to incorporate multiple features; on the other hand, this model is designed to facilitate model tuning for different objective functions.
Experimental results show that using syntactic constraints can improve the performance of the discriminative word alignment model, which also leads to better PB-SMT performance compared to using state-of-the-art word alignment models.
The second model is a syntactically constrained generative word alignment model, where we add in a syntactic coherence model over the target phrases in the context of HMM word-to-phrase alignment. The advantages of our model are that (i) the addition of the syntactic coherence model preserves the efficient parameter estimation procedures; and (ii) the flexibility of the model can be increased so that it can
be tuned according to different objective functions. Experimental results show that tuning this model properly leads to a significant gain in MT performance over the
state-of-the-art
- ā¦