248,405 research outputs found
Multi-word expression-sensitive word alignment
This paper presents a new word alignment method which incorporates knowledge about Bilingual Multi-Word Expressions (BMWEs). Our method of word alignment first extracts such BMWEs in a bidirectional way for a given corpus and then starts conventional word alignment,
considering the properties of BMWEs in their grouping as well as their alignment links. We give partial annotation of alignment links as prior knowledge to the word
alignment process; by replacing the maximum likelihood estimate in the M-step of the IBM Models with the Maximum A
Posteriori (MAP) estimate, prior knowledge about BMWEs is embedded in the prior in this MAP estimate. In our experiments, we saw an improvement of 0.77 Bleu points absolute in JP–EN. Except for one case, our method gave better results than the method using only BMWEs grouping. Even though this paper does not directly address the issues in Cross-Lingual Information Retrieval (CLIR), it
discusses an approach of direct relevance to the field. This approach could be viewed as the opposite of current trends in CLIR on semantic space that incorporate a notion of order in the bag-of-words model (e.g. co-occurences)
Consensus Versus Expertise: A Case Study of Word Alignment with Mechanical Turk
Word alignment is an important preprocessing step for machine translation. The project aims at incorporating manual alignments from Amazon Mechanical Turk (MTurk) to help improve word alignment quality. As a global crowdsourcing service, MTurk can provide flexible and abundant labor force and therefore reduce the cost of obtaining labels. An easy-to-use interface is developed to simplify the labeling process. We compare the alignment results by Turkers to that by experts, and incorporate the alignments in a semi-supervised word alignment tool to improve the quality of the labels. We also compared two pricing strategies for word alignment task. Experimental results show high precision of the alignments provided by Turkers and the semi-supervised approach achieved 0.5% absolute reduction on alignment error rate
Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment
Word translation without parallel corpora has become feasible, rivaling the
performance of supervised methods. Recent findings have shown that the accuracy
and robustness of unsupervised word translation (UWT) can be improved by making
use of visual observations, which are universal representations across
languages. In this work, we investigate the potential of using not only visual
observations but also pretrained language-image models for enabling a more
efficient and robust UWT. Specifically, we develop a novel UWT method dubbed
Word Alignment using Language-Image Pretraining (WALIP), which leverages visual
observations via the shared embedding space of images and texts provided by
CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we
retrieve word pairs with high confidences of similarity, computed using our
proposed image-based fingerprints, which define the initial pivot for the word
alignment. Second, we apply our robust Procrustes algorithm to estimate the
linear mapping between two embedding spaces, which iteratively corrects and
refines the estimated alignment. Our extensive experiments show that WALIP
improves upon the state-of-the-art performance of bilingual word alignment for
a few language pairs across different word embeddings and displays great
robustness to the dissimilarity of language pairs or training corpora for two
word embeddings.Comment: In Proceedings of the 2022 Conference on Empirical Methods in Natural
Language Processing (EMNLP Findings
Addressing the Rare Word Problem in Neural Machine Translation
Neural Machine Translation (NMT) is a new approach to machine translation
that has shown promising results that are comparable to traditional approaches.
A significant weakness in conventional NMT systems is their inability to
correctly translate very rare words: end-to-end NMTs tend to have relatively
small vocabularies with a single unk symbol that represents every possible
out-of-vocabulary (OOV) word. In this paper, we propose and implement an
effective technique to address this problem. We train an NMT system on data
that is augmented by the output of a word alignment algorithm, allowing the NMT
system to emit, for each OOV word in the target sentence, the position of its
corresponding word in the source sentence. This information is later utilized
in a post-processing step that translates every OOV word using a dictionary.
Our experiments on the WMT14 English to French translation task show that this
method provides a substantial improvement of up to 2.8 BLEU points over an
equivalent NMT system that does not use this technique. With 37.5 BLEU points,
our NMT system is the first to surpass the best result achieved on a WMT14
contest task.Comment: ACL 2015 camera-ready versio
Attention Focusing for Neural Machine Translation by Bridging Source and Target Embeddings
In neural machine translation, a source sequence of words is encoded into a
vector from which a target sequence is generated in the decoding phase.
Differently from statistical machine translation, the associations between
source words and their possible target counterparts are not explicitly stored.
Source and target words are at the two ends of a long information processing
procedure, mediated by hidden states at both the source encoding and the target
decoding phases. This makes it possible that a source word is incorrectly
translated into a target word that is not any of its admissible equivalent
counterparts in the target language.
In this paper, we seek to somewhat shorten the distance between source and
target words in that procedure, and thus strengthen their association, by means
of a method we term bridging source and target word embeddings. We experiment
with three strategies: (1) a source-side bridging model, where source word
embeddings are moved one step closer to the output target sequence; (2) a
target-side bridging model, which explores the more relevant source word
embeddings for the prediction of the target sequence; and (3) a direct bridging
model, which directly connects source and target word embeddings seeking to
minimize errors in the translation of ones by the others.
Experiments and analysis presented in this paper demonstrate that the
proposed bridging models are able to significantly improve quality of both
sentence translation, in general, and alignment and translation of individual
source words with target words, in particular.Comment: 9 pages, 6 figures. Accepted by ACL201
Multi-Word Expression-Sensitive Word Alignment
This paper presents a new word alignment method which incorporates knowledge about Bilingual Multi-Word Expressions (BMWEs). Our method of word alignment first extracts such BMWEs in a bidirectional way for a given corpus and then starts conventional word alignment, considering the properties of BMWEs in their grouping as well as their alignment links. We give partial annotation of alignment links as prior knowledge to the word alignment process; by replacing the maximum likelihood estimate in the M-step of the IBM Models with the Maximum A Posteriori (MAP) estimate, prior knowledge about BMWEs is embedded in the prior in this MAP estimate. In our experiments, we saw an improvement of 0.77 Bleu points absolute in JP–EN. Except for one case, our method gave better results than the method using only BMWEs grouping. Even though this paper does not directly address the issues in Cross-Lingual Information Retrieval (CLIR), it discusses an approach of direct relevance to the field. This approach could be viewed as the opposite of current trends in CLIR on semantic space that incorporate a notion of order in the bag-of-words model (e.g. co-occurences).4th Workshop on Cross Lingual Information Access, 28 August 2010, Beijing, Chin
N-gram-based statistical machine translation versus syntax augmented machine translation: comparison and system combination
In this paper we compare and contrast
two approaches to Machine Translation
(MT): the CMU-UKA Syntax Augmented
Machine Translation system (SAMT) and
UPC-TALP N-gram-based Statistical Machine
Translation (SMT). SAMT is a hierarchical
syntax-driven translation system
underlain by a phrase-based model and a
target part parse tree. In N-gram-based
SMT, the translation process is based on
bilingual units related to word-to-word
alignment and statistical modeling of the
bilingual context following a maximumentropy
framework. We provide a stepby-
step comparison of the systems and report
results in terms of automatic evaluation
metrics and required computational
resources for a smaller Arabic-to-English
translation task (1.5M tokens in the training
corpus). Human error analysis clarifies
advantages and disadvantages of the
systems under consideration. Finally, we
combine the output of both systems to
yield significant improvements in translation
quality.Postprint (published version
- …