9,323 research outputs found
Handling Homographs in Neural Machine Translation
Homographs, words with different meanings but the same surface form, have
long caused difficulty for machine translation systems, as it is difficult to
select the correct translation based on the context. However, with the advent
of neural machine translation (NMT) systems, which can theoretically take into
account global sentential context, one may hypothesize that this problem has
been alleviated. In this paper, we first provide empirical evidence that
existing NMT systems in fact still have significant problems in properly
translating ambiguous words. We then proceed to describe methods, inspired by
the word sense disambiguation literature, that model the context of the input
word with context-aware word embeddings that help to differentiate the word
sense be- fore feeding it into the encoder. Experiments on three language pairs
demonstrate that such models improve the performance of NMT systems both in
terms of BLEU score and in the accuracy of translating homographs.Comment: NAACL201
Integrating Weakly Supervised Word Sense Disambiguation into Neural Machine Translation
This paper demonstrates that word sense disambiguation (WSD) can improve
neural machine translation (NMT) by widening the source context considered when
modeling the senses of potentially ambiguous words. We first introduce three
adaptive clustering algorithms for WSD, based on k-means, Chinese restaurant
processes, and random walks, which are then applied to large word contexts
represented in a low-rank space and evaluated on SemEval shared-task data. We
then learn word vectors jointly with sense vectors defined by our best WSD
method, within a state-of-the-art NMT system. We show that the concatenation of
these vectors, and the use of a sense selection mechanism based on the weighted
average of sense vectors, outperforms several baselines including sense-aware
ones. This is demonstrated by translation on five language pairs. The
improvements are above one BLEU point over strong NMT baselines, +4% accuracy
over all ambiguous nouns and verbs, or +20% when scored manually over several
challenging words.Comment: To appear in TAC
Developing Word-aligned Myanmar-English Parallel Corpus based on the IBM Models
Word alignment in bilingual corpora has been an active research
topic in the Machine Translation research groups. Corpus is the
body of text collections, which are useful for Language
Processing (NLP). Parallel text alignment is the identification of
the corresponding sentences in the parallel text. Large
collections of parallel level are prerequisite for many areas of
linguistic research. Parallel corpus helps in making statistical
bilingual dictionary, in supporting statistical machine translation
and in supporting as training data for word sense disambiguation
and translation disambiguation. Nowadays, the world is a global
network and everybody will be learned more than one language.
So, multilingual corpora are more processing. Thus, the main
purpose of this system is to construct word-aligned parallel
corpus to be able in Myanmar-English machine translation. One
useful concept is to identify correspondences between words in
one language and in other language. The proposed approach is
based on the first three IBM models and EM algorithm. It also
shows that the approach can also be improved by using a list of
cognates and morphological analysis
Contrastive Conditioning for Assessing Disambiguation in MT: A Case Study of Distilled Bias
Lexical disambiguation is a major challenge for machine translation systems, especially if some senses of a word are trained less often than others. Identifying patterns of overgeneralization requires evaluation methods that are both reliable and scalable. We propose contrastive conditioning as a reference-free black-box method for detecting disambiguation errors. Specifically, we score the quality of a translation by conditioning on variants of the source that provide contrastive disambiguation cues. After validating our method, we apply it in a case study to perform a targeted evaluation of sequence-level knowledge distillation. By probing word sense disambiguation and translation of gendered occupation names, we show that distillation-trained models tend to overgeneralize more than other models with a comparable BLEU score. Contrastive conditioning thus highlights a side effect of distillation that is not fully captured by standard evaluation metrics. Code and data to reproduce our findings are publicly available
Code-Switching with Word Senses for Pretraining in Neural Machine Translation
Lexical ambiguity is a significant and pervasive challenge in Neural Machine
Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to
handle polysemous words (Campolungo et al., 2022). The same holds for the NMT
pretraining paradigm of denoising synthetic "code-switched" text (Pan et al.,
2021; Iyer et al., 2023), where word senses are ignored in the noising stage --
leading to harmful sense biases in the pretraining data that are subsequently
inherited by the resulting models. In this work, we introduce Word Sense
Pretraining for Neural Machine Translation (WSP-NMT) - an end-to-end approach
for pretraining multilingual NMT models leveraging word sense-specific
information from Knowledge Bases. Our experiments show significant improvements
in overall translation quality. Then, we show the robustness of our approach to
scale to various challenging data and resource-scarce scenarios and, finally,
report fine-grained accuracy improvements on the DiBiMT disambiguation
benchmark. Our studies yield interesting and novel insights into the merits and
challenges of integrating word sense information and structured knowledge in
multilingual pretraining for NMT.Comment: EMNLP (Findings) 2023 Long Pape
FASTSUBS: An Efficient and Exact Procedure for Finding the Most Likely Lexical Substitutes Based on an N-gram Language Model
Lexical substitutes have found use in areas such as paraphrasing, text
simplification, machine translation, word sense disambiguation, and part of
speech induction. However the computational complexity of accurately
identifying the most likely substitutes for a word has made large scale
experiments difficult. In this paper I introduce a new search algorithm,
FASTSUBS, that is guaranteed to find the K most likely lexical substitutes for
a given word in a sentence based on an n-gram language model. The computation
is sub-linear in both K and the vocabulary size V. An implementation of the
algorithm and a dataset with the top 100 substitutes of each token in the WSJ
section of the Penn Treebank are available at http://goo.gl/jzKH0.Comment: 4 pages, 1 figure, to appear in IEEE Signal Processing Letter
Capturing lexical variation in MT evaluation using automatically built sense-cluster inventories
The strict character of most of the existing Machine Translation (MT) evaluation metrics does not permit them to capture lexical variation in translation. However, a central
issue in MT evaluation is the high correlation that the metrics should have with human judgments of translation quality. In order to achieve a higher correlation, the identification of sense correspondences between the compared translations becomes really important. Given
that most metrics are looking for exact correspondences, the evaluation results are often misleading concerning translation quality. Apart from that, existing metrics do not permit one to make a conclusive estimation of the impact of Word Sense Disambiguation techniques into
MT systems. In this paper, we show how information acquired by an unsupervised semantic analysis method can be used to render MT evaluation more sensitive to lexical semantics. The sense inventories built by this data-driven method are incorporated into METEOR: they replace WordNet for evaluation in English and render METEOR’s synonymy module operable in French. The evaluation results demonstrate that the use of these inventories gives rise to an increase in the number of matches and the correlation with human judgments of translation quality, compared to precision-based metrics
- …