3 research outputs found
Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation
Neural machine translation (NMT) systems require large amounts of high
quality in-domain parallel corpora for training. State-of-the-art NMT systems
still face challenges related to out-of-vocabulary words and dealing with
low-resource language pairs. In this paper, we propose and compare several
models for fusion of bilingual lexicons with an end-to-end trained
sequence-to-sequence model for machine translation. The result is a fusion
model with two information sources for the decoder: a neural conditional
language model and a bilingual lexicon. This fusion model learns how to combine
both sources of information in order to produce higher quality translation
output. Our experiments show that our proposed models work well in relatively
low-resource scenarios, and also effectively reduce the parameter size and
training cost for NMT without sacrificing performance
Lexicon Learning for Few-Shot Neural Sequence Modeling
Sequence-to-sequence transduction is the core problem in language processing
applications as diverse as semantic parsing, machine translation, and
instruction following. The neural network models that provide the dominant
solution to these problems are brittle, especially in low-resource settings:
they fail to generalize correctly or systematically from small datasets. Past
work has shown that many failures of systematic generalization arise from
neural models' inability to disentangle lexical phenomena from syntactic ones.
To address this, we augment neural decoders with a lexical translation
mechanism that generalizes existing copy mechanisms to incorporate learned,
decontextualized, token-level translation rules. We describe how to initialize
this mechanism using a variety of lexicon learning algorithms, and show that it
improves systematic generalization on a diverse set of sequence modeling tasks
drawn from cognitive science, formal semantics, and machine translation.Comment: ACL 202
word2word: A Collection of Bilingual Lexicons for 3,564 Language Pairs
We present word2word, a publicly available dataset and an open-source Python
package for cross-lingual word translations extracted from sentence-level
parallel corpora. Our dataset provides top-k word translations in 3,564
(directed) language pairs across 62 languages in OpenSubtitles2018 (Lison et
al., 2018). To obtain this dataset, we use a count-based bilingual lexicon
extraction model based on the observation that not only source and target words
but also source words themselves can be highly correlated. We illustrate that
the resulting bilingual lexicons have high coverage and attain competitive
translation quality for several language pairs. We wrap our dataset and model
in an easy-to-use Python library, which supports downloading and retrieving
top-k word translations in any of the supported language pairs as well as
computing top-k word translations for custom parallel corpora