3,837 research outputs found
The TALP–UPC Spanish–English WMT biomedical task: bilingual embeddings and char-based neural language model rescoring in a phrase-based system
This paper describes the TALP–UPC system in the Spanish–English WMT 2016 biomedical shared task. Our system is a standard phrase-based system enhanced with vocabulary expansion using bilingual word embeddings and a characterbased neural language model with rescoring. The former focuses on resolving outof- vocabulary words, while the latter enhances the fluency of the system. The two modules progressively improve the final translation as measured by a combination of several lexical metrics.Postprint (published version
Word-to-Word Models of Translational Equivalence
Parallel texts (bitexts) have properties that distinguish them from other
kinds of parallel data. First, most words translate to only one other word.
Second, bitext correspondence is noisy. This article presents methods for
biasing statistical translation models to reflect these properties. Analysis of
the expected behavior of these biases in the presence of sparse data predicts
that they will result in more accurate models. The prediction is confirmed by
evaluation with respect to a gold standard -- translation models that are
biased in this fashion are significantly more accurate than a baseline
knowledge-poor model. This article also shows how a statistical translation
model can take advantage of various kinds of pre-existing knowledge that might
be available about particular language pairs. Even the simplest kinds of
language-specific knowledge, such as the distinction between content words and
function words, is shown to reliably boost translation model performance on
some tasks. Statistical models that are informed by pre-existing knowledge
about the model domain combine the best of both the rationalist and empiricist
traditions
- …