27 research outputs found
The TALP & I2R SMT Systems for IWSLT 2008
This paper gives a description of the statistical machine
translation (SMT) systems developed at the TALP Research
Center of the UPC (Universitat Polit`ecnica de Catalunya)
for our participation in the IWSLT’08 evaluation campaign.
We present Ngram-based (TALPtuples) and phrase-based
(TALPphrases) SMT systems. The paper explains the 2008
systems’ architecture and outlines translation schemes we
have used, mainly focusing on the new techniques that are
challenged to improve speech-to-speech translation quality.
The novelties we have introduced are: improved reordering
method, linear combination of translation and reordering
models and new technique dealing with punctuation marks
insertion for a phrase-based SMT system.
This year we focus on the Arabic-English, Chinese-Spanish and pivot Chinese-(English)-Spanish translation
tasks.Postprint (published version
The TALP–UPC Spanish–English WMT biomedical task: bilingual embeddings and char-based neural language model rescoring in a phrase-based system
This paper describes the TALP–UPC system in the Spanish–English WMT 2016 biomedical shared task. Our system is a standard phrase-based system enhanced with vocabulary expansion using bilingual word embeddings and a characterbased neural language model with rescoring. The former focuses on resolving outof- vocabulary words, while the latter enhances the fluency of the system. The two modules progressively improve the final translation as measured by a combination of several lexical metrics.Postprint (published version
Neural network language models to select the best translation
The quality of translations produced by statistical machine translation (SMT) systems crucially depends on the generalization ability provided by the statistical models involved in the process. While most modern SMT systems use n-gram models to predict the next element in a sequence of tokens, our system uses a continuous space language model (LM) based on neural networks (NN). In contrast to works in which the NN LM is only used to estimate the probabilities of shortlist words (Schwenk 2010), we calculate the posterior probabilities of out-of-shortlist words using an additional neuron and unigram probabilities. Experimental results on a small Italian- to-English and a large Arabic-to-English translation task, which take into account different word history lengths (n-gram order), show that the NN LMs are scalable to small and large data and can improve an n-gram-based SMT system. For the most part, this approach aims to improve translation quality for tasks that lack translation data, but we also demonstrate its scalability to large-vocabulary tasks.Khalilov, M.; Fonollosa, JA.; Zamora-Mart Nez, F.; Castro Bleda, MJ.; España Boquera, S. (2013). Neural network language models to select the best translation. Computational Linguistics in the Netherlands Journal. (3):217-233. http://hdl.handle.net/10251/46629S217233
N-gram-based statistical machine translation versus syntax augmented machine translation: comparison and system combination
In this paper we compare and contrast
two approaches to Machine Translation
(MT): the CMU-UKA Syntax Augmented
Machine Translation system (SAMT) and
UPC-TALP N-gram-based Statistical Machine
Translation (SMT). SAMT is a hierarchical
syntax-driven translation system
underlain by a phrase-based model and a
target part parse tree. In N-gram-based
SMT, the translation process is based on
bilingual units related to word-to-word
alignment and statistical modeling of the
bilingual context following a maximumentropy
framework. We provide a stepby-
step comparison of the systems and report
results in terms of automatic evaluation
metrics and required computational
resources for a smaller Arabic-to-English
translation task (1.5M tokens in the training
corpus). Human error analysis clarifies
advantages and disadvantages of the
systems under consideration. Finally, we
combine the output of both systems to
yield significant improvements in translation
quality.Postprint (published version
Domain adaptation strategies in statistical machine translation: a brief overview
© Cambridge University Press, 2015.Statistical machine translation (SMT) is gaining interest given that it can easily be adapted to any pair of languages. One of the main challenges in SMT is domain adaptation because the performance in translation drops when testing conditions deviate from training conditions. Many research works are arising to face this challenge. Research is focused on trying to exploit all kinds of material, if available. This paper provides an overview of research, which copes with the domain adaptation challenge in SMT.Peer ReviewedPostprint (author's final draft
Coupling hierarchical word reordering and decoding in phrase-based statistical machine translation
In this paper, we start with the existing idea of taking reordering rules automatically derived from syntactic representations, and applying them in a preprocessing step before translation to make the source sentence structurally more like the target; and we propose a new approach to hierarchically extracting these rules. We evaluate this, combined with a lattice-based decoding, and show improvements over stateof-the-art distortion models.Postprint (published version
Towards improving English-Latvian translation: a system comparison and a new rescoring feature
This paper presents a comparative study of two alternative approaches to statistical machine translation (SMT) and their application to
a task of English-to-Latvian translation. Furthermore, a novel feature intending to reflect the relatively free word order scheme of the
Latvian language is proposed and successfully applied on the n-best list rescoring step. Moving beyond classical automatic scores of
translation quality that are classically presented in MT research papers, we contribute presenting a manual error analysis of MT systems
output that helps to shed light on advantages and disadvantages of the SMT systems under consideration.Postprint (published version
English-Latvian SMT: the challenge of translating into a free word order language
This paper presents a comparative study of two approaches to
statistical machine translation (SMT) and their application to
a task of English-to-Latvian translation, which is still an open
research line in the field of automatic translation.
We consider a state-of-the-art phrase-based SMT and an
alternative N-gram-based SMT systems. The major differences
between these two approaches lie in the distinct representations
of bilingual units, which are the components of the
bilingual model driving translation process and in the statistical
modeling of the translation context.
Latvian being a rather free word order language implies
additional difficulties to the translation process. We contrast
different reordering models and investigate how well they
deal with the word ordering issue.
Moving beyond automatic scores of translation quality
that are classically presented in MT research papers, we contribute
presenting a manual error analysis of MT systems output
that helps to shed light on advantages and disadvantages
of the SMT systems under consideration and identify the most
prominent source of errors typical for both SMT systems.Postprint (published version