99 research outputs found
Modeling Target-Side Inflection in Neural Machine Translation
NMT systems have problems with large vocabulary sizes. Byte-pair encoding
(BPE) is a popular approach to solving this problem, but while BPE allows the
system to generate any target-side word, it does not enable effective
generalization over the rich vocabulary in morphologically rich languages with
strong inflectional phenomena. We introduce a simple approach to overcome this
problem by training a system to produce the lemma of a word and its
morphologically rich POS tag, which is then followed by a deterministic
generation step. We apply this strategy for English-Czech and English-German
translation scenarios, obtaining improvements in both settings. We furthermore
show that the improvement is not due to only adding explicit morphological
information.Comment: Accepted as a research paper at WMT17. (Updated version with
corrected references.
Target-Side Context for Discriminative Models in Statistical Machine Translation
Discriminative translation models utilizing source context have been shown to
help statistical machine translation performance. We propose a novel extension
of this work using target context information. Surprisingly, we show that this
model can be efficiently integrated directly in the decoding process. Our
approach scales to large training data sizes and results in consistent
improvements in translation quality on four language pairs. We also provide an
analysis comparing the strengths of the baseline source-context model with our
extended source-context and target-context model and we show that our extension
allows us to better capture morphological coherence. Our work is freely
available as part of Moses.Comment: Accepted as a long paper for ACL 201
The Design of Eman, an Experiment Manager
We present eman, a tool for managing large numbers of computational experiments. Over
the years of our research in machine translation (MT), we have collected a couple of ideas for
efficient experimenting. We believe these ideas are generally applicable in (computational)
research of any field. We incorporated them into eman in order to make them available in a
command-line Unix environment.
The aim of this article is to highlight the core of the many ideas. We hope the text can serve
as a collection of experiment management tips and tricks for anyone, regardless their field of
study or computer platform they use. The specific examples we provide in eman’s current syntax
are less important but they allow us to use concrete terms. The article thus also fills the gap in eman documentation by providing some high-level overview
Improving Evaluation of English-Czech MT through Paraphrasing
In this paper, we present a method of improving the accuracy of machine translation
evaluation of Czech sentences. Given a reference sentence, our algorithm transforms it
by targeted paraphrasing into a new synthetic reference sentence that is closer in
wording to the machine translation output, but at the same time preserves the meaning of
the original reference sentence.
Grammatical correctness of~the new reference sentence is provided by applying Depfix on
newly created paraphrases. Depfix is a system for post-editing English-to-Czech machine
translation outputs. We adjusted it to fix the errors in paraphrased sentences.
Due to a noisy source of our paraphrases, we experiment with adding word alignment. However,
the alignment reduces the number of paraphrases found and the best results were achieved
by~a~simple greedy method with only one-word paraphrases thanks to their intensive filtering.
BLEU scores computed using these new reference sentences show significantly higher correlation
with human judgment than scores computed on the original reference sentences
Federalism and its manifestations in the European Union
Federalism and its manifestations in the European Union Abstract The thesis focuses on the concept of federalism and its manifestations in the European Union. The European Union as an entity uniting the member states of Europe in a supranational community is an entity in which it is possible to see some features of a future federal state, although the current European Union does not claim to be federalist. The thesis itself is divided into three main parts, with a general discussion of federalism, federations and their features in the introduction. In the second, central part of the thesis, the European Union is analysed through the features of federalism and their presence in the European Union is assessed. In the final thesis, I then develop my own analytical perspective on the specificities of European federalism. In assessing these specificities of contemporary European federalism, I focus on the combination of the characteristic fact of the subtle gradual expansion of the European Union's powers and the European Union's specific approach to the formal features of federalism and statehood. In particular, I look at examples of active avoidance of any use of federalist or state terminology by the European Union and the Member States. On the whole, the European Union has an appropriate arrangement of...Federalismus a jeho projevy v Evropské unii Abstrakt Diplomová práce je zaměřena na pojem federalismu a jeho projevy v Evropské unii. Evropská unie jakožto entita sdružující členské evropské státy do nadnárodního společenství je celkem, ve kterém je možno spatřovat některé znaky budoucího federálního státu, a to ačkoliv se současná Evropská unie k federalistickému směřování nehlásí. Vlastní práce je rozdělena na tři hlavní části, kdy v úvodu je obecně rozveden federalismus, federace a jejich znaky. Ve druhé, stěžejní části práce, je Evropská unie analyzována skrze znaky federalismu a je vyhodnocena jejich přítomnost v Evropské unii. V závěrečné práci pak rozvíjím vlastní analytický pohled na specifika Evropského federalismu. Při hodnocení těchto specifik současného evropského federalismu se zaměřuji na kombinaci charakteristické skutečnosti nenápadného postupného rozšiřování pravomocí Evropské unie a specifického přístupu Evropské unie k formálním znakům federalismu a státnosti. Zejména se zabývám příkladům aktivního vyhýbání se jakémukoliv užívání federalistické či státní terminologie ze strany Evropské unie i členských států. Evropská unie disponuje v celku vhodným uspořádáním orgánů a dělbou moci mezi orgány moci výkonné, zákonodárné i soudní, jakož i dělbou moci mezi orgány Unie a členských států....Katedra ústavního právaDepartment of Constitutional LawFaculty of LawPrávnická fakult
MTMonkey: A Scalable Infrastructure for a Machine Translation Web Service
We present a web service which handles and distributes JSON-encoded HTTP
requests for machine translation (MT) among multiple machines running
an MT system, including text pre- and post processing.
It is currently used to provide MT between several languages
for cross-lingual information retrieval in the Khresmoi project.
The software consists of an application server and remote workers which handle
text processing and communicate translation requests to MT
systems. The communication between the application server and the workers is
based on the XML-RPC protocol. We present
the overall design of the software and test results which document
speed and scalability of our solution.
Our software is licensed under the Apache 2.0 licence and is available for
download from the Lindat-Clarin repository and Github
CUNI in WMT14: Chimera Still Awaits Bellerophon
We present our English→Czech and
English→Hindi submissions for this
year’s WMT translation task. For
English→Czech, we build upon last year’s
CHIMERA and evaluate several setups.
English→Hindi is a new language pair for
this year. We experimented with reverse
self-training to acquire more (synthetic)
parallel data and with modeling target-side
morphology
HindEnCorp – Hindi-English and Hindi-only Corpus for Machine Translation
We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task
Integrating a Discriminative Classifier into Phrase-based and Hierarchical Decoding
Current state-of-the-art statistical machine translation (SMT) relies
on simple feature functions which make independence assumptions at the
level of phrases or CFG rules. However, it is well-known that
discriminative models can benefit from rich features extracted from
the source sentence context outside of the applied phrase or CFG rule,
which is available at decoding time. We present a framework for the
open-source decoder Moses that allows discriminative models over
source context to easily be trained on a large number of examples and
then be included as feature functions in decoding
Machine Translation of Medical Texts in the Khresmoi Project
The WMT 2014 Medical Translation Task poses an interesting challenge for Machine Translation
(MT). In the standard translation task, the end application is the translation itself. In this task, the MT system is considered a part of a larger system for cross-lingual information retrieval (IR)
- …