3 research outputs found
Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords
We present a new release of the Czech-English parallel corpus CzEng 2.0
consisting of over 2 billion words (2 "gigawords") in each language. The corpus
contains document-level information and is filtered with several techniques to
lower the amount of noise. In addition to the data in the previous version of
CzEng, it contains new authentic and also high-quality synthetic parallel data.
CzEng is freely available for research and educational purposes
Improving Paraphrase Detection with the Adversarial Paraphrasing Task
If two sentences have the same meaning, it should follow that they are
equivalent in their inferential properties, i.e., each sentence should
textually entail the other. However, many paraphrase datasets currently in
widespread use rely on a sense of paraphrase based on word overlap and syntax.
Can we teach them instead to identify paraphrases in a way that draws on the
inferential properties of the sentences, and is not over-reliant on lexical and
syntactic similarities of a sentence pair? We apply the adversarial paradigm to
this question, and introduce a new adversarial method of dataset creation for
paraphrase identification: the Adversarial Paraphrasing Task (APT), which asks
participants to generate semantically equivalent (in the sense of mutually
implicative) but lexically and syntactically disparate paraphrases. These
sentence pairs can then be used both to test paraphrase identification models
(which get barely random accuracy) and then improve their performance. To
accelerate dataset generation, we explore automation of APT using T5, and show
that the resulting dataset also improves accuracy. We discuss implications for
paraphrase detection and release our dataset in the hope of making paraphrase
detection models better able to detect sentence-level meaning equivalence
End-to-End Lexically Constrained Machine Translation for Morphologically Rich Languages
Lexically constrained machine translation allows the user to manipulate the
output sentence by enforcing the presence or absence of certain words and
phrases. Although current approaches can enforce terms to appear in the
translation, they often struggle to make the constraint word form agree with
the rest of the generated output. Our manual analysis shows that 46% of the
errors in the output of a baseline constrained model for English to Czech
translation are related to agreement. We investigate mechanisms to allow neural
machine translation to infer the correct word inflection given lemmatized
constraints. In particular, we focus on methods based on training the model
with constraints provided as part of the input sequence. Our experiments on the
English-Czech language pair show that this approach improves the translation of
constrained terms in both automatic and manual evaluation by reducing errors in
agreement. Our approach thus eliminates inflection errors, without introducing
new errors or decreasing the overall quality of the translation