4,736 research outputs found
Evaluating prose style transfer with the Bible
In the prose style transfer task a system, provided with text input and a
target prose style, produces output which preserves the meaning of the input
text but alters the style. These systems require parallel data for evaluation
of results and usually make use of parallel data for training. Currently, there
are few publicly available corpora for this task. In this work, we identify a
high-quality source of aligned, stylistically distinct text in different
versions of the Bible. We provide a standardized split, into training,
development and testing data, of the public domain versions in our corpus. This
corpus is highly parallel since many Bible versions are included. Sentences are
aligned due to the presence of chapter and verse numbers within all versions of
the text. In addition to the corpus, we present the results, as measured by the
BLEU and PINC metrics, of several models trained on our data which can serve as
baselines for future research. While we present these data as a style transfer
corpus, we believe that it is of unmatched quality and may be useful for other
natural language tasks as well
ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
We describe PARANMT-50M, a dataset of more than 50 million English-English
sentential paraphrase pairs. We generated the pairs automatically by using
neural machine translation to translate the non-English side of a large
parallel corpus, following Wieting et al. (2017). Our hope is that ParaNMT-50M
can be a valuable resource for paraphrase generation and can provide a rich
source of semantic knowledge to improve downstream natural language
understanding tasks. To show its utility, we use ParaNMT-50M to train
paraphrastic sentence embeddings that outperform all supervised systems on
every SemEval semantic textual similarity competition, in addition to showing
how it can be used for paraphrase generation
- …