424 research outputs found
Evaluating prose style transfer with the Bible
In the prose style transfer task a system, provided with text input and a
target prose style, produces output which preserves the meaning of the input
text but alters the style. These systems require parallel data for evaluation
of results and usually make use of parallel data for training. Currently, there
are few publicly available corpora for this task. In this work, we identify a
high-quality source of aligned, stylistically distinct text in different
versions of the Bible. We provide a standardized split, into training,
development and testing data, of the public domain versions in our corpus. This
corpus is highly parallel since many Bible versions are included. Sentences are
aligned due to the presence of chapter and verse numbers within all versions of
the text. In addition to the corpus, we present the results, as measured by the
BLEU and PINC metrics, of several models trained on our data which can serve as
baselines for future research. While we present these data as a style transfer
corpus, we believe that it is of unmatched quality and may be useful for other
natural language tasks as well
A Continuously Growing Dataset of Sentential Paraphrases
A major challenge in paraphrase research is the lack of parallel corpora. In
this paper, we present a new method to collect large-scale sentential
paraphrases from Twitter by linking tweets through shared URLs. The main
advantage of our method is its simplicity, as it gets rid of the classifier or
human in the loop needed to select data before annotation and subsequent
application of paraphrase identification algorithms in the previous work. We
present the largest human-labeled paraphrase corpus to date of 51,524 sentence
pairs and the first cross-domain benchmarking for automatic paraphrase
identification. In addition, we show that more than 30,000 new sentential
paraphrases can be easily and continuously captured every month at ~70%
precision, and demonstrate their utility for downstream NLP tasks through
phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201
Investigation of text data augmentation for transformer training via translation technique
Data augmentation can improve model’s final accuracy by introducing new data samples to the dataset. In this paper, text data augmentation using translation technique is investigated. Synthetic translations, generated by Opus-MT model are compared to the unique foreign data samples in terms of an impact to the trans- former network-based models’ performance. The experimental results showed that multilingual models like DistilBERT in some cases benefit from the introduction of the addition artificially created data samples presented in a foreign language
- …