The recent success of large language models for text generation poses a
severe threat to academic integrity, as plagiarists can generate realistic
paraphrases indistinguishable from original work. However, the role of large
autoregressive transformers in generating machine-paraphrased plagiarism and
their detection is still developing in the literature. This work explores T5
and GPT-3 for machine-paraphrase generation on scientific articles from arXiv,
student theses, and Wikipedia. We evaluate the detection performance of six
automated solutions and one commercial plagiarism detection software and
perform a human study with 105 participants regarding their detection
performance and the quality of generated examples. Our results suggest that
large models can rewrite text humans have difficulty identifying as
machine-paraphrased (53% mean acc.). Human experts rate the quality of
paraphrases generated by GPT-3 as high as original texts (clarity 4.0/5,
fluency 4.2/5, coherence 3.8/5). The best-performing detection model (GPT-3)
achieves a 66% F1-score in detecting paraphrases