We consider the problem of automatically generating textual paraphrases with
modified attributes or stylistic properties, focusing on the setting without
parallel data (Hu et al., 2017; Shen et al., 2017). This setting poses
challenges for learning and evaluation. We show that the metric of
post-transfer classification accuracy is insufficient on its own, and propose
additional metrics based on semantic content preservation and fluency. For
reliable evaluation, all three metric categories must be taken into account. We
contribute new loss functions and training strategies to address the new
metrics. Semantic preservation is addressed by adding a cyclic consistency loss
and a loss based on paraphrase pairs, while fluency is improved by integrating
losses based on style-specific language models. Automatic and manual evaluation
show large improvements over the baseline method of Shen et al. (2017). Our
hope is that these losses and metrics can be general and useful tools for a
range of textual transfer settings without parallel corpora