Text Style Transfer (TST) evaluation is, in practice, inconsistent.
Therefore, we conduct a meta-analysis on human and automated TST evaluation and
experimentation that thoroughly examines existing literature in the field. The
meta-analysis reveals a substantial standardization gap in human and automated
evaluation. In addition, we also find a validation gap: only few automated
metrics have been validated using human experiments. To this end, we thoroughly
scrutinize both the standardization and validation gap and reveal the resulting
pitfalls. This work also paves the way to close the standardization and
validation gap in TST evaluation by calling out requirements to be met by
future research.Comment: Accepted to Findings of ACL 202