The problem of detecting scientific fraud using machine learning was recently
introduced, with initial, positive results from a model taking into account
various general indicators. The results seem to suggest that writing style is
predictive of scientific fraud. We revisit these initial experiments, and show
that the leave-one-out testing procedure they used likely leads to a slight
over-estimate of the predictability, but also that simple models can outperform
their proposed model by some margin. We go on to explore more abstract
linguistic features, such as linguistic complexity and discourse structure,
only to obtain negative results. Upon analyzing our models, we do see some
interesting patterns, though: Scientific fraud, for examples, contains less
comparison, as well as different types of hedging and ways of presenting
logical reasoning.Comment: To appear in the Proceedings of the Workshop on Stylistic Variation
2017 (EMNLP), 6 page