Developing state-of-the-art approaches for specific tasks is a major driving
force in our research community. Depending on the prestige of the task,
publishing it can come along with a lot of visibility. The question arises how
reliable are our evaluation methodologies to compare approaches?
One common methodology to identify the state-of-the-art is to partition data
into a train, a development and a test set. Researchers can train and tune
their approach on some part of the dataset and then select the model that
worked best on the development set for a final evaluation on unseen test data.
Test scores from different approaches are compared, and performance differences
are tested for statistical significance.
In this publication, we show that there is a high risk that a statistical
significance in this type of evaluation is not due to a superior learning
approach. Instead, there is a high risk that the difference is due to chance.
For example for the CoNLL 2003 NER dataset we observed in up to 26% of the
cases type I errors (false positives) with a threshold of p < 0.05, i.e.,
falsely concluding a statistically significant difference between two identical
approaches.
We prove that this evaluation setup is unsuitable to compare learning
approaches. We formalize alternative evaluation setups based on score
distributions