Why Comparing Single Performance Scores Does Not Allow to Draw
  Conclusions About Machine Learning Approaches

Gurevych, Iryna; Reimers, Nils

research

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

Authors: Iryna Gurevych
Nils Reimers
Publication date: 26 March 2018
Publisher

Abstract

Developing state-of-the-art approaches for specific tasks is a major driving force in our research community. Depending on the prestige of the task, publishing it can come along with a lot of visibility. The question arises how reliable are our evaluation methodologies to compare approaches? One common methodology to identify the state-of-the-art is to partition data into a train, a development and a test set. Researchers can train and tune their approach on some part of the dataset and then select the model that worked best on the development set for a final evaluation on unseen test data. Test scores from different approaches are compared, and performance differences are tested for statistical significance. In this publication, we show that there is a high risk that a statistical significance in this type of evaluation is not due to a superior learning approach. Instead, there is a high risk that the difference is due to chance. For example for the CoNLL 2003 NER dataset we observed in up to 26% of the cases type I errors (false positives) with a threshold of p < 0.05, i.e., falsely concluding a statistically significant difference between two identical approaches. We prove that this evaluation setup is unsuitable to compare learning approaches. We formalize alternative evaluation setups based on score distributions

Similar works

Full text

Available Versions

TUbiblio

oai:tubiblio.ulb.tu-darmstadt....

Last time updated on 05/04/2020