58,992 research outputs found
Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches
Developing state-of-the-art approaches for specific tasks is a major driving
force in our research community. Depending on the prestige of the task,
publishing it can come along with a lot of visibility. The question arises how
reliable are our evaluation methodologies to compare approaches?
One common methodology to identify the state-of-the-art is to partition data
into a train, a development and a test set. Researchers can train and tune
their approach on some part of the dataset and then select the model that
worked best on the development set for a final evaluation on unseen test data.
Test scores from different approaches are compared, and performance differences
are tested for statistical significance.
In this publication, we show that there is a high risk that a statistical
significance in this type of evaluation is not due to a superior learning
approach. Instead, there is a high risk that the difference is due to chance.
For example for the CoNLL 2003 NER dataset we observed in up to 26% of the
cases type I errors (false positives) with a threshold of p < 0.05, i.e.,
falsely concluding a statistically significant difference between two identical
approaches.
We prove that this evaluation setup is unsuitable to compare learning
approaches. We formalize alternative evaluation setups based on score
distributions
Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging
In this paper we show that reporting a single performance score is
insufficient to compare non-deterministic approaches. We demonstrate for common
sequence tagging tasks that the seed value for the random number generator can
result in statistically significant (p < 10^-4) differences for
state-of-the-art systems. For two recent systems for NER, we observe an
absolute difference of one percentage point F1-score depending on the selected
seed value, making these systems perceived either as state-of-the-art or
mediocre. Instead of publishing and reporting single performance scores, we
propose to compare score distributions based on multiple executions. Based on
the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we
present network architectures that produce both superior performance as well as
are more stable with respect to the remaining hyperparameters.Comment: Accepted at EMNLP 201
LIMEtree: Interactively Customisable Explanations Based on Local Surrogate Multi-output Regression Trees
Systems based on artificial intelligence and machine learning models should
be transparent, in the sense of being capable of explaining their decisions to
gain humans' approval and trust. While there are a number of explainability
techniques that can be used to this end, many of them are only capable of
outputting a single one-size-fits-all explanation that simply cannot address
all of the explainees' diverse needs. In this work we introduce a
model-agnostic and post-hoc local explainability technique for black-box
predictions called LIMEtree, which employs surrogate multi-output regression
trees. We validate our algorithm on a deep neural network trained for object
detection in images and compare it against Local Interpretable Model-agnostic
Explanations (LIME). Our method comes with local fidelity guarantees and can
produce a range of diverse explanation types, including contrastive and
counterfactual explanations praised in the literature. Some of these
explanations can be interactively personalised to create bespoke, meaningful
and actionable insights into the model's behaviour. While other methods may
give an illusion of customisability by wrapping, otherwise static, explanations
in an interactive interface, our explanations are truly interactive, in the
sense of allowing the user to "interrogate" a black-box model. LIMEtree can
therefore produce consistent explanations on which an interactive exploratory
process can be built
Unmasking Clever Hans Predictors and Assessing What Machines Really Learn
Current learning machines have successfully solved hard application problems,
reaching high accuracy and displaying seemingly "intelligent" behavior. Here we
apply recent techniques for explaining decisions of state-of-the-art learning
machines and analyze various tasks from computer vision and arcade games. This
showcases a spectrum of problem-solving behaviors ranging from naive and
short-sighted, to well-informed and strategic. We observe that standard
performance evaluation metrics can be oblivious to distinguishing these diverse
problem solving behaviors. Furthermore, we propose our semi-automated Spectral
Relevance Analysis that provides a practically effective way of characterizing
and validating the behavior of nonlinear learning machines. This helps to
assess whether a learned model indeed delivers reliably for the problem that it
was conceived for. Furthermore, our work intends to add a voice of caution to
the ongoing excitement about machine intelligence and pledges to evaluate and
judge some of these recent successes in a more nuanced manner.Comment: Accepted for publication in Nature Communication
- …