Search CORE

717 research outputs found

Evaluating diagnostic tests: The area under the ROC curve and the balance of errors

Author: Hand DJ
Publication venue: 'Wiley'
Publication date: 30/06/2010
Field of study

Spiral - Imperial College Digital Repository

Measuring classifier performance: a coherent alternative to the area under the ROC curve

Author: Hand DJ
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2009
Field of study

Spiral - Imperial College Digital Repository

Statistics and computing: the genesis of data science

Author: David J. Hand
DJ Hand
DJ Hand
DJ Hand
DJ Hand
DJ Hand
DJ Hand
HG Wells
JA Nelder
JPA Ioannidis
UM Fayyad
W Kruskal
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Approximate repeated-measures shrinkage

Author: Brentnall AR
Crowder MJ
Hand DJ
Publication venue: 'Elsevier BV'
Publication date: 01/02/2011
Field of study

Spiral - Imperial College Digital Repository

A tool for subjective and interactive visual data exploration

Author: C Ware
D Paurat
DJ Hand
J Lijffijt
T Bie De
T Ruotsalo
V Dzyuba
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

We present SIDE, a tool for Subjective and Interactive Visual Data Exploration, which lets users explore high dimensional data via subjectively informative 2D data visualizations. Many existing visual analytics tools are either restricted to specific problems and domains or they aim to find visualizations that align with user’s belief about the data. In contrast, our generic tool computes data visualizations that are surprising given a user’s current understanding of the data. The user’s belief state is represented as a set of projection tiles. Hence, this user-awareness offers users an efficient way to interactively explore yet-unknown features of complex high dimensional datasets

Crossref

Ghent University Academic Bibliography

Statistical challenges of administrative and transaction data

Author: Hand DJ
Publication venue: 'Wiley'
Publication date: 01/02/2018
Field of study

Administrative data are becoming increasingly important. They are typically the side effect of some operational exercise and are often seen as having significant advantages over alternative sources of data. Although it is true that such data have merits, statisticians should approach the analysis of such data with the same cautious and critical eye as they approach the analysis of data from any other source. The paper identifies some statistical challenges, with the aim of stimulating debate about and improving the analysis of administrative data, and encouraging methodology researchers to explore some of the important statistical problems which arise with such data

Spiral - Imperial College Digital Repository

Efficient estimation of AUC in a sliding window

Author: A Bifet
C Ferri
D Brzezinski
DJ Hand
I Žliobaitė
J Gama
J Gama
J Gama
Remco R. Bouckaert
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/02/2019
Field of study

In many applications, monitoring area under the ROC curve (AUC) in a sliding window over a data stream is a natural way of detecting changes in the system. The drawback is that computing AUC in a sliding window is expensive, especially if the window size is large and the data flow is significant. In this paper we propose a scheme for maintaining an approximate AUC in a sliding window of length

k

. More specifically, we propose an algorithm that, given

\epsilon

, estimates AUC within

\epsilon / 2

, and can maintain this estimate in

O((\log k) / \epsilon)

time, per update, as the window slides. This provides a speed-up over the exact computation of AUC, which requires

O(k)

time, per update. The speed-up becomes more significant as the size of the window increases. Our estimate is based on grouping the data points together, and using these groups to calculate AUC. The grouping is designed carefully such that (

i

) the groups are small enough, so that the error stays small, (

ii

) the number of groups is small, so that enumerating them is not expensive, and (

iii

) the definition is flexible enough so that we can maintain the groups efficiently. Our experimental evaluation demonstrates that the average approximation error in practice is much smaller than the approximation guarantee

\epsilon / 2

, and that we can achieve significant speed-ups with only a modest sacrifice in accuracy

arXiv.org e-Print Archive

Crossref

A review of the F-measure: its history, properties, criticism, and alternatives

Author: Christen P
Hand DJ
Kirielle N
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2023
Field of study

Methods to classify objects into two or more classes are at the core of various disciplines. When a set of objects with their true classes is available, a supervised classifier can be trained and employed to decide if, for example, a new patient has cancer or not. The choice of performance measure is critical in deciding which supervised method to use in any particular classification problem. Different measures can lead to very different choices, so the measure should match the objectives. Many performance measures have been developed, and one of them is the F-measure, the harmonic mean of precision and recall. Originally proposed in information retrieval, the F-measure has gained increasing interest in the context of classification. However, the rationale underlying this measure appears weak, and unlike other measures it does not have a representational meaning. The use of the harmonic mean also has little theoretical justification. The F-measure also stresses one class, which seems inappropriate for general classification problems. We provide a history of the F-measure and its use in computational disciplines, describe its properties, and discuss criticism about the F-Measure. We conclude with alternatives to the F-measure, and recommendations of how to use it effectively

Spiral - Imperial College Digital Repository

Estimating bank default with generalised extreme value regression models

Author: Agresti A
Dowd K
Embrechts P
Giudici P
Gomez-Gonzalez J
Gup BE
Hand DJ
Hand DJ
McCullagh P
Merton RC
Nocedal J
Paolo Giudici
Raffaella Calabrese
Resti A
Rose PS
Roth M
Ruppert D
Vasicek OA
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

The paper proposes a novel model for the prediction of bank failures, on the basis of both macroeconomic and bank-specific microeconomic factors. As bank failures are rare, in the paper we apply a regression method for binary data based on extreme value theory, which turns out to be more effective than classical logistic regression models, as it better leverages the information in the tail of the default distribution. The application of this model to the occurrence of bank defaults in a highly bank dependent economy (Italy) shows that, while microeconomic factors as well as regulatory capital are significant to explain proper failures, macroeconomic factors are relevant only when failures are defined not only in terms of actual defaults but also in terms of mergers and acquisitions. In terms of predictive accuracy, the model based on extreme value theory outperforms classical logistic regression models

University of Essex Research Repository

Crossref

Archivio Istituzionale della Ricerca - Università degli Studi di Pavia

Randomized Reference Classifier with Gaussian Distribution and Soft Confusion Matrix Applied to the Improving Weak Classifiers

Author: B. Bergmann
D Yekutieli
DJ Hand
F Provost
F Wilcoxon
HA David
J Demšar
James O. Berger
JR Quinlan
L Breiman
L Kuncheva
M Friedman
M Hall
M Kurzynski
Marcin Majak
Marek Kurzynski
Marina Sokolova
N Johnson
Pawel Trajdos
R Lysiak
S Garcia
T Cover
T Woloszynski
Y Freund
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/05/2019
Field of study

In this paper, an issue of building the RRC model using probability distributions other than beta distribution is addressed. More precisely, in this paper, we propose to build the RRR model using the truncated normal distribution. Heuristic procedures for expected value and the variance of the truncated-normal distribution are also proposed. The proposed approach is tested using SCM-based model for testing the consequences of applying the truncated normal distribution in the RRC model. The experimental evaluation is performed using four different base classifiers and seven quality measures. The results showed that the proposed approach is comparable to the RRC model built using beta distribution. What is more, for some base classifiers, the truncated-normal-based SCM algorithm turned out to be better at discovering objects coming from minority classes.Comment: arXiv admin note: text overlap with arXiv:1901.0882

arXiv.org e-Print Archive

Crossref