717 research outputs found
A tool for subjective and interactive visual data exploration
We present SIDE, a tool for Subjective and Interactive Visual Data Exploration, which lets users explore high dimensional data via subjectively informative 2D data visualizations. Many existing visual analytics tools are either restricted to specific problems and domains or they aim to find visualizations that align with user’s belief about the data. In contrast, our generic tool computes data visualizations that are surprising given a user’s current understanding of the data. The user’s belief state is represented as a set of projection tiles. Hence, this user-awareness offers users an efficient way to interactively explore yet-unknown features of complex high dimensional datasets
Statistical challenges of administrative and transaction data
Administrative data are becoming increasingly important. They are typically the side effect of some operational exercise and are often seen as having significant advantages over alternative sources of data. Although it is true that such data have merits, statisticians should approach the analysis of such data with the same cautious and critical eye as they approach the analysis of data from any other source. The paper identifies some statistical challenges, with the aim of stimulating debate about and improving the analysis of administrative data, and encouraging methodology researchers to explore some of the important statistical problems which arise with such data
Efficient estimation of AUC in a sliding window
In many applications, monitoring area under the ROC curve (AUC) in a sliding
window over a data stream is a natural way of detecting changes in the system.
The drawback is that computing AUC in a sliding window is expensive, especially
if the window size is large and the data flow is significant.
In this paper we propose a scheme for maintaining an approximate AUC in a
sliding window of length . More specifically, we propose an algorithm that,
given , estimates AUC within , and can maintain this
estimate in time, per update, as the window slides.
This provides a speed-up over the exact computation of AUC, which requires
time, per update. The speed-up becomes more significant as the size of
the window increases. Our estimate is based on grouping the data points
together, and using these groups to calculate AUC. The grouping is designed
carefully such that () the groups are small enough, so that the error stays
small, () the number of groups is small, so that enumerating them is not
expensive, and () the definition is flexible enough so that we can
maintain the groups efficiently.
Our experimental evaluation demonstrates that the average approximation error
in practice is much smaller than the approximation guarantee ,
and that we can achieve significant speed-ups with only a modest sacrifice in
accuracy
A review of the F-measure: its history, properties, criticism, and alternatives
Methods to classify objects into two or more classes are at the core of various disciplines. When a set of objects with their true classes is available, a supervised classifier can be trained and employed to decide if, for example, a new patient has cancer or not. The choice of performance measure is critical in deciding which supervised method to use in any particular classification problem. Different measures can lead to very different choices, so the measure should match the objectives. Many performance measures have been developed, and one of them is the F-measure, the harmonic mean of precision and recall. Originally proposed in information retrieval, the F-measure has gained increasing interest in the context of classification. However, the rationale underlying this measure appears weak, and unlike other measures it does not have a representational meaning. The use of the harmonic mean also has little theoretical justification. The F-measure also stresses one class, which seems inappropriate for general classification problems. We provide a history of the F-measure and its use in computational disciplines, describe its properties, and discuss criticism about the F-Measure. We conclude with alternatives to the F-measure, and recommendations of how to use it effectively
Estimating bank default with generalised extreme value regression models
The paper proposes a novel model for the prediction of bank failures, on the basis of both macroeconomic and bank-specific microeconomic factors. As bank failures are rare, in the paper we apply a regression method for binary data based on extreme value theory, which turns out to be more effective than classical logistic regression models, as it better leverages the information in the tail of the default distribution. The application of this model to the occurrence of bank defaults in a highly bank dependent economy (Italy) shows that, while microeconomic factors as well as regulatory capital are significant to explain proper failures, macroeconomic factors are relevant only when failures are defined not only in terms of actual defaults but also in terms of mergers and acquisitions. In terms of predictive accuracy, the model based on extreme value theory outperforms classical logistic regression models
Randomized Reference Classifier with Gaussian Distribution and Soft Confusion Matrix Applied to the Improving Weak Classifiers
In this paper, an issue of building the RRC model using probability
distributions other than beta distribution is addressed. More precisely, in
this paper, we propose to build the RRR model using the truncated normal
distribution. Heuristic procedures for expected value and the variance of the
truncated-normal distribution are also proposed. The proposed approach is
tested using SCM-based model for testing the consequences of applying the
truncated normal distribution in the RRC model. The experimental evaluation is
performed using four different base classifiers and seven quality measures. The
results showed that the proposed approach is comparable to the RRC model built
using beta distribution. What is more, for some base classifiers, the
truncated-normal-based SCM algorithm turned out to be better at discovering
objects coming from minority classes.Comment: arXiv admin note: text overlap with arXiv:1901.0882
- …