35,734 research outputs found
Binary Classifier Calibration using an Ensemble of Near Isotonic Regression Models
Learning accurate probabilistic models from data is crucial in many practical
tasks in data mining. In this paper we present a new non-parametric calibration
method called \textit{ensemble of near isotonic regression} (ENIR). The method
can be considered as an extension of BBQ, a recently proposed calibration
method, as well as the commonly used calibration method based on isotonic
regression. ENIR is designed to address the key limitation of isotonic
regression which is the monotonicity assumption of the predictions. Similar to
BBQ, the method post-processes the output of a binary classifier to obtain
calibrated probabilities. Thus it can be combined with many existing
classification models. We demonstrate the performance of ENIR on synthetic and
real datasets for the commonly used binary classification models. Experimental
results show that the method outperforms several common binary classifier
calibration methods. In particular on the real data, ENIR commonly performs
statistically significantly better than the other methods, and never worse. It
is able to improve the calibration power of classifiers, while retaining their
discrimination power. The method is also computationally tractable for large
scale datasets, as it is time, where is the number of
samples
Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm
The Markov Blanket Bayesian Classifier is a recently-proposed algorithm for
construction of probabilistic classifiers. This paper presents an empirical
comparison of the MBBC algorithm with three other Bayesian classifiers: Naive
Bayes, Tree-Augmented Naive Bayes and a general Bayesian network. All of these
are implemented using the K2 framework of Cooper and Herskovits. The
classifiers are compared in terms of their performance (using simple accuracy
measures and ROC curves) and speed, on a range of standard benchmark data sets.
It is concluded that MBBC is competitive in terms of speed and accuracy with
the other algorithms considered.Comment: 9 pages: Technical Report No. NUIG-IT-011002, Department of
Information Technology, National University of Ireland, Galway (2002
Reliable ABC model choice via random forests
Approximate Bayesian computation (ABC) methods provide an elaborate approach
to Bayesian inference on complex models, including model choice. Both
theoretical arguments and simulation experiments indicate, however, that model
posterior probabilities may be poorly evaluated by standard ABC techniques. We
propose a novel approach based on a machine learning tool named random forests
to conduct selection among the highly complex models covered by ABC algorithms.
We thus modify the way Bayesian model selection is both understood and
operated, in that we rephrase the inferential goal as a classification problem,
first predicting the model that best fits the data with random forests and
postponing the approximation of the posterior probability of the predicted MAP
for a second stage also relying on random forests. Compared with earlier
implementations of ABC model choice, the ABC random forest approach offers
several potential improvements: (i) it often has a larger discriminative power
among the competing models, (ii) it is more robust against the number and
choice of statistics summarizing the data, (iii) the computing effort is
drastically reduced (with a gain in computation efficiency of at least fifty),
and (iv) it includes an approximation of the posterior probability of the
selected model. The call to random forests will undoubtedly extend the range of
size of datasets and complexity of models that ABC can handle. We illustrate
the power of this novel methodology by analyzing controlled experiments as well
as genuine population genetics datasets. The proposed methodologies are
implemented in the R package abcrf available on the CRAN.Comment: 39 pages, 15 figures, 6 table
Dealing with Label Switching in Mixture Models Under Genuine Multimodality
The fitting of finite mixture models is an ill-defined estimation problem as completely different parameterizations can induce similar mixture distributions. This leads to multiple modes in the likelihood which is a problem for frequentist maximum likelihood estimation, and complicates statistical inference of Markov chain Monte Carlo draws in Bayesian estimation. For the analysis of the posterior density of these draws a suitable separation into different modes is desirable. In addition, a unique labelling of the component specific estimates is necessary to solve the label
switching problem. This paper presents and compares two approaches to achieve these goals: relabelling under multimodality and constrained clustering. The algorithmic details are discussed and their application is demonstrated on artificial and real-world data
Automatic discovery of optimal classes
A criterion, based on Bayes' theorem, is described that defines the optimal set of classes (a classification) for a given set of examples. This criterion is transformed into an equivalent minimum message length criterion with an intuitive information interpretation. This criterion does not require that the number of classes be specified in advance, this is determined by the data. The minimum message length criterion includes the message length required to describe the classes, so there is a built in bias against adding new classes unless they lead to a reduction in the message length required to describe the data. Unfortunately, the search space of possible classifications is too large to search exhaustively, so heuristic search methods, such as simulated annealing, are applied. Tutored learning and probabilistic prediction in particular cases are an important indirect result of optimal class discovery. Extensions to the basic class induction program include the ability to combine category and real value data, hierarchical classes, independent classifications and deciding for each class which attributes are relevant
- …