33,250 research outputs found
PDE-Foam - a probability-density estimation method using self-adapting phase-space binning
Probability Density Estimation (PDE) is a multivariate discrimination
technique based on sampling signal and background densities defined by event
samples from data or Monte-Carlo (MC) simulations in a multi-dimensional phase
space. In this paper, we present a modification of the PDE method that uses a
self-adapting binning method to divide the multi-dimensional phase space in a
finite number of hyper-rectangles (cells). The binning algorithm adjusts the
size and position of a predefined number of cells inside the multi-dimensional
phase space, minimising the variance of the signal and background densities
inside the cells. The implementation of the binning algorithm PDE-Foam is based
on the MC event-generation package Foam. We present performance results for
representative examples (toy models) and discuss the dependence of the obtained
results on the choice of parameters. The new PDE-Foam shows improved
classification capability for small training samples and reduced classification
time compared to the original PDE method based on range searching.Comment: 19 pages, 11 figures; replaced with revised version accepted for
publication in NIM A and corrected typos in description of Fig. 7 and
Classification with the nearest neighbor rule in general finite dimensional spaces: necessary and sufficient conditions
Given an -sample of random vectors whose
joint law is unknown, the long-standing problem of supervised classification
aims to \textit{optimally} predict the label of a given a new observation
. In this context, the nearest neighbor rule is a popular flexible and
intuitive method in non-parametric situations.
Even if this algorithm is commonly used in the machine learning and
statistics communities, less is known about its prediction ability in general
finite dimensional spaces, especially when the support of the density of the
observations is . This paper is devoted to the study of the
statistical properties of the nearest neighbor rule in various situations. In
particular, attention is paid to the marginal law of , as well as the
smoothness and margin properties of the \textit{regression function} . We identify two necessary and sufficient conditions to
obtain uniform consistency rates of classification and to derive sharp
estimates in the case of the nearest neighbor rule. Some numerical experiments
are proposed at the end of the paper to help illustrate the discussion.Comment: 53 Pages, 3 figure
Asymptotic Generalization Bound of Fisher's Linear Discriminant Analysis
Fisher's linear discriminant analysis (FLDA) is an important dimension
reduction method in statistical pattern recognition. It has been shown that
FLDA is asymptotically Bayes optimal under the homoscedastic Gaussian
assumption. However, this classical result has the following two major
limitations: 1) it holds only for a fixed dimensionality , and thus does not
apply when and the training sample size are proportionally large; 2) it
does not provide a quantitative description on how the generalization ability
of FLDA is affected by and . In this paper, we present an asymptotic
generalization analysis of FLDA based on random matrix theory, in a setting
where both and increase and . The
obtained lower bound of the generalization discrimination power overcomes both
limitations of the classical result, i.e., it is applicable when and
are proportionally large and provides a quantitative description of the
generalization ability of FLDA in terms of the ratio and the
population discrimination power. Besides, the discrimination power bound also
leads to an upper bound on the generalization error of binary-classification
with FLDA
Reliable ABC model choice via random forests
Approximate Bayesian computation (ABC) methods provide an elaborate approach
to Bayesian inference on complex models, including model choice. Both
theoretical arguments and simulation experiments indicate, however, that model
posterior probabilities may be poorly evaluated by standard ABC techniques. We
propose a novel approach based on a machine learning tool named random forests
to conduct selection among the highly complex models covered by ABC algorithms.
We thus modify the way Bayesian model selection is both understood and
operated, in that we rephrase the inferential goal as a classification problem,
first predicting the model that best fits the data with random forests and
postponing the approximation of the posterior probability of the predicted MAP
for a second stage also relying on random forests. Compared with earlier
implementations of ABC model choice, the ABC random forest approach offers
several potential improvements: (i) it often has a larger discriminative power
among the competing models, (ii) it is more robust against the number and
choice of statistics summarizing the data, (iii) the computing effort is
drastically reduced (with a gain in computation efficiency of at least fifty),
and (iv) it includes an approximation of the posterior probability of the
selected model. The call to random forests will undoubtedly extend the range of
size of datasets and complexity of models that ABC can handle. We illustrate
the power of this novel methodology by analyzing controlled experiments as well
as genuine population genetics datasets. The proposed methodologies are
implemented in the R package abcrf available on the CRAN.Comment: 39 pages, 15 figures, 6 table
- …