Search CORE

33,250 research outputs found

PDE-Foam - a probability-density estimation method using self-adapting phase-space binning

Author: Alexander Voigt
Bishop
Breiman
Brun
Carli
Cranmer
Dominik Dannheim
Holmström
Jadach
Jadach
Karl-Johan Grahn
Neyman
Peter Speckmayer
Tancredi Carli
Publication venue: 'Elsevier BV'
Publication date: 04/12/2008
Field of study

Probability Density Estimation (PDE) is a multivariate discrimination technique based on sampling signal and background densities defined by event samples from data or Monte-Carlo (MC) simulations in a multi-dimensional phase space. In this paper, we present a modification of the PDE method that uses a self-adapting binning method to divide the multi-dimensional phase space in a finite number of hyper-rectangles (cells). The binning algorithm adjusts the size and position of a predefined number of cells inside the multi-dimensional phase space, minimising the variance of the signal and background densities inside the cells. The implementation of the binning algorithm PDE-Foam is based on the MC event-generation package Foam. We present performance results for representative examples (toy models) and discuss the dependence of the obtained results on the choice of parameters. The new PDE-Foam shows improved classification capability for small training samples and reduced classification time compared to the original PDE method based on range searching.Comment: 19 pages, 11 figures; replaced with revised version accepted for publication in NIM A and corrected typos in description of Fig. 7 and

arXiv.org e-Print Archive

Crossref

CERN Document Server

Classification with the nearest neighbor rule in general finite dimensional spaces: necessary and sufficient conditions

Author: Gadat Sébastien
Klein Thierry
Marteau Clément
Publication venue
Publication date: 01/11/2014
Field of study

Given an

n

-sample of random vectors

(X_i,Y_i)_{1 \leq i \leq n}

whose joint law is unknown, the long-standing problem of supervised classification aims to \textit{optimally} predict the label

Y

of a given a new observation

X

. In this context, the nearest neighbor rule is a popular flexible and intuitive method in non-parametric situations. Even if this algorithm is commonly used in the machine learning and statistics communities, less is known about its prediction ability in general finite dimensional spaces, especially when the support of the density of the observations is

\mathbb{R}^d

. This paper is devoted to the study of the statistical properties of the nearest neighbor rule in various situations. In particular, attention is paid to the marginal law of

X

, as well as the smoothness and margin properties of the \textit{regression function}

\eta(X) = \mathbb{E}[Y | X]

. We identify two necessary and sufficient conditions to obtain uniform consistency rates of classification and to derive sharp estimates in the case of the nearest neighbor rule. Some numerical experiments are proposed at the end of the paper to help illustrate the discussion.Comment: 53 Pages, 3 figure

arXiv.org e-Print Archive

Toulouse Capitole Publications

Toulouse 1 Capitole Publications

Asymptotic Generalization Bound of Fisher's Linear Discriminant Analysis

Author: Bian Wei
Tao Dacheng
Publication venue
Publication date: 22/04/2013
Field of study

Fisher's linear discriminant analysis (FLDA) is an important dimension reduction method in statistical pattern recognition. It has been shown that FLDA is asymptotically Bayes optimal under the homoscedastic Gaussian assumption. However, this classical result has the following two major limitations: 1) it holds only for a fixed dimensionality

D

, and thus does not apply when

D

and the training sample size

N

are proportionally large; 2) it does not provide a quantitative description on how the generalization ability of FLDA is affected by

D

and

N

. In this paper, we present an asymptotic generalization analysis of FLDA based on random matrix theory, in a setting where both

D

and

N

increase and

D/N\longrightarrow\gamma\in[0,1)

. The obtained lower bound of the generalization discrimination power overcomes both limitations of the classical result, i.e., it is applicable when

D

and

N

are proportionally large and provides a quantitative description of the generalization ability of FLDA in terms of the ratio

\gamma=D/N

and the population discrimination power. Besides, the discrimination power bound also leads to an upper bound on the generalization error of binary-classification with FLDA

arXiv.org e-Print Archive

OPUS - University of Technology Sydney

Reliable ABC model choice via random forests

Author: Cornuet Jean-Marie
Estoup Arnaud
Gautier Mathieu
Marin Jean-Michel
Pudlo Pierre
Robert Christian P.
Publication venue
Publication date: 02/09/2015
Field of study

Approximate Bayesian computation (ABC) methods provide an elaborate approach to Bayesian inference on complex models, including model choice. Both theoretical arguments and simulation experiments indicate, however, that model posterior probabilities may be poorly evaluated by standard ABC techniques. We propose a novel approach based on a machine learning tool named random forests to conduct selection among the highly complex models covered by ABC algorithms. We thus modify the way Bayesian model selection is both understood and operated, in that we rephrase the inferential goal as a classification problem, first predicting the model that best fits the data with random forests and postponing the approximation of the posterior probability of the predicted MAP for a second stage also relying on random forests. Compared with earlier implementations of ABC model choice, the ABC random forest approach offers several potential improvements: (i) it often has a larger discriminative power among the competing models, (ii) it is more robust against the number and choice of statistics summarizing the data, (iii) the computing effort is drastically reduced (with a gain in computation efficiency of at least fifty), and (iv) it includes an approximation of the posterior probability of the selected model. The call to random forests will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of this novel methodology by analyzing controlled experiments as well as genuine population genetics datasets. The proposed methodologies are implemented in the R package abcrf available on the CRAN.Comment: 39 pages, 15 figures, 6 table

arXiv.org e-Print Archive

Base de publications de l'université Paris-Dauphine

HAL Descartes

Warwick Research Archives Portal Repository

HAL-CIRAD