24,683 research outputs found

    Feature Selection via Binary Simultaneous Perturbation Stochastic Approximation

    Full text link
    Feature selection (FS) has become an indispensable task in dealing with today's highly complex pattern recognition problems with massive number of features. In this study, we propose a new wrapper approach for FS based on binary simultaneous perturbation stochastic approximation (BSPSA). This pseudo-gradient descent stochastic algorithm starts with an initial feature vector and moves toward the optimal feature vector via successive iterations. In each iteration, the current feature vector's individual components are perturbed simultaneously by random offsets from a qualified probability distribution. We present computational experiments on datasets with numbers of features ranging from a few dozens to thousands using three widely-used classifiers as wrappers: nearest neighbor, decision tree, and linear support vector machine. We compare our methodology against the full set of features as well as a binary genetic algorithm and sequential FS methods using cross-validated classification error rate and AUC as the performance criteria. Our results indicate that features selected by BSPSA compare favorably to alternative methods in general and BSPSA can yield superior feature sets for datasets with tens of thousands of features by examining an extremely small fraction of the solution space. We are not aware of any other wrapper FS methods that are computationally feasible with good convergence properties for such large datasets.Comment: This is the Istanbul Sehir University Technical Report #SHR-ISE-2016.01. A short version of this report has been accepted for publication at Pattern Recognition Letter

    Construction and evaluation of classifiers for forensic document analysis

    Full text link
    In this study we illustrate a statistical approach to questioned document examination. Specifically, we consider the construction of three classifiers that predict the writer of a sample document based on categorical data. To evaluate these classifiers, we use a data set with a large number of writers and a small number of writing samples per writer. Since the resulting classifiers were found to have near perfect accuracy using leave-one-out cross-validation, we propose a novel Bayesian-based cross-validation method for evaluating the classifiers.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS379 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Boosting Nearest Neighbor Classifiers for Multiclass Recognition

    Full text link
    This paper introduces an algorithm that uses boosting to learn a distance measure for multiclass k-nearest neighbor classification. Given a family of distance measures as input, AdaBoost is used to learn a weighted distance measure, that is a linear combination of the input measures. The proposed method can be seen both as a novel way to learn a distance measure from data, and as a novel way to apply boosting to multiclass recognition problems, that does not require output codes. In our approach, multiclass recognition of objects is reduced into a single binary recognition task, defined on triples of objects. Preliminary experiments with eight UCI datasets yield no clear winner among our method, boosting using output codes, and k-nn classification using an unoptimized distance measure. Our algorithm did achieve lower error rates in some of the datasets, which indicates that, in some domains, it may lead to better results than existing methods

    Population Synthesis via k-Nearest Neighbor Crossover Kernel

    Full text link
    The recent development of multi-agent simulations brings about a need for population synthesis. It is a task of reconstructing the entire population from a sampling survey of limited size (1% or so), supplying the initial conditions from which simulations begin. This paper presents a new kernel density estimator for this task. Our method is an analogue of the classical Breiman-Meisel-Purcell estimator, but employs novel techniques that harness the huge degree of freedom which is required to model high-dimensional nonlinearly correlated datasets: the crossover kernel, the k-nearest neighbor restriction of the kernel construction set and the bagging of kernels. The performance as a statistical estimator is examined through real and synthetic datasets. We provide an "optimization-free" parameter selection rule for our method, a theory of how our method works and a computational cost analysis. To demonstrate the usefulness as a population synthesizer, our method is applied to a household synthesis task for an urban micro-simulator.Comment: 10 pages, 4 figures, IEEE International Conference on Data Mining (ICDM) 201
    corecore