5 research outputs found

    Boosting en el modelo de aprendizaje PAC

    Get PDF
    A review on the idea of Boosting in the PAC learning model is presented. Also a review of the first practical Boosting method, the adaptative boosting (Adaboost) is provided, giving details concerning theoretical garantees on error convergence and exploring the important concept of margin.Una revisión de la idea de Boosting en el modelo de aprendizaje PAC es presentada. Adicionalmente se provee una revisión del primer método de Boosting práctico, el Boosting adaptativo (Adaboost), dando detalles respecto a las garantías teóricas en la convergencia del error y explorando el importante concepto de margen

    Investigating Randomised Sphere Covers in Supervised Learning

    Get PDF
    c©This copy of the thesis has been supplied on condition that anyone who consults it is understood to recognise that its copyright rests with the author and that no quotation from the thesis, nor any information derived therefrom, may be published without the author’s prior, written consent. In this thesis, we thoroughly investigate a simple Instance Based Learning (IBL) classifier known as Sphere Cover. We propose a simple Randomized Sphere Cover Classifier (αRSC) and use several datasets in order to evaluate the classification performance of the αRSC classifier. In addition, we analyse the generalization error of the proposed classifier using bias/variance decomposition. A Sphere Cover Classifier may be described from the compression scheme which stipulates data compression as the reason for high generalization performance. We investigate the compression capacity of αRSC using a sample compression bound. The Compression Scheme prompted us to search new compressibility methods for αRSC. As such, we used a Gaussian kernel to investigate further data compression

    Genetic Algorithms for Feature Selection and Classification of Complex Chromatographic and Spectroscopic Data

    Get PDF
    A basic methodology for analyzing large multivariate chemical data sets based on feature selection is proposed. Each chromatogram or spectrum is represented as a point in a high dimensional measurement space. A genetic algorithm for feature selection and classification is applied to the data to identify features that optimize the separation of the classes in a plot of the two or three largest principal components of the data. A good principal component plot can only be generated using features whose variance or information is primarily about differences between classes in the data. Hence, feature subsets that maximize the ratio of between-class to within-class variance are selected by the pattern recognition genetic algorithm. Furthermore, the structure of the data set can be explored, for example, new classes can be discovered by simply tuning various parameters of the fitness function of the pattern recognition genetic algorithm. The proposed method has been validated on a wide range of data. A two-step procedure for pattern recognition analysis of spectral data has been developed. First, wavelets are used to denoise and deconvolute spectral bands by decomposing each spectrum into wavelet coefficients, which represent the samples constituent frequencies. Second, the pattern recognition genetic algorithm is used to identify wavelet coefficients characteristic of the class. In several studies involving spectral library searching, this method was employed. In one study, a search pre-filter to detect the presence of carboxylic acids from vapor phase infrared spectra which has previously eluted prominent researchers has been successfully formulated and validated. In another study, this same approach has been used to develop a pattern recognition assisted infrared library searching technique to determine the model, manufacturer, and year of the vehicle from which a clear coat paint smear originated. The pattern recognition genetic algorithm has also been used to develop a potential method to identify molds in indoor environments using volatile organic compounds. A distinct profile indicative of microbial volatile organic compounds was developed from air sampling data that could be readily differentiated from the blank for both high mold count and moderate mold count exposure samples. The utility of the pattern recognition genetic algorithm for discovery of biomarker candidates from genomic and proteomic data sets has also been shown.Chemistry Departmen

    Variable Selection to Improve Classification in Structure-activity Studies and Spectroscopic Analysis

    Get PDF
    A genetic algorithm for variable selection to improve classifications is explored and validated on a wide range of data. In one study, 147 tetralin and indan musks and nonmusks compiled from the literature for the purpose of investigating the relationship between molecular structure and musk odor quality were correctly classified by 45 molecular descriptors identified by the pattern recognition GA which revealed an asymmetric data structure. A 3-layer feed-forward neural network trained by back propagation was used to develop a discriminant that correctly classified all of the compounds in the training set as musk and nonmusk. The neural network was successfully validated using an external prediction set of 37 compounds. In another study, 172 tetralin-, indan- and isochroman-like compounds were combed from the published literature to investigate the relationship between chemical structure and musk odor quality. The 20 molecular structural descriptors selected by the pattern recognition GA yielded a discriminant that was successfully validated using an external validation set consisting of 19 compounds. In a third study, the development of a prototype pattern recognition library search system for the infrared spectral libraries of the paint data query database to improve the discrimination capability and permit quantification of discriminant power for automotive paint comparisons involving the original equipment manufacturer is described. The system consists of two separate but interrelated components: search prefilters to cull the library spectra to a specific assembly plant and a cross correlation library search algorithm that utilizes both forward and backward searching to identify the year, line and model of the unknown in the spectral set identified by the search prefilters. The genetic algorithm was able to identify spectral variables from the clear coat, surfacer-primer and e-coat layers of the original manufacturer�s automotive paint that were characteristic of the assembly plant of the vehicle.Chemistr

    On the dynamics of boosting

    No full text
    In order to understand AdaBoost’s dynamics, especially its ability to maximize margins, we derive an associated simplified nonlinear iterated map and analyze its behavior in low-dimensional cases. We find stable cycles for these cases, which can explicitly be used to solve for Ada-Boost’s output. By considering AdaBoost as a dynamical system, we are able to prove Rätsch and Warmuth’s conjecture that AdaBoost may fail to converge to a maximal-margin combined classifier when given a ‘nonoptimal’ weak learning algorithm. AdaBoost is known to be a coordinate descent method, but other known algorithms that explicitly aim to maximize the margin (such as AdaBoost ∗ and arc-gv) are not. We consider a differentiable function for which coordinate ascent will yield a maximum margin solution. We then make a simple approximation to derive a new boosting algorithm whose updates are slightly more aggressive than those of arc-gv.
    corecore