151 research outputs found

    Determination of the total acid number (TAN) of used mineral oils in aviation engines by FTIR using regression models

    Full text link
    [EN] Total acid number (TAN) has been considered an important indicator of the oil quality of used oils. TAN is determined by potentiometric titration, which is time-consuming and requires solvent. A more convenient approach to determine TAN is based on infrared (IR) spectral data and multivariate regression models. Predictive models for the determination of TAN using the IR data measured from ashless dispersant oils developed for aviation piston engines (SAE 50) have been developed. Different techniques, including Projection Pursuit Regression (PPR), Partial Least Square, Support Vector Machines, Linear Models and Random Forest (RF), have been used. The used methodology involved a five folder cross validation to derive the best model. Then a full error measure over the whole dataset was taken. A backward variable selection was used and 25 highly relevant variables were extracted. RF provided an acceptable modelling technology with grouped dataset predictions that allowed transformations to be performed that fitted the measured values. A hybrid method considering group of bands as features was used for modelling. An innovative mechanism for wider features selection based on genetic algorithm has been implemented. This method showed better performance than the results obtained using the other methodologies. RMSE and MAE values obtained in the validation were 0.759 and 0.359 for PPR model respectively.The authors would like to thank Roland Tones of the Universidad Metropolitana for his collaboration in oil sample processing. BLDR acknowledges financial support from the Venoco Company. The authors also thank the Universidad Politecnica de Madrid for granting access to the CESVIMA (http://www.cesvima.upm.es/) HPC infrastructure. We would also like to thank the author Beatriz Leal de Rivas (in memoriam), for her efforts to conform this team of researchers from different areas of expertise, and we want to dedicate this work to her loving memory.Leal De-Rivas, BC.; Vivancos, J.; Ordieres Meré, J.; Capuz-Rizo, SF. (2017). Determination of the total acid number (TAN) of used mineral oils in aviation engines by FTIR using regression models. Chemometrics and Intelligent Laboratory Systems. 160:32-39. doi:10.1016/j.chemolab.2016.10.015S323916

    PETER HALL'S WORK ON HIGH-DIMENSIONAL DATA AND CLASSIFICATION

    Get PDF
    In this article, I summarise Peter Hall’s contributions to high-dimensional data, including their geometric representations and variable selection methods based on ranking. I also discuss his work on classification problems, concluding with some personal reflections on my own interactions with him. This article complements [Ann. Statist. 44 (2016) 1821–1836; Ann. Statist. 44 (2016) 1837–1853; Ann. Statist. 44 (2016) 1854–1866 and Ann. Statist. 44 (2016) 1867–1887], which focus on other aspects of Peter’s research.Supported by an EPSRC Early Career Fellowship and a Philip Leverhulme priz

    Random projections: data perturbation for classification problems

    Get PDF
    Random projections offer an appealing and flexible approach to a wide range of large-scale statistical problems. They are particularly useful in high-dimensional settings, where we have many covariates recorded for each observation. In classification problems there are two general techniques using random projections. The first involves many projections in an ensemble -- the idea here is to aggregate the results after applying different random projections, with the aim of achieving superior statistical accuracy. The second class of methods include hashing and sketching techniques, which are straightforward ways to reduce the complexity of a problem, perhaps therefore with a huge computational saving, while approximately preserving the statistical efficiency.Comment: 24 pages, 4 figure

    Sur la vitesse de convergence de l'estimateur du plus proche voisin baggé

    Get PDF
    International audienceOn s'intéresse dans cette communication à l'estimation de la fonction de

    Consistency of random forests

    Get PDF
    Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45 (2001) 5--32] that combines several randomized decision trees and aggregates their predictions by averaging. Despite its wide usage and outstanding practical performance, little is known about the mathematical properties of the procedure. This disparity between theory and practice originates in the difficulty to simultaneously analyze both the randomization process and the highly data-dependent tree structure. In the present paper, we take a step forward in forest exploration by proving a consistency result for Breiman's [Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive regression models. Our analysis also sheds an interesting light on how random forests can nicely adapt to sparsity. 1. Introduction. Random forests are an ensemble learning method for classification and regression that constructs a number of randomized decision trees during the training phase and predicts by averaging the results. Since its publication in the seminal paper of Breiman (2001), the procedure has become a major data analysis tool, that performs well in practice in comparison with many standard methods. What has greatly contributed to the popularity of forests is the fact that they can be applied to a wide range of prediction problems and have few parameters to tune. Aside from being simple to use, the method is generally recognized for its accuracy and its ability to deal with small sample sizes, high-dimensional feature spaces and complex data structures. The random forest methodology has been successfully involved in many practical problems, including air quality prediction (winning code of the EMC data science global hackathon in 2012, see http://www.kaggle.com/c/dsg-hackathon), chemoinformatics [Svetnik et al. (2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3

    Analysis of a Random Forests Model

    Full text link
    Random forests are a scheme proposed by Leo Breiman in the 2000's for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Despite growing interest and practical use, there has been little exploration of the statistical properties of random forests, and little is known about the mathematical forces driving the algorithm. In this paper, we offer an in-depth analysis of a random forests model suggested by Breiman in \cite{Bre04}, which is very close to the original algorithm. We show in particular that the procedure is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present

    Risk estimation and risk prediction using machine-learning methods

    Get PDF
    After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s00439-012-1194-y) contains supplementary material, which is available to authorized users
    corecore