151 research outputs found
Determination of the total acid number (TAN) of used mineral oils in aviation engines by FTIR using regression models
[EN] Total acid number (TAN) has been considered an important indicator of the oil quality of used oils. TAN is determined by potentiometric titration, which is time-consuming and requires solvent. A more convenient approach to determine TAN is based on infrared (IR) spectral data and multivariate regression models. Predictive models for the determination of TAN using the IR data measured from ashless dispersant oils developed for aviation piston engines (SAE 50) have been developed. Different techniques, including Projection Pursuit Regression (PPR), Partial Least Square, Support Vector Machines, Linear Models and Random Forest (RF), have been used. The used methodology involved a five folder cross validation to derive the best model. Then a full error measure over the whole dataset was taken. A backward variable selection was used and 25 highly relevant variables were extracted. RF provided an acceptable modelling technology with grouped dataset predictions that allowed transformations to be performed that fitted the measured values. A hybrid method considering group of bands as features was used for modelling. An innovative mechanism for wider features selection based on genetic algorithm has been implemented. This method showed better performance than the results obtained using the other methodologies. RMSE and MAE values obtained in the validation were 0.759 and 0.359 for PPR model respectively.The authors would like to thank Roland Tones of the Universidad Metropolitana for his collaboration in oil sample processing. BLDR acknowledges financial support from the Venoco Company. The authors also thank the Universidad Politecnica de Madrid for granting access to the CESVIMA (http://www.cesvima.upm.es/) HPC infrastructure. We would also like to thank the author Beatriz Leal de Rivas (in memoriam), for her efforts to conform this team of researchers from different areas of expertise, and we want to dedicate this work to her loving memory.Leal De-Rivas, BC.; Vivancos, J.; Ordieres Meré, J.; Capuz-Rizo, SF. (2017). Determination of the total acid number (TAN) of used mineral oils in aviation engines by FTIR using regression models. Chemometrics and Intelligent Laboratory Systems. 160:32-39. doi:10.1016/j.chemolab.2016.10.015S323916
PETER HALL'S WORK ON HIGH-DIMENSIONAL DATA AND CLASSIFICATION
In this article, I summarise Peter Hall’s contributions to high-dimensional data, including their geometric representations and variable selection methods based on ranking. I also discuss his work on classification problems, concluding with some personal reflections on my own interactions with him. This article complements [Ann. Statist. 44 (2016) 1821–1836; Ann. Statist. 44 (2016) 1837–1853; Ann. Statist. 44 (2016) 1854–1866 and Ann. Statist. 44 (2016) 1867–1887], which focus on other aspects of Peter’s research.Supported by an EPSRC Early Career Fellowship and a Philip Leverhulme priz
Random projections: data perturbation for classification problems
Random projections offer an appealing and flexible approach to a wide range
of large-scale statistical problems. They are particularly useful in
high-dimensional settings, where we have many covariates recorded for each
observation. In classification problems there are two general techniques using
random projections. The first involves many projections in an ensemble -- the
idea here is to aggregate the results after applying different random
projections, with the aim of achieving superior statistical accuracy. The
second class of methods include hashing and sketching techniques, which are
straightforward ways to reduce the complexity of a problem, perhaps therefore
with a huge computational saving, while approximately preserving the
statistical efficiency.Comment: 24 pages, 4 figure
Sur la vitesse de convergence de l'estimateur du plus proche voisin baggé
International audienceOn s'intéresse dans cette communication à l'estimation de la fonction de
Consistency of random forests
Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45
(2001) 5--32] that combines several randomized decision trees and aggregates
their predictions by averaging. Despite its wide usage and outstanding
practical performance, little is known about the mathematical properties of the
procedure. This disparity between theory and practice originates in the
difficulty to simultaneously analyze both the randomization process and the
highly data-dependent tree structure. In the present paper, we take a step
forward in forest exploration by proving a consistency result for Breiman's
[Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive
regression models. Our analysis also sheds an interesting light on how random
forests can nicely adapt to sparsity. 1. Introduction. Random forests are an
ensemble learning method for classification and regression that constructs a
number of randomized decision trees during the training phase and predicts by
averaging the results. Since its publication in the seminal paper of Breiman
(2001), the procedure has become a major data analysis tool, that performs well
in practice in comparison with many standard methods. What has greatly
contributed to the popularity of forests is the fact that they can be applied
to a wide range of prediction problems and have few parameters to tune. Aside
from being simple to use, the method is generally recognized for its accuracy
and its ability to deal with small sample sizes, high-dimensional feature
spaces and complex data structures. The random forest methodology has been
successfully involved in many practical problems, including air quality
prediction (winning code of the EMC data science global hackathon in 2012, see
http://www.kaggle.com/c/dsg-hackathon), chemoinformatics [Svetnik et al.
(2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3
Analysis of a Random Forests Model
Random forests are a scheme proposed by Leo Breiman in the 2000's for
building a predictor ensemble with a set of decision trees that grow in
randomly selected subspaces of data. Despite growing interest and practical
use, there has been little exploration of the statistical properties of random
forests, and little is known about the mathematical forces driving the
algorithm. In this paper, we offer an in-depth analysis of a random forests
model suggested by Breiman in \cite{Bre04}, which is very close to the original
algorithm. We show in particular that the procedure is consistent and adapts to
sparsity, in the sense that its rate of convergence depends only on the number
of strong features and not on how many noise variables are present
Risk estimation and risk prediction using machine-learning methods
After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s00439-012-1194-y) contains supplementary material, which is available to authorized users
- …