1,795 research outputs found

    Optimal classifier selection and negative bias in error rate estimation: An empirical study on high-dimensional prediction

    Get PDF
    In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias. In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. We then assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly. We conclude that the strategy to present only the optimal result is not acceptable, and suggest alternative approaches for properly reporting classification accuracy

    Regression on manifolds: Estimation of the exterior derivative

    Full text link
    Collinearity and near-collinearity of predictors cause difficulties when doing regression. In these cases, variable selection becomes untenable because of mathematical issues concerning the existence and numerical stability of the regression coefficients, and interpretation of the coefficients is ambiguous because gradients are not defined. Using a differential geometric interpretation, in which the regression coefficients are interpreted as estimates of the exterior derivative of a function, we develop a new method to do regression in the presence of collinearities. Our regularization scheme can improve estimation error, and it can be easily modified to include lasso-type regularization. These estimators also have simple extensions to the "large pp, small nn" context.Comment: Published in at http://dx.doi.org/10.1214/10-AOS823 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Partial Least Squares: A Versatile Tool for the Analysis of High-Dimensional Genomic Data

    Get PDF
    Partial Least Squares (PLS) is a highly efficient statistical regression technique that is well suited for the analysis of high-dimensional genomic data. In this paper we review the theory and applications of PLS both under methodological and biological points of view. Focusing on microarray expression data we provide a systematic comparison of the PLS approaches currently employed, and discuss problems as different as tumor classification, identification of relevant genes, survival analysis and modeling of gene networks

    Comparing Out-of-Sample Predictive Ability of PLS, Covariance, and Regression Models

    Get PDF
    Partial Least Squares Path Modelling (PLSPM) is a popular technique for estimating structural equation models in the social sciences, and is frequently presented as an alternative to covariance-based analysis as being especially suited for predictive modeling. While existing research on PLSPM has focused on its use in causal-explanatory modeling, this paper follows two recent papers at ICIS 2012 and 2013 in examining how PLSPM performs when used for predictive purposes. Additionally, as a predictive technique, we compare PLSPM to traditional regression methods that are widely used for predictive modelling in other disciplines. Specifically, we employ out-of-sample k-fold cross-validation to compare PLSPM to covariance-SEM and a range of a-theoretical regression techniques in a simulation study. Our results show that PLSPM offers advantages over covariance-SEM and other prediction methods

    Predicting Skin Permeability by means of Computational Approaches : Reliability and Caveats in Pharmaceutical Studies

    Get PDF
    © 2019 American Chemical Society.The skin is the main barrier between the internal body environment and the external one. The characteristics of this barrier and its properties are able to modify and affect drug delivery and chemical toxicity parameters. Therefore, it is not surprising that permeability of many different compounds has been measured through several in vitro and in vivo techniques. Moreover, many different in silico approaches have been used to identify the correlation between the structure of the permeants and their permeability, to reproduce the skin behavior, and to predict the ability of specific chemicals to permeate this barrier. A significant number of issues, like interlaboratory variability, experimental conditions, data set building rationales, and skin site of origin and hydration, still prevent us from obtaining a definitive predictive skin permeability model. This review wants to show the main advances and the principal approaches in computational methods used to predict this property, to enlighten the main issues that have arisen, and to address the challenges to develop in future research.Peer reviewedFinal Accepted Versio
    • …
    corecore