81,046 research outputs found

    No unbiased Estimator of the Variance of K-Fold Cross-Validation

    Get PDF
    In statistical machine learning, the standard measure of accuracy for models is the prediction error, i.e. the expected loss on future examples. When the data distribution is unknown, it cannot be computed but several resampling methods, such as K-fold cross-validation can be used to obtain an unbiased estimator of prediction error. However, to compare learning algorithms one needs to also estimate the uncertainty around the cross-validation estimator, which is important because it can be very large. However, the usual variance estimates for means of independent samples cannot be used because of the reuse of the data used to form the cross-validation estimator. The main result of this paper is that there is no universal (distribution independent) unbiased estimator of the variance of the K-fold cross-validation estimator, based only on the empirical results of the error measurements obtained through the cross-validation procedure. The analysis provides a theoretical understanding showing the difficulty of this estimation. These results generalize to other resampling methods, as long as data are reused for training or testing. L'erreur de prĂ©diction, donc la perte attendue sur des donnĂ©es futures, est la mesure standard pour la qualitĂ© des modĂšles d'apprentissage statistique. Quand la distribution des donnĂ©es est inconnue, cette erreur ne peut ĂȘtre calculĂ©e mais plusieurs mĂ©thodes de rĂ©Ă©chantillonnage, comme la validation croisĂ©e, peuvent ĂȘtre utilisĂ©es pour obtenir un estimateur non-biaisĂ© de l'erreur de prĂ©diction. Cependant pour comparer des algorithmes d'apprentissage, il faut aussi estimer l'incertitude autour de cet estimateur d'erreur future, car cette incertitude peut ĂȘtre trĂšs grande. Cependant, les estimateurs ordinaires de variance d'une moyenne pour des Ă©chantillons indĂ©pendants ne peuvent ĂȘtre utilisĂ©s Ă  cause du recoupement des ensembles d'apprentissage utilisĂ©s pour effectuer la validation croisĂ©e. Le rĂ©sultat principal de cet article est qu'il n'existe pas d'estimateur non-biaisĂ© universel (indĂ©pendant de la distribution) de la variance de la validation croisĂ©e, en se basant sur les mesures d'erreur faites durant la validation croisĂ©e. L'analyse fournit une meilleure comprĂ©hension de la difficultĂ© d'estimer l'incertitude autour de la validation croisĂ©e. Ces rĂ©sultats se gĂ©nĂ©ralisent Ă  d'autres mĂ©thodes de rĂ©Ă©chantillonnage pour lesquelles des donnĂ©es sont rĂ©utilisĂ©es pour l'apprentissage ou le test.Prediction error, cross-validation, multivariate variance estimators, statistical comparison of algorithms, Erreur de prĂ©diction, validation croisĂ©e, estimateur de variance multivariĂ©e, comparaison statistique des algorithmes

    lassopack: Model selection and prediction with regularized regression in Stata

    Get PDF
    This article introduces lassopack, a suite of programs for regularized regression in Stata. lassopack implements lasso, square-root lasso, elastic net, ridge regression, adaptive lasso and post-estimation OLS. The methods are suitable for the high-dimensional setting where the number of predictors pp may be large and possibly greater than the number of observations, nn. We offer three different approaches for selecting the penalization (`tuning') parameters: information criteria (implemented in lasso2), KK-fold cross-validation and hh-step ahead rolling cross-validation for cross-section, panel and time-series data (cvlasso), and theory-driven (`rigorous') penalization for the lasso and square-root lasso for cross-section and panel data (rlasso). We discuss the theoretical framework and practical considerations for each approach. We also present Monte Carlo results to compare the performance of the penalization approaches.Comment: 52 pages, 6 figures, 6 tables; submitted to Stata Journal; for more information see https://statalasso.github.io

    Reliability and validity in comparative studies of software prediction models

    Get PDF
    Empirical studies on software prediction models do not converge with respect to the question "which prediction model is best?" The reason for this lack of convergence is poorly understood. In this simulation study, we have examined a frequently used research procedure comprising three main ingredients: a single data sample, an accuracy indicator, and cross validation. Typically, these empirical studies compare a machine learning model with a regression model. In our study, we use simulation and compare a machine learning and a regression model. The results suggest that it is the research procedure itself that is unreliable. This lack of reliability may strongly contribute to the lack of convergence. Our findings thus cast some doubt on the conclusions of any study of competing software prediction models that used this research procedure as a basis of model comparison. Thus, we need to develop more reliable research procedures before we can have confidence in the conclusions of comparative studies of software prediction models

    Controlling the Overfitting of Heritability in Genomic Selection through Cross Validation.

    Get PDF
    In genomic selection (GS), all the markers across the entire genome are used to conduct marker-assisted selection such that each quantitative trait locus of complex trait is in linkage disequilibrium with at least one marker. Although GS improves estimated breeding values and genetic gain, in most GS models genetic variance is estimated from training samples with many trait-irrelevant markers, which leads to severe overfitting in the calculation of trait heritability. In this study, we demonstrated overfitting heritability due to the inclusion of trait-irrelevant markers using a series of simulations, and such overfitting can be effectively controlled by cross validation experiment. In the proposed method, the genetic variance is simply the variance of the genetic values predicted through cross validation, the residual variance is the variance of the differences between the observed phenotypic values and the predicted genetic values, and these two resultant variance components are used for calculating the unbiased heritability. We also demonstrated that the heritability calculated through cross validation is equivalent to trait predictability, which objectively reflects the applicability of the GS models. The proposed method can be implemented with the Mixed Procedure in SAS or with our R package "GSMX" which is publically available at https://cran.r-project.org/web/packages/GSMX/index.html
    • 

    corecore