14,598 research outputs found

    TENFOLD BOOTSTRAP PROCEDURE FOR SUPPORT VECTOR MACHINES

    Get PDF
    Cross validation is often used to split input data into training and test set in Support vector machines. The two most commonly used cross validation versions are the tenfold and leave-one-out cross validation. Another commonly used resampling method is the random test/train split. The advantage of these methods is that they avoid overfitting in the model and perform model selection. They, however, can increase the computational time for fitting Support vector machines with the increase of the size of the dataset. In this research, we propose an alternative for fitting SVM, which we call the tenfold bootstrap for Support vector machines. This resampling procedure can significantly reduce execution time despite the big number of observations, while preserving model’s accuracy. With this finding, we propose a solution to the problem of slow execution time when fitting support vector machines on big datasets

    Comparison of Classifiers Applied to Confocal Scanning Laser Ophthalmoscopy Data

    Get PDF
    Objectives: Comparison of classification methods using data of one clinical study. The tuning of hyperparameters is assessed as part of the methods by nested-loop cross-validation. Methods: We assess the ability of 18 statistical and machine learning classifiers to detect glaucoma. The training data set is one case-control study consisting of confocal scanning laser ophthalmoscopy measurement values from 98 glaucoma patients and 98 healthy controls. We compare bootstrap estimates of the classification error by the Wilcoxon signed rank test and box-plots of a bootstrap distribution of the estimate. Results: The comparison of out-of-bag bootstrap estimators of classification errors is assessed by Spearman?s rank correlation, Wilcoxon signed rank tests and box-plots of a bootstrap distribution of the estimate. The classification methods random forests 15.4%, support vector machines 15.9%, bundling 16.3% to 17.8%, and penalized discriminant analysis 16.8% show the best results. Conclusions: Using nested-loop cross-validation we account for the tuning of hyperparameters and demonstrate the assessment of different classifiers. We recommend a block design of the bootstrap simulation to allow a statistical assessment of the bootstrap estimates of the misclassification error. The results depend on the data of the clinical study and the given size of the bootstrap sample

    Bringing Statistical Learning Machines Together for Hydro-Climatological Predictions - Case Study for Sacramento San Joaquin River Basin, California

    Get PDF
    Study region: Sacramento San Joaquin River Basin, California Study focus: The study forecasts the streamflow at a regional scale within SSJ river basin with largescale climate variables. The proposed approach eliminates the bias resulting from predefined indices at regional scale. The study was performed for eight unimpaired streamflow stations from 1962–2016. First, the Singular Valued Decomposition (SVD) teleconnections of the streamflow corresponding to 500 mbar geopotential height, sea surface temperature, 500 mbar specific humidity (SHUM500), and 500 mbar U-wind (U500) were obtained. Second, the skillful SVD teleconnections were screened non-parametrically. Finally, the screened teleconnections were used as the streamflow predictors in the non-linear regression models (K-nearest neighbor regression and data-driven support vector machine). New hydrological insights: The SVD results identified new spatial regions that have not been included in existing predefined indices. The nonparametric model indicated the teleconnections of SHUM500 and U500 being better streamflow predictors compared to other climate variables. The regression models were capable to apprehend most of the sustained low flows, proving the model to be effective for drought-affected regions. It was also observed that the proposed approach showed better forecasting skills with preprocessed large scale climate variables rather than using the predefined indices. The proposed study is simple, yet robust in providing qualitative streamflow forecasts that may assist water managers in making policy-related decisions when planning and managing watersheds

    Stratification bias in low signal microarray studies

    Get PDF
    BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets

    Building Predictive Models in R Using the caret Package

    Get PDF
    The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations. An example from computational chemistry is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models.

    A bias correction for the minimum error rate in cross-validation

    Full text link
    Tuning parameters in supervised learning problems are often estimated by cross-validation. The minimum value of the cross-validation error can be biased downward as an estimate of the test error at that same value of the tuning parameter. We propose a simple method for the estimation of this bias that uses information from the cross-validation process. As a result, it requires essentially no additional computation. We apply our bias estimate to a number of popular classifiers in various settings, and examine its performance.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS224 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore