301,861 research outputs found

    Concentration inequalities of the cross-validation estimate for stable predictors

    Full text link
    In this article, we derive concentration inequalities for the cross-validation estimate of the generalization error for stable predictors in the context of risk assessment. The notion of stability has been first introduced by \cite{DEWA79} and extended by \cite{KEA95}, \cite{BE01} and \cite{KUNIY02} to characterize class of predictors with infinite VC dimension. In particular, this covers kk-nearest neighbors rules, bayesian algorithm (\cite{KEA95}), boosting,... General loss functions and class of predictors are considered. We use the formalism introduced by \cite{DUD03} to cover a large variety of cross-validation procedures including leave-one-out cross-validation, kk-fold cross-validation, hold-out cross-validation (or split sample), and the leave-υ\upsilon-out cross-validation. In particular, we give a simple rule on how to choose the cross-validation, depending on the stability of the class of predictors. In the special case of uniform stability, an interesting consequence is that the number of elements in the test set is not required to grow to infinity for the consistency of the cross-validation procedure. In this special case, the particular interest of leave-one-out cross-validation is emphasized

    Improving generalisation of AutoML systems with dynamic fitness evaluations

    Full text link
    A common problem machine learning developers are faced with is overfitting, that is, fitting a pipeline too closely to the training data that the performance degrades for unseen data. Automated machine learning aims to free (or at least ease) the developer from the burden of pipeline creation, but this overfitting problem can persist. In fact, this can become more of a problem as we look to iteratively optimise the performance of an internal cross-validation (most often \textit{k}-fold). While this internal cross-validation hopes to reduce this overfitting, we show we can still risk overfitting to the particular folds used. In this work, we aim to remedy this problem by introducing dynamic fitness evaluations which approximate repeated \textit{k}-fold cross-validation, at little extra cost over single \textit{k}-fold, and far lower cost than typical repeated \textit{k}-fold. The results show that when time equated, the proposed fitness function results in significant improvement over the current state-of-the-art baseline method which uses an internal single \textit{k}-fold. Furthermore, the proposed extension is very simple to implement on top of existing evolutionary computation methods, and can provide essentially a free boost in generalisation/testing performance.Comment: 19 pages, 4 figure

    Efficient Cross-Validation of Echo State Networks

    Full text link
    Echo State Networks (ESNs) are known for their fast and precise one-shot learning of time series. But they often need good hyper-parameter tuning for best performance. For this good validation is key, but usually, a single validation split is used. In this rather practical contribution we suggest several schemes for cross-validating ESNs and introduce an efficient algorithm for implementing them. The component that dominates the time complexity of the already quite fast ESN training remains constant (does not scale up with kk) in our proposed method of doing kk-fold cross-validation. The component that does scale linearly with kk starts dominating only in some not very common situations. Thus in many situations kk-fold cross-validation of ESNs can be done for virtually the same time complexity as a simple single split validation. Space complexity can also remain the same. We also discuss when the proposed validation schemes for ESNs could be beneficial and empirically investigate them on several different real-world datasets.Comment: Accepted in ICANN'19 Workshop on Reservoir Computin

    Performance Analysis of Classification and Regression Tree (CART) Algorithm in Classifying Male Fertility Levels with Mobile-Based

    Get PDF
    Fertility is the ability to produce offspring in a man or the ability of the reproductive organs to work optimally in fertilization. Fertility rates have declined drastically in the last fifty years. Machine Learning is a field devoted to understanding and building learning methods. This study will use machine learning algorithms to classify male fertility levels, namely the Classification and Regression Tree (CART) algorithm and the K-Fold Cross Validation validation method. The fertility dataset used in this study was obtained from the UCI Machine Learning website, with a total of 100 data and the variables used are Age, Childish diseases, Accident or serious trauma, Surgical intervention, High fevers in the last year, Frequency of alcohol consumption, Smoking habit, Number of hours spent sitting per day and Diagnosis. K-Fold Cross Validation can be used together with CART to measure the performance of the CART model on different data, so as to avoid overfitting or underfitting the CART model. Based on the calculation of the CART algorithm and the K-Fold Cross Validation validation method (K = 1 to K = 9), the average accuracy value for training data is 98.70% and the average accuracy value for testing data is 81.16%. The results of this study have proven that the CART algorithm can be used to classify the level of fertility in men well. In addition, the classification model formed can be implemented into a mobile application (android) so that it is easy to use and understand.Fertility merupakan kemampuan untuk menghasilkan keturunan pada seorang laki-laki atau kemampuan alat reproduksi untuk bekerja secara optimal dalam melakukan pembuahan. Angka fertility telah menurun secara drastis dalam lima puluh tahun terakhir. Machine Learning merupakan bidang yang dikhususkan untuk memahami dan membangun metode pembelajaran. Dalam penelitian ini akan menggunakan algoritma machine learning untuk mengklasifikasikan tingkat kesuburan pria yaitu algoritma Classification and Regression Tree (CART) dan metode validasi K-Fold Cross Validation. Adapun dataset fertility yang digunakan dalam penelitian ini diperoleh dari situs UCI Machine Learning, dengan jumlah data sebanyak 100 dan variabel yang digunakan yaitu Age, Childish diseases, Accident or serious trauma, Surgical intervention, High fevers in the last year, Frequency of alcohol consumption, Smoking habit, Number of hours spent sitting per day dan Diagnosis. K-Fold Cross Validation dapat digunakan bersama dengan CART untuk mengukur performa model CART pada data yang berbeda-beda, sehingga dapat menghindari overfitting atau underfitting pada model CART. Berdasarkan perhitungan algoritma CART  dan metode validasi K-Fold Cross Validation (K=1 sampai dengan K=9) diperoleh nilai rerata akurasi untuk data training sebesar 98,70% dan rerata akurasi untuk data testing diperoleh nilai rata-rata akurasi sebesar 81,16%. Hasil penelitian ini telah membuktikan bahwa algoritma CART dapat digunakan untuk mengklasifikasikan tingkat kesuburan pada pria dengan baik. Selain itu, model klasifikasi yang terbentuk dapat diimplementasikan ke dalam sebuah aplikasi mobile (android) sehingga mudah untuk digunakan dan dipahami

    Impression Classification of Endek (Balinese Fabric) Image Using K-Nearest Neighbors Method

    Get PDF
    An impression can be interpreted as a psychological feeling toward a product and it plays an important role in decision making. Therefore, the understanding of the data in the domain of impressions will be very useful. This research had the objective of knowing the performance of K-Nearest Neighbors method to classify endek image impression using K-Fold Cross Validation method. The images were taken from 3 locations, namely CV. Artha Dharma, Agung Bali Collection, and Pengrajin Sri Rejeki. To get the image impression was done by consulting with an endek expert named Dr. D.A Tirta Ray, M.Si. The process of data mining was done by using K-Nearest Neighbors Method which was a classification method to a set of data based on learning data that had been classified previously and to classify new objects based on attributes and training samples. K-Fold Cross Validation testing obtained accuracy of 91% with K value in K-Nearest Neighbors of 3, 4, 7, 8

    CVTresh: R Package for Level-Dependent Cross-Validation Thresholding

    Get PDF
    The core of the wavelet approach to nonparametric regression is thresholding of wavelet coefficients. This paper reviews a cross-validation method for the selection of the thresholding value in wavelet shrinkage of Oh, Kim, and Lee (2006), and introduces the R package CVThresh implementing details of the calculations for the procedures. This procedure is implemented by coupling a conventional cross-validation with a fast imputation method, so that it overcomes a limitation of data length, a power of 2. It can be easily applied to the classical leave-one-out cross-validation and K-fold cross-validation. Since the procedure is computationally fast, a level-dependent cross-validation can be developed for wavelet shrinkage of data with various sparseness according to levels.

    Using K-fold cross validation proposed models for SpikeProp learning enhancements

    Get PDF
    Spiking Neural Network (SNN) uses individual spikes in time field to perform as well as to communicate computation in such a way as the actual neurons act. SNN was not studied earlier as it was considered too complicated and too hard to examine. Several limitations concerning the characteristics of SNN which were not researched earlier are now resolved since the introduction of SpikeProp in 2000 by Sander Bothe as a supervised SNN learning model. This paper defines the research developments of the enhancement Spikeprop learning using K-fold cross validation for datasets classification. Hence, this paper introduces acceleration factors of SpikeProp using Radius Initial Weight and Differential Evolution (DE) Initialization weights as proposed methods. In addition, training and testing using K-fold cross validation properties of the new proposed method were investigated using datasets obtained from Machine Learning Benchmark Repository as an improved Bohte's algorithm. A comparison of the performance was made between the proposed method and Backpropagation (BP) together with the Standard SpikeProp. The findings also reveal that the proposed method has better performance when compared to Standard SpikeProp as well as the BP for all datasets performed by K-fold cross validation for classification datasets
    corecore