182 research outputs found

    No unbiased Estimator of the Variance of K-Fold Cross-Validation

    Get PDF
    In statistical machine learning, the standard measure of accuracy for models is the prediction error, i.e. the expected loss on future examples. When the data distribution is unknown, it cannot be computed but several resampling methods, such as K-fold cross-validation can be used to obtain an unbiased estimator of prediction error. However, to compare learning algorithms one needs to also estimate the uncertainty around the cross-validation estimator, which is important because it can be very large. However, the usual variance estimates for means of independent samples cannot be used because of the reuse of the data used to form the cross-validation estimator. The main result of this paper is that there is no universal (distribution independent) unbiased estimator of the variance of the K-fold cross-validation estimator, based only on the empirical results of the error measurements obtained through the cross-validation procedure. The analysis provides a theoretical understanding showing the difficulty of this estimation. These results generalize to other resampling methods, as long as data are reused for training or testing. L'erreur de prédiction, donc la perte attendue sur des données futures, est la mesure standard pour la qualité des modèles d'apprentissage statistique. Quand la distribution des données est inconnue, cette erreur ne peut être calculée mais plusieurs méthodes de rééchantillonnage, comme la validation croisée, peuvent être utilisées pour obtenir un estimateur non-biaisé de l'erreur de prédiction. Cependant pour comparer des algorithmes d'apprentissage, il faut aussi estimer l'incertitude autour de cet estimateur d'erreur future, car cette incertitude peut être très grande. Cependant, les estimateurs ordinaires de variance d'une moyenne pour des échantillons indépendants ne peuvent être utilisés à cause du recoupement des ensembles d'apprentissage utilisés pour effectuer la validation croisée. Le résultat principal de cet article est qu'il n'existe pas d'estimateur non-biaisé universel (indépendant de la distribution) de la variance de la validation croisée, en se basant sur les mesures d'erreur faites durant la validation croisée. L'analyse fournit une meilleure compréhension de la difficulté d'estimer l'incertitude autour de la validation croisée. Ces résultats se généralisent à d'autres méthodes de rééchantillonnage pour lesquelles des données sont réutilisées pour l'apprentissage ou le test.Prediction error, cross-validation, multivariate variance estimators, statistical comparison of algorithms, Erreur de prédiction, validation croisée, estimateur de variance multivariée, comparaison statistique des algorithmes

    Regularizing Portfolio Optimization

    Get PDF
    The optimization of large portfolios displays an inherent instability to estimation error. This poses a fundamental problem, because solutions that are not stable under sample fluctuations may look optimal for a given sample, but are, in effect, very far from optimal with respect to the average risk. In this paper, we approach the problem from the point of view of statistical learning theory. The occurrence of the instability is intimately related to over-fitting which can be avoided using known regularization methods. We show how regularized portfolio optimization with the expected shortfall as a risk measure is related to support vector regression. The budget constraint dictates a modification. We present the resulting optimization problem and discuss the solution. The L2 norm of the weight vector is used as a regularizer, which corresponds to a diversification "pressure". This means that diversification, besides counteracting downward fluctuations in some assets by upward fluctuations in others, is also crucial because it improves the stability of the solution. The approach we provide here allows for the simultaneous treatment of optimization and diversification in one framework that enables the investor to trade-off between the two, depending on the size of the available data set

    KLASIFIKASI DATA PERKAWINAN ANAK DI MALUKU UTARA DENGAN METODE KERNEL REGRESSION DAN SUPPORT VECTOR MACHINE

    Get PDF
    Salah satu tujuan Sustainable Development Goals (SDGs) yang ingin dicapai pada 2030 adalah menghapus semua praktek perkawinan anak. Di Maluku Utara, kasus perkawinan anak pada perempuan masih cukup tinggi sesuai dengan laporan Badan Pusat Statistik (BPS) yaitu sebesar 14,36 persen pada 2019, melebihi angka nasional yaitu 10,82 persen. Sehingga, informasi mengenai determinan perkawinan anak diperlukan Pemerintah untuk melaksanakan program yang bertujuan untuk menekan kasus. Untuk itu, metode klasifikasi Kernel Regression dan Support Vector Machine (SVM) dapat digunakanan untuk mengetahui determinannya. Hasil penelitian menunjukkan bahwa metode SVM memberikan ketepatan klasifikasi yang lebih tinggi yaitu 99,17 persen pada 70% data training dan 100 persen pada 30% data testing dibandingkan metode Kernel Regression. Melalui SVM, diperoleh determinanan perkawinan anak di Maluku Utara yaitu, lama sekolah, status pekerjaan, akses terhadap internet, daerah tempat tinggal, jumlah ART, lama sekolah KRT yang telah diselesaikan dan pengeluaran per kapita rumah tangga

    Exploring a combined biomarker for tuberculosis treatment response: protocol for a prospective observational cohort study.

    Get PDF
    INTRODUCTION: An improved understanding of factors explaining tuberculosis (TB) treatment response is urgently needed to help clinicians optimise and personalise treatment and assist scientists undertaking novel treatment regimen trials. Promising outcome proxy measures, including sputum bacillary load and host immune response, are widely reported with variable results. However, they have not been studied together in combination with antibiotic exposure. The aim of this observational cohort study is to investigate which antibiotic exposures correlate with sputum bacillary load and which with the host immune response. Subsequently, we will explore if these correlations can be used to inform a candidate combined biomarker predicting cure. METHODS AND ANALYSIS: All patients aged ≥ 18, diagnosed with drug-sensitive pulmonary TB (culture or molecular test), eligible for standard anti-TB treatment, at selected London, UK TB Services, will be invited to participate in this observational cohort study (target sample size=210). Patients will be asked to give blood for host transcriptomics and antibiotic plasma exposure, in addition to standard of care sputum samples for bacillary load. Antibiotic plasma concentrations will be quantified using a validated liquid chromatograph triple quadrupole mass spectrometer (LC-MS/MS) assay and sputum bacillary load by mycobacterial growth incubator tube time to positivity. Expression from a total of 35 prespecified host blood genes will be quantified using NanoString®. Antibiotic exposure, sputum bacillary load and host blood transcriptomic time series data will be analysed using nonlinear mixed-effects models. Correlations between combinations of longitudinal biomarkers and microbiological cure at the end of treatment and remaining relapse free for 1 year thereafter will be analysed using logistic regression and Cox proportional hazard models. ETHICS AND DISSEMINATION: The observational cohort study has been approved by the UK's HRA REC (20/SW/0007). Written informed consent will be obtained. Results will be disseminated via publication, presentation and through engagement with institutes/companies developing novel anti-TB treatment combinations

    Deep Learning to Analyze RNA-Seq Gene Expression Data

    Get PDF
    Deep learning models are currently being applied in several areas with great success. However, their application for the analysis of high-throughput sequencing data remains a challenge for the research community due to the fact that this family of models are known to work very well in big datasets with lots of samples available, just the opposite scenario typically found in biomedical areas. In this work, a first approximation on the use of deep learning for the analysis of RNA-Seq gene expression profiles data is provided. Three public cancer-related databases are analyzed using a regularized linear model (standard LASSO) as baseline model, and two deep learning models that differ on the feature selection technique used prior to the application of a deep neural net model. The results indicate that a straightforward application of deep nets implementations available in public scientific tools and under the conditions described within this work is not enough to outperform simpler models like LASSO. Therefore, smarter and more complex ways that incorporate prior biological knowledge into the estimation procedure of deep learning models may be necessary in order to obtain better results in terms of predictive performance.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
    corecore