5 research outputs found
On feature selection protocols for very low-sample-size data
High-dimensional data with very few instances are typical in many application domains. Selecting a highly discriminative subset of the original features is often the main interest of the end user. The widely-used feature selection protocol for such type of data consists of two steps. First, features are selected from the data (possibly through cross-validation), and, second, a cross-validation protocol is applied to test a classifier using the selected features. The selected feature set and the testing accuracy are then returned to the user. For the lack of a better option, the same low-sample-size dataset is used in both steps. Questioning the validity of this protocol, we carried out an experiment using 24 high-dimensional datasets, three feature selection methods and five classifier models. We found that the accuracy returned by the above protocol is heavily biased, and therefore propose an alternative protocol which avoids the contamination by including both steps in a single cross-validation loop. Statistical tests verify that the classification accuracy returned by the proper protocol is significantly closer to the true accuracy (estimated from an independent testing set) compared to that returned by the currently favoured protocol.project RPG-2015-188 funded by The Leverhulme Trust, UK and by project TIN2015-67534-P (MINECO/FEDER, UE) funded by the Ministerio de EconomÃa y Competitividad of the Spanish Government and European Union FEDER fund
Recommended from our members
Analysis for warning factors of type 2 diabetes mellitus complications with Markov blanket based on a Bayesian network model
Background and objective
Type 2 diabetes mellitus (T2DM) complications seriously affect the quality of life and could not be cured completely. Actions should be taken for prevention and self-management. Analysis of warning factors is beneficial for patients, on which some previous studies focused. They generally used the professional medical test factors or complete factors to predict and prevent, but it was inconvenient and impractical for patients to self-manage. With this in mind, this study built a Bayesian network (BN) model, from the perspective of diabetic patients’ self-management and prevention, to predict six complications of T2DM using the selected warning factors which patients could have access from medical examination. Furthermore, the model was analyzed to explore the relationships between physiological variables and T2DM complications, as well as the complications themselves. The model aims to help patients with T2DM self-manage and prevent themselves from complications.
Methods
The dataset was collected from a well-known data center called the National Health Clinical Center between 1st January 2009 and 31st December 2009. After preprocess and impute the data, a BN model merging expert knowledge was built with Bootstrap and Tabu search algorithm. Markov Blanket (MB) was used to select the warning factors and predict T2DM complications. Moreover, a Bayesian network without prior information (BN-wopi) model learned using 10-fold cross-validation both in structure and in parameters was added to compare with other classifiers learned using 10-fold cross-validation fairly. The warning factors were selected according the structure learned in each fold and were used to predict. Finally, the performance of two BN models using warning features were compared with Naïve Bayes model, Random Forest model, and C5.0 Decision Tree model, which used all features to predict. Besides, the validation parameters of the proposed model were also compared with those in existing studies using some other variables in clinical data or biomedical data to predict T2DM complications.
Results
Experimental results indicated that the BN models using warning factors performed statistically better than their counterparts using all other variables in predicting T2DM complications. In addition, the proposed BN model were effective and significant in predicting diabetic nephropathy (DN) (AUC: 0.831), diabetic foot (DF) (AUC: 0.905), diabetic macrovascular complications (DMV) (AUC: 0.753) and diabetic ketoacidosis (DK) (AUC: 0.877) with the selected warning factors compared with other experiments.
Conclusions
The warning factors of DN, DF, DMV, and DK selected by MB in this research might be able to help predict certain T2DM complications effectively, and the proposed BN model might be used as a general tool for prevention, monitoring, and self-management