1,122 research outputs found

    Variable Selection and Parameter Tuning in High-Dimensional Prediction

    Get PDF
    In the context of classification using high-dimensional data such as microarray gene expression data, it is often useful to perform preliminary variable selection. For example, the k-nearest-neighbors classification procedure yields a much higher accuracy when applied on variables with high discriminatory power. Typical (univariate) variable selection methods for binary classification are, e.g., the two-sample t-statistic or the Mann-Whitney test. In small sample settings, the classification error rate is often estimated using cross-validation (CV) or related approaches. The variable selection procedure has then to be applied for each considered training set anew, i.e. for each CV iteration successively. Performing variable selection based on the whole sample before the CV procedure would yield a downwardly biased error rate estimate. CV may also be used to tune parameters involved in a classification method. For instance, the penalty parameter in penalized regression or the cost in support vector machines are most often selected using CV. This type of CV is usually denoted as "internal CV" in contrast to the "external CV" performed to estimate the error rate, while the term "nested CV" refers to the whole procedure embedding two CV loops. While variable selection and parameter tuning have been widely investigated in the context of high-dimensional classification, it is still unclear how they should be combined if a classification method involves both variable selection and parameter tuning. For example, the k-nearest-neighbors method usually requires variable selection and involves a tuning parameter: the number k of neighbors. It is well-known that variable selection should be repeated for each external CV iteration. But should we also repeat variable selection for each it internal CV iteration or rather perform tuning based on fixed subset of variables? While the first variant seems more natural, it implies a huge computational expense and its benefit in terms of error rate remains unknown. In this paper, we assess both variants quantitatively using real microarray data sets. We focus on two representative examples: k-nearest-neighbors (with k as tuning parameter) and Partial Least Squares dimension reduction followed by linear discriminant analysis (with the number of components as tuning parameter). We conclude that the more natural but computationally expensive variant with repeated variable selection does not necessarily lead to better accuracy and point out the potential pitfalls of both variants

    A model-free feature selection technique of feature screening and random forest based recursive feature elimination

    Full text link
    In this paper, we propose a model-free feature selection method for ultra-high dimensional data with mass features. This is a two phases procedure that we propose to use the fused Kolmogorov filter with the random forest based RFE to remove model limitations and reduce the computational complexity. The method is fully nonparametric and can work with various types of datasets. It has several appealing characteristics, i.e., accuracy, model-free, and computational efficiency, and can be widely used in practical problems, such as multiclass classification, nonparametric regression, and Poisson regression, among others. We show that the proposed method is selection consistent and L2L_2 consistent under weak regularity conditions. We further demonstrate the superior performance of the proposed method over other existing methods by simulations and real data examples

    Practical Methods Validation For Variables Selection In The High Dimension Data: Application For Three Metabolomics Datasets

    Get PDF
    Background: Variable selection on high throughput metabolomics data are becoming inevitable to select relevant information since they often imply a high degree of multicolinearity, and, as a result, lead to severely ill conditioned problems. Both in supervised classification framework and machine learning algorithms, one solution is to reduce their data dimensionality either by performing features selection, or by introducing artificial variables in order to enhance the generalization performance of a given algorithm as well as to gain some insight about the concept to learned. Objective: The main objective of this study is to select a set of features from thousands of variables in dataset. We divide this objective into two sides: (1) To identify small sets of features (fewer than 15 features) that could be used for diagnostic purpose in clinical practice, called low-level analysis and (2) We do the identification to a larger set of features (around 50-100 features), called middle-level analysis; this involves obtaining a set of variables that are related to the outcome of interest. Besides that, we would like to compare the performances of several proposed techniques in feature selection procedure for Metabolomics study. Method: This study is facilitated by four proposed techniques, which are two machine learning techniques (i.e., RSVM and RFFS) and two supervised classification techniques (i.e., PLS-DA VIP and sPLS-DA), to classify our three datasets, i.e., human urines, ratā€™s urines, and ratā€™s plasma datasets, which contains two classes sample each dataset. Results: RSVM-LOO always leads the accuracy performance compare to the other two cross-validation methods, i.e., bootstrap and N-fold. However, this RSVM results is not much better since RFFS could achieve the higher accuracy performance. Another side, PLS-DA and sPLS-DA could reach a good performance either for variability explanation or predictive ability. In biological sense, RFFS and PLS-DA VIP show their performance by finding the more common selected features than RSVM and sPLS-DA compare to previous metabolomics study. This is also confirmed in the statistical comparison that RFFS and PLS-DA could lead the similarity percentage of selected features. Furthermore, RFFS and PLS-DA VIP have their better performance since they could select three metabolites of five confirmed metabolites from previous metabolomics study which couldnā€™t be achieved by RSVM and sPLS-DA. Conclusion: RFFS seems to become the most appropriate techniques in features selection study, particularly in low-level analysis when having small sets features is often desirable. Both PLS-DA VIP and sPLS-DA lead to a good performance either for variability explanation or predictive ability, but PLS-DA VIP is slightly better in term of biological insight. Besides it is only limited for two class problem, RSVM unfortunately couldnā€™t achieve a quite good performance both in statistical and biological interpretation

    (Non) Linear Regression Modeling

    Get PDF
    We will study causal relationships of a known form between random variables. Given a model, we distinguish one or more dependent (endogenous) variables Y = (Y1, . . . , Yl), l āˆˆ N, which are explained by a model, and independent (exogenous, explanatory) variables X = (X1, . . . ,Xp), p āˆˆ N, which explain or predict the dependent variables by means of the model. Such relationships and models are commonly referred to as regression models. --

    A New Method for Preliminary Identification of Gene Regulatory Networks from Gene Microarray Cancer Data Using Ridge Partial Least Squares with Recursive Feature Elimination and Novel Brier and Occurrence Probability Measures

    Get PDF
    published_or_final_versio
    • ā€¦
    corecore