Search CORE

1,122 research outputs found

Variable Selection and Parameter Tuning in High-Dimensional Prediction

Author: Bernau Christoph
Boulesteix Anne-Laure
Publication venue
Publication date: 28/01/2010
Field of study

In the context of classification using high-dimensional data such as microarray gene expression data, it is often useful to perform preliminary variable selection. For example, the k-nearest-neighbors classification procedure yields a much higher accuracy when applied on variables with high discriminatory power. Typical (univariate) variable selection methods for binary classification are, e.g., the two-sample t-statistic or the Mann-Whitney test. In small sample settings, the classification error rate is often estimated using cross-validation (CV) or related approaches. The variable selection procedure has then to be applied for each considered training set anew, i.e. for each CV iteration successively. Performing variable selection based on the whole sample before the CV procedure would yield a downwardly biased error rate estimate. CV may also be used to tune parameters involved in a classification method. For instance, the penalty parameter in penalized regression or the cost in support vector machines are most often selected using CV. This type of CV is usually denoted as "internal CV" in contrast to the "external CV" performed to estimate the error rate, while the term "nested CV" refers to the whole procedure embedding two CV loops. While variable selection and parameter tuning have been widely investigated in the context of high-dimensional classification, it is still unclear how they should be combined if a classification method involves both variable selection and parameter tuning. For example, the k-nearest-neighbors method usually requires variable selection and involves a tuning parameter: the number k of neighbors. It is well-known that variable selection should be repeated for each external CV iteration. But should we also repeat variable selection for each it internal CV iteration or rather perform tuning based on fixed subset of variables? While the first variant seems more natural, it implies a huge computational expense and its benefit in terms of error rate remains unknown. In this paper, we assess both variants quantitatively using real microarray data sets. We focus on two representative examples: k-nearest-neighbors (with k as tuning parameter) and Partial Least Squares dimension reduction followed by linear discriminant analysis (with the number of components as tuning parameter). We conclude that the more natural but computationally expensive variant with repeated variable selection does not necessarily lead to better accuracy and point out the potential pitfalls of both variants

Open Access LMU

A model-free feature selection technique of feature screening and random forest based recursive feature elimination

Author: Xia Siwei
Yang Yuehan
Publication venue
Publication date: 14/02/2023
Field of study

In this paper, we propose a model-free feature selection method for ultra-high dimensional data with mass features. This is a two phases procedure that we propose to use the fused Kolmogorov filter with the random forest based RFE to remove model limitations and reduce the computational complexity. The method is fully nonparametric and can work with various types of datasets. It has several appealing characteristics, i.e., accuracy, model-free, and computational efficiency, and can be widely used in practical problems, such as multiclass classification, nonparametric regression, and Poisson regression, among others. We show that the proposed method is selection consistent and

L_2

consistent under weak regularity conditions. We further demonstrate the superior performance of the proposed method over other existing methods by simulations and real data examples

arXiv.org e-Print Archive

Recommended from our members

Modelling prognostic trajectories in Alzheimer’s disease

Author: Giorgio Joseph
Publication venue: University of Cambridge
Publication date: 25/05/2020
Field of study

Progression to dementia due to Alzheimer’s Disease (AD) is a long and protracted process that involves multiple pathways of disease pathophysiology. Predicting these dynamic changes has major implications for timely and effective clinical management in AD. There are two reasons why at present we lack appropriate tools to make such predictions. First, a key feature of AD is the interactive nature of the relationships between biomarkers, such as accumulation of β-amyloid -a peptide that builds plaques between nerve cells-, tau -a protein found in the axons of nerve cells- and widespread neurodegeneration. Current models fail to capture these relationships because they are unable to successfully reduce the high dimensionality of biomarkers while exploiting informative multivariate relationships. Second, current models focus on simply predicting in a binary manner whether an individual will develop dementia due to AD or not, without informing clinicians about their predicted disease trajectory. This can result in administering inefficient treatment plans and hindering appropriate stratification for clinical trials. In this thesis, we overcome these challenges by using applied machine learning to build predictive models of patient disease trajectories in the earliest stages of AD. Specifically, to exploit the multi-dimensionality of biomarker data, we used a novel feature generation methodology Partial Least Squares regression with recursive feature elimination (PLSr-RFE). This method applies a hybrid-feature selection and feature construction method that captures co-morbidities in cognition and pathophysiology, resulting in an index of Alzheimer’s disease atrophy from structural MRI. We validated our choice of biomarker and the efficacy of our methodology by showing that the learnt pattern of grey matter atrophy is highly predictive of tau accumulation in an independent sample. Next, to go beyond predicting binary outcomes to deriving individualised prognostic scores of cognitive decline due to AD, we used a novel trajectory modelling approach (Generalised Metric Learning Vector Quantization – Scalar projection) that mines multimodal data from large AD research cohorts. Using this approach, we derive individualised prognostic scores of cognitive decline due to AD, revealing interactive cognitive, and biological factors that improve prediction accuracy. Next, we extended our machine learning framework to classify and stage early AD individuals based on future pathological tau accumulation. Our results show that the characteristic spreading pattern of tau in early AD can be predicted by baseline biomarkers, particularly when stratifying groups using multimodal data. Further, we showed that our prognostic index predicts individualised rates of future tau accumulation with high accuracy and regional specificity in an independent sample of cognitively unimpaired individuals. Overall, our work used machine learning to combine continuous information from AD biomarkers predicting pathophysiological changes at different stages in the AD cascade. The approaches presented in this thesis provide an excellent framework to support personalised clinical interventions and guide effective drug discovery trials

Apollo (Cambridge)

Practical Methods Validation For Variables Selection In The High Dimension Data: Application For Three Metabolomics Datasets

Author: Choiruddin Achmad
Publication venue
Publication date: 01/03/2015
Field of study

Background: Variable selection on high throughput metabolomics data are becoming inevitable to select relevant information since they often imply a high degree of multicolinearity, and, as a result, lead to severely ill conditioned problems. Both in supervised classification framework and machine learning algorithms, one solution is to reduce their data dimensionality either by performing features selection, or by introducing artificial variables in order to enhance the generalization performance of a given algorithm as well as to gain some insight about the concept to learned. Objective: The main objective of this study is to select a set of features from thousands of variables in dataset. We divide this objective into two sides: (1) To identify small sets of features (fewer than 15 features) that could be used for diagnostic purpose in clinical practice, called low-level analysis and (2) We do the identification to a larger set of features (around 50-100 features), called middle-level analysis; this involves obtaining a set of variables that are related to the outcome of interest. Besides that, we would like to compare the performances of several proposed techniques in feature selection procedure for Metabolomics study. Method: This study is facilitated by four proposed techniques, which are two machine learning techniques (i.e., RSVM and RFFS) and two supervised classification techniques (i.e., PLS-DA VIP and sPLS-DA), to classify our three datasets, i.e., human urines, rat’s urines, and rat’s plasma datasets, which contains two classes sample each dataset. Results: RSVM-LOO always leads the accuracy performance compare to the other two cross-validation methods, i.e., bootstrap and N-fold. However, this RSVM results is not much better since RFFS could achieve the higher accuracy performance. Another side, PLS-DA and sPLS-DA could reach a good performance either for variability explanation or predictive ability. In biological sense, RFFS and PLS-DA VIP show their performance by finding the more common selected features than RSVM and sPLS-DA compare to previous metabolomics study. This is also confirmed in the statistical comparison that RFFS and PLS-DA could lead the similarity percentage of selected features. Furthermore, RFFS and PLS-DA VIP have their better performance since they could select three metabolites of five confirmed metabolites from previous metabolomics study which couldn’t be achieved by RSVM and sPLS-DA. Conclusion: RFFS seems to become the most appropriate techniques in features selection study, particularly in low-level analysis when having small sets features is often desirable. Both PLS-DA VIP and sPLS-DA lead to a good performance either for variability explanation or predictive ability, but PLS-DA VIP is slightly better in term of biological insight. Besides it is only limited for two class problem, RSVM unfortunately couldn’t achieve a quite good performance both in statistical and biological interpretation

ITS Repository

Method for Classification of Cancers with Partial Least Squares Regression as Feature Selector with Kernel SVM

Author: Gardiner Bryan
Koul Nimrita
Manvi Sunilkumar
Publication venue
Publication date: 03/12/2021
Field of study

Ulster University's Research Portal

(Non) Linear Regression Modeling

Author: Čížek Pavel
Publication venue
Publication date
Field of study

We will study causal relationships of a known form between random variables. Given a model, we distinguish one or more dependent (endogenous) variables Y = (Y1, . . . , Yl), l ∈ N, which are explained by a model, and independent (exogenous, explanatory) variables X = (X1, . . . ,Xp), p ∈ N, which explain or predict the dependent variables by means of the model. Such relationships and models are commonly referred to as regression models. --

Research Papers in Economics

A New Method for Preliminary Identification of Gene Regulatory Networks from Gene Microarray Cancer Data Using Ridge Partial Least Squares with Recursive Feature Elimination and Novel Brier and Occurrence Probability Measures

Author: Chan SC
Tsui KM
Wu HC
Publication venue: IEEE. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=3468
Publication date: 01/01/2012
Field of study

published_or_final_versio

HKU Scholars Hub

Modelling prognostic trajectories of cognitive decline due to Alzheimer's disease

Author: Giorgio Joseph
Jagust William
Kourtzi Zoe
Landau Susan
Tino Peter
Publication venue: 'Elsevier BV'
Publication date: 26/01/2020
Field of study

University of Birmingham Research Portal