351,605 research outputs found
An Information Theoretic Approach to Quantify the Stability of Feature Selection and Ranking Algorithms
[EN] Feature selection is a key step when dealing with high-dimensional data. In particular, these techniques simplify the process of knowledge discovery from the data in fields like biomedicine, bioinformatics, genetics or chemometrics by selecting the most relevant features out of the noisy, redundant and irrel- evant features. A problem that arises in many of these applications is that the outcome of the feature selection algorithm is not stable. Thus, small variations in the data may yield very different feature rankings. Assessing the stability of these methods becomes an important issue in the previously mentioned situations, but it has been long overlooked in the literature. We propose an information-theoretic approach based on the Jensen-Shannon di-vergence to quantify this robustness. Unlike other stability measures, this metric is suitable for different algorithm outcomes: full ranked lists, top-k lists (feature subsets) as well as the lesser studied partial ranked lists that keep the k best ranked elements. This generalized metric quantifies the dif-ference among a whole set of lists with the same size, following a probabilistic approach and being able to give more importance to the disagreements that appear at the top of the list. Moreover, it possesses desirable properties for a stability metric including correction for change, and upper/lower bounds and conditions for a deterministic selection. We illustrate the use of this stability metric with data generated in a fully controlled way and compare it with popular metrics including the Spearman’s rank correlation and the Kuncheva’s index on feature ranking and selection outcomes respectively.S
Statistical model for reproducibility in ranking-based feature selection
The stability of feature subset selection algorithms has become crucial in real-world problems due to the need for consistent experimental results across different replicates. Specifically, in this paper, we analyze the reproducibility of ranking-based feature subset selection algorithms. When applied to data, this family of algorithms builds an ordering of variables in terms of a measure of relevance. In order to quantify the reproducibility of ranking-based feature subset selection algorithms, we propose a model that takes into account all the different sized subsets of top-ranked features. The model is fitted to data through the minimization of an error function related to the expected values of Kuncheva’s consistency index for those subsets. Once it is fitted, the model provides practical information about the feature subset selection algorithm analyzed, such as a measure of its expected reproducibility or its estimated area under the receiver operating characteristic curve regarding the identification of relevant features. We test our model empirically using both synthetic and a wide range of real data. The results show that our proposal can be used to analyze feature subset selection algorithms based on rankings in terms of their reproducibility and their performance
Optimising use of 4D-CT phase information for radiomics analysis in lung cancer patients treated with stereotactic body radiotherapy
From IOP Publishing via Jisc Publications RouterHistory: received 2021-03-17, oa-requested 2021-04-07, accepted 2021-04-21, epub 2021-05-24, open-access 2021-05-24, ppub 2021-06-07Publication status: PublishedFunder: Cancer Research UK; doi: https://doi.org/10.13039/501100000289; Grant(s): C147/A25254Abstract: Purpose. 4D-CT is routine imaging for lung cancer patients treated with stereotactic body radiotherapy. No studies have investigated optimal 4D phase selection for radiomics. We aim to determine how phase data should be used to identify prognostic biomarkers for distant failure, and test whether stability assessment is required. A phase selection approach will be developed to aid studies with different 4D protocols and account for patient differences. Methods. 186 features were extracted from the tumour and peritumour on all phases for 258 patients. Feature values were selected from phase features using four methods: (A) mean across phases, (B) median across phases, (C) 50% phase, and (D) the most stable phase (closest in value to two neighbours), coined personalised selection. Four levels of stability assessment were also analysed, with inclusion of: (1) all features, (2) stable features across all phases, (3) stable features across phase and neighbour phases, and (4) features averaged over neighbour phases. Clinical-radiomics models were built for twelve combinations of feature type and assessment method. Model performance was assessed by concordance index (c-index) and fraction of new information from radiomic features. Results. The most stable phase spanned the whole range but was most often near exhale. All radiomic signatures provided new information for distant failure prediction. The personalised model had the highest c-index (0.77), and 58% of new information was provided by radiomic features when no stability assessment was performed. Conclusion. The most stable phase varies per-patient and selecting this improves model performance compared to standard methods. We advise the single most stable phase should be determined by minimising feature differences to neighbour phases. Stability assessment over all phases decreases performance by excessively removing features. Instead, averaging of neighbour phases should be used when stability is of concern. The models suggest that higher peritumoural intensity predicts distant failure
Novel analysis–forecast system based on multi-objective optimization for air quality index
© 2018 Elsevier Ltd The air quality index (AQI) is an important indicator of air quality. Owing to the randomness and non-stationarity inherent in AQI, it is still a challenging task to establish a reasonable analysis–forecast system for AQI. Previous studies primarily focused on enhancing either forecasting accuracy or stability and failed to improve both aspects simultaneously, leading to unsatisfactory results. In this study, a novel analysis–forecast system is proposed that consists of complexity analysis, data preprocessing, and optimize–forecast modules and addresses the problems of air quality monitoring. The proposed system performs a complexity analysis of the original series based on sample entropy and data preprocessing using a novel feature selection model that integrates a decomposition technique and an optimization algorithm for removing noise and selecting the optimal input structure, and then forecasts hourly AQI series by utilizing a modified least squares support vector machine optimized by a multi-objective multi-verse optimization algorithm. Experiments based on datasets from eight major cities in China demonstrated that the proposed system can simultaneously obtain high accuracy and strong stability and is thus efficient and reliable for air quality monitoring
Stable Feature Selection for Biomarker Discovery
Feature selection techniques have been used as the workhorse in biomarker
discovery applications for a long time. Surprisingly, the stability of feature
selection with respect to sampling variations has long been under-considered.
It is only until recently that this issue has received more and more attention.
In this article, we review existing stable feature selection methods for
biomarker discovery using a generic hierarchal framework. We have two
objectives: (1) providing an overview on this new yet fast growing topic for a
convenient reference; (2) categorizing existing methods under an expandable
framework for future research and development
Recommended from our members
Evolutionary computation-based feature selection for finding a stable set of features in high-dimensional data
Evolutionary Computation (EC) algorithms have proved to work well for feature selection because they are powerful search techniques and can produce multiple good solutions. However, they suffer from some limitations for real world applications. Firstly, ECs require high computation time as they evaluate many solutions at each iteration. Secondly, a classifier is usually used as their fitness function which causes the selected subset to perform well only on the utilised classifier (e.g. classifier-bias). Lastly, ECs, as stochastic search methods, return a different final subset in different runs which poses a problem for finding a stable set of features (e.g. stability issue). To address computation time and classifier-bias limitations, this thesis proposes a new two-stage selection approach called filter/filter in which two filter feature selection algorithms are combined. In the first stage, a ranking algorithm forms a reduced dataset by selecting the most informative features from the original dataset. In the second stage, the reduced dataset is fed to a novel EC algorithm to select final feature subset. This new EC algorithm is a Tabu search hybridised with an Asexual Genetic Algorithm called TAGA. TAGA benefits from new search components and solution representation which can effectively reduce computation time. To select a classifier-unbiased final subset, a statistical criterion is used as the fitness function which evaluates the subset independent of any classifier. Experiments show that the proposed filter/filter requires an acceptable computation time and selects more classifier-unbiased features compared to the state-of-the-arts. To find a stable set of features, a novel Generalisation Power Index (GPI) is proposed to analyse the generalisation power of final subsets of an EC in several runs. Generalisation power refers to performance capability of a subset over wide range of classifiers. Computation results confirm that GPI is able to find a stable set of features which achieves near optimal accuracy when used to train various classifiers. To ex amine the suitability of the proposed methods for real-world applications, the filter/filter approach and GPI are integrated to select a stable set of features for METABRIC breast cancer subtype classification problem. Experimental results show that this integration not only can address the limitations of ECs for a real-world biomedical feature selection problem but it performs better than alternatives methods
Determining appropriate approaches for using data in feature selection
Feature selection is increasingly important in data analysis and machine learning in big data era. However, how to use the data in feature selection, i.e. using either ALL or PART of a dataset, has become a serious and tricky issue. Whilst the conventional practice of using all the data in feature selection may lead to selection bias, using part of the data may, on the other hand, lead to underestimating the relevant features under some conditions. This paper investigates these two strategies systematically in terms of reliability and effectiveness, and then determines their suitability for datasets with different characteristics. The reliability is measured by the Average Tanimoto Index and the Inter-method Average Tanimoto Index, and the effectiveness is measured by the mean generalisation accuracy of classification. The computational experiments are carried out on ten real-world benchmark datasets and fourteen synthetic datasets. The synthetic datasets are generated with a pre-set number of relevant features and varied numbers of irrelevant features and instances, and added with different levels of noise. The results indicate that the PART approach is more effective in reducing the bias when the size of a dataset is small but starts to lose its advantage as the dataset size increases
- …