18,693 research outputs found

    Feature subset selection and ranking for data dimensionality reduction

    Get PDF
    A new unsupervised forward orthogonal search (FOS) algorithm is introduced for feature selection and ranking. In the new algorithm, features are selected in a stepwise way, one at a time, by estimating the capability of each specified candidate feature subset to represent the overall features in the measurement space. A squared correlation function is employed as the criterion to measure the dependency between features and this makes the new algorithm easy to implement. The forward orthogonalization strategy, which combines good effectiveness with high efficiency, enables the new algorithm to produce efficient feature subsets with a clear physical interpretation

    Feature subset selection and ranking for data dimensionality reduction

    Get PDF
    A new unsupervised forward orthogonal search (FOS) algorithm is introduced for feature selection and ranking. In the new algorithm, features are selected in a stepwise way, one at a time, by estimating the capability of each specified candidate feature subset to represent the overall features in the measurement space. A squared correlation function is employed as the criterion to measure the dependency between features and this makes the new algorithm easy to implement. The forward orthogonalization strategy, which combines good effectiveness with high efficiency, enables the new algorithm to produce efficient feature subsets with a clear physical interpretation

    A multiple sequential orthogonal least squares algorithm for feature ranking and subset selection

    Get PDF
    High-dimensional data analysis involving a large number of variables or features is commonly encountered in multiple regression and multivariate pattern recognition. It has been noted that in many cases not all the original variables are necessary for characterizing the overall features. More often only a subset of a small number of significant variables is required. The detection of significant variables from a library consisting of all the original variables is therefore a key and challenging step for dimensionality reduction. Principal component analysis is a useful tool for dimensionality reduction. Principal components, however, suffer from two main deficiencies: Principal components always involve all the original variables and are usually difficult to physically interpret. This study introduces a new multiple sequential orthogonal least squares algorithm for feature ranking and subset selection. The new method detects in a stepwise way the capability of each candidate feature to recover the first few principal components. At each step, only the significant variable with the strongest capability to represent the first few principal components is selected. Unlike principal components, which carry no clear physical meanings, features selected by the new method preserve the original measurement meanings

    Efficient and Scalable Techniques for Multivariate Time Series Analysis and Search

    Get PDF
    Innovation and advances in technology have led to the growth of time series data at a phenomenal rate in many applications. Query processing and the analysis of time series data have been studied and, numerous solutions have been proposed. In this research, we focus on multivariate time series (MTS) and devise techniques for high dimensional and voluminous MTS data. The success of such solution techniques relies on effective dimensionality reduction in a preprocessing step. Feature selection has often been used as a dimensionality reduction technique. It helps identify a subset of features that capture most characteristics from the data. We propose a more effective feature subset selection technique, termed Weighted Scores (WS), based on statistics drawn from the Principal Component Analysis (PCA) of the input MTS data matrix. The technique allows reducing the dimensionality of the data, while retaining and ranking its most influential features. We then consider feature grouping and develop a technique termed FRG (Feature Ranking and Grouping) to improve the effectiveness of our technique in sparse vector frameworks. We also developed a PCA based MTS representation technique M2U (Multivariate to Univariate transformation) which allows to transform the MTS with large number of variables to a univariate signal prior to performing downstream pattern recognition tasks such as seeking correlations within the set. In related research, we study the similarity search problem for MTS, and developed a novel correlation based method for standard MTS, ESTMSS (Efficient and Scalable Technique for MTS Similarity Search). For this, we uses randomized dimensionality reduction, and a threshold based correlation computation. The results of our numerous experiments on real benchmark data indicate the effectiveness of our methods. The technique improves computation time by at least an order of magnitude compared to other techniques, and affords a large reduction in memory requirement while providing comparable accuracy and precision results in large scale frameworks

    Automatic Selection of Molecular Descriptors using Random Forest: Application to Drug Discovery

    Get PDF
    The optimal selection of chemical features (molecular descriptors) is an essential pre-processing step for the efficient application of computational intelligence techniques in virtual screening for identification of bioactive molecules in drug discovery. The selection of molecular descriptors has key influence in the accuracy of affinity prediction. In order to improve this prediction, we examined a Random Forest (RF)-based approach to automatically select molecular descriptors of training data for ligands of kinases, nuclear hormone receptors, and other enzymes. The reduction of features to use during prediction dramatically reduces the computing time over existing approaches and consequently permits the exploration of much larger sets of experimental data. To test the validity of the method, we compared the results of our approach with the ones obtained using manual feature selection in our previous study (Perez-Sanchez et al., 2014). The main novelty of this work in the field of drug discovery is the use of RF in two different ways: feature ranking and dimensionality reduction, and classification using the automatically selected feature subset. Our RF-based method out-performs classification results provided by Support Vector Machine (SVM) and Neural Networks (NN) approaches
    corecore