377,142 research outputs found

    New Multi-Label Correlation-Based Feature Selection Methods for Multi-Label Classification and Application in Bioinformatics

    Get PDF
    The very large dimensionality of real world datasets is a challenging problem for classification algorithms, since often many features are redundant or irrelevant for classification. In addition, a very large number of features leads to a high computational time for classification algorithms. Feature selection methods are used to deal with the large dimensionality of data by selecting a relevant feature subset according to an evaluation criterion. The vast majority of research on feature selection involves conventional single-label classification problems, where each instance is assigned a single class label; but there has been growing research on more complex multi-label classification problems, where each instance can be assigned multiple class labels. This thesis proposes three types of new Multi-Label Correlation-based Feature Selection (ML-CFS) methods, namely: (a) methods based on hill-climbing search, (b) methods that exploit biological knowledge (still using hill-climbing search), and (c) methods based on genetic algorithms as the search method. Firstly, we proposed three versions of ML-CFS methods based on hill climbing search. In essence, these ML-CFS versions extend the original CFS method by extending the merit function (which evaluates candidate feature subsets) to the multi-label classification scenario, as well as modifying the merit function in other ways. A conventional search strategy, hill-climbing, was used to explore the space of candidate solutions (candidate feature subsets) for those three versions of ML-CFS. These ML-CFS versions are described in detail in Chapter 4. \ud Secondly, in order to try to improve the performance of ML-CFS in cancer-related microarray gene expression datasets, we proposed three versions of the ML-CFS method that exploit biological knowledge. These ML-CFS versions are also based on hill-climbing search, but the merit function was modified in a way that favours the selection of genes (features) involved in pre-defined cancer-related pathways, as discussed in detail in Chapter 5. Lastly, we proposed two more sophisticated versions of ML-CFS based on Genetic Algorithms (rather than hill-climbing) as the search method. The first version of GA-based ML-CFS is based on a conventional single-objective GA, where there is only one objective to be optimized; while the second version of GA-based ML-CFS performs lexicographic multi-objective optimization, where there are two objectives to be optimized, as discussed in detail in Chapter 6. In this thesis, all proposed ML-CFS methods for multi-label classification problems were evaluated by measuring the predictive accuracies obtained by two well-known multi-label classification algorithms when using the selected features? namely: the Multi-Label K-Nearest neighbours (ML-kNN) algorithm and the Multi-Label Back Propagation Multi-Label Learning Neural Network (BPMLL) algorithm. In general, the results obtained by the best version of the proposed ML-CFS methods, namely a GA-based ML-CFS method, were competitive with the results of other multi-label feature selection methods and baseline approaches. More precisely, one of our GA-based methods achieved the second best predictive accuracy out of all methods being compared (both with ML-kNN and BPMLL used as classifiers), but there was no statistically significant difference between that GA-based ML-CFS and the best method in terms of predictive accuracy. In addition, in the experiment with ML-kNN (the most accurate) method selects about twice as many features as our GA-based ML-CFS; whilst in the experiments with BPMLL the most accurate method was a baseline method that does not perform any feature selection, and runs the classifier once (with all original features) for each of the many class labels, which is a very computationally expensive baseline approach. In summary, one of the proposed GA-based ML-CFS methods managed to achieve substantial data reduction, (selecting a smaller subset of relevant features) without a significant decrease in predictive accuracy with respect to the most accurate method

    Support Vector Machine—Recursive Feature Elimination for localized feature selection

    Get PDF
    基于支持向量机的递归特征消除(SVM-rfE)是目前最主流的基因选择方法之一,是为二分类问题设计的,对于多分类问题必须要进行扩展。从帕累托最优(PArETO OPTIMuM)的概念出发,阐明了常用的基因选择方法在多分类问题中的局限性,提出了基于类别的基因选择过程,并据此提出一种新的SVM-rfE设计方法。8个癌症和肿瘤基因表达谱数据上的实验结果证明了新方法优于另两种递归特征消除方法,为每一类单独寻找最优基因,能够得到更高的分类准确率。Support Vector Machine—Recursive Feature Elimination (SVM-RFE) is one of state-of-the-art method for gene selection.SVM-RFE was originally designed to solve binary feature selection problems and has been extended to solve multiclass problems in several recent studies.This paper illustrates the limitations of the present multi-class gene selection methods from the perspective of Pareto Optimum,describes a new procedure for selecting significant genes for each class,and proposes a new implementation for SVM-RFE.Experiments on 8 cancer and tumor gene expression dataset demonstrate its superiority over two other RFE methods.By considering each class during the gene selection stages,the new method can identify genes leading to more accurate classification.福建省自然科学基金No2009J05153---

    Integration of feature subset selection methods for sentiment analysis

    Get PDF
    Feature selection is one of the main challenges in sentiment analysis to find an optimal feature subset from a real-world domain. The complexity of an optimal feature subset selection grows exponentially based on the number of features for analysing and organizing data in high-dimensional spaces that lead to the high-dimensional problems. To overcome the problem, this study attempted to enhance the feature subset selection in high-dimensional data by removing irrelevant and redundant features using filter and wrapper approaches. Initially, a filter method based on dispersion of samples on feature space known as mutual standard deviation method was developed to minimize intra-class and maximize inter-class distances. The filter-based methods have some advantages such as they are easily scaled to high-dimensional datasets and are computationally simple and fast. Besides, they only depend on feature selection space and ignore the hypothesis model space. Hence, the next step of this study developed a new feature ranking approach by integrating various filter methods. The ordinal-based and frequency-based integration of different filter methods were developed. Finally, a hybrid harmony search based on search strategy was developed and used to enhance the feature subset selection to overcome the problem of ignoring the dependency of feature selection on the classifier. Therefore, a search strategy on feature space using integration of filter and wrapper approaches was introduced to find a semantic relationship among the model selections and subsets of the search features. Comparative experiments were performed on five sentiment datasets, namely movie, music, book, electronics, and kitchen review dataset. A sizeable performance improvement was noted whereby the proposed integration-based feature subset selection method yielded a result of 98.32% accuracy in sentiment classification using POS-based features on movie reviews. Finally, a statistical test conducted based on the accuracy showed significant differences between the proposed methods and the baseline methods in almost all the comparisons in k-fold cross-validation. The findings of the study have shown the effectiveness of the mutual standard deviation and integration-based feature subset selection methods have outperformed the other baseline methods in terms of accuracy

    Feature Space Modeling for Accurate and Efficient Learning From Non-Stationary Data

    Get PDF
    A non-stationary dataset is one whose statistical properties such as the mean, variance, correlation, probability distribution, etc. change over a specific interval of time. On the contrary, a stationary dataset is one whose statistical properties remain constant over time. Apart from the volatile statistical properties, non-stationary data poses other challenges such as time and memory management due to the limitation of computational resources mostly caused by the recent advancements in data collection technologies which generate a variety of data at an alarming pace and volume. Additionally, when the collected data is complex, managing data complexity, emerging from its dimensionality and heterogeneity, can pose another challenge for effective computational learning. The problem is to enable accurate and efficient learning from non-stationary data in a continuous fashion over time while facing and managing the critical challenges of time, memory, concept change, and complexity simultaneously. Feature space modeling is one of the most effective solutions to address this problem. For non-stationary data, selecting relevant features is even more critical than stationary data due to the reduction of feature dimension which can ensure the best use a computational resource to produce higher accuracy and efficiency by data mining algorithms. In this dissertation, we investigated a variety of feature space modeling techniques to improve the overall performance of data mining algorithms. In particular, we built Relief based feature sub selection method in combination with data complexity iv analysis to improve the classification performance using ovarian cancer image data collected in a non-stationary batch mode. We also collected time series health sensor data in a streaming environment and deployed feature space transformation using Singular Value Decomposition (SVD). This led to reduced dimensionality of feature space resulting in better accuracy and efficiency produced by Density Ration Estimation Method in identifying potential change points in data over time. We have also built an unsupervised feature space modeling using matrix factorization and Lasso Regression which was successfully deployed in conjugate with Relative Density Ratio Estimation to address the botnet attacks in a non-stationary environment. Relief based feature model improved 16% accuracy of Fuzzy Forest classifier. For change detection framework, we observed 9% improvement in accuracy for PCA feature transformation. Due to the unsupervised feature selection model, for 2% and 5% malicious traffic ratio, the proposed botnet detection framework exhibited average 20% better accuracy than One Class Support Vector Machine (OSVM) and average 25% better accuracy than Autoencoder. All these results successfully demonstrate the effectives of these feature space models. The fundamental theme that repeats itself in this dissertation is about modeling efficient feature space to improve both accuracy and efficiency of selected data mining models. Every contribution in this dissertation has been subsequently and successfully employed to capitalize on those advantages to solve real-world problems. Our work bridges the concepts from multiple disciplines ineffective and surprising ways, leading to new insights, new frameworks, and ultimately to a cross-production of diverse fields like mathematics, statistics, and data mining

    Convex and non-convex optimization using centroid-encoding for visualization, classification, and feature selection

    Get PDF
    Includes bibliographical references.2022 Fall.Classification, visualization, and feature selection are the three essential tasks of machine learning. This Ph.D. dissertation presents convex and non-convex models suitable for these three tasks. We propose Centroid-Encoder (CE), an autoencoder-based supervised tool for visualizing complex and potentially large, e.g., SUSY with 5 million samples and high-dimensional datasets, e.g., GSE73072 clinical challenge data. Unlike an autoencoder, which maps a point to itself, a centroid-encoder has a modified target, i.e., the class centroid in the ambient space. We present a detailed comparative analysis of the method using various data sets and state-of-the-art techniques. We have proposed a variation of the centroid-encoder, Bottleneck Centroid-Encoder (BCE), where additional constraints are imposed at the bottleneck layer to improve generalization performance in the reduced space. We further developed a sparse optimization problem for the non-linear mapping of the centroid-encoder called Sparse Centroid-Encoder (SCE) to determine the set of discriminate features between two or more classes. The sparse model selects variables using the 1-norm applied to the input feature space. SCE extracts discriminative features from multi-modal data sets, i.e., data whose classes appear to have multiple clusters, by using several centers per class. This approach seems to have advantages over models which use a one-hot-encoding vector. We also provide a feature selection framework that first ranks each feature by its occurrence, and the optimal number of features is chosen using a validation set. CE and SCE are models based on neural network architectures and require the solution of non-convex optimization problems. Motivated by the CE algorithm, we have developed a convex optimization for the supervised dimensionality reduction technique called Centroid Component Retrieval (CCR). The CCR model optimizes a multi-objective cost by balancing two complementary terms. The first term pulls the samples of a class towards its centroid by minimizing a sample's distance from its class centroid in low dimensional space. The second term pushes the classes by maximizing the scattering volume of the ellipsoid formed by the class-centroids in embedded space. Although the design principle of CCR is similar to LDA, our experimental results show that CCR exhibits performance advantages over LDA, especially on high-dimensional data sets, e.g., Yale Faces, ORL, and COIL20. Finally, we present a linear formulation of Centroid-Encoder with orthogonality constraints, called Principal Centroid Component Analysis (PCCA). This formulation is similar to PCA, except the class labels are used to formulate the objective, resulting in the form of supervised PCA. We show the classification and visualization experiments results with this new linear tool

    A lexicographic multi-objective genetic algorithm for multi-label correlation-based feature selection

    Get PDF
    This paper proposes a new Lexicographic multi-objective Genetic Algorithm for Multi-Label Correlation-based Feature Selection (LexGA-ML-CFS), which is an extension of the previous single-objective Genetic Algorithm for Multi-label Correlation-based Feature Selection (GA-ML-CFS). This extension uses a LexGA as a global search method for generating candidate feature subsets. In our experiments, we compare the results obtained by LexGA-ML-CFS with the results obtained by the original hill climbing-based ML-CFS, the single-objective GA-ML-CFS and a baseline Binary Relevance method, using ML-kNN as the multi-label classifier. The results from our experiments show that LexGA-ML-CFS improved predictive accuracy, by comparison with other methods, in some cases, but in general there was no statistically significant different between the results of LexGA-ML-CFS and other methods
    corecore