9,780 research outputs found

    Conditional-Entropy Metrics for Feature Selection

    Get PDF
    Institute for Communicating and Collaborative SystemsWe examine the task of feature selection, which is a method of forming simplified descriptions of complex data for use in probabilistic classifiers. Feature selection typically requires a numerical measure or metric of the desirability of a given set of features. The thesis considers a number of existing metrics, with particular attention to those based on entropy and other quantities derived from information theory. A useful new perspective on feature selection is provided by the concepts of partitioning and encoding of data by a feature set. The ideas of partitioning and encoding, together with the theoretical shortcomings of existing metrics, motivate a new class of feature selection metrics based on conditional entropy. The simplest of the new metrics is referred to as expected partition entropy or EPE. Performances of the new and existing metrics are compared by experiments with a simplified form of part-of-speech tagging and with classification of Reuters news stories by topic. In order to conduct the experiments, a new class of accelerated feature selection search algorithms is introduced; a member of this class is found to provide significantly increased speed with minimal loss in performance, as measured by feature selection metrics and accuracy on test data. The comparative performance of existing metrics is also analysed, giving rise to a new general conjecture regarding the wrapper class of metrics. Each wrapper is inherently tied to a specific type of classifier. The experimental results support the idea that a wrapper selects feature sets which perform well in conjunction with its own particular classifier, but this good performance cannot be expected to carry over to other types of model. The new metrics introduced in this thesis prove to have substantial advantages over a representative selection of other feature selection mechanisms: Mutual information, frequency-based cutoff, the Koller-Sahami information loss measure, and two different types of wrapper method. Feature selection using the new metrics easily outperforms other filter-based methods such as mutual information; additionally, our approach attains comparable performance to a wrapper method, but at a fraction of the computational expense. Finally, members of the new class of metrics succeed in a case where the Koller-Sahami metric fails to provide a meaningful criterion for feature selection

    Feature selection using mutual information in network intrusion detection system

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Network technologies have made significant progress in development, while the security issues alongside these technologies have not been well addressed. Current research on network security mainly focuses on developing preventative measures, such as security policies and secure communication protocols. Meanwhile, attempts have been made to protect computer systems and networks against malicious behaviours by deploying Intrusion Detection Systems (IDSs). The collaboration of IDSs and preventative measures can provide a safe and secure communication environment. Intrusion detection systems are now an essential complement to security project infrastructure of most organisations. However, current IDSs suffer from three significant issues that severely restrict their utility and performance. These issues are: a large number of false alarms, very high volume of network traffic and the classification problem when the class labels are not available. In this thesis, these three issues are addressed and efficient intrusion detection systems are developed which are effective in detecting a wide variety of attacks and result in very few false alarms and low computational cost. The principal contribution is the efficient and effective use of mutual information, which offers a solid theoretical framework for quantifying the amount of information that two random variables share with each other. The goal of this thesis is to develop an IDS that is accurate in detecting attacks and fast enough to make real-time decisions. First, a nonlinear correlation coefficient-based similarity measure to help extract both linear and nonlinear correlations between network traffic records is used. This measure is based on mutual information. The extracted information is used to develop an IDS to detect malicious network behaviours. However, the current network traffic data, which consist of a great number of traffic patterns, create a serious challenge to IDSs. Therefore, to address this issue, two feature selection methods are proposed; filter-based feature selection and hybrid feature selection algorithms, added to our current IDS for supervised classification. These methods are used to select a subset of features from the original feature set and use the selected subset to build our IDS and enhance the detection performance. The filter-based feature selection algorithm, named Flexible Mutual Information Feature Selection (FMIFS), uses the theoretical analyses of mutual information as evaluation criteria to measure the relevance between the input features and the output classes. To eliminate the redundancy among selected features, FMIFS introduces a new criterion to estimate the redundancy of the current selected features with respect to the previously selected subset of features. The hybrid feature selection algorithm is a combination of filter and wrapper algorithms. The filter method searches for the best subset of features using mutual information as a measure of relevance between the input features and the output class. The wrapper method is used to further refine the selected subset from the previous phase and select the optimal subset of features that can produce better accuracy. In addition to the supervised feature selection methods, the research is extended to unsupervised feature selection methods, and an Extended Laplacian score EL and a Modified Laplacian score ML methods are proposed which can select features in unsupervised scenarios. More specifically, each of EL and ML consists of two main phases. In the first phase, the Laplacian score algorithm is applied to rank the features by evaluating the power of locality preservation for each feature in the initial data. In the second phase, a new redundancy penalization technique uses mutual information to remove the redundancy among the selected features. The final output of these algorithms is then used to build the detection model. The proposed IDSs are then tested on three publicly available datasets, the KDD Cup 99, NSL-KDD and Kyoto dataset. Experimental results confirm the effectiveness and feasibility of these proposed solutions in terms of detection accuracy, false alarm rate, computational complexity and the capability of utilising unlabelled data. The unsupervised feature selection methods have been further tested on five more well-known datasets from the UCI Machine Learning Repository. These newly added datasets are frequently used in literature to evaluate the performance of feature selection methods. Furthermore, these datasets have different sample sizes and various numbers of features, so they are a lot more challenging for comprehensively testing feature selection algorithms. The experimental results show that ML performs better than EL and four other state-of-art methods (including the Variance score algorithm and the Laplacian score algorithm) in terms of the classification accuracy

    Breast Cancer Classification by Gene Expression Analysis using Hybrid Feature Selection and Hyper-heuristic Adaptive Universum Support Vector Machine

    Get PDF
    Comprehensive assessments of the molecular characteristics of breast cancer from gene expression patterns can aid in the early identification and treatment of tumor patients. The enormous scale of gene expression data obtained through microarray sequencing increases the difficulty of training the classifier due to large-scale features. Selecting pivotal gene features can minimize high dimensionality and the classifier complexity with improved breast cancer detection accuracy. However, traditional filter and wrapper-based selection methods have scalability and adaptability issues in handling complex gene features. This paper presents a hybrid feature selection method of Mutual Information Maximization - Improved Moth Flame Optimization (MIM-IMFO) for gene selection along with an advanced Hyper-heuristic Adaptive Universum Support classification model Vector Machine (HH-AUSVM) to improve cancer detection rates. The hybrid gene selection method is developed by performing filter-based selection using MIM in the first stage followed by the wrapper method in the second stage, to obtain the pivotal features and remove the inappropriate ones. This method improves standard MFO by a hybrid exploration/exploitation phase to accomplish a better trade-off between exploration and exploitation phases. The classifier HH-AUSVM is formulated by integrating the Adaptive Universum learning approach to the hyper- heuristics-based parameter optimized SVM to tackle the class samples imbalance problem. Evaluated on breast cancer gene expression datasets from Mendeley Data Repository, this proposed MIM-IMFO gene selection-based HH-AUSVM classification approach provided better breast cancer detection with high accuracies of 95.67%, 96.52%, 97.97% and 95.5% and less processing time of 4.28, 3.17, 9.45 and 6.31 seconds, respectively

    Feature selection for chemical sensor arrays using mutual information

    Get PDF
    We address the problem of feature selection for classifying a diverse set of chemicals using an array of metal oxide sensors. Our aim is to evaluate a filter approach to feature selection with reference to previous work, which used a wrapper approach on the same data set, and established best features and upper bounds on classification performance. We selected feature sets that exhibit the maximal mutual information with the identity of the chemicals. The selected features closely match those found to perform well in the previous study using a wrapper approach to conduct an exhaustive search of all permitted feature combinations. By comparing the classification performance of support vector machines (using features selected by mutual information) with the performance observed in the previous study, we found that while our approach does not always give the maximum possible classification performance, it always selects features that achieve classification performance approaching the optimum obtained by exhaustive search. We performed further classification using the selected feature set with some common classifiers and found that, for the selected features, Bayesian Networks gave the best performance. Finally, we compared the observed classification performances with the performance of classifiers using randomly selected features. We found that the selected features consistently outperformed randomly selected features for all tested classifiers. The mutual information filter approach is therefore a computationally efficient method for selecting near optimal features for chemical sensor arrays

    Search Strategies for Binary Feature Selection for a Naive Bayes Classifier

    Get PDF
    We compare in this paper several feature selection methods for the Naive Bayes Classifier (NBC) when the data under study are described by a large number of redundant binary indicators. Wrapper approaches guided by the NBC estimation of the classification error probability out-perform filter approaches while retaining a reasonable computational cost

    Feature Selection via Coalitional Game Theory

    Get PDF
    We present and study the contribution-selection algorithm (CSA), a novel algorithm for feature selection. The algorithm is based on the multiperturbation shapley analysis (MSA), a framework that relies on game theory to estimate usefulness. The algorithm iteratively estimates the usefulness of features and selects them accordingly, using either forward selection or backward elimination. It can optimize various performance measures over unseen data such as accuracy, balanced error rate, and area under receiver-operator-characteristic curve. Empirical comparison with several other existing feature selection methods shows that the backward elimination variant of CSA leads to the most accurate classification results on an array of data sets
    • ā€¦
    corecore