185,492 research outputs found

    Conditional Dynamic Mutual Information-Based Feature Selection

    Get PDF
    With emergence of new techniques, data in many fields are getting larger and larger, especially in dimensionality aspect. The high dimensionality of data may pose great challenges to traditional learning algorithms. In fact, many of features in large volume of data are redundant and noisy. Their presence not only degrades the performance of learning algorithms, but also confuses end-users in the post-analysis process. Thus, it is necessary to eliminate irrelevant features from data before being fed into learning algorithms. Currently, many endeavors have been attempted in this field and many outstanding feature selection methods have been developed. Among different evaluation criteria, mutual information has also been widely used in feature selection because of its good capability of quantifying uncertainty of features in classification tasks. However, the mutual information estimated on the whole dataset cannot exactly represent the correlation between features. To cope with this issue, in this paper we firstly re-estimate mutual information on identified instances dynamically, and then introduce a new feature selection method based on conditional mutual information. Performance evaluations on sixteen UCI datasets show that our proposed method achieves comparable performance to other well-established feature selection algorithms in most cases

    A New Approach Based on Quantum Clustering and Wavelet Transform for Breast Cancer Classification: Comparative Study

    Get PDF
    Feature selection involves identifying a subset of the most useful features that produce the same results as the original set of features. In this paper, we present a new approach for improving classification accuracy. This approach is based on quantum clustering for feature subset selection and wavelet transform for features extraction. The feature selection is performed in three steps. First the mammographic image undergoes a wavelet transform then some features are extracted. In the second step the original feature space is partitioned in clusters in order to group similar features. This operation is performed using the Quantum Clustering algorithm. The third step deals with the selection of a representative feature for each cluster. This selection is based on similarity measures such as the correlation coefficient (CC) and the mutual information (MI). The feature which maximizes this information (CC or MI) is chosen by the algorithm. This approach is applied for breast cancer classification. The K-nearest neighbors (KNN) classifier is used to achieve the classification. We have presented classification accuracy versus feature type, wavelet transform and K neighbors in the KNN classifier. An accuracy of 100% was reached in some cases

    Various Feature Selection Techniques in Type 2 Diabetic Patients for the Prediction of Cardiovascular Disease

    Get PDF
    Cardiovascular disease (CVD) is a serious but preventable complication of type 2 diabetes mellitus (T2DM) that results in substantial disease burden, increased health services use, and higher risk of premature mortality [10]. People with diabetes are also at a greatly increased risk of cardiovascular which results in sudden death, which increases year by year. Data mining is the search for relationships and global patterns that exist in large databases but are `hidden' among the vast amount of data, such as a relationship between patient data and their medical diagnosis. Usually medical databases of type 2 diabetic patients are high dimensional in nature. If a training dataset contains irrelevant and redundant features (i.e., attributes), classification analysis may produce less accurate results. In order for data mining algorithms to perform efficiently and effectively on high-dimensional data, it is imperative to remove irrelevant and redundant features. Feature selection is one of the important and frequently used data preprocessing techniques for data mining applications in medicine. Many of the research area in data mining has improved the predictive accuracy of the classifiers by applying the various techniques of feature selection This paper illustrates, the application of feature selection technique in medical databases, will enable to find small number of informative features leading to potential improvement in medical diagnosis. It is proposed to find an optimal feature subset of the PIMA Indian Diabetes Dataset using Artificial Bee Colony technique with Differential Evolution, Symmetrical Uncertainty Attribute set Evaluator and Fast Correlation-Based Filter (FCBF). Then Mutual information based feature selection is done by introducing normalized mutual information feature selection (NMIFS). And valid classes of input features are selected by applying Hybrid Fuzzy C Means algorithm (HFCM)

    Feature selection using mutual information in network intrusion detection system

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Network technologies have made significant progress in development, while the security issues alongside these technologies have not been well addressed. Current research on network security mainly focuses on developing preventative measures, such as security policies and secure communication protocols. Meanwhile, attempts have been made to protect computer systems and networks against malicious behaviours by deploying Intrusion Detection Systems (IDSs). The collaboration of IDSs and preventative measures can provide a safe and secure communication environment. Intrusion detection systems are now an essential complement to security project infrastructure of most organisations. However, current IDSs suffer from three significant issues that severely restrict their utility and performance. These issues are: a large number of false alarms, very high volume of network traffic and the classification problem when the class labels are not available. In this thesis, these three issues are addressed and efficient intrusion detection systems are developed which are effective in detecting a wide variety of attacks and result in very few false alarms and low computational cost. The principal contribution is the efficient and effective use of mutual information, which offers a solid theoretical framework for quantifying the amount of information that two random variables share with each other. The goal of this thesis is to develop an IDS that is accurate in detecting attacks and fast enough to make real-time decisions. First, a nonlinear correlation coefficient-based similarity measure to help extract both linear and nonlinear correlations between network traffic records is used. This measure is based on mutual information. The extracted information is used to develop an IDS to detect malicious network behaviours. However, the current network traffic data, which consist of a great number of traffic patterns, create a serious challenge to IDSs. Therefore, to address this issue, two feature selection methods are proposed; filter-based feature selection and hybrid feature selection algorithms, added to our current IDS for supervised classification. These methods are used to select a subset of features from the original feature set and use the selected subset to build our IDS and enhance the detection performance. The filter-based feature selection algorithm, named Flexible Mutual Information Feature Selection (FMIFS), uses the theoretical analyses of mutual information as evaluation criteria to measure the relevance between the input features and the output classes. To eliminate the redundancy among selected features, FMIFS introduces a new criterion to estimate the redundancy of the current selected features with respect to the previously selected subset of features. The hybrid feature selection algorithm is a combination of filter and wrapper algorithms. The filter method searches for the best subset of features using mutual information as a measure of relevance between the input features and the output class. The wrapper method is used to further refine the selected subset from the previous phase and select the optimal subset of features that can produce better accuracy. In addition to the supervised feature selection methods, the research is extended to unsupervised feature selection methods, and an Extended Laplacian score EL and a Modified Laplacian score ML methods are proposed which can select features in unsupervised scenarios. More specifically, each of EL and ML consists of two main phases. In the first phase, the Laplacian score algorithm is applied to rank the features by evaluating the power of locality preservation for each feature in the initial data. In the second phase, a new redundancy penalization technique uses mutual information to remove the redundancy among the selected features. The final output of these algorithms is then used to build the detection model. The proposed IDSs are then tested on three publicly available datasets, the KDD Cup 99, NSL-KDD and Kyoto dataset. Experimental results confirm the effectiveness and feasibility of these proposed solutions in terms of detection accuracy, false alarm rate, computational complexity and the capability of utilising unlabelled data. The unsupervised feature selection methods have been further tested on five more well-known datasets from the UCI Machine Learning Repository. These newly added datasets are frequently used in literature to evaluate the performance of feature selection methods. Furthermore, these datasets have different sample sizes and various numbers of features, so they are a lot more challenging for comprehensively testing feature selection algorithms. The experimental results show that ML performs better than EL and four other state-of-art methods (including the Variance score algorithm and the Laplacian score algorithm) in terms of the classification accuracy

    Model-Based Feature Selection Based on Radial Basis Functions and Information Measures

    Get PDF
    In this paper the development of a new embedded feature selection method is presented, based on a Radial-Basis-Function Neural-Fuzzy modelling structure. The proposed method is created to find the relative importance of features in a given dataset (or process in general), with special focus on manufacturing processes. The proposed approach evaluates the impact/importance of processes features by using information theoretic measures to measure the correlation between the process features and the modelling performance. Crucially, the proposed method acts during the training of the process model; hence it is an embedded method, achieving the modelling/classification task in parallel to the feature selection task. The latter is achieved by taking advantage of the information in the output layer of the Neural Fuzzy structure; in the presented case this is a TSK-type polynomial function. Two information measures are evaluated in this work, both based on information entropy: mutual information, and cross-sample entropy. The proposed methodology is tested against two popular datasets in the literature (IRIS - plant data, AirFoil - manufacturing/design data), and one more case study relevant to manufacturing - the heat treatment of steel. Results show the good and reliable performance of the developed modelling structure, on par with existing published work, as well as the good performance of the feature selection task in terms of correctly identifying important process features

    Feature selection and hierarchical classifier design with applications to human motion recognition

    Get PDF
    The performance of a classifier is affected by a number of factors including classifier type, the input features and the desired output. This thesis examines the impact of feature selection and classification problem division on classification accuracy and complexity. Proper feature selection can reduce classifier size and improve classifier performance by minimizing the impact of noisy, redundant and correlated features. Noisy features can cause false association between the features and the classifier output. Redundant and correlated features increase classifier complexity without adding additional information. Output selection or classification problem division describes the division of a large classification problem into a set of smaller problems. Problem division can improve accuracy by allocating more resources to more difficult class divisions and enabling the use of more specific feature sets for each sub-problem. The first part of this thesis presents two methods for creating feature-selected hierarchical classifiers. The feature-selected hierarchical classification method jointly optimizes the features and classification tree-design using genetic algorithms. The multi-modal binary tree (MBT) method performs the class division and feature selection sequentially and tolerates misclassifications in the higher nodes of the tree. This yields a piecewise separation for classes that cannot be fully separated with a single classifier. Experiments show that the accuracy of MBT is comparable to other multi-class extensions, but with lower test time. Furthermore, the accuracy of MBT is significantly higher on multi-modal data sets. The second part of this thesis focuses on input feature selection measures. A number of filter-based feature subset evaluation measures are evaluated with the goal of assessing their performance with respect to specific classifiers. Although there are many feature selection measures proposed in literature, it is unclear which feature selection measures are appropriate for use with different classifiers. Sixteen common filter-based measures are tested on 20 real and 20 artificial data sets, which are designed to probe for specific feature selection challenges. The strengths and weaknesses of each measure are discussed with respect to the specific feature selection challenges in the artificial data sets, correlation with classifier accuracy and their ability to identify known informative features. The results indicate that the best filter measure is classifier-specific. K-nearest neighbours classifiers work well with subset-based RELIEF, correlation feature selection or conditional mutual information maximization, whereas Fisher's interclass separability criterion and conditional mutual information maximization work better for support vector machines. Based on the results of the feature selection experiments, two new filter-based measures are proposed based on conditional mutual information maximization, which performs well but cannot identify dependent features in a set and does not include a check for correlated features. Both new measures explicitly check for dependent features and the second measure also includes a term to discount correlated features. Both measures correctly identify known informative features in the artificial data sets and correlate well with classifier accuracy. The final part of this thesis examines the use of feature selection for time-series data by using feature selection to determine important individual time windows or key frames in the series. Time-series feature selection is used with the MBT algorithm to create classification trees for time-series data. The feature selected MBT algorithm is tested on two human motion recognition tasks: full-body human motion recognition from joint angle data and hand gesture recognition from electromyography data. Results indicate that the feature selected MBT is able to achieve high classification accuracy on the time-series data while maintaining a short test time

    Feature Selection Based on Sequential Orthogonal Search Strategy

    Get PDF
    This thesis introduces three new feature selection methods based on sequential orthogonal search strategy that addresses three different contexts of feature selection problem being considered. The first method is a supervised feature selection called the maximum relevance–minimum multicollinearity (MRmMC), which can overcome some shortcomings associated with existing methods that apply the same form of feature selection criterion, especially those that are based on mutual information. In the proposed method, relevant features are measured by correlation characteristics based on conditional variance while redundancy elimination is achieved according to multiple correlation assessment using an orthogonal projection scheme. The second method is an unsupervised feature selection based on Locality Preserving Projection (LPP), which is incorporated in a sequential orthogonal search (SOS) strategy. Locality preserving criterion has been proved a successful measure to evaluate feature importance in many feature selection methods but most of which ignore feature correlation and this means these methods ignore redundant features. This problem has motivated the introduction of the second method that evaluates feature importance jointly rather than individually. In the method, the first LPP component which contains the information of local largest structure (LLS) is utilized as a reference variable to guide the search for significant features. This method is referred to as sequential orthogonal search for local largest structure (SOS-LLS). The third method is also an unsupervised feature selection with essentially the same SOS strategy but it is specifically designed to be robust on noisy data. As limited work has been reported concerning feature selection in the presence of attribute noise, the third method is thus attempts to make an effort towards this scarcity by further exploring the second proposed method. The third method is designed to deal with attribute noise in the search for significant features, and kernel pre-images (KPI) based on kernel PCA are used in the third method to replace the role of the first LPP component as the reference variable used in the second method. This feature selection scheme is referred to as sequential orthogonal search for kernel pre-images (SOS-KPI) method. The performance of these three feature selection methods are demonstrated based on some comprehensive analysis on public real datasets of different characteristics and comparative studies with a number of state-of-the-art methods. Results show that each of the proposed methods has the capacity to select more efficient feature subsets than the other feature selection methods in the comparative studies
    • …
    corecore