491 research outputs found

    A New Maximum Relevance-Minimum Multicollinearity (MRmMC) Method for Feature Selection and Ranking

    Get PDF
    A substantial amount of datasets stored for various applications are often high dimensional with redundant and irrelevant features. Processing and analysing data under such circumstances is time consuming and makes it difficult to obtain efficient predictive models. There is a strong need to carry out analyses for high dimensional data in some lower dimensions, and one approach to achieve this is through feature selection. This paper presents a new relevancy-redundancy approach, called the maximum relevance–minimum multicollinearity (MRmMC) method, for feature selection and ranking, which can overcome some shortcomings of existing criteria. In the proposed method, relevant features are measured by correlation characteristics based on conditional variance while redundancy elimination is achieved according to multiple correlation assessment using an orthogonal projection scheme. A series of experiments were conducted on eight datasets from the UCI Machine Learning Repository and results show that the proposed method performed reasonably well for feature subset selection

    An examination of thermal features' relevance in the task of battery-fault detection

    Get PDF
    Uninterruptible power supplies (UPS), represented by lead-acid batteries, play an important role in various kinds of industries. They protect industrial technologies from being damaged by dangerous interruptions of an electric power supply. Advanced UPS monitoring performed by a complex battery management system (BMS) prevents the UPS from sustaining more serious damage due to its timely and accurate battery-fault detection based on voltage metering. This technique is very advanced and precise but also very expensive on a long-term basis. This article describes an experiment applying infrared thermographic measurements during a long term monitoring and fault detection in UPS. The assumption that the battery overheat implies its damaged state is the leading factor of our experiments. They are based on real measured data on various UPS battery sets and several statistical examinations confirming the high relevancy of the thermal features with mostly over 90% detection accuracy. Such a model can be used as a supplement for lead-acid battery based UPS monitoring to ensure their higher reliability under significantly lower maintenance costs.Web of Science82art. no. 18

    Effect of Feature Selection on Gene Expression Datasets Classification Accurac

    Get PDF
    Feature selection attracts researchers who deal with machine learning and data mining. It consists of selecting the variables that have the greatest impact on the dataset classification, and discarding the rest. This dimentionality reduction allows classifiers to be fast and more accurate. This paper traits the effect of feature selection on the accuracy of widely used classifiers in literature. These classifiers are compared with three real datasets which are pre-processed with feature selection methods. More than 9% amelioration in classification accuracy is observed, and k-means appears to be the most sensitive classifier to feature selection

    Software defect prediction using maximal information coefficient and fast correlation-based filter feature selection

    Get PDF
    Software quality ensures that applications that are developed are failure free. Some modern systems are intricate, due to the complexity of their information processes. Software fault prediction is an important quality assurance activity, since it is a mechanism that correctly predicts the defect proneness of modules and classifies modules that saves resources, time and developers’ efforts. In this study, a model that selects relevant features that can be used in defect prediction was proposed. The literature was reviewed and it revealed that process metrics are better predictors of defects in version systems and are based on historic source code over time. These metrics are extracted from the source-code module and include, for example, the number of additions and deletions from the source code, the number of distinct committers and the number of modified lines. In this research, defect prediction was conducted using open source software (OSS) of software product line(s) (SPL), hence process metrics were chosen. Data sets that are used in defect prediction may contain non-significant and redundant attributes that may affect the accuracy of machine-learning algorithms. In order to improve the prediction accuracy of classification models, features that are significant in the defect prediction process are utilised. In machine learning, feature selection techniques are applied in the identification of the relevant data. Feature selection is a pre-processing step that helps to reduce the dimensionality of data in machine learning. Feature selection techniques include information theoretic methods that are based on the entropy concept. This study experimented the efficiency of the feature selection techniques. It was realised that software defect prediction using significant attributes improves the prediction accuracy. A novel MICFastCR model, which is based on the Maximal Information Coefficient (MIC) was developed to select significant attributes and Fast Correlation Based Filter (FCBF) to eliminate redundant attributes. Machine learning algorithms were then run to predict software defects. The MICFastCR achieved the highest prediction accuracy as reported by various performance measures.School of ComputingPh. D. (Computer Science

    Gene selection for classification of microarray data based on the Bayes error

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With DNA microarray data, selecting a compact subset of discriminative genes from thousands of genes is a critical step for accurate classification of phenotypes for, e.g., disease diagnosis. Several widely used gene selection methods often select top-ranked genes according to their individual discriminative power in classifying samples into distinct categories, without considering correlations among genes. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analyses. Some latest studies show that incorporating gene to gene correlations into gene selection can remove redundant genes and improve classification accuracy.</p> <p>Results</p> <p>In this study, we propose a new method, Based Bayes error Filter (BBF), to select relevant genes and remove redundant genes in classification analyses of microarray data. The effectiveness and accuracy of this method is demonstrated through analyses of five publicly available microarray datasets. The results show that our gene selection method is capable of achieving better accuracies than previous studies, while being able to effectively select relevant genes, remove redundant genes and obtain efficient and small gene sets for sample classification purposes.</p> <p>Conclusion</p> <p>The proposed method can effectively identify a compact set of genes with high classification accuracy. This study also indicates that application of the Bayes error is a feasible and effective wayfor removing redundant genes in gene selection.</p

    Examining Swarm Intelligence-based Feature Selection for Multi-Label Classification

    Get PDF
    Multi-label classification addresses the issues that more than one class label assigns to each instance. Many real-world multi-label classification tasks are high-dimensional due to digital technologies, leading to reduced performance of traditional multi-label classifiers. Feature selection is a common and successful approach to tackling this problem by retaining relevant features and eliminating redundant ones to reduce dimensionality. There is several feature selection that is successfully applied in multi-label learning. Most of those features are wrapper methods that employ a multi-label classifier in their processes. They run a classifier in each step, which requires a high computational cost, and thus they suffer from scalability issues. To deal with this issue, filter methods are introduced to evaluate the feature subsets using information-theoretic mechanisms instead of running classifiers. This paper aims to provide a comprehensive review of different methods of feature selection presented for the tasks of multi-label classification. To this end, in this review, we have investigated most of the well-known and state-of-the-art methods. We then provided the main characteristics of the existing multi-label feature selection techniques and compared them analytically

    Selection of compressible signals from telemetry data

    Get PDF
    Sensors are deployed in all aspects of modern city infrastructure and generate vast amounts of data. Only subsets of this data, however, are relevant to individual organisations. For example, a local council may collect suspension movement from vehicles to detect pot-holes, but this data is not relevant when assessing traffic flow. Supervised feature selection aims to find the set of signals that best predict a target variable. Typical approaches use either measures of correlation or similarity, as in filter methods, or predictive power in a learned model, as in wrapper methods. In both approaches selected features often have high entropies and are not suitable for compression. This is of particular issue in the automotive domain where fast communication and archival of vehicle telemetry data is likely to be prevalent in the near future, especially with technologies such as V2V and V2X. In this paper, we adapt a popular feature selection filter method to consider the compressibility of signals being selected for use in a predictive model. In particular, we add a compression term to the Minimal Redundancy Maximal Relevance (MRMR) filter and introduce Minimal Redundancy Maximal Relevance And Compression (MRMRAC). Using MRMRAC, we then select features from the Controller Area Network (CAN) and predict each of current instantaneous fuel consumption, engine torque, vehicle speed, and gear position, using a Support Vector Machine (SVM). We show that while performance is slightly lower when compression is considered, the compressibility of the selected features is significantly improved

    Computer-aided weld inspection by fuzzy modeling with selected features

    Get PDF
    This thesis develops a computer-aided weld inspection methodology based on fuzzy modeling with selected features. The proposed methodology employs several filter feature selection methods for selecting input variables and then builds fuzzy models with the selected features. Our fuzzy modeling method is based on a fuzzy c-means (FCM) variant for the generation of fuzzy terms sets. The implemented FCM variant differs from the original FCM method in two aspects: (1) the two end terms take the maximum and minimum domain values as their centers, and (2) all fuzzy terms are forced to be convex. The optimal number of terms and the optimal shape of the membership function associated with each term are determined based on the mean squared error criterion. The fuzzy model serves as the rule base of a fuzzy reasoning based expert system implemented. In this implementation, first the fuzzy rules are extracted from feature data one feature at a time based on the FCM variant. The total number of fuzzy rules is the product of the fuzzy terms for each feature. The performances of these fuzzy sets are then tested with unseen data in terms of accuracy rates and computational time. To evaluate the goodness of each selected feature subset, the selected combination is used as an input for the proposed fuzzy model. The accuracy of each selected feature subset along with the average error of the selected filter technique is reported. For comparison, the results of all possible combinations of the specified set of feature subsets are also obtained
    corecore