5 research outputs found

    Multistage feature selection methods for data classification

    Get PDF
    In data analysis process, a good decision can be made with the assistance of several sub-processes and methods. The most common processes are feature selection and classification processes. Various methods and processes have been proposed to solve many issues such as low classification accuracy, and long processing time faced by the decision-makers. The analysis process becomes more complicated especially when dealing with complex datasets that consist of large and problematic datasets. One of the solutions that can be used is by employing an effective feature selection method to reduce the data processing time, decrease the used memory space, and increase the accuracy of decisions. However, not all the existing methods are capable of dealing with these issues. The aim of this research was to assist the classifier in giving a better performance when dealing with problematic datasets by generating optimised attribute set. The proposed method comprised two stages of feature selection processes, that employed correlation-based feature selection method using a best first search algorithm (CFS-BFS) and as well as a soft set and rough set parameter selection method (SSRS). CFS-BFS is used to eliminate uncorrelated attributes in a dataset meanwhile SSRS was utilized to manage any problematic values such as uncertainty in a dataset. Several bench-marking feature selection methods such as classifier subset evaluation (CSE) and principle component analysis (PCA) and different classifiers such as support vector machine (SVM) and neural network (NN) were used to validate the obtained results. ANOVA and T-test were also conducted to verify the obtained results. The obtained averages for two experimentalworks have proven that the proposed method equally matched the performance of other benchmarking methods in terms of assisting the classifier in achieving high classification performance for complex datasets. The obtained average for another experimental work has shown that the proposed work has outperformed the other benchmarking methods. In conclusion, the proposed method is significant to be used as an alternative feature selection method and able to assist the classifiers in achieving better accuracy in the classification process especially when dealing with problematic datasets

    From fuzzy-rough to crisp feature selection

    Get PDF
    A central problem in machine learning and pattern recognition is the process of recognizing the most important features in a dataset. This process plays a decisive role in big data processing by reducing the size of datasets. One major drawback of existing feature selection methods is the high chance of redundant features appearing in the final subset, where in most cases, finding and removing them can greatly improve the resulting classification accuracy. To tackle this problem on two different fronts, we employed fuzzy-rough sets and perturbation theories. On one side, we used three strategies to improve the performance of fuzzy-rough set-based feature selection methods. The first strategy was to code both features and samples in one binary vector and use a shuffled frog leaping algorithm to choose the best combination using fuzzy dependency degree as the fitness function. In the second strategy, we designed a measure to evaluate features based on fuzzy-rough dependency degree in a fashion where redundant features are given less priority to be selected. In the last strategy, we designed a new binary version of the shuffled frog leaping algorithm that employs a fuzzy positive region as its similarity measure to work in complete harmony with the fitness function (i.e. fuzzy-rough dependency degree). To extend the applicability of fuzzy-rough set-based feature selection to multi-party medical datasets, we designed a privacy-preserving version of the original method. In addition, we studied the feasibility and applicability of perturbation theory to feature selection, which to the best of our knowledge has never been researched. We introduced a new feature selection based on perturbation theory that is not only capable of detecting and discarding redundant features but also is very fast and flexible in accommodating the special needs of the application. It employs a clustering algorithm to group likely-behaved features based on the sensitivity of each feature to perturbation, the angle of each feature to the outcome and the effect of removing each feature to the outcome, and it chooses the closest feature to the centre of each cluster and returns all those features as the final subset. To assess the effectiveness of the proposed methods, we compared the results of each method with well-known feature selection methods against a series of artificially generated datasets, and biological, medical and cancer datasets adopted from the University of California Irvine machine learning repository, Arizona State University repository and Gene Expression Omnibus repository
    corecore