20,935 research outputs found
Rough sets approach to symbolic value partition
AbstractIn data mining, searching for simple representations of knowledge is a very important issue. Attribute reduction, continuous attribute discretization and symbolic value partition are three preprocessing techniques which are used in this regard. This paper investigates the symbolic value partition technique, which divides each attribute domain of a data table into a family for disjoint subsets, and constructs a new data table with fewer attributes and smaller attribute domains. Specifically, we investigates the optimal symbolic value partition (OSVP) problem of supervised data, where the optimal metric is defined by the cardinality sum of new attribute domains. We propose the concept of partition reducts for this problem. An optimal partition reduct is the solution to the OSVP-problem. We develop a greedy algorithm to search for a suboptimal partition reduct, and analyze major properties of the proposed algorithm. Empirical studies on various datasets from the UCI library show that our algorithm effectively reduces the size of attribute domains. Furthermore, it assists in computing smaller rule sets with better coverage compared with the attribute reduction approach
Attribute Equilibrium Dominance Reduction Accelerator (DCCAEDR) Based on Distributed Coevolutionary Cloud and Its Application in Medical Records
© 2013 IEEE. Aimed at the tremendous challenge of attribute reduction for big data mining and knowledge discovery, we propose a new attribute equilibrium dominance reduction accelerator (DCCAEDR) based on the distributed coevolutionary cloud model. First, the framework of N-populations distributed coevolutionary MapReduce model is designed to divide the entire population into N subpopulations, sharing the reward of different subpopulations' solutions under a MapReduce cloud mechanism. Because the adaptive balancing between exploration and exploitation can be achieved in a better way, the reduction performance is guaranteed to be the same as those using the whole independent data set. Second, a novel Nash equilibrium dominance strategy of elitists under the N bounded rationality regions is adopted to assist the subpopulations necessary to attain the stable status of Nash equilibrium dominance. This further enhances the accelerator's robustness against complex noise on big data. Third, the approximation parallelism mechanism based on MapReduce is constructed to implement rule reduction by accelerating the computation of attribute equivalence classes. Consequently, the entire attribute reduction set with the equilibrium dominance solution can be achieved. Extensive simulation results have been used to illustrate the effectiveness and robustness of the proposed DCCAEDR accelerator for attribute reduction on big data. Furthermore, the DCCAEDR is applied to solve attribute reduction for traditional Chinese medical records and to segment cortical surfaces of the neonatal brain 3-D-MRI records, and the DCCAEDR shows the superior competitive results, when compared with the representative algorithms
Identifying Mislabeled Training Data
This paper presents a new approach to identifying and eliminating mislabeled
training instances for supervised learning. The goal of this approach is to
improve classification accuracies produced by learning algorithms by improving
the quality of the training data. Our approach uses a set of learning
algorithms to create classifiers that serve as noise filters for the training
data. We evaluate single algorithm, majority vote and consensus filters on five
datasets that are prone to labeling errors. Our experiments illustrate that
filtering significantly improves classification accuracy for noise levels up to
30 percent. An analytical and empirical evaluation of the precision of our
approach shows that consensus filters are conservative at throwing away good
data at the expense of retaining bad data and that majority filters are better
at detecting bad data at the expense of throwing away good data. This suggests
that for situations in which there is a paucity of data, consensus filters are
preferable, whereas majority vote filters are preferable for situations with an
abundance of data
Enhancing Big Data Feature Selection Using a Hybrid Correlation-Based Feature Selection
This study proposes an alternate data extraction method that combines three well-known
feature selection methods for handling large and problematic datasets: the correlation-based feature
selection (CFS), best first search (BFS), and dominance-based rough set approach (DRSA) methods.
This study aims to enhance the classifier’s performance in decision analysis by eliminating uncorrelated and inconsistent data values. The proposed method, named CFS-DRSA, comprises several
phases executed in sequence, with the main phases incorporating two crucial feature extraction tasks.
Data reduction is first, which implements a CFS method with a BFS algorithm. Secondly, a data selection process applies a DRSA to generate the optimized dataset. Therefore, this study aims to solve
the computational time complexity and increase the classification accuracy. Several datasets with
various characteristics and volumes were used in the experimental process to evaluate the proposed
method’s credibility. The method’s performance was validated using standard evaluation measures
and benchmarked with other established methods such as deep learning (DL). Overall, the proposed
work proved that it could assist the classifier in returning a significant result, with an accuracy rate
of 82.1% for the neural network (NN) classifier, compared to the support vector machine (SVM),
which returned 66.5% and 49.96% for DL. The one-way analysis of variance (ANOVA) statistical
result indicates that the proposed method is an alternative extraction tool for those with difficulties
acquiring expensive big data analysis tools and those who are new to the data analysis field.Ministry of Higher Education under the Fundamental Research Grant Scheme (FRGS/1/2018/ICT04/UTM/01/1)Universiti Teknologi Malaysia (UTM) under Research University Grant Vot-20H04, Malaysia Research University Network (MRUN) Vot 4L876SPEV project, University of Hradec Kralove, Faculty
of Informatics and Management, Czech Republic (ID: 2102–2021), “Smart Solutions in Ubiquitous
Computing Environments
- …