863 research outputs found
Granular Support Vector Machines Based on Granular Computing, Soft Computing and Statistical Learning
With emergence of biomedical informatics, Web intelligence, and E-business, new challenges are coming for knowledge discovery and data mining modeling problems. In this dissertation work, a framework named Granular Support Vector Machines (GSVM) is proposed to systematically and formally combine statistical learning theory, granular computing theory and soft computing theory to address challenging predictive data modeling problems effectively and/or efficiently, with specific focus on binary classification problems. In general, GSVM works in 3 steps. Step 1 is granulation to build a sequence of information granules from the original dataset or from the original feature space. Step 2 is modeling Support Vector Machines (SVM) in some of these information granules when necessary. Finally, step 3 is aggregation to consolidate information in these granules at suitable abstract level. A good granulation method to find suitable granules is crucial for modeling a good GSVM. Under this framework, many different granulation algorithms including the GSVM-CMW (cumulative margin width) algorithm, the GSVM-AR (association rule mining) algorithm, a family of GSVM-RFE (recursive feature elimination) algorithms, the GSVM-DC (data cleaning) algorithm and the GSVM-RU (repetitive undersampling) algorithm are designed for binary classification problems with different characteristics. The empirical studies in biomedical domain and many other application domains demonstrate that the framework is promising. As a preliminary step, this dissertation work will be extended in the future to build a Granular Computing based Predictive Data Modeling framework (GrC-PDM) with which we can create hybrid adaptive intelligent data mining systems for high quality prediction
Three-way Imbalanced Learning based on Fuzzy Twin SVM
Three-way decision (3WD) is a powerful tool for granular computing to deal
with uncertain data, commonly used in information systems, decision-making, and
medical care. Three-way decision gets much research in traditional rough set
models. However, three-way decision is rarely combined with the currently
popular field of machine learning to expand its research. In this paper,
three-way decision is connected with SVM, a standard binary classification
model in machine learning, for solving imbalanced classification problems that
SVM needs to improve. A new three-way fuzzy membership function and a new fuzzy
twin support vector machine with three-way membership (TWFTSVM) are proposed.
The new three-way fuzzy membership function is defined to increase the
certainty of uncertain data in both input space and feature space, which
assigns higher fuzzy membership to minority samples compared with majority
samples. To evaluate the effectiveness of the proposed model, comparative
experiments are designed for forty-seven different datasets with varying
imbalance ratios. In addition, datasets with different imbalance ratios are
derived from the same dataset to further assess the proposed model's
performance. The results show that the proposed model significantly outperforms
other traditional SVM-based methods
Semi-random partitioning of data into training and test sets in granular computing context
Due to the vast and rapid increase in the size of data, machine learning has become an increasingly more popular approach for the purpose of knowledge discovery and predictive modelling. For both of the above purposes, it is essential to have a data set partitioned into a training set and a test set. In particular, the training set is used towards learning a model and the test set is then used towards evaluating the performance of the model learned from the training set. The split of the data into the two sets, however, and the influence on model performance, has only been investigated with respect to the optimal proportion for the two sets, with no attention paid to the characteristics of the data within the training and test sets. Thus, the current practice is to randomly split the data into approximately 70% for training and 30% for testing. In this paper, we show that this way of partitioning the data leads to two major issues: (a) class imbalance and (b) sample representativeness issues. Class imbalance is known to affect the performance of many classifiers by introducing a bias towards the majority class; the representativeness of the training set affects a model’s performance through the lack of opportunity for the algorithm to learn, by not presenting it with relevant examples—similar to testing a student on material that was not taught. To solve the above two issues, we propose a semi-random data partitioning framework, in the setting of granular computing. While we discuss how the framework can address both issues, in this paper, we focus on avoiding class imbalance when partitioning the data, through the proposed approach. The results show that avoiding class imbalance results in better model performance
- …