1,585 research outputs found
An empirical evaluation of imbalanced data strategies from a practitioner's point of view
This research tested the following well known strategies to deal with binary
imbalanced data on 82 different real life data sets (sampled to imbalance rates
of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline
(just the base classifier). As base classifiers we used SVM with RBF kernel,
random forests, and gradient boosting machines and we measured the quality of
the resulting classifier using 6 different metrics (Area under the curve,
Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced
accuracy). The best strategy strongly depends on the metric used to measure the
quality of the classifier. For AUC and accuracy class weight and the baseline
perform better; for F-measure and MCC, SMOTE performs better; and for G-mean
and balanced accuracy, underbagging
SMOTE: Synthetic Minority Over-sampling Technique
An approach to the construction of classifiers from imbalanced datasets is
described. A dataset is imbalanced if the classification categories are not
approximately equally represented. Often real-world data sets are predominately
composed of "normal" examples with only a small percentage of "abnormal" or
"interesting" examples. It is also the case that the cost of misclassifying an
abnormal (interesting) example as a normal example is often much higher than
the cost of the reverse error. Under-sampling of the majority (normal) class
has been proposed as a good means of increasing the sensitivity of a classifier
to the minority class. This paper shows that a combination of our method of
over-sampling the minority (abnormal) class and under-sampling the majority
(normal) class can achieve better classifier performance (in ROC space) than
only under-sampling the majority class. This paper also shows that a
combination of our method of over-sampling the minority class and
under-sampling the majority class can achieve better classifier performance (in
ROC space) than varying the loss ratios in Ripper or class priors in Naive
Bayes. Our method of over-sampling the minority class involves creating
synthetic minority class examples. Experiments are performed using C4.5, Ripper
and a Naive Bayes classifier. The method is evaluated using the area under the
Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy
Learning from Multi-Class Imbalanced Big Data with Apache Spark
With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered.
This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed
Cost-sensitive Bayesian network learning using sampling
A significant advance in recent years has been the development of cost-sensitive decision tree learners, recognising that real world classification problems need to take account of costs of misclassification and not just focus on accuracy. The literature contains well over 50 cost-sensitive decision tree induction algorithms, each with varying performance profiles. Obtaining good Bayesian networks can be challenging and hence several algorithms have been proposed for learning their structure and parameters from data. However, most of these algorithms focus on learning Bayesian networks that aim to maximise the accuracy of classifications. Hence an obvious question that arises is whether it is possible to develop cost-sensitive Bayesian networks and whether they would perform better than cost-sensitive decision trees for minimising classification cost? This paper explores this question by developing a new Bayesian network learning algorithm based on changing the data distribution to reflect the costs of misclassification. The proposed method is explored by conducting experiments on over 20 data sets. The results show that this approach produces good results in comparison to more complex cost-sensitive decision tree algorithms
Analysis of Data mining based Software Defect Prediction Techniques
Software bug repository is the main resource for fault prone modules. Different data mining algorithms are used to extract fault prone modules from these repositories. Software development team tries to increase the software quality by decreasing the number of defects as much as possible. In this paper different data mining techniques are discussed for identifying fault prone modules as well as compare the data mining algorithms to find out the best algorithm for defect prediction
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
Recommended from our members
Ensemble methods for instance-based Arabic language authorship attribution
The Authorship Attribution (AA) is considered as a subfield of authorship analysis and it is an important problem as the range of anonymous information increased with fast growing of internet usage worldwide. In other languages such as English, Spanish and Chinese, such issue is quite well studied. However, in Arabic language, the AA problem has received less attention from the research community due to complexity and nature of Arabic sentences. The paper presented an intensive review on previous studies for Arabic language. Based on that, this study has employed the Technique for Order Preferences by Similarity to Ideal Solution (TOPSIS) method to choose the base classifier of the ensemble methods. In terms of attribution features, hundreds of stylometric features and distinct words using several tools have been extracted. Then, Adaboost and Bagging ensemble methods have been applied on Arabic enquires (Fatwa) dataset. The findings showed an improvement of the effectiveness of the authorship attribution task in the Arabic language
- …