10,624 research outputs found
Multi-class Boosting for imbalanced data.
We consider the problem of multi-class classification with imbalanced data-sets. To this end, we introduce a cost-sensitive multi-class Boosting algorithm (BAdaCost) based on a generalization of the Boosting margin, termed multi-class cost-sensitive margin. To address the class imbalance we introduce a cost matrix that weighs more hevily the costs of confused classes and a procedure to estimate these costs from the confusion matrix of a standard 0|1-loss classifier. Finally, we evaluate the performance of the approach with synthetic and real data-sets and compare our results with the AdaC2.M1 algorithm
CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification
Class imbalance classification is a challenging research problem in data
mining and machine learning, as most of the real-life datasets are often
imbalanced in nature. Existing learning algorithms maximise the classification
accuracy by correctly classifying the majority class, but misclassify the
minority class. However, the minority class instances are representing the
concept with greater interest than the majority class instances in real-life
applications. Recently, several techniques based on sampling methods
(under-sampling of the majority class and over-sampling the minority class),
cost-sensitive learning methods, and ensemble learning have been used in the
literature for classifying imbalanced datasets. In this paper, we introduce a
new clustering-based under-sampling approach with boosting (AdaBoost)
algorithm, called CUSBoost, for effective imbalanced classification. The
proposed algorithm provides an alternative to RUSBoost (random under-sampling
with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost)
algorithms. We evaluated the performance of CUSBoost algorithm with the
state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost,
SMOTEBoost on 13 imbalance binary and multi-class datasets with various
imbalance ratios. The experimental results show that the CUSBoost is a
promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201
Cost-Sensitive Boosting for Classification of Imbalanced Data
The classification of data with imbalanced class distributions has
posed a significant drawback in the performance attainable by most
well-developed classification systems, which assume relatively
balanced class distributions. This problem is especially crucial
in many application domains, such as medical diagnosis, fraud
detection, network intrusion, etc., which are of great importance
in machine learning and data mining.
This thesis explores meta-techniques which are applicable to most
classifier learning algorithms, with the aim to advance the
classification of imbalanced data. Boosting is a powerful
meta-technique to learn an ensemble of weak models with a promise
of improving the classification accuracy. AdaBoost has been taken
as the most successful boosting algorithm. This thesis starts with
applying AdaBoost to an associative classifier for both learning
time reduction and accuracy improvement. However, the promise of
accuracy improvement is trivial in the context of the class
imbalance problem, where accuracy is less meaningful. The insight
gained from a comprehensive analysis on the boosting strategy of
AdaBoost leads to the investigation of cost-sensitive boosting
algorithms, which are developed by introducing cost items into the
learning framework of AdaBoost. The cost items are used to denote
the uneven identification importance among classes, such that the
boosting strategies can intentionally bias the learning towards
classes associated with higher identification importance and
eventually improve the identification performance on them. Given
an application domain, cost values with respect to different types
of samples are usually unavailable for applying the proposed
cost-sensitive boosting algorithms. To set up the effective cost
values, empirical methods are used for bi-class applications and
heuristic searching of the Genetic Algorithm is employed for
multi-class applications.
This thesis also covers the implementation of the proposed
cost-sensitive boosting algorithms. It ends with a discussion on
the experimental results of classification of real-world
imbalanced data. Compared with existing algorithms, the new
algorithms this thesis presents are superior in achieving better
measurements regarding the learning objectives
- …