5,090 research outputs found

    Misclassification analysis for the class imbalance problem

    Get PDF
    In classification, the class imbalance issue normally causes the learning algorithm to be dominated by the majority classes and the features of the minority classes are sometimes ignored. This will indirectly affect how human visualise the data. Therefore, special care is needed to take care of the learning algorithm in order to enhance the accuracy for the minority classes. In this study, the use of misclassification analysis is investigated for data re-distribution. Several under-sampling techniques and hybrid techniques using misclassification analysis are proposed in the paper. The benchmark data sets obtained from the University of California Irvine (UCI) machine learning repository are used to investigate the performance of the proposed techniques. The results show that the proposed hybrid technique presents the best performance in the experiment

    Coupling different methods for overcoming the class imbalance problem

    Get PDF
    Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

    A balanced approach to the multi-class imbalance problem

    Get PDF
    The multi-class class-imbalance problem is a subset of supervised machine learning tasks where the classification variable of interest consists of three or more categories with unequal sample sizes. In the fields of manufacturing and business, common machine learning classification tasks such as failure mode, fraud, and threat detection often exhibit class imbalance due to the infrequent occurrence of one or more event states. Though machine learning as a discipline is well established, the study of class imbalance with respect to multi-class learning does not yet have the same deep, rich history. In its current state, the class imbalance literature leverages the use of biased sampling and increasing model complexity to improve predictive performance, and while some have made advances, there is still no standard model evaluation criteria for which to compare their performance. In the presence of substantial multi-class distributional skew, of the model evaluation criteria that can scale beyond the binary case, many become invalid due to their over-emphasis on the majority class observations. Going a step further, many of the evaluation criteria utilized in practice vary significantly across the class imbalance literature and so far no single measure has been able to galvanize consensus due not only to implementation complexity, but the existence of undesirable properties. Therefore, the focus of this research is to introduce a new performance measure, Class Balance Accuracy, designed specifically for model validation in the presence of multi-class imbalance. This paper begins with the statement of definition for Class Balance Accuracy and provides an intuitive proof for its interpretation as a simultaneous lower bound for the average per class recall and average per class precision. Results from comparison studies show that models chosen by maximizing the training class balance accuracy consistently yield both high overall accuracy and per class recall on the test sets compared to the models chosen by other criteria. Simulation studies were then conducted to highlight specific scenarios where the use of class balance accuracy outperforms model selection based on regular accuracy. The measure is then invoked in two novel applications, one as the maximization criteria in the instance selection biased sampling technique and the other as a model selection tool in a multiple classifier system prediction algorithm. In the case of instance selection, the use of class balance accuracy shows improvement over traditional accuracy in scenarios of multi-class class-imbalance data sets with low separability between the majority and minority classes. Likewise, the use of CBA in the multiple classifier system resulted in improved predictions over state of the art methods such as adaBoost for some of the U.C.I. machine learning repository test data sets. The paper then concludes with a discussion of the climbR package, a repository of functions designed to aid in the model evaluation and prediction of class imbalance machine learning problems

    Cost-sensitive deep neural network ensemble for class imbalance problem

    Full text link
    In data mining, classification is a task to build a model which classifies data into a given set of categories. Most classification algorithms assume the class distribution of data to be roughly balanced. In real-life applications such as direct marketing, fraud detection and churn prediction, class imbalance problem usually occurs. Class imbalance problem is referred to the issue that the number of examples belonging to a class is significantly greater than those of the others. When training a standard classifier with class imbalance data, the classifier is usually biased toward majority class. However, minority class is the class of interest and more significant than the majority class. In the literature, existing methods such as data-level, algorithmic-level and cost-sensitive learning have been proposed to address this problem. The experiments discussed in these studies were usually conducted on relatively small data sets or even on artificial data. The performance of the methods on modern real-life data sets, which are more complicated, is unclear. In this research, we study the background and some of the state-of-the-art approaches which handle class imbalance problem. We also propose two costsensitive methods to address class imbalance problem, namely Cost-Sensitive Deep Neural Network (CSDNN) and Cost-Sensitive Deep Neural Network Ensemble (CSDE). CSDNN is a deep neural network based on Stacked Denoising Autoencoders (SDAE). We propose CSDNN by incorporating cost information of majority and minority class into the cost function of SDAE to make it costsensitive. Another proposed method, CSDE, is an ensemble learning version of CSDNN which is proposed to improve the generalization performance on class imbalance problem. In the first step, a deep neural network based on SDAE is created for layer-wise feature extraction. Next, we perform Bagging’s resampling procedure with undersampling to split training data into a number of bootstrap samples. In the third step, we apply a layer-wise feature extraction method to extract new feature samples from each of the hidden layer(s) of the SDAE. Lastly, the ensemble learning is performed by using each of the new feature samples to train a CSDNN classifier with random cost vector. Experiments are conducted to compare the proposed methods with the existing methods. We examine their performance on real-life data sets in business domains. The results show that the proposed methods obtain promising results in handling class imbalance problem and also outperform all the other compared methods. There are three major contributions to this work. First, we proposed CSDNN method in which misclassification costs are considered in training process. Second, we incorporate random undersampling with layer-wise feature extraction to perform ensemble learning. Third, this is the first work that conducts experiments on class imbalance problem using large real-life data sets in different business domains ranging from direct marketing, churn prediction, credit scoring, fraud detection to fake review detection
    • …
    corecore