552 research outputs found

    An under-Sampled Approach for Handling Skewed Data Distribution using Cluster Disjuncts

    Get PDF
    In Data mining and Knowledge Discovery hidden and valuable knowledge from the data sources is discovered. The traditional algorithms used for knowledge discovery are bottle necked due to wide range of data sources availability. Class imbalance is a one of the problem arises due to data source which provide unequal class i.e. examples of one class in a training data set vastly outnumber examples of the other class(es). Researchers have rigorously studied several techniques to alleviate the problem of class imbalance, including resampling algorithms, and feature selection approaches to this problem. In this paper, we present a new hybrid frame work dubbed as Majority Under-sampling based on Cluster Disjunct (MAJOR_CD) for learning from skewed training data. This algorithm provides a simpler and faster alternative by using cluster disjunct concept. We conduct experiments using twelve UCI data sets from various application domains using five algorithms for comparison on six evaluation metrics. The empirical study suggests that MAJOR_CD have been believed to be effective in addressing the class imbalance problem

    An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis

    Get PDF
    Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced classification. Furthermore, data characteristics have a significant impact on the performance of imbalanced classifiers, which are generally neglected by existing evaluation methods. The objective of this study is to introduce a new criterion to comprehensively evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established using data envelopment analysis without explicit inputs (DEA-WEI), to determine the trade-off between the benefits of improved minority class accuracy and the cost of reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced ratio and typical imbalanced data characteristics on the efficiency of the classifiers. Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble and undersampling techniques are more effective for overlapping and noisy data. The efficiency of cost-sensitive classifiers decreases dramatically when the imbalanced ratio increases. Finally, we investigate the reasons for the different efficiencies of classifiers on imbalanced data and recommend steps to select appropriate classifiers for imbalanced data based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023 71725001 71771037 7197104

    Coupling different methods for overcoming the class imbalance problem

    Get PDF
    Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

    On the class overlap problem in imbalanced data classification.

    Get PDF
    Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance

    Radar-based Feature Design and Multiclass Classification for Road User Recognition

    Full text link
    The classification of individual traffic participants is a complex task, especially for challenging scenarios with multiple road users or under bad weather conditions. Radar sensors provide an - with respect to well established camera systems - orthogonal way of measuring such scenes. In order to gain accurate classification results, 50 different features are extracted from the measurement data and tested on their performance. From these features a suitable subset is chosen and passed to random forest and long short-term memory (LSTM) classifiers to obtain class predictions for the radar input. Moreover, it is shown why data imbalance is an inherent problem in automotive radar classification when the dataset is not sufficiently large. To overcome this issue, classifier binarization is used among other techniques in order to better account for underrepresented classes. A new method to couple the resulting probabilities is proposed and compared to others with great success. Final results show substantial improvements when compared to ordinary multiclass classificationComment: 8 pages, 6 figure

    Learning from Multi-Class Imbalanced Big Data with Apache Spark

    Get PDF
    With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered. This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed
    • …
    corecore