160 research outputs found

    Coupling different methods for overcoming the class imbalance problem

    Get PDF
    Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

    double committee adaboost

    Get PDF
    Abstract In this paper we make an extensive study of different combinations of ensemble techniques for improving the performance of adaboost considering the following strategies: reducing the correlation problem among the features, reducing the effect of the outliers in adaboost training, and proposing an efficient way for selecting/weighing the weak learners. First, we show that random subspace works well coupled with several adaboost techniques. Second, we show that an ensemble based on training perturbation using editing methods (to reduce the importance of the outliers) further improves performance. We examine the robustness of the new approach by applying it to a number of benchmark datasets representing a range of different problems. We find that compared with other state-of-the-art classifiers our proposed method performs consistently well across all the tested datasets. One useful finding is that this approach obtains a performance similar to support vector machine (SVM), using the well-known LibSVM implementation, even when both kernel selection and various parameters of SVM are carefully tuned for each dataset. The main drawback of the proposed approach is the computation time, which is high as a result of combining the different ensemble techniques. We have also tested the fusion between our selected committee of adaboost with SVM (again using the widely tested LibSVM tool) where the parameters of SVM are tuned for each dataset. We find that the fusion between SVM and a committee of adaboost (i.e., a heterogeneous ensemble) statistically outperforms the most used SVM tool with parameters tuned for each dataset. The MATLAB code of our best approach is available at bias.csr.unibo.it/nanni/ADA.rar

    Experiments to Parameters and Base Classifiers in the Fitness Function for GA-Ensemble

    Get PDF
    GA-Ensemble is found to be more resistant to outliers and results in simpler predictive models than other ensemble models. The fitness function consists of three parameters (a, b, and p) that limit the number of base classifiers (by b) and control the effects of outliers (by a) to maximize an appropriately chosen p-th percentile of margins. We present the effect of the parameters of a new fitness function as well as the increased complexity of base classifiers to improve predictive accuracy. We use some artificial and real data sets to demonstrate the effect of GA-Ensemble performance at 16 different treatment levels with three different base classifier options and compare to AdaBoost

    Large dataset complexity reduction for classification: An optimization perspective

    Get PDF
    Doctor of PhilosophyComputational complexity in data mining is attributed to algorithms but lies hugely with the data. Different algorithms may exist to solve the same problem, but the simplest is not always the best. At the same time, data of astronomical proportions is rather common, boosted by automation, and the fuller the data, the better resolution of the concept it projects. Paradoxically, it is the computing power that is lacking. Perhaps a fast algorithm can be run on the data, but not the optimal. Even then any modeling is much constrained, involving serial application of many algorithms. The only other way to relieve the computational load is via making the data lighter. Any representative subset has to preserve the data essence suiting, ideally, any algorithm. The reduction should minimize the error of approximation, while trading precision for performance. Data mining is a wide field. We concentrate on classification. In the literature review we present a variety of methods, emphasizing the effort of past decade. Two major objects of reduction are instances and attributes. The data can be also recast into a more economical format. We address sampling, noise reduction, class domain binarization, feature ranking, feature subset selection, feature extraction, and also discretization of continuous features. Achievements are tremendous, but so are possibilities. We improve an existing technique of data cleansing and suggest a way of data condensing as the extension. We also touch on noise reduction. Instance similarity, excepting the class mix, prompts a technique of feature selection. Additionally, we consider multivariate discretization, enabling a compact data representation without the size change. We compare proposed methods with alternative techniques which we introduce new, implement or use available
    • …
    corecore