10,773 research outputs found

    Classification on imbalanced data sets, taking advantage of errors to improve performance

    Get PDF
    Classification methods usually exhibit a poor performance when they are applied on imbalanced data sets. In order to overcome this problem, some algorithms have been proposed in the last decade. Most of them generate synthetic instances in order to balance data sets, regardless the classification algorithm. These methods work reasonably well in most cases; however, they tend to cause over-fitting. In this paper, we propose a method to face the imbalance problem. Our approach, which is very simple to implement, works in two phases; the first one detects instances that are difficult to predict correctly for classification methods. These instances are then categorized into “noisy” and “secure”, where the former refers to those instances whose most of their nearest neighbors belong to the opposite class. The second phase of our method, consists in generating a number of synthetic instances for each one of those that are difficult to predict correctly. After applying our method to data sets, the AUC area of classifiers is improved dramatically. We compare our method with others of the state-of-the-art, using more than 10 data sets

    Deep Over-sampling Framework for Classifying Imbalanced Data

    Full text link
    Class imbalance is a challenging issue in practical classification problems for deep learning models as well as traditional models. Traditionally successful countermeasures such as synthetic over-sampling have had limited success with complex, structured data handled by deep learning models. In this paper, we propose Deep Over-sampling (DOS), a framework for extending the synthetic over-sampling method to exploit the deep feature space acquired by a convolutional neural network (CNN). Its key feature is an explicit, supervised representation learning, for which the training data presents each raw input sample with a synthetic embedding target in the deep feature space, which is sampled from the linear subspace of in-class neighbors. We implement an iterative process of training the CNN and updating the targets, which induces smaller in-class variance among the embeddings, to increase the discriminative power of the deep representation. We present an empirical study using public benchmarks, which shows that the DOS framework not only counteracts class imbalance better than the existing method, but also improves the performance of the CNN in the standard, balanced settings

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
    corecore