24,188 research outputs found

    Oversampling for Imbalanced Learning Based on K-Means and SMOTE

    Full text link
    Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

    A Review of Codebook Models in Patch-Based Visual Object Recognition

    No full text
    The codebook model-based approach, while ignoring any structural aspect in vision, nonetheless provides state-of-the-art performances on current datasets. The key role of a visual codebook is to provide a way to map the low-level features into a fixed-length vector in histogram space to which standard classifiers can be directly applied. The discriminative power of such a visual codebook determines the quality of the codebook model, whereas the size of the codebook controls the complexity of the model. Thus, the construction of a codebook is an important step which is usually done by cluster analysis. However, clustering is a process that retains regions of high density in a distribution and it follows that the resulting codebook need not have discriminant properties. This is also recognised as a computational bottleneck of such systems. In our recent work, we proposed a resource-allocating codebook, to constructing a discriminant codebook in a one-pass design procedure that slightly outperforms more traditional approaches at drastically reduced computing times. In this review we survey several approaches that have been proposed over the last decade with their use of feature detectors, descriptors, codebook construction schemes, choice of classifiers in recognising objects, and datasets that were used in evaluating the proposed methods

    EC3: Combining Clustering and Classification for Ensemble Learning

    Full text link
    Classification and clustering algorithms have been proved to be successful individually in different contexts. Both of them have their own advantages and limitations. For instance, although classification algorithms are more powerful than clustering methods in predicting class labels of objects, they do not perform well when there is a lack of sufficient manually labeled reliable data. On the other hand, although clustering algorithms do not produce label information for objects, they provide supplementary constraints (e.g., if two objects are clustered together, it is more likely that the same label is assigned to both of them) that one can leverage for label prediction of a set of unknown objects. Therefore, systematic utilization of both these types of algorithms together can lead to better prediction performance. In this paper, We propose a novel algorithm, called EC3 that merges classification and clustering together in order to support both binary and multi-class classification. EC3 is based on a principled combination of multiple classification and multiple clustering methods using an optimization function. We theoretically show the convexity and optimality of the problem and solve it by block coordinate descent method. We additionally propose iEC3, a variant of EC3 that handles imbalanced training data. We perform an extensive experimental analysis by comparing EC3 and iEC3 with 14 baseline methods (7 well-known standalone classifiers, 5 ensemble classifiers, and 2 existing methods that merge classification and clustering) on 13 standard benchmark datasets. We show that our methods outperform other baselines for every single dataset, achieving at most 10% higher AUC. Moreover our methods are faster (1.21 times faster than the best baseline), more resilient to noise and class imbalance than the best baseline method.Comment: 14 pages, 7 figures, 11 table
    • …
    corecore