3,045 research outputs found

    Coupling different methods for overcoming the class imbalance problem

    Get PDF
    Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

    Bag-Level Aggregation for Multiple Instance Active Learning in Instance Classification Problems

    Full text link
    A growing number of applications, e.g. video surveillance and medical image analysis, require training recognition systems from large amounts of weakly annotated data while some targeted interactions with a domain expert are allowed to improve the training process. In such cases, active learning (AL) can reduce labeling costs for training a classifier by querying the expert to provide the labels of most informative instances. This paper focuses on AL methods for instance classification problems in multiple instance learning (MIL), where data is arranged into sets, called bags, that are weakly labeled. Most AL methods focus on single instance learning problems. These methods are not suitable for MIL problems because they cannot account for the bag structure of data. In this paper, new methods for bag-level aggregation of instance informativeness are proposed for multiple instance active learning (MIAL). The \textit{aggregated informativeness} method identifies the most informative instances based on classifier uncertainty, and queries bags incorporating the most information. The other proposed method, called \textit{cluster-based aggregative sampling}, clusters data hierarchically in the instance space. The informativeness of instances is assessed by considering bag labels, inferred instance labels, and the proportion of labels that remain to be discovered in clusters. Both proposed methods significantly outperform reference methods in extensive experiments using benchmark data from several application domains. Results indicate that using an appropriate strategy to address MIAL problems yields a significant reduction in the number of queries needed to achieve the same level of performance as single instance AL methods

    Robust Framework to Combine Diverse Classifiers Assigning Distributed Confidence to Individual Classifiers at Class Level

    Get PDF
    We have presented a classification framework that combines multiple heterogeneous classifiers in the presence of class label noise. An extension of m-Mediods based modeling is presented that generates model of various classes whilst identifying and filtering noisy training data. This noise free data is further used to learn model for other classifiers such as GMM and SVM. A weight learning method is then introduced to learn weights on each class for different classifiers to construct an ensemble. For this purpose, we applied genetic algorithm to search for an optimal weight vector on which classifier ensemble is expected to give the best accuracy. The proposed approach is evaluated on variety of real life datasets. It is also compared with existing standard ensemble techniques such as Adaboost, Bagging, and Random Subspace Methods. Experimental results show the superiority of proposed ensemble method as compared to its competitors, especially in the presence of class label noise and imbalance classes

    BagStack Classification for Data Imbalance Problems with Application to Defect Detection and Labeling in Semiconductor Units

    Get PDF
    abstract: Despite the fact that machine learning supports the development of computer vision applications by shortening the development cycle, finding a general learning algorithm that solves a wide range of applications is still bounded by the ”no free lunch theorem”. The search for the right algorithm to solve a specific problem is driven by the problem itself, the data availability and many other requirements. Automated visual inspection (AVI) systems represent a major part of these challenging computer vision applications. They are gaining growing interest in the manufacturing industry to detect defective products and keep these from reaching customers. The process of defect detection and classification in semiconductor units is challenging due to different acceptable variations that the manufacturing process introduces. Other variations are also typically introduced when using optical inspection systems due to changes in lighting conditions and misalignment of the imaged units, which makes the defect detection process more challenging. In this thesis, a BagStack classification framework is proposed, which makes use of stacking and bagging concepts to handle both variance and bias errors. The classifier is designed to handle the data imbalance and overfitting problems by adaptively transforming the multi-class classification problem into multiple binary classification problems, applying a bagging approach to train a set of base learners for each specific problem, adaptively specifying the number of base learners assigned to each problem, adaptively specifying the number of samples to use from each class, applying a novel data-imbalance aware cross-validation technique to generate the meta-data while taking into account the data imbalance problem at the meta-data level and, finally, using a multi-response random forest regression classifier as a meta-classifier. The BagStack classifier makes use of multiple features to solve the defect classification problem. In order to detect defects, a locally adaptive statistical background modeling is proposed. The proposed BagStack classifier outperforms state-of-the-art image classification techniques on our dataset in terms of overall classification accuracy and average per-class classification accuracy. The proposed detection method achieves high performance on the considered dataset in terms of recall and precision.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    Principled Diverse Counterfactuals in Multilinear Models

    Get PDF
    Machine learning (ML) applications have automated numerous real-life tasks,improving both private and public life. However, the black-box nature of manystate-of-the-art models poses the challenge of model verification; how can onebe sure that the algorithm bases its decisions on the proper criteria, or that itdoes not discriminate against certain minority groups? In this paper we proposea way to generate diverse counterfactual explanations from multilinear models,a broad class which includes Random Forests, as well as Bayesian Networks.<br/

    Multiple Instance Learning: A Survey of Problem Characteristics and Applications

    Full text link
    Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research

    Vote-boosting ensembles

    Full text link
    Vote-boosting is a sequential ensemble learning method in which the individual classifiers are built on different weighted versions of the training data. To build a new classifier, the weight of each training instance is determined in terms of the degree of disagreement among the current ensemble predictions for that instance. For low class-label noise levels, especially when simple base learners are used, emphasis should be made on instances for which the disagreement rate is high. When more flexible classifiers are used and as the noise level increases, the emphasis on these uncertain instances should be reduced. In fact, at sufficiently high levels of class-label noise, the focus should be on instances on which the ensemble classifiers agree. The optimal type of emphasis can be automatically determined using cross-validation. An extensive empirical analysis using the beta distribution as emphasis function illustrates that vote-boosting is an effective method to generate ensembles that are both accurate and robust
    • 

    corecore