303 research outputs found

    Minimax Classifier with Box Constraint on the Priors

    Get PDF
    Learning a classifier in safety-critical applications like medicine raises several issues. Firstly, the class proportions, also called priors, are in general imbalanced or uncertain. Sometimes, experts are able to provide some bounds on the priors and taking into account this knowledge can improve the predictions. Secondly, it is also necessary to consider any arbitrary loss function given by experts to evaluate the classification decision. Finally, the dataset may contain both categorical and numeric features. In this paper, we propose a box-constrained minimax classifier which addresses all the mentioned issues. To deal with both categorical and numeric features, many works have shown that discretizing the numeric attributes can lead to interesting results. Here, we thus consider that numeric features are discretized. In order to address the class proportions issues, we compute the priors which maximize the empirical Bayes risk over a box-constrained probabilistic simplex. This constraint is defined as the intersection between the simplex and a box constraint provided by experts, which aims at bounding independently each class proportions. Our approach allows to find a compromise between the empirical Bayes classifier and the standard minimax classifier, which may appear too pessimistic. The standard minimax classifier, which has not been studied yet when considerring discrete features, is still accessible by our approach. When considering only discrete features, we show that, for any arbitrary loss function, the empirical Bayes risk, considered as a function of the priors, is a concave non-differentiable multivariate piecewise affine function. To compute the box-constrained least favorable priors, we derive a projected subgradient algorithm. The convergence of our algorithm is established. The performance of our algorithm is illustrated with experiments on the Framingham study database to predict the risk of Coronary Heart Disease (CHD)

    Semantic concept detection in imbalanced datasets based on different under-sampling strategies

    Get PDF
    Semantic concept detection is a very useful technique for developing powerful retrieval or filtering systems for multimedia data. To date, the methods for concept detection have been converging on generic classification schemes. However, there is often imbalanced dataset or rare class problems in classification algorithms, which deteriorate the performance of many classifiers. In this paper, we adopt three “under-sampling” strategies to handle this imbalanced dataset issue in a SVM classification framework and evaluate their performances on the TRECVid 2007 dataset and additional positive samples from TRECVid 2010 development set. Experimental results show that our well-designed “under-sampling” methods (method SAK) increase the performance of concept detection about 9.6% overall. In cases of extreme imbalance in the collection the proposed methods worsen the performance than a baseline sampling method (method SI), however in the majority of cases, our proposed methods increase the performance of concept detection substantially. We also conclude that method SAK is a promising solution to address the SVM classification with not extremely imbalanced datasets

    Generative Adversarial Networks for Mitigating Biases in Machine Learning Systems

    Full text link
    In this paper, we propose a new framework for mitigating biases in machine learning systems. The problem of the existing mitigation approaches is that they are model-oriented in the sense that they focus on tuning the training algorithms to produce fair results, while overlooking the fact that the training data can itself be the main reason for biased outcomes. Technically speaking, two essential limitations can be found in such model-based approaches: 1) the mitigation cannot be achieved without degrading the accuracy of the machine learning models, and 2) when the data used for training are largely biased, the training time automatically increases so as to find suitable learning parameters that help produce fair results. To address these shortcomings, we propose in this work a new framework that can largely mitigate the biases and discriminations in machine learning systems while at the same time enhancing the prediction accuracy of these systems. The proposed framework is based on conditional Generative Adversarial Networks (cGANs), which are used to generate new synthetic fair data with selective properties from the original data. We also propose a framework for analyzing data biases, which is important for understanding the amount and type of data that need to be synthetically sampled and labeled for each population group. Experimental results show that the proposed solution can efficiently mitigate different types of biases, while at the same time enhancing the prediction accuracy of the underlying machine learning model

    Biased classification for relevance feedback in content-based image retrieval.

    Get PDF
    Peng, Xiang.Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.Includes bibliographical references (leaves 98-115).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Problem Statement --- p.3Chapter 1.2 --- Major Contributions --- p.6Chapter 1.3 --- Thesis Outline --- p.7Chapter 2 --- Background Study --- p.9Chapter 2.1 --- Content-based Image Retrieval --- p.9Chapter 2.1.1 --- Image Representation --- p.11Chapter 2.1.2 --- High Dimensional Indexing --- p.15Chapter 2.1.3 --- Image Retrieval Systems Design --- p.16Chapter 2.2 --- Relevance Feedback --- p.19Chapter 2.2.1 --- Self-Organizing Map in Relevance Feedback --- p.20Chapter 2.2.2 --- Decision Tree in Relevance Feedback --- p.22Chapter 2.2.3 --- Bayesian Classifier in Relevance Feedback --- p.24Chapter 2.2.4 --- Nearest Neighbor Search in Relevance Feedback --- p.25Chapter 2.2.5 --- Support Vector Machines in Relevance Feedback --- p.26Chapter 2.3 --- Imbalanced Classification --- p.29Chapter 2.4 --- Active Learning --- p.31Chapter 2.4.1 --- Uncertainly-based Sampling --- p.33Chapter 2.4.2 --- Error Reduction --- p.34Chapter 2.4.3 --- Batch Selection --- p.35Chapter 2.5 --- Convex Optimization --- p.35Chapter 2.5.1 --- Overview of Convex Optimization --- p.35Chapter 2.5.2 --- Linear Program --- p.37Chapter 2.5.3 --- Quadratic Program --- p.37Chapter 2.5.4 --- Quadratically Constrained Quadratic Program --- p.37Chapter 2.5.5 --- Cone Program --- p.38Chapter 2.5.6 --- Semi-definite Program --- p.39Chapter 3 --- Imbalanced Learning with BMPM for CBIR --- p.40Chapter 3.1 --- Research Motivation --- p.41Chapter 3.2 --- Background Review --- p.42Chapter 3.2.1 --- Relevance Feedback for CBIR --- p.42Chapter 3.2.2 --- Minimax Probability Machine --- p.42Chapter 3.2.3 --- Extensions of Minimax Probability Machine --- p.44Chapter 3.3 --- Relevance Feedback using BMPM --- p.45Chapter 3.3.1 --- Model Definition --- p.45Chapter 3.3.2 --- Advantages of BMPM in Relevance Feedback --- p.46Chapter 3.3.3 --- Relevance Feedback Framework by BMPM --- p.47Chapter 3.4 --- Experimental Results --- p.47Chapter 3.4.1 --- Experiment Datasets --- p.48Chapter 3.4.2 --- Performance Evaluation --- p.50Chapter 3.4.3 --- Discussions --- p.53Chapter 3.5 --- Summary --- p.53Chapter 4 --- BMPM Active Learning for CBIR --- p.55Chapter 4.1 --- Problem Statement and Motivation --- p.55Chapter 4.2 --- Background Review --- p.57Chapter 4.3 --- Relevance Feedback by BMPM Active Learning . --- p.58Chapter 4.3.1 --- Active Learning Concept --- p.58Chapter 4.3.2 --- General Approaches for Active Learning . --- p.59Chapter 4.3.3 --- Biased Minimax Probability Machine --- p.60Chapter 4.3.4 --- Proposed Framework --- p.61Chapter 4.4 --- Experimental Results --- p.63Chapter 4.4.1 --- Experiment Setup --- p.64Chapter 4.4.2 --- Performance Evaluation --- p.66Chapter 4.5 --- Summary --- p.68Chapter 5 --- Large Scale Learning with BMPM --- p.70Chapter 5.1 --- Introduction --- p.71Chapter 5.1.1 --- Motivation --- p.71Chapter 5.1.2 --- Contribution --- p.72Chapter 5.2 --- Background Review --- p.72Chapter 5.2.1 --- Second Order Cone Program --- p.72Chapter 5.2.2 --- General Methods for Large Scale Problems --- p.73Chapter 5.2.3 --- Biased Minimax Probability Machine --- p.75Chapter 5.3 --- Efficient BMPM Training --- p.78Chapter 5.3.1 --- Proposed Strategy --- p.78Chapter 5.3.2 --- Kernelized BMPM and Its Solution --- p.81Chapter 5.4 --- Experimental Results --- p.82Chapter 5.4.1 --- Experimental Testbeds --- p.83Chapter 5.4.2 --- Experimental Settings --- p.85Chapter 5.4.3 --- Performance Evaluation --- p.87Chapter 5.5 --- Summary --- p.92Chapter 6 --- Conclusion and Future Work --- p.93Chapter 6.1 --- Conclusion --- p.93Chapter 6.2 --- Future Work --- p.94Chapter A --- List of Symbols and Notations --- p.96Chapter B --- List of Publications --- p.98Bibliography --- p.10

    Learning With Imbalanced Data in Smart Manufacturing: A Comparative Analysis

    Get PDF
    The Internet of Things (IoT) paradigm is revolutionising the world of manufacturing into what is known as Smart Manufacturing or Industry 4.0. The main pillar in smart manufacturing looks at harnessing IoT data and leveraging machine learning (ML) to automate the prediction of faults, thus cutting maintenance time and cost and improving the product quality. However, faults in real industries are overwhelmingly outweighed by instances of good performance (faultless samples); this bias is reflected in the data captured by IoT devices. Imbalanced data limits the success of ML in predicting faults, thus presents a significant hindrance in the progress of smart manufacturing. Although various techniques have been proposed to tackle this challenge in general, this work is the first to present a framework for evaluating the effectiveness of these remedies in the context of manufacturing. We present a comprehensive comparative analysis in which we apply our proposed framework to benchmark the performance of different combinations of algorithm components using a real-world manufacturing dataset. We draw key insights into the effectiveness of each component and inter-relatedness between the dataset, the application context, and the design of the ML algorithm
    corecore