7,617 research outputs found

    Novel Algorithm-Level Approaches for Class-Imbalanced Machine Learning

    Get PDF
    Machine learning classifiers are designed with the underlying assumption of a roughly balanced number of instances per class. However, in many real-world applications this is far from true. This thesis explores adaptations of neural networks which are robust to class imbalanced datasets, do not involve data manipulation, and are flexible enough to be used with any model architecture or framework. The thesis explores two complementary approaches to the problem of class imbalance. The first exchanges conventional choices of classification loss function, which are fundamentally measures of how far network outputs are from desired ones, for ones that instead primarily register whether outputs are right or wrong. The construction of these novel loss functions involves the concept of an approximated confusion matrix, another use of which is to generate new performance metrics, especially useful for monitoring validation behaviour for imbalanced datasets. The second approach changes the form of the output layer activation function to one with a threshold which can be learned so as to more easily classify the more difficult minority class. These two approaches can be used together or separately, with the combined technique being a promising approach for cases of extreme class imbalance. While the methods are developed primarily for binary classification scenarios, as these are the most numerous in the applications literature, the novel loss functions introduced here are also demonstrated to be extensible to a multi-class scenari

    Box Drawings for Learning with Imbalanced Data

    Get PDF
    The vast majority of real world classification problems are imbalanced, meaning there are far fewer data from the class of interest (the positive class) than from other classes. We propose two machine learning algorithms to handle highly imbalanced classification problems. The classifiers constructed by both methods are created as unions of parallel axis rectangles around the positive examples, and thus have the benefit of being interpretable. The first algorithm uses mixed integer programming to optimize a weighted balance between positive and negative class accuracies. Regularization is introduced to improve generalization performance. The second method uses an approximation in order to assist with scalability. Specifically, it follows a \textit{characterize then discriminate} approach, where the positive class is characterized first by boxes, and then each box boundary becomes a separate discriminative classifier. This method has the computational advantages that it can be easily parallelized, and considers only the relevant regions of feature space

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    • …
    corecore