62 research outputs found

    Improved imbalanced classification through convex space learning

    Get PDF
    Imbalanced datasets for classification problems, characterised by unequal distribution of samples, are abundant in practical scenarios. Oversampling algorithms generate synthetic data to enrich classification performance for such datasets. In this thesis, I discuss two algorithms LoRAS & ProWRAS, improving on the state-of-the-art as shown through rigorous benchmarking on publicly available datasets. A biological application for detection of rare cell-types from single-cell transcriptomics data is also discussed. The thesis also provides a better theoretical understanding behind oversampling

    Learning from Multi-Class Imbalanced Big Data with Apache Spark

    Get PDF
    With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered. This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed

    Iterative Training Sample Expansion to Increase and Balance the Accuracy of Land Classification from VHR Imagery

    Get PDF
    © 1980-2012 IEEE. Imbalanced training sets are known to produce suboptimal maps for supervised classification. Therefore, one challenge in mapping land cover is acquiring training data that will allow classification with high overall accuracy (OA) in which each class is also mapped onto similar user's accuracy. To solve this problem, we integrated local adaptive region and box-and-whisker plot (BP) techniques into an iterative algorithm to expand the size of the training sample for selected classes in this article. The major steps of the proposed algorithm are as follows. First, a very small initial training sample (ITS) for each class set is labeled manually. Second, potential new training samples are found within an adaptive region by conducting local spectral variation analysis. Lastly, three new training samples are acquired to capture information regarding intraclass variation; these samples lie in the lower, median, and upper quartiles of BP. After adding these new training samples to the ITS, classification is retrained and the process is continued iteratively until termination. The proposed approach was applied to three very high-resolution (VHR) remote-sensing images and compared with a set of cognate methods. The comparison demonstrated that the proposed approach produced the best result in terms of OA and exhibited superiority in balancing user's accuracy. For example, the proposed approach was typically 2%-10% more accurate than the compared methods in terms of OA and it generally yielded the most balanced classification

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Approach to identify product and process state drivers in manufacturing systems using supervised machine learning

    Get PDF
    The developed concept allows identifying relevant state drivers of complex, multi-stage manufacturing systems holistically. It is able to utilize complex, diverse and high-dimensional data sets which often occur in manufacturing applications and integrate the important process intra- and inter-relations. The evaluation was conducted by using three different scenarios from distinctive manufacturing domains (aviation, chemical and semiconductor). The evaluation confirmed that it is possible to incorporate implicit process intra- and inter-relations on process as well as programme level through applying SVM based feature ranking. The analysis outcome presents a direct benefit for practitioners in form of the most important process parameters and state characteristics, so-called state drivers, of a manufacturing system. Given the increasing availability of data and information, this selection support can be directly utilized in, e.g., quality monitoring and advanced process control

    Dealing with imbalanced and weakly labelled data in machine learning using fuzzy and rough set methods

    Get PDF
    corecore