1,542 research outputs found

    Learning from Multi-Class Imbalanced Big Data with Apache Spark

    Get PDF
    With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered. This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed

    A New Under-Sampling Method to Face Class Overlap and Imbalance

    Get PDF
    Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases

    A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

    Full text link
    Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a taxonomy of algorithms for imbalanced data streams and proposes a standardized, exhaustive, and informative experimental testbed to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to the largest experimental study conducted so far in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods. This way we propose the first standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create trustworthy and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams

    On the class overlap problem in imbalanced data classification.

    Get PDF
    Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance

    Learning With An Insufficient Supply Of Data Via Knowledge Transfer And Sharing

    Get PDF
    As machine learning methods extend to more complex and diverse set of problems, situations arise where the complexity and availability of data presents a situation where the information source is not adequate to generate a representative hypothesis. Learning from multiple sources of data is a promising research direction as researchers leverage ever more diverse sources of information. Since data is not readily available, knowledge has to be transferred from other sources and new methods (both supervised and un-supervised) have to be developed to selectively share and transfer knowledge. In this dissertation, we present both supervised and un-supervised techniques to tackle a problem where learning algorithms cannot generalize and require an extension to leverage knowledge from different sources of data. Knowledge transfer is a difficult problem as diverse sources of data can overwhelm each individual dataset\u27s distribution and a careful set of transformations has to be applied to increase the relevant knowledge at the risk of biasing a dataset\u27s distribution and inducing negative transfer that can degrade a learner\u27s performance. We give an overview of the issues encountered when the learning dataset does not have a sufficient supply of training examples. We categorize the structure of small datasets and highlight the need for further research. We present an instance-transfer supervised classification algorithm to improve classification performance in a target dataset via knowledge transfer from an auxiliary dataset. The improved classification performance of our algorithm is demonstrated with several real-world experiments. We extend the instance-transfer paradigm to supervised classification with Absolute Rarity\u27 , where a dataset has an insufficient supply of training examples and a skewed class distribution. We demonstrate one solution with a transfer learning approach and another with an imbalanced learning approach and demonstrate the effectiveness of our algorithms with several real world text and demographics classification problems (among others). We present an unsupervised multi-task clustering algorithm where several small datasets are simultaneously clustered and knowledge is transferred between the datasets to improve clustering performance on each individual dataset and we demonstrate the improved clustering performance with an extensive set of experiments

    Fraud: and anomaly detection in healthcare: an unsupervised machine learning approach

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsFraud and abuse in healthcare are critical and cause significant damage. However, the auditing of healthcare encounters is cumbersome, and the detection of fraud and abuse is challenging and binds capacity. Data-driven fraud and anomaly detection models can help to overcome these issues. This work proposes several unsupervised learning methods to understand patterns and detect abnormal healthcare encounters which might be fraudulent or abusive. The ensemble of models is split into sub-processes and applied on a healthcare data set belonging to Future Healthcare group, a Portuguese group acting in health insurance. One major part of the ensemble is the implementation of the Isolation Forest algorithm, which achieves good results in precision and recall and detect new potential fraudulent abnormal behaviour. Due to unlabelled data and the application of unsupervised learning methods, the proposed model detects new fraudulent patterns instead of learning from existing patterns. Besides the model to predict whether new incoming medical encounters are fraudulent or abusive, this work illustrates a visual method to detect suspicious networks among medical providers. In addition, this work contains an approach to predict whether a customer will cancel the insurance based on anomalous behaviour. This internship report aims to contribute to science and be public, even though some parts could not be explained in detail due to confidentiality

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

    Exploring and Evaluating the Scalability and Efficiency of Apache Spark using Educational Datasets

    Get PDF
    Research into the combination of data mining and machine learning technology with web-based education systems (known as education data mining, or EDM) is becoming imperative in order to enhance the quality of education by moving beyond traditional methods. With the worldwide growth of the Information Communication Technology (ICT), data are becoming available at a significantly large volume, with high velocity and extensive variety. In this thesis, four popular data mining methods are applied to Apache Spark, using large volumes of datasets from Online Cognitive Learning Systems to explore the scalability and efficiency of Spark. Various volumes of datasets are tested on Spark MLlib with different running configurations and parameter tunings. The thesis convincingly presents useful strategies for allocating computing resources and tuning to take full advantage of the in-memory system of Apache Spark to conduct the tasks of data mining and machine learning. Moreover, it offers insights that education experts and data scientists can use to manage and improve the quality of education, as well as to analyze and discover hidden knowledge in the era of big data

    Imbalanced data classification and its application in cyber security

    Get PDF
    Cyber security, also known as information technology security or simply as information security, aims to protect government organizations, companies and individuals by defending their computers, servers, electronic systems, networks, and data from malicious attacks. With the advancement of client-side on the fly web content generation techniques, it becomes easier for attackers to modify the content of a website dynamically and gain access to valuable information. The impact of cybercrime to the global economy is now more than ever, and it is growing day by day. Among various types of cybercrimes, financial attacks are widely spread and the financial sector is among most targeted. Both corporations and individuals are losing a huge amount of money each year. The majority portion of financial attacks is carried out by banking malware and web-based attacks. The end users are not always skilled enough to differentiate between injected content and actual contents of a webpage. Designing a real-time security system for ensuring a safe browsing experience is a challenging task. Some of the existing solutions are designed for client side and all the users have to install it in their system, which is very difficult to implement. In addition, various platforms and tools are used by organizations and individuals, therefore, different solutions are needed to be designed. The existing server-side solution often focuses on sanitizing and filtering the inputs. It will fail to detect obfuscated and hidden scripts. This is a realtime security system and any significant delay will hamper user experience. Therefore, finding the most optimized and efficient solution is very important. To ensure an easy installation and integration capabilities of any solution with the existing system is also a critical factor to consider. If the solution is efficient but difficult to integrate, then it may not be a feasible solution for practical use. Unsupervised and supervised data classification techniques have been widely applied to design algorithms for solving cyber security problems. The performance of these algorithms varies depending on types of cyber security problems and size of datasets. To date, existing algorithms do not achieve high accuracy in detecting malware activities. Datasets in cyber security and, especially those from financial sectors, are predominantly imbalanced datasets as the number of malware activities is significantly less than the number of normal activities. This means that classifiers for imbalanced datasets can be used to develop supervised data classification algorithms to detect malware activities. Development of classifiers for imbalanced data sets has been subject of research over the last decade. Most of these classifiers are based on oversampling and undersampling techniques and are not efficient in many situations as such techniques are applied globally. In this thesis, we develop two new algorithms for solving supervised data classification problems in imbalanced datasets and then apply them to solve malware detection problems. The first algorithm is designed using the piecewise linear classifiers by formulating this problem as an optimization problem and by applying the penalty function method. More specifically, we add more penalty to the objective function for misclassified points from minority classes. The second method is based on the combination of the supervised and unsupervised (clustering) algorithms. Such an approach allows one to identify areas in the input space where minority classes are located and to apply local oversampling or undersampling. This approach leads to the design of more efficient and accurate classifiers. The proposed algorithms are tested using real-world datasets. Results clearly demonstrate superiority of newly introduced algorithms. Then we apply these algorithms to design classifiers to detect malwares.Doctor of Philosoph
    corecore