302 research outputs found

    An insight into imbalanced Big Data classification: outcomes and challenges

    Get PDF
    Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795

    A Survey of Methods for Handling Disk Data Imbalance

    Full text link
    Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This paper provides a comprehensive overview of research in the field of imbalanced data classification. The discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas, strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed, along with strategies to address them. It is convenient for researchers to choose the appropriate method according to their needs

    Data Mining

    Get PDF
    Data mining is a branch of computer science that is used to automatically extract meaningful, useful knowledge and previously unknown, hidden, interesting patterns from a large amount of data to support the decision-making process. This book presents recent theoretical and practical advances in the field of data mining. It discusses a number of data mining methods, including classification, clustering, and association rule mining. This book brings together many different successful data mining studies in various areas such as health, banking, education, software engineering, animal science, and the environment

    Learning from Multi-Class Imbalanced Big Data with Apache Spark

    Get PDF
    With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered. This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    MapReduce-iterative support vector machine classifier: novel fraud detection systems in healthcare insurance industry

    Get PDF
    Fraud in healthcare insurance claims is one of the significant research challenges that affect the growth of the healthcare services. The healthcare frauds are happening through subscribers, companies and the providers. The development of a decision support is to automate the claim data from service provider and to offset the patient’s challenges. In this paper, a novel hybridized big data and statistical machine learning technique, named MapReduce based iterative support vector machine (MR-ISVM) that provide a set of sophisticated steps for the automatic detection of fraudulent claims in the health insurance databases. The experimental results have proven that the MR-ISVM classifier outperforms better in classification and detection than other support vector machine (SVM) kernel classifiers. From the results, a positive impact seen in declining the computational time on processing the healthcare insurance claims without compromising the classification accuracy is achieved. The proposed MR-ISVM classifier achieves 87.73% accuracy than the linear (75.3%) and radial basis function (79.98%)

    Methodological challenges and analytic opportunities for modeling and interpreting Big Healthcare Data

    Full text link
    Abstract Managing, processing and understanding big healthcare data is challenging, costly and demanding. Without a robust fundamental theory for representation, analysis and inference, a roadmap for uniform handling and analyzing of such complex data remains elusive. In this article, we outline various big data challenges, opportunities, modeling methods and software techniques for blending complex healthcare data, advanced analytic tools, and distributed scientific computing. Using imaging, genetic and healthcare data we provide examples of processing heterogeneous datasets using distributed cloud services, automated and semi-automated classification techniques, and open-science protocols. Despite substantial advances, new innovative technologies need to be developed that enhance, scale and optimize the management and processing of large, complex and heterogeneous data. Stakeholder investments in data acquisition, research and development, computational infrastructure and education will be critical to realize the huge potential of big data, to reap the expected information benefits and to build lasting knowledge assets. Multi-faceted proprietary, open-source, and community developments will be essential to enable broad, reliable, sustainable and efficient data-driven discovery and analytics. Big data will affect every sector of the economy and their hallmark will be ‘team science’.http://deepblue.lib.umich.edu/bitstream/2027.42/134522/1/13742_2016_Article_117.pd
    corecore