51 research outputs found

    Evolutionary undersampling for extremely imbalanced big data classification under apache spark

    Get PDF
    The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority class examples is very big. In this scenario, the use of the evolutionary model becomes unpractical due to the memory and time constrictions. Divide-and-conquer approaches based on the MapReduce paradigm have already been proposed to handle this type of problems by dividing data into multiple subsets. However, in extremely imbalanced cases, these models may suffer from a lack of density from the minority class in the subsets considered. Aiming at addressing this problem, in this contribution we provide a new big data scheme based on the new emerging technology Apache Spark to tackle highly imbalanced datasets. We take advantage of its in-memory operations to diminish the effect of the small sample size. The key point of this proposal lies in the independent management of majority and minority class examples, allowing us to keep a higher number of minority class examples in each subset. In our experiments, we analyze the proposed model with several data sets with up to 17 million instances. The results show the goodness of this evolutionary undersampling model for extremely imbalanced big data classification

    Evolutionary undersampling for extremely imbalanced big data classification under apache spark

    Full text link

    An insight into imbalanced Big Data classification: outcomes and challenges

    Get PDF
    Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795

    Novel Strategies to Accelerate Search Algorithms in Data Reduction

    Get PDF
    In our current hyper-connected digital world where data is growing enormously, instance reduction is an essential pre-processing phase to obtain cleaner and smaller datasets that are free from noise, redundant or irrelevant samples (the so-called, Smart Data). The data after pre-processing may become more reliable, accurate and useful for subsequent data mining tasks. Instance reduction consists of two types: instance selection and instance generation; each can be formulated as a combinatorial/continuous optimisation problem depending on whether its decision variable is discrete or continuous, respectively. It is an emerging challenge characterised by multimodality and a large number of decision variables. Given such difficulties, derivative-free methods are likely promising approaches to address the problem. They are powerful search algorithms that seek the nearest local optimum and do not necessarily take into account the gradient computation of the objective function like derivative methods. Solutions for instance reduction fall into the intersection of machine learning, data mining and optimisation at which the process of a domain can take part in the execution of another. Thus, the synergy between domains is important to solve the problem more effectively, and this has attracted a significant interest from researchers. Among many different derivative-free search approaches, the family of direct search methods has introduced various strategies to tackle numerous modern numerical optimisation problems, where population-based meta-heuristics and pattern search can be considered two of the most prevalent in the literature. Population-based meta-heuristics are an iterative search framework composing several subordinate low-level heuristics to control exploration and exploitation for a pool of solution candidates. This set of methods searches for high-quality solutions from multi-points, and thus is usually associated with high computational expense. Pattern search methods seek an improved solution from candidates that are generated from different directions. They examine trial solutions sequentially by comparing each trial solution with the `best' solution found up to the present time. In this dissertation, we will investigate these derivative-free search strategies to address instance reduction, a critical optimisation problem in the field of data science. Although many derivative-free methods have been proved effective in addressing instance reduction, they are usually time-consuming, especially when handling relatively large datasets. This impediment limits their practicality in many data mining systems and thus necessitates a solution to accelerate the search process. The need for a fast and effective search framework for instance reduction has motivated us to develop novel search strategies in the family of direct search approaches, aiming to still obtain high quality solutions achieved by state-of-the-art techniques in the domain, but significantly reduce the runtime of the search process. Three major work packages presented in this thesis will cover two direct search approaches for two types of instance reduction, arranged in a progressive order at which findings at an earlier stage will contribute to the understanding of the later outcomes. Firstly, a novel evolutionary search framework for instance selection is proposed to balance the number of samples between classes to address a case study of imbalanced classification. Secondly, we develop another search framework for instance generation based on single-point search and memetic computing, namely Single-Point Memetic Structure. An accelerated mechanism for computing the objective function is embedded into the proposed search design, thus reducing significantly the runtime. Finally, a novel search framework for simultaneous instance selection and generation is designed to handle the instance reduction problem in both combinatorial and continuous search spaces. In summary, the research conducted here introduces a set of novel search strategies towards derivative-free methods to tackle instance reduction problems. They are different search frameworks which aim to produce a high quality reduced set from a relatively large original source within a reasonable amount of time. This is accomplished by either taking advantage of machine learning integration or the Single-Point Memetic Structure with an accelerated mechanism. The use of machine learning in a meta-heuristic search framework greatly speeds up the computation of the objective function while the Single-Point Memetic Search allows us to reuse virtually all prior calculations for computing the fitness value of newly evolved individuals. Hence, these novel search strategies can save vast computational cost. Finally, we leverage the insights previously found to propose another novel search framework that handles both instance selection and instance generation simultaneously, and operates in both combinatorial and continuous search spaces. These novel search strategies are examined with a large number of datasets in different hyper-parameter settings. The obtained numerical results are comprehensively analysed and verified by different statistical tests to prove the robustness of the proposed search strategies with respect to other state-of-the-art techniques in the domain

    Novel Strategies to Accelerate Search Algorithms in Data Reduction

    Get PDF
    In our current hyper-connected digital world where data is growing enormously, instance reduction is an essential pre-processing phase to obtain cleaner and smaller datasets that are free from noise, redundant or irrelevant samples (the so-called, Smart Data). The data after pre-processing may become more reliable, accurate and useful for subsequent data mining tasks. Instance reduction consists of two types: instance selection and instance generation; each can be formulated as a combinatorial/continuous optimisation problem depending on whether its decision variable is discrete or continuous, respectively. It is an emerging challenge characterised by multimodality and a large number of decision variables. Given such difficulties, derivative-free methods are likely promising approaches to address the problem. They are powerful search algorithms that seek the nearest local optimum and do not necessarily take into account the gradient computation of the objective function like derivative methods. Solutions for instance reduction fall into the intersection of machine learning, data mining and optimisation at which the process of a domain can take part in the execution of another. Thus, the synergy between domains is important to solve the problem more effectively, and this has attracted a significant interest from researchers. Among many different derivative-free search approaches, the family of direct search methods has introduced various strategies to tackle numerous modern numerical optimisation problems, where population-based meta-heuristics and pattern search can be considered two of the most prevalent in the literature. Population-based meta-heuristics are an iterative search framework composing several subordinate low-level heuristics to control exploration and exploitation for a pool of solution candidates. This set of methods searches for high-quality solutions from multi-points, and thus is usually associated with high computational expense. Pattern search methods seek an improved solution from candidates that are generated from different directions. They examine trial solutions sequentially by comparing each trial solution with the `best' solution found up to the present time. In this dissertation, we will investigate these derivative-free search strategies to address instance reduction, a critical optimisation problem in the field of data science. Although many derivative-free methods have been proved effective in addressing instance reduction, they are usually time-consuming, especially when handling relatively large datasets. This impediment limits their practicality in many data mining systems and thus necessitates a solution to accelerate the search process. The need for a fast and effective search framework for instance reduction has motivated us to develop novel search strategies in the family of direct search approaches, aiming to still obtain high quality solutions achieved by state-of-the-art techniques in the domain, but significantly reduce the runtime of the search process. Three major work packages presented in this thesis will cover two direct search approaches for two types of instance reduction, arranged in a progressive order at which findings at an earlier stage will contribute to the understanding of the later outcomes. Firstly, a novel evolutionary search framework for instance selection is proposed to balance the number of samples between classes to address a case study of imbalanced classification. Secondly, we develop another search framework for instance generation based on single-point search and memetic computing, namely Single-Point Memetic Structure. An accelerated mechanism for computing the objective function is embedded into the proposed search design, thus reducing significantly the runtime. Finally, a novel search framework for simultaneous instance selection and generation is designed to handle the instance reduction problem in both combinatorial and continuous search spaces. In summary, the research conducted here introduces a set of novel search strategies towards derivative-free methods to tackle instance reduction problems. They are different search frameworks which aim to produce a high quality reduced set from a relatively large original source within a reasonable amount of time. This is accomplished by either taking advantage of machine learning integration or the Single-Point Memetic Structure with an accelerated mechanism. The use of machine learning in a meta-heuristic search framework greatly speeds up the computation of the objective function while the Single-Point Memetic Search allows us to reuse virtually all prior calculations for computing the fitness value of newly evolved individuals. Hence, these novel search strategies can save vast computational cost. Finally, we leverage the insights previously found to propose another novel search framework that handles both instance selection and instance generation simultaneously, and operates in both combinatorial and continuous search spaces. These novel search strategies are examined with a large number of datasets in different hyper-parameter settings. The obtained numerical results are comprehensively analysed and verified by different statistical tests to prove the robustness of the proposed search strategies with respect to other state-of-the-art techniques in the domain

    A comparison of statistical machine learning methods in heartbeat detection and classification

    Get PDF
    In health care, patients with heart problems require quick responsiveness in a clinical setting or in the operating theatre. Towards that end, automated classification of heartbeats is vital as some heartbeat irregularities are time consuming to detect. Therefore, analysis of electro-cardiogram (ECG) signals is an active area of research. The methods proposed in the literature depend on the structure of a heartbeat cycle. In this paper, we use interval and amplitude based features together with a few samples from the ECG signal as a feature vector. We studied a variety of classification algorithms focused especially on a type of arrhythmia known as the ventricular ectopic fibrillation (VEB). We compare the performance of the classifiers against algorithms proposed in the literature and make recommendations regarding features, sampling rate, and choice of the classifier to apply in a real-time clinical setting. The extensive study is based on the MIT-BIH arrhythmia database. Our main contribution is the evaluation of existing classifiers over a range sampling rates, recommendation of a detection methodology to employ in a practical setting, and extend the notion of a mixture of experts to a larger class of algorithms

    Advances in Data Mining Knowledge Discovery and Applications

    Get PDF
    Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications

    Emergent relational schemas for RDF

    Get PDF

    Principled design of evolutionary learning sytems for large scale data mining

    Get PDF
    Currently, the data mining and machine learning fields are facing new challenges because of the amount of information that is collected and needs processing. Many sophisticated learning approaches cannot simply cope with large and complex domains, because of the unmanageable execution times or the loss of prediction and generality capacities that occurs when the domains become more complex. Therefore, to cope with the volumes of information of the current realworld problems there is a need to push forward the boundaries of sophisticated data mining techniques. This thesis is focused on improving the efficiency of Evolutionary Learning systems in large scale domains. Specifically the objective of this thesis is improving the efficiency of the Bioinformatic Hierarchical Evolutionary Learning (BioHEL) system, a system designed with the purpose of handling large domains. This is a classifier system that uses an Iterative Rule Learning approach to generate a set of rules one by one using consecutive Genetic Algorithms. This system have shown to be very competitive so far in large and complex domains. In particular, BioHEL has obtained very important results when solving protein structure prediction problems and has won related merits, such as being placed among the best algorithms for this purpose at the Critical Assessment of Techniques for Protein Structure Prediction (CASP) in 2008 and 2010, and winning the bronze medal at the HUMIES Awards for Human-competitive results in 2007. However, there is still a need to analyse this system in a principled way to determine how the current mechanisms work together to solve larger domains and determine the aspects of the system that can be improved towards this aim. To fulfil the objective of this thesis, the work is divided in two parts. In the first part of the thesis exhaustive experimentation was carried out to determine ways in which the system could be improved. From this exhaustive analysis three main weaknesses are pointed out: a) the problem-dependancy of parameters in BioHEL's fitness function, which results in having a system difficult to set up and which requires an extensive preliminary experimentation to determine the adequate values for these parameters; b) the execution time of the learning process, which at the moment does not use any parallelisation techniques and depends on the size of the training sets; and c) the lack of global supervision over the generated solutions which comes from the usage of the Iterative Rule Learning paradigm and produces larger rule sets in which there is no guarantee of minimality or maximal generality. The second part of the thesis is focused on tackling each one of the weaknesses abovementioned to have a system capable of handling larger domains. First a heuristic approach to set parameters within BioHEL's fitness function is developed. Second a new parallel evaluation process that runs on General Purpose Graphic Processing Units was developed. Finally, post-processing operators to tackle the generality and cardinality of the generated solutions are proposed. By means of these enhancements we managed to improve the BioHEL system to reduce both the learning and the preliminary experimentation time, increase the generality of the final solutions and make the system more accessible for end-users. Moreover, as the techniques discussed in this thesis can be easily extended to other Evolutionary Learning systems we consider them important additions to the research in this field towards tackling large scale domains
    corecore