92 research outputs found

    EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data

    Get PDF
    Classification problems with an imbalanced class distribution have received an increased amount of attention within the machine learning community over the last decade. They are encountered in a growing number of real-world situations and pose a challenge to standard machine learning techniques. We propose a new hybrid method specifically tailored to handle class imbalance, called EPRENNID. It performs an evolutionary prototype reduction focused on providing diverse solutions to prevent the method from overfitting the training set. It also allows us to explicitly reduce the underrepresented class, which the most common preprocessing solutions handling class imbalance usually protect. As part of the experimental study, we show that the proposed prototype reduction method outperforms state-of-the-art preprocessing techniques. The preprocessing step yields multiple prototype sets that are later used in an ensemble, performing a weighted voting scheme with the nearest neighbor classifier. EPRENNID is experimentally shown to significantly outperform previous proposals

    Hybridizing and applying computational intelligence techniques

    Get PDF
    As computers are increasingly relied upon to perform tasks of increasing complexity affecting many aspects of society, it is imperative that the underlying computational methods performing the tasks have high performance in terms of effectiveness and scalability. A common solution employed to perform such complex tasks are computational intelligence (CI) techniques. CI techniques use approaches influenced by nature to solve problems in which traditional modeling approaches fail due to impracticality, intractability, or mathematical ill-posedness. While CI techniques can perform considerably better than traditional modeling approaches when solving complex problems, the scalability performance of a given CI technique alone is not always optimal. Hybridization is a popular process by which a better performing CI technique is created from the combination of multiple existing techniques in a logical manner. In the first paper in this thesis, a novel hybridization of two CI techniques, accuracy-based learning classifier systems (XCS) and cluster analysis, is presented that improves upon the efficiency and, in some cases, the effectiveness of XCS. A number of tasks in software engineering are performed manually, such as defining expected output in model transformation testing. Especially since the number and size of projects that rely on tasks that must be performed manually, it is critical that automated approaches are employed to reduce or eliminate manual effort from these tasks in order to scale efficiently. The second paper in this thesis details a novel application of a CI technique, multi-objective simulated annealing, to the task of test case model generation to reduce the resulting effort required to manually update expected transformation output --Abstract, page iv

    The diversity-accuracy duality in ensembles of classifiersd

    Get PDF
    Horizontal scaling of Machine Learning algorithms has the potential to tackle concerns over the scalability and sustainability of Deep Learning methods, viz. their consumption of energy and computational resources, as well their increasing inaccessibility to researchers. One way to enact horizontal scaling is by employing ensemble learning methods, since they enable distribution. There is a consensus on the point that diversity between individual learners leads to better performance, which is why we have focused on it as the criterion for distributing the base models of an ensemble. However, there is no standard agreement on how diversity should be defined and thus how to exploit it to construct a high-performing classifier. Therefore, we have proposed different definitions of diversity and innovative algorithms which promote it in a systematic way. We have first considered architectural diversity with an algorithm called WILDA: Wide Learning of Diverse Architectures. In a distributed fashion, this algorithm evolves a set of neural networks that are pretrained on the target task and diverse w.r.t. architectural feature descriptors. We have then generalised this notion by defining behavioural diversity on the basis of the divergence between the errors made by different models on a dataset. We have defined several diversity metrics and used them to guide a novelty search algorithm which builds an ensemble of behaviourally diverse classifiers. The algorithm promotes diversity in ensembles by explicitly searching for it, without selecting for accuracy. We have then extended this approach with a surrogate diversity model, which reduces the computational burden of this search by eliminating the need to train each network in the population with stochastic gradient descent at each step. These methods have enabled us to investigate the role that both architectural and behavioural diversity play in contributing to the performance of an ensemble. In order to study the relationship between diversity and accuracy in classifier ensembles, we have then proposed several methods that extend the novelty search with accuracy objectives. Surprisingly, we have observed that, with the highest-performing diversity metrics, there is an equivalence between searching for diversity objectives and searching for accuracy objectives. This contradicts widespread assumptions that a trade-off must be found by balancing diversity and accuracy objectives. We therefore posit the existence of a diversity-accuracy duality in ensembles of classifiers. An implication of this is the possibility of evolving diverse ensembles without detriment to their accuracy, since it is implicitly ensured.Open Acces

    Prediction of Sudden Cardiac Death Using Ensemble Classifiers

    Get PDF
    Sudden Cardiac Death (SCD) is a medical problem that is responsible for over 300,000 deaths per year in the United States and millions worldwide. SCD is defined as death occurring from within one hour of the onset of acute symptoms, an unwitnessed death in the absence of pre-existing progressive circulatory failures or other causes of deaths, or death during attempted resuscitation. Sudden death due to cardiac reasons is a leading cause of death among Congestive Heart Failure (CHF) patients. The use of Electronic Medical Records (EMR) systems has made a wealth of medical data available for research and analysis. Supervised machine learning methods have been successfully used for medical diagnosis. Ensemble classifiers are known to achieve better prediction accuracy than its constituent base classifiers. In an effort to understand the factors contributing to SCD, data on 2,521 patients were collected for the Sudden Cardiac Death in Heart Failure Trial (SCD-HeFT). The data included 96 features that were gathered over a period of 5 years. The goal of this dissertation was to develop a model that could accurately predict SCD based on available features. The prediction model used the Cox proportional hazards model as a score and then used the ExtraTreesClassifier algorithm as a boosting mechanism to create the ensemble. We tested the system at prediction points of 180 days and 365 days. Our best results were at 180-days with accuracy of 0.9624, specificity of 0.9915, and F1 score of 0.9607

    An academic review: applications of data mining techniques in finance industry

    Get PDF
    With the development of Internet techniques, data volumes are doubling every two years, faster than predicted by Moore’s Law. Big Data Analytics becomes particularly important for enterprise business. Modern computational technologies will provide effective tools to help understand hugely accumulated data and leverage this information to get insights into the finance industry. In order to get actionable insights into the business, data has become most valuable asset of financial organisations, as there are no physical products in finance industry to manufacture. This is where data mining techniques come to their rescue by allowing access to the right information at the right time. These techniques are used by the finance industry in various areas such as fraud detection, intelligent forecasting, credit rating, loan management, customer profiling, money laundering, marketing and prediction of price movements to name a few. This work aims to survey the research on data mining techniques applied to the finance industry from 2010 to 2015.The review finds that Stock prediction and Credit rating have received most attention of researchers, compared to Loan prediction, Money Laundering and Time Series prediction. Due to the dynamics, uncertainty and variety of data, nonlinear mapping techniques have been deeply studied than linear techniques. Also it has been proved that hybrid methods are more accurate in prediction, closely followed by Neural Network technique. This survey could provide a clue of applications of data mining techniques for finance industry, and a summary of methodologies for researchers in this area. Especially, it could provide a good vision of Data Mining Techniques in computational finance for beginners who want to work in the field of computational finance

    New perspectives and methods for stream learning in the presence of concept drift.

    Get PDF
    153 p.Applications that generate data in the form of fast streams from non-stationary environments, that is,those where the underlying phenomena change over time, are becoming increasingly prevalent. In thiskind of environments the probability density function of the data-generating process may change overtime, producing a drift. This causes that predictive models trained over these stream data become obsoleteand do not adapt suitably to the new distribution. Specially in online learning scenarios, there is apressing need for new algorithms that adapt to this change as fast as possible, while maintaining goodperformance scores. Examples of these applications include making inferences or predictions based onfinancial data, energy demand and climate data analysis, web usage or sensor network monitoring, andmalware/spam detection, among many others.Online learning and concept drift are two of the most hot topics in the recent literature due to theirrelevance for the so-called Big Data paradigm, where nowadays we can find an increasing number ofapplications based on training data continuously available, named as data streams. Thus, learning in nonstationaryenvironments requires adaptive or evolving approaches that can monitor and track theunderlying changes, and adapt a model to accommodate those changes accordingly. In this effort, Iprovide in this thesis a comprehensive state-of-the-art approaches as well as I identify the most relevantopen challenges in the literature, while focusing on addressing three of them by providing innovativeperspectives and methods.This thesis provides with a complete overview of several related fields, and tackles several openchallenges that have been identified in the very recent state of the art. Concretely, it presents aninnovative way to generate artificial diversity in ensembles, a set of necessary adaptations andimprovements for spiking neural networks in order to be used in online learning scenarios, and finally, adrift detector based on this former algorithm. All of these approaches together constitute an innovativework aimed at presenting new perspectives and methods for the field

    New perspectives and methods for stream learning in the presence of concept drift.

    Get PDF
    153 p.Applications that generate data in the form of fast streams from non-stationary environments, that is,those where the underlying phenomena change over time, are becoming increasingly prevalent. In thiskind of environments the probability density function of the data-generating process may change overtime, producing a drift. This causes that predictive models trained over these stream data become obsoleteand do not adapt suitably to the new distribution. Specially in online learning scenarios, there is apressing need for new algorithms that adapt to this change as fast as possible, while maintaining goodperformance scores. Examples of these applications include making inferences or predictions based onfinancial data, energy demand and climate data analysis, web usage or sensor network monitoring, andmalware/spam detection, among many others.Online learning and concept drift are two of the most hot topics in the recent literature due to theirrelevance for the so-called Big Data paradigm, where nowadays we can find an increasing number ofapplications based on training data continuously available, named as data streams. Thus, learning in nonstationaryenvironments requires adaptive or evolving approaches that can monitor and track theunderlying changes, and adapt a model to accommodate those changes accordingly. In this effort, Iprovide in this thesis a comprehensive state-of-the-art approaches as well as I identify the most relevantopen challenges in the literature, while focusing on addressing three of them by providing innovativeperspectives and methods.This thesis provides with a complete overview of several related fields, and tackles several openchallenges that have been identified in the very recent state of the art. Concretely, it presents aninnovative way to generate artificial diversity in ensembles, a set of necessary adaptations andimprovements for spiking neural networks in order to be used in online learning scenarios, and finally, adrift detector based on this former algorithm. All of these approaches together constitute an innovativework aimed at presenting new perspectives and methods for the field

    Data classification using genetic programming.

    Get PDF
    Master of Science in Computer Science.Genetic programming (GP), a field of artificial intelligence, is an evolutionary algorithm which evolves a population of trees which represent programs. These programs are used to solve problems. This dissertation investigates the use of genetic programming for data classification. In machine learning, data classification is the process of allocating a class label to an instance of data. A classifier is created in order to perform these allocations. Several studies have investigated the use of GP to solve data classification problems. These studies have shown that GP is able to create classifiers with high classification accuracies. However, there are certain aspects which have not previously been investigated. Five areas were investigated in this dissertation. The first was an investigation into how discretisation could be incorporated into a GP algorithm. An adaptive discretisation algorithm was proposed, and outperformed certain existing methods. The second was a comparison of GP representations for binary data classification. The findings indicated that from the representations examined (arithmetic trees, decision trees, and logical trees), the decision trees performed the best. The third was to investigate the use of the encapsulation genetic operator and its effect on data classification. The findings revealed that an improvement in both training and test results was achieved when encapsulation was incorporated. The fourth was an investigative analysis of several hybridisations of a GP algorithm with a genetic algorithm in order to evolve a population of ensembles. Four methods were proposed and these methods outperformed certain existing GP and ensemble methods. Finally, the fifth area was to investigate an ensemble construction method for classification. In this approach GP evolved a single ensemble. The proposed method resulted in an improvement in training and test accuracy when compared to the standard GP algorithm. The methods proposed in this dissertation were tested on publicly available data sets, and the results were statistically tested in order to determine the effectiveness of the proposed approaches
    corecore