5 research outputs found

    Instance selection for model-based classifiers

    Get PDF
    Aspects of a classifier\u27s training dataset can often make building a helpful and high accuracy classifier difficult. Instance selection addresses some of the issues in a dataset by selecting a subset of the data in such a way that learning from the reduced dataset leads to a better classifier. This work introduces an integer programming formulation of instance selection that relies on column generation techniques to obtain a good solution to the problem. Experimental results show that instance selection improves the usefulness of some classifiers by optimizing the training data so that that the training dataset has easier to learn boundaries between class values. Also included in this paper are two case studies from the Surveillance, Epidemiology, and End Results (SEER) database that further confirm the benefit of instance selection. Overall, results indicate that performing instance selection for a classifier is a competitive classification approach. However, it should be noted that instance selection might overfit classifiers that have already achieved a good fit to the dataset

    A survey on pre-processing techniques: relevant issues in the context of environmental data mining

    Get PDF
    One of the important issues related with all types of data analysis, either statistical data analysis, machine learning, data mining, data science or whatever form of data-driven modeling, is data quality. The more complex the reality to be analyzed is, the higher the risk of getting low quality data. Unfortunately real data often contain noise, uncertainty, errors, redundancies or even irrelevant information. Useless models will be obtained when built over incorrect or incomplete data. As a consequence, the quality of decisions made over these models, also depends on data quality. This is why pre-processing is one of the most critical steps of data analysis in any of its forms. However, pre-processing has not been properly systematized yet, and little research is focused on this. In this paper a survey on most popular pre-processing steps required in environmental data analysis is presented, together with a proposal to systematize it. Rather than providing technical details on specific pre-processing techniques, the paper focus on providing general ideas to a non-expert user, who, after reading them, can decide which one is the more suitable technique required to solve his/her problem.Peer ReviewedPostprint (author's final draft

    A balanced approach to the multi-class imbalance problem

    Get PDF
    The multi-class class-imbalance problem is a subset of supervised machine learning tasks where the classification variable of interest consists of three or more categories with unequal sample sizes. In the fields of manufacturing and business, common machine learning classification tasks such as failure mode, fraud, and threat detection often exhibit class imbalance due to the infrequent occurrence of one or more event states. Though machine learning as a discipline is well established, the study of class imbalance with respect to multi-class learning does not yet have the same deep, rich history. In its current state, the class imbalance literature leverages the use of biased sampling and increasing model complexity to improve predictive performance, and while some have made advances, there is still no standard model evaluation criteria for which to compare their performance. In the presence of substantial multi-class distributional skew, of the model evaluation criteria that can scale beyond the binary case, many become invalid due to their over-emphasis on the majority class observations. Going a step further, many of the evaluation criteria utilized in practice vary significantly across the class imbalance literature and so far no single measure has been able to galvanize consensus due not only to implementation complexity, but the existence of undesirable properties. Therefore, the focus of this research is to introduce a new performance measure, Class Balance Accuracy, designed specifically for model validation in the presence of multi-class imbalance. This paper begins with the statement of definition for Class Balance Accuracy and provides an intuitive proof for its interpretation as a simultaneous lower bound for the average per class recall and average per class precision. Results from comparison studies show that models chosen by maximizing the training class balance accuracy consistently yield both high overall accuracy and per class recall on the test sets compared to the models chosen by other criteria. Simulation studies were then conducted to highlight specific scenarios where the use of class balance accuracy outperforms model selection based on regular accuracy. The measure is then invoked in two novel applications, one as the maximization criteria in the instance selection biased sampling technique and the other as a model selection tool in a multiple classifier system prediction algorithm. In the case of instance selection, the use of class balance accuracy shows improvement over traditional accuracy in scenarios of multi-class class-imbalance data sets with low separability between the majority and minority classes. Likewise, the use of CBA in the multiple classifier system resulted in improved predictions over state of the art methods such as adaBoost for some of the U.C.I. machine learning repository test data sets. The paper then concludes with a discussion of the climbR package, a repository of functions designed to aid in the model evaluation and prediction of class imbalance machine learning problems

    Towards a more efficient use of computational budget in large-scale black-box optimization

    Get PDF
    Evolutionary algorithms are general purpose optimizers that have been shown effective in solving a variety of challenging optimization problems. In contrast to mathematical programming models, evolutionary algorithms do not require derivative information and are still effective when the algebraic formula of the given problem is unavailable. Nevertheless, the rapid advances in science and technology have witnessed the emergence of more complex optimization problems than ever, which pose significant challenges to traditional optimization methods. The dimensionality of the search space of an optimization problem when the available computational budget is limited is one of the main contributors to its difficulty and complexity. This so-called curse of dimensionality can significantly affect the efficiency and effectiveness of optimization methods including evolutionary algorithms. This research aims to study two topics related to a more efficient use of computational budget in evolutionary algorithms when solving large-scale black-box optimization problems. More specifically, we study the role of population initializers in saving the computational resource, and computational budget allocation in cooperative coevolutionary algorithms. Consequently, this dissertation consists of two major parts, each of which relates to one of these research directions. In the first part, we review several population initialization techniques that have been used in evolutionary algorithms. Then, we categorize them from different perspectives. The contribution of each category to improving evolutionary algorithms in solving large-scale problems is measured. We also study the mutual effect of population size and initialization technique on the performance of evolutionary techniques when dealing with large-scale problems. Finally, assuming uniformity of initial population as a key contributor in saving a significant part of the computational budget, we investigate whether achieving a high-level of uniformity in high-dimensional spaces is feasible given the practical restriction in computational resources. In the second part of the thesis, we study the large-scale imbalanced problems. In many real world applications, a large problem may consist of subproblems with different degrees of difficulty and importance. In addition, the solution to each subproblem may contribute differently to the overall objective value of the final solution. When the computational budget is restricted, which is the case in many practical problems, investing the same portion of resources in optimizing each of these imbalanced subproblems is not the most efficient strategy. Therefore, we examine several ways to learn the contribution of each subproblem, and then, dynamically allocate the limited computational resources in solving each of them according to its contribution to the overall objective value of the final solution. To demonstrate the effectiveness of the proposed framework, we design a new set of 40 large-scale imbalanced problems and study the performance of some possible instances of the framework
    corecore