254 research outputs found

    A hybrid decision tree/genetic algorithm method for data mining

    Get PDF

    An artificial immune system for fuzzy-rule induction in data mining

    Get PDF
    This work proposes a classification-rule discovery algorithm integrating artificial immune systems and fuzzy systems. The algorithm consists of two parts: a sequential covering procedure and a rule evolution procedure. Each antibody (candidate solution) corresponds to a classification rule. The classification of new examples (antigens) considers not only the fitness of a fuzzy rule based on the entire training set, but also the affinity between the rule and the new example. This affinity must be greater than a threshold in order for the fuzzy rule to be activated, and it is proposed an adaptive procedure for computing this threshold for each rule. This paper reports results for the proposed algorithm in several data sets. Results are analyzed with respect to both predictive accuracy and rule set simplicity, and are compared with C4.5rules, a very popular data mining algorithm

    Interpretable Categorization of Heterogeneous Time Series Data

    Get PDF
    Understanding heterogeneous multivariate time series data is important in many applications ranging from smart homes to aviation. Learning models of heterogeneous multivariate time series that are also human-interpretable is challenging and not adequately addressed by the existing literature. We propose grammar-based decision trees (GBDTs) and an algorithm for learning them. GBDTs extend decision trees with a grammar framework. Logical expressions derived from a context-free grammar are used for branching in place of simple thresholds on attributes. The added expressivity enables support for a wide range of data types while retaining the interpretability of decision trees. In particular, when a grammar based on temporal logic is used, we show that GBDTs can be used for the interpretable classi cation of high-dimensional and heterogeneous time series data. Furthermore, we show how GBDTs can also be used for categorization, which is a combination of clustering and generating interpretable explanations for each cluster. We apply GBDTs to analyze the classic Australian Sign Language dataset as well as data on near mid-air collisions (NMACs). The NMAC data comes from aircraft simulations used in the development of the next-generation Airborne Collision Avoidance System (ACAS X).Comment: 9 pages, 5 figures, 2 tables, SIAM International Conference on Data Mining (SDM) 201

    An under-Sampled Approach for Handling Skewed Data Distribution using Cluster Disjuncts

    Get PDF
    In Data mining and Knowledge Discovery hidden and valuable knowledge from the data sources is discovered. The traditional algorithms used for knowledge discovery are bottle necked due to wide range of data sources availability. Class imbalance is a one of the problem arises due to data source which provide unequal class i.e. examples of one class in a training data set vastly outnumber examples of the other class(es). Researchers have rigorously studied several techniques to alleviate the problem of class imbalance, including resampling algorithms, and feature selection approaches to this problem. In this paper, we present a new hybrid frame work dubbed as Majority Under-sampling based on Cluster Disjunct (MAJOR_CD) for learning from skewed training data. This algorithm provides a simpler and faster alternative by using cluster disjunct concept. We conduct experiments using twelve UCI data sets from various application domains using five algorithms for comparison on six evaluation metrics. The empirical study suggests that MAJOR_CD have been believed to be effective in addressing the class imbalance problem

    Self learning neuro-fuzzy modeling using hybrid genetic probabilistic approach for engine air/fuel ratio prediction

    Get PDF
    Machine Learning is concerned in constructing models which can learn and make predictions based on data. Rule extraction from real world data that are usually tainted with noise, ambiguity, and uncertainty, automatically requires feature selection. Neuro-Fuzzy system (NFS) which is known with its prediction performance has the difficulty in determining the proper number of rules and the number of membership functions for each rule. An enhanced hybrid Genetic Algorithm based Fuzzy Bayesian classifier (GA-FBC) was proposed to help the NFS in the rule extraction. Feature selection was performed in the rule level overcoming the problems of the FBC which depends on the frequency of the features leading to ignore the patterns of small classes. As dealing with a real world problem such as the Air/Fuel Ratio (AFR) prediction, a multi-objective problem is adopted. The GA-FBC uses mutual information entropy, which considers the relevance between feature attributes and class attributes. A fitness function is proposed to deal with multi-objective problem without weight using a new composition method. The model was compared to other learning algorithms for NFS such as Fuzzy c-means (FCM) and grid partition algorithm. Predictive accuracy and the complexity of the Fuzzy Rule Base System (FRBS) including number of rules and number of terms in each rule were taken as terms of evaluation. It was also compared to the original GA-FBC depending on the frequency not on Mutual Information (MI). Experimental results using Air/Fuel Ratio (AFR) data sets show that the new model participates in decreasing the average number of attributes in the rule and sometimes in increasing the average performance compared to other models. This work facilitates in achieving a self-generating FRBS from real data. The GA-FBC can be used as a new direction in machine learning research. This research contributes in controlling automobile emissions in helping the reduction of one of the most causes of pollution to produce greener environment

    Prediction in Financial Markets: The Case for Small Disjuncts

    Get PDF
    Predictive models in regression and classification problems typically have a single model that covers most, if not all, cases in the data. At the opposite end of the spectrum is a collection of models each of which covers a very small subset of the decision space. These are referred to as “small disjuncts.” The tradeoffs between the two types of models have been well documented. Single models, especially linear ones, are easy to interpret and explain. In contrast, small disjuncts do not provide as clean or as simple an interpretation of the data, and have been shown by several researchers to be responsible for a disproportionately large number of errors when applied to out of sample data. This research provides a counterpoint, demonstrating that “simple” small disjuncts provide a credible model for financial market prediction, a problem with a high degree of noise. A related novel contribution of this paper is a simple method for measuring the “yield” of a learning system, which is the percentage of in sample performance that the learned model can be expected to realize on out-of-sample data. Curiously, such a measure is missing from the literature on regression learning algorithms.NYU Stern School of Busines

    Nonadaptive Mastermind Algorithms for String and Vector Databases, with Case Studies

    Full text link
    In this paper, we study sparsity-exploiting Mastermind algorithms for attacking the privacy of an entire database of character strings or vectors, such as DNA strings, movie ratings, or social network friendship data. Based on reductions to nonadaptive group testing, our methods are able to take advantage of minimal amounts of privacy leakage, such as contained in a single bit that indicates if two people in a medical database have any common genetic mutations, or if two people have any common friends in an online social network. We analyze our Mastermind attack algorithms using theoretical characterizations that provide sublinear bounds on the number of queries needed to clone the database, as well as experimental tests on genomic information, collaborative filtering data, and online social networks. By taking advantage of the generally sparse nature of these real-world databases and modulating a parameter that controls query sparsity, we demonstrate that relatively few nonadaptive queries are needed to recover a large majority of each database

    LOGIC AND CONSTRAINT PROGRAMMING FOR COMPUTATIONAL SUSTAINABILITY

    Get PDF
    Computational Sustainability is an interdisciplinary field that aims to develop computational and mathematical models and methods for decision making concerning the management and allocation of resources in order to help solve environmental problems. This thesis deals with a broad spectrum of such problems (energy efficiency, water management, limiting greenhouse gas emissions and fuel consumption) giving a contribution towards their solution by means of Logic Programming (LP) and Constraint Programming (CP), declarative paradigms from Artificial Intelligence of proven solidity. The problems described in this thesis were proposed by experts of the respective domains and tested on the real data instances they provided. The results are encouraging and show the aptness of the chosen methodologies and approaches. The overall aim of this work is twofold: both to address real world problems in order to achieve practical results and to get, from the application of LP and CP technologies to complex scenarios, feedback and directions useful for their improvement

    DISCOVERING INTERESTING PATTERNS FOR INVESTMENT DECISION MAKING WITH GLOWER C - A GENETIC LEARNER OVERLAID WITH ENTROPY REDUCTION

    Get PDF
    Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be weak or non-existent, which makes problem formulation open-ended by forcing us to consider a large number of independent variables and thereby increasing the dimensionality of the search space. Second, the weak relationships among variables tend to be nonlinear, and may hold only in limited areas of the search space. Third, in financial practice, where analysts conduct extensive manual analysis of historically well performing indicators, a key is to find the hidden interactions among variables that perform well in combination. Unfortunately, these are exactly the patterns that the greedy search biases incorporated by many standard rule algorithms will miss. In this paper, we describe and evaluate several variations of a new genetic learning algorithm (GLOWER) on a variety of data sets. The design of GLOWER has been motivated by financial prediction problems, but incorporates successful ideas from tree induction and rule learning. We examine the performance of several GLOWER variants on two UCI data sets as well as on a standard financial prediction problem (S&P500 stock returns), using the results to identify and use one of the better variants for further comparisons. We introduce a new (to KDD) financial prediction problem (predicting positive and negative earnings surprises), and experiment withGLOWER, contrasting it with tree- and rule-induction approaches. Our results are encouraging, showing that GLOWER has the ability to uncover effective patterns for difficult problems that have weak structure and significant nonlinearities.Information Systems Working Papers Serie
    • …
    corecore