254 research outputs found
An artificial immune system for fuzzy-rule induction in data mining
This work proposes a classification-rule discovery algorithm integrating artificial immune systems and fuzzy systems. The algorithm consists of two parts: a sequential covering procedure and a rule evolution procedure. Each antibody (candidate solution) corresponds to a classification rule. The classification of new examples (antigens) considers not only the fitness of a fuzzy rule based on the entire training set, but also the affinity between the rule and the new example. This affinity must be greater than a threshold in order for the fuzzy rule to be activated, and it is proposed an adaptive procedure for computing this threshold for each rule. This paper reports results for the proposed algorithm in several data sets. Results are analyzed with respect to both predictive accuracy and rule set simplicity, and are compared with C4.5rules, a very popular data mining algorithm
Interpretable Categorization of Heterogeneous Time Series Data
Understanding heterogeneous multivariate time series data is important in
many applications ranging from smart homes to aviation. Learning models of
heterogeneous multivariate time series that are also human-interpretable is
challenging and not adequately addressed by the existing literature. We propose
grammar-based decision trees (GBDTs) and an algorithm for learning them. GBDTs
extend decision trees with a grammar framework. Logical expressions derived
from a context-free grammar are used for branching in place of simple
thresholds on attributes. The added expressivity enables support for a wide
range of data types while retaining the interpretability of decision trees. In
particular, when a grammar based on temporal logic is used, we show that GBDTs
can be used for the interpretable classi cation of high-dimensional and
heterogeneous time series data. Furthermore, we show how GBDTs can also be used
for categorization, which is a combination of clustering and generating
interpretable explanations for each cluster. We apply GBDTs to analyze the
classic Australian Sign Language dataset as well as data on near mid-air
collisions (NMACs). The NMAC data comes from aircraft simulations used in the
development of the next-generation Airborne Collision Avoidance System (ACAS
X).Comment: 9 pages, 5 figures, 2 tables, SIAM International Conference on Data
Mining (SDM) 201
An under-Sampled Approach for Handling Skewed Data Distribution using Cluster Disjuncts
In Data mining and Knowledge Discovery hidden and valuable knowledge from the data sources is discovered. The traditional algorithms used for knowledge discovery are bottle necked due to wide range of data sources availability. Class imbalance is a one of the problem arises due to data source which provide unequal class i.e. examples of one class in a training data set vastly outnumber examples of the other class(es). Researchers have rigorously studied several techniques to alleviate the problem of class imbalance, including resampling algorithms, and feature selection approaches to this problem. In this paper, we present a new hybrid frame work dubbed as Majority Under-sampling based on Cluster Disjunct (MAJOR_CD) for learning from skewed training data. This algorithm provides a simpler and faster alternative by using cluster disjunct concept. We conduct experiments using twelve UCI data sets from various application domains using five algorithms for comparison on six evaluation metrics. The empirical study suggests that MAJOR_CD have been believed to be effective in addressing the class imbalance problem
Self learning neuro-fuzzy modeling using hybrid genetic probabilistic approach for engine air/fuel ratio prediction
Machine Learning is concerned in constructing models which can learn and make predictions based on data. Rule extraction from real world data that are usually tainted with noise, ambiguity, and uncertainty, automatically requires feature selection. Neuro-Fuzzy system (NFS) which is known with its prediction performance has the difficulty in determining the proper number of rules and the number of membership functions for each rule. An enhanced hybrid Genetic Algorithm based Fuzzy Bayesian
classifier (GA-FBC) was proposed to help the NFS in the rule extraction. Feature selection was performed in the rule level overcoming the problems of the FBC which depends on the frequency of the features leading to ignore the patterns of small classes. As dealing with a real world problem such as the Air/Fuel Ratio (AFR) prediction, a multi-objective problem is adopted. The GA-FBC uses mutual information entropy, which considers the relevance between feature attributes and class attributes. A fitness function is proposed to deal with multi-objective problem without weight using a new composition method. The model was compared to other learning algorithms for NFS such as Fuzzy c-means (FCM) and grid partition algorithm. Predictive accuracy and the complexity of the Fuzzy Rule Base System (FRBS) including number of rules and number of terms in each rule were taken as terms of evaluation. It was also compared to the original GA-FBC depending on the
frequency not on Mutual Information (MI). Experimental results using Air/Fuel Ratio
(AFR) data sets show that the new model participates in decreasing the average number of attributes in the rule and sometimes in increasing the average performance compared to other models. This work facilitates in achieving a self-generating FRBS from real data. The GA-FBC can be used as a new direction in machine learning research. This research contributes in controlling automobile emissions in helping the
reduction of one of the most causes of pollution to produce greener environment
Prediction in Financial Markets: The Case for Small Disjuncts
Predictive models in regression and classification problems typically
have a single model that covers most, if not all, cases in the data. At
the opposite end of the spectrum is a collection of models each of which
covers a very small subset of the decision space. These are referred to
as “small disjuncts.” The tradeoffs between the two types of
models have been well documented. Single models, especially linear ones,
are easy to interpret and explain. In contrast, small disjuncts do not
provide as clean or as simple an interpretation of the data, and have
been shown by several researchers to be responsible for a
disproportionately large number of errors when applied to out of sample
data. This research provides a counterpoint, demonstrating that
“simple” small disjuncts provide a credible model for
financial market prediction, a problem with a high degree of noise. A
related novel contribution of this paper is a simple method for
measuring the “yield” of a learning system, which is the
percentage of in sample performance that the learned model can be
expected to realize on out-of-sample data. Curiously, such a measure is
missing from the literature on regression learning algorithms.NYU Stern School of Busines
Nonadaptive Mastermind Algorithms for String and Vector Databases, with Case Studies
In this paper, we study sparsity-exploiting Mastermind algorithms for
attacking the privacy of an entire database of character strings or vectors,
such as DNA strings, movie ratings, or social network friendship data. Based on
reductions to nonadaptive group testing, our methods are able to take advantage
of minimal amounts of privacy leakage, such as contained in a single bit that
indicates if two people in a medical database have any common genetic
mutations, or if two people have any common friends in an online social
network. We analyze our Mastermind attack algorithms using theoretical
characterizations that provide sublinear bounds on the number of queries needed
to clone the database, as well as experimental tests on genomic information,
collaborative filtering data, and online social networks. By taking advantage
of the generally sparse nature of these real-world databases and modulating a
parameter that controls query sparsity, we demonstrate that relatively few
nonadaptive queries are needed to recover a large majority of each database
LOGIC AND CONSTRAINT PROGRAMMING FOR COMPUTATIONAL SUSTAINABILITY
Computational Sustainability is an interdisciplinary field that aims to develop computational
and mathematical models and methods for decision making concerning
the management and allocation of resources in order to help solve environmental
problems.
This thesis deals with a broad spectrum of such problems (energy efficiency, water
management, limiting greenhouse gas emissions and fuel consumption) giving
a contribution towards their solution by means of Logic Programming (LP) and
Constraint Programming (CP), declarative paradigms from Artificial Intelligence
of proven solidity.
The problems described in this thesis were proposed by experts of the respective
domains and tested on the real data instances they provided. The results are encouraging
and show the aptness of the chosen methodologies and approaches.
The overall aim of this work is twofold: both to address real world problems
in order to achieve practical results and to get, from the application of LP and
CP technologies to complex scenarios, feedback and directions useful for their
improvement
DISCOVERING INTERESTING PATTERNS FOR INVESTMENT DECISION MAKING WITH GLOWER C - A GENETIC LEARNER OVERLAID WITH ENTROPY REDUCTION
Prediction in financial domains is notoriously difficult for a number of reasons. First, theories tend to be
weak or non-existent, which makes problem formulation open-ended by forcing us to consider a large
number of independent variables and thereby increasing the dimensionality of the search space. Second, the
weak relationships among variables tend to be nonlinear, and may hold only in limited areas of the search
space. Third, in financial practice, where analysts conduct extensive manual analysis of historically well
performing indicators, a key is to find the hidden interactions among variables that perform well in
combination. Unfortunately, these are exactly the patterns that the greedy search biases incorporated by
many standard rule algorithms will miss. In this paper, we describe and evaluate several variations of a new
genetic learning algorithm (GLOWER) on a variety of data sets. The design of GLOWER has been motivated
by financial prediction problems, but incorporates successful ideas from tree induction and rule learning.
We examine the performance of several GLOWER variants on two UCI data sets as well as on a standard
financial prediction problem (S&P500 stock returns), using the results to identify and use one of the better
variants for further comparisons. We introduce a new (to KDD) financial prediction problem (predicting
positive and negative earnings surprises), and experiment withGLOWER, contrasting it with tree- and rule-induction
approaches. Our results are encouraging, showing that GLOWER has the ability to uncover
effective patterns for difficult problems that have weak structure and significant nonlinearities.Information Systems Working Papers Serie
- …