1,733 research outputs found
A Max-relevance-min-divergence Criterion for Data Discretization with Applications on Naive Bayes
In many classification models, data is discretized to better estimate its
distribution. Existing discretization methods often target at maximizing the
discriminant power of discretized data, while overlooking the fact that the
primary target of data discretization in classification is to improve the
generalization performance. As a result, the data tend to be over-split into
many small bins since the data without discretization retain the maximal
discriminant information. Thus, we propose a Max-Dependency-Min-Divergence
(MDmD) criterion that maximizes both the discriminant information and
generalization ability of the discretized data. More specifically, the
Max-Dependency criterion maximizes the statistical dependency between the
discretized data and the classification variable while the Min-Divergence
criterion explicitly minimizes the JS-divergence between the training data and
the validation data for a given discretization scheme. The proposed MDmD
criterion is technically appealing, but it is difficult to reliably estimate
the high-order joint distributions of attributes and the classification
variable. We hence further propose a more practical solution,
Max-Relevance-Min-Divergence (MRmD) discretization scheme, where each attribute
is discretized separately, by simultaneously maximizing the discriminant
information and the generalization ability of the discretized data. The
proposed MRmD is compared with the state-of-the-art discretization algorithms
under the naive Bayes classification framework on 45 machine-learning benchmark
datasets. It significantly outperforms all the compared methods on most of the
datasets.Comment: Under major revision of Pattern Recognitio
Assessing similarity of feature selection techniques in high-dimensional domains
Recent research efforts attempt to combine multiple feature selection techniques instead of using a single one. However, this combination is often made on an “ad hoc” basis, depending on the specific problem at hand, without considering the degree of diversity/similarity of the involved methods. Moreover, though it is recognized that different techniques may return quite dissimilar outputs, especially in high dimensional/small sample size domains, few direct comparisons exist that quantify these differences and their implications on classification performance. This paper aims to provide a contribution in this direction by proposing a general methodology for assessing the similarity between the outputs of different feature selection methods in high dimensional classification problems. Using as benchmark the genomics domain, an empirical study has been conducted to compare some of the most popular feature selection methods, and useful insight has been obtained about their pattern of agreement
Discretization of Continuous Attributes
7 pagesIn the data mining field, many learning methods -like association rules, Bayesian networks, induction rules (Grzymala-Busse & Stefanowski, 2001)- can handle only discrete attributes. Therefore, before the machine learning process, it is necessary to re-encode each continuous attribute in a discrete attribute constituted by a set of intervals, for example the age attribute can be transformed in two discrete values representing two intervals: less than 18 (a minor) and 18 and more (of age). This process, known as discretization, is an essential task of the data preprocessing, not only because some learning methods do not handle continuous attributes, but also for other important reasons: the data transformed in a set of intervals are more cognitively relevant for a human interpretation (Liu, Hussain, Tan & Dash, 2002); the computation process goes faster with a reduced level of data, particularly when some attributes are suppressed from the representation space of the learning problem if it is impossible to find a relevant cut (Mittal & Cheong, 2002); the discretization can provide non-linear relations -e.g., the infants and the elderly people are more sensitive to illness
Multivariate discretization of continuous valued attributes.
The area of Knowledge discovery and data mining is growing rapidly. Feature Discretization is a crucial issue in Knowledge Discovery in Databases (KDD), or Data Mining because most data sets used in real world applications have features with continuously values. Discretization is performed as a preprocessing step of the data mining to make data mining techniques useful for these data sets. This thesis addresses discretization issue by proposing a multivariate discretization (MVD) algorithm. It begins withal number of common discretization algorithms like Equal width discretization, Equal frequency discretization, Naïve; Entropy based discretization, Chi square discretization, and orthogonal hyper planes. After that comparing the results achieved by the multivariate discretization (MVD) algorithm with the accuracy results of other algorithms. This thesis is divided into six chapters, covering a few common discretization algorithms and tests these algorithms on a real world datasets which varying in size and complexity, and shows how data visualization techniques will be effective in determining the degree of complexity of the given data set. We have examined the multivariate discretization (MVD) algorithm with the same data sets. After that we have classified discrete data using artificial neural network single layer perceptron and multilayer perceptron with back propagation algorithm. We have trained the Classifier using the training data set, and tested its accuracy using the testing data set. Our experiments lead to better accuracy results with some data sets and low accuracy results with other data sets, and this is subject ot the degree of data complexity then we have compared the accuracy results of multivariate discretization (MVD) algorithm with the results achieved by other discretization algorithms. We have found that multivariate discretization (MVD) algorithm produces good accuracy results in comparing with the other discretization algorithm
A Fully Nonparametric Modelling Approach to Binary Regression
We propose a general nonparametric Bayesian framework for binary regression,
which is built from modeling for the joint response-covariate distribution. The
observed binary responses are assumed to arise from underlying continuous
random variables through discretization, and we model the joint distribution of
these latent responses and the covariates using a Dirichlet process mixture of
multivariate normals. We show that the kernel of the induced mixture model for
the observed data is identifiable upon a restriction on the latent variables.
To allow for appropriate dependence structure while facilitating
identifiability, we use a square-root-free Cholesky decomposition of the
covariance matrix in the normal mixture kernel. In addition to allowing for the
necessary restriction, this modeling strategy provides substantial
simplifications in implementation of Markov chain Monte Carlo posterior
simulation. We present two data examples taken from areas for which the
methodology is especially well suited. In particular, the first example
involves estimation of relationships between environmental variables, and the
second develops inference for natural selection surfaces in evolutionary
biology. Finally, we discuss extensions to regression settings with
multivariate ordinal responses
Investigating hybrids of evolution and learning for real-parameter optimization
In recent years, more and more advanced techniques have been developed in the field
of hybridizing of evolution and learning, this means that more applications with these techniques
can benefit from this progress. One example of these advanced techniques is the
Learnable Evolution Model (LEM), which adopts learning as a guide for the general evolutionary
search. Despite this trend and the progress in LEM, there are still many ideas and
attempts which deserve further investigations and tests. For this purpose, this thesis has
developed a number of new algorithms attempting to combine more learning algorithms
with evolution in different ways. With these developments, we expect to understand the
effects and relations between evolution and learning, and also achieve better performances
in solving complex problems.
The machine learning algorithms combined into the standard Genetic Algorithm (GA)
are the supervised learning method k-nearest-neighbors (KNN), the Entropy-Based Discretization
(ED) method, and the decision tree learning algorithm ID3. We test these algorithms
on various real-parameter function optimization problems, especially the functions
in the special session on CEC 2005 real-parameter function optimization. Additionally, a
medical cancer chemotherapy treatment problem is solved in this thesis by some of our
hybrid algorithms.
The performances of these algorithms are compared with standard genetic algorithms
and other well-known contemporary evolution and learning hybrid algorithms. Some of
them are the CovarianceMatrix Adaptation Evolution Strategies (CMAES), and variants of
the Estimation of Distribution Algorithms (EDA).
Some important results have been derived from our experiments on these developed algorithms.
Among them, we found that even some very simple learning methods hybridized
properly with evolution procedure can provide significant performance improvement; and
when more complex learning algorithms are incorporated with evolution, the resulting algorithms
are very promising and compete very well against the state of the art hybrid algorithms
both in well-defined real-parameter function optimization problems and a practical
evaluation-expensive problem
Identity Disclosure Protection: A Data Reconstruction Approach for Preserving Privacy in Data Mining
- …