184 research outputs found
A Max-relevance-min-divergence Criterion for Data Discretization with Applications on Naive Bayes
In many classification models, data is discretized to better estimate its
distribution. Existing discretization methods often target at maximizing the
discriminant power of discretized data, while overlooking the fact that the
primary target of data discretization in classification is to improve the
generalization performance. As a result, the data tend to be over-split into
many small bins since the data without discretization retain the maximal
discriminant information. Thus, we propose a Max-Dependency-Min-Divergence
(MDmD) criterion that maximizes both the discriminant information and
generalization ability of the discretized data. More specifically, the
Max-Dependency criterion maximizes the statistical dependency between the
discretized data and the classification variable while the Min-Divergence
criterion explicitly minimizes the JS-divergence between the training data and
the validation data for a given discretization scheme. The proposed MDmD
criterion is technically appealing, but it is difficult to reliably estimate
the high-order joint distributions of attributes and the classification
variable. We hence further propose a more practical solution,
Max-Relevance-Min-Divergence (MRmD) discretization scheme, where each attribute
is discretized separately, by simultaneously maximizing the discriminant
information and the generalization ability of the discretized data. The
proposed MRmD is compared with the state-of-the-art discretization algorithms
under the naive Bayes classification framework on 45 machine-learning benchmark
datasets. It significantly outperforms all the compared methods on most of the
datasets.Comment: Under major revision of Pattern Recognitio
Multivariate discretization of continuous valued attributes.
The area of Knowledge discovery and data mining is growing rapidly. Feature Discretization is a crucial issue in Knowledge Discovery in Databases (KDD), or Data Mining because most data sets used in real world applications have features with continuously values. Discretization is performed as a preprocessing step of the data mining to make data mining techniques useful for these data sets. This thesis addresses discretization issue by proposing a multivariate discretization (MVD) algorithm. It begins withal number of common discretization algorithms like Equal width discretization, Equal frequency discretization, Naïve; Entropy based discretization, Chi square discretization, and orthogonal hyper planes. After that comparing the results achieved by the multivariate discretization (MVD) algorithm with the accuracy results of other algorithms. This thesis is divided into six chapters, covering a few common discretization algorithms and tests these algorithms on a real world datasets which varying in size and complexity, and shows how data visualization techniques will be effective in determining the degree of complexity of the given data set. We have examined the multivariate discretization (MVD) algorithm with the same data sets. After that we have classified discrete data using artificial neural network single layer perceptron and multilayer perceptron with back propagation algorithm. We have trained the Classifier using the training data set, and tested its accuracy using the testing data set. Our experiments lead to better accuracy results with some data sets and low accuracy results with other data sets, and this is subject ot the degree of data complexity then we have compared the accuracy results of multivariate discretization (MVD) algorithm with the results achieved by other discretization algorithms. We have found that multivariate discretization (MVD) algorithm produces good accuracy results in comparing with the other discretization algorithm
A hybridwind speed forecasting system based on a 'decomposition and ensemble' strategy and fuzzy time series
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. Accurate and stable wind speed forecasting is of critical importance in the wind power industry and has measurable influence on power-system management and the stability of market economics. However, most traditional wind speed forecasting models require a large amount of historical data and face restrictions due to assumptions, such as normality postulates. Additionally, any data volatility leads to increased forecasting instability. Therefore, in this paper, a hybrid forecasting system, which combines the 'decomposition and ensemble' strategy and fuzzy time series forecasting algorithm, is proposed that comprises two modules-data pre-processing and forecasting. Moreover, the statistical model, artificial neural network, and Support Vector Regression model are employed to compare with the proposed hybrid system, which is proven to be very effective in forecasting wind speed data affected by noise and instability. The results of these comparisons demonstrate that the hybrid forecasting system can improve the forecasting accuracy and stability significantly, and supervised discretization methods outperform the unsupervised methods for fuzzy time series in most cases
Discovering correlated parameters in Semiconductor Manufacturing processes: a Data Mining approach
International audienceData mining tools are nowadays becoming more and more popular in the semiconductor manufacturing industry, and especially in yield-oriented enhancement techniques. This is because conventional approaches fail to extract hidden relationships between numerous complex process control parameters. In order to highlight correlations between such parameters, we propose in this paper a complete knowledge discovery in databases (KDD) model. The mining heart of the model uses a new method derived from association rules programming, and is based on two concepts: decision correlation rules and contingency vectors. The first concept results from a cross fertilization between correlation and decision rules. It enables relevant links to be highlighted between sets of values of a relation and the values of sets of targets belonging to the same relation. Decision correlation rules are built on the twofold basis of the chi-squared measure and of the support of the extracted values. Due to the very nature of the problem, levelwise algorithms only allow extraction of results with long execution times and huge memory occupation. To offset these two problems, we propose an algorithm based both on the lectic order and contingency vectors, an alternate representation of contingency tables. This algorithm is the basis of our KDD model software, called MineCor. An overall presentation of its other functions, of some significant experimental results, and of associated performances are provided and discussed
An algorithm for discretization of real value attributes based on interval similarity
Extent: 8p.Discretization algorithm for real value attributes is of very important uses in many areas such as intelligence and machine learning. The algorithms related to Chi2 algorithm (includes modified Chi2 algorithm and extended Chi2 algorithm) are famous discretization algorithm exploiting the technique of probability and statistics. In this paper the algorithms are analyzed, and their drawback is pointed. Based on the analysis a new modified algorithm based on interval similarity is proposed. The new algorithm defines an interval similarity function which is regarded as a new merging standard in the process of discretization. At the same time, two important parameters (condition parameterαand tiny move parameterc) in the process of discretization and discrepancy extent of a number of adjacent two intervals are given in the form of function. The related theory analysis and the experiment results show that the presented algorithm is effective.Li Zou, Deqin Yan, Hamid Reza Karimi, and Peng Sh
Analysis of Feature Rankings for Classification
Different ways of contrast generated rankings by feature selection algorithms are presented in this paper, showing several possible interpretations, depending on the given approach to each study. We begin from the premise of no existence of only one ideal subset for all cases. The purpose of these kinds of algorithms is to reduce the data set to each first attributes without losing prediction against the original data set. In this paper we propose a method, feature–ranking performance, to compare different feature–ranking methods, based on the Area Under Feature Ranking Classification Performance Curve (AURC). Conclusions and trends taken from this paper propose support for the performance of learning tasks, where some ranking algorithms studied here operate
A discretization method based on maximizing the area under receiver operating characteristic curve
Cataloged from PDF version of article.Many machine learning algorithms require the features to be categorical. Hence, they require all numeric-valued data to be discretized into intervals. In this paper, we present a new discretization method based on the receiver operating characteristics (ROC) Curve (AUC) measure. Maximum area under ROC curve-based discretization (MAD) is a global, static and supervised discretization method. MAD uses the sorted order of the continuous values of a feature and discretizes the feature in such a way that the AUC based on that feature is to be maximized. The proposed method is compared with alternative discretization methods such as ChiMerge, Entropy-Minimum Description Length Principle (MDLP), Fixed Frequency Discretization (FFD), and Proportional Discretization (PD). FFD and PD have been recently proposed and are designed for Naive Bayes learning. ChiMerge is a merging discretization method as the MAD method. Evaluations are performed in terms of M-Measure, an AUC-based metric for multi-class classification, and accuracy values obtained from Naive Bayes and Aggregating One-Dependence Estimators (AODE) algorithms by using real-world datasets. Empirical results show that MAD is a strong candidate to be a good alternative to other discretization methods
- …