3,136 research outputs found

    Merging of Numerical Intervals in Entropy-Based Discretization

    Get PDF
    As previous research indicates, a multiple-scanning methodology for discretization of numerical datasets, based on entropy, is very competitive. Discretization is a process of converting numerical values of the data records into discrete values associated with numerical intervals defined over the domains of the data records. In multiple-scanning discretization, the last step is the merging of neighboring intervals in discretized datasets as a kind of postprocessing. Our objective is to check how the error rate, measured by tenfold cross validation within the C4.5 system, is affected by such merging. We conducted experiments on 17 numerical datasets, using the same setup of multiple scanning, with three different options for merging: no merging at all, merging based on the smallest entropy, and merging based on the biggest entropy. As a result of the Friedman rank sum test (5% significance level) we concluded that the differences between all three approaches are statistically insignificant. There is no universally best approach. Then, we repeated all experiments 30 times, recording averages and standard deviations. The test of the difference between averages shows that, for a comparison of no merging with merging based on the smallest entropy, there are statistically highly significant differences (with a 1% significance level). In some cases, the smaller error rate is associated with no merging, in some cases the smaller error rate is associated with merging based on the smallest entropy. A comparison of no merging with merging based on the biggest entropy showed similar results. So, our final conclusion was that there are highly significant differences between no merging and merging, depending on the dataset. The best approach should be chosen by trying all three approaches

    A Comparison of Four Approaches to Discretization Based on Entropy †

    Get PDF
    We compare four discretization methods, all based on entropy: the original C4.5 approach to discretization, two globalized methods, known as equal interval width and equal frequency per interval, and a relatively new method for discretization called multiple scanning using the C4.5 decision tree generation system. The main objective of our research is to compare the quality of these four methods using two criteria: an error rate evaluated by ten-fold cross-validation and the size of the decision tree generated by C4.5. Our results show that multiple scanning is the best discretization method in terms of the error rate and that decision trees generated from datasets discretized by multiple scanning are simpler than decision trees generated directly by C4.5 or generated from datasets discretized by both globalized discretization methods

    DOMINANT ATTRIBUTE AND MULTIPLE SCANNING APPROACHES FOR DISCRETIZATION OF NUMERICAL ATTRIBUTES

    Get PDF
    Rapid development of high throughput technologies and database management systems has made it possible to produce and store large amount of data. However, making sense of big data and discovering knowledge from it is a compounding challenge. Generally, data mining techniques search for information in datasets and express gained knowledge in the form of trends, regularities, patterns or rules. Rules are frequently identified automatically by a technique called rule induction, which is the most important technique in data mining and machine learning and it was developed primarily to handle symbolic data. However, real life data often contain numerical attributes and therefore, in order to fully utilize the power of rule induction techniques, an essential preprocessing step of converting numeric data into symbolic data called discretization is employed in data mining. Here we present two entropy based discretization techniques known as dominant attribute approach and multiple scanning approach, respectively. These approaches were implemented as two explicit algorithms in a JAVA programming language and experiments were conducted by applying each algorithm separately on seventeen well known numerical data sets. The resulting discretized data sets were used for rule induction by LEM2 or Learning from Examples Module 2 algorithm. For each dataset in multiple scanning approach, experiments were repeated with incremental scans until interval counts were stabilized. Preliminary results from this study indicated that multiple scanning approach performed better than dominant attribute approach in terms of producing comparatively smaller and simpler rule sets

    Realtime market microstructure analysis: online Transaction Cost Analysis

    Full text link
    Motivated by the practical challenge in monitoring the performance of a large number of algorithmic trading orders, this paper provides a methodology that leads to automatic discovery of the causes that lie behind a poor trading performance. It also gives theoretical foundations to a generic framework for real-time trading analysis. Academic literature provides different ways to formalize these algorithms and show how optimal they can be from a mean-variance, a stochastic control, an impulse control or a statistical learning viewpoint. This paper is agnostic about the way the algorithm has been built and provides a theoretical formalism to identify in real-time the market conditions that influenced its efficiency or inefficiency. For a given set of characteristics describing the market context, selected by a practitioner, we first show how a set of additional derived explanatory factors, called anomaly detectors, can be created for each market order. We then will present an online methodology to quantify how this extended set of factors, at any given time, predicts which of the orders are underperforming while calculating the predictive power of this explanatory factor set. Armed with this information, which we call influence analysis, we intend to empower the order monitoring user to take appropriate action on any affected orders by re-calibrating the trading algorithms working the order through new parameters, pausing their execution or taking over more direct trading control. Also we intend that use of this method in the post trade analysis of algorithms can be taken advantage of to automatically adjust their trading action.Comment: 33 pages, 12 figure

    ur-CAIM: Improved CAIM Discretization for Unbalanced and Balanced Data

    Get PDF
    Supervised discretization is one of basic data preprocessing techniques used in data mining. CAIM (Class- Attribute InterdependenceMaximization) is a discretization algorithm of data for which the classes are known. However, new arising challenges such as the presence of unbalanced data sets, call for new algorithms capable of handling them, in addition to balanced data. This paper presents a new discretization algorithm named ur-CAIM, which improves on the CAIM algorithm in three important ways. First, it generates more flexible discretization schemes while producing a small number of intervals. Second, the quality of the intervals is improved based on the data classes distribution, which leads to better classification performance on balanced and, especially, unbalanced data. Third, the runtime of the algorithm is lower than CAIM’s. The algorithm has been designed free-parameter and it self-adapts to the problem complexity and the data class distribution. The ur-CAIM was compared with 9 well-known discretization methods on 28 balanced, and 70 unbalanced data sets. The results obtained were contrasted through non-parametric statistical tests, which show that our proposal outperforms CAIM and many of the other methods on both types of data but especially on unbalanced data, which is its significant advantage

    Reduced Data Sets and Entropy-Based Discretization

    Get PDF
    This work is licensed under a Creative Commons Attribution 4.0 International License.Results of experiments on numerical data sets discretized using two methods—global versions of Equal Frequency per Interval and Equal Interval Width-are presented. Globalization of both methods is based on entropy. For discretized data sets left and right reducts were computed. For each discretized data set and two data sets, based, respectively, on left and right reducts, we applied ten-fold cross validation using the C4.5 decision tree generation system. Our main objective was to compare the quality of all three types of data sets in terms of an error rate. Additionally, we compared complexity of generated decision trees. We show that reduction of data sets may only increase the error rate and that the decision trees generated from reduced decision sets are not simpler than the decision trees generated from non-reduced data sets
    • …
    corecore