4,506 research outputs found

    A Comparison of the Quality of Rule Induction from Inconsistent Data Sets and Incomplete Data Sets

    Get PDF
    In data mining, decision rules induced from known examples are used to classify unseen cases. There are various rule induction algorithms, such as LEM1 (Learning from Examples Module version 1), LEM2 (Learning from Examples Module version 2) and MLEM2 (Modified Learning from Examples Module version 2). In the real world, many data sets are imperfect, either inconsistent or incomplete. The idea of lower and upper approximations, or more generally, the probabilistic approximation, provides an effective way to induce rules from inconsistent data sets and incomplete data sets. But the accuracies of rule sets induced from imperfect data sets are expected to be lower. The objective of this project is to investigate which kind of imperfect data sets (inconsistent or incomplete) is worse in terms of the quality of rule induction. In this project, experiments were conducted on eight inconsistent data sets and eight incomplete data sets with lost values. We implemented the MLEM2 algorithm to induce certain and possible rules from inconsistent data sets, and implemented the local probabilistic version of MLEM2 algorithm to induce certain and possible rules from incomplete data sets. A program called Rule Checker was also developed to classify unseen cases with induced rules and measure the classification error rate. Ten-fold cross validation was carried out and the average error rate was used as the criterion for comparison. The Mann-Whitney nonparametric tests were performed to compare, separately for certain and possible rules, incompleteness with inconsistency. The results show that there is no significant difference between inconsistent and incomplete data sets in terms of the quality of rule induction

    Distribution of Mutual Information from Complete and Incomplete Data

    Full text link
    Mutual information is widely used, in a descriptive way, to measure the stochastic dependence of categorical random variables. In order to address questions such as the reliability of the descriptive value, one must consider sample-to-population inferential approaches. This paper deals with the posterior distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean, and analytical approximations for the variance, skewness and kurtosis are derived. These approximations have a guaranteed accuracy level of the order O(1/n^3), where n is the sample size. Leading order approximations for the mean and the variance are derived in the case of incomplete samples. The derived analytical expressions allow the distribution of mutual information to be approximated reliably and quickly. In fact, the derived expressions can be computed with the same order of complexity needed for descriptive mutual information. This makes the distribution of mutual information become a concrete alternative to descriptive mutual information in many applications which would benefit from moving to the inductive side. Some of these prospective applications are discussed, and one of them, namely feature selection, is shown to perform significantly better when inductive mutual information is used.Comment: 26 pages, LaTeX, 5 figures, 4 table

    HANDLING MISSING ATTRIBUTE VALUES IN DECISION TABLES USING VALUED TOLERANCE APPROACH

    Get PDF
    Rule induction is one of the key areas in data mining as it is applied to a large number of real life data. However, in such real life data, the information is incompletely specified most of the time. To induce rules from these incomplete data, more powerful algorithms are necessary. This research work mainly focuses on a probabilistic approach based on the valued tolerance relation. This thesis is divided into two parts. The first part describes the implementation of the valued tolerance relation. The induced rules are then evaluated based on the error rate due to incorrectly classified and unclassified examples. The second part of this research work shows a comparison of the rules induced by the MLEM2 algorithm that has been implemented before, with the rules induced by the valued tolerance based approach which was implemented as part of this research. Hence, through this thesis, the error rate for the MLEM2 algorithm and the valued tolerance based approach are compared and the results are documented

    A comparison of sixteen classification strategies of rule induction from incomplete data using the MLEM2 algorithm

    Get PDF
    In data mining, rule induction is a process of extracting formal rules from decision tables, where the later are the tabulated observations, which typically consist of few attributes, i.e., independent variables and a decision, i.e., a dependent variable. Each tuple in the table is considered as a case, and there could be n number of cases for a table specifying each observation. The efficiency of the rule induction depends on how many cases are successfully characterized by the generated set of rules, i.e., ruleset. There are different rule induction algorithms, such as LEM1, LEM2, MLEM2. In the real world, datasets will be imperfect, inconsistent, and incomplete. MLEM2 is an efficient algorithm to deal with such sorts of data, but the quality of rule induction largely depends on the chosen classification strategy. We tried to compare the 16 classification strategies of rule induction using MLEM2 on incomplete data. For this, we implemented MLEM2 for inducing rulesets based on the selection of the type of approximation, i.e., singleton, subset or concept, and the value of alpha for calculating probabilistic approximations. A program called rule checker is used to calculate the error rate based on the classification strategy specified. To reduce the anomalies, we used ten-fold cross-validation to measure the error rate for each classification. Error rates for the above strategies are being calculated for different datasets, compared, and presented

    Automatic Bayesian Density Analysis

    Full text link
    Making sense of a dataset in an automatic and unsupervised fashion is a challenging problem in statistics and AI. Classical approaches for {exploratory data analysis} are usually not flexible enough to deal with the uncertainty inherent to real-world data: they are often restricted to fixed latent interaction models and homogeneous likelihoods; they are sensitive to missing, corrupt and anomalous data; moreover, their expressiveness generally comes at the price of intractable inference. As a result, supervision from statisticians is usually needed to find the right model for the data. However, since domain experts are not necessarily also experts in statistics, we propose Automatic Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible at large. Specifically, ABDA allows for automatic and efficient missing value estimation, statistical data type and likelihood discovery, anomaly detection and dependency structure mining, on top of providing accurate density estimation. Extensive empirical evidence shows that ABDA is a suitable tool for automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19

    Dominance-based Rough Set Approach, basic ideas and main trends

    Full text link
    Dominance-based Rough Approach (DRSA) has been proposed as a machine learning and knowledge discovery methodology to handle Multiple Criteria Decision Aiding (MCDA). Due to its capacity of asking the decision maker (DM) for simple preference information and supplying easily understandable and explainable recommendations, DRSA gained much interest during the years and it is now one of the most appreciated MCDA approaches. In fact, it has been applied also beyond MCDA domain, as a general knowledge discovery and data mining methodology for the analysis of monotonic (and also non-monotonic) data. In this contribution, we recall the basic principles and the main concepts of DRSA, with a general overview of its developments and software. We present also a historical reconstruction of the genesis of the methodology, with a specific focus on the contribution of Roman S{\l}owi\'nski.Comment: This research was partially supported by TAILOR, a project funded by European Union (EU) Horizon 2020 research and innovation programme under GA No 952215. This submission is a preprint of a book chapter accepted by Springer, with very few minor differences of just technical natur

    Uncertainty Management of Intelligent Feature Selection in Wireless Sensor Networks

    Get PDF
    Wireless sensor networks (WSN) are envisioned to revolutionize the paradigm of monitoring complex real-world systems at a very high resolution. However, the deployment of a large number of unattended sensor nodes in hostile environments, frequent changes of environment dynamics, and severe resource constraints pose uncertainties and limit the potential use of WSN in complex real-world applications. Although uncertainty management in Artificial Intelligence (AI) is well developed and well investigated, its implications in wireless sensor environments are inadequately addressed. This dissertation addresses uncertainty management issues of spatio-temporal patterns generated from sensor data. It provides a framework for characterizing spatio-temporal pattern in WSN. Using rough set theory and temporal reasoning a novel formalism has been developed to characterize and quantify the uncertainties in predicting spatio-temporal patterns from sensor data. This research also uncovers the trade-off among the uncertainty measures, which can be used to develop a multi-objective optimization model for real-time decision making in sensor data aggregation and samplin
    • …
    corecore