2,011 research outputs found

    A Comparison of the Quality of Rule Induction from Inconsistent Data Sets and Incomplete Data Sets

    Get PDF
    In data mining, decision rules induced from known examples are used to classify unseen cases. There are various rule induction algorithms, such as LEM1 (Learning from Examples Module version 1), LEM2 (Learning from Examples Module version 2) and MLEM2 (Modified Learning from Examples Module version 2). In the real world, many data sets are imperfect, either inconsistent or incomplete. The idea of lower and upper approximations, or more generally, the probabilistic approximation, provides an effective way to induce rules from inconsistent data sets and incomplete data sets. But the accuracies of rule sets induced from imperfect data sets are expected to be lower. The objective of this project is to investigate which kind of imperfect data sets (inconsistent or incomplete) is worse in terms of the quality of rule induction. In this project, experiments were conducted on eight inconsistent data sets and eight incomplete data sets with lost values. We implemented the MLEM2 algorithm to induce certain and possible rules from inconsistent data sets, and implemented the local probabilistic version of MLEM2 algorithm to induce certain and possible rules from incomplete data sets. A program called Rule Checker was also developed to classify unseen cases with induced rules and measure the classification error rate. Ten-fold cross validation was carried out and the average error rate was used as the criterion for comparison. The Mann-Whitney nonparametric tests were performed to compare, separately for certain and possible rules, incompleteness with inconsistency. The results show that there is no significant difference between inconsistent and incomplete data sets in terms of the quality of rule induction

    Dominance-based Rough Set Approach, basic ideas and main trends

    Full text link
    Dominance-based Rough Approach (DRSA) has been proposed as a machine learning and knowledge discovery methodology to handle Multiple Criteria Decision Aiding (MCDA). Due to its capacity of asking the decision maker (DM) for simple preference information and supplying easily understandable and explainable recommendations, DRSA gained much interest during the years and it is now one of the most appreciated MCDA approaches. In fact, it has been applied also beyond MCDA domain, as a general knowledge discovery and data mining methodology for the analysis of monotonic (and also non-monotonic) data. In this contribution, we recall the basic principles and the main concepts of DRSA, with a general overview of its developments and software. We present also a historical reconstruction of the genesis of the methodology, with a specific focus on the contribution of Roman S{\l}owi\'nski.Comment: This research was partially supported by TAILOR, a project funded by European Union (EU) Horizon 2020 research and innovation programme under GA No 952215. This submission is a preprint of a book chapter accepted by Springer, with very few minor differences of just technical natur

    Interesting Rule Induction Module: Adding Support for Unknown Attribute Values

    Get PDF
    IRIM (Interesting Rule Induction Module) is a rule induction system designed to induce particularly strong, simple rule sets. Additionally, IRIM does not require prior discretization of numerical attribute values. IRIM does not necessarily produce consistent rules that fully describe the target concepts, however, the rules induced by IRIM often lead to novel revelations of hidden relationships in a dataset. In this paper, we attempt to extend the IRIM system to be able to handle missing attribute values (in particular, lost and do-not-care attribute values) more thoroughly than ignoring the cases that they belong to. Further, we include an implementation of IRIM in the modern programming language Python that has been written for easy inclusion in within a Python data mining package or library. The provided implementation makes use of the Pandas module which is built on top of a C back end for quick performance relative to the performance normally found with Python

    The usefulness of a machine learning approach to knowledge acquisition

    Get PDF
    This paper presents results of experiments showing how machine learning methods are useful for rule induction in the process of knowledge acquisition for expert systems. Four machine learning methods were used: ID3, ID3 with dropping conditions, and two options of the system LERS (Learning from Examples based on Rough Sets): LEM1 and LEM2. Two knowledge acquisition options of LERS were used as well. All six methods were used for rule induction from six real-life data sets. The main objective was to test how an expert system, supplied with these rule sets, performs without information on a few attributes. Thus an expert system attempts to classify examples with all missing values of some attributes. As a result of experiments, it is clear that all machine learning methods performed much worse than knowledge acquisition options of LERS. Thus, machine learning methods used for knowledge acquisition should be replaced by other methods of rule induction that will generate complete sets of rules. Knowledge acquisition options of LERS are examples of such appropriate ways of inducing rules for building knowledge bases

    A semantical and computational approach to covering-based rough sets

    Get PDF

    Implementation of Web Query Languages Reconsidered

    Get PDF
    Visions of the next generation Web such as the "Semantic Web" or the "Web 2.0" have triggered the emergence of a multitude of data formats. These formats have different characteristics as far as the shape of data is concerned (for example tree- vs. graph-shaped). They are accompanied by a puzzlingly large number of query languages each limited to one data format. Thus, a key feature of the Web, namely to make it possible to access anything published by anyone, is compromised. This thesis is devoted to versatile query languages capable of accessing data in a variety of Web formats. The issue is addressed from three angles: language design, common, yet uniform semantics, and common, yet uniform evaluation. % Thus it is divided in three parts: First, we consider the query language Xcerpt as an example of the advocated class of versatile Web query languages. Using this concrete exemplar allows us to clarify and discuss the vision of versatility in detail. Second, a number of query languages, XPath, XQuery, SPARQL, and Xcerpt, are translated into a common intermediary language, CIQLog. This language has a purely logical semantics, which makes it easily amenable to optimizations. As a side effect, this provides the, to the best of our knowledge, first logical semantics for XQuery and SPARQL. It is a very useful tool for understanding the commonalities and differences of the considered languages. Third, the intermediate logical language is translated into a query algebra, CIQCAG. The core feature of CIQCAG is that it scales from tree- to graph-shaped data and queries without efficiency losses when tree-data and -queries are considered: it is shown that, in these cases, optimal complexities are achieved. CIQCAG is also shown to evaluate each of the aforementioned query languages with a complexity at least as good as the best known evaluation methods so far. For example, navigational XPath is evaluated with space complexity O(q d) and time complexity O(q n) where q is the query size, n the data size, and d the depth of the (tree-shaped) data. CIQCAG is further shown to provide linear time and space evaluation of tree-shaped queries for a larger class of graph-shaped data than any method previously proposed. This larger class of graph-shaped data, called continuous-image graphs, short CIGs, is introduced for the first time in this thesis. A (directed) graph is a CIG if its nodes can be totally ordered in such a manner that, for this order, the children of any node form a continuous interval. CIQCAG achieves these properties by employing a novel data structure, called sequence map, that allows an efficient evaluation of tree-shaped queries, or of tree-shaped cores of graph-shaped queries on any graph-shaped data. While being ideally suited to trees and CIGs, the data structure gracefully degrades to unrestricted graphs. It yields a remarkably efficient evaluation on graph-shaped data that only a few edges prevent from being trees or CIGs

    不完全な情報システムのためのラフ集合モデルと知識獲得

    Get PDF
    国立大学法人長岡技術科学大
    corecore