71,576 research outputs found

    Validating Causal Effect Rules Using Cluster-Based Cohort Study

    Get PDF
    Mining association rules from massive amount of data in the database is interesting for many industries especially for root cause analysis. Many techniques have been introduced to identify causal effect root cause using association rules mining framework, such as the support-confidence and support-lift framework. However, verifying and validating causal effect root causes usually involve an expert from the business domain. This has increased the complexity and time taken in the rule mining process. Hence, this study proposed the use of cohort study approach to statistically verify the generated causal effect root cause by Apriori association rules mining technique. The study follows the experimental methodology in testing and validating the proposed cohort study approach. The project had also studied on the partitioning technique in cohort study approach. The proposed cluster-based partitioning technique using k-mean clustering was compared with the manual partitioning technique through experimental results analysis. The data used in the experiments were taken from a semi-conductor manufacturer in Melaka. The data involve true alarm of failure detection collected from the business intelligence reporting unit. The results have shown positive results on root cause validation using k-mean partitioning cohort study. The manual partitioning cohort study has generated 107 rules while the k-means partitioned cohort study produced 49 rules. Only 8 rules appeared in both approached. Thus, we can conclude that the 8 rules generated by both approaches are definite causal effect rules, while the others are to be further confirmed by domain expert. In summary, cohort study approach can be used for validating a causal effect rules to a certain extend. Manual partitioning to create different cohort data can be done only if there is sufficient knowledge about the data. In the other hand, K-Means clustering technique can be used to partition the raw data into different cohorts for further validation. The limitation of this work lies on the validation of generated root causes with the domain expert due to time constraints. So, the future work in this study should focus on the domain expert validation. Besides, the use of lift standardization and thresholding should also be concerned for it is believed to be able to improve the results of generated causal effect rules

    Autonomous Hypothesis Generation for Knowledge Discovery in Continuous Domains

    Full text link
    Advances of computational power, data collection and storage techniques are making new data available every day. This situation has given rise to hypothesis generation research, which complements conventional hypothesis testing research. Hypothesis generation research adopts techniques from machine learning and data mining to autonomously uncover causal relations among variables in the form of previously unknown hidden patterns and models from data. Those patterns and models can come in different forms (e.g. rules, classifiers, clusters, causal relations). In some situations, data are collected without prior supposition or imposition of a specific research goal or hypothesis. Sometimes domain knowledge for this type of problem is also limited. For example, in sensor networks, sensors constantly record data. In these data, not all forms of relationships can be described in advance. Moreover, the environment may change without prior knowledge. In a situation like this one, hypothesis generation techniques can potentially provide a paradigm to gain new insights about the data and the underlying system. This thesis proposes a general hypothesis generation framework, whereby assumptions about the observational data and the system are not predefined. The problem is decomposed into two interrelated sub-problems: (1) the associative hypothesis generation problem and (2) the causal hypothesis generation problem. The former defines a task of finding evidence of the potential causal relations in data. The latter defines a refined task of identifying casual relations. A novel association rule algorithm for continuous domains, called functional association rule mining, is proposed to address the first problem. An agent based causal search algorithm is then designed for the second problem. It systematically tests the potential causal relations by querying the system to generate specific data; thus allowing for causality to be asserted. Empirical experiments show that the functional association rule mining algorithm can uncover associative relations from data. If the underlying relationships in the data overlap, the algorithm decomposes these relationships into their constituent non-overlapping parts. Experiments with the causal search algorithm show a relative low error rate on the retrieved hidden causal structures. In summary, the contributions of this thesis are: (1) a general framework for hypothesis generation in continuous domains, which relaxes a number of conditions assumed in existing automatic causal modelling algorithms and defines a more general hypothesis generation problem; (2) a new functional association rule mining algorithm, which serves as a probing step to identify associative relations in a given dataset and provides a novel functional association rule definition and algorithms to the literature of association rule mining; (3) a new causal search algorithm, which identifies the hidden causal relations of an unknown system on the basis of functional association rule mining and relaxes a number of assumptions commonly used in automatic causal modelling

    Categorization of interestingness measures for knowledge extraction

    Full text link
    Finding interesting association rules is an important and active research field in data mining. The algorithms of the Apriori family are based on two rule extraction measures, support and confidence. Although these two measures have the virtue of being algorithmically fast, they generate a prohibitive number of rules most of which are redundant and irrelevant. It is therefore necessary to use further measures which filter uninteresting rules. Many synthesis studies were then realized on the interestingness measures according to several points of view. Different reported studies have been carried out to identify "good" properties of rule extraction measures and these properties have been assessed on 61 measures. The purpose of this paper is twofold. First to extend the number of the measures and properties to be studied, in addition to the formalization of the properties proposed in the literature. Second, in the light of this formal study, to categorize the studied measures. This paper leads then to identify categories of measures in order to help the users to efficiently select an appropriate measure by choosing one or more measure(s) during the knowledge extraction process. The properties evaluation on the 61 measures has enabled us to identify 7 classes of measures, classes that we obtained using two different clustering techniques.Comment: 34 pages, 4 figure

    Combining Clustering techniques and Formal Concept Analysis to characterize Interestingness Measures

    Full text link
    Formal Concept Analysis "FCA" is a data analysis method which enables to discover hidden knowledge existing in data. A kind of hidden knowledge extracted from data is association rules. Different quality measures were reported in the literature to extract only relevant association rules. Given a dataset, the choice of a good quality measure remains a challenging task for a user. Given a quality measures evaluation matrix according to semantic properties, this paper describes how FCA can highlight quality measures with similar behavior in order to help the user during his choice. The aim of this article is the discovery of Interestingness Measures "IM" clusters, able to validate those found due to the hierarchical and partitioning clustering methods "AHC" and "k-means". Then, based on the theoretical study of sixty one interestingness measures according to nineteen properties, proposed in a recent study, "FCA" describes several groups of measures.Comment: 13 pages, 2 figure
    corecore