1,786 research outputs found
Towards a semantic and statistical selection of association rules
The increasing growth of databases raises an urgent need for more accurate
methods to better understand the stored data. In this scope, association rules
were extensively used for the analysis and the comprehension of huge amounts of
data. However, the number of generated rules is too large to be efficiently
analyzed and explored in any further process. Association rules selection is a
classical topic to address this issue, yet, new innovated approaches are
required in order to provide help to decision makers. Hence, many interesting-
ness measures have been defined to statistically evaluate and filter the
association rules. However, these measures present two major problems. On the
one hand, they do not allow eliminating irrelevant rules, on the other hand,
their abun- dance leads to the heterogeneity of the evaluation results which
leads to confusion in decision making. In this paper, we propose a two-winged
approach to select statistically in- teresting and semantically incomparable
rules. Our statis- tical selection helps discovering interesting association
rules without favoring or excluding any measure. The semantic comparability
helps to decide if the considered association rules are semantically related
i.e comparable. The outcomes of our experiments on real datasets show promising
results in terms of reduction in the number of rules
Testing Interestingness Measures in Practice: A Large-Scale Analysis of Buying Patterns
Understanding customer buying patterns is of great interest to the retail
industry and has shown to benefit a wide variety of goals ranging from managing
stocks to implementing loyalty programs. Association rule mining is a common
technique for extracting correlations such as "people in the South of France
buy ros\'e wine" or "customers who buy pat\'e also buy salted butter and sour
bread." Unfortunately, sifting through a high number of buying patterns is not
useful in practice, because of the predominance of popular products in the top
rules. As a result, a number of "interestingness" measures (over 30) have been
proposed to rank rules. However, there is no agreement on which measures are
more appropriate for retail data. Moreover, since pattern mining algorithms
output thousands of association rules for each product, the ability for an
analyst to rely on ranking measures to identify the most interesting ones is
crucial. In this paper, we develop CAPA (Comparative Analysis of PAtterns), a
framework that provides analysts with the ability to compare the outcome of
interestingness measures applied to buying patterns in the retail industry. We
report on how we used CAPA to compare 34 measures applied to over 1,800 stores
of Intermarch\'e, one of the largest food retailers in France
New probabilistic interest measures for association rules
Mining association rules is an important technique for discovering meaningful
patterns in transaction databases. Many different measures of interestingness
have been proposed for association rules. However, these measures fail to take
the probabilistic properties of the mined data into account. In this paper, we
start with presenting a simple probabilistic framework for transaction data
which can be used to simulate transaction data when no associations are
present. We use such data and a real-world database from a grocery outlet to
explore the behavior of confidence and lift, two popular interest measures used
for rule mining. The results show that confidence is systematically influenced
by the frequency of the items in the left hand side of rules and that lift
performs poorly to filter random noise in transaction data. Based on the
probabilistic framework we develop two new interest measures, hyper-lift and
hyper-confidence, which can be used to filter or order mined association rules.
The new measures show significantly better performance than lift for
applications where spurious rules are problematic
Evaluation and optimization of frequent association rule based classification
Deriving useful and interesting rules from a data mining system is an essential and important task. Problems
such as the discovery of random and coincidental patterns or patterns with no significant values, and the
generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness
of rules generated by data mining algorithms are actively and constantly being examined and developed. In this
paper, a systematic way to evaluate the association rules discovered from frequent itemset mining algorithms,
combining common data mining and statistical interestingness measures, and outline an appropriated sequence of usage is presented. The experiments are performed using a number of real-world datasets that represent diverse characteristics of data/items, and detailed evaluation of rule sets is provided. Empirical results show that with a proper combination of data mining and statistical analysis, the framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy and coverage rules when used in the classification problem. Moreover, the results reveal the important characteristics of mining frequent itemsets, and the impact of confidence measure for the classification task
Statistical strategies for pruning all the uninteresting association rules
We propose a general framework to describe formally the
problem of capturing the intensity of implication for
association rules through statistical metrics.
In this framework we present properties that influence the
interestingness of a rule, analyze the conditions that
lead a measure to perform a perfect prune at a time,
and define a final proper order to sort the surviving
rules. We will discuss why none of the currently employed
measures can capture objective interestingness, and
just the combination of some of them, in a multi-step fashion,
can be reliable. In contrast, we propose a new simple modification
of the Pearson coefficient that will meet all the necessary
requirements. We statistically infer the convenient cut-off
threshold for this new metric by empirically describing its
distribution function through simulation. Final experiments
serve to show the ability of our proposal.Postprint (published version
Combining Clustering techniques and Formal Concept Analysis to characterize Interestingness Measures
Formal Concept Analysis "FCA" is a data analysis method which enables to
discover hidden knowledge existing in data. A kind of hidden knowledge
extracted from data is association rules. Different quality measures were
reported in the literature to extract only relevant association rules. Given a
dataset, the choice of a good quality measure remains a challenging task for a
user. Given a quality measures evaluation matrix according to semantic
properties, this paper describes how FCA can highlight quality measures with
similar behavior in order to help the user during his choice. The aim of this
article is the discovery of Interestingness Measures "IM" clusters, able to
validate those found due to the hierarchical and partitioning clustering
methods "AHC" and "k-means". Then, based on the theoretical study of sixty one
interestingness measures according to nineteen properties, proposed in a recent
study, "FCA" describes several groups of measures.Comment: 13 pages, 2 figure
- …