856 research outputs found
Testing Interestingness Measures in Practice: A Large-Scale Analysis of Buying Patterns
Understanding customer buying patterns is of great interest to the retail
industry and has shown to benefit a wide variety of goals ranging from managing
stocks to implementing loyalty programs. Association rule mining is a common
technique for extracting correlations such as "people in the South of France
buy ros\'e wine" or "customers who buy pat\'e also buy salted butter and sour
bread." Unfortunately, sifting through a high number of buying patterns is not
useful in practice, because of the predominance of popular products in the top
rules. As a result, a number of "interestingness" measures (over 30) have been
proposed to rank rules. However, there is no agreement on which measures are
more appropriate for retail data. Moreover, since pattern mining algorithms
output thousands of association rules for each product, the ability for an
analyst to rely on ranking measures to identify the most interesting ones is
crucial. In this paper, we develop CAPA (Comparative Analysis of PAtterns), a
framework that provides analysts with the ability to compare the outcome of
interestingness measures applied to buying patterns in the retail industry. We
report on how we used CAPA to compare 34 measures applied to over 1,800 stores
of Intermarch\'e, one of the largest food retailers in France
New probabilistic interest measures for association rules
Mining association rules is an important technique for discovering meaningful
patterns in transaction databases. Many different measures of interestingness
have been proposed for association rules. However, these measures fail to take
the probabilistic properties of the mined data into account. In this paper, we
start with presenting a simple probabilistic framework for transaction data
which can be used to simulate transaction data when no associations are
present. We use such data and a real-world database from a grocery outlet to
explore the behavior of confidence and lift, two popular interest measures used
for rule mining. The results show that confidence is systematically influenced
by the frequency of the items in the left hand side of rules and that lift
performs poorly to filter random noise in transaction data. Based on the
probabilistic framework we develop two new interest measures, hyper-lift and
hyper-confidence, which can be used to filter or order mined association rules.
The new measures show significantly better performance than lift for
applications where spurious rules are problematic
Guided Interaction Exploration in Artifact-centric Process Models
Artifact-centric process models aim to describe complex processes as a
collection of interacting artifacts. Recent development in process mining allow
for the discovery of such models. However, the focus is often on the
representation of the individual artifacts rather than their interactions.
Based on event data we can automatically discover composite state machines
representing artifact-centric processes. Moreover, we provide ways of
visualizing and quantifying interactions among different artifacts. For
example, we are able to highlight strongly correlated behaviours in different
artifacts. The approach has been fully implemented as a ProM plug-in; the CSM
Miner provides an interactive artifact-centric process discovery tool focussing
on interactions. The approach has been evaluated using real life data sets,
including the personal loan and overdraft process of a Dutch financial
institution.Comment: 10 pages, 4 figures, to be published in proceedings of the 19th IEEE
Conference on Business Informatics, CBI 201
Knowledge-based Systems and Interestingness Measures: Analysis with Clinical Datasets
Knowledge mined from clinical data can be used for medical diagnosis and prognosis. By improving the quality of knowledge base, the efficiency of prediction of a knowledge-based system can be enhanced. Designing accurate and precise clinical decision support systems, which use the mined knowledge, is still a broad area of research. This work analyses the variation in classification accuracy for such knowledge-based systems using different rule lists. The purpose of this work is not to improve the prediction accuracy of a decision support system, but analyze the factors that influence the efficiency and design of the knowledge base in a rule-based decision support system. Three benchmark medical datasets are used. Rules are extracted using a supervised machine learning algorithm (PART). Each rule in the ruleset is validated using nine frequently used rule interestingness measures. After calculating the measure values, the rule lists are used for performance evaluation. Experimental results show variation in classification accuracy for different rule lists. Confidence and Laplace measures yield relatively superior accuracy: 81.188% for heart disease dataset and 78.255% for diabetes dataset. The accuracy of the knowledge-based prediction system is predominantly dependent on the organization of the ruleset. Rule length needs to be considered when deciding the rule ordering. Subset of a rule, or combination of rule elements, may form new rules and sometimes be a member of the rule list. Redundant rules should be eliminated. Prior knowledge about the domain will enable knowledge engineers to design a better knowledge base
Visual grouping of association rules by clustering conditional probabilities for categorical data
We demonstrate the use of a visual data-mining tool for non-technical domain experts within organizations to facilitate the extraction of meaningful information and knowledge from in-house databases. The tool is mainly based on the basic notion of grouping association rules. Association rules are useful in discovering items that are frequently found together. However in many applications, rules with lower frequencies are often interesting for the user. Grouping of association rules is one way to overcome the rare item problem. However some groups of association rules are too large for ease of understanding. In this chapter we propose a method for clustering categorical data based on the conditional probabilities of association rules for data sets with large numbers of attributes. We argue that the proposed method provides non-technical users with a better understanding of discovered patterns in the data set
A survey of temporal knowledge discovery paradigms and methods
With the increase in the size of data sets, data mining has recently become an important research topic and is receiving substantial interest from both academia and industry. At the same time, interest in temporal databases has been increasing and a growing number of both prototype and implemented systems are using an enhanced temporal understanding to explain aspects of behavior associated with the implicit time-varying nature of the universe. This paper investigates the confluence of these two areas, surveys the work to date, and explores the issues involved and the outstanding problems in temporal data mining
- …