74,163 research outputs found
New probabilistic interest measures for association rules
Mining association rules is an important technique for discovering meaningful
patterns in transaction databases. Many different measures of interestingness
have been proposed for association rules. However, these measures fail to take
the probabilistic properties of the mined data into account. In this paper, we
start with presenting a simple probabilistic framework for transaction data
which can be used to simulate transaction data when no associations are
present. We use such data and a real-world database from a grocery outlet to
explore the behavior of confidence and lift, two popular interest measures used
for rule mining. The results show that confidence is systematically influenced
by the frequency of the items in the left hand side of rules and that lift
performs poorly to filter random noise in transaction data. Based on the
probabilistic framework we develop two new interest measures, hyper-lift and
hyper-confidence, which can be used to filter or order mined association rules.
The new measures show significantly better performance than lift for
applications where spurious rules are problematic
Implications of probabilistic data modeling for rule mining
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine associations are discussed in great detail. In this paper we investigate properties of transaction data sets from a probabilistic point of view. We present a simple probabilistic framework for transaction data and its implementation using the R statistical computing environment. The framework can be used to simulate transaction data when no associations are present. We use such data to explore the ability to filter noise of confidence and lift, two popular interest measures used for rule mining. Based on the framework we develop the measure hyperlift and we compare this new measure to lift using simulated data and a real-world grocery database.Series: Research Report Series / Department of Statistics and Mathematic
Categorization of interestingness measures for knowledge extraction
Finding interesting association rules is an important and active research
field in data mining. The algorithms of the Apriori family are based on two
rule extraction measures, support and confidence. Although these two measures
have the virtue of being algorithmically fast, they generate a prohibitive
number of rules most of which are redundant and irrelevant. It is therefore
necessary to use further measures which filter uninteresting rules. Many
synthesis studies were then realized on the interestingness measures according
to several points of view. Different reported studies have been carried out to
identify "good" properties of rule extraction measures and these properties
have been assessed on 61 measures. The purpose of this paper is twofold. First
to extend the number of the measures and properties to be studied, in addition
to the formalization of the properties proposed in the literature. Second, in
the light of this formal study, to categorize the studied measures. This paper
leads then to identify categories of measures in order to help the users to
efficiently select an appropriate measure by choosing one or more measure(s)
during the knowledge extraction process. The properties evaluation on the 61
measures has enabled us to identify 7 classes of measures, classes that we
obtained using two different clustering techniques.Comment: 34 pages, 4 figure
Combining Clustering techniques and Formal Concept Analysis to characterize Interestingness Measures
Formal Concept Analysis "FCA" is a data analysis method which enables to
discover hidden knowledge existing in data. A kind of hidden knowledge
extracted from data is association rules. Different quality measures were
reported in the literature to extract only relevant association rules. Given a
dataset, the choice of a good quality measure remains a challenging task for a
user. Given a quality measures evaluation matrix according to semantic
properties, this paper describes how FCA can highlight quality measures with
similar behavior in order to help the user during his choice. The aim of this
article is the discovery of Interestingness Measures "IM" clusters, able to
validate those found due to the hierarchical and partitioning clustering
methods "AHC" and "k-means". Then, based on the theoretical study of sixty one
interestingness measures according to nineteen properties, proposed in a recent
study, "FCA" describes several groups of measures.Comment: 13 pages, 2 figure
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Apportioning Development Effort in a Probabilistic LR Parsing System through Evaluation
We describe an implemented system for robust domain-independent syntactic
parsing of English, using a unification-based grammar of part-of-speech and
punctuation labels coupled with a probabilistic LR parser. We present
evaluations of the system's performance along several different dimensions;
these enable us to assess the contribution that each individual part is making
to the success of the system as a whole, and thus prioritise the effort to be
devoted to its further enhancement. Currently, the system is able to parse
around 80% of sentences in a substantial corpus of general text containing a
number of distinct genres. On a random sample of 250 such sentences the system
has a mean crossing bracket rate of 0.71 and recall and precision of 83% and
84% respectively when evaluated against manually-disambiguated analyses.Comment: 10 pages, 1 Postscript figure. To Appear in Proceedings of the
Conference on Empirical Methods in Natural Language Processing, University of
Pennsylvania, May 199
- …