53,064 research outputs found
Computing Multi-Relational Sufficient Statistics for Large Databases
Databases contain information about which relationships do and do not hold
among entities. To make this information accessible for statistical analysis
requires computing sufficient statistics that combine information from
different database tables. Such statistics may involve any number of {\em
positive and negative} relationships. With a naive enumeration approach,
computing sufficient statistics for negative relationships is feasible only for
small databases. We solve this problem with a new dynamic programming algorithm
that performs a virtual join, where the requisite counts are computed without
materializing join tables. Contingency table algebra is a new extension of
relational algebra, that facilitates the efficient implementation of this
M\"obius virtual join operation. The M\"obius Join scales to large datasets
(over 1M tuples) with complex schemas. Empirical evaluation with seven
benchmark datasets showed that information about the presence and absence of
links can be exploited in feature selection, association rule mining, and
Bayesian network learning.Comment: 11pages, 8 figures, 8 tables, CIKM'14,November 3--7, 2014, Shanghai,
Chin
Re-mining item associations: methodology and a case study in apparel retailing
Association mining is the conventional data mining technique for analyzing market basket data and it reveals the positive and negative associations between items. While being an integral part of transaction data, pricing and time information have not been integrated into market basket analysis in earlier studies. This paper proposes a new approach to mine price, time and domain related attributes through re-mining of association mining results. The underlying factors behind positive and negative relationships can be characterized and described through this second data mining stage. The applicability of the methodology is demonstrated through the analysis of data coming from a large apparel retail chain, and its algorithmic complexity is analyzed in comparison to the existing techniques
A review of associative classification mining
Associative classification mining is a promising approach in data mining that utilizes the
association rule discovery techniques to construct classification systems, also known as
associative classifiers. In the last few years, a number of associative classification algorithms
have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms
employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule
evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative
classification techniques with regards to the above criteria. Finally, future directions in associative
classification, such as incremental learning and mining low-quality data sets, are also
highlighted in this paper
New probabilistic interest measures for association rules
Mining association rules is an important technique for discovering meaningful
patterns in transaction databases. Many different measures of interestingness
have been proposed for association rules. However, these measures fail to take
the probabilistic properties of the mined data into account. In this paper, we
start with presenting a simple probabilistic framework for transaction data
which can be used to simulate transaction data when no associations are
present. We use such data and a real-world database from a grocery outlet to
explore the behavior of confidence and lift, two popular interest measures used
for rule mining. The results show that confidence is systematically influenced
by the frequency of the items in the left hand side of rules and that lift
performs poorly to filter random noise in transaction data. Based on the
probabilistic framework we develop two new interest measures, hyper-lift and
hyper-confidence, which can be used to filter or order mined association rules.
The new measures show significantly better performance than lift for
applications where spurious rules are problematic
- …