948 research outputs found
A review of associative classification mining
Associative classification mining is a promising approach in data mining that utilizes the
association rule discovery techniques to construct classification systems, also known as
associative classifiers. In the last few years, a number of associative classification algorithms
have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms
employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule
evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative
classification techniques with regards to the above criteria. Finally, future directions in associative
classification, such as incremental learning and mining low-quality data sets, are also
highlighted in this paper
arules - A Computational Environment for Mining Association Rules and Frequent Item Sets
Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.
Finding the True Frequent Itemsets
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It
requires to identify all itemsets appearing in at least a fraction of
a transactional dataset . Often though, the ultimate goal of
mining is not an analysis of the dataset \emph{per se}, but the
understanding of the underlying process that generated it. Specifically, in
many applications is a collection of samples obtained from an
unknown probability distribution on transactions, and by extracting the
FIs in one attempts to infer itemsets that are frequently (i.e.,
with probability at least ) generated by , which we call the True
Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the
generative process, the set of FIs is only a rough approximation of the set of
TFIs, as it often contains a huge number of \emph{false positives}, i.e.,
spurious itemsets that are not among the TFIs. In this work we design and
analyze an algorithm to identify a threshold such that the
collection of itemsets with frequency at least in
contains only TFIs with probability at least , for some
user-specified . Our method uses results from statistical learning
theory involving the (empirical) VC-dimension of the problem at hand. This
allows us to identify almost all the TFIs without including any false positive.
We also experimentally compare our method with the direct mining of
at frequency and with techniques based on widely-used
standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and
show that our algorithm outperforms these methods and achieves even better
results than what is guaranteed by the theoretical analysis.Comment: 13 pages, Extended version of work appeared in SIAM International
Conference on Data Mining, 201
Using association rule mining to enrich semantic concepts for video retrieval
In order to achieve true content-based information retrieval on video we should analyse and index video with
high-level semantic concepts in addition to using user-generated tags and structured metadata like title, date,
etc. However the range of such high-level semantic concepts, detected either manually or automatically,
usually limited compared to the richness of information content in video and the potential vocabulary of
available concepts for indexing. Even though there is work to improve the performance of individual concept
classifiers, we should strive to make the best use of whatever partial sets of semantic concept occurrences
are available to us. We describe in this paper our method for using association rule mining to automatically
enrich the representation of video content through a set of semantic concepts based on concept co-occurrence
patterns. We describe our experiments on the TRECVid 2005 video corpus annotated with the 449 concepts
of the LSCOM ontology. The evaluation of our results shows the usefulness of our approach
Modern Approaches to Uncertain Database Exploration from Categorizing Data to Advanced Mining Solutions
In today's digitized era, the ubiquity of data from diverse sources has introduced unique challenges in database management, notably the issue of data uncertainty. Uncertainty in databases can arise from various factors – sensor inaccuracies, human input errors, or inherent vagueness in data interpretation. Addressing these challenges, this research delves into modern approaches to uncertain database exploration. The paper begins by exploring methods for categorizing data based on certainty levels, emphasizing the importance and mechanisms to distinguish between certain and uncertain data. The discussion then transitions to highlight pioneering mining solutions that enhance the utility of uncertain databases. By integrating state-of-the-art techniques with traditional database management principles, this study aims to bolster the reliability, efficiency, and versatility of data mining in uncertain contexts. The implications of these methods, both theoretically and in real-world applications, hold the potential to redefine how uncertain data is perceived and utilized in diverse sectors, from healthcare to finance
Constraint-based sequence mining using constraint programming
The goal of constraint-based sequence mining is to find sequences of symbols
that are included in a large number of input sequences and that satisfy some
constraints specified by the user. Many constraints have been proposed in the
literature, but a general framework is still missing. We investigate the use of
constraint programming as general framework for this task. We first identify
four categories of constraints that are applicable to sequence mining. We then
propose two constraint programming formulations. The first formulation
introduces a new global constraint called exists-embedding. This formulation is
the most efficient but does not support one type of constraint. To support such
constraints, we develop a second formulation that is more general but incurs
more overhead. Both formulations can use the projected database technique used
in specialised algorithms. Experiments demonstrate the flexibility towards
constraint-based settings and compare the approach to existing methods.Comment: In Integration of AI and OR Techniques in Constraint Programming
(CPAIOR), 201
- …