948 research outputs found

    A review of associative classification mining

    Get PDF
    Associative classification mining is a promising approach in data mining that utilizes the association rule discovery techniques to construct classification systems, also known as associative classifiers. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative classification techniques with regards to the above criteria. Finally, future directions in associative classification, such as incremental learning and mining low-quality data sets, are also highlighted in this paper

    arules - A Computational Environment for Mining Association Rules and Frequent Item Sets

    Get PDF
    Mining frequent itemsets and association rules is a popular and well researched approach for discovering interesting relationships between variables in large databases. The R package arules presented in this paper provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules. The package also includes interfaces to two fast mining algorithms, the popular C implementations of Apriori and Eclat by Christian Borgelt. These algorithms can be used to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.

    Finding the True Frequent Itemsets

    Full text link
    Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction θ\theta of a transactional dataset D\mathcal{D}. Often though, the ultimate goal of mining D\mathcal{D} is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications D\mathcal{D} is a collection of samples obtained from an unknown probability distribution π\pi on transactions, and by extracting the FIs in D\mathcal{D} one attempts to infer itemsets that are frequently (i.e., with probability at least θ\theta) generated by π\pi, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of \emph{false positives}, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold θ^\hat{\theta} such that the collection of itemsets with frequency at least θ^\hat{\theta} in D\mathcal{D} contains only TFIs with probability at least 1δ1-\delta, for some user-specified δ\delta. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of D\mathcal{D} at frequency θ\theta and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.Comment: 13 pages, Extended version of work appeared in SIAM International Conference on Data Mining, 201

    Using association rule mining to enrich semantic concepts for video retrieval

    Get PDF
    In order to achieve true content-based information retrieval on video we should analyse and index video with high-level semantic concepts in addition to using user-generated tags and structured metadata like title, date, etc. However the range of such high-level semantic concepts, detected either manually or automatically, usually limited compared to the richness of information content in video and the potential vocabulary of available concepts for indexing. Even though there is work to improve the performance of individual concept classifiers, we should strive to make the best use of whatever partial sets of semantic concept occurrences are available to us. We describe in this paper our method for using association rule mining to automatically enrich the representation of video content through a set of semantic concepts based on concept co-occurrence patterns. We describe our experiments on the TRECVid 2005 video corpus annotated with the 449 concepts of the LSCOM ontology. The evaluation of our results shows the usefulness of our approach

    Modern Approaches to Uncertain Database Exploration from Categorizing Data to Advanced Mining Solutions

    Get PDF
    In today's digitized era, the ubiquity of data from diverse sources has introduced unique challenges in database management, notably the issue of data uncertainty. Uncertainty in databases can arise from various factors – sensor inaccuracies, human input errors, or inherent vagueness in data interpretation. Addressing these challenges, this research delves into modern approaches to uncertain database exploration. The paper begins by exploring methods for categorizing data based on certainty levels, emphasizing the importance and mechanisms to distinguish between certain and uncertain data. The discussion then transitions to highlight pioneering mining solutions that enhance the utility of uncertain databases. By integrating state-of-the-art techniques with traditional database management principles, this study aims to bolster the reliability, efficiency, and versatility of data mining in uncertain contexts. The implications of these methods, both theoretically and in real-world applications, hold the potential to redefine how uncertain data is perceived and utilized in diverse sectors, from healthcare to finance

    Constraint-based sequence mining using constraint programming

    Full text link
    The goal of constraint-based sequence mining is to find sequences of symbols that are included in a large number of input sequences and that satisfy some constraints specified by the user. Many constraints have been proposed in the literature, but a general framework is still missing. We investigate the use of constraint programming as general framework for this task. We first identify four categories of constraints that are applicable to sequence mining. We then propose two constraint programming formulations. The first formulation introduces a new global constraint called exists-embedding. This formulation is the most efficient but does not support one type of constraint. To support such constraints, we develop a second formulation that is more general but incurs more overhead. Both formulations can use the projected database technique used in specialised algorithms. Experiments demonstrate the flexibility towards constraint-based settings and compare the approach to existing methods.Comment: In Integration of AI and OR Techniques in Constraint Programming (CPAIOR), 201
    corecore