3,212 research outputs found

    DiffNodesets: An Efficient Structure for Fast Mining Frequent Itemsets

    Full text link
    Mining frequent itemsets is an essential problem in data mining and plays an important role in many data mining applications. In recent years, some itemset representations based on node sets have been proposed, which have shown to be very efficient for mining frequent itemsets. In this paper, we propose DiffNodeset, a novel and more efficient itemset representation, for mining frequent itemsets. Based on the DiffNodeset structure, we present an efficient algorithm, named dFIN, to mining frequent itemsets. To achieve high efficiency, dFIN finds frequent itemsets using a set-enumeration tree with a hybrid search strategy and directly enumerates frequent itemsets without candidate generation under some case. For evaluating the performance of dFIN, we have conduct extensive experiments to compare it against with existing leading algorithms on a variety of real and synthetic datasets. The experimental results show that dFIN is significantly faster than these leading algorithms.Comment: 22 pages, 13 figure

    Efficiently mining frequent itemsets from very large databases

    Get PDF
    Efficient algorithms for mining frequent itemsets are crucial for mining association rules and for other data mining tasks. Methods for mining frequent itemsets and for iceberg data cube computation have been implemented using a prefix-tree structure, known as a FP-tree, for storing compressed frequency information. Numerous experimental results have demonstrated that these algorithms perform extremely well. In this thesis we present a novel FP-array technique that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FP-tree based algorithms. The technique works especially well for sparse datasets. We then present new algorithms for mining all frequent itemsets, maximal frequent itemsets, and closed frequent item-sets. The algorithms use the FP-tree data structure in combination with the FP-array technique efficiently, and incorporate various optimization techniques. In the algorithm for mining maximal frequent itemsets, a variant FP-tree data structure, called a MFI-tree, and an efficient maximality-checking approach are used. Another variant FP-tree data structure, called a CFI-tree, and an efficient closedness-testing approach are also given in the algorithm for mining closed frequent itemsets. Experimental results show that our methods outperform the existing methods in not only the speed of the algorithms, but also their memory consumption and their scalability. We also notice that most algorithms for mining frequent itemsets assume that the main memory is large enough for the data structures used in the mining, and very few efficient algorithms deal with the cases when the database is very large or the minimum support is very low. We thus investigate approaches to mining frequent itemsets when data structures are too large to fit in main memory. Several divide-and-conquer algorithms are presented for mining from disks. Many novel techniques are introduced. Experimental results show that the techniques reduce the required disk accesses by orders of magnitude, and enable truly scalable data mining

    Comparison of different algorithms for exploting the hidden trends in data sources

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2003Includes bibliographical references (leaves: 92-97)Text in English; Abstract: Turkish and English97 leavesThe growth of large-scale transactional databases, time-series databases and other kinds of databases has been giving rise to the development of several efficient algorithms that cope with the computationally expensive task of association rule mining.In this study, different algorithms, Apriori, FP-tree and CHARM, for exploiting the hidden trends such as frequent itemsets, frequent patterns, closed frequent itemsets respectively, were discussed and their performances were evaluated. The perfomances of the algorithms were measured at different support levels, and the algorithms were tested on different data sets (on both synthetic and real data sets). The algorihms were compared according to their, data preparation performances, mining performance, run time performances and knowledge extraction capabilities.The Apriori algorithm is the most prevalent algorithm of association rule mining which makes multiple passes over the database aiming at finding the set of frequent itemsets for each level. The FP-Tree algorithm is a scalable algorithm which finds the crucial information as regards the complete set of prefix paths, conditional pattern bases and frequent patterns by using a compact FP-Tree based mining method. The CHARM is a novel algorithm which brings remarkable improvements over existing association rule mining algorithms by proving the fact that mining the set of closed frequent itemsets is adequate instead of mining the set of all frequent itemsets.Related to our experimental results, we conclude that the Apriori algorithm demonstrates a good performance on sparse data sets. The Fp-tree algorithm extracts less association in comparison to Apriori, however it is completelty a feasable solution that facilitates mining dense data sets at low support levels. On the other hand, the CHARM algorithm is an appropriate algorithm for mining closed frequent itemsets (a substantial portion of frequent itemsets) on both sparse and dense data sets even at low levels of support

    An overview of data structures and algorithms: case study of us in the vector-space model and mining off requentitem sets using the apriori algorithm

    Get PDF
    In this paper, we review some commonly used data structures and algorithms. We then review two important problems: the creation of the vector-space model that is widely used in the design of information retrieval systems, and the mining of frequent itemsets using the apriori algorithm. We consider two variations of the apriori algorithm: the first is the classical algorithm which computes candidate k-itemsets by first joining frequent (k-1)-itemsets to themselves, and applying the apriori property to prune the generated candidate k-itemsets; the second avoids the join stage in the classical algorithm, and instead, generates candidate k-itemsets directly from rows of the transactions database, followed by application of the apriori property to prune each itemset so determined. Finally, we illustrate appropriate data structures and algorithms that when put together, provide efficient implementations of our solution to the problems mentioned.Keywords: data structures, algorithms, vector-space model, frequent itemsets mining, apriori algorith

    CLOLINK: An Adapted Algorithm for Mining Closed Frequent Itemsets

    Get PDF
    Mining of the complete set of frequent itemsets will lead to a huge number of itemsets. Fortunately, this problem can be reduced to the mining of closed frequent itemsets, which results in a much smaller number of itemsets. Methods for efficient mining of closed frequent itemsets have been studied extensively by many researchers using various strategies to prove their efficiencies such as Apriori-likemethods, FP growth algorithms, Tree projection and so on. However, when mining databases, these methods still encounter some performance bottlenecks like processing time, storage space and so on. This paper integrates the advantages of the strategies of H-Mine, a memory efficient algorithmfor mining frequent itemsets. The study proposes an algorithm named CLOLINK, which makes use of a compact data structure named L struct that links the items in the database dynamically during the mining process. An extensive experimental evaluation of the approach on real databases shows a better performance over the previous methods in mining closed frequent itemsets

    Knowledge, false beliefs and fact-driven perceptions of Muslims in Australia: a national survey

    Full text link
    Mining frequent itemsets is one of the main problems in data mining. Much effort went into developing efficient and scalable algorithms for this problem. When the support threshold is set too low, however, or the data is highly correlated, the number of frequent itemsets can become too large, independently of the algorithm used. Therefore, it is often more interesting to mine a reduced collection of interesting itemsets, i.e., a condensed representation. Recently, in this context, the non-derivable itemsets were proposed as an important class of itemsets. An itemset is called derivable when its support is completely determined by the support of its subsets. As such, derivable itemsets represent redundant information and can be pruned from the collection of frequent itemsets. It was shown both theoretically and experimentally that the collection of non-derivable frequent itemsets is in general much smaller than the complete set of frequent itemsets. A breadth-first, Apriori-based algorithm, called NDI, to find all non-derivable itemsets was proposed. In this paper we present a depth-first algorithm, dfNDI, that is based on Eclat for mining the non-derivable itemsets. dfNDI is evaluated on real-life datasets, and experiments show that dfNDI outperforms NDI with an order of magnitude.
    corecore