133 research outputs found

    An efficient closed frequent itemset miner for the MOA stream mining system

    Get PDF
    Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version

    New Approaches to Frequent and Incremental Frequent Pattern Mining

    Full text link
    Data Mining (DM) is a process for extracting interesting patterns from large volumes of data. It is one of the crucial steps in Knowledge Discovery in Databases (KDD). It involves various data mining methods that mainly fall into predictive and descriptive models. Descriptive models look for patterns, rules, relationships and associations within data. One of the descriptive methods is association rule analysis, which represents co-occurrence of items or events. Association rules are commonly used in market basket analysis. An association rule is in the form of X → Y and it shows that X and Y co-occur with a given level of support and confidence. Association rule mining is a common technique used in discovering interesting frequent patterns in large datasets acquired in various application domains. Having petabytes of data finding its way into data storages in perhaps every day, made many researchers look for efficient methods for analyzing these large datasets. Many algorithms have been proposed for searching for frequent patterns. The search space combinatorically explodes as the size of the source data increases. Simply using more powerful computers, or even super-computers to handle ever-increasing size of large data sets is not sufficient. Hence, incremental algorithms have been developed and used to improve the efficiency of frequent pattern mining. One of the challenges of frequent itemset mining is long running times of the algorithms. Two major costs of long running times of frequent itemset mining are due to the number of database scans and the number of candidates generated (the latter one requires memory, and the more the number of candidates there are the more memory space is needed. When the candidates do not fit in memory then page swapping will occur which will increase the running time of the algorithms). In this dissertation we propose a new implementation of Apriori algorithm, NCLAT (Near Candidate-less Apriori with Tidlists), which scans the database only once and creates candidates only for level one (1-itemsets) which is equivalent to the total number of unique items in the database. In addition, we also show the results of choice of data structures used whether they are probabilistic or not, whether the datasets are horizontal or vertical, how counting is done, whether the algorithms are computed single or parallel way. We implement, explore and devise incremental algorithm UWEP with single as well as parallel computation. We have also cleaned a minor bug in UWEP and created a more efficient version UWEP2, which reduces the number of candidates created and the number of database scans. We have run all of our tests against three datasets with different features for different minimum support levels. We show both frequent and incremental frequent itemset mining implementation test results and comparison to each other. While there has been a lot of work done on frequent itemset mining on structured data, very little work has been done on the unstructured data. So, we have created a new hybrid pattern search algorithm, Double-Hash, which performed better for all of our test scenarios than the known pattern search algorithms. Double-Hash can potentially be used in frequent itemset mining on unstructured data in the future. We will be presenting our work and test results on this as well

    An Algorithm for Generating Non-Redundant Sequential Rules for Medical Time Series Data

    Get PDF
    In this paper, an algorithm for generating non-redundant sequential rules for the medical time series data is designed. This study is the continuation of my previous study titled �An Algorithm for Mining Closed Weighted Sequential Patterns with Flexing Time Interval for Medical Time Series Data� [25]. In my previous work, the sequence weight for each sequence was calculated based on the time interval between the itemsets.Subsequently, the candidate sequences were generated with flexible time intervals initially. The next step was, computation of frequent sequential patterns with the aid of proposed support measure. Next the frequent sequential patterns were subjected to closure checking process which leads to filter the closed sequential patterns with flexible time intervals. Finally, the methodology produced with necessary sequential patterns was proved. This methodology constructed closed sequential patterns which was 23.2% lesser than the sequential patterns. In this study, the sequential rules are generated based on the calculation of confidence value of the rule from the closed sequential pattern. Once the closed sequential rules are generated which are subjected to non-redundant checking process, that leads to produce the final set of non-redundant weighted closed sequential rules with flexible time intervals. This study produces non-redundant sequential rules which is 172.37% lesser than sequential rules

    Frequent Pattern mining with closeness Considerations: Current State of the art

    Get PDF
    Due to rising importance in frequent pattern mining in the field of data mining research, tremendous progress has been observed in fields ranging from frequent itemset mining in transaction databases to numerous research frontiers. An elaborative note on current condition in frequent pattern mining and potential research directions is discussed in this article. It2019;s a strong belief that with considerably increasing research in frequent pattern mining in data analysis, it will provide a strong foundation for data mining methodologies and its applications which might prove a milestone in data mining applications in mere future

    Mining Closed Itemsets for Coherent Rules: An Inference Analysis Approach

    Get PDF
    Past observations have shown that a frequent item set mining algorithm are alleged to mine the closed ones because the finish offers a compact and a whole progress set and higher potency. Anyhow, the most recent closed item set mining algorithms works with candidate maintenance combined with check paradigm that is dear in runtime likewise as area usage when support threshold is a smaller amount or the item sets gets long. Here, we show, PEPP with inference analysis that could be a capable approach used for mining closed sequences for coherent rules while not candidate. It implements a unique sequence closure checking format with inference analysis that based mostly on Sequence Graph protruding by an approach labeled Parallel Edge projection and pruning in brief will refer as PEPP. We describe a novel inference analysis approach to prune patterns that tends to derive coherent rules. A whole observation having sparse and dense real-life information sets proved that PEPP with inference analysis performs larger compared to older algorithms because it takes low memory and is quicker than any algorithms those cited in literature frequently

    Efficient Closed Pattern Mining in the Presence of Tough Block Constraints

    Get PDF
    In recent years, various constrained frequent pattern mining problem formulations and associated algorithms have been developed that enable the user to specify various itemsetbased constraints that better capture the underlying application requirements and characteristics. In this paper we introduce a new class of block constraints that determine the significance of an itemset pattern by considering the dense block that is formed by the pattern's items and its associated set of transactions. Block constraints provide a natural framework by which a number of important problems can be specified and make it possible to solve numerous problems on binary and real-valued datasets. However, developing computationally e#cient algorithms to find these block constraints poses a number of challenges as unlike the di#erent itemset-based constraints studied earlier, these block constraints are tough as they are neither anti-monotone, monotone, nor convertible. To overcome this problem, we introduce a new class of pruning methods that can be used to significantly reduce the overall search space and make it possible to develop computationally e#cient block constraint mining algorithms. We present an algorithm called CBMiner that takes advantage of these pruning methods to develop an algorithm for finding the closed itemsets that satisfy the block constraints. Our extensive performance study shows that CBMiner generates more concise result set and can be order(s) of magnitude faster than the traditional frequent closed itemset mining algorithms

    Efficiently Using Prime-Encoding for Mining Frequent Itemsets in Sparse Data

    Get PDF
    In the data mining field, data representation turns out to be one of the major factors affecting mining algorithm scalability. Mining Frequent Itemsets (MFI) is a data mining problem that is heavily affected by this fact. The vertical approach is one of the successful data representations adopted for MFI problem. The main advantage of this approach is support for fast frequency counting via joining operations. Recently, an encoding method called prime-encoding is proposed as an enhancement for the vertical approach [10]. The performance study introduced in [10] confirmed the high quality of prime-encoding based vertical mining of frequent sequence over other vertical and horizontal ones in terms of space and time. Though sequence mining is more general than itemset mining, this paper presents a prime-encoding based vertical mining of frequent itemsets with new optimizations and a new re-encoding method that further enhance memory and speed. The experimental results show that prime encoding based vertical itemset mining is suitable for high-dimensional sparse data
    corecore