157 research outputs found

    An efficient closed frequent itemset miner for the MOA stream mining system

    Get PDF
    Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version

    Mining Recent Frequent Itemsets in Sliding Windows over Data Streams

    Get PDF
    This paper considers the problem of mining recent frequent itemsets over data streams. As the data grows without limit at a rapid rate, it is hard to track the new changes of frequent itemsets over data streams. We propose an efficient one-pass algorithm in sliding windows over data streams with an error bound guarantee. This algorithm does not need to refer to obsolete transactions when they are removed from the sliding window. It exploits a compact data structure to maintain potentially frequent itemsets so that it can output recent frequent itemsets at any time. Flexible queries for continuous transactions in the sliding window can be answered with an error bound guarantee

    Max-FISM: Mining (recently) maximal frequent itemsets over data streams using the sliding window model

    Get PDF
    AbstractFrequent itemset mining from data streams is an important data mining problem with broad applications such as retail market data analysis, network monitoring, web usage mining, and stock market prediction. However, it is also a difficult problem due to the unbounded, high-speed and continuous characteristics of streaming data. Therefore, extracting frequent itemsets from more recent data can enhance the analysis of stream data. In this paper, we propose an efficient algorithm, called Max-FISM (Maximal-Frequent Itemsets Mining), for mining recent maximal frequent itemsets from a high-speed stream of transactions within a sliding window. According to our algorithm, whenever a new transaction is inserted in the current window only its maximum itemset should be inserted into a prefix tree-based summary data structure called Max-Set for maintaining the number of independent appearance of each transaction in the current window. Finally, the set of recent maximal frequent itemsets is obtained from the current Max-Set. Experimental studies show that the proposed Max-FISM algorithm is highly efficient in terms of memory and time complexity for mining recent maximal frequent itemsets over high-speed data streams

    Mining High Utility Patterns Over Data Streams

    Get PDF
    Mining useful patterns from sequential data is a challenging topic in data mining. An important task for mining sequential data is sequential pattern mining, which discovers sequences of itemsets that frequently appear in a sequence database. In sequential pattern mining, the selection of sequences is generally based on the frequency/support framework. However, most of the patterns returned by sequential pattern mining may not be informative enough to business people and are not particularly related to a business objective. In view of this, high utility sequential pattern (HUSP) mining has emerged as a novel research topic in data mining recently. The main objective of HUSP mining is to extract valuable and useful sequential patterns from data by considering the utility of a pattern that captures a business objective (e.g., profit, users interest). In HUSP mining, the goal is to find sequences whose utility in the database is no less than a user-specified minimum utility threshold. Nowadays, many applications generate a huge volume of data in the form of data streams. A number of studies have been conducted on mining HUSPs, but they are mainly intended for non-streaming data and thus do not take data stream characteristics into consideration. Mining HUSP from such data poses many challenges. First, it is infeasible to keep all streaming data in the memory due to the high volume of data accumulated over time. Second, mining algorithms need to process the arriving data in real time with one scan of data. Third, depending on the minimum utility threshold value, the number of patterns returned by a HUSP mining algorithm can be large and overwhelms the user. In general, it is hard for the user to determine the value for the threshold. Thus, algorithms that can find the most valuable patterns (i.e., top-k high utility patterns) are more desirable. Mining the most valuable patterns is interesting in both static data and data streams. To address these research limitations and challenges, this dissertation proposes techniques and algorithms for mining high utility sequential patterns over data streams. We work on mining HUSPs over both a long portion of a data stream and a short period of time. We also work on how to efficiently identify the most significant high utility patterns (namely, the top-k high utility patterns) over data streams. In the first part, we explore a fundamental problem that is how the limited memory space can be well utilized to produce high quality HUSPs over the entire data stream. An approximation algorithm, called MAHUSP, is designed which employs memory adaptive mechanisms to use a bounded portion of memory, to efficiently discover HUSPs over the entire data streams. The second part of the dissertation presents a new sliding window-based algorithm to discover recent high utility sequential patterns over data streams. A novel data structure named HUSP-Tree is proposed to maintain the essential information for mining recenT HUSPs. An efficient and single-pass algorithm named HUSP-Stream is proposed to generate recent HUSPs from HUSP-Tree. The third part addresses the problem of top-k high utility pattern mining over data streams. Two novel methods, named T-HUDS and T-HUSP, for finding top-k high utility patterns over a data stream are proposed. T-HUDS discovers top-k high utility itemsets and T-HUSP discovers top-k high utility sequential patterns over a data stream. T-HUDS is based on a compressed tree structure, called HUDS-Tree, that can be used to efficiently find potential top-k high utility itemsets over data streams. T-HUSP incrementally maintains the content of top-k HUSPs in a data stream in a summary data structure, named TKList, and discovers top-k HUSPs efficiently. All of the algorithms are evaluated using both synthetic and real datasets. The performances, including the running time, memory consumption, precision, recall and Fmeasure, are compared. In order to show the effectiveness and efficiency of the proposed methods in reallife applications, the fourth part of this dissertation presents applications of one of the proposed methods (i.e., MAHUSP) to extract meaningful patterns from a real web clickstream dataset and a real biosequence dataset. The utility-based sequential patterns are compared with the patterns in the frequency/support framework. The results show that high utility sequential pattern mining provides meaningful patterns in real-life applications

    Distributed context discovering for predictive modeling

    Get PDF
    Click prediction has applications in various areas such as advertising, search and online sales. Usually user-intent information such as query terms and previous click history is used in click prediction. However, this information is not always available. For example, there are no queries from users on the webpages of content publishers, such as personal blogs. The available information for click prediction in this scenario are implicitly derived from users, such as visiting time and IP address. Thus, the existing approaches utilizing user-intent information may be inapplicable in this scenario; and the click prediction problem in this scenario remains unexplored to our knowledge. In addition, the challenges in handling skewed data streams also exist in prediction, since there is often a heavy traffic on webpages and few visitors click on them. In this thesis, we propose to use the pattern-based classification approach to tackle the click prediction problem. Attributes in webpage visits are combined by a pattern mining algorithm to enhance their power in prediction. To make the pattern-based classification handle skewed data streams, we adopt a sliding window to capture recent data, and an undersampling technique to handle the skewness. As a side problem raised by the pattern-based approach, mining patterns from large datasets is addressed by a distributed pattern sampling algorithm proposed by us. This algorithm shows its scalability in experiments. We validate our pattern-based approach in click prediction on a real-world dataset from a Dutch portal website. The experiments show our pattern-based approach can achieve an average AUC of 0.675 over a period of 36 days with a 5-day sized sliding window, which surpasses the baseline, a statically trained classification model without patterns by 0.002. Besides, the average weighted F-measure of our approach is 0.009 higher than the baseline. Therefore, our proposed approach can slightly improve classification performance; yet whether this improvement worth deployment in real scenarios remains a question. Click prediction has applications in various areas such as advertising, search and online sales. Usually user-intent information such as query terms and previous click history is used in click prediction. However, this information is not always available. For example, there are no queries from users on the webpages of content publishers, such as personal blogs. The available information for click prediction in this scenario are implicitly derived from users, such as visiting time and IP address. Thus, the existing approaches utilizing user-intent information may be inapplicable in this scenario; and the click prediction problem in this scenario remains unexplored to our knowledge. In addition, the challenges in handling skewed data streams also exist in prediction, since there is often a heavy traffic on webpages and few visitors click on them. In this thesis, we propose to use the pattern-based classification approach to tackle the click prediction problem. Attributes in webpage visits are combined by a pattern mining algorithm to enhance their power in prediction. To make the pattern-based classification handle skewed data streams, we adopt a sliding window to capture recent data, and an undersampling technique to handle the skewness. As a side problem raised by the pattern-based approach, mining patterns from large datasets is addressed by a distributed pattern sampling algorithm proposed by us. This algorithm shows its scalability in experiments. We validate our pattern-based approach in click prediction on a real-world dataset from a Dutch portal website. The experiments show our pattern-based approach can achieve an average AUC of 0.675 over a period of 36 days with a 5-day sized sliding window, which surpasses the baseline, a statically trained classification model without patterns by 0.002. Besides, the average weighted F-measure of our approach is 0.009 higher than the baseline. Therefore, our proposed approach can slightly improve classification performance; yet whether this improvement worth deployment in real scenarios remains a question

    A new data stream mining algorithm for interestingness-rich association rules

    Get PDF
    Frequent itemset mining and association rule generation is a challenging task in data stream. Even though, various algorithms have been proposed to solve the issue, it has been found out that only frequency does not decides the significance interestingness of the mined itemset and hence the association rules. This accelerates the algorithms to mine the association rules based on utility i.e. proficiency of the mined rules. However, fewer algorithms exist in the literature to deal with the utility as most of them deals with reducing the complexity in frequent itemset/association rules mining algorithm. Also, those few algorithms consider only the overall utility of the association rules and not the consistency of the rules throughout a defined number of periods. To solve this issue, in this paper, an enhanced association rule mining algorithm is proposed. The algorithm introduces new weightage validation in the conventional association rule mining algorithms to validate the utility and its consistency in the mined association rules. The utility is validated by the integrated calculation of the cost/price efficiency of the itemsets and its frequency. The consistency validation is performed at every defined number of windows using the probability distribution function, assuming that the weights are normally distributed. Hence, validated and the obtained rules are frequent and utility efficient and their interestingness are distributed throughout the entire time period. The algorithm is implemented and the resultant rules are compared against the rules that can be obtained from conventional mining algorithms
    • …
    corecore