54,874 research outputs found

    Mining High Utility Patterns Over Data Streams

    Get PDF
    Mining useful patterns from sequential data is a challenging topic in data mining. An important task for mining sequential data is sequential pattern mining, which discovers sequences of itemsets that frequently appear in a sequence database. In sequential pattern mining, the selection of sequences is generally based on the frequency/support framework. However, most of the patterns returned by sequential pattern mining may not be informative enough to business people and are not particularly related to a business objective. In view of this, high utility sequential pattern (HUSP) mining has emerged as a novel research topic in data mining recently. The main objective of HUSP mining is to extract valuable and useful sequential patterns from data by considering the utility of a pattern that captures a business objective (e.g., profit, users interest). In HUSP mining, the goal is to find sequences whose utility in the database is no less than a user-specified minimum utility threshold. Nowadays, many applications generate a huge volume of data in the form of data streams. A number of studies have been conducted on mining HUSPs, but they are mainly intended for non-streaming data and thus do not take data stream characteristics into consideration. Mining HUSP from such data poses many challenges. First, it is infeasible to keep all streaming data in the memory due to the high volume of data accumulated over time. Second, mining algorithms need to process the arriving data in real time with one scan of data. Third, depending on the minimum utility threshold value, the number of patterns returned by a HUSP mining algorithm can be large and overwhelms the user. In general, it is hard for the user to determine the value for the threshold. Thus, algorithms that can find the most valuable patterns (i.e., top-k high utility patterns) are more desirable. Mining the most valuable patterns is interesting in both static data and data streams. To address these research limitations and challenges, this dissertation proposes techniques and algorithms for mining high utility sequential patterns over data streams. We work on mining HUSPs over both a long portion of a data stream and a short period of time. We also work on how to efficiently identify the most significant high utility patterns (namely, the top-k high utility patterns) over data streams. In the first part, we explore a fundamental problem that is how the limited memory space can be well utilized to produce high quality HUSPs over the entire data stream. An approximation algorithm, called MAHUSP, is designed which employs memory adaptive mechanisms to use a bounded portion of memory, to efficiently discover HUSPs over the entire data streams. The second part of the dissertation presents a new sliding window-based algorithm to discover recent high utility sequential patterns over data streams. A novel data structure named HUSP-Tree is proposed to maintain the essential information for mining recenT HUSPs. An efficient and single-pass algorithm named HUSP-Stream is proposed to generate recent HUSPs from HUSP-Tree. The third part addresses the problem of top-k high utility pattern mining over data streams. Two novel methods, named T-HUDS and T-HUSP, for finding top-k high utility patterns over a data stream are proposed. T-HUDS discovers top-k high utility itemsets and T-HUSP discovers top-k high utility sequential patterns over a data stream. T-HUDS is based on a compressed tree structure, called HUDS-Tree, that can be used to efficiently find potential top-k high utility itemsets over data streams. T-HUSP incrementally maintains the content of top-k HUSPs in a data stream in a summary data structure, named TKList, and discovers top-k HUSPs efficiently. All of the algorithms are evaluated using both synthetic and real datasets. The performances, including the running time, memory consumption, precision, recall and Fmeasure, are compared. In order to show the effectiveness and efficiency of the proposed methods in reallife applications, the fourth part of this dissertation presents applications of one of the proposed methods (i.e., MAHUSP) to extract meaningful patterns from a real web clickstream dataset and a real biosequence dataset. The utility-based sequential patterns are compared with the patterns in the frequency/support framework. The results show that high utility sequential pattern mining provides meaningful patterns in real-life applications

    Mining Frequent Item Sets Data Streams using "ÉclatAlgorithm"

    Get PDF
    Frequent pattern mining is the process of mining data in a set of items or some patterns from a largedatabase. The resulted frequent set data supports the minimum support threshold. A frequentpattern is a pattern that occurs frequently in a dataset. Association rule mining is defined as to findout association rules that satisfy the predefined minimum support and confidence from a given database. If an item set is said to be frequent, that item set supports the minimum support andconfidence. A Frequent item set should appear in all the transaction of that data base. Discoveringfrequent item sets play a very important role in mining association rules, sequence rules, web logmining and many other interesting patterns among complex data. Data stream is a real timecontinuous, ordered sequence of items. It is an uninterrupted flow of a long sequence of data. Somereal time examples of data stream data are sensor network data, telecommunication data,transactional data and scientific surveillances systems. These data produced trillions of updatesevery day. So it is very difficult to store the entire data. In that time some mining process is required.Data mining is the non-trivial process of identifying valid, original, potentially useful and ultimatelyunderstandable patterns in data. It is an extraction of the hidden predictive information from largedata base. There are lots of algorithms used to find out the frequent item set. In that Apriorialgorithm is the very first classical algorithm used to find the frequent item set. Apart from Apriori,lots of algorithms generated but they are similar to Apriori. They are based on prune and candidategeneration. It takes more memory and time to find out the frequent item set. In this paper, we havestudied about how the éclat algorithm is used in data streams to find out the frequent item sets.Éclat algorithm need not required candidate generation

    DRSP : Dimension Reduction For Similarity Matching And Pruning Of Time Series Data Streams

    Get PDF
    Similarity matching and join of time series data streams has gained a lot of relevance in today's world that has large streaming data. This process finds wide scale application in the areas of location tracking, sensor networks, object positioning and monitoring to name a few. However, as the size of the data stream increases, the cost involved to retain all the data in order to aid the process of similarity matching also increases. We develop a novel framework to addresses the following objectives. Firstly, Dimension reduction is performed in the preprocessing stage, where large stream data is segmented and reduced into a compact representation such that it retains all the crucial information by a technique called Multi-level Segment Means (MSM). This reduces the space complexity associated with the storage of large time-series data streams. Secondly, it incorporates effective Similarity Matching technique to analyze if the new data objects are symmetric to the existing data stream. And finally, the Pruning Technique that filters out the pseudo data object pairs and join only the relevant pairs. The computational cost for MSM is O(l*ni) and the cost for pruning is O(DRF*wsize*d), where DRF is the Dimension Reduction Factor. We have performed exhaustive experimental trials to show that the proposed framework is both efficient and competent in comparison with earlier works.Comment: 20 pages,8 figures, 6 Table

    An efficient closed frequent itemset miner for the MOA stream mining system

    Get PDF
    Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version

    Community Graph Sequence with Sequence Data of Network Structured Data

    Get PDF
    corecore