71,531 research outputs found

    Mining high utility sequential patterns

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Sequential pattern mining refers to the identification of frequent subsequences in sequence databases as patterns. It provides an effective way to analyze the sequential data. The selection of interesting sequences is generally based on the frequency/support framework: sequences of high frequency are treated as significant. In the last two decades, researchers have proposed many techniques and algorithms for extracting the frequent sequential patterns, in which the downward closure property (also known as Apriori property) plays a fundamental role. At the same time, the relative importance of each item has been introduced in frequent pattern mining, and “high utility itemset mining” has been proposed. Instead of selecting high frequency patterns, the utility-based methods extract itemsets with high utilities, and many algorithms and strategies have been proposed. These methods can only process the itemsets in the utility framework. However, all the above methods suffer from the following common issues and problems to varying extents: 1) Sometimes, most of frequent patterns may not be informative to business decision-making, since they do not show the business value and impact. 2) Even if there is an algorithm that considers the business impact (namely utility), it can only obtain high utility sequences based on a given minimum utility threshold, thus it is very difficult for users to specify an appropriate minimum utility and to directly obtain the most valuable patterns. 3) The algorithm in the utility framework may generate a large number of patterns, many of which maybe redundant. Although high utility sequential pattern mining is essential, discovering the patterns is challenging for the following reasons: 1) The downward closure property does not hold in utility-based sequence mining. This means that most of the existing algorithms cannot be directly transferred, e.g. from frequent sequential pattern mining to high utility sequential pattern mining. Furthermore, compared to high utility itemset mining, utility-based sequence analysis faces the critical combinational explosion and computational complexity caused by sequencing between sequential elements (itemsets). 2) Since the minimum utility is not given in advance, the algorithm essentially starts searching from 0 minimum support. This not only incurs very high computational costs, but also the challenge of how to raise the minimum threshold without missing any top-k high utility sequences. 3) Due to the fundamental difference, incorporating the traditional closure concept into high utility sequential pattern mining makes the outcome patterns irreversibly lossy and no longer recoverable, which will be reasoned in the following chapters. Therefore, it is exceedingly challenging to address the above issues by designing a novel representation for high utility sequential patterns. To address these research limitations and challenges, this thesis proposes a high utility sequential pattern mining framework, and proposes both a threshold-based and top-k-based mining algorithm. Furthermore, a compact and lossless representation of utility-based sequence is presented, and an efficient algorithm is provided to mine such kind of patterns. Chapter 2 thoroughly reviews the related works in the frequent sequential pattern mining and high utility itemset/sequence mining. Chapter 3 incorporates utility into sequential pattern mining, and a generic framework for high utility sequence mining is defined. Two efficient algorithms, namely USpan and USpan+, are presented to mine for high utility sequential patterns. In USpan and USpan+, we introduce the lexicographic quantitative sequence tree to extract the complete set of high utility sequences and design concatenation mechanisms for calculating the utility of a node and its children with three effective pruning strategies. Chapter 4 proposes a novel framework called top-k high utility sequential pattern mining to tackle this critical problem. Accordingly, an efficient algorithm, Top-k high Utility Sequence (TUS for short) mining, is designed to identify top-k high utility sequential patterns without minimum utility. In addition, three effective features are introduced to handle the efficiency problem, including two strategies for raising the threshold and one pruning for filtering unpromising items. Chapter 5 proposes a novel concise framework to discover US-closed (Utility Sequence closed) high utility sequential patterns, with theoretical proof that it expresses the lossless representation of high-utility patterns. An efficient algorithm named CloUSpan is introduced to extract the US-closed patterns. Two effective strategies are used to enhance the performance of CloUSpan. All of the algorithms are examined in both synthetic and real datasets. The performances, including the running time and memory consumption, are compared. Furthermore, the utility-based sequential patterns are compared with the patterns in the frequency/support framework. The results show that high utility sequential patterns provide insightful knowledge for users

    An efficient closed frequent itemset miner for the MOA stream mining system

    Get PDF
    Mining itemsets is a central task in data mining, both in the batch and the streaming paradigms. While robust, efficient, and well-tested implementations exist for batch mining, hardly any publicly available equivalent exists for the streaming scenario. The lack of an efficient, usable tool for the task hinders its use by practitioners and makes it difficult to assess new research in the area. To alleviate this situation, we review the algorithms described in the literature, and implement and evaluate the IncMine algorithm by Cheng, Ke, and Ng (2008) for mining frequent closed itemsets from data streams. Our implementation works on top of the MOA (Massive Online Analysis) stream mining framework to ease its use and integration with other stream mining tasks. We provide a PAC-style rigorous analysis of the quality of the output of IncMine as a function of its parameters; this type of analysis is rare in pattern mining algorithms. As a by-product, the analysis shows how one of the user-provided parameters in the original description can be removed entirely while retaining the performance guarantees. Finally, we experimentally confirm both on synthetic and real data the excellent performance of the algorithm, as reported in the original paper, and its ability to handle concept drift.Postprint (published version

    A pattern mining approach for information filtering systems

    Get PDF
    It is a big challenge to clearly identify the boundary between positive and negative streams for information filtering systems. Several attempts have used negative feedback to solve this challenge; however, there are two issues for using negative relevance feedback to improve the effectiveness of information filtering. The first one is how to select constructive negative samples in order to reduce the space of negative documents. The second issue is how to decide noisy extracted features that should be updated based on the selected negative samples. This paper proposes a pattern mining based approach to select some offenders from the negative documents, where an offender can be used to reduce the side effects of noisy features. It also classifies extracted features (i.e., terms) into three categories: positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies can be used to update extracted features. An iterative learning algorithm is also proposed to implement this approach on the RCV1 data collection, and substantial experiments show that the proposed approach achieves encouraging performance and the performance is also consistent for adaptive filtering as well
    • …
    corecore