61,717 research outputs found

    Mining high utility sequential patterns

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Sequential pattern mining refers to the identification of frequent subsequences in sequence databases as patterns. It provides an effective way to analyze the sequential data. The selection of interesting sequences is generally based on the frequency/support framework: sequences of high frequency are treated as significant. In the last two decades, researchers have proposed many techniques and algorithms for extracting the frequent sequential patterns, in which the downward closure property (also known as Apriori property) plays a fundamental role. At the same time, the relative importance of each item has been introduced in frequent pattern mining, and “high utility itemset mining” has been proposed. Instead of selecting high frequency patterns, the utility-based methods extract itemsets with high utilities, and many algorithms and strategies have been proposed. These methods can only process the itemsets in the utility framework. However, all the above methods suffer from the following common issues and problems to varying extents: 1) Sometimes, most of frequent patterns may not be informative to business decision-making, since they do not show the business value and impact. 2) Even if there is an algorithm that considers the business impact (namely utility), it can only obtain high utility sequences based on a given minimum utility threshold, thus it is very difficult for users to specify an appropriate minimum utility and to directly obtain the most valuable patterns. 3) The algorithm in the utility framework may generate a large number of patterns, many of which maybe redundant. Although high utility sequential pattern mining is essential, discovering the patterns is challenging for the following reasons: 1) The downward closure property does not hold in utility-based sequence mining. This means that most of the existing algorithms cannot be directly transferred, e.g. from frequent sequential pattern mining to high utility sequential pattern mining. Furthermore, compared to high utility itemset mining, utility-based sequence analysis faces the critical combinational explosion and computational complexity caused by sequencing between sequential elements (itemsets). 2) Since the minimum utility is not given in advance, the algorithm essentially starts searching from 0 minimum support. This not only incurs very high computational costs, but also the challenge of how to raise the minimum threshold without missing any top-k high utility sequences. 3) Due to the fundamental difference, incorporating the traditional closure concept into high utility sequential pattern mining makes the outcome patterns irreversibly lossy and no longer recoverable, which will be reasoned in the following chapters. Therefore, it is exceedingly challenging to address the above issues by designing a novel representation for high utility sequential patterns. To address these research limitations and challenges, this thesis proposes a high utility sequential pattern mining framework, and proposes both a threshold-based and top-k-based mining algorithm. Furthermore, a compact and lossless representation of utility-based sequence is presented, and an efficient algorithm is provided to mine such kind of patterns. Chapter 2 thoroughly reviews the related works in the frequent sequential pattern mining and high utility itemset/sequence mining. Chapter 3 incorporates utility into sequential pattern mining, and a generic framework for high utility sequence mining is defined. Two efficient algorithms, namely USpan and USpan+, are presented to mine for high utility sequential patterns. In USpan and USpan+, we introduce the lexicographic quantitative sequence tree to extract the complete set of high utility sequences and design concatenation mechanisms for calculating the utility of a node and its children with three effective pruning strategies. Chapter 4 proposes a novel framework called top-k high utility sequential pattern mining to tackle this critical problem. Accordingly, an efficient algorithm, Top-k high Utility Sequence (TUS for short) mining, is designed to identify top-k high utility sequential patterns without minimum utility. In addition, three effective features are introduced to handle the efficiency problem, including two strategies for raising the threshold and one pruning for filtering unpromising items. Chapter 5 proposes a novel concise framework to discover US-closed (Utility Sequence closed) high utility sequential patterns, with theoretical proof that it expresses the lossless representation of high-utility patterns. An efficient algorithm named CloUSpan is introduced to extract the US-closed patterns. Two effective strategies are used to enhance the performance of CloUSpan. All of the algorithms are examined in both synthetic and real datasets. The performances, including the running time and memory consumption, are compared. Furthermore, the utility-based sequential patterns are compared with the patterns in the frequency/support framework. The results show that high utility sequential patterns provide insightful knowledge for users

    Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

    Full text link
    The tasks of extracting (top-KK) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-KK) FI's and AR's. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call \emph{d-index}, and is the maximum integer dd such that the dataset contains at least dd transactions of length at least dd such that no one of them is a superset of or equal to another. We show that this bound is strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the proceedings of ECML PKDD 201

    New probabilistic interest measures for association rules

    Full text link
    Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic
    • …
    corecore