925 research outputs found

    Empirical Evaluations On Real And Synthetic Datasets State Of The Art Utility Mining Algorithms

    Get PDF
    We have considered the issue of best k high utility itemsets mining, where k is the coveted number of high utility itemsets to be mined. Two effective calculations TKU (mining Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One stage) are proposed for mining such itemsets without setting least utility limits. TKU is the initial two-stage calculation for mining top-k high utility itemsets, which joins five techniques PE, NU, MD, MC and SE to adequately raise the fringe least utility edges and further prune the hunt space. Then again, TKO is the first stage algorithm produced for top-k HUI mining, which incorporates the novel methodologies RUC, RUZ and EPB to extraordinarily enhance its execution. The proposed calculations have great versatility on extensive datasets and the execution of the proposed algorithms is near the ideal instance of the cutting edge two-stage and one-stage utility mining algorithms

    An Approach for Mining Top-k High Utility Item Sets (HUI)

    Get PDF
    Itemsets have been extracted by utilising high utility item (HUI) mining, which provides more benefits to the consumer. This could be one of the significant domains in data mining and be resourceful for several real-time implementations. Even though modern HUI mining algorithms may identify item sets that meet the minimum utility threshold, However, fixing the minimum threshold utility value has not been a simple task, and often it is intricate for the consumers when we keep the minimum utility value low. It might generate a massive amount of itemsets, and when the value is at its maximum, it might provide a smaller amount of itemsets. To avoid these issues, top-k HUI mining, where k represents the number of HUIs to be identified, has been proposed. Further, in this manuscript, the authors projected an algorithm called the top-k exact utility (TKEU) algorithm, which works without computing and comparing transaction weighted utilisation (TWU) values and deliberates the individual utility item values for deriving the top-k HUI. The datasets are pre-processed by the proposed algorithm to lessen the system memory space and to provide optimal outcomes for condensed datasets

    High Utility Itemsets Mining for Transactional Databases

    Get PDF
    Mainstream issue in data mining, which is called "high-utility itemset mining" or all the more for the most part utility mining. High Utility Itemsets which are itemsets having an utility gathering a client determined least utility edge value i.e min_util. The principle target of utility mining is to discover thing sets with highest utilities, by thinking about benefit, amount, cost or some other client inclinations. Research has been done in region of mining HUI's. Different procedures have been connected. The fundamental issue with setting edge value which is for the most part client particular, is it should be proper. In Order to set most fitting or right Threshold value for mining HUI's,user needs to do trial and mistake which thus is tedious and repetitive process, in light of the fact that if min_util is set too low, framework will bring about getting substantial data of HUI, which thus makes framework incapable with the end goal of HUI. In the event that we set min_util too high, this will bring about getting little sum or no HUI's. Consequently setting least edge value is troublesome. The proposed framework is following Top-k framework for mining top-k HUI's, which is utilizing two algorithms TKU (mining top-k utility itemsets) and TKO (mining top-k in one phase),without setting min_util edge

    Mining High Utility Patterns Over Data Streams

    Get PDF
    Mining useful patterns from sequential data is a challenging topic in data mining. An important task for mining sequential data is sequential pattern mining, which discovers sequences of itemsets that frequently appear in a sequence database. In sequential pattern mining, the selection of sequences is generally based on the frequency/support framework. However, most of the patterns returned by sequential pattern mining may not be informative enough to business people and are not particularly related to a business objective. In view of this, high utility sequential pattern (HUSP) mining has emerged as a novel research topic in data mining recently. The main objective of HUSP mining is to extract valuable and useful sequential patterns from data by considering the utility of a pattern that captures a business objective (e.g., profit, users interest). In HUSP mining, the goal is to find sequences whose utility in the database is no less than a user-specified minimum utility threshold. Nowadays, many applications generate a huge volume of data in the form of data streams. A number of studies have been conducted on mining HUSPs, but they are mainly intended for non-streaming data and thus do not take data stream characteristics into consideration. Mining HUSP from such data poses many challenges. First, it is infeasible to keep all streaming data in the memory due to the high volume of data accumulated over time. Second, mining algorithms need to process the arriving data in real time with one scan of data. Third, depending on the minimum utility threshold value, the number of patterns returned by a HUSP mining algorithm can be large and overwhelms the user. In general, it is hard for the user to determine the value for the threshold. Thus, algorithms that can find the most valuable patterns (i.e., top-k high utility patterns) are more desirable. Mining the most valuable patterns is interesting in both static data and data streams. To address these research limitations and challenges, this dissertation proposes techniques and algorithms for mining high utility sequential patterns over data streams. We work on mining HUSPs over both a long portion of a data stream and a short period of time. We also work on how to efficiently identify the most significant high utility patterns (namely, the top-k high utility patterns) over data streams. In the first part, we explore a fundamental problem that is how the limited memory space can be well utilized to produce high quality HUSPs over the entire data stream. An approximation algorithm, called MAHUSP, is designed which employs memory adaptive mechanisms to use a bounded portion of memory, to efficiently discover HUSPs over the entire data streams. The second part of the dissertation presents a new sliding window-based algorithm to discover recent high utility sequential patterns over data streams. A novel data structure named HUSP-Tree is proposed to maintain the essential information for mining recenT HUSPs. An efficient and single-pass algorithm named HUSP-Stream is proposed to generate recent HUSPs from HUSP-Tree. The third part addresses the problem of top-k high utility pattern mining over data streams. Two novel methods, named T-HUDS and T-HUSP, for finding top-k high utility patterns over a data stream are proposed. T-HUDS discovers top-k high utility itemsets and T-HUSP discovers top-k high utility sequential patterns over a data stream. T-HUDS is based on a compressed tree structure, called HUDS-Tree, that can be used to efficiently find potential top-k high utility itemsets over data streams. T-HUSP incrementally maintains the content of top-k HUSPs in a data stream in a summary data structure, named TKList, and discovers top-k HUSPs efficiently. All of the algorithms are evaluated using both synthetic and real datasets. The performances, including the running time, memory consumption, precision, recall and Fmeasure, are compared. In order to show the effectiveness and efficiency of the proposed methods in reallife applications, the fourth part of this dissertation presents applications of one of the proposed methods (i.e., MAHUSP) to extract meaningful patterns from a real web clickstream dataset and a real biosequence dataset. The utility-based sequential patterns are compared with the patterns in the frequency/support framework. The results show that high utility sequential pattern mining provides meaningful patterns in real-life applications

    MINING TOP-K HIGH UTILITY ITEM SETS BY USING EFFICIENT DATA STRUCTURE TO IMPROVE THE PERFORMANCE

    Get PDF
    Association rules show strong relationship between attribute-value pairs (or items) that occur frequently in a given data set. Association rules are commonly used to determine the purchasing patterns of customers in a store. Such analysis is implemented in many decision-making processes, such as product placement, catalogue design, and cross-marketing. The discovery of association rules is based on frequent itemset mining. These frequent itemset mining algorithms mainly suffers from generation of more number of candidate itemsets and large no of database scans. These issues are addressed by two algorithms namely TKU (mining Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in one phase) which are recommended for mining K- high utility itemsets in two scans of the entire database. Though scans are reduced to two, processing time is more because of UP-Tree traversals which is the data structure used by TKU and TKO algorithms.  The proposed algorithm uses B+-Tree data structure instead of UP-Tree to reduce the time. Experimental analysis clearly shows that the processing time is improved and hence limitations of existing work are overcome by proposing a methodology using B+ -Tree

    Mining high utility sequential patterns

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Sequential pattern mining refers to the identification of frequent subsequences in sequence databases as patterns. It provides an effective way to analyze the sequential data. The selection of interesting sequences is generally based on the frequency/support framework: sequences of high frequency are treated as significant. In the last two decades, researchers have proposed many techniques and algorithms for extracting the frequent sequential patterns, in which the downward closure property (also known as Apriori property) plays a fundamental role. At the same time, the relative importance of each item has been introduced in frequent pattern mining, and “high utility itemset mining” has been proposed. Instead of selecting high frequency patterns, the utility-based methods extract itemsets with high utilities, and many algorithms and strategies have been proposed. These methods can only process the itemsets in the utility framework. However, all the above methods suffer from the following common issues and problems to varying extents: 1) Sometimes, most of frequent patterns may not be informative to business decision-making, since they do not show the business value and impact. 2) Even if there is an algorithm that considers the business impact (namely utility), it can only obtain high utility sequences based on a given minimum utility threshold, thus it is very difficult for users to specify an appropriate minimum utility and to directly obtain the most valuable patterns. 3) The algorithm in the utility framework may generate a large number of patterns, many of which maybe redundant. Although high utility sequential pattern mining is essential, discovering the patterns is challenging for the following reasons: 1) The downward closure property does not hold in utility-based sequence mining. This means that most of the existing algorithms cannot be directly transferred, e.g. from frequent sequential pattern mining to high utility sequential pattern mining. Furthermore, compared to high utility itemset mining, utility-based sequence analysis faces the critical combinational explosion and computational complexity caused by sequencing between sequential elements (itemsets). 2) Since the minimum utility is not given in advance, the algorithm essentially starts searching from 0 minimum support. This not only incurs very high computational costs, but also the challenge of how to raise the minimum threshold without missing any top-k high utility sequences. 3) Due to the fundamental difference, incorporating the traditional closure concept into high utility sequential pattern mining makes the outcome patterns irreversibly lossy and no longer recoverable, which will be reasoned in the following chapters. Therefore, it is exceedingly challenging to address the above issues by designing a novel representation for high utility sequential patterns. To address these research limitations and challenges, this thesis proposes a high utility sequential pattern mining framework, and proposes both a threshold-based and top-k-based mining algorithm. Furthermore, a compact and lossless representation of utility-based sequence is presented, and an efficient algorithm is provided to mine such kind of patterns. Chapter 2 thoroughly reviews the related works in the frequent sequential pattern mining and high utility itemset/sequence mining. Chapter 3 incorporates utility into sequential pattern mining, and a generic framework for high utility sequence mining is defined. Two efficient algorithms, namely USpan and USpan+, are presented to mine for high utility sequential patterns. In USpan and USpan+, we introduce the lexicographic quantitative sequence tree to extract the complete set of high utility sequences and design concatenation mechanisms for calculating the utility of a node and its children with three effective pruning strategies. Chapter 4 proposes a novel framework called top-k high utility sequential pattern mining to tackle this critical problem. Accordingly, an efficient algorithm, Top-k high Utility Sequence (TUS for short) mining, is designed to identify top-k high utility sequential patterns without minimum utility. In addition, three effective features are introduced to handle the efficiency problem, including two strategies for raising the threshold and one pruning for filtering unpromising items. Chapter 5 proposes a novel concise framework to discover US-closed (Utility Sequence closed) high utility sequential patterns, with theoretical proof that it expresses the lossless representation of high-utility patterns. An efficient algorithm named CloUSpan is introduced to extract the US-closed patterns. Two effective strategies are used to enhance the performance of CloUSpan. All of the algorithms are examined in both synthetic and real datasets. The performances, including the running time and memory consumption, are compared. Furthermore, the utility-based sequential patterns are compared with the patterns in the frequency/support framework. The results show that high utility sequential patterns provide insightful knowledge for users

    Efficient chain structure for high-utility sequential pattern mining

    Get PDF
    High-utility sequential pattern mining (HUSPM) is an emerging topic in data mining, which considers both utility and sequence factors to derive the set of high-utility sequential patterns (HUSPs) from the quantitative databases. Several works have been presented to reduce the computational cost by variants of pruning strategies. In this paper, we present an efficient sequence-utility (SU)-chain structure, which can be used to store more relevant information to improve mining performance. Based on the SU-Chain structure, the existing pruning strategies can also be utilized here to early prune the unpromising candidates and obtain the satisfied HUSPs. Experiments are then compared with the state-of-the-art HUSPM algorithms and the results showed that the SU-Chain-based model can efficiently improve the efficiency performance than the existing HUSPM algorithms in terms of runtime and number of the determined candidates

    Privacy Preserving Utility Mining: A Survey

    Full text link
    In big data era, the collected data usually contains rich information and hidden knowledge. Utility-oriented pattern mining and analytics have shown a powerful ability to explore these ubiquitous data, which may be collected from various fields and applications, such as market basket analysis, retail, click-stream analysis, medical analysis, and bioinformatics. However, analysis of these data with sensitive private information raises privacy concerns. To achieve better trade-off between utility maximizing and privacy preserving, Privacy-Preserving Utility Mining (PPUM) has become a critical issue in recent years. In this paper, we provide a comprehensive overview of PPUM. We first present the background of utility mining, privacy-preserving data mining and PPUM, then introduce the related preliminaries and problem formulation of PPUM, as well as some key evaluation criteria for PPUM. In particular, we present and discuss the current state-of-the-art PPUM algorithms, as well as their advantages and deficiencies in detail. Finally, we highlight and discuss some technical challenges and open directions for future research on PPUM.Comment: 2018 IEEE International Conference on Big Data, 10 page
    • …
    corecore