83 research outputs found

    Incrementally updating the high average-utility patterns with pre-large concept

    Get PDF
    High-utility itemset mining (HUIM) is considered as an emerging approach to detect the high-utility patterns from databases. Most existing algorithms of HUIM only consider the itemset utility regardless of the length. This limitation raises the utility as a result of a growing itemset size. High average-utility itemset mining (HAUIM) considers the size of the itemset, thus providing a more balanced scale to measure the average-utility for decision-making. Several algorithms were presented to efficiently mine the set of high average-utility itemsets (HAUIs) but most of them focus on handling static databases. In the past, a fast-updated (FUP)-based algorithm was developed to efficiently handle the incremental problem but it still has to re-scan the database when the itemset in the original database is small but there is a high average-utility upper-bound itemset (HAUUBI) in the newly inserted transactions. In this paper, an efficient framework called PRE-HAUIMI for transaction insertion in dynamic databases is developed, which relies on the average-utility-list (AUL) structures. Moreover, we apply the pre-large concept on HAUIM. A pre-large concept is used to speed up the mining performance, which can ensure that if the total utility in the newly inserted transaction is within the safety bound, the small itemsets in the original database could not be the large ones after the database is updated. This, in turn, reduces the recurring database scans and obtains the correct HAUIs. Experiments demonstrate that the PRE-HAUIMI outperforms the state-of-the-art batch mode HAUI-Miner, and the state-of-the-art incremental IHAUPM and FUP-based algorithms in terms of runtime, memory, number of assessed patterns and scalability.publishedVersio

    Incrementally updating the high average-utility patterns with pre-large concept

    Get PDF
    High-utility itemset mining (HUIM) is considered as an emerging approach to detect the high-utility patterns from databases. Most existing algorithms of HUIM only consider the itemset utility regardless of the length. This limitation raises the utility as a result of a growing itemset size. High average-utility itemset mining (HAUIM) considers the size of the itemset, thus providing a more balanced scale to measure the average-utility for decision-making. Several algorithms were presented to efficiently mine the set of high average-utility itemsets (HAUIs) but most of them focus on handling static databases. In the past, a fast-updated (FUP)-based algorithm was developed to efficiently handle the incremental problem but it still has to re-scan the database when the itemset in the original database is small but there is a high average-utility upper-bound itemset (HAUUBI) in the newly inserted transactions. In this paper, an efficient framework called PRE-HAUIMI for transaction insertion in dynamic databases is developed, which relies on the average-utility-list (AUL) structures. Moreover, we apply the pre-large concept on HAUIM. A pre-large concept is used to speed up the mining performance, which can ensure that if the total utility in the newly inserted transaction is within the safety bound, the small itemsets in the original database could not be the large ones after the database is updated. This, in turn, reduces the recurring database scans and obtains the correct HAUIs. Experiments demonstrate that the PRE-HAUIMI outperforms the state-of-the-art batch mode HAUI-Miner, and the state-of-the-art incremental IHAUPM and FUP-based algorithms in terms of runtime, memory, number of assessed patterns and scalability.publishedVersio

    Optimized High-Utility Itemsets Mining for Effective Association Mining Paper

    Get PDF
    Association rule mining is intently used for determining the frequent itemsets of transactional database; however, it is needed to consider the utility of itemsets in market behavioral applications. Apriori or FP-growth methods generate the association rules without utility factor of items. High-utility itemset mining (HUIM) is a well-known method that effectively determines the itemsets based on high-utility value and the resulting itemsets are known as high-utility itemsets. Fastest high-utility mining method (FHM) is an enhanced version of HUIM. FHM reduces the number of join operations during itemsets generation, so it is faster than HUIM. For large datasets, both methods are very expenisve. Proposed method addressed this issue by building pruning based utility co-occurrence structure (PEUCS) for elimatination of low-profit itemsets, thus, obviously it process only optimal number of high-utility itemsets, so it is called as optimal FHM (OFHM). Experimental results show that OFHM takes less computational runtime, therefore it is more efficient when compared to other existing methods for benchmarked large datasets

    Investigation of discovering rules from data.

    Get PDF
    by Ng, King Kwok.Thesis submitted in: December 1999.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 99-104).Abstracts in English and Chinese.Acknowledgments --- p.iiAbstract --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Data Mining and Rule Discovery --- p.1Chapter 1.1.1 --- Association Rule --- p.3Chapter 1.1.2 --- Sequential Pattern --- p.4Chapter 1.1.3 --- Dependence Rule --- p.6Chapter 1.2 --- Association Rule Mining --- p.8Chapter 1.3 --- Contributions --- p.9Chapter 1.4 --- Outline of the Thesis --- p.10Chapter 2 --- Related Work on Association Rule Mining --- p.11Chapter 2.1 --- Batch Algorithms --- p.11Chapter 2.1.1 --- The Apriori Algorithm --- p.11Chapter 2.1.2 --- The DIC Algorithm --- p.13Chapter 2.1.3 --- The Partition Algorithm --- p.15Chapter 2.1.4 --- The Sampling Algorithm --- p.15Chapter 2.2 --- Incremental Association Rule Mining --- p.16Chapter 2.2.1 --- The FUP Algorithm --- p.17Chapter 2.2.2 --- The FUP2 Algorithm --- p.18Chapter 2.2.3 --- The FUP* Algorithm --- p.19Chapter 2.2.4 --- The Negative Border Method --- p.20Chapter 2.2.5 --- Limitations of Existing Incremental Association Rule Mining Algorithms --- p.21Chapter 3 --- A New Incremental Association Rule Mining Approach --- p.23Chapter 3.1 --- Outline for the Proposed Approach --- p.23Chapter 3.2 --- Our New Approach --- p.26Chapter 3.2.1 --- The IDIC_M Algorithm --- p.26Chapter 3.2.2 --- A Variant Algorithm: The IDIC_S Algorithm --- p.29Chapter 3.3 --- Performance Evaluation of Our Approach --- p.30Chapter 3.3.1 --- Experimental Results for Algorithm IDIC_M --- p.30Chapter 3.3.2 --- Experimental Results for Algorithm IDIC_S --- p.35Chapter 3.4 --- Discussion --- p.39Chapter 4 --- Related Work on Multiple_Level AR and Belief-Driven Mining --- p.41Chapter 4.1 --- Background on Multiple_Level Association Rules --- p.41Chapter 4.2 --- Related Work on Multiple-Level Association Rules --- p.42Chapter 4.2.1 --- The Basic Algorithm --- p.42Chapter 4.2.2 --- The Cumulate Algorithm --- p.44Chapter 4.2.3 --- The EstMerge Algorithm --- p.44Chapter 4.2.4 --- Using Hierarchy-Information Encoded Transaction Table --- p.45Chapter 4.3 --- Background on Rule Mining in the Presence of User Belief --- p.46Chapter 4.4 --- Related Work on Rule Mining in the Presence of User Belief --- p.47Chapter 4.4.1 --- Post-Analysis of Learned Rules --- p.47Chapter 4.4.2 --- Using General Impressions to Analyze Discovered Classification Rules --- p.49Chapter 4.4.3 --- A Belief-Driven Method for Discovering Unexpected Patterns --- p.50Chapter 4.4.4 --- Constraint-Based Rule Mining --- p.51Chapter 4.5 --- Limitations of Existing Approaches --- p.52Chapter 5 --- Multiple-Level Association Rules Mining in the Presence of User Belief --- p.54Chapter 5.1 --- User Belief Under Taxonomy --- p.55Chapter 5.2 --- Formal Definition of Rule Interestingness --- p.57Chapter 5.3 --- The MARUB_E Mining Algorithm --- p.61Chapter 6 --- Experiments on MARUB_E --- p.64Chapter 6.1 --- Preliminary Experiments --- p.64Chapter 6.2 --- Experiments on Synthetic Data --- p.68Chapter 6.3 --- Experiments on Real Data --- p.71Chapter 7 --- Dealing with Vague Belief of User --- p.76Chapter 7.1 --- User Belief Under Taxonomy --- p.76Chapter 7.2 --- Relationship with Constraint-Based Rule Mining --- p.79Chapter 7.3 --- Formal Definition of Rule Interestingness --- p.79Chapter 7.4 --- The MARUB_V Mining Algorithm --- p.81Chapter 8 --- Experiments on MARUB_V --- p.84Chapter 8.1 --- Preliminary Experiments --- p.84Chapter 8.1.1 --- Experiments on Synthetic Data --- p.87Chapter 8.1.2 --- Experiments on Real Data --- p.93Chapter 9 --- Conclusions and Future Work --- p.96Chapter 9.1 --- Conclusions --- p.95Chapter 9.2 --- Future Work --- p.9

    Mining frequent sequential patterns in data streams using SSM-algorithm.

    Get PDF
    Frequent sequential mining is the process of discovering frequent sequential patterns in data sequences as found in applications like web log access sequences. In data stream applications, data arrive at high speed rates in a continuous flow. Data stream mining is an online process different from traditional mining. Traditional mining algorithms work on an entire static dataset in order to obtain results while data stream mining algorithms work with continuously arriving data streams. With rapid change in technology, there are many applications that take data as continuous streams. Examples include stock tickers, network traffic measurements, click stream data, data feeds from sensor networks, and telecom call records. Mining frequent sequential patterns on data stream applications contend with many challenges such as limited memory for unlimited data, inability of algorithms to scan infinitely flowing original dataset more than once and to deliver current and accurate result on demand. This thesis proposes SSM-Algorithm (sequential stream mining-algorithm) that delivers frequent sequential patterns in data streams. The concept of this work came from FP-Stream algorithm that delivers time sensitive frequent patterns. Proposed SSM-Algorithm outperforms FP-Stream algorithm by the use of a hash based and two efficient tree based data structures. All incoming streams are handled dynamically to improve memory usage. SSM-Algorithm maintains frequent sequences incrementally and delivers most current result on demand. The introduced algorithm can be deployed to analyze e-commerce data where the primary source of the data is click stream data. (Abstract shortened by UMI.)Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .M668. Source: Masters Abstracts International, Volume: 44-03, page: 1409. Thesis (M.Sc.)--University of Windsor (Canada), 2005

    Mining Association Rules Events over Data Streams

    Get PDF
    Data streams have gained considerable attention in data analysis and data mining communities because of the emergence of a new classes of applications, such as monitoring, supply chain execution, sensor networks, oilfield and pipeline operations, financial marketing and health data industries. Telecommunication advancements have provided us with easy access to stream data produced by various applications. Data in streams differ from static data stored in data warehouses or database. Data streams are continuous, arrive at high-speeds and change through time. Traditional data mining algorithms assume presence of data in conventional storage means where data mining is performed centrally with the luxury of accessing the data multiple times, using powerful processors, providing offline output with no time constraints. Such algorithms are not suitable for dynamic data streams. Stream data needs to be mined promptly as it might not be feasible to store such volume of data. In addition, streams reflect live status of the environment generating it, so prompt analysis may provide early detection of faults, delays, performance measurements, trend analysis and other diagnostics. This thesis focuses on developing a data stream association rule mining algorithm among co-occurring events. The proposed algorithm mines association rules over data streams incrementally in a centralized setting. We are interested in association rules that meet a provided minimum confidence threshold and have a lift value greater than 1. We refer to such association rules as strong rules. Experiments on several datasets demonstrate that the proposed algorithms is efficient and effective in extracting association rules from data streams, thus having a faster processing time and better memory management

    A study of two problems in data mining: projective clustering and multiple tables association rules mining.

    Get PDF
    Ng Ka Ka.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 114-120).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgement --- p.viiChapter I --- Projective Clustering --- p.1Chapter 1 --- Introduction to Projective Clustering --- p.2Chapter 2 --- Related Work to Projective Clustering --- p.7Chapter 2.1 --- CLARANS - Graph Abstraction and Bounded Optimization --- p.8Chapter 2.1.1 --- Graph Abstraction --- p.8Chapter 2.1.2 --- Bounded Optimized Random Search --- p.9Chapter 2.2 --- OptiGrid ´ؤ Grid Partitioning Approach and Density Estimation Function --- p.9Chapter 2.2.1 --- Empty Space Phenomenon --- p.10Chapter 2.2.2 --- Density Estimation Function --- p.11Chapter 2.2.3 --- Upper Bound Property --- p.12Chapter 2.3 --- CLIQUE and ENCLUS - Subspace Clustering --- p.13Chapter 2.3.1 --- Monotonicity Property of Subspaces --- p.14Chapter 2.4 --- PROCLUS Projective Clustering --- p.15Chapter 2.5 --- ORCLUS - Generalized Projective Clustering --- p.16Chapter 2.5.1 --- Singular Value Decomposition SVD --- p.17Chapter 2.6 --- "An ""Optimal"" Projective Clustering" --- p.17Chapter 3 --- EPC : Efficient Projective Clustering --- p.19Chapter 3.1 --- Motivation --- p.19Chapter 3.2 --- Notations and Definitions --- p.21Chapter 3.2.1 --- Density Estimation Function --- p.22Chapter 3.2.2 --- 1-d Histogram --- p.23Chapter 3.2.3 --- 1-d Dense Region --- p.25Chapter 3.2.4 --- Signature Q --- p.26Chapter 3.3 --- The overall framework --- p.28Chapter 3.4 --- Major Steps --- p.30Chapter 3.4.1 --- Histogram Generation --- p.30Chapter 3.4.2 --- Adaptive discovery of dense regions --- p.31Chapter 3.4.3 --- Count the occurrences of signatures --- p.36Chapter 3.4.4 --- Find the most frequent signatures --- p.36Chapter 3.4.5 --- Refine the top 3m signatures --- p.37Chapter 3.5 --- Time and Space Complexity --- p.38Chapter 4 --- EPCH: An extension and generalization of EPC --- p.40Chapter 4.1 --- Motivation of the extension --- p.40Chapter 4.2 --- Distinguish clusters by their projections in different subspaces --- p.43Chapter 4.3 --- EPCH: a generalization of EPC by building histogram with higher dimensionality --- p.46Chapter 4.3.1 --- Multidimensional histograms construction and dense re- gions detection --- p.46Chapter 4.3.2 --- Compressing data objects to signatures --- p.47Chapter 4.3.3 --- Merging Similar Signature Entries --- p.49Chapter 4.3.4 --- Associating membership degree --- p.51Chapter 4.3.5 --- The choice of Dimensionality d of the Histogram --- p.52Chapter 4.4 --- Implementation of EPC2 --- p.53Chapter 4.5 --- Time and Space Complexity of EPCH --- p.54Chapter 5 --- Experimental Results --- p.56Chapter 5.1 --- Clustering Quality Measurement --- p.56Chapter 5.2 --- Synthetic Data Generation --- p.58Chapter 5.3 --- Experimental setup --- p.59Chapter 5.4 --- Comparison between EPC and PROCULS --- p.60Chapter 5.5 --- Comparison between EPCH and ORCLUS --- p.62Chapter 5.5.1 --- Dimensionality of the original space and the associated subspace --- p.65Chapter 5.5.2 --- Projection not parallel to original axes --- p.66Chapter 5.5.3 --- Data objects belong to more than one cluster under fuzzy clustering --- p.67Chapter 5.6 --- Scalability of EPC --- p.68Chapter 5.7 --- Scalability of EPC2 --- p.69Chapter 6 --- Conclusion --- p.71Chapter II --- Multiple Tables Association Rules Mining --- p.74Chapter 7 --- Introduction to Multiple Tables Association Rule Mining --- p.75Chapter 7.1 --- Problem Statement --- p.77Chapter 8 --- Related Work to Multiple Tables Association Rules Mining --- p.80Chapter 8.1 --- Aprori - A Bottom-up approach to generate candidate sets --- p.80Chapter 8.2 --- VIPER - Vertical Mining with various optimization techniques --- p.81Chapter 8.2.1 --- Vertical TID Representation and Mining --- p.82Chapter 8.2.2 --- FORC --- p.83Chapter 8.3 --- Frequent Itemset Counting across Multiple Tables --- p.84Chapter 9 --- The Proposed Method --- p.85Chapter 9.1 --- Notations --- p.85Chapter 9.2 --- Converting Dimension Tables to internal representation --- p.87Chapter 9.3 --- The idea of discovering frequent itemsets without joining --- p.89Chapter 9.4 --- Overall Steps --- p.91Chapter 9.5 --- Binding multiple Dimension Tables --- p.92Chapter 9.6 --- Prefix Tree for FT --- p.94Chapter 9.7 --- Maintaining frequent itemsets in FI-trees --- p.96Chapter 9.8 --- Frequency Counting --- p.99Chapter 10 --- Experiments --- p.102Chapter 10.1 --- Synthetic Data Generation --- p.102Chapter 10.2 --- Experimental Findings --- p.106Chapter 11 --- Conclusion and Future Works --- p.112Bibliography --- p.11

    LC an effective classification based association rule mining algorithm

    Get PDF
    Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called “Looking at the Class” (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory. Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naïve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider

    Distributed frequent hierarchical pattern mining for robust and efficient large-scale association discovery

    Get PDF
    Field of study: Computer science.Dr. Chi-Ren Shyu, Dissertation Supervisor.Includes vita."May 2017."Frequent pattern mining is a classic data mining technique, generally applicable to a wide range of application domains, and a mature area of research. The fundamental challenge arises from the combinatorial nature of frequent itemsets, scaling exponentially with respect to the number of unique items. Apriori-based and FPTree-based algorithms have dominated the space thus far. Initial phases of this research relied on the Apriori algorithm and utilized a distributed computing environment; we proposed the Cartesian Scheduler to manage Apriori's candidate generation process. To address the limitation of bottom-up frequent pattern mining algorithms such as Apriori and FPGrowth, we propose the Frequent Hierarchical Pattern Tree (FHPTree): a tree structure and new frequent pattern mining paradigm. The classic problem is redefined as frequent hierarchical pattern mining where the goal is to detect frequent maximal pattern covers. Under the proposed paradigm, compressed representations of maximal patterns are mined using a top-down FHPTree traversal, FHPGrowth, which detects large patterns before their subsets, thus yielding significant reductions in computation time. The FHPTree memory footprint is small; the number of nodes in the structure scales linearly with respect to the number of unique items. Additionally, the FHPTree serves as a persistent, dynamic data structure to index frequent patterns and enable efficient searches. When the search space is exponential, efficient targeted mining capabilities are paramount; this is one of the key contributions of the FHPTree. This dissertation will demonstrate the performance of FHPGrowth, achieving a 300x speed up over state-of-the-art maximal pattern mining algorithms and approximately a 2400x speedup when utilizing FHPGrowth in a distributed computing environment. In addition, we allude to future research opportunities, and suggest various modifications to further optimize the FHPTree and FHPGrowth. Moreover, the methods we offer will have an impact on other data mining research areas including contrast set mining as well as spatial and temporal mining.Includes bibliographical references (pages 121-133)
    corecore