8 research outputs found

    Khai thác luật phân lớp két hợp trên cơ sở dữ liệu phân tán

    Get PDF
    This paper proposes a method for mining class association rules in distributed datasets using peer-to-peer network. This method utilizes the computing performance of PCs in the network to process the local information at each site, and then only transfers the information of item sets that their supports satisfying the minimum support threshold to the mining site. Therefore, the proposed method can reduce the memory usage than that of transferring all dataset’s information to the mining site.Bài báo đề xuất một phương pháp khai thác luật phân lớp kết hợp trên cơ sở dữ liệu phân tán dựa trên mạng ngang hàng. Phương pháp này tận dụng được năng lực tính toán của các máy trong mạng để xử lý thông tin tại mỗi vị trí và chỉ truyền các thông tin của các itemset có độ hỗ trợ thỏa ngưỡng độ hỗ trợ tối thiểu từ các bên tham gia cho bên cần khai thác. Chính vì vậy, phương pháp đề nghị giảm thiểu được không gian lưu trữ so với việc chuyển toàn bộ cơ sở dữ liệu về bên cần khai thác luật

    Parallel Mining of Association Rules Using a Lattice Based Approach

    Get PDF
    The discovery of interesting patterns from database transactions is one of the major problems in knowledge discovery in database. One such interesting pattern is the association rules extracted from these transactions. Parallel algorithms are required for the mining of association rules due to the very large databases used to store the transactions. In this paper we present a parallel algorithm for the mining of association rules. We implemented a parallel algorithm that used a lattice approach for mining association rules. The Dynamic Distributed Rule Mining (DDRM) is a lattice-based algorithm that partitions the lattice into sublattices to be assigned to processors for processing and identification of frequent itemsets. Experimental results show that DDRM utilizes the processors efficiently and performed better than the prefix-based and partition algorithms that use a static approach to assign classes to the processors. The DDRM algorithm scales well and shows good speedup

    Investigation of discovering rules from data.

    Get PDF
    by Ng, King Kwok.Thesis submitted in: December 1999.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 99-104).Abstracts in English and Chinese.Acknowledgments --- p.iiAbstract --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Data Mining and Rule Discovery --- p.1Chapter 1.1.1 --- Association Rule --- p.3Chapter 1.1.2 --- Sequential Pattern --- p.4Chapter 1.1.3 --- Dependence Rule --- p.6Chapter 1.2 --- Association Rule Mining --- p.8Chapter 1.3 --- Contributions --- p.9Chapter 1.4 --- Outline of the Thesis --- p.10Chapter 2 --- Related Work on Association Rule Mining --- p.11Chapter 2.1 --- Batch Algorithms --- p.11Chapter 2.1.1 --- The Apriori Algorithm --- p.11Chapter 2.1.2 --- The DIC Algorithm --- p.13Chapter 2.1.3 --- The Partition Algorithm --- p.15Chapter 2.1.4 --- The Sampling Algorithm --- p.15Chapter 2.2 --- Incremental Association Rule Mining --- p.16Chapter 2.2.1 --- The FUP Algorithm --- p.17Chapter 2.2.2 --- The FUP2 Algorithm --- p.18Chapter 2.2.3 --- The FUP* Algorithm --- p.19Chapter 2.2.4 --- The Negative Border Method --- p.20Chapter 2.2.5 --- Limitations of Existing Incremental Association Rule Mining Algorithms --- p.21Chapter 3 --- A New Incremental Association Rule Mining Approach --- p.23Chapter 3.1 --- Outline for the Proposed Approach --- p.23Chapter 3.2 --- Our New Approach --- p.26Chapter 3.2.1 --- The IDIC_M Algorithm --- p.26Chapter 3.2.2 --- A Variant Algorithm: The IDIC_S Algorithm --- p.29Chapter 3.3 --- Performance Evaluation of Our Approach --- p.30Chapter 3.3.1 --- Experimental Results for Algorithm IDIC_M --- p.30Chapter 3.3.2 --- Experimental Results for Algorithm IDIC_S --- p.35Chapter 3.4 --- Discussion --- p.39Chapter 4 --- Related Work on Multiple_Level AR and Belief-Driven Mining --- p.41Chapter 4.1 --- Background on Multiple_Level Association Rules --- p.41Chapter 4.2 --- Related Work on Multiple-Level Association Rules --- p.42Chapter 4.2.1 --- The Basic Algorithm --- p.42Chapter 4.2.2 --- The Cumulate Algorithm --- p.44Chapter 4.2.3 --- The EstMerge Algorithm --- p.44Chapter 4.2.4 --- Using Hierarchy-Information Encoded Transaction Table --- p.45Chapter 4.3 --- Background on Rule Mining in the Presence of User Belief --- p.46Chapter 4.4 --- Related Work on Rule Mining in the Presence of User Belief --- p.47Chapter 4.4.1 --- Post-Analysis of Learned Rules --- p.47Chapter 4.4.2 --- Using General Impressions to Analyze Discovered Classification Rules --- p.49Chapter 4.4.3 --- A Belief-Driven Method for Discovering Unexpected Patterns --- p.50Chapter 4.4.4 --- Constraint-Based Rule Mining --- p.51Chapter 4.5 --- Limitations of Existing Approaches --- p.52Chapter 5 --- Multiple-Level Association Rules Mining in the Presence of User Belief --- p.54Chapter 5.1 --- User Belief Under Taxonomy --- p.55Chapter 5.2 --- Formal Definition of Rule Interestingness --- p.57Chapter 5.3 --- The MARUB_E Mining Algorithm --- p.61Chapter 6 --- Experiments on MARUB_E --- p.64Chapter 6.1 --- Preliminary Experiments --- p.64Chapter 6.2 --- Experiments on Synthetic Data --- p.68Chapter 6.3 --- Experiments on Real Data --- p.71Chapter 7 --- Dealing with Vague Belief of User --- p.76Chapter 7.1 --- User Belief Under Taxonomy --- p.76Chapter 7.2 --- Relationship with Constraint-Based Rule Mining --- p.79Chapter 7.3 --- Formal Definition of Rule Interestingness --- p.79Chapter 7.4 --- The MARUB_V Mining Algorithm --- p.81Chapter 8 --- Experiments on MARUB_V --- p.84Chapter 8.1 --- Preliminary Experiments --- p.84Chapter 8.1.1 --- Experiments on Synthetic Data --- p.87Chapter 8.1.2 --- Experiments on Real Data --- p.93Chapter 9 --- Conclusions and Future Work --- p.96Chapter 9.1 --- Conclusions --- p.95Chapter 9.2 --- Future Work --- p.9

    A COMPREHENSIVE GEOSPATIAL KNOWLEDGE DISCOVERY FRAMEWORK FOR SPATIAL ASSOCIATION RULE MINING

    Get PDF
    Continuous advances in modern data collection techniques help spatial scientists gain access to massive and high-resolution spatial and spatio-temporal data. Thus there is an urgent need to develop effective and efficient methods seeking to find unknown and useful information embedded in big-data datasets of unprecedentedly large size (e.g., millions of observations), high dimensionality (e.g., hundreds of variables), and complexity (e.g., heterogeneous data sources, space–time dynamics, multivariate connections, explicit and implicit spatial relations and interactions). Responding to this line of development, this research focuses on the utilization of the association rule (AR) mining technique for a geospatial knowledge discovery process. Prior attempts have sidestepped the complexity of the spatial dependence structure embedded in the studied phenomenon. Thus, adopting association rule mining in spatial analysis is rather problematic. Interestingly, a very similar predicament afflicts spatial regression analysis with a spatial weight matrix that would be assigned a priori, without validation on the specific domain of application. Besides, a dependable geospatial knowledge discovery process necessitates algorithms supporting automatic and robust but accurate procedures for the evaluation of mined results. Surprisingly, this has received little attention in the context of spatial association rule mining. To remedy the existing deficiencies mentioned above, the foremost goal for this research is to construct a comprehensive geospatial knowledge discovery framework using spatial association rule mining for the detection of spatial patterns embedded in geospatial databases and to demonstrate its application within the domain of crime analysis. It is the first attempt at delivering a complete geo-spatial knowledge discovery framework using spatial association rule mining

    Frequent itemset mining on multiprocessor systems

    Get PDF
    Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets

    Effect of Data Skewness in Parallel Mining of Association Rules

    No full text
    An efficient parallel algorithm FPM(Fast Parallel Mining) for mining association rules on a shared-nothing parallel system has been proposed. It adopts the count distribution approach and has incorporated two powerful candidate pruning techniques, i.e., distributed pruning and global pruning. It has a simple communication scheme which performs only one round of message exchange in each iteration. We found that the two pruning techniques are very sensitive to data skewness, which describes the degree of non-uniformity of the itemset distribution among the database partitions. Distributed pruning is very effective when data skewness is high. Global pruning is more effective than distributed pruning even for the mild data skewness case. We have implemented the algorithm on an IBM SP2 parallel machine. The performance studies confirm our observation on the relationship between the effectiveness of the two pruning techniques and data skewness. It has also shown that FPM outperforms CD (Count Distribution) consistently, which is a parallel version of the popular Apriori algorithm [2, 3]. Furthermore, FPM has nice parallelism of speedup, scaleup and sizeup.link_to_subscribed_fulltex

    Effect of data skewness in parallel mining of association rules

    No full text
    corecore