6,342 research outputs found

    An efficient parallel method for mining frequent closed sequential patterns

    Get PDF
    Mining frequent closed sequential pattern (FCSPs) has attracted a great deal of research attention, because it is an important task in sequences mining. In recently, many studies have focused on mining frequent closed sequential patterns because, such patterns have proved to be more efficient and compact than frequent sequential patterns. Information can be fully extracted from frequent closed sequential patterns. In this paper, we propose an efficient parallel approach called parallel dynamic bit vector frequent closed sequential patterns (pDBV-FCSP) using multi-core processor architecture for mining FCSPs from large databases. The pDBV-FCSP divides the search space to reduce the required storage space and performs closure checking of prefix sequences early to reduce execution time for mining frequent closed sequential patterns. This approach overcomes the problems of parallel mining such as overhead of communication, synchronization, and data replication. It also solves the load balance issues of the workload between the processors with a dynamic mechanism that re-distributes the work, when some processes are out of work to minimize the idle CPU time.Web of Science5174021739

    Algorithms for Extracting Frequent Episodes in the Process of Temporal Data Mining

    Get PDF
    An important aspect in the data mining process is the discovery of patterns having a great influence on the studied problem. The purpose of this paper is to study the frequent episodes data mining through the use of parallel pattern discovery algorithms. Parallel pattern discovery algorithms offer better performance and scalability, so they are of a great interest for the data mining research community. In the following, there will be highlighted some parallel and distributed frequent pattern mining algorithms on various platforms and it will also be presented a comparative study of their main features. The study takes into account the new possibilities that arise along with the emerging novel Compute Unified Device Architecture from the latest generation of graphics processing units. Based on their high performance, low cost and the increasing number of features offered, GPU processors are viable solutions for an optimal implementation of frequent pattern mining algorithmsFrequent Pattern Mining, Parallel Computing, Dynamic Load Balancing, Temporal Data Mining, CUDA, GPU, Fermi, Thread

    Dynamic load balancing for the distributed mining of molecular structures

    Get PDF
    In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids

    Porting Decision Tree Algorithms to Multicore using FastFlow

    Full text link
    The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.Comment: 18 pages + cove

    Multi-threaded Implementation of Association Rule Mining with Visualization of the Pattern Tree

    Get PDF
    Motor Vehicle fatalities per 100,000 population in the United States has been reported to be 10.69% in the year 2012 as per NHTSA (National Highway Traffic Safety Administration). The fatality rate has increased by 0.27% in 2012 compared to the rate in the year 2011. As per the reports, there are many factors involved in increasing the fatality rate drastically such as driving under influence, testing while driving, and various other weather phenomena. Decision makers need to analyze the factors attributing to the increase in an accident rate to take implied measures. Current methods used to perform the data analysis process has to be reformed and optimized to make policies for controlling the high traffic accident rates. This research work is an extension to the data-mining algorithm implementation Most Associated Sequential Pattern (MASP). MASP uses association rule mining approach to mine interesting traffic accident data using a modified version of FP-growth algorithm. Owing to the huge amounts of available traffic accident data, MASP algorithm needs to be further modified to make it more efficient with respect to both space and time. Therefore, we present a parallel implementation to the MASP algorithm. In addition to this, pattern tree and apriori-tid algorithm implementation has been done. The application is designed in C# using .NET Framework and C# Task Parallel Library

    A multithreaded hybrid framework for mining frequent itemsets

    Get PDF
    Mining frequent itemsets is an area of data mining that has beguiled several researchers in recent years. Varied data structures such as Nodesets, DiffNodesets, NegNodesets, N-lists, and Diffsets are among a few that were employed to extract frequent items. However, most of these approaches fell short either in respect of run time or memory. Hybrid frameworks were formulated to repress these issues that encompass the deployment of two or more data structures to facilitate effective mining of frequent itemsets. Such an approach aims to exploit the advantages of either of the data structures while mitigating the problems of relying on either of them alone. However, limited efforts have been made to reinforce the efficiency of such frameworks. To address these issues this paper proposes a novel multithreaded hybrid framework comprising of NegNodesets and N-list structure that uses the multicore feature of today’s processors. While NegNodesets offer a concise representation of itemsets, N-lists rely on List intersection thereby speeding up the mining process. To optimize the extraction of frequent items a hash-based algorithm has been designed here to extract the resultant set of frequent items which further enhances the novelty of the framework

    Parallel Methods for Mining Frequent Sequential patterns

    Get PDF
    The explosive growth of data and the rapid progress of technology have led to a huge amount of data that is collected every day. In that data volume contains much valuable information. Data mining is the emerging field of applying statistical and artificial intelligence techniques to the problem of finding novel, useful and non-trivial patterns from large databases. It is the task of discovering interesting patterns from large amounts of data. This is achieved by determining both implicit and explicit unidentified patterns in data that can direct the process of decision making. There are many data mining tasks, such as classification, clustering, association rule mining and sequential pattern mining. In that, sequential pattern mining is an important problem in data mining. It provides an effective way to analyze the sequence data. The goal of sequential pattern mining is to discover interesting, unexpected and useful patterns from sequence databases. This task is used in many wide applications such as financial data analysis of banks, retail industry, customer shopping history, goods transportation, consumption and services, telecommunication industry, biological data analysis, scientific applications, network intrusion detection, scientific research, etc. Different types of sequential pattern mining can be performed, they are sequential patterns, maximal sequential patterns, closed sequences, constraint based and time interval based sequential patterns. Sequential pattern mining refers to the identification of frequent subsequences in sequence databases as patterns. In the last two decades, researchers have proposed many techniques and algorithms for extracting the frequent sequential patterns, in which the downward closure property plays a fundamental role. Sequential pattern is a sequence of itemsets that frequently occur in a specific order, where all items in the same itemsets are supposed to have the same transaction time value. One of the challenges for sequential pattern mining is the computational costs beside that is the potentially huge number of extracted patterns. In this thesis, we present an overview of the work done for sequential pattern mining and develop parallel methods for mining frequent sequential patterns in sequence databases that can tackle emerging data processing workloads while coping with larger and larger scales.The explosive growth of data and the rapid progress of technology have led to a huge amount of data that is collected every day. In that data volume contains much valuable information. Data mining is the emerging field of applying statistical and artificial intelligence techniques to the problem of finding novel, useful and non-trivial patterns from large databases. It is the task of discovering interesting patterns from large amounts of data. This is achieved by determining both implicit and explicit unidentified patterns in data that can direct the process of decision making. There are many data mining tasks, such as classification, clustering, association rule mining and sequential pattern mining. In that, sequential pattern mining is an important problem in data mining. It provides an effective way to analyze the sequence data. The goal of sequential pattern mining is to discover interesting, unexpected and useful patterns from sequence databases. This task is used in many wide applications such as financial data analysis of banks, retail industry, customer shopping history, goods transportation, consumption and services, telecommunication industry, biological data analysis, scientific applications, network intrusion detection, scientific research, etc. Different types of sequential pattern mining can be performed, they are sequential patterns, maximal sequential patterns, closed sequences, constraint based and time interval based sequential patterns. Sequential pattern mining refers to the identification of frequent subsequences in sequence databases as patterns. In the last two decades, researchers have proposed many techniques and algorithms for extracting the frequent sequential patterns, in which the downward closure property plays a fundamental role. Sequential pattern is a sequence of itemsets that frequently occur in a specific order, where all items in the same itemsets are supposed to have the same transaction time value. One of the challenges for sequential pattern mining is the computational costs beside that is the potentially huge number of extracted patterns. In this thesis, we present an overview of the work done for sequential pattern mining and develop parallel methods for mining frequent sequential patterns in sequence databases that can tackle emerging data processing workloads while coping with larger and larger scales.460 - Katedra informatikyvyhově

    Frequent itemset mining on multiprocessor systems

    Get PDF
    Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets
    corecore