37 research outputs found

    Efficient mining of discriminative molecular fragments

    Get PDF
    Frequent pattern discovery in structured data is receiving an increasing attention in many application areas of sciences. However, the computational complexity and the large amount of data to be explored often make the sequential algorithms unsuitable. In this context high performance distributed computing becomes a very interesting and promising approach. In this paper we present a parallel formulation of the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The application is characterized by a highly irregular tree-structured computation. No estimation is available for task workloads, which show a power-law distribution in a wide range. The proposed approach allows dynamic resource aggregation and provides fault and latency tolerance. These features make the distributed application suitable for multi-domain heterogeneous environments, such as computational Grids. The distributed application has been evaluated on the well known National Cancer Institute’s HIV-screening dataset

    An improved approach of FP-Growth tree for Frequent Itemset Mining using Partition Projection and Parallel Projection Techniques

    Get PDF
    In Data mining, it is about analyzing data; about extracting information out of data. It is a very actual as well as interesting issue having more and more data stored in database. The most important usage: customer behavior in market purchasing, shopping cart processed information provide, management of campaign , customer relationship management, mining about web usage called web mining, mining of text. In the current age of science we developed such technology by using it each type of data related to anything such like person, place, shop, or any organization can be stored. By analysis it is found that FP-growth is efficient in terms of tree construction as compared to Apriori and Tree Projection. Tree Projection is faster and more scalable than Apriori. The parallel projection technique is proved to be more scalable than partition projection as partition projection saves memory space as it works well for the dataset which is dispersed, if the FP-growth tree algorithm and Tree Projection are compared on the basis of benefits it holds on, Apriori does not result to be convenient enough. The pros of FP-growth as compared to Apriori concludes to be transparent as the datasets which it contains has an enormous number of combinations of short-narrative frequent patterns. FP-growth tree implemented along with projection techniques i.e. Partition projection technique constructed to reduce execution time for constructing FP-Growth tree has to be carried out

    A Survey of Parallel Data Mining

    Get PDF
    With the fast, continuous increase in the number and size of databases, parallel data mining is a natural and cost-effective approach to tackle the problem of scalability in data mining. Recently there has been a considerable research on parallel data mining. However, most projects focus on the parallelization of a single kind of data mining algorithm/paradigm. This paper surveys parallel data mining with a broader perspective. More precisely, we discuss the parallelization of data mining algorithms of four knowledge discovery paradigms, namely rule induction, instance-based learning, genetic algorithms and neural networks. Using the lessons learned from this discussion, we also derive a set of heuristic principles for designing efficient parallel data mining algorithms

    Efficient parallel mining of association rules on shared-memory multiple-processor machine

    Get PDF
    In this paper we consider the problem of parallel mining of association rules on a shared-memory multiprocessor system. Two efficient algorithms PSM and HSM have been proposed. PSM adopted two powerful candidate set pruning techniques distributed pruning and global pruning to reduce the size of candidates. HSM further utilized an I/O reduction strategy to enhance its performance. We have implemented PSM and HSM on a SGI Power Challenge parallel machine. The performance studies show that PSM and HSM out perform CD-SM, which is a shared-memory parallel version of the popular Apriori algorithm.published_or_final_versio

    Dynamic load balancing for the distributed mining of molecular structures

    Get PDF
    In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids