9 research outputs found

    Parallel and Distributed Data Mining

    Get PDF

    A COMPREHENSIVE SURVEY ON PRIVACY PRESERVING DISTRIBUTED DATA MINING WITH EVOLUTIONARY COMPUTING

    Get PDF
    Data mining is really a procedure for nontrivial extraction of implicit, formerly unknown, and potentially helpful information from data in databases. Actually, the word “knowledge discovery” is much more general compared to term “data mining.” Data mining is generally seen like a step towards the entire process of understanding discovery, although both of these terms are thought as synonyms within the computer literature. Posting data about people without revealing sensitive details about them is a vital problem. In many applications, data mining has to be done in distributed data scenarios. Distributed data mining (DDM) techniques have become necessary for large and multi-scenario datasets requiring resources, which are heterogeneous and distributed. In such situations, data owners may be concerned with the misuse of data, hence, they do not want their data to be mined, especially when these contain sensitive information. Distributed data mining techniques use sensitive data from distributed databases held by different parties. This makes direct conflict by having an individual’s need and to privacy. It's thus crucial to build up sufficient security approaches for safeguarding privacy of person values employed for data mining.In this paper, we consider privacy-protecting naïve-Bayes classifier for flat partitioned distributed data and propose data mining privacy by decomposition (DMPD) way in which uses genetic formula to look for optimal set of features partitioning by classification precision and k-anonymity constraints. Here, we study maintaining privacy in distributed data mining and how two (or even more) parties will find frequent item sets from  distributed databases without revealing each party’s area of the data to another. It also incorporates  the privacy issues related to Distributed data mining (PPDDM) from a wider perspective and investigate various approaches that can help to provide privacy for sensitive information DDBs

    Knowledge Discovery on the Grid

    Get PDF

    Frequent itemset mining on multiprocessor systems

    Get PDF
    Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets

    Mineração de dados em redes de baixa tensão usando algoritmos genéticos

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-Graduação em Ciência da ComputaçãoDiversos problemas atingem as redes de distribuição de energia elétrica n

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    A high-performance distributed algorithm for mining association rules

    No full text
    corecore