813 research outputs found

    High Utility Itemsets Identification in Big Data

    Full text link
    High utility itemset mining is an important data mining problem which considers profit factors besides quantity from the transactional database. It helps find the most valuable products/items that are difficult to track using only the frequent data mining set. An item that has a high-profit value might be rare in the transactional database despite its tremendous importance. While there are many existing algorithms which generate comparatively large candidate sets while finding high utility itemsets, the major focus is to reduce the computational time significantly with the introduction of pruning strategies. Another aspect of high utility itemset mining is to compute the large dataset. There are very few algorithms that can handle a large dataset to find high utility itemset mining in a parallel (distributed) system. In this thesis, there are two proposed methods: 1) High utility itemset mining using pruning strategies approach (HUI-PR) and 2) Parallel EFIM (EFIM-Par). In the method I, the proposed algorithm constructs the candidate sets in the form of a tree structure, which traverses the itemsets with High Transaction-Weighted Utility (HTWUIs). It uses a pruning strategies to reduce the computational time by refraining the visit to unnecessary nodes of an itemset to reduce the search space. It significantly minimizes the transaction database generated on each node. In the method II, the distributed approach is proposed dividing the search space among different worker nodes to compute high utility itemsets which are aggregated to find the result. The experimental results for both methods show that they significantly improve the execution time for computing the high utility itemsets

    MR-AT: Map Reduce based Apriori Technique for Sequential Pattern Mining using Big Data in Hadoop

    Get PDF
    One of the most well-known and widely implemented data mining methods is Apriori algorithm which is responsible for mining frequent item sets. The effectiveness of the Apriori algorithm has been improved by a number of algorithms that have been introduced on both parallel and distributed platforms in recent years. They are distinct from one another on account of the method of load balancing, memory system, method of data degradation, and data layout that was utilised in their implementation. The majority of the issues that arise with distributed frameworks are associated with the operating costs of handling distributed systems and the absence of high-level parallel programming languages. In addition, when using grid computing, there is constantly a possibility that a node will fail, which will result in the task being re-executed multiple times. The MapReduce approach that was developed by Google can be used to solve these kinds of issues. MapReduce is a programming model that is applied to large-scale distributed processing of data on large clusters of commodity computers. It is effective, scalable, and easy to use. MapReduce is also utilised in cloud computing. This research paper presents an enhanced version of the Apriori algorithm, which is referred to as Improved Parallel and Distributed Apriori (IPDA). It is based on the scalable environment referred as Hadoop MapReduce, which was used to analyse Big Data. Through the generation of split-frequent data regionally and the early elimination of unusual data, the proposed work has its primary objective to reduce the enormous demands placed on available resources as well as the reduction of the overhead communication that occurs whenever frequent data are retrieved. The paper presents the results of tests, which demonstrate that the IPDA performs better than traditional apriori and parallel and distributed apriori in terms of the amount of time required, the number of rules created, and the various minimum support values

    High Performance Frequent Subgraph Mining on Transactional Datasets

    Get PDF
    Graph data mining has been a crucial as well as inevitable area of research. Large amounts of graph data are produced in many areas, such as Bioinformatics, Cheminformatics, Social Networks, and Web etc. Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities. Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs. To tackle this problem, many main memory-based methods were proposed, which proved to be inefficient as the data size grew exponentially over time. In the past few years several research groups have attempted to handle the frequent subgraph mining (FSM) problem in multiple ways. Many authors have tried to achieve better performance using Graphic Processing Units (GPUs) which has multi-fold improvement over in-memory while dealing with large datasets. Later, Google\u27s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing. Although MapReduce came with many benefits, its disk I/O and non-iterative style model could not help much for FSM domain since subgraph mining process is an iterative approach. In recent years, Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability. This is a right fit solution for iterative style of programming as well. In this work, we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results

    Efficient Discovery of Interesting Patterns Based on Strong Closedness

    Full text link

    Mining Interesting Patterns in Multi-Relational Data

    Get PDF

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
    • …
    corecore