1,124 research outputs found

    Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster

    Full text link
    Designing fast and scalable algorithm for mining frequent itemsets is always being a most eminent and promising problem of data mining. Apriori is one of the most broadly used and popular algorithm of frequent itemset mining. Designing efficient algorithms on MapReduce framework to process and analyze big datasets is contemporary research nowadays. In this paper, we have focused on the performance of MapReduce based Apriori on homogeneous as well as on heterogeneous Hadoop cluster. We have investigated a number of factors that significantly affects the execution time of MapReduce based Apriori running on homogeneous and heterogeneous Hadoop Cluster. Factors are specific to both algorithmic and non-algorithmic improvements. Considered factors specific to algorithmic improvements are filtered transactions and data structures. Experimental results show that how an appropriate data structure and filtered transactions technique drastically reduce the execution time. The non-algorithmic factors include speculative execution, nodes with poor performance, data locality & distribution of data blocks, and parallelism control with input split size. We have applied strategies against these factors and fine tuned the relevant parameters in our particular application. Experimental results show that if cluster specific parameters are taken care of then there is a significant reduction in execution time. Also we have discussed the issues regarding MapReduce implementation of Apriori which may significantly influence the performance.Comment: 8 pages, 8 figures, International Conference on Computing, Communication and Automation (ICCCA2016

    Monitorable network and CPU load statistics and their application to scheduling

    Get PDF
    Recent trends in high-speed computing have moved towards the use of networks of workstations as a cost-effective approach to parallel computing. One recently proposed solution involves the use of an existing network of workstation-class computers as a single multiprocessor, and much research is ongoing in this area;This dissertation describes work in the area of process scheduling on networks of workstations, specifically in the area of load analysis. After presenting extensive background in the field, measures of CPU and network load are defined, and a test parallel application program presented, written for a network-multiprocessing software package called PVM. A series of experiments is then detailed, whose goal was to discover the relationship between the run time of the test application and the loads on the participating workstations and networks. The experiments include measurement of CPU loading and network loading, both during test application runs, during artificially elevated loads, and during quiet conditions. Results of the experiments are presented, and the applications of the results to the problem of task scheduling examined. It is then claimed that several easily measured load measures are useful to task scheduling, by allowing run time to be predicted within a margin of error, and allowing limiting network segments to be detected and avoided

    Viper : a visualisation tool for parallel program construction

    Get PDF
    +133hlm.;24c

    Distributed Computing in a Cloud of Mobile Phones

    Get PDF
    For the past few years we have seen an exponential growth in the number of mobile devices and in their computation, storage and communication capabilities. We also have seen an increase in the amount of data generated by mobile devices while performing common tasks. Additionally, the ubiquity associated with these mobile devices, makes it more reasonable to start thinking in a different use for them, where they will begin to act as an important part in the computation of more demanding applications, rather than relying exclusively on external servers to do it. It is also possible to observe an increase in the number of resource demanding applications, whereas these resort to the use of services, offered by infrastructure Clouds. However, with the use of these Cloud services, many problems arise, such as: the considerable use of energy and bandwidth, high latency values, unavailability of connectivity infrastructures, due to the congestion or the non existence of it. Considering all the above, for some applications it starts to make sense to do part or all the computation locally in the mobile devices themselves. We propose a distributed computing framework, able to process a batch or a stream of data, which is being generated by a cloud composed of mobile devices, that does not require Internet services. Differently from the current state-of-the-art, where both computation and data are offloaded to mobile devices, our system intends to move the computation to where the data is, reducing significantly the amount of data exchanged between mobile devices. Based on the evaluation performed, both on a real and simulated environment, our framework has proved to support scalability, by benefiting significantly from the usage of several devices to handle computation, and by supporting multiple devices submitting computation requests while not having a significant increase in the latency of a request. It also proved that is able to deal with churn without being highly penalized by it

    MapReduce analysis for cloud-archived data

    Get PDF
    Public storage clouds have become a popular choice for archiving certain classes of enterprise data - for example, application and infrastructure logs. These logs contain sensitive information like IP addresses or user logins due to which regulatory and security requirements often require data to be encrypted before moved to the cloud. In order to leverage such data for any business value, analytics systems (e.g. Hadoop/MapReduce) first download data from these public clouds, decrypt it and then process it at the secure enterprise site. We propose VNCache: an efficient solution for MapReduceanalysis of such cloud-archived log data without requiring an apriori data transfer and loading into the local Hadoop cluster. VNcache dynamically integrates cloud-archived data into a virtual namespace at the enterprise Hadoop cluster. Through a seamless data streaming and prefetching model, Hadoop jobs can begin execution as soon as they are launched without requiring any apriori downloading. With VNcache's accurate pre-fetching and caching, jobs often run on a local cached copy of the data block significantly improving performance. When no longer needed, data is safely evicted from the enterprise cluster reducing the total storage footprint. Uniquely, VNcache is implemented with NO changes to the Hadoop application stack. © 2014 IEEE

    Parallel processing of streaming media on heterogeneous hosts using work stealing

    Get PDF
    Master'sMASTER OF SCIENC

    Design and Implementation of Parallel Computing Models for Solar Radiation Simulation

    Get PDF
    In order to simulate geographical phenomenon, many complex and high precision models have been developed by scientists. But at most time common hardware and implementation of those computation models are not capable of processing large amounts of data, and the time performance might be unacceptable. Nowadays, the growth in the speed of modern graphics processing units is incredible, and the flops/dollar radio provided by GPU is also growing very fast, which makes large scale GPU clusters gain popularity in the scientific computing community. However, GPU programming and clusters' software deployment and development are associated with a number of challenges. In this thesis, the geo-science model developed by I. D. Dobreva and M. P. Bishop proposed in A Spatial Temporal, Topographic and Spectral GIS based Solar Radiation Model (SRM) was analyzed. I built a heterogeneous cluster and developed its software framework which could provide powerful computation service for complex geographic models. Time performance and computation accuracy has been analyzed. Issues and challenges such as GPU programming, job balancing and scheduling are addressed. The SRM application running on this framework can process data fast enough and be able to give researchers rendering images as feedback in a short time, which improved the performance by hundreds of times when compared to the current performance in our available hardware, and the speedup can easily be scaled by adding new machines

    A study of distributed clustering of vector time series on the grid by task farming

    Get PDF
    Traditional data mining methods were limited by availability of computing resources like network bandwidth, storage space and processing power. These algorithms were developed to work around this problem by looking at a small cross-section of the whole data available. However since a major chunk of the data is kept out, the predictions were generally inaccurate and missed out on significant features that was part of the data. Today with resources growing at almost the same pace as data, it is possible to rethink mining algorithms to work on distributed resources and essentially distributed data. Distributed data mining thus holds great promise. Using grid technologies, data mining can be extended to areas which were not previously looked at because of the volume of data being generated, like climate modeling, web usage, etc. An important characteristic of data today is that it is highly decentralized and mostly redundant. Data mining algorithms which can make efficient use of distributed data has to be thought of. Though it is possible to bring all the data together and run traditional algorithms, this has a high overhead, in terms of bandwidth usage for transmission, preprocessing steps which have to be to handle every format the received data. By processing the data locally, the preprocessing stage can be made less bulky and also the traditional data mining techniques would be able to work on the data efficiently. The focus of this project is to use an existing data mining technique, fuzzy c-means clustering to work on distributed data in a simulated grid environment and to review the performance of this approach viz., the traditional approach
    corecore