Search CORE

1,124 research outputs found

Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster

Author: Garg Rakhi
Mishra P. K.
Singh Sudhakar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/01/2017
Field of study

Designing fast and scalable algorithm for mining frequent itemsets is always being a most eminent and promising problem of data mining. Apriori is one of the most broadly used and popular algorithm of frequent itemset mining. Designing efficient algorithms on MapReduce framework to process and analyze big datasets is contemporary research nowadays. In this paper, we have focused on the performance of MapReduce based Apriori on homogeneous as well as on heterogeneous Hadoop cluster. We have investigated a number of factors that significantly affects the execution time of MapReduce based Apriori running on homogeneous and heterogeneous Hadoop Cluster. Factors are specific to both algorithmic and non-algorithmic improvements. Considered factors specific to algorithmic improvements are filtered transactions and data structures. Experimental results show that how an appropriate data structure and filtered transactions technique drastically reduce the execution time. The non-algorithmic factors include speculative execution, nodes with poor performance, data locality & distribution of data blocks, and parallelism control with input split size. We have applied strategies against these factors and fine tuned the relevant parameters in our particular application. Experimental results show that if cluster specific parameters are taken care of then there is a significant reduction in execution time. Also we have discussed the issues regarding MapReduce implementation of Apriori which may significantly influence the performance.Comment: 8 pages, 8 figures, International Conference on Computing, Communication and Automation (ICCCA2016

arXiv.org e-Print Archive

Crossref

Monitorable network and CPU load statistics and their application to scheduling

Author: Meyer Trevor Ethan
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/1995
Field of study

Recent trends in high-speed computing have moved towards the use of networks of workstations as a cost-effective approach to parallel computing. One recently proposed solution involves the use of an existing network of workstation-class computers as a single multiprocessor, and much research is ongoing in this area;This dissertation describes work in the area of process scheduling on networks of workstations, specifically in the area of load analysis. After presenting extensive background in the field, measures of CPU and network load are defined, and a test parallel application program presented, written for a network-multiprocessing software package called PVM. A series of experiments is then detailed, whose goal was to discover the relationship between the run time of the test application and the loads on the participating workstations and networks. The experiments include measurement of CPU loading and network loading, both during test application runs, during artificially elevated loads, and during quiet conditions. Results of the experiments are presented, and the applications of the results to the problem of task scheduling examined. It is then claimed that several easily measured load measures are useful to task scheduling, by allowing run time to be predicted within a margin of error, and allowing limiting network segments to be detected and avoided

Digital Repository @ Iowa State University (ISU)

Viper : a visualisation tool for parallel program construction

Author: Schiefer R.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/1999
Field of study

+133hlm.;24c

Repository TU/e

Pure OAI Repository

uilis.unsyiah.ac.id

Distributed Computing in a Cloud of Mobile Phones

Author: Sanches Pedro Miguel Castanheira
Publication venue
Publication date: 01/01/2017
Field of study

For the past few years we have seen an exponential growth in the number of mobile devices and in their computation, storage and communication capabilities. We also have seen an increase in the amount of data generated by mobile devices while performing common tasks. Additionally, the ubiquity associated with these mobile devices, makes it more reasonable to start thinking in a different use for them, where they will begin to act as an important part in the computation of more demanding applications, rather than relying exclusively on external servers to do it. It is also possible to observe an increase in the number of resource demanding applications, whereas these resort to the use of services, offered by infrastructure Clouds. However, with the use of these Cloud services, many problems arise, such as: the considerable use of energy and bandwidth, high latency values, unavailability of connectivity infrastructures, due to the congestion or the non existence of it. Considering all the above, for some applications it starts to make sense to do part or all the computation locally in the mobile devices themselves. We propose a distributed computing framework, able to process a batch or a stream of data, which is being generated by a cloud composed of mobile devices, that does not require Internet services. Differently from the current state-of-the-art, where both computation and data are offloaded to mobile devices, our system intends to move the computation to where the data is, reducing significantly the amount of data exchanged between mobile devices. Based on the evaluation performed, both on a real and simulated environment, our framework has proved to support scalability, by benefiting significantly from the usage of several devices to handle computation, and by supporting multiple devices submitting computation requests while not having a significant increase in the latency of a request. It also proved that is able to deal with churn without being highly penalized by it

Repositório da Universidade Nova de Lisboa

MapReduce analysis for cloud-archived data

Author: Alatorre G
Liu L
Mandagere N
Palanisamy B
Singh A
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Public storage clouds have become a popular choice for archiving certain classes of enterprise data - for example, application and infrastructure logs. These logs contain sensitive information like IP addresses or user logins due to which regulatory and security requirements often require data to be encrypted before moved to the cloud. In order to leverage such data for any business value, analytics systems (e.g. Hadoop/MapReduce) first download data from these public clouds, decrypt it and then process it at the secure enterprise site. We propose VNCache: an efficient solution for MapReduceanalysis of such cloud-archived log data without requiring an apriori data transfer and loading into the local Hadoop cluster. VNcache dynamically integrates cloud-archived data into a virtual namespace at the enterprise Hadoop cluster. Through a seamless data streaming and prefetching model, Hadoop jobs can begin execution as soon as they are launched without requiring any apriori downloading. With VNcache's accurate pre-fetching and caching, jobs often run on a local cached copy of the data block significantly improving performance. When no longer needed, data is safely evicted from the enterprise cluster reducing the total storage footprint. Uniquely, VNcache is implemented with NO changes to the Hadoop application stack. © 2014 IEEE

CiteSeerX

Crossref

D-Scholarship@Pitt

Towards distributed heterogeneous simulation using internet of things

Author: Hamayun Mian Muhammad
Haseeb Muhammad
Malik Asad Waqar
Rahman Anis ur
Publication venue
Publication date
Field of study

University of Birmingham Research Portal

Parallel processing of streaming media on heterogeneous hosts using work stealing

Author: LI QINGRUI
Publication venue
Publication date: 29/08/2004
Field of study

Master'sMASTER OF SCIENC

ScholarBank@NUS

Design and Implementation of Parallel Computing Models for Solar Radiation Simulation

Author: Liang Da
Publication venue
Publication date: 04/05/2016
Field of study

In order to simulate geographical phenomenon, many complex and high precision models have been developed by scientists. But at most time common hardware and implementation of those computation models are not capable of processing large amounts of data, and the time performance might be unacceptable. Nowadays, the growth in the speed of modern graphics processing units is incredible, and the flops/dollar radio provided by GPU is also growing very fast, which makes large scale GPU clusters gain popularity in the scientific computing community. However, GPU programming and clusters' software deployment and development are associated with a number of challenges. In this thesis, the geo-science model developed by I. D. Dobreva and M. P. Bishop proposed in A Spatial Temporal, Topographic and Spectral GIS based Solar Radiation Model (SRM) was analyzed. I built a heterogeneous cluster and developed its software framework which could provide powerful computation service for complex geographic models. Time performance and computation accuracy has been analyzed. Issues and challenges such as GPU programming, job balancing and scheduling are addressed. The SRM application running on this framework can process data fast enough and be able to give researchers rendering images as feedback in a short time, which improved the performance by hundreds of times when compared to the current performance in our available hardware, and the speedup can easily be scaled by adding new machines

Texas A&M Repository

A study of distributed clustering of vector time series on the grid by task farming

Author: Nayar Arun B
Publication venue: LSU Digital Commons
Publication date: 01/01/2005
Field of study

Traditional data mining methods were limited by availability of computing resources like network bandwidth, storage space and processing power. These algorithms were developed to work around this problem by looking at a small cross-section of the whole data available. However since a major chunk of the data is kept out, the predictions were generally inaccurate and missed out on significant features that was part of the data. Today with resources growing at almost the same pace as data, it is possible to rethink mining algorithms to work on distributed resources and essentially distributed data. Distributed data mining thus holds great promise. Using grid technologies, data mining can be extended to areas which were not previously looked at because of the volume of data being generated, like climate modeling, web usage, etc. An important characteristic of data today is that it is highly decentralized and mostly redundant. Data mining algorithms which can make efficient use of distributed data has to be thought of. Though it is possible to bring all the data together and run traditional algorithms, this has a high overhead, in terms of bandwidth usage for transmission, preprocessing steps which have to be to handle every format the received data. By processing the data locally, the preprocessing stage can be made less bulky and also the traditional data mining techniques would be able to work on the data efficiently. The focus of this project is to use an existing data mining technique, fuzzy c-means clustering to work on distributed data in a simulated grid environment and to review the performance of this approach viz., the traditional approach

Louisiana State University