Search CORE

11 research outputs found

Experimental Performance Evaluation of Cloud-Based Analytics-as-a-Service

Author: Carra Damiano
Michiardi Pietro
Milanesio Marco
Pace Francesco
Venzano Daniele
Publication venue
Publication date: 01/01/2016
Field of study

An increasing number of Analytics-as-a-Service solutions has recently seen the light, in the landscape of cloud-based services. These services allow flexible composition of compute and storage components, that create powerful data ingestion and processing pipelines. This work is a first attempt at an experimental evaluation of analytic application performance executed using a wide range of storage service configurations. We present an intuitive notion of data locality, that we use as a proxy to rank different service compositions in terms of expected performance. Through an empirical analysis, we dissect the performance achieved by analytic workloads and unveil problems due to the impedance mismatch that arise in some configurations. Our work paves the way to a better understanding of modern cloud-based analytic services and their performance, both for its end-users and their providers.Comment: Longer version of the paper in Submission at IEEE CLOUD'1

arXiv.org e-Print Archive

Crossref

Catalogo dei prodotti della ricerca

Scipedia

Optimised Method of Resource Allocation for Hadoop on Cloud

Author: Shilpitha swarna, Amogh Pramod Kulkarni
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/04/2016
Field of study

— Many case studies have proved that the data generated at industries and academia are growing rapidly, which are difficult to store using existing database system. Due to the usage of internet many applications are created and has helped many industries such as finance, health care etc, which are also the source of producing massive data. The smart grid is a technology which delivers energy in an optimal manner, phasor measurement unit (PMU) installed in smart grid is used to check the critical power paths and also generate massive sample data. Using parallel detrending fluctuation analysis algorithm (PDFA) fast detection of events from PMU samples are made. Storing and analyzing the events are made easy using MapReduce model, hadoop is an open source implemented MapReduce framework. Many cloud service providers (CSP) are extending their service for Hadoop which makes easy for user’s to run their hadoop application on cloud. The major task is, it is users responsibility to estimate the time and resources required to complete the job within deadlines. In this paper, machine learning techniquies such as local weighted linear regression and the parallel glowworm swarm optimization (GSO) algorithm are used to estimate the resource and job completion time

International Journal on Recent and Innovation Trends in Computing and Communication

Modeling performance of Hadoop applications: A journey from queueing networks to stochastic well formed nets

Author: A Castiglione
D Ardagna
DJ Dubois
DR Liang
E Vianna
ED Lazowska
HV Jagadish
J Polo
JE Marynowski
K Jensen
K Kambatla
L Aguilera-Mendoza
M Bertoli
M Lin
RD Nelson
S Baarir
VW Mak
WW Chu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the enduser and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9–14%

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

A Game-Theoretic Approach for Runtime Capacity Allocation in MapReduce

Author: Ardagna Danilo
Ciavotta Michele
Gianniti Eugenio
Passacantando Mauro
Publication venue
Publication date: 01/01/2017
Field of study

Nowadays many companies have available large amounts of raw, unstructured data. Among Big Data enabling technologies, a central place is held by the MapReduce framework and, in particular, by its open source implementation, Apache Hadoop. For cost effectiveness considerations, a common approach entails sharing server clusters among multiple users. The underlying infrastructure should provide every user with a fair share of computational resources, ensuring that Service Level Agreements (SLAs) are met and avoiding wastes. In this paper we consider two mathematical programming problems that model the optimal allocation of computational resources in a Hadoop 2.x cluster with the aim to develop new capacity allocation techniques that guarantee better performance in shared data centers. Our goal is to get a substantial reduction of power consumption while respecting the deadlines stated in the SLAs and avoiding penalties associated with job rejections. The core of this approach is a distributed algorithm for runtime capacity allocation, based on Game Theory models and techniques, that mimics the MapReduce dynamics by means of interacting players, namely the central Resource Manager and Class Managers

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Archivio della Ricerca - Università di Pisa

Modelling of Map Reduce Performance for Work Control and Supply Provisioning

Author: Palson Kennedy.R, Kalyana Sundaram.N, Karthik.V
Publication venue: Auricle Global Society of Education and Research
Publication date: 31/10/2017
Field of study

Data intensive applications adopts Map Reduce as a major computing model. Hadoop, an open source implementation of MapReduce, has been implemented by progressively increasing user community. Many Cloud computing service providers offer the chances for Hadoop operators to contract a certain amount of supply�s and remunerate for their usage. Nevertheless, a key contest is that cloud service providers do not have a supply provisioning mechanism to fulfil user works with target requirements. At present, it is solely the user's accountability to evaluate the required amount of supply forming a work in the cloud. This paper presents a Hadoop presentation model that precisely guesses completion time and further provisions the required amount of supply�s for a work to be completed within a deadline. The proposed model forms on past work execution records and services Locally Weighted Linear Regression (LWLR) technique to estimate the execution time of a work. Moreover, it pays Lagrange Multipliers technique for supply provisioning to satisfy works with deadline requirements. The proposed method is primarily assessed on an in-house Hadoop cluster and then evaluated in the Amazon EC2 Cloud. Experimental results show that the accuracy of the proposed method in work execution approximation is in the range of 90.37% and 91.28%, and works are completed within the required limits following on the supply provisioning scheme of the proposed model

International Journal on Future Revolution in Computer Science & Communication Engineering

Feedback Autonomic Provisioning for Guaranteeing Performance in MapReduce Systems

Author: Berekmeri Mihaly
Bouchenak Sara
Marchand Nicolas
Robu Bogdan
Serrano Damián
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

International audienceCompanies have a fast growing amounts of data to process and store, a data explosion is happening next to us. Currentlyone of the most common approaches to treat these vast data quantities are based on the MapReduce parallel programming paradigm.While its use is widespread in the industry, ensuring performance constraints, while at the same time minimizing costs, still providesconsiderable challenges. We propose a coarse grained control theoretical approach, based on techniques that have already provedtheir usefulness in the control community. We introduce the first algorithm to create dynamic models for Big Data MapReduce systems,running a concurrent workload. Furthermore we identify two important control use cases: relaxed performance - minimal resourceand strict performance. For the first case we develop two feedback control mechanism. A classical feedback controller and an evenbasedfeedback, that minimises the number of cluster reconfigurations as well. Moreover, to address strict performance requirements afeedforward predictive controller that efficiently suppresses the effects of large workload size variations is developed. All the controllersare validated online in a benchmark running in a real 60 node MapReduce cluster, using a data intensive Business Intelligenceworkload. Our experiments demonstrate the success of the control strategies employed in assuring service time constraints

Crossref

Hal - Université Grenoble Alpes

HAL Descartes

HAL

Hal-Diderot

Hadoop performance modeling for job estimation and resource provisioning

Author: Jiang C
Jin Y
Khan M
Li M
Xiang Y
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2016
Field of study

MapReduce has become a major computing model for data intensive applications. Hadoop, an open source implementation of MapReduce, has been adopted by an increasingly growing user community. Cloud computing service providers such as Amazon EC2 Cloud offer the opportunities for Hadoop users to lease a certain amount of resources and pay for their use. However, a key challenge is that cloud service providers do not have a resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user's responsibility to estimate the required amount of resources for running a job in the cloud. This paper presents a Hadoop job performance model that accurately estimates job completion time and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model builds on historical job execution records and employs Locally Weighted Linear Regression (LWLR) technique to estimate the execution time of a job. Furthermore, it employs Lagrange Multipliers technique for resource provisioning to satisfy jobs with deadline requirements. The proposed model is initially evaluated on an in-house Hadoop cluster and subsequently evaluated in the Amazon EC2 Cloud. Experimental results show that the accuracy of the proposed model in job execution estimation is in the range of 94.97% and 95.51%, and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model.This research is partially supported by the 973 project on Network Big Data Analytics funded by the Ministry of Science and Technology, China. No. 2014CB340404

Crossref

Brunel University Research Archive

Layered performance modelling and evaluation for cloud topic detection and tracking based big data applications

Author: Wang Meisong
Publication venue
Publication date
Field of study

“Big Data” best characterized by its three features namely “Variety”, “Volume” and “Velocity” is revolutionizing nearly every aspect of our lives ranging from enterprises to consumers, from science to government. A fourth characteristic namely “value” is delivered via the use of smart data analytics over Big Data. One such Big Data Analytics application considered in this thesis is Topic Detection and Tracking (TDT). The characteristics of Big Data brings with it unprecedented challenges such as too large for traditional devices to process and store (volume), too fast for traditional methods to scale (velocity), and heterogeneous data (variety). In recent times, cloud computing has emerged as a practical and technical solution for processing big data. However, while deploying Big data analytics applications such as TDT in cloud (called cloud-based TDT), the challenge is to cost-effectively orchestrate and provision Cloud resources to meet performance Service Level Agreements (SLAs). Although there exist limited work on performance modeling of cloud-based TDT applications none of these methods can be directly applied to guarantee the performance SLA of cloud-based TDT applications. For instance, current literature lacks a systematic, reliable and accurate methodology to measure, predict and finally guarantee performances of TDT applications. Furthermore, existing performance models fail to consider the end-to-end complexity of TDT applications and focus only on the individual processing components (e.g. map reduce). To tackle this challenge, in this thesis, we develop a layered performance model of cloud-based TDT applications that take into account big data characteristics, the data and event flow across myriad cloud software and hardware resources and diverse SLA considerations. In particular, we propose and develop models to capture in detail with great accuracy, the factors having a pivotal role in performances of cloud-based TDT applications and identify ways in which these factors affect the performance and determine the dependencies between the factors. Further, we have developed models to predict the performance of cloud-based TDT applications under uncertainty conditions imposed by Big Data characteristics. The model developed in this thesis is aimed to be generic allowing its application to other cloud-based data analytics applications. We have demonstrated the feasibility, efficiency, validity and prediction accuracy of the proposed models via experimental evaluations using a real-world Flu detection use-case on Apache Hadoop Map Reduce, HDFS and Mahout Frameworks

The Australian National University

Recommended from our members

Hadoop performance modeling and job optimization for big data analytics

Author: Khan Mukhtaj
Publication venue: Brunel University London
Publication date: 01/01/2015
Field of study

This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.Abdul Wali Khan University Marda

Brunel University Research Archive

Recommended from our members

High performance latent dirichlet allocation for text mining

Author: Liu Zelong
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2013
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Latent Dirichlet Allocation (LDA), a total probability generative model, is a three-tier Bayesian model. LDA computes the latent topic structure of the data and obtains the significant information of documents. However, traditional LDA has several limitations in practical applications. LDA cannot be directly used in classification because it is a non-supervised learning model. It needs to be embedded into appropriate classification algorithms. LDA is a generative model as it normally generates the latent topics in the categories where the target documents do not belong to, producing the deviation in computation and reducing the classification accuracy. The number of topics in LDA influences the learning process of model parameters greatly. Noise samples in the training data also affect the final text classification result. And, the quality of LDA based classifiers depends on the quality of the training samples to a great extent. Although parallel LDA algorithms are proposed to deal with huge amounts of data, balancing computing loads in a computer cluster poses another challenge. This thesis presents a text classification method which combines the LDA model and Support Vector Machine (SVM) classification algorithm for an improved accuracy in classification when reducing the dimension of datasets. Based on Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the algorithm automatically optimizes the number of topics to be selected which reduces the number of iterations in computation. Furthermore, this thesis presents a noise data reduction scheme to process noise data. When the noise ratio is large in the training data set, the noise reduction scheme can always produce a high level of accuracy in classification. Finally, the thesis parallelizes LDA using the MapReduce model which is the de facto computing standard in supporting data intensive applications. A genetic algorithm based load balancing algorithm is designed to balance the workloads among computers in a heterogeneous MapReduce cluster where the computers have a variety of computing resources in terms of CPU speed, memory space and hard disk space

Brunel University Research Archive