334 research outputs found
Mapreduce performance model for Hadoop 2.x
MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that may provide reasonably accurate job response time estimation at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce performance model for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, they could not be applied to Hadoop 2.x due to fundamental architectural changes and dynamic resource allocation in Hadoop 2.x. Thus, the proposed solution is based on an existing performance model for Hadoop 1.x, but taking into consideration architectural changes and capturing the execution flow of a MapReduce job by using queuing network model. This way, the cost model reflects the intra-job synchronization constraints that occur due the contention at shared resources. The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup.Peer ReviewedPostprint (author's final draft
D-SPACE4Cloud: A Design Tool for Big Data Applications
The last years have seen a steep rise in data generation worldwide, with the
development and widespread adoption of several software projects targeting the
Big Data paradigm. Many companies currently engage in Big Data analytics as
part of their core business activities, nonetheless there are no tools and
techniques to support the design of the underlying hardware configuration
backing such systems. In particular, the focus in this report is set on Cloud
deployed clusters, which represent a cost-effective alternative to on premises
installations. We propose a novel tool implementing a battery of optimization
and prediction techniques integrated so as to efficiently assess several
alternative resource configurations, in order to determine the minimum cost
cluster deployment satisfying QoS constraints. Further, the experimental
campaign conducted on real systems shows the validity and relevance of the
proposed method
A Combined Analytical Modeling Machine Learning Approach for Performance Prediction of MapReduce Jobs in Hadoop Clusters
Nowadays MapReduce and its open source implementation, Apache Hadoop, are the most widespread solutions for handling massive dataset on clusters of commodity hardware. At the expense of a somewhat reduced performance in comparison to HPC technologies, the MapReduce framework provides fault tolerance and automatic parallelization without any efforts by developers. Since in many cases Hadoop is adopted to support business critical activities, it is often important to predict with fair confidence the execution time of submitted jobs, for instance when SLAs are established with end-users. In this work, we propose and validate a hybrid approach exploiting both queuing networks and support vector regression, in order to achieve a good accuracy without too many costly experiments on a real setup. The experimental results show how the proposed approach attains a 21% improvement in accuracy over applying machine learning techniques without any support from analytical models
CloudScope: diagnosing and managing performance interference in multi-tenant clouds
© 2015 IEEE.Virtual machine consolidation is attractive in cloud computing platforms for several reasons including reduced infrastructure costs, lower energy consumption and ease of management. However, the interference between co-resident workloads caused by virtualization can violate the service level objectives (SLOs) that the cloud platform guarantees. Existing solutions to minimize interference between virtual machines (VMs) are mostly based on comprehensive micro-benchmarks or online training which makes them computationally intensive. In this paper, we present CloudScope, a system for diagnosing interference for multi-tenant cloud systems in a lightweight way. CloudScope employs a discrete-time Markov Chain model for the online prediction of performance interference of co-resident VMs. It uses the results to optimally (re)assign VMs to physical machines and to optimize the hypervisor configuration, e.g. the CPU share it can use, for different workloads. We have implemented CloudScope on top of the Xen hypervisor and conducted experiments using a set of CPU, disk, and network intensive workloads and a real system (MapReduce). Our results show that CloudScope interference prediction achieves an average error of 9%. The interference-aware scheduler improves VM performance by up to 10% compared to the default scheduler. In addition, the hypervisor reconfiguration can improve network throughput by up to 30%
A Game-Theoretic Approach for Runtime Capacity Allocation in MapReduce
Nowadays many companies have available large amounts of raw, unstructured
data. Among Big Data enabling technologies, a central place is held by the
MapReduce framework and, in particular, by its open source implementation,
Apache Hadoop. For cost effectiveness considerations, a common approach entails
sharing server clusters among multiple users. The underlying infrastructure
should provide every user with a fair share of computational resources,
ensuring that Service Level Agreements (SLAs) are met and avoiding wastes. In
this paper we consider two mathematical programming problems that model the
optimal allocation of computational resources in a Hadoop 2.x cluster with the
aim to develop new capacity allocation techniques that guarantee better
performance in shared data centers. Our goal is to get a substantial reduction
of power consumption while respecting the deadlines stated in the SLAs and
avoiding penalties associated with job rejections. The core of this approach is
a distributed algorithm for runtime capacity allocation, based on Game Theory
models and techniques, that mimics the MapReduce dynamics by means of
interacting players, namely the central Resource Manager and Class Managers
Optimal Map Reduce Job Capacity Allocation in Cloud Systems.
We are entering a Big Data world. Many sectors of our economy are now guided by data-driven decision processes. Big Data and business intelligence applications are facilitated by the MapReduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. Capacity allocation in such systems is a key challenge to provide performance for MapReduce jobs and minimize cloud resource costs. The contribution of this paper is twofold: (i) we provide new upper and lower bounds for MapReduce job execution time in shared Hadoop clusters, (ii) we formulate a linear programming model able to minimize cloud resources costs and job rejection penalties for the execution of jobs of multiple classes with (soft) deadline guarantees. Simulation results show how the execution time of MapReduce jobs falls within 14% of our upper bound on average.
Moreover, numerical analyses demonstrate that our method is able to determine the global optimal solution of the linear problem for systems including up to 1,000 user classes in less than 0.5 seconds
MapReduce performance models for Hadoop 2.x
MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that it may provide reasonably accurate job response time at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce
performance models for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, the fundamental architectural changes of Hadoop 2.x require that the cost models are also reconsidered. The proposed solution is based on
an existing performance model for Hadoop 1.x, but it takes into consideration the architectural changes of Hadoop 2.x and captures the execution flow of a MapReduce job by using queuing network model. This way the cost model adheres to the intra-job synchronization constraints that occur due the contention at shared resources. The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup. According to our evaluation results, the proposed model produces estimates of average job response time with
error within the range of 11% - 13.5%.Peer ReviewedPostprint (published version
Towards auto-scaling in the cloud: online resource allocation techniques
Cloud computing provides an easy access to computing resources. Customers can acquire and release resources any time. However, it is not trivial to determine when and how many resources to allocate. Many applications running in the cloud face workload changes that affect their resource demand. The first thought is to plan capacity either for the average load or for the peak load. In the first case there is less cost incurred, but performance will be affected if the peak load occurs. The second case leads to money wastage, since resources will remain underutilized most of the time. Therefore there is a need for a more sophisticated resource provisioning techniques that can automatically scale the application resources according to workload demand and performance constrains.
Large cloud providers such as Amazon, Microsoft, RightScale provide auto-scaling services. However, without the proper configuration and testing such services can do more harm than good. In this work I investigate application specific online resource allocation techniques that allow to dynamically adapt to incoming workload, minimize the cost of virtual resources and meet user-specified performance objectives
- …