441 research outputs found

    D-SPACE4Cloud: A Design Tool for Big Data Applications

    Get PDF
    The last years have seen a steep rise in data generation worldwide, with the development and widespread adoption of several software projects targeting the Big Data paradigm. Many companies currently engage in Big Data analytics as part of their core business activities, nonetheless there are no tools and techniques to support the design of the underlying hardware configuration backing such systems. In particular, the focus in this report is set on Cloud deployed clusters, which represent a cost-effective alternative to on premises installations. We propose a novel tool implementing a battery of optimization and prediction techniques integrated so as to efficiently assess several alternative resource configurations, in order to determine the minimum cost cluster deployment satisfying QoS constraints. Further, the experimental campaign conducted on real systems shows the validity and relevance of the proposed method

    Performance Modeling and Resource Management for Mapreduce Applications

    Get PDF
    Big Data analytics is increasingly performed using the MapReduce paradigm and its open-source implementation Hadoop as a platform choice. Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs. An increasing number of these applications have additional requirements for completion time guarantees. The advent of cloud computing brings a competitive alternative solution for data analytic problems while it also introduces new challenges in provisioning clusters that provide best cost-performance trade-offs. In this dissertation, we aim to develop a performance evaluation framework that enables automatic resource management for MapReduce applications in achieving different optimization goals. It consists of the following components: (1) a performance modeling framework that estimates the completion time of a given MapReduce application when executed on a Hadoop cluster according to its input data sets, the job settings and the amount of allocated resources for processing it; (2) a resource allocation strategy for deadline-driven MapReduce applications that automatically tailors and controls the resource allocation on a shared Hadoop cluster to different applications to achieve their (soft) deadlines; (3) a simulator-based solution to the resource provision problem in public cloud environment that guides the users to determine the types and amount of resources that should lease from the service provider for achieving different goals; (4) an optimization strategy to automatically determine the optimal job settings within a MapReduce application for efficient execution and resource usage. We validate the accuracy, efficiency, and performance benefits of the proposed framework using a set of realistic MapReduce applications on both private cluster and public cloud environment

    Research Statement

    Get PDF
    My research centers around performance modeling, optimization and resource management for MapReduce workflows with completion time constrains. My work is motivated by (1) the popularity of MapReduce framework and its open source implementation Hadoop that provides an economically compelling alternative for efficient analytics over ”Big Data ” in the enterprise; and (2) the recent technological trend shift toward

    Resilin: Elastic MapReduce over Multiple Clouds

    Get PDF
    The MapReduce programming model, introduced by Google, offers a simple and efficient way of performing distributed computation over large data sets. Although Google's implementation is proprietary, MapReduce can be leveraged by anyone using the free and open-source Apache Hadoop framework. To simplify the usage of Hadoop in the cloud, Amazon Web Services offers Elastic MapReduce, a web service enabling users to run MapReduce jobs. Elastic MapReduce takes care of resource provisioning, Hadoop configuration and performance tuning, data staging, fault tolerance, etc. This service drastically reduces the entry barrier to perform MapReduce computations in the cloud, allowing users to concentrate on the problem to solve. However, Elastic MapReduce is restricted to Amazon EC2 resources, and is provided at an additional cost. In this paper, we present Resilin, a system implementing the Elastic MapReduce API with resources from clouds other than Amazon EC2, such as private and scientific clouds. Furthermore, we explore a feature going beyond the current Amazon Elastic MapReduce offering: performing MapReduce computations over multiple distributed clouds. The evaluation of Resilin shows the benefits of running computations on more than one cloud. While not being the most efficient way to perform Hadoop computations, it solves the problem of resource availability and adds more flexibility regarding the type/price of resource.Le modĂšle de programmation MapReduce, introduit par Google, offre un moyen simple et efficace de rĂ©aliser des calculs distribuĂ©s sur de grandes quantitĂ©s de donnĂ©es. Bien que la mise en oeuvre de Google soit propriĂ©taire, MapReduce peut ĂȘtre utilisĂ© librement avec l'environnement Hadoop. Pour simplifier l'utilisation de Hadoop dans les nuages informatiques, Amazon Web Services offre Elastic MapReduce, un service web qui permet aux utilisateurs d'exĂ©cuter des applications MapReduce. Il prend en charge l'allocation de ressources, la configuration et l'optimisation de Hadoop, la copie des donnĂ©es, la tolĂ©rance aux fautes, etc. Ce service facilite l'exĂ©cution d'applications MapReduce dans les nuages informatiques, permettant ainsi aux utilisateurs de se concentrer sur la rĂ©solution de leur problĂšme plutĂŽt que sur la gestion de la plate-forme d'exĂ©cution. Elastic MapReduce est limitĂ© ĂĄ l'utilisation de ressources fournies par Amazon EC2 et est proposĂ© Ă  un coĂ»t additionnel. Dans cet article, nous prĂ©sentons Resilin, un systĂšme mettant en oeuvre l'API Elastic MapReduce avec des ressources provenant d'autres nuages informatiques que Amazon EC2, tels que les nuages privĂ©s ou communautaires. De plus, nous explorons une fonctionnalitĂ© nouvelle par rapport au service offert par Amazon Elastic MapReduce: l'exĂ©cution d'applications MapReduce sur plusieurs nuages gĂ©ographiquement distribuĂ©s. L'Ă©valuation de Resilin montre les avantages liĂ©s Ă  l'utilisation de plus d'un nuage pour l'exĂ©cution d'applications MapReduce. Bien qu'il ne fournisse pas la solution la plus efficace pour l'exĂ©cution d'applications MapReduce, Resilin rĂ©sout le problĂšme de la disponibilitĂ© des ressources et ajoute une plus grande flexibilitĂ© en ce qui concerne le type et le prix des ressources

    Cost-Effective Resource Provisioning for MapReduce in a Cloud

    Get PDF
    This paper presents a new MapReduce cloud service model, Cura, for provisioning cost-effective MapReduce services in a cloud. In contrast to existing MapReduce cloud services such as a generic compute cloud or a dedicated MapReduce cloud, Cura has a number of unique benefits. First, Cura is designed to provide a cost-effective solution to efficiently handle MapReduce production workloads that have a significant amount of interactive jobs. Second, unlike existing services that require customers to decide the resources to be used for the jobs, Cura leverages MapReduce profiling to automatically create the best cluster configuration for the jobs. While the existing models allow only a per-job resource optimization for the jobs, Cura implements a globally efficient resource allocation scheme that significantly reduces the resource usage cost in the cloud. Third, Cura leverages unique optimization opportunities when dealing with workloads that can withstand some slack. By effectively multiplexing the available cloud resources among the jobs based on the job requirements, Cura achieves significantly lower resource usage costs for the jobs. Cura's core resource management schemes include cost-aware resource provisioning, VM-aware scheduling and online virtual machine reconfiguration. Our experimental results using Facebook-like workload traces show that our techniques lead to more than 80 percent reduction in the cloud compute infrastructure cost with upto 65 percent reduction in job response times

    Towards efficient resource provisioning in MapReduce

    Get PDF
    The paper presents a novel approach and algorithm with mathematical formula for obtaining the exact optimal number of task resources for any workload running on HadoopMapReduce. In the era of Big Data, energy efficiency has become an important issue for the ubiquitous Hadoop MapReduce framework. However, the question of what is the optimal number of tasks required for a job to get the most efficient performance from MapReduce still has no definite answer. Our algorithm for optimal resource provisioning allows users to identify the best trade-off point between performance and energy efficiency on the runtime elbow curve fitted from sampled executions on the target cluster for subsequent behavioral replication. Our verification and comparison show that the currently well-known rules of thumb for calculating the required number of reduce tasks are inaccurate and could lead to significant waste of computing resources and energy with no further improvement in execution time
    • 

    corecore