14 research outputs found

    D-SPACE4Cloud: A Design Tool for Big Data Applications

    Get PDF
    The last years have seen a steep rise in data generation worldwide, with the development and widespread adoption of several software projects targeting the Big Data paradigm. Many companies currently engage in Big Data analytics as part of their core business activities, nonetheless there are no tools and techniques to support the design of the underlying hardware configuration backing such systems. In particular, the focus in this report is set on Cloud deployed clusters, which represent a cost-effective alternative to on premises installations. We propose a novel tool implementing a battery of optimization and prediction techniques integrated so as to efficiently assess several alternative resource configurations, in order to determine the minimum cost cluster deployment satisfying QoS constraints. Further, the experimental campaign conducted on real systems shows the validity and relevance of the proposed method

    Fluid Petri Nets for the Performance Evaluation of MapReduce Applications

    Get PDF
    Big Data applications allow to successfully analyze large amounts of data not necessarily structured, though at the same time they present new challenges. For example, predicting the performance of frameworks such as Hadoop can be a costly task, hence the necessity to provide models that can be a valuable support for designers and developers. This paper provides a new contribution in studying a novel modeling approach based on fluid Petri nets to predict MapReduce jobs execution time. The experiments we performed at CINECA, the Italian supercomputing center, have shown that the achieved accuracy is within 16% of the actual measurements on average

    Modeling Big Data Systems by Extending the Palladio Component Model

    Get PDF
    ABSTRACT The growing availability of big data has induced new storing and processing techniques implemented in big data systems such as Apache Hadoop or Apache Spark. With increased implementations of these systems in organizations, simultaneously, the requirements regarding performance qualities such as response time, throughput, and resource utilization increase to create added value. Guaranteeing these performance requirements as well as efficiently planning needed capacities in advance is an enormous challenge. Performance models such as the Palladio component model (PCM) allow for addressing such problems. Therefore, we propose a metamodel extension for PCM to be able to model typical characteristics of big data systems. The extension consists of two parts. First, the meta-model is extended to support parallel computing by forking an operation multiple times on a computer cluster as intended by the single instruction, multiple data (SIMD) architecture. Second, modeling of computer clusters is integrated into the meta-model so operations can be properly scheduled on contained computing nodes

    Modeling performance of Hadoop applications: A journey from queueing networks to stochastic well formed nets

    Get PDF
    Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the enduser and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9–14%

    A Game-Theoretic Approach for Runtime Capacity Allocation in MapReduce

    Get PDF
    Nowadays many companies have available large amounts of raw, unstructured data. Among Big Data enabling technologies, a central place is held by the MapReduce framework and, in particular, by its open source implementation, Apache Hadoop. For cost effectiveness considerations, a common approach entails sharing server clusters among multiple users. The underlying infrastructure should provide every user with a fair share of computational resources, ensuring that Service Level Agreements (SLAs) are met and avoiding wastes. In this paper we consider two mathematical programming problems that model the optimal allocation of computational resources in a Hadoop 2.x cluster with the aim to develop new capacity allocation techniques that guarantee better performance in shared data centers. Our goal is to get a substantial reduction of power consumption while respecting the deadlines stated in the SLAs and avoiding penalties associated with job rejections. The core of this approach is a distributed algorithm for runtime capacity allocation, based on Game Theory models and techniques, that mimics the MapReduce dynamics by means of interacting players, namely the central Resource Manager and Class Managers

    Optimal Map Reduce Job Capacity Allocation in Cloud Systems.

    Get PDF
    We are entering a Big Data world. Many sectors of our economy are now guided by data-driven decision processes. Big Data and business intelligence applications are facilitated by the MapReduce programming model while, at infrastructural layer, cloud computing provides flexible and cost effective solutions for allocating on demand large clusters. Capacity allocation in such systems is a key challenge to provide performance for MapReduce jobs and minimize cloud resource costs. The contribution of this paper is twofold: (i) we provide new upper and lower bounds for MapReduce job execution time in shared Hadoop clusters, (ii) we formulate a linear programming model able to minimize cloud resources costs and job rejection penalties for the execution of jobs of multiple classes with (soft) deadline guarantees. Simulation results show how the execution time of MapReduce jobs falls within 14% of our upper bound on average. Moreover, numerical analyses demonstrate that our method is able to determine the global optimal solution of the linear problem for systems including up to 1,000 user classes in less than 0.5 seconds

    Mapreduce performance model for Hadoop 2.x

    Get PDF
    MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that may provide reasonably accurate job response time estimation at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce performance model for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, they could not be applied to Hadoop 2.x due to fundamental architectural changes and dynamic resource allocation in Hadoop 2.x. Thus, the proposed solution is based on an existing performance model for Hadoop 1.x, but taking into consideration architectural changes and capturing the execution flow of a MapReduce job by using queuing network model. This way, the cost model reflects the intra-job synchronization constraints that occur due the contention at shared resources. The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup.Peer ReviewedPostprint (author's final draft

    Performance Prediction of Cloud-Based Big Data Applications

    Get PDF
    Big data analytics have become widespread as a means to extract knowledge from large datasets. Yet, the heterogeneity and irregular- ity usually associated with big data applications often overwhelm the existing software and hardware infrastructures. In such con- text, the exibility and elasticity provided by the cloud computing paradigm o er a natural approach to cost-e ectively adapting the allocated resources to the application’s current needs. However, these same characteristics impose extra challenges to predicting the performance of cloud-based big data applications, a key step to proper management and planning. This paper explores three modeling approaches for performance prediction of cloud-based big data applications. We evaluate two queuing-based analytical models and a novel fast ad hoc simulator in various scenarios based on di erent applications and infrastructure setups. The three ap- proaches are compared in terms of prediction accuracy, nding that our best approaches can predict average application execution times with 26% relative error in the very worst case and about 7% on average

    Feedback Autonomic Provisioning for Guaranteeing Performance in MapReduce Systems

    No full text
    International audienceCompanies have a fast growing amounts of data to process and store, a data explosion is happening next to us. Currentlyone of the most common approaches to treat these vast data quantities are based on the MapReduce parallel programming paradigm.While its use is widespread in the industry, ensuring performance constraints, while at the same time minimizing costs, still providesconsiderable challenges. We propose a coarse grained control theoretical approach, based on techniques that have already provedtheir usefulness in the control community. We introduce the first algorithm to create dynamic models for Big Data MapReduce systems,running a concurrent workload. Furthermore we identify two important control use cases: relaxed performance - minimal resourceand strict performance. For the first case we develop two feedback control mechanism. A classical feedback controller and an evenbasedfeedback, that minimises the number of cluster reconfigurations as well. Moreover, to address strict performance requirements afeedforward predictive controller that efficiently suppresses the effects of large workload size variations is developed. All the controllersare validated online in a benchmark running in a real 60 node MapReduce cluster, using a data intensive Business Intelligenceworkload. Our experiments demonstrate the success of the control strategies employed in assuring service time constraints

    Stochastic bounds in fork-join queueing systems under full and partial mapping

    Get PDF
    In a Fork-Join (FJ) queueing system an upstream fork station splits incoming jobs into N tasks to be further processed by N parallel servers, each with its own queue; the response time of one job is determined, at a downstream join station, by the maximum of the corresponding tasks’ response times. This queueing system is useful to the modelling of multi-service systems subject to synchronization constraints, such as MapReduce clusters or multipath routing. Despite their apparent simplicity, FJ systems are hard to analyze. This paper provides the first computable stochastic bounds on the waiting and response time distributions in FJ systems under full (bijective) and partial (injective) mapping of tasks to servers. We consider four practical scenarios by combining 1a) renewal and 1b) non-renewal arrivals, and 2a) non-blocking and 2b) blocking servers. In the case of non-blocking servers we prove that delays scale as O(log N), a law which is known for first moments under renewal input only. In the case of blocking servers, we prove that the same factor of log N dictates the stability region of the system. Simulation results indicate that our bounds are tight, especially at high utilizations, in all four scenarios. A remarkable insight gained from our results is that, at moderate to high utilizations, multipath routing “makes sense” from a queueing perspective for two paths only, i.e., response times drop the most when N = 2; the technical explanation is that the resequencing (delay) price starts to quickly dominate the tempting gain due to multipath transmissions
    corecore