39 research outputs found

    Performance Modeling and Resource Management for Mapreduce Applications

    Get PDF
    Big Data analytics is increasingly performed using the MapReduce paradigm and its open-source implementation Hadoop as a platform choice. Many applications associated with live business intelligence are written as complex data analysis programs defined by directed acyclic graphs of MapReduce jobs. An increasing number of these applications have additional requirements for completion time guarantees. The advent of cloud computing brings a competitive alternative solution for data analytic problems while it also introduces new challenges in provisioning clusters that provide best cost-performance trade-offs. In this dissertation, we aim to develop a performance evaluation framework that enables automatic resource management for MapReduce applications in achieving different optimization goals. It consists of the following components: (1) a performance modeling framework that estimates the completion time of a given MapReduce application when executed on a Hadoop cluster according to its input data sets, the job settings and the amount of allocated resources for processing it; (2) a resource allocation strategy for deadline-driven MapReduce applications that automatically tailors and controls the resource allocation on a shared Hadoop cluster to different applications to achieve their (soft) deadlines; (3) a simulator-based solution to the resource provision problem in public cloud environment that guides the users to determine the types and amount of resources that should lease from the service provider for achieving different goals; (4) an optimization strategy to automatically determine the optimal job settings within a MapReduce application for efficient execution and resource usage. We validate the accuracy, efficiency, and performance benefits of the proposed framework using a set of realistic MapReduce applications on both private cluster and public cloud environment

    Resource Management In Cloud And Big Data Systems

    Get PDF
    Cloud computing is a paradigm shift in computing, where services are offered and acquired on demand in a cost-effective way. These services are often virtualized, and they can handle the computing needs of big data analytics. The ever-growing demand for cloud services arises in many areas including healthcare, transportation, energy systems, and manufacturing. However, cloud resources such as computing power, storage, energy, dollars for infrastructure, and dollars for operations, are limited. Effective use of the existing resources raises several fundamental challenges that place the cloud resource management at the heart of the cloud providers\u27 decision-making process. One of these challenges faced by the cloud providers is to provision, allocate, and price the resources such that their profit is maximized and the resources are utilized efficiently. In addition, executing large-scale applications in clouds may require resources from several cloud providers. Another challenge when processing data intensive applications is minimizing their energy costs. Electricity used in US data centers in 2010 accounted for about 2% of total electricity used nationwide. In addition, the energy consumed by the data centers is growing at over 15% annually, and the energy costs make up about 42% of the data centers\u27 operating costs. Therefore, it is critical for the data centers to minimize their energy consumption when offering services to customers. In this Ph.D. dissertation, we address these challenges by designing, developing, and analyzing mechanisms for resource management in cloud computing systems and data centers. The goal is to allocate resources efficiently while optimizing a global performance objective of the system (e.g., maximizing revenue, maximizing social welfare, or minimizing energy). We improve the state-of-the-art in both methodologies and applications. As for methodologies, we introduce novel resource management mechanisms based on mechanism design, approximation algorithms, cooperative game theory, and hedonic games. These mechanisms can be applied in cloud virtual machine (VM) allocation and pricing, cloud federation formation, and energy-efficient computing. In this dissertation, we outline our contributions and possible directions for future research in this field

    Resource Management In Cloud And Big Data Systems

    Get PDF
    Cloud computing is a paradigm shift in computing, where services are offered and acquired on demand in a cost-effective way. These services are often virtualized, and they can handle the computing needs of big data analytics. The ever-growing demand for cloud services arises in many areas including healthcare, transportation, energy systems, and manufacturing. However, cloud resources such as computing power, storage, energy, dollars for infrastructure, and dollars for operations, are limited. Effective use of the existing resources raises several fundamental challenges that place the cloud resource management at the heart of the cloud providers\u27 decision-making process. One of these challenges faced by the cloud providers is to provision, allocate, and price the resources such that their profit is maximized and the resources are utilized efficiently. In addition, executing large-scale applications in clouds may require resources from several cloud providers. Another challenge when processing data intensive applications is minimizing their energy costs. Electricity used in US data centers in 2010 accounted for about 2% of total electricity used nationwide. In addition, the energy consumed by the data centers is growing at over 15% annually, and the energy costs make up about 42% of the data centers\u27 operating costs. Therefore, it is critical for the data centers to minimize their energy consumption when offering services to customers. In this Ph.D. dissertation, we address these challenges by designing, developing, and analyzing mechanisms for resource management in cloud computing systems and data centers. The goal is to allocate resources efficiently while optimizing a global performance objective of the system (e.g., maximizing revenue, maximizing social welfare, or minimizing energy). We improve the state-of-the-art in both methodologies and applications. As for methodologies, we introduce novel resource management mechanisms based on mechanism design, approximation algorithms, cooperative game theory, and hedonic games. These mechanisms can be applied in cloud virtual machine (VM) allocation and pricing, cloud federation formation, and energy-efficient computing. In this dissertation, we outline our contributions and possible directions for future research in this field

    Task Scheduling in Big Data Platforms: A Systematic Literature Review

    Get PDF
    Context: Hadoop, Spark, Storm, and Mesos are very well known frameworks in both research and industrial communities that allow expressing and processing distributed computations on massive amounts of data. Multiple scheduling algorithms have been proposed to ensure that short interactive jobs, large batch jobs, and guaranteed-capacity production jobs running on these frameworks can deliver results quickly while maintaining a high throughput. However, only a few works have examined the effectiveness of these algorithms. Objective: The Evidence-based Software Engineering (EBSE) paradigm and its core tool, i.e., the Systematic Literature Review (SLR), have been introduced to the Software Engineering community in 2004 to help researchers systematically and objectively gather and aggregate research evidences about different topics. In this paper, we conduct a SLR of task scheduling algorithms that have been proposed for big data platforms. Method: We analyse the design decisions of different scheduling models proposed in the literature for Hadoop, Spark, Storm, and Mesos over the period between 2005 and 2016. We provide a research taxonomy for succinct classification of these scheduling models. We also compare the algorithms in terms of performance, resources utilization, and failure recovery mechanisms. Results: Our searches identifies 586 studies from journals, conferences and workshops having the highest quality in this field. This SLR reports about different types of scheduling models (dynamic, constrained, and adaptive) and the main motivations behind them (including data locality, workload balancing, resources utilization, and energy efficiency). A discussion of some open issues and future challenges pertaining to improving the current studies is provided

    Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

    Get PDF
    The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

    Multi-constraint scheduling of MapReduce workloads

    Get PDF
    In recent years there has been an extraordinary growth of large-scale data processing and related technologies in both, industry and academic communities. This trend is mostly driven by the need to explore the increasingly large amounts of information that global companies and communities are able to gather, and has lead the introduction of new tools and models, most of which are designed around the idea of handling huge amounts of data. A good example of this trend towards improved large-scale data processing is MapReduce, a programming model intended to ease the development of massively parallel applications, and which has been widely adopted to process large datasets thanks to its simplicity. While the MapReduce model was originally used primarily for batch data processing in large static clusters, nowadays it is mostly deployed along with other kinds of workloads in shared environments in which multiple users may be submitting concurrent jobs with completely different priorities and needs: from small, almost interactive, executions, to very long applications that take hours to complete. Scheduling and selecting tasks for execution is extremely relevant in MapReduce environments since it governs a job's opportunity to make progress and determines its performance. However, only basic primitives to prioritize between jobs are available at the moment, constantly causing either under or over-provisioning, as the amount of resources needed to complete a particular job are not obvious a priori. This thesis aims to address both, the lack of management capabilities and the increased complexity of the environments in which MapReduce is executed. To that end, new models and techniques are introduced in order to improve the scheduling of MapReduce in the presence of different constraints found in real-world scenarios, such as completion time goals, data locality, hardware heterogeneity, or availability of resources. The focus is on improving the integration of MapReduce with the computing infrastructures in which it usually runs, allowing alternative techniques for dynamic management and provisioning of resources. More specifically, it is focused in three scenarios that are incremental in its scope. First, it studies the prospects of using high-level performance criteria to manage and drive the performance of MapReduce applications, taking advantage of the fact that MapReduce is executed in controlled environments in which the status of the cluster is known. Second, it examines the feasibility and benefits of making the MapReduce runtime more aware of the underlying hardware and the characteristics of applications. And finally, it also considers the interaction between MapReduce and other kinds of workloads, proposing new techniques to handle these increasingly complex environments. Following these three items described above, this thesis contributes to the management of MapReduce workloads by 1) proposing a performance model for MapReduce workloads and a scheduling algorithm that leverages the proposed model and is able to adapt depending on the various needs of its users in the presence of completion time constraints; 2) proposing a new resource model for MapReduce and a placement algorithm aware of the underlying hardware as well as the characteristics of the applications, capable of improving cluster utilization while still being guided by job performance metrics; and 3) proposing a model for shared environments in which MapReduce is executed along with other kinds of workloads such as transactional applications, and a scheduler aware of these workloads and its expected demand of resources, capable of improving resource utilization across machines while observing completion time goals

    Bridging a Gap Between Research and Production: Contributions to Scheduling and Simulation

    Get PDF
    Large scale distributed computing infrastructures (e.g., data centers, grids, or clouds) are used by scientists from various domains to produce outstanding research results, such as the discovery of the Higgs Boson in High Energy Physics. These infrastructures are also studied by Computer Scientists to produce their own set of scientific results. Ideally, a virtuous circle should exist between Domain and Computer Scientists: the former raising challenges that could be addressed by the latter. Unfortunately, in many occasions, a gap exists that prevents such an ideal and fostering collaboration. This habilitation covers research works conducted in the fields of scheduling and simulation that contribute to the filling of this gap. It discusses the necessary conditions to achieve this goal and details concrete initiatives in this endeavor
    corecore