573 research outputs found
Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments
This chapter presents software architectures of the big data processing
platforms. It will provide an in-depth knowledge on resource management
techniques involved while deploying big data processing systems on cloud
environment. It starts from the very basics and gradually introduce the core
components of resource management which we have divided in multiple layers. It
covers the state-of-art practices and researches done in SLA-based resource
management with a specific focus on the job scheduling mechanisms.Comment: 27 pages, 9 figure
D-SPACE4Cloud: A Design Tool for Big Data Applications
The last years have seen a steep rise in data generation worldwide, with the
development and widespread adoption of several software projects targeting the
Big Data paradigm. Many companies currently engage in Big Data analytics as
part of their core business activities, nonetheless there are no tools and
techniques to support the design of the underlying hardware configuration
backing such systems. In particular, the focus in this report is set on Cloud
deployed clusters, which represent a cost-effective alternative to on premises
installations. We propose a novel tool implementing a battery of optimization
and prediction techniques integrated so as to efficiently assess several
alternative resource configurations, in order to determine the minimum cost
cluster deployment satisfying QoS constraints. Further, the experimental
campaign conducted on real systems shows the validity and relevance of the
proposed method
Performance optimization and energy efficiency of big-data computing workflows
Next-generation e-science is producing colossal amounts of data, now frequently termed as Big Data, on the order of terabyte at present and petabyte or even exabyte in the predictable future. These scientific applications typically feature data-intensive workflows comprised of moldable parallel computing jobs, such as MapReduce, with intricate inter-job dependencies. The granularity of task partitioning in each moldable job of such big data workflows has a significant impact on workflow completion time, energy consumption, and financial cost if executed in clouds, which remains largely unexplored. This dissertation conducts an in-depth investigation into the properties of moldable jobs and provides an experiment-based validation of the performance model where the total workload of a moldable job increases along with the degree of parallelism. Furthermore, this dissertation conducts rigorous research on workflow execution dynamics in resource sharing environments and explores the interactions between workflow mapping and task scheduling on various computing platforms. A workflow optimization architecture is developed to seamlessly integrate three interrelated technical components, i.e., resource allocation, job mapping, and task scheduling.
Cloud computing provides a cost-effective computing platform for big data workflows where moldable parallel computing models are widely applied to meet stringent performance requirements. Based on the moldable parallel computing performance model, a big-data workflow mapping model is constructed and a workflow mapping problem is formulated to minimize workflow makespan under a budget constraint in public clouds. This dissertation shows this problem to be strongly NP-complete and designs i) a fully polynomial-time approximation scheme for a special case with a pipeline-structured workflow executed on virtual machines of a single class, and ii) a heuristic for a generalized problem with an arbitrary directed acyclic graph-structured workflow executed on virtual machines of multiple classes. The performance superiority of the proposed solution is illustrated by extensive simulation-based results in Hadoop/YARN in comparison with existing workflow mapping models and algorithms.
Considering that large-scale workflows for big data analytics have become a main consumer of energy in data centers, this dissertation also delves into the problem of static workflow mapping to minimize the dynamic energy consumption of a workflow request under a deadline constraint in Hadoop clusters, which is shown to be strongly NP-hard. A fully polynomial-time approximation scheme is designed for a special case with a pipeline-structured workflow on a homogeneous cluster and a heuristic is designed for the generalized problem with an arbitrary directed acyclic graph-structured workflow on a heterogeneous cluster. This problem is further extended to a dynamic version with deadline-constrained MapReduce workflows to minimize dynamic energy consumption in Hadoop clusters. This dissertation proposes a semi-dynamic online scheduling algorithm based on adaptive task partitioning to reduce dynamic energy consumption while meeting performance requirements from a global perspective, and also develops corresponding system modules for algorithm implementation in the Hadoop ecosystem. The performance superiority of the proposed solutions in terms of dynamic energy saving and deadline missing rate is illustrated by extensive simulation results in comparison with existing algorithms, and further validated through real-life workflow implementation and experiments using the Oozie workflow engine in Hadoop/YARN systems
Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds
MapReduce has become a popular programming model for running data intensive
applications on the cloud. Completion time goals or deadlines of MapReduce jobs
set by users are becoming crucial in existing cloud-based data processing
environments like Hadoop. There is a conflict between the scheduling MR jobs to
meet deadlines and "data locality" (assigning tasks to nodes that contain their
input data). To meet the deadline a task may be scheduled on a node without
local input data for that task causing expensive data transfer from a remote
node. In this paper, a novel scheduler is proposed to address the above problem
which is primarily based on the dynamic resource reconfiguration approach. It
has two components: 1) Resource Predictor: which dynamically determines the
required number of Map/Reduce slots for every job to meet completion time
guarantee; 2) Resource Reconfigurator: that adjusts the CPU resources while not
violating completion time goals of the users by dynamically increasing or
decreasing individual VMs to maximize data locality and also to maximize the
use of resources within the system among the active jobs. The proposed
scheduler has been evaluated against Fair Scheduler on virtual cluster built on
a physical cluster of 20 machines. The results demonstrate a gain of about 12%
increase in throughput of Job
- …