220 research outputs found
Scheduling MapReduce Jobs under Multi-Round Precedences
We consider non-preemptive scheduling of MapReduce jobs with multiple tasks
in the practical scenario where each job requires several map-reduce rounds. We
seek to minimize the average weighted completion time and consider scheduling
on identical and unrelated parallel processors. For identical processors, we
present LP-based O(1)-approximation algorithms. For unrelated processors, the
approximation ratio naturally depends on the maximum number of rounds of any
job. Since the number of rounds per job in typical MapReduce algorithms is a
small constant, our scheduling algorithms achieve a small approximation ratio
in practice. For the single-round case, we substantially improve on previously
best known approximation guarantees for both identical and unrelated
processors. Moreover, we conduct an experimental analysis and compare the
performance of our algorithms against a fast heuristic and a lower bound on the
optimal solution, thus demonstrating their promising practical performance
Performance Analysis of Modified SRPT in Multiple-Processor Multitask Scheduling
In this paper we study the multiple-processor multitask scheduling problem in
both deterministic and stochastic models. We consider and analyze Modified
Shortest Remaining Processing Time (M-SRPT) scheduling algorithm, a simple
modification of SRPT, which always schedules jobs according to SRPT whenever
possible, while processes tasks in an arbitrary order. The M-SRPT algorithm is
proved to achieve a competitive ratio of for
minimizing response time, where denotes the ratio between maximum job
workload and minimum job workload, represents the ratio between maximum
non-preemptive task workload and minimum job workload. In addition, the
competitive ratio achieved is shown to be optimal (up to a constant factor),
when there are constant number of machines. We further consider the problem
under Poisson arrival and general workload distribution (\ie, system),
and show that M-SRPT achieves asymptotic optimal mean response time when the
traffic intensity approaches , if job size distribution has finite
support. Beyond finite job workload, the asymptotic optimality of M-SRPT also
holds for infinite job size distributions with certain probabilistic
assumptions, for example, system with finite task workload
Energy Efficient Scheduling of MapReduce Jobs
MapReduce is emerged as a prominent programming model for data-intensive
computation. In this work, we study power-aware MapReduce scheduling in the
speed scaling setting first introduced by Yao et al. [FOCS 1995]. We focus on
the minimization of the total weighted completion time of a set of MapReduce
jobs under a given budget of energy. Using a linear programming relaxation of
our problem, we derive a polynomial time constant-factor approximation
algorithm. We also propose a convex programming formulation that we combine
with standard list scheduling policies, and we evaluate their performance using
simulations.Comment: 22 page
Performance optimization and energy efficiency of big-data computing workflows
Next-generation e-science is producing colossal amounts of data, now frequently termed as Big Data, on the order of terabyte at present and petabyte or even exabyte in the predictable future. These scientific applications typically feature data-intensive workflows comprised of moldable parallel computing jobs, such as MapReduce, with intricate inter-job dependencies. The granularity of task partitioning in each moldable job of such big data workflows has a significant impact on workflow completion time, energy consumption, and financial cost if executed in clouds, which remains largely unexplored. This dissertation conducts an in-depth investigation into the properties of moldable jobs and provides an experiment-based validation of the performance model where the total workload of a moldable job increases along with the degree of parallelism. Furthermore, this dissertation conducts rigorous research on workflow execution dynamics in resource sharing environments and explores the interactions between workflow mapping and task scheduling on various computing platforms. A workflow optimization architecture is developed to seamlessly integrate three interrelated technical components, i.e., resource allocation, job mapping, and task scheduling.
Cloud computing provides a cost-effective computing platform for big data workflows where moldable parallel computing models are widely applied to meet stringent performance requirements. Based on the moldable parallel computing performance model, a big-data workflow mapping model is constructed and a workflow mapping problem is formulated to minimize workflow makespan under a budget constraint in public clouds. This dissertation shows this problem to be strongly NP-complete and designs i) a fully polynomial-time approximation scheme for a special case with a pipeline-structured workflow executed on virtual machines of a single class, and ii) a heuristic for a generalized problem with an arbitrary directed acyclic graph-structured workflow executed on virtual machines of multiple classes. The performance superiority of the proposed solution is illustrated by extensive simulation-based results in Hadoop/YARN in comparison with existing workflow mapping models and algorithms.
Considering that large-scale workflows for big data analytics have become a main consumer of energy in data centers, this dissertation also delves into the problem of static workflow mapping to minimize the dynamic energy consumption of a workflow request under a deadline constraint in Hadoop clusters, which is shown to be strongly NP-hard. A fully polynomial-time approximation scheme is designed for a special case with a pipeline-structured workflow on a homogeneous cluster and a heuristic is designed for the generalized problem with an arbitrary directed acyclic graph-structured workflow on a heterogeneous cluster. This problem is further extended to a dynamic version with deadline-constrained MapReduce workflows to minimize dynamic energy consumption in Hadoop clusters. This dissertation proposes a semi-dynamic online scheduling algorithm based on adaptive task partitioning to reduce dynamic energy consumption while meeting performance requirements from a global perspective, and also develops corresponding system modules for algorithm implementation in the Hadoop ecosystem. The performance superiority of the proposed solutions in terms of dynamic energy saving and deadline missing rate is illustrated by extensive simulation results in comparison with existing algorithms, and further validated through real-life workflow implementation and experiments using the Oozie workflow engine in Hadoop/YARN systems
- …