1 research outputs found
OS4M: Achieving Global Load Balance of MapReduce Workload by Scheduling at the Operation Level
The efficiency of MapReduce is closely related to its load balance. Existing
works on MapReduce load balance focus on coarse-grained scheduling. This study
concerns fine-grained scheduling on MapReduce operations, with each operation
representing one invocation of the Map or Reduce function. By default,
MapReduce adopts the hash-based method to schedule Reduce operations, which
often leads to poor load balance. In addition, the copy phase of Reduce tasks
overlaps with Map tasks, which significantly hinders the progress of Map tasks
due to I/O contention. Moreover, the three phases of Reduce tasks run in
sequence, while consuming different resources, thereby under-utilizing
resources. To overcome these problems, we introduce a set of mechanisms named
OS4M (Operation Scheduling for MapReduce) to improve MapReduce's performance.
OS4M achieves load balance by collecting statistics of all Map operations, and
calculates a globally optimal schedule to distribute Reduce operations. With
OS4M, the copy phase of Reduce tasks no longer overlaps with Map tasks, and the
three phases of Reduce tasks are pipelined based on their operation loads. OS4M
has been transparently incorporated into MapReduce. Evaluations on standard
benchmarks show that OS4M's job duration can be shortened by up to 42%,
compared with a baseline of Hadoop.Comment: arXiv admin note: substantial text overlap with arXiv:1401.035