2 research outputs found
Optimizing MapReduce for Highly Distributed Environments
MapReduce, the popular programming paradigm for large-scale data processing,
has traditionally been deployed over tightly-coupled clusters where the data is
already locally available. The assumption that the data and compute resources
are available in a single central location, however, no longer holds for many
emerging applications in commercial, scientific and social networking domains,
where the data is generated in a geographically distributed manner. Further,
the computational resources needed for carrying out the data analysis may be
distributed across multiple data centers or community resources such as Grids.
In this paper, we develop a modeling framework to capture MapReduce execution
in a highly distributed environment comprising distributed data sources and
distributed computational resources. This framework is flexible enough to
capture several design choices and performance optimizations for MapReduce
execution. We propose a model-driven optimization that has two key features:
(i) it is end-to-end as opposed to myopic optimizations that may only make
locally optimal but globally suboptimal decisions, and (ii) it can control
multiple MapReduce phases to achieve low runtime, as opposed to single-phase
optimizations that may control only individual phases. Our model results show
that our optimization can provide nearly 82% and 64% reduction in execution
time over myopic and single-phase optimizations, respectively. We have modified
Hadoop to implement our model outputs, and using three different MapReduce
applications over an 8-node emulated PlanetLab testbed, we show that our
optimized Hadoop execution plan achieves 31-41% reduction in runtime over a
vanilla Hadoop execution. Our model-driven optimization also provides several
insights into the choice of techniques and execution parameters based on
application and platform characteristics