There is a growing trend of performing analysis on large datasets using
workflows composed of MapReduce jobs connected through producer-consumer
relationships based on data. This trend has spurred the development of a number
of interfaces--ranging from program-based to query-based interfaces--for
generating MapReduce workflows. Studies have shown that the gap in performance
can be quite large between optimized and unoptimized workflows. However,
automatic cost-based optimization of MapReduce workflows remains a challenge
due to the multitude of interfaces, large size of the execution plan space, and
the frequent unavailability of all types of information needed for
optimization. We introduce a comprehensive plan space for MapReduce workflows
generated by popular workflow generators. We then propose Stubby, a cost-based
optimizer that searches selectively through the subspace of the full plan space
that can be enumerated correctly and costed based on the information available
in any given setting. Stubby enumerates the plan space based on plan-to-plan
transformations and an efficient search algorithm. Stubby is designed to be
extensible to new interfaces and new types of optimizations, which is a
desirable feature given how rapidly MapReduce systems are evolving. Stubby's
efficiency and effectiveness have been evaluated using representative workflows
from many domains.Comment: VLDB201