In this paper, we propose the evaluation of MapReduce on the Cell processor by way of the Marchine Cubes application. We argue that the Cell architecture and the MapReduce parallel programming model complement each other well, and that the Marching Cubes application is a good application through which to evaluate this potential synergy. For the interested reader, a preliminary design and plan of evalution are both presented.
Introduction
The Cell processor is capable of an order of magnitude performance improvement over conventional processor architectures [5, 7] . However, actually realizing this potential is contingent on efficiently utilizing each of the eight Synergistic Processing Elements (SPEs) in addition to the main Power Processing Element (PPE); the SPEs have unconventional execution characteristics that make them difficult to program directly, and compilers are unable to optimize effectively without significant guidance and/or programmer awareness [3, 2] . As a result, it is virtually impossible for a programmer to port optimally performing code to the Cell processor without first acquiring a considerable understanding of the underlying architecture.
MapReduce is a simple and flexible parallel programming model recently proposed by Google for application development in a distributed computing environment [1] . It has since been adapted to the symmetric-and chip-multiprocessor space as well, showing great promise [13] . MapReduce is attractive for its simple, high-level interface and its applicability to a wide range of data parallel applications. It is worth noting that the types of applications that perform well using MapReduce are the types of applications to which the Cell's SPEs are especially well suited -applications that are easily divisible into streams of computation. The MapReduce model moreover specifies a master processor that is responsible for work coordination and scheduling, a task ideally suited for the Cell's PPE, which excels at such control-intensive tasks. These two observations lead us to believe that an exploration of the MapReduce model mapped to the Cell processor has the potential to produce compelling results.
As our sample application, we choose the Marching Cubes problem [8] , chosen for three reasons: firstly, it is a highly data-parallel application of the type for which the Cell processor was initially designed, and therefore will accurately attest to our ability (or inability) to achieve near-peak performance using MapReduce; secondly, the Marching Cubes problem has never before been applied to MapReduce in the literature, making this work original in two respects; and thirdly, the Marching Cubes algorithm has both simple and complex variants. On this last point, we will elaborate: in its most basic form, the Marching Cubes algorithm is very simple and easy to understand. However, it also contains substantial redundancy. Eliminating that redundancy significantly complicates the implementation due to dependence and ordering considerations. Such an implementation is very complex. Thus, we have two possible implementations to which we can compare our MapReduce implementation: an easy-to-program implementation with suboptimal performance, and a difficult-to-program implementation with optimal performance. Hopefully we will find that MapReduce enables us to achieve the best of both worlds.
Design
A complete design discussion is premature at this time. What follows are some design highlights.
The MapReduce for Marching Cubes occurs conceptually in two stages. The first stage maps and reduces the following pairs: (voxel, (coordinate, isovalue)) maps to (edge, list(coordinate, isovalue, in/out) reduces to (edge, coordinate). The second stage continues: (edge, coordinate)) maps to (cube, list(cube-edge, coordinate)) reduces to (cube, triangles). We do not expound on this further here. The final paper will present this more fully.
The PPE will have at least two threads of control to take advantage of its SMT capabilities. One thread will perform scheduling work, and the other will maintain a sorted key list. Any slack, if it exists, might be used to assist in the actual work, but for now we will assume that there is none. Each SPE will have one thread of control. The PPE scheduler thread will stream data for mapping to the SPEs. The SPEs will double buffer to minimize idle time, writing intermediate values into separate buffers, which are streamed back to main memory. Keys are stored separately from their values, and are node data structures within a sorted key list maintained by the PPE (this needs more explanation, but we refrain from creeping into implementation specifics for now). This sorted list is required to group common items together for the reduction phase. The reduction phase proceeds virtually identically to the mapping phase, but the values are streamed to the SPEs using a DMA gather to assemble map output buffers with common keys. When each reduce task is complete, the output is flushed back to main memory. Finally, the reduction output buffers are logically sorted using a sorted buffer list. Since the key is unchanged, this sorting is significantly simpler than the sorting between the mapping and reduction phases and is simply an ordering problem.
Methodology of Evaluation
The primary purpose of this project is to evaluate MapReduce on the Cell processor. Therefore, in an effort to maintain project scope, only one platform (Cell) and only one application (Marching Cubes) will be evaluated. The Marching Cubes application will be evaluated in three forms, as alluded to in the Introduction: 1) a simple Marching Cubes implementation using the traditional algorithm, where cubes are assembled from vertices and processed entirely independently of one another, 2) a more complex Marching Cubes implementation where cubes are processed in groups, in a manner that minimizes computational redundancy and memory transfers, as described in the work implementing Marching Cubes on the Cell by O'Conor [9] , and 3) a MapReduce Marching Cubes implementation that eliminates computational redundancy by executing in two phases and transforming the data (through reduction) between phases, at the expense of more memory transfers and more complex runtime execution.
The input(s) to the Marching Cubes implementations have not yet been determined. The goal is to evaluate inputs of widely varying sizes and distributions, from a data set that easily fit into a single 256KB SPE local store, to a data set that is many hundreds of times larger.
The three implementations will be executed and evaluated on the IBM Full-System Simulator for the Cell Broadband Engine Processor. If time allows, they will also be evaluated on an actual Cell BladeCenter QS20. Using the simulator, we also wish to evaluate each implementation running on a range of available SPEs, such as from 2 to 64, on the assumption that the simulator has support for such configurations.
Metrics of Evaluation
The three Marching Cubes implementations will be compared based on relative execution time only. For the simulator, execution time will most likely be measured in cycles. For the Cell blade (assuming it is used in evalutation), execution time will be measured in microseconds.
Expected Conclusions
It is actually very difficult to say what the expected results will be at this stage. We are fairly confident that the MapReduce implementation will outpace the simple Marching Cubes implementation. However, it is likely that the other, more complex implementation will have a performance edge. This is primarily because the Marching Cubes problem has very fine data granularity with a small computation component relative to the expected MapReduce data management overhead. The MapReduce implementation will need to maintain a sorted list of keys and their associated values between the map and reduce phases, and there will be some unavoidable SPE idle time between the two phases due to their necessarily serial execution. However, the MapReduce implementation has the advantage that it will eliminate all redundant computations. We expect to find that the performance of the MapReduce implementation actually varies quite drastically with the input data set relative to the other two more traditional implementations. We believe this for design reasons not yet fully explained.
Related Work
Other efforts to enhance the programmability of the Cell have been underway for some time. The Accelerated Framework Library from IBM, distributed with the Cell SDK, provides an architecture-specific API to facilitate data parallel programming. IBM Research has also proposed message passing microtasks [10] based on the standard Message Passing Interface (MPI) [4] . OpenMP directives are supported and heavily utilized by the Cell compiler [3] , and are a popular and easily applied tool to express parallelism at all levels [11, 12] . OpenMP is, however, not as elegant as MapReduce in expressing data-level parallel workflow and does not provide assistance in other Cell problem areas such as processor scheduling, SIMDization, alignment handling, static branch prediction, software caching, and other SPE pitfall areas.
Marching Cubes on the Cell processor has been implemented before [9] . We will adapt our more complex implementation of Marching cubes from the implementation described in this prior work. This work actually implements a variation on the Marching Cubes algorithm, known as Marching Tetrahedra [6] , which has both advantages and disadvantages over Marching Cubes. The two variations are easily interchanged in implementation; we implement Marching Cubes for simplicity.
