Abstract-Array computers can be useful in the solution of numerical spatiotemporal problems such as partial differential equations (PDEs). IBM has recently introduced the Cell Broadband Engine (Cell BE) Architecture, which contains 8 identical vector processors in an array structure. In the paper the implementation of the 3-D Princeton Ocean Model on the Cell BE is discussed. The area/speed/power tradeoffs of our solution and different hardware implementations are also compared.
I. INTRODUCTION
Performance of the general purpose computing systems is usually improved by increasing the clock frequency and adding more processor cores. However, to achieve very high operating frequency very deep pipeline is required, which cannot be utilized in every clock cycle due to data and control dependencies. If an array of processor cores is used, the memory system should handle several concurrent memory accesses, which requires large cache memory and complex control logic. In addition, applications rarely occupy all of the available integer and floating point execution units fully. 1 Array processing to increase the computing power by using parallel computation can be a good candidate to solve architectural problems (distribution of control signals on a chip). Huge computing power is a requirement if we want to solve complex tasks and optimize to dissipated power and area at the same time. There are a number of different implementations of array processors commercially available. The CSX600 accelerator chip from Clearspeed Inc. [1] contains two main processor elements, the Mono and the Poly execution units. The Mono execution unit is a conventional RISC processor responsible for program flow control and thread switching. The Poly execution unit is a 1-D array of 96 execution units, which work on a SIMD fashion. Each execution unit contains a 64bit floating point unit, integer ALU, 16bit MAC (Multiply Accumulate) unit, an I/O unit, a small register file and local SRAM memory. Although the architecture runs only on 250MHz clock frequency the computing performance of the array may reach 25GFlops. The Mathstar FPOA (Field Programmable Object Array) architecture [2] contains different types of 16bit execution units, called Silicon Objects, which are arranged on a 2-D grid. The connection between the Silicon Objects is provided by a programmable routing architecture. The three main object types are the 16bit integer ALU, 16bit MAC and 64 word register file. Additionally, the architecture contains 19Kb on-chip SRAM memories. The Silicon objects work independently on a MIMD (Multiple Instruction Multiple Data) fashion. FPOA designs are created in a graphical design environment or by using MathStars Silicon Object Assembly Language. The Tilera Tile64 architecture [3] is a regular array of general purpose processors, called Tile Processors, arranged on an 8×8 grid. Each Tile Processor is 3-way VLIW (Very Long Instruction Word) architecture and has a local L1, L2 cache and a switch for the on-chip network. The L2 cache is visible for all processors forming a large coherent shared L3 cache. The clock frequency of the architecture is in the 600-900MHz range providing 192GOps peak computing power. The processors work with 32bit data words but floating point support is not described in the datasheets.
In this work we have concentrated on topographic IBM Cell heterogeneous array processor architecture mainly because its development system is open source. It is exploited here in solving complex, time consuming problems.
II. CELL PROCESSOR ARCHITECTURE

A. Cell Processor Chip
The Cell Broadband Engine Architecture (CBEA) [4] is designed to achieve high computing performance with better area/performance and power/performance ratios than the conventional multi-core architectures. The CBEA defines a heterogeneous multi-processor architecture where general purpose processors called Power Processor Elements (PPE) Figure 1 . The PPE is a conventional dual-threaded 64bit PowerPC processor which can run existing operating systems without modification and can control the operation of the SPEs. To simplify processor design and achieve higher clock speed instruction reordering is not supported by the PPE. The EIB is not a bus as suggested by its name but a ring network which contains 4 unidirectional rings where two rings run counter to the direction of the other two. The dual-channel Rambus XDR memory interface provides very high 25.6GB/s memory bandwidth. I/O devices can be accessed via two Rambus FlexIO interfaces where one of them (the Broadband Interface (BIF)) is coherent and makes it possible to connect two Cell processors directly.
The SPEs are SIMD only processors which are designed to handle streaming data. Therefore they do not perform well in general purpose applications and cannot run operating systems. Block diagram of the SPE is shown in Figure 2 .
The SPE has two execution pipelines: the even pipeline is used to execute floating point and integer instructions while the odd pipeline is responsible for the execution of branch, memory and permute instructions. Instructions for the even and odd pipeline can be issued in parallel. Similarly to the PPE the SPEs are also in-order processors. Data for the instructions are provided by the very large 128 element register file where each register is 16byte wide. Therefore SIMD instructions of the SPE work on 16byte-wide vectors, for example, four single precision floating point numbers or eight 16bit integers. The register file has 6 read and 2 write ports to provide data for the two pipelines. The SPEs can only address their local 256KB SRAM memory but they can access the main memory of the system by DMA instructions. The Local Store is 128byte wide for the DMA and instruction fetch unit, while the Memory unit can address data on 16byte boundaries by using a buffer register. 16byte data words arriving from the EIB are collected by the DMA engine and written to the memory in one cycle. The DMA engines can handle up to 16 concurrent DMA operations where the size of each DMA operation can be 16KB. The DMA engine is part of the globally coherent memory address space but we must note that the local store of the SPE is not coherent.
ATA
Rambus
B. Cell Blade Systems
Cell blade systems are built up from two Cell processor chips interconnected with a broadband interface. They offer extreme performance to accelerate compute-intensive tasks. The IBM Blade Center QS20 (see Figure 3) is equipped with two Cell processor chips, Gigabit Ethernet, and 4x InfiniBand I/O capability. Its computing power is 400GFLOPS peak. Further technical details are as follows: The second generation blade system is the IBM Blade Center QS21 providing extraordinary computing density up to 6.4 TFLOPS in a single Blade Center house.
III. OCEAN MODEL AND ITS IMPLEMENTATION
Several studies proved the effectiveness of the CNN-UM solution of different PDEs [5] , [6] . But the results cannot be used in real life implementations because of the limitations of the analog CNN-UM chips such as low precision, temperature sensitivity or the application of non-linear templates. Some previous results show that emulated digital architectures can be very efficiently used in the computation of the CNN dynamics [7] , [8] and in the solution of PDEs [9] , [10] , [11] . Using the CNN simulation kernel described in [8] helped to solve Navier-Stokes PDE on the Cell architecture. The details will be presented here.
Simulation of compressible and incompressible fluids is one of the most exciting areas of the solution of PDEs because these equations appear in many important applications in aerodynamics, meteorology, and oceanography. Modeling ocean currents plays a very important role both in mediumterm weather forecasting and global climate simulations. In general, ocean models describe the response of the variable density ocean to atmospheric momentum and heat forcing. In the simplest barotropic ocean model a region of the oceans water column is vertically integrated to obtain one value for the vertically different horizontal currents. The more accurate models use several horizontal layers to describe the motion in the deeper regions of the ocean. Such a model is the Princeton Ocean Model (POM) [12] , [13] being a sigma coordinate model in which the vertical coordinate is scaled on the water column depth. (1)- (7) is based on the same techniques available in [13] . The discretization in space is done according to the Arakawa-C differencing scheme where the variables are located on a staggered mesh. The mass transports U and V are located at the center of the box boundaries facing the x and y directions, respectively. All other parameters are located at the center of mesh boxes. The horizontal grid uses curvilinear orthogonal coordinates. The equations, governing the dynamics of coastal circulation, contain fast moving external gravity waves and slow moving internal gravity waves. It is desirable in terms of computer economy to separate the vertically integrated equations (external mode) from the vertical structure equations (internal mode). This technique, known as mode splitting permits the calculation of the free surface elevation with little sacrifice in computational time by solving the velocity transport separately from the three-dimensional calculation of the velocity and the thermodynamic properties.
The external mode calculation is responsible for computing surface elevation and the vertically averaged velocities. The internal mode computes horizontal and vertical velocities (U, V ), temperature (T ) and salinity (S). During the calculation the former uses short time step, whereas the latter uses longer time step. Using this method many external steps are evaluated for every long internal time step. These results are used for the internal mode computation.
The velocity external mode equations are obtained by integrating the internal mode equations over the depth, thereby eliminating all vertical structure. Thus, by integrating Equations (1)-(3) from σ = −1 to σ = 0 the following equations can be obtained:
The overbars denote vertically integrated velocities such as
Udσ. The wind stress components are -< wu(0) > and -< wv(0) >, and the bottom stress components are -< wu(−1) > and -< wv(−1) >. U , V are the horizontal velocities, f is the Coriolis parameter, g is gravitational acceleration, ρ 0 and ρ are the reference and in situ density, respectively.
IV. IMPLEMENTATION ON CELL PROCESSOR ARRAY
Solution of equations (1)- (7) on a CNN-UM architecture requires finite difference approximation on a uniform square grid. The spatial derivatives can be approximated by the following well known finite difference schemes and CNN templates e.g.:
The resulting CNN templates are usually sparse and symmetrical which provides a good starting point for further optimization. Therefore a new C based solution is developed which is optimized for the SPEs of the Cell architecture.
The large (128-entry) register file of the SPE makes it possible to store the neighborhood of the currently processed cell during the solution of the governing equations. The number of load instructions can be decreased significantly.
Since the SPEs cannot address the global memory directly, the users application running on the SPE is responsible to carry out data transfer between the local memory of the SPE and the global memory via DMA transactions. Since the relatively small local memory of the SPEs does not allow to store all the required data, an efficient buffering method is required. In our solution a belt of 5 rows is stored in the local memory from the array: 3 rows are required to form the local neighborhood of the currently processed row, one line is required for data s1 s2 s3 s4 s5 s6 s7 s8 s4 s5 s6 s7 s1 s2 s3 s4 s5 s6 s7 s8 It seems obvious to pack 4 neighboring cells into one vector {s5, s6, s7, s8}. However, constructing the vector which contains the left {s4, s5, s6, s7} and right {s6, s7, s8, s9} neighbors of the cells is somewhat complicated because 2 "rotate" and 1 "select" instructions are needed to generate the required vector (see Figure 4 ) . This limits the utilization of the floating-point pipeline because 3 integer instructions (rotate and select) must be carried out to generate the left and right neighborhood of the cell, before a floating point instruction can be issued.
This limitation can be removed by slicing the cell array into 4 vertical stripes and rearranging the cell values. In the above case, the 4-element vector contains data from the 4 different slices as shown in Figure 5 . This makes it possible to eliminate the shift and shuffle operations to create the neighborhood of the cells in the vector. The rearrangement should be carried out only once, at the beginning of the computation and can be carried out by the PPE. Though, this solution improves the performance of the simulation data, data dependency between the floating-point instructions may still cause pipeline stalls. In order to eliminate this dependency the inner loop of the computation must be rolled out. Instead of waiting for the result of the first floating-point instruction, the computation of the next group of cells is started. The level of unrolling is limited by the size of the register file.
To achieve even faster computation multiple SPEs can be used. The data can be partitioned between the SPEs by horizontally striping the cell array. 
V. PERFORMANCE COMPARISONS
For testing and performance evaluation purposes a simple initial setup was used which is included in the source code of the POM. This solves the problem of the flow through a channel which includes an island or a seamount at the center of the domain. The size of the modeled ocean is 1024km, the north and south boundaries are closed, the east and west boundaries are open, the grid size is 128×128×32 and the horizontal grid resolution is 8km. The simulation was run using 6s internal timestep and 180s external timestep, the simulated time interval was 72 hours. Experimental results were obtained on a Sony PlayStation 3 console and average runtimes are summarized in Table I .
The achievable performance of the Cell using different number of SPEs is compared to the performance of the Intel Core 2 Duo T7200 2GHz scalar processor. Comparison of the required computation time of one iteration in external (2D) and internal (3D) mode show that the external mode computations can be carried out 126 times faster. The result is significant saving on computation time. Performance of the six-SPE solution is compared to the performance of a high performance microprocessor. The external mode calculations on the Cell processor are 79-time faster than on the Core 2 Duo microprocessor, while in the internal mode 86-time speedup can be achived. During a 72 hours simulation using both internal and external mode calculations 85-time speedup was measured.
VI. CONCLUSION
Complex spatio-temporal dynamical problems are analyzed by a topographic array processor. The CNN paradigm was used to solve the 3-D Princeton Ocean Model and significant performance improvement was achieved. Our solution was optimized according to the special requirements of the Cell architecture. Performance comparison showed that about 17-time speedup can be achieved with respect to a high performance microprocessor in the single SPE solution, while the speedup is 85-time higher when all the 6 SPEs are utilized. Further speedup might be achieved by using the full power of the Cell architecture on Cell processor based IBM blades.
