Cellular automata, represented by a discrete set of elements are ideal candidates for parallelisation, particularly on graphics cards using GPGPU technology. This paper shows that the speedups of 50 times over CPU are possible but that the hardware is only partially responsible and the memory model is vital to exploiting the additional computational power of the GPU.
INTRODUCTION
Modern hardware is becoming increasingly parallel as shown by the fact that off-the-shelf CPUs are now equipped with 2 to 8 processing cores and modern graphical processing units (GPUs) now have thousands of cores. Through the recent introduction of General Purpose-Graphic Processing Units (GPGPUs), programmers may take advantage of the expanding hardware parallelism of Graphics Processing Units (GPU) for general computation. The latest Nvidia GPGPUs have up to 3072 processing cores [1] , and as such GPGPUs demonstrate even greater parallelism than their CPU counterparts. The multi-core nature of most modern CPUs is allowing Moore's law to continue to hold true despite stagnation in individual core speed [2] . Several sources suggest that co-processors like GPGPUs, with their inherently Massively Multi-Core (MMC) nature, are increasing in performance faster than their CPU counterparts [3] .
CELLULAR AUTOMATA
Cellular Automata (CA) were first conceived by John Von Neumann in the 1940's [4] as a model of the universe, following a set of natural laws. The automata consist of a lattice (grid) of cells commonly arranged in a 2D regular grid, with each cell being in one of a number of states. Each cell is updated using state transition rules (a rule set), that uses the current state of that cell and its adjacent neighbours in a particular pattern to determine the state of that cell in the next iteration of the algorithm. Each cell is updated in an idealised parallel sense to form a new lattice of cell states, and the process is repeated for a number of iterations.
RELATED WORKS
There appears to be few attempts in the literature to develop parallel CA algorithms and to investigate how exactly they will interact with MMC technologies. There are a number of examples of implementations presented, however few of these investigate the spread of possible speed-ups, or how the variation of the CA's base parameters (e.g. lattice size, number of generation, number of states/data types, neighbourhood sizes, rule complexity) affects these speed ups. A representative approach to the use of CPU and GPU computing to speed up CA execution is described in [5] ; and indeed the literature presents a wide variety of speed up factors, with little explanation of the features that affect this, other than the levels of hardware parallelism.
METHOD
Tests are performed using the well-studied 'game of life' rule set developed by John Conway in the late 1970s [4] , which has 2 states and a Moore neighbourhood. Since memory look-up is relatively expensive on the GPGPU; a programmatic function is used here (a series of C if-statements) which forms the basis of a decision tree. The GPGPU present three memory types including 'global memory' which is essentially the RAM on the GPU card; 'local memory' which is on-chip and therefore much faster but limited in space and scope to a single compute core; and finally 'image memory' (sometimes referred to as texture memory) which is memory specific to the native task of the GPGPU as a graphics processor and is cache-lined even in older models as well as having special hardware for dealing with border conditions. As image data commonly has four channels (Red, Green, Blue, Alpha), the GPGPU hardware is specially set up to deal with and transfer vectors of values, most commonly with a length of four. This is used to create a novel hybrid of texture memory and vectorisation, in which the lattice is folded into four quarters, much as one would imagine folding a square of paper. A combination of the texture memories' ability to deal with border conditions, and swizzling operations (which re-order vectors efficiently in hardware) are used to make near-optimal use of the hardware features present in the GPU. In this way the wider memory lanes of the GPGPGU are used to effectively allow the transfer of four values in parallel, saving additional memory accessing time; also reducing the number of threads required. Finally, tests are also performed with vectorisation (folding) and global memory. The new open-standard language/API OpenCL is used, designed specifically for parallel hardware such as the GPGPU, as well as multi-core CPUs. For detailed information the reader is referred to the OpenCL specification [8] . 
EXPERIMENTAL SETUP

EXPERIMENTAL RESULTS
CONCLUSION
Firstly it can be seen that a large enough grid size is required in order to overcome the overheads of parallelisation which on the GPGPU included small amount of compilation time and transfer down the known bottleneck of the PCI connection. Also seen are periodic reductions in performance which are due to loading balancing by the hardware between the number of available cores and the specific number and processing times of individual threads/cells. Figures 1 and 2 show a marked difference in performance between the two tested machines, due to the introduction of cache-lined global memory in the Fermi (Machine B) generation of GPGPU's. Local and image memory gain greater speed-ups than global memory alone on Machine A, whereas on Machine B local and image memories are less efficient than global memories due to the more efficient caching and the need to explicitly copy data to the local and image memories. When vectorisation (folding) is applied both global and image memories show an increase in performance. For Machine A, the vectorised image/texture memory performance best, but for Machine B, it is the vectorised global memory that is the top performer. This is due to the more efficient cache-line global memory of the Fermi chip with Machine B.
