Abstract. Overcoming the memory wall [15] may be achieved by increasing the bandwidth and reducing the latency of the processor to memory connection, for example by implementing Cellular architectures, such as the IBM Cyclops. Such massively parallel architectures have sophisticated memory models. In this paper we used DIMES (the Delaware Iterative Multiprocessor Emulation System), developed by CAPSL at the University of Delaware, as a hardware evaluation tool for cellular architectures. The authors contend that there is an open question regarding the potential, ideal approach to parallelism from the programmer's perspective. For example, at language-level such as UPC or HPF, or using trace-scheduling, or at a library-level, for example OpenMP or POSIX-threads. To investigate this, we have chosen to use a threaded Mandelbrot-set generator with a work-stealing algorithm to evaluate the DIMES cthread programming model for writing a simple multi-threaded program.
Introduction
Integrating the processing logic and memory [2] , termed PIM, is an approach to overcome the memory wall [15] . PIM architectures may improve both dataprocessing and data-access times, but the combined processor speed and the amount of memory may be reduced [2] . This may be overcome by connecting multiple, independent PIM cells, giving a cellular architecture. In this organisation, every thread unit is an independent single-issue, in-order processor, thus able to potentially access memory independently. Moreover, the different memory hierarchies may have different access timings and consistency models such as location consistency [7] . This gives rise to a number of code-generation problems, centred around the fact that to provide computational power, these systems are not only massively parallel, but have complex memory hierarchies.
Research also proceeded towards thread-generating compilers, for example, HPF and UPC [9] , IBM XL Fortran and Visual Age C/C++, largely based upon OpenMP, all of which have their compromises. Some of these also have support for the various memory models.
Unfortunately general-purpose languages have been slow to adopt a sophisticated abstraction of the machine model, library-based approaches have developed, for example, the various implementations of OpenMP. But, the authors contend that library-based solutions to threading are too dependent upon the programmer to use effectively. For example, the explicit use of locks in programs is prone to error, with deadlocks and race-conditions that are hard to track down easily, introduced, even on systems with only a few processors. The development of suitable tools to debug multi-threaded applications has also been slow. Debuggers are in development, for example for Cyclops [8] , but there have been too few, with limited functionality.
As identifying parallelism both correctly and efficiently is very hard for the programmer to do, the authors contend that they should not do it. The compiler, equipped via these libraries with a detailed machine-model, could be able to use the programmer-identified parallelize-able variables and functions, to generate more efficient code. The authors identified little work investigating the software aspect of the code-generation problem for massively-parallel architectures. Unfortunately, if this case would continue, this shortcoming could adversely affect the popularity of such systems and maintain the perception that massively parallel architectures are too specialised and thus too expensive to be of more general use. Given the popularity of introducing multi-core processors, this position is set to become even more untenable.
Related Work

The Programming Models: From Compiler to Libraries
With such compute bandwidth, and parallelism, a number of problems for the programmer have been raised, primarily these are focused on the problems of memory reads and writes. Super-scalar chips have had mechanisms to hide these problems from the programmer, but the cellular architectures of such chips as picoChip [6] and IBM BlueGene/C [1] do not. Thus the programmer needs to know how memory reads and writes interact with:
-the software-controlled data-cache attached to that pipeline, -the software-controlled data-cache of other on-chip pipelines, -any global on-chip memory, -the software controlled data-caches of other off-chip pipelines, -the global on-chip memory that is on any other chips, -any global memory that is not on any chip -and finally, given the massive parallelism available, how to make efficient use of it.
For a programmer, the memory access models are important to understand, or to have a library or compiler that hides the details from the applications programmer. In the remainder of the paper the authors will focus on the IBM BlueGene/C architecture, and a prototype implementation of it called Cyclops [2, 4] , that was implemented at CAPSL at the University of Delaware in collaboration with the University of Hertfordshire. The Cyclops architecture was prototyped in hardware, called DIMES/P, [14] which was used as the platform for executing the programming example, described later in this paper. In the following sections the memory access models will be discussed, leading on to a presentation of the authors' experience in developing a program for such an architecture. The experience gained from this will allow the authors to discuss the major problems that were faced, how, if at all, they were overcome, and the outstanding problem domains that, in the authors' experience, would hinder the acceptance of multi-core chips and, moreover such massively parallel designs as IBM BlueGene/C.
Programming Models on Cellular Architectures
The hardware differences between cellular and super-scalar architectures indicate that different programming models, to those used for super-scalar architectures, are required to make effective use of the cellular architectures [7, 8] . In the first two of those three papers, their authors propose the use of a combination of execution models and memory models, as already noted in this paper.
The primary concerns when programming DIMES/P, and thus any Cyclopsbased architecture, were:
-How to manage the potentially large numbers of threads.
-How to easily express any parallelism within the input source-code.
-How to make correct, and most effective use, of the memory consistency models.
Some research has already been done regarding programming models for the threading, such as using thread percolation as a technique to perform dynamic load-balancing [10] . Another piece of research [3] investigated using multilevel scheduling-schemes: a work-stealing algorithm at the higher-level and a multi-threading technique at the lower-level to hide communication latencies. Alternatively there is research [13] into how to implement OpenMP efficiently on cellular architectures such as IBM BlueGene/C.
Programming for Cyclops -cthreads
This section will very briefly describe the cthread programming model, which is an early version of TNT [5, 8] , then how it was used to implement the programming example, followed by a discussion of the implementation. The implementation of the memory consistency models was relatively simple: earlier, unpublished, work on the GCC-based compiler had implemented a simple algorithm: all static variables were stored in on-chip memory, and the function call stack, including all automatic variables was placed in the scratchpad memory.
As there was no language-level support for thread management, a library had to be implemented to support the thread management instructions in the Cyclops ISA, which was used as the basis for creating a higher-level C++ abstraction. This was because the cthread implementation, that closely followed a POSIX-Threads API, was considered far too primitive by the authors to be effectively used for programming Cyclops. This C++ API also included criticalsection, mutex and event objects to allow for easier management of the lowerlevel objects.
To test these ideas, and the Cyclops architecture, a small, simple and embarrassingly parallel program to generate Mandelbrot sets [12] was created. In the following sections a brief overview of how this how this program may be implementation for DIMES/P.
Threading and the Mandelbrot Set
Due to the properties of DIMES/P, alternative techniques were not possible, as there are only 8 thread units between two processors. In this implementation, the complex plane was divided into a series of horizontal strips. Those strips may be calculated independently of each other, using separate threads, implemented as algorithm 1. 
7.
Signal work completed, set t = 0 (thus this thread is guaranteed not to be selected by the work-stealing algorithm 2).
Suspend.
However, each strip will, in general, take a different amount of time to complete, thus the threads would have completed their assigned portion of work at different times. Thus a work-stealing algorithm 2 performed the load-balancing between the threads.
The bandwidth of the work-stealing thread, algorithm 2, limited scaling to more worker threads, algorithm 1. But algorithm 2 would able to tolerate failures: if a worker thread stopped responding, its work would have been eventually stolen.
Algorithm 2. The work-stealing algorithm.
1.
Monitor render threads for a work-completed signal. That thread that completes we shall denote as Tc.
2.
Find that render thread with the longest estimated completion-time, t, note that each render thread updates this time upon completion of a line. Call this thread T l .
3.
Stop T l when it completes the current line it is rendering. 4. Split the remaining work to be done in the strip equally between the two render threads Tc and T l .
5.
Restart the render threads Tc and T l . 6. Go to 1.
If robustness is not required, then the image generated may be viewed as an array values. Each of these values would be the classification of c. Thus if one has p 0...q threads, each p n thread initially classifies a point in the array offset by n, and once completed, would move along the array using a stride of q. This would allow the use of a number of threads that is bounded by the number of points within the image.
DIMES/P Implementation of the Mandelbrot-Set Application
In cthreads, each software thread was statically allocated to one of the 8 hardware thread-units in DIMES/P at program start-up. The software threads were:
1. The a thread was required for cthreads support and the debugger [8] , if it were to be run. 2. The main loop of the Mandelbrot-set application. 3. The thread that executed the work-stealing algorithm 2. In principle, a worker thread could also run on this thread unit, but cthreads did not support virtual threads. 4. The remaining 5 threads were worker threads that executed algorithm 1.
Further details regarding the implementation may be found in [11] .
Discussion
The limitations of DIMES/P prevented further study of the properties of this program: scalability and timings were not done because of the limited number of thread units (8) and memory capacity.
The memory model support, using the C/C++ keyword static by the compiler, made natural use of language-level syntax to map data into scratch-pad and on-chip memory made using these different memory models. The atomic, word-sized, memory-operations on Cyclops were not used for this problem, because of the multiple, read-modify-write operations that had to be maintained as an atomic unit. If the manual locking had been implemented within the compiler, then it may have been possible for the compiler to perform optimization on the locking of access to the data.
With regards to the thread library: in the opinion of the author's, the complexity of POSIX-Threads has been a hindrance to successful multi-thread program creation. Abstracting the algorithms that expressed the parallelism within the Mandelbrot program, for example the work-stealing algorithm, was not implemented for this paper, as this was considered to be potentially too closely coupled to the actual program in question. Ultimately this decision, in the authors' opinion, was flawed, and by extracting and abstracting the work-stealing algorithm from both the program and Cyclops, would have allowed a programmer to reuse that algorithm with other programs, thus separating the design of the parallelism from the details of the program that would wish to use it.
It is still an open question regarding what may be the ideal approach to parallelism: language-level support such as UPC, HPF or other language extensions, or within the compiler using trace-scheduling, or should it be at a library-level using, for example OpenMP or POSIX-Threads, or should it be within the architecture, such as the data-flow design. If programs more sophisticated than the one described in this paper are to be successfully written for these cellular architectures, then based upon this brief examination, it is the authors' contention that it would be highly advantageous to have:
-Compiler support for making use of any available the memory model of the architecture. -Compiler support for locking, which would aid the programmer with writing code that avoids race-conditions. -Reusable abstractions of techniques of implementing parallelism, such as work-stealing, or master-slave models. These abstractions could make use of both data and code locality to ensure that a thread unit re-executes the same code, if desirable.
