Abstract. Parallel graph reduction is a model for parallel program execution in which shared-memory is used under a strict access regime with single assignment and blocking reads. We outline the design of an ecient and accurate multiprocessor simulation scheme and the results of a simulation study of the performance of a suite of benchmark programs operating under a cache coherency protocol that is representative of protocols used in commercial shared-memory machines and in more scalable distributed shared-memory systems. We analyse the in uence of cache line size on performance and expose the relative contributions of spatial, temporal and processor locality and false sharing to overall performance.
Background
Parallel graph reduction (PGR) is a well-known technique which uses a shared graph structure to manage and synchronise the parallel execution of a functional program. It provides a simple model of communication and synchronisation between processors with distinctive properties of importance in the design and optimisation of multicache implementations of shared-memory. The performance of parallel programs under this regime depends critically on the provision of access to the shared heap with very high average performance.
The most successful implementations of PGR (e.g. Goldberg's Buckwheat system 8] and Augustsson and Johnsson's h ; Gi-machine 3]) have used generalpurpose shared-memory multiprocessors based on well-known bus-based snooping cache coherency protocols. For modest numbers of processors these systems implement a shared heap with near-ideal performance, but their size is limited by contention for the snooping bus. Shared-bus multiprocessors now have relatively high-latency communication (i.e. the latency of a memory reference that requires use of the bus is now many times greater than one that does not), and the number of processors that can be supported by the bus is very small. A solution is to use a more complex network, but such networks favour the transmission of larger messages, and require directories to store copy sets because broadcast is no longer available.
There are three important issues to be addressed: 1. How does cache line size a ect the performance of parallel graph reduction operating with invalidation based protocols?
2. How much of a performance impact does false sharing (i.e. contention for write access to a cache line) have? 3. How signi cant is the e ect of invalidation tra c on performance, particularly when broadcast is not available?
This paper describes simulation work aimed at determining the performance of PGR operating with a multicache implementation of shared-memory, and builds on our earlier work in which performance under an ideal shared-memory was studied 5].
Parallel Graph Reduction
The essence of PGR is that in a function application f e1 e2 we may mark one or more of the parameters for possible parallel execution | as with e1 here, for example:
At runtime a heap-based representation of how e1 is to be computed is built and added to a pool of available tasks for distribution to other processing elements (PEs) | the node is sparked. In the body of the function f where the parameter is used, code is emitted to check whether the parameter has yet been evaluated | the node is demanded. If another PE has accepted the task, this PE will block; otherwise it will evaluate e1 itself.
Once a parameter has been evaluated, it is not changed. Without going into a great deal more detail, it should be clear that this mechanism requires shared-memory, but that it uses shared-memory in a disciplined fashion: each cell is marked with a tag indicating whether it contains an evaluated result, or if not whether it is being computed by some PE. A cell is not written to without rst gaining exclusive access to the cell, and setting the tag appropriately.
Cache Coherency Protocols
We simulate the Berkeley Ownership 12] protocol, similar to that used in commercial bus-based shared-memory machines such as the Sequent Symmetry.
The shared-memory region is divided into a number of cache lines of some constant size. Each line has an owner (either a PE or main memory); ownership changes dynamically according to coherency transactions. When a PE attempts to read a line which it does not have a copy of, a read request is broadcast, and the owner of the line responds by broadcasting a copy of the line. The requesting PE adds the line to its cache (expelling another line if the cache is full) and proceeds to use it. In this way, multiple copies of lines come into existence in the system. A write to a line that is cached locally but not remotely can take place without using the network.
The di culty occurs when a write to a line that has multiple copies occurs: either the new data can be broadcast enabling other caches to update their copies, or the remote copies can be invalidated before the write is allowed to proceed. The relative performance of update and invalidation protocols is determined by the data sharing behaviour and by the network characteristics: update protocols o er better performance when signi cant contention for shared data exists and the network has low latency 2] but we adopt an invalidation protocol since we are interested in higher latency networks. The requesting PE becomes the new owner of the line when the write has completed.
The same protocol can be used on non-broadcast networks provided directories are used to record which caches have copies of each line and a mechanism to locate the owner of lines is adopted | see for example Li's xed distributed manager algorithm 14], a basic distributed shared-memory (DSM) protocol.
False Sharing
Networks with large latencies tend to favour the transmission of relatively large messages which in turn tend to favour the use of large line sizes. However a large line size will cause false sharing 6] in which several data objects located on the same line are used independently by di erent PEs but any write will cause copies of the line to be invalidated. This e ect is referred to as line stealing. This additional communication and serialisation is due to contention for ownership of the cache line.
Experimental Design
Our objective was to produce repeatable and reusable results, and so at several points we have chosen a simpli ed approach rather than a more optimised implementation in which the issues might be obscured. Our simpli cations are described in Sects. 4.2 and 4.3.
Source Language and Compiler
The source language is a lazy, higher-order functional language in the tradition of SASL, Miranda and Haskell. The primary objective in building an optimising compiler for a lazy functional language is to reduce the frequency at which claims and references are made to the heap. It is therefore of great importance that the compiler used in our experiments should perform reasonably well. Although comparing compilers is di cult, we have some con dence that the system is competitive with the state of the art 11]. It also, conveniently, generates C, making generated code very easy to instrument and modify.
Process Management
Parallel computation is coordinated via the heap where graph nodes are allocated. Work is allocated to processors via a work pool, which in the current system is a simple queue. The heap is accessed by the processors in order to determine what needs to be evaluated, and processes synchronise using a blocking read operation.
Garbage Collection
When storage allocated from the shared heap becomes free, it should be recycled for reuse. In a parallel system a parallel garbage collector is needed, and the area is the subject of intensive research. The behaviour of the garbage collector may interfere with normal program execution in two ways: rstly, it may change the relative timing of processes, depending on when it is activated, and whether all processors collect in synchrony. Secondly, garbage collection may substantially change the pattern in which store is allocated.
We have made an important simpli cation here: we have no garbage collection at all. Instead, each processor is allocated a large contiguous segment of the shared address space, and it allocates space from it starting from the base.
The motivation for this decision is as follows: to introduce a garbage collector into the simulation we would have to choose one of the available algorithms, and our results would then be applicable only when a similar collector is in use. Unfortunately, there is no obvious candidate: there is no consensus on how garbage collection should be done in large shared-memory systems. However, any copying collector would nish by handing the application program a contiguous segment of free store. Thus, there is a signi cant period during which heap cell allocation proceeds in precisely the simple pattern we assume. We simply assume this is the case all the time.
This assumption illuminates an important point: our objective is to learn general lessons about a large class of systems. We are less concerned that the experiments predict the actual performance of some production system, as to do so we would have to introduce many factors which are orthogonal to the issues we intend to study.
Example Functional Programs
The complete suite of programs we have been using are as follows: p b compute the n th Fibonacci number nqueens compute a safe arrangement of queens on an n n chess board matmult multiply two n n matrices quad nd the integral of a cubic function using adaptive quadrature Less trivial example programs are available, but the programs in this suite are simple enough to o er the possibility that their behaviour might be understood, while covering a variety of parallel program structures. The programs were used in Goldberg's Alfalfa and Buckwheat experiments, and are fully described in his thesis 8]. Marking of expressions for parallel evaluation is introduced manually.
Trace-driven simulation has been employed in simulating shared-memory systems. However, there are problems with the validity of such studies since the relative timing of processes depends on the behaviour of the memory system. For this reason, we have chosen execution-driven simulation. It is slower than trace-driven simulation, since the object code is reexecuted for each experiment, but it eliminates the validity problems of trace-driven simulation since the simulated processors read the data as it is at the simulated time at which the reference is made.
For these experiments, the simulator models the state and copy set of each cache line in the shared-memory in order to determine the latency of each heap reference. Further information is also associated with each cell and cache line to enable the performance of the cache system to be closely monitored. The simulation scheme is more fully described in 4].
Architectural Model
We basically assume a 32-bit non-pipelined load-store architecture, with the following assumptions:
{ The stack, private data and code regions of each process are served by a separate, perfect cache system; each read or write to these regions has a latency of one cycle { The heap cell allocation method used is incremental, consequently the allocation of a cell on a cache line that has not been used before, although a write-miss, can be treated as a write-hit using a \direct write" instruction 9] since the old contents of the line will not be used { We assume that in nite sized, fully associative caches are used, so cache capacity e ects do not occur. The extra overheads caused by using non-fully associative caches is unlikely to be signi cant 13]. Adopting in nite sized caches allows the main memory unit to be removed entirely and all shared data resides in a cache Although each assumption is invalid on a real machine, they prevent the results from being obscured by other e ects. In addition, these assumptions reduce the number of global events produced, improving simulation time.
The network model determines the latency of coherency transactions. For ease of comparison, a model very similar to those used in several related cache performance studies is used (for example 1]). Note that these latencies remain xed when the cache line size is altered. We make no claim that the gures are accurate or realistic; essentially they apply an arbitrary weighting to network activity. In analysing our results we count the events themselves, thus avoiding this bias. Network contention is not directly modeled since we are primarily interested in a network in which contention e ects are more complicated than a bus.
Many simulation runs were made of the benchmark set, using the simpli ed invalidation protocol and the architectural model described above. Simulation parameters were the number of PEs (1, 2, 4, 8, 16 and 32) and cache line size (1, 2, 4, 8, 16, 32, 64, 128 and 256 cells) . Line sizes are speci ed in units of heap cells (a cell is 24 bytes in the current implementation).
Simulated execution times indicate that each program is highly parallel. Due to the architectural assumptions made, the simulated execution time of each program for a single processor is independent of cache line size. However, when more than a single PE is simulated, cache line size does have an in uence, but the magnitude is dependent upon the details of the architectural model.
Spatial Locality and Line Stealing
It is apparent that line size does a ect performance, but the contributions of spatial locality and line stealing are not clear. The simulator uses monitoring information to enable these e ects to be measured separately. A classi cation of heap references is produced, allowing any gains from spatial locality and losses due to line stealing to be quanti ed. Monitoring information associated with each heap node records the elapsed simulated time when the cell was last written, a list of PEs that have accessed the cell since that time and the last PE to write to the cell. Information stored with each cache line describes the elapsed simulated time at which each PE last took a copy of that line.
Heap read references are classi ed as follows:
Simple-read: the PE has accessed an up-to-date copy of the cell before, and has a coherent copy of the associated cache line Mandatory-read: the PE has never had an up-to-date copy of the cell in its cache, and communication must take place. Either the PE has not accessed the cell before, or the cell has been updated by a remote PE since it did Gain-read: the PE has not accessed an up-to-date copy of the cell before, but has a coherent copy of the cache line, i.e. a gain from spatial locality Loss-read: the PE had an up-to-date copy of the cell in its cache at some time in the past, but the line has been invalidated, i.e. a loss from line stealing caused by false sharing The classi cation scheme for write references concentrates on relating line sharing to cell sharing, and the e ect this has on whether an invalidation needs to be issued on a write, and is described in terms of \active sharing". An object (heap cell or cache line) is said to be actively shared if an up-to-date copy of it has been accessed by more than one PE. Therefore multiple copies of an actively shared cache line exist in the system. Allocation-write: a write-miss caused by allocating a new cell on an as yet unused cache line. It can be treated as a write-hit using a direct write instruction The graph of read references for nqueens shows some interesting trends. The solid line separates read references not requiring network use (simple-reads and gain-reads below) from those that do (mandatory-reads and loss-reads above):
{ When the line size is 1 cell, clearly there is no opportunity to exploit spatial locality, but also false sharing cannot occur. Therefore all reads are after which it drops away rapidly. So it is clear that spatial locality is being exploited. However, losses due to false sharing outweigh gains when the line size exceed 16 cells The corresponding graph for writes shows that, at line size 1, more than 35% of writes cause fresh lines to be allocated. This demonstrates the importance of a \direct write" instruction which allows previously unused cache lines to be used immediately. The majority of writes are classi ed as simple-writes. Mandatorywrite references are roughly constant across line sizes. The loss e ect steadily increases until it is about 13%. The percentage of writes requiring use of the network (above the solid line) increases from the smallest line size.
Read and write graphs for matmult, p b and quad are broadly similar, although the magnitude of each region varies greatly. P b reads are almost entirely simple-reads for line sizes less than 64 cells, and so the potential gain is correspondingly small. The contention for lines when line size exceeds 32 cells for p b can be clearly seen to be due to line stealing. For write references, the simplewrite versus allocation-write variation is seen again, and mandatory-writes are also largely insensitive to line size. Mandatory-writes are particularly high because such a high percentage of cells are actively shared 5].
Heap read references for matmult are approximately 5.5% mandatory-reads when the line size if 1 cell, larger than for any other program, and therefore matmult o ers the best opportunity to bene t from spatial locality. This is indeed achieved: reads not requiring the network peak at a line size of 16 cells, after which false sharing reverses the e ect. Write references show similar trends to the other programs.
Although the programs show di erent behaviour with varying line size, signi cant similarities in the graphs are apparent. The graphs for write accesses are particularly similar in form: each shows that the vast majority of writes are simple-writes or allocation-writes (and therefore of low latency). For small line sizes, many fresh lines can be allocated cheaply using a direct write instruction. It might be expected that many writes to large lines would require invalidations, but the burstiness of writes is such that this is not a signi cant problem. Burstiness is particularly apparent when an unevaluated node is demanded, causing the node to be locked, several elds to be updated, and then unlocked. Although the rst write of a burst may require an invalidation, there is a high probability that the line will not be accessed by other PEs during the rest of the burst, and therefore no extra invalidation tra c will be required. Consequently there is a high degree of processor locality (at least for relatively small line sizes). The solid line on each write graph separate low latency from high latency actions: in all cases this line falls with increasing line size, and so (for writes alone) the minimum line size is desirable.
The read graphs are also fairly similar in general form, but di erences in the heap cell reference characteristics of the programs are more apparent here. In each case, the mandatory-reads at the minimumline size are virtually eliminated at large line sizes, and a steadily increasing proportion of reads are loss-reads due to false sharing. Mandatory-reads can never be eliminated since some represent the minimum communication that is required for parallel execution. Nqueens and matmult which both use constructed data have optimum line sizes of 8 and 16 cells respectively, whereas for p b and quad it is 1, although little is lost by using a line size as large as 16 or 32 cells.
More Realistic Networks
The network model adopted is simple enough that the in uence of varying line size on performance can be accurately measured and attributed to the simulation parameters of line size and number of PEs. A more realistic network model would di er in two signi cant ways: contention would increase latencies under heavy loads, and broadcast may not be available requiring the protocol to be modi ed so that the copy set of each cache line is explicitly stored. Graphs showing the distribution of invalidations required for invalidating writes (not shown here) indicate that, under reasonable conditions, less than two invalidations are required, enabling small directories to be used.
Discussion and Related Work
There are many studies of the cache behaviour of parallel imperative programs in the literature, but few related to declarative languages.
A short study of the cache behaviour of parallel functional programs on shared bus architectures is given as an example of the use of the MiG executiondriven simulator in 15]. The parallel evaluation strategy is a modi ed form of graph reduction in which shared nodes are only ever read, greatly simplifying the coherency problem, and making direct comparisons with our work di cult.
A related study of caching e ects in a sequential functional language implementation 13] found signi cant locality which was attributed to the misses caused by the incremental allocation strategy, the LIFO behaviour of the graph reducer stack and the evaluation of suspension (in which multiple elds are read in a burst). Our stacks are not shared and we assume they are dealt with by a separate perfect cache, but this work does support the adoption of a \direct write" instruction. Cache associativity was also studied, with the conclusion that little is to be gained by increasing from 2 to 4-way set associativity since little interference occurs. This supports our decision not to include associativity e ects in our cache model.
A simulation study of a coherent cache shared-bus architecture to support parallel logic languages is reported in 9]. The design uses the Illinois protocol as a base (essentially the same as our protocol) with several instructions added to support various communication patterns that commonly occur. Examples of new instructions are \read-purge" which removes cache lines from the cache, and direct-write. The latter is the only instruction that is used in heap references and yielded simulated execution time improvements of between 38% and 45%. These gures are high since the cache line size is only four words (approximately our minimum line size). The language does not have arrays and little spatial locality was found | this corresponds with our results showing that programs without constructed data did not show gains from spatial locality when large cache lines were used.
Conclusions
This paper has presented the design of an e cient and accurate event-driven multiprocessor simulator based on an optimising functional language compiler. A simpli ed architectural model is used in which processor pipeline, cache capacity and network contention e ects are ignored. This loses low-level accuracy but allows us to identify object reference patterns which a ect performance and to separate the con icting e ects of spatial locality and false sharing. The simulator has been used to study the performance of a benchmark set of parallel functional programs operating under an invalidation-based cache coherency protocol. Our results include the following: { Cache line size can have a signi cant e ect on performance, limited only by the number of PEs { Each program displays signi cant processor locality which is vital for an invalidation protocol, but contention for lines reduces this and greatly harms performance with particularly large line sizes { Programs which did not use constructed data did not bene t from large line sizes at all, but were not greatly harmed by line sizes up to at least 8 cells. Programs which use constructed data (nqueens and matmult) can bene t from large line sizes due to spatial locality: matmult shows peak performance for line sizes of 8 or 16 cells { Write behaviour is broadly similar for all the programs dues to the burstiness of writes. Contention for lines is so great when large sizes are used with many PEs that each burst may produce several invalidating writes. Delayed consistency 7] could solve this problem { Not surprisingly, we found that our results are very di erent from those derived from shared-memory programs written in conventional languages. We have observed greater processor locality and less invalidation tra c for large lines when compared to simulations of the Splash suite for example 10] { The use of a fast cache line allocation instruction (such as direct-write) is important when using small cache lines since a high proportion of write references are used to allocate new heap cells { The use of a single global task pool has not appeared as a bottleneck, but it does mean that we fail to make full use of locality and this deserves further study Future work will include more PEs, a wider range of example programs and a more thorough study of network contention. Directions for further study include a nity scheduling (in which scheduling decisions are in uenced by the location of data), automatic compile-time parallelisation (following 8]) and the e ect of garbage collection and compaction. Other ideas include the development of performance debugging tools based on data sharing measurements and the use of high-level data distribution annotations.
