The cost of hardware cache-coherence, both in terms of execution delay and operational cost, is substantial for scalable systems. Fortunately, compiler generated cache management can reduce program serialization due to cache-contention and increase execution performance. It can also reduce the cost of parallel systems by eliminating the need for more expensive hardware support. In this paper, we use Sisal functional language system as a vehicle to implement and investigate automatic, compiler based cache management. We describe our implementation of Sisal for the IBM Power/4. The Power/4, brie y available as a product, represents an early attempt to build a shared-memory machine that relies strictly on the language system for cache-coherence. We discuss the issues associated with deterministic execution and program correctness on a system without hardware coherence, and demonstrate how Sisal (as a functional language) is able to address those issues.
Introduction
The cost of hardware cache-coherence, both in terms of execution delay and operational cost, is substantial for scalable systems 4] . Parallel work must stop while the caches are adjusted. 8]. Furthermore, as cache-coherent systems scale in size, the time associated with each consistency operation also increases. Small, bus-based systems can typically resolve a cachemiss in 5 to 50 processor cycles. Larger systems using distributed memories and directory structures can require up to 500 cycles to resolve a miss, especially if This work was supported in part by NSF grant ASC-9308900 and by DOE Contract W-7405-Eng-48. E-mail addresses of th e authors are: rich@cs.ucsd.edu and cann@craycos.com the systems use a local area network as a processor interconnect. In addition, the trend in processor design is toward wider instruction issue on each cycle. For example, both the IBM RIOS 2, and the SGI TFP processing units can issue up to 4 instructions per cycle. A 100 cycle delay due to a cache miss could imply a relative cost of 400 instructions. Out-of-order instruction issue and hardware write-bu ering can reduce this cost, but in general, the cache-synchronization delay can seriously impair performance. Fortunately, compiler generated cache management can reduce the amount of serialization resulting from hardware based cache-coherence. Further, because the dollar cost of scalable systems is high, compiler optimizations for cache coherence can also reduce the need for more expensive hardware support and thereby improve priceperformance.
In this paper, we investigate the use of a functional language as a vehicle for implementing automatic, compiler-based cache management. We describe an implementation of Sisal (Streams and Iterations in a Single Assignment Language) 5] for the IBM Power/4. The Power/4 supports shared memory, but relies strictly on the language system to enforce cache-coherence. Functional languages are attractive for such a system as they are easily analyzable for parallelism and data dependence. Moreover, the compiler exclusively controls how data is mapped with respect to cache alignment, so it can ensure the the caches are managed correctly. If machines are to be built in the future without hardware coherence, functional programming can drastically reduce the cost of programming them. Sisal is a good choice for such an implementation as it has been shown to achieve excellent shared-memory execution performance for scienti c programs on other systems 2].
In the next section, we brie y describe the IBM Power/4. Section 3 details some of the problems associated with software cache management and how we address them using OSC (the Optimizing Sisal Compiler) 1]. In Section 4, we detail and analyze our results in terms of two scienti c programs: RICARD and SIMPLE. We discuss both the relative performance (speedup) and the execution time of each program, and identify sources of execution overhead. Finally, in Section 5 we summarize our work and outline future research directions. 2 The IBM Power/4
The Power/4 architecture, available only brie y from IBM as a product, consists of 4 RIOS 1 processors connected to a set of 7 globally addressable memory banks via a crossbar switch. System software partitions each processor's address space into a privately accessible region and a globally shared region. A di erent memory module for each processor services private accesses so in the absence of sharing, there is no contention for memory. The systems maps contiguous shared memory locations across all memory banks to minimize hot-spot contention. The RIOS 1 processors implement no support for cache-coherence. There is no way to externally signal a cache post or invalidate, and no way to bypass the on-board cache and access memory directly. The machine we used was an early prototype supporting 32K bytes of data cache and 8K bytes of instruction cache per processor, both managed using a copy-back policy. In that machine, each cache line is 64 bytes wide and a cache miss causes the processor to stall; there is no support for pre-fetch and no write bu ering. Since the cache cannot be bypassed, every read of a memory location, either shared or private, causes a copy of the memory to be cached locally. Similarly, every processor write to memory results in a write to cache only. Data is subsequently moved from cache to memory either when it is evicted from the cache so that the cache line can be reused, or when it is explicitly ushed by the processor. While the Power/4 is no longer commercially available, it represents an early example of a sharedmemory, cache-based architecture without hardware coherence.
Software Cache Coherence
Previous work in software cache management proposes to reduce or eliminate entirely the need for hardware coherence mechanisms 4, 6, 3] . A purely software based approach requires the compiler and runtime system to explicitly address the problems of stale data and false sharing in order to generate deterministic programs. We describe these problems in greater detail, as well as the way in which our implementation of OSC addresses them, in the following subsections.
Stale Data
Stale data is a copy of a data item that does not re ect its most current value. If a computation inadvertently uses a stale data item, all descendant computations are potentially invalid. To ensure data \fresh-ness" without hardware support, a parallel program must execute cache invalidate, post, and ush operations to explicitly control the interaction of their local cache with shared memory. post: Data associated with an address is copied back to global memory. The processor's cache retains its copy of the data. A processor writing a shared data element into its cache must post it to memory some time before another processor attempts to access it.
invalidate: Data associated with an address is marked invalid and the data is not copied back to global memory. The next reference for this address by the processor will be to global memory. Any processor reading a memory location that has been updated by another processor must invalidate its own copy before reading. Otherwise, it may read a stale copy from its own cache and not the valid copy from shared memory.
ush: The atomic combination of invalidate and post.
In Figure 1 we show the communication of a shared variable from one processor to another. The circled numbers show the order of execution and associate data movement with each instruction. Processor P0 assigns the value 5 into shared variable A. The value is written into P0's local cached version of A (labeled A 0 in the gure) as a result of the store. P0 then posts A to memory causing the data value cached therein to be copied to shared memory. P0 and P1 synchronize using a barrier so that P1 does not attempt to read the value of A before P0 posts it. Before P1 reads the value for A, it must invalidate its cached copy (labeled A 00 ) so that the read will come from memory and not its local cache. Note that this invalidate can take place any time before P1 attempts its read although we show it co-located with the read itself. Finally, P1 stores the value fetched from A into its local cached copy of B. 
OSC and Stale Data
The current version of OSC implements a master/slave model of parallelism. All code except that implementing parallel loops is executed sequentially by the master thread. When the master reaches a parallel loop, it spawns slave tasks by writing an activation record (AR) into a pre-de ned shared memory location for each slave. The activation record describes all of the loop inputs, a loop body entry point, and an index range over which the slave is to execute. Upon receipt of an AR, a slave executes the loop body over the speci ed range and then enters a barrier waiting for the other slaves participating in the computation to complete. Once all slaves spawned by the master have entered the barrier, the master is free to proceed.
Sisal's strict functional semantics ensure that no communication will occur between slaves once they are activated. All loop inputs must be completely available before the slaves are spawned, and no loop output will be consumed until the master and the slaves synchronize at the end of the loop. On the IBM Power/4, both the post and invalidate instructions are combined into a single ush operation (implemented as an operating-system call). Therefore, to avoid stale data accesses, the master must ush its cache before it spawns any parallel work, and each slave must ush its own cache before it enters the barrier at the end of a loop (see Figure 2 ). The Master's ush both posts any data it has written, and invalidates any cache entries for the memory that the slaves will write with 
False Sharing
The unit of caching on the IBM Power/4 is a 64 byte cache line. When a processor accesses a memory location, the entire 64 byte cache line in which it resides is fetched into the processor's cache. If data items written by di erent processors are mapped to the same cache line, their accesses must be sequentialized. Otherwise, processors will update di erent copies of the same cache line. Since they are not updating the same memory location within the line (each memory location has a single writer in a correct parallel section), each processor will contain the cache line's original contents and its own updates, but not the updates made by the other processors. All of the cache line copies map to the same set of memory locations, so when the copies are ushed back to memory, only the last write prevails. We refer to this condition as false sharing.
For example, consider a parallel program executing on a system that uses 32 byte cache-lines. 1 Assume that the program forms a contiguous vector of 15 double-precision oating point numbers in parallel, using two processors, and that the rst element of the memory allocated to hold the vector is cachealigned ( Figure 3 ). Note that in the gure, each A cache line size of 32 bytes is large enough to hold four double-precision vector elements. Since element 1 is cache-aligned and the vector occupies contiguous memory, elements 5, 9, and 13 are also cache-aligned. 1 We use 32 byte cache lines in this example to make the explanations and the subsequent gures less complex.
Note that the values in the cache line containing elements 5 through 8 are falsely shared between processors P0 and P1. When P0 produces elements 5 through 7 into its cache, the space for element 8 in the cache line will be left untouched. Similarly, P1
will produce element 8 into the rightmost slot of the cache line, leaving the slots for elements 5 through 7 untouched. The values of the untouched elements are unde ned. In practice, however, they will contain whatever random data happened to be in the memory locations corresponding to the cache line before the rst element is produced. After each processor produces the elements its has been assigned, it must post the values to memory. However, the hardware will post the entire cache line as a unit, thereby writing unde ned values into the vector. Otherwise, P1 posts rst, and the unde ned value for element 8 in P0's cache line will be written into the vector. The short cache line containing elements 13 through 15 in Figure 4b may also create the possibility for false sharing. When the cache line is written to memory, the slot corresponding to what would be the 16th element will also be written. Since there is no 16th element in the vector produced by the computation, this slot will contain an unde ned value with respect to the program. If another unrelated data structure happens to be contiguous with the vector, its rst 8 bytes will be overwritten when the cache line is posted to memory. Again, in practice, the cache line will be fetched from memory before P1 lls in element 13 so that what-ever bit pattern is present in the 16th element will be written back. If there are no processors updating the memory corresponding to the last slot in the cache line, the correct value will be posted back to the memory, and the program is correct. In general, however, that memory may also be updated in parallel since it is potentially used by an independent data structure.
Padding and Tessellation
The general solution to the problem of false sharing within a parallel section of code requires that 1. the memory used to implement all data structures is an integral number of cache lines, and 2. no two processors share a cache line in parallel.
When implemented by a compiler and runtime system, the rst requirement translates to padding in any memory allocation. For both imperative and functional languages, statically de ned data structures can be easily padded. However, if the programmer is allowed to allocate memory dynamically, padding cannot be ensured by the language system. The advantage of using a functional language is that all memory allocations are strictly under the control of the compiler and runtime system; the programmer cannot directly allocate memory. Further, the need for padding can be reduced if memory is allocated in cache-aligned blocks. Again, there is no way to ensure such alignment for imperative languages that allow dynamic allocation.
To satisfy the second requirement, the program partitioner must understand the mapping between logical data structures and the memory that implements them. In particular, the partitioner must tessellate each shared data structure with an integral number of cache lines. If the programmer is allowed to specify a partition that does not tessellate, the compiler and runtime must sequentialize all or part of the computation. However, the functional language compiler and runtime are free to coordinate memory alignment and partitioning to ensure tessellation.
OSC and False Sharing
We modi ed OSC to pad all data structures to an integral number of cache lines. We then changed the dynamic memory allocation system used by the runtime to allocate cache-aligned regions, and to round all allocation requests up to the nearest cache-line size. The result is that all statically and dynamically dened data structures are cache-aligned and padded in the modi ed compiler.
As mentioned previously, the compiler crafts a set of activation records (one per active processor) in shared memory for each parallel loop. An activation record speci es a loop-body entry point, a list of inputs, and an index range. It is the index range that controls partitioning under OSC, as the functional semantics of the loop dictate that the computation associated with each index is independent.
To e ect tessellation, we needed to change the AR generator to take into account cache-alignment. Sisal is statically typed so the elemental data types within any aggregate (such as an array) are known at compile time. Using the example in Figure 3 , the compiler knows a priori that the parallel loop will produce a vector of double-precision elements. The actual size of the array may not be known until run time, hence the activation record is not crafted until the program actually executes. However, by knowing the cacheline size and the size of each element produced by the loop, the AR generator can calculate how many indices correspond to a single cache line. Once the number of processors and the total index range for the loop is known, each processor can be assigned an integral number of cache lines to produce.
Returning to the example shown in Figure 3 , the compiler knows that four elements will t in each cache line. When the AR generator is called at run time, it is parameterized with this information, the total index range (1 through 15), and the processor ids (P 0 and P1). It calculates that four cache lines are required to hold the fteen elements produced by the loop and splits the work evenly, two per processor. It then assigns indices 1 through 8 to processor P0, and 9 through 15 to processor P1. We show the resulting partition in Figure 6 . In the gure, each cache line 
Interference with Other Optimizations
While the changes to the memory manager and AR generator were all that were necessary to e ect cache line tessellation in the general sense, OSC includes several other optimizations that potentially interfere with tessellation. In particular, loop fusion and storage pre-allocation causes di culties that we are forced to address.
OSC attempts to fuse loops whenever possible, both to reduce the need for intermediate storage variables and to reduce overall loop overhead. 2 The result is that a single loop range generates multiple output variables, each with a potentially di erent elemental type. For example, consider the fusion of the loop producing the vector shown in Figure 6 with one that produces a 15 element vector of two-byte integers. Since both loops produce 15 elements, they can be fused into a single loop to save loop overhead. In Figure 7 we show both vectors with their respective data and cache partitions. Notice that all 15 two-byte inte- gers will t into a single cache line. Therefore, the loop that produces this vector cannot be parallelized if false sharing is to be avoided. In general, each loop producing more than one output must be partitioned according to the least common multiple among the elemental data types of its outputs. Since there is no possibility for memory aliasing and no implicit state in a functional language, each loop's outputs are unambiguous. Further, Sisal's strong typing makes the elemental data type known at compile time. The size, however, may not be known. For example, if a parallel produces an vector of vectors (which is the way twodimensional arrays are represented in Sisal 1.2), the size of each inner vector may not be known until run time. OSC implements such aggregates using pointers to non-contiguous storage. That is, the outer vector 2 Parallel Sisal loops may return multiple values of di erent types. OSC will compute these values using a single loop implementation (thereby fusing their production) by default. In version 12.9.1 of OSC, this default could not be over-ridden although the functionality should be part of future versions.
contains memory pointers each referring to a di erent inner vector. If the production of the outer vector is parallelized, each loop body produces some number of inner vectors and returns pointers to them. The elemental data type for the outer vector is therefore a memory pointer, the size of which is known at compile time.
The other form of interference comes from the build-in-place optimizations speci ed in 7]. These optimizations will cause contiguous memory to be preallocated for data structures that are built separately and then concatenated. For example, if a vector is produced, and a border or ghost region is then concatenated with either end of the vector, OSC will perform a single memory allocation for both the vector and the borders. The loop producing the vector is then passed a memory pointer referring to the location within the allocated memory where the vector is to be stored. In Figure 8 we depict the 15 element vector from Figure 6 with a single border element on each end. Notice that the memory manager has aligned the storage for the vector which in e ect aligns the rst border element and not the rst data value. The AR generator, therefore, must consider o set from the beginning of a cache line to the location where the rst data value is produced when it assigns an index range to each loop body. In Figure 6 where there is no border, P0 produces elements 1 through 8 since a cache line boundary falls between elements 8 and 9. With a cache-aligned border, however, the same cache-line boundary is shifted and subsequently falls between elements 7 and 8. The result is that P0 must produce elements 1 through 7, and P1 is assigned 8 through 15.
In the examples, the load balance remains the same even though the partition changes. In general, however, cache-aligning the border elements can cause a load imbalance. If, for example, two border elements were appended to the vector, P0 would produce 6 values, and P1 would produce 9, rather than the 7 and
Results
We have limited results to report as our access to a Power/4 proved somewhat problematic. Further, the machine on which we were able to develop our compiler was an early prototype used to test software upgrades and new products. However, we were able to validate the correctness of our compiler and evaluate its e ectiveness. Achieving correctness demonstrates that software coherence can indeed be implemented by the language system without programmer intervention. To test the viability of the implementation, we present execution times for RICARD and SIMPLE, two scienti c codes written in Sisal 1.2. We wish to determine both the degree to which the compiler can exploit the parallelism of the machine, and the absolute performance. RICARD is a production code developed at the University of Colorado Medical Center to simulate the elutions patterns of proteins in a gel. SIMPLE is a Lagrangian hydrodynamics benchmark developed at the Lawrence Livermore National Laboratory which simulates the behavior of uid in a sphere. The code is reasonably complex containing multiply fused loops, both with and without border constructions.
Figure 9 details the number of lines and the oating point operation counts for both codes. Both codes perform both double-precision oating point arithmetic and integer calculations, however the integer operations are primarily for array indexing.
In Figure 10 and Figure 11 we show the execution time and speedup pro le for RICARD and SIMPLE respectively. Time, measured in seconds, is the cpu time returned by the system. To measure speedup, we compile the code for parallel execution and time its execution on one, two, three, and four processors respectively. Each speedup value is calculated as the ratio of the one-processor execution to the time for a parallel execution.
RICARD
The lack of a good speedup in Figure 10 (1.85 on 4 processors) comes from a poor load balance due to cache partitioning. In particular, OSC partitions only the outer loop of nested parallel constructs by default. One such loop produces a two-dimensional array having only four rows. Since the output from the loop is four memory pointers (one per row), and all four memory pointers will t into a single cache line, that loop cannot be parallelized. We reran the program in- 
SIMPLE
The speedup for SIMPLE is better than that for RICARD despite the load imbalance caused by fused loops and array border constructions. The reason is that in SIMPLE parallel loops return essentially two di erent sized elements: pointers and double-precision oating point numbers. On the Power/4, pointer values occupy 4 bytes and double-precision reals each t in 8 bytes. If a parallel loop returns both, it must be partitioned for the least common multiple between the two (4 bytes) yielding a poor load balance for the other. The pointer values, however, correspond to rows and the double-precision values to individual elements. On the average, more computation is expended to produce a row than an element so by partitioning each loop for pointers (rows), the load balance favors the production of the more expensive elements. Neither SIMPLE nor RICARD achieve good absolute performance on the Power/4. SIMPLE executes in 17.52 seconds on four processors yielding 6.8 MFLOPS/sec. and RICARD's four processor performance (3.54 seconds) is 23.6 MFLOPS/sec. 3 We believe that there are essentially two reasons, unrelated to the functional nature of Sisal, for these less than sterling performance numbers. First, the implementation does not take advantage of caching between parallel constructions. Like many scienti c codes, SIM-PLE and RICARD contain parallel loops that are repeatedly executed across time steps. Frequently, the arrays produced during one time step serve as the inputs to the next. However, the caches are ushed at the end of every parallel loop, so no cached values are carried over. It is possible to identify values that can remain cached across iterations, but the overhead associated with the ush operation for selected cache elements is much greater than that for ushing the entire cache. Therefore, unless all values could remain cached between iterations, the cost of selectively ushing a few would overshadow the bene t gained from caching. We conducted a few preliminary experiments and veri ed this hypothesis on the prototype machine. Notice that it is not the functional language but the hardware implementation of cache synchronization instructions that impairs performance. If the hardware implemented separate invalidate and ush operations as instructions (and not a single ush system call), we believe we could modify OSC to take advantage of caching. Slaves would invalidate inputs, and post (without invalidating) outputs at the end of each loop. Subsequent parallel sections would then access valid data from their local caches if it were present.
SIMPLE Execution Times
The second performance problem is due to the over-We therefore conclude that functional languages provide a good vehicle for software cache coherence. They shield the programmer from the problems of stale data and false sharing while exploiting parallelism automatically. Absolute performance, however, hinges on and e cient underlying hardware implementation. In particular, post and invalidate operations should be implemented as e ciently as possible, and ush by itself is not su cient.
As part of our future work, we plan to investigate the tradeo between cache tessellation and load balance on hardware cache-coherent systems. We hope to develop a parameterized model to determine when the reduced contention due to cache partitioning will overshadow any subsequent loss of performance due to load imbalance. Also, we hope to study the e cacy of compiler controlled cache invalidate, post, and flush operations as optimizations for hardware coherent systems.
This document was prepared as an account of work partially sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply endorsement, recommendation, or favoring by the United States Government or the University of California. The views an opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes. Have a nice day.
