Frank Tr~inkle, and G u h a n V i s w a n a t h a n .
O v e r v i e w
Consensus is emerging about the lowest and highest levels of massively-parallel computers: these machines will be built from workstation-like nodes and programmed in high-level parallel languages--like HPF--that support a shared address space in which processes uniformly reference data. Unfortunately, no consensus has emerged on the communication modelshared memory or message passing--to support parallel languages.
The Wisconsin Wind Tunnel project is currently developing a consensus about the middlelevel interface--below languages and compilers and above system software and hardware.
Our first proposed interface was Cooperative Shared Memory, which is an evolutionary extension to conventional shared-memory software and hardware [see tocs93_csm, ps. Z]. Cooperative Shared Memory asks programmers to identify the expected data sharing behavior with Check-In/Check-Out (CICO) performance annotations so that the system can handle references efficiently without complex hardware (e.g.,
Dir~ SW).
Recently, we have developed a new interface--called Tempest--between userlevel protocol handlers and system-supplied mechanisms [see isca94_typhoon.ps. Z]. Tempest provides the mechanisms that allow programmers, compilers, and program libraries to implement and use message passing, transparent shared memory, and hybrid combinations of the two. Tempest mechanisms are lowoverhead messages, bulk data transfer, virtual memory management, and fine-grain access control. The most novel mechanism--fine-grain access control--allows user software to tag blocks (e.g., 32 bytes) as read-write, read-only, or invalid, so the local memory can be used to transparently cache remote data.
We are developing implementations of Tempest on a Thinking Machines CM-5, a cluster of dual-processor Sun workstations, and a hypothetical hardware platform. Each will use a different method to implement fine-grain access control. To refine our design ideas, we have developed and implemented an execution-driven simulation system called the Wisconsin Wind Tunnel (WWT) [sigmetrics93_wwt.ps.Z]. The released version of WWT runs a parallel shared-memory program on a parallel computer (Thinking Machines CM-5) and uses execution-driven, distributed, discrete-event simulation to accurately calculate program execution time. ~VWT directly executes all shared-memory program instructions and memory references that hit in the hypothetical machine's cache. WWT's speed and the CM-5's memory capacity permit evaluations to use more realistic workloads than are feasible with other simulation techniques. Unreleased versions of WWT model shared-memory, message-passing, and Tempest target programs and allow each CM-5 node to host multiple target nodes.
The Wisconsin Wind Tunnel Project is so named because we use our tools to cull the design space of parallel supercomputers in a manner similar to how aeronautical engineers use conventional wind tunnels to design airplanes. Needless to say, we neither design airplanes nor blow air.
On-Line A c c e s s
On-line information on the Wisconsin Wind Tunnel Project can be obtained through world wide web/mosaic, anonymous ftp, and gopher. If these fail or you can't print the compressed postscript, e-mail your postal mail address to wwt©cs, wise. edu and we will send hardcopies. 
. 2
A n o n y m o u s F T P Anonymous ftp to f t p . c s . w i s c . e d u and cd wwt. We recommend that you g e t README.
. 3
G o p h e r Our gopher server is experimental. We believe the paucity of massively-parallel, shared-memory machines follows from the lack of a shared-memory programming performance model that can inform programmers of the cost of operations (so they can avoid expensive ones) and can tell hardware designers which cases are common (so they can build simple hardware to optimize them). Cooperative shared memory, our approach to shared-memory design, addresses this problem.
Our We also compare protocol performance by running eight benchmarks on 32 processor systems. Simulations show that Dirl SW+'s performance is comparable to more complex directory protocols. The significant disparity in hardware complexity and the small difference in performance argue that DirlSW + may be a more effective use of resources. The small performance difference is attributable to two factors: the low degree of sharing in the benchmarks and CheckIn/Check-Out (CICO) directives.
Keywords: Shared-memory multiprocessors, memory systems, cache coherence, directory protocols, and hardware mechanisms. A programming performance model provides a programmer with feedback on the cost of program operations and is a necessary basis to write ellicient programs. Many sharedmemory perlbrmance models do not accurately capture tile cost of interprocessor communication caused by non-local memory references, particularly in computers with caches. This paper describes a simple and practical programming performance model--called check-in, check-out (CICO)--for cache-coherent, sharedmemory parallel computers. CICO consists of two components. The first is a collection of annotations that a programmer adds to a program to elucidate the communication arising from shared-memory references. The second is a model that calculates the communication cost of these annotations. An annotation's cost models the cost of the memory references that it summarizes and serves as a metric to compare alternative implementations. Several examples demonstrate that CICO accurately predicts cache misses and identifies changes that improve program performance. This paper discusses solving microstructure electrostatics with cooperative shared memory (cce_electrostat ics. ps. Z). The programming models presented by parallel computers are diverse and changing. We study a new parallel programming model--cooperative shared memory (CSM)--with a collaborative effort between chemical engineers and computer scientists. Since CSM machines do not (yet) exist we evaluate our applications and machine designs with the Wisconsin Wind Tunnel (WWT), which runs CSM programs and calculates the performance of hypothetical parallel computers.
The application considered is the class of three-dimensional elliptic partial differential equations (Laplace, Stokes, Navier) with solutions represented by boundary integral equations. The parallel algorithm follows naturally from our use of the Completed Double Layer Boundary Integral Equation Method (CDL-BIEM).
• A major result is the demonstration that coding CDLBIEM is much simpler under CSM than with the message passing model, and yet performance (computational times and speed ups) is comparable, a fact that may be of great interest to designers of future machines. With WWT, we can also examine pertbrmance as a function ol' machine parameters such :ks cache size and network bandwidth and latency. The possibility of tweaking simultaneously the algorithm and architecture to outline pathways of evolution for future parallel machines is an important concept explored in this work.
M a s t e r ' s thesis t h a t explains on the contents of the above p a p e r ( t r a e n k l e _ m s . p s . Z). The programming models presented by parallel computers are diverse and changing. We study the implementation of our application in different parallel programming models with a collaborative effort between chemical engineers and computer scientists.
The application considered is the class of three-dimensional elliptic partial differential equations (Laplace, Stokes, Navier) with solutions represented by boundary integral equations. These partial differential equations appear in basic microscopic descriptions of heterogeneous structured continua. As an example, we present results for the macroscopic dielectric constants and thermal conduetivities of twophase materials. The parallel algorithm follows naturally from our use of the Completed Double Layer Boundary Integral Equation Method (CDLBIEM).
The application is implemented in the message-passing programming model using the standard send-receive message-passing primitives in the CMMD library and the static shared-memory model in the form of Split-C, both running on the Thinking Machines CM-5 parallel computer. Furthermore, we study its implementation in a new parallel programming model -cooperative shared memory (CSM). Since CSM machines do not (yet) exist we evaluate our application and machine designs with the Wisconsin Wind Tunnel (WWT), which runs CSM programs and calculates the performance of hypothetical parallel computers.
A major result is the demonstration that coding CDLBIEM is much simpler under CSM than with the message-passing model or Split-C, and yet performance (computational times and scaleup) is comparable, a fact that may be of great interest to designers of future machines.
This p a p e r describes Cachier, a tool for autornatically inserting C I C O a n n o t a t i o n s in programs ( i c p p 9 4 _ c a c h i e r . p s . Z). Shared memory in a parallel computer provides programmers with the valuable abstraction of a shared address space-through which any part of a computation can access any datum. Although uniform access simplifies programming, it also hides communication, which can lead to inefficient programs. The checkin, check-out (CICO) performance model for cache-coherent, shared-memory parallel computers helps a programmer identify the communication underlying memory references and account for its cost. CICO consists of annotations that a programmer can use to elucidate communication and a model that attributes costs to these annotations. The annotations can also serve as directives to a memory system to improve program performance. Inserting CICO annotations requires reasoning about the dynamic cache behavior of a program, which is not always easy. This paper describes Cachier, a tool that automatically inserts CICO annotations into shared-memory programs. A novel feature of this tool is its use of both dynamic information, obtained from a program execution trace, as well as static information, obtained from program analysis. We measured several benchmarks annotated by Cachier by running them on a simulation of the D i r l S W cache coherence protocol, which supports these directives. The results show that programs annotated by Cachier perform significantly better than both programs without CICO annotations and programs that were annotated by hand.
Keywords: Shared-memory, parallel programming performance models, parallel programming tools, cache-coherence, directory protocols. We have developed a new technique for evaluating cache coherent, shared-memory computers. The Wisconsin Wind Tunnel (WWT) runs a parallel shared-memory program on a parallel computer (CM-5) and uses execution-driven, distributed, discrete-event simulation to accurately calculate program execution time. W W T is a virtual prototype that exploits similarities between the system under design (the target) and an existing evaluation platform (the host). The host directly executes all target program instructions and memory references that hit in the target cache. W W T ' s shared memory uses the CM-5 memory's error-correcting code (ECC) as valid bits for a fine-grained extension of shared virtual memory. Only memory references that miss in the target cache trap to WWT, which simulates a cache-coherence protocol. W W T correctly interleaves target machine events and calculates target program execution time. W W T runs on parallel computers with greater speed and memory capacity than uniprocessors. W W T ' s simulation time decreases as target system size increases for fixed-size problems and holds roughly constant as the target system and problem scale.
W i s c o n s i n W i n d T u n n e l
Describes o p e r a t i n g system s u p p o r t for the Wisconsin W i n d Tunnel ( u s e n i x 9 3 _ k e r n e 1 . p s . Z): This paper describes a kernel interface that provides an untrusted user-level process (an executive) with protected access to memory management functions, including the ability to create, manipulate, and execute within subservient contexts (address spaces). Page motion callbacks not only give the executive limited control over physical memory management, but also shift certain responsibilities out of the kernel, greatly reducing kernel state and complexity.
The executive interface was motivated by the requirements of the Wisconsin Wind Tunnel (WWT), a system for evaluating cache-coherent shared-memory parallel architectures. W W T uses the executive interface to implement a finegrain user-level extension of Li's shared virtual memory on a Thinking Machines CM-5, a message-passing multicomputer. However, the interface is sufficiently general that an executive could act as a multiprogrammed operating system, exporting an alternative interface to tile threads running ill its subservient contexts.
Tile executive interface is currently implemented as an extension to CMOST, the standard operating system for the CM-5. In CMOST, policy decisions are made on a central, distinct control processor (CP) and broadcast to the processing nodes (PNs). The PNs execute a minimal kernel sufficient only to implement the CP's policy. While this structure efficiently supports some parallel application models, the lack of autonomy on the PNs restricts its generality. Adding the executive interface provides limited autonomy to t h e PNs, creating a structure that supports multiple models of application parallelism. This structure, with autonomy on top of centralization, is in stark contrast to most microkernel-based parallel operating systems in which the nodes are fundamentally autonomous. This tutorial gives a brief introduction to programming, compiling, and executing parallel shared-memory applications on the Wisconsin Wind Tunnel (WWT), a virtual prototyping system. The W W T currently runs only on a Thinking Machines CM-5, so we assume that the reader has access to one and knows how to log in and run programs and is familiar with basic Unix(TM) functionality.
The tutorial illustrates how to parallelize a simple sequential application; how to use the Cooperative Shared Memory (CSM) model and different cache coherence protocols; and how to execute, debug and profile parallel applications on the WWT. The tutorial should give you enough information to get started writing your own programs for the WWT.
Shows t h a t parallel simulation can have better c o s t / p e r f o r m a n c e t h a n sequential simulation ( p a d s 9 4 _ c o s t p e r f . p s . Z).
B a b a k
Falsail and David A. Wood. C o s t / P e r f o r m a n c e of a Parallel C o m p u t e r Simulator. In Proceedings of PADS '9~, July 1994. This paper examines the cost/performance of simulating a hypothetical target parallel computer using a commercial host parallel computer. We address tile question of whether parallel simulation is simply faster than sequential simulation, or if it is also more cost-effective. To answer this, we develop a performance model of the Wisconsin Wind Tunnel (WWT), a system that simulates cache-coherent shared-memory machines on a message-passing Thinking Machines CM-5. The performance model uses Kruskal and Weiss's fork-join model to account for the effect of event processing time variability on WWT's conservative fixed-window simulation algorithm. A generalization of Thiebaut and Stone's footprint model accurately predicts the effect of cache interference on the CM-5. The model is calibrated using parameters extracted from a fully-parallel simulation (p-N), and validated by measuring the speedup as the number of processors (p) ranges from one to the number of target nodes (N). Together with simple cost models, the performance model indicates that for target system sizes of 32 nodes and larger, parallel simulation is more costeffective than sequential simulation. The key intuition behind this result is that large simulations require large memories, which dominate the cost of a uniprocessor; parallel computers allow multiple processors to simultaneously access this large memory.
4.3
Tempest, Typhoon, Blizzard, etc. Many parallel languages presume a shared address space in which any portion of a computation can access any datum. Some parallel computers directly support this abstraction with hardware shared memory. Other computers provide distinct (per-processor) address spaces and communication mechanisms on which software can construct a shared address space. Since programmers have difficulty explicitly managing address spaces, there is considerable interest in compiler support for shared address spaces on the widely available messagepassing computers.
At first glance, it might appear that hardware-implemented shared memory is unquestionably a better base on which to implement a language. This paper argues, however, that compiler-implemented shared memory, despite its shortcomings, has the i)otential t,o exploit more effectively the resources in a parallel computer. Hardware designers need to find mechanisms to combine the advantages of both approaches in a single system. Future parallel computers must efficiently execute not only hand-coded applications but also programs written in high-level, parallel programming languages.
Today's machines limit these programs to a single communication paradigm, either message-passing or sharedmemory, which results in uneven performance. This paper addresses this problem by defining an interface, Tempest, that exposes lowlevel communication and memory-system mechanisms so programmers and compilers can customize policies for a given application. Typhoon is a proposed hardware platform that implements these mechanisms with a fullyprogrammable, user-level processor in the network interface. We demonstrate the utility of Tempest with two examples. First, the Stache protocol uses Tempest's fine-grain access control mechanisms to manage part of a processor's local memory as a large, fully-associative cache for remote data. We simulated Typhoon on the Wisconsin Wind Tunnel and found that Stache running on Typhoon performs comparably (:t= 30%) to an all-hardware Dir~NB cachecoherence protocol for five shared-memory programs. Second, we illustrate how programmers or compilers can use Tempest's flexibility to exploit an application's sharing patterns with a custom protocol. For the EM3D application, the custom protocol improves performance up to 35o7o over the all-hardware protocol. This paper discusses various techniques for finegrain access control and three implementation of them in Blizzard (asplos6_2±ne_gra±n.ps. Z). This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-blocksized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focuses on lowcost implementations that require little or no additional hardware. These techniques permit efficient implementation of shared memory on a wide range of parallel systems, thereby providing shared-memory codes with a portability previously limited to message passing. This paper categorizes techniques based on where access control is enforced and where access conflicts are handled. ~,Ve incorporated three techniques that require no additional hardware into Blizzard, a system that supports distributed shared memory on the CM-5. The first adds a software lookup before each sharedmemory reference by modifying the program's executable. The second uses the memory's error correcting code (ECC) as cache-block valid bits. The third is a hybrid. The software technique ranged from slightly faster to two times slower than the ECC approach. Blizzard's performance is roughly comparable to a hardware shared-memory machine. These results argue that clusters of workstations or personal computers with networks comparable to the CM-5's will be able to support the same shared-memory interfaces as supercomputers. This paper compares four shared-memory and message-passing programs running on detailed architectural simulators of comparable machines. (asplos6_sm~p. ps. Z).
Satish Chandra, James R. Larus, and Anne Rogers.
Where Message passing and shared memory are two techniques parallel programs use for coordination and communication. This paper studies the strengths and weaknesses of these two mechanisms by comparing equivalent, well-written message-passing and shared-memory programs running on similar hardware. To ensure that our measurements are comparable, we produced two carefully tuned versions of each program and measured them on closely-related simulators of a message-passing and a shared-memory machine, both of which are based on same underlying hardware assumptions.
We examined the behavior and performance of each program carefully. Although the cost of computation in each pair of programs was similar, synchronization and communication differed greatly. We found that message-passing's advantage over shared-memory is not clear-cut. Three of the four shared-memory programs ran at roughly the same speed as their messagepassing equivalent, even though their communication patterns were different. This paper shows how a custom memory system, built on Blizzard, can help support C**, a high-level parallel language (asplos6_lcm. ps. Z). Higher-level parallel programming languages can be difficult to implement efficiently on parallel machines. This paper shows how a flexible, compiler-controlled memory system can help achieve good performance for language constructs that previously appeared too costly to be practical.
Our compiler-controlled memory system is called Loosely Coherent Memory (LCM). It is an example of a larger class of Reconcilable Shared Memory (RSM) systems, which generalize the replication and merge policies of cachecoherent shared-memory. RSM protocols differ in the action taken by a processor in response to a request for a location and the way in which a processor reconciles multiple outstanding copies of a location. LCM memory becomes temporarily inconsistent to implement the semantics of C** parallel functions efficiently. RSM provides a compiler with control over memorysystem policies, which it can use to implement a language's semantics, improve performance, or detect errors. We illustrate tile first two points with LCM and our compiler for the dataparallel language C**. This paper examines customizing protocols to applications using the Tempest interface running on Blizzard (sc94_protocols. ps. Z).
Babak Falsafi, Alvin Lebeck, Steven Reinhardt, Ioannis Schoinas, Mark D. Hill, James Larus,
