Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association [17] , which are not sequentially consistent and which can perform poorly in the presence of fine-grain data sharing [11] .
These systems lack fine-grain access control, a key feature of hardware shared-memory machines. Access control is the ability to selectively restrict reads and writes to memory regions. At each memory ref-
erence, the system must perform a lookup to determine whether the referenced data is in local memory, in an appropriate state. If local data does not satisfy the reference, the system must invoke a protocol action to bring the desired data to the local node. We refer to the combination of performing a lookup on a memory reference and conditionally invoking an action as access control.
Access control granularity is the smallest amount of data that can be independently controlled (also referred to as the block size).
Access control is fine-grain if its granularity is similar to a hardware cache block (32-128 bytes).
Current shared-memory machines achieve high performance by using hardware-intensive implementations of fine-grain access control. However, this additional hardware would impose an impossible burden in the cost-conscious workstation and personal computer market.
Efficient shared memory on clusters of these machines requires low-or no-cost methods of fine-grain access control. This paper explores this design space by identifying where the lookup and action can be performed, fitting existing and proposed systems into this space, and illustrating performance trade-offs with a simulation model. The paper then focuses on three techniques suitable for existing hardware. We used these techniques to implement three variants of Blizzard, a system that uses the Tempest Where is the Action Taken?
When a lookup detects a conflict, it must invoke an action dictated by a coherence protocol to obtain an accessible copy of a block. As with the lookup itself, hardware, software, or a combination of the two can perform this action. The protocol action software can execute either on the same CPU as the application (the "primary" processor) or on a separate, auxiliary processor.
Hardware. The DASH, KSR-1, and S3.mp systems implement actions in dedicated hardware, which provides high performance for a single protocol.
While custom hardware performs an action quickly, research has shown that no single protocol is optimal for all applications [16] Figure 1 , we varied the overhead of lookups and the overhead of invoking an action handler.
The "ideal" case is an upper bound on performance.
It models a system in which access fault handlers and message processing run on a separate, infinitely-fast processor. In particular, the protocol software runs in zero time without polluting the processor's cache. However, to make the simulation repeatable, message sends are charged one cycle. The ideal case is 2.2-2.8 x faster than a realistic system running protocol software on the application processor with hardware access control that reduces lookup overhead to zero and invocation overhead near zero. The simulations show that lookup overhead has a far larger effect on system performance than invocation overhead. For example, in Barnes also has frequent, irregular communication that incurs a high penalty on Blizzard.
Summary and Conclusions
This paper examines implementations of fine-grain memory access control, a crucial mechanism for efficient shared memory. It presents a taxonomy of alternatives for fine-grain access control. Previous sharedmemory systems used or proposed hardware-intensive techniques for access control. Although these techniques provide high performance, the cost of additional hardware precludes shared memory from lowcost clusters of workstations and personal computers.
This paper describes several alternatives for finegrain access control that require no additional hardware, but provide good performance. We implemented three in Blizzard, our system that supports fine-grain distributed shared memory on the Thinking Machines CM-5. Blizzard-S relies entirely on software and modifies an application's executable to insert a fast (15 instruction) access check before each load or store.
Blizzard-E uses the CM-5's memory error correcting code (ECC) to mark invalid cacheblock-sized regions of memory.
Blizzard-ES is a hybrid that combines both techniques.
The relative performance of these techniques depends on an application's shared-memory communication, but on six programs, Blizzard-S ran from 2% faster to 108% slower than Blizzard-E.
We believe that the CM-5's network interface and network performance is similar to facilities that will 
