Shared-memory provides a uniform and attractive mechanism for communication.
Introduction
Given current trends in network and systems design, it should come as no surprise that most distributed shared-memory machines are built on top of an underlying message-passing substrate [1, 2, 3] .
Architects of shared-memory machines often obscure this topology with a layer of hardware that implements their favorite memory coherence protocol and that insulates the processor entirely from the interconnection network.
In such machines, communication between processing elements can occur only through the sharedmemory abstraction. It seems natural, however, to expose the network directly to the processor, as shown in Figure 1 , thereby gaining performance in situations for which the shared-memory paradigm is either unnecessary or inappropriate.
For example, consider a thread dispatch operation to a remote node. This operation requires a pointer to the thread's code and any arguments to be placed atomically on the task queue of another processor. The task queue resides in the portion of distributed memory associated with the remote processor. To do so via shared-memory, the invoking processor must first acquire the remote task queue Permission to copy without fee all or part of this material is granted provided that tha copies are not made or distributed for direct commercial advantage, the ACM copyright notica and the titla of tha publication and its date appaar, and notica is given that copying is by permission of the Association for Computing Machinery.
To copy otherwise, or to republish, requiras a fee and/or epecific permission.
lCS-7/93
Tokyo, Japan lock, and then modify and unlock the queue using shared-memory reads and writes, each of which can require multiple network messages. As depicted in Figure 2 , a message-based implementation is substantially simpler all the information necessaryto invoke the thread is marshaled into a single message which is unmarshalled and queued atomically by the receiving processor. In this manner, synchronization and data transfer are combined in a single message.
The message-passing implementation yields substantial performance gains over a pure shared-memory implementation. We characterize the performance of these two implementation schemes by meaSUI_@two inteNak: Tinvoker, the time from when the invoking processor begins the operation until it is free to proceed with other work, and Tinvokee, the time from when the invoking processor begins the operation until the invoked thread begins running. With our best shared-memory implementation, these times are 10.7 and 24.4 ,usec, respectively. With the message-based implementation, both times are reduced drastically, to 0.5 and 7.4~sec, respectively.
These numbers were derived from a cycle-by-cycle simulation of the Alewife machine, assuming a 33 MHz clock.
While supporting an efficient message interface is advantageous, we believe it is important to provide support for the sharedmemory abstraction because of the simplicity it affords those pro-grams where communication
The challenge in implementing such a streamlined interface to the interconnection network is to achieve a performance that rivals that of purely message passing machines, while at the same time coexisting with the shared memory hardware.
The integration of shared memory and message passing in
Alewife is kept simple through a design discipline that provides a single, unl~orm interface to the interconnection network. This means that the shared memory protocol packets and the packets produced by the message passing facilities use the same format and the same network queues and hardware.
The message interface itself follows a similar design discipline:
provide a single, uniform communications interface between the processor and the interconnection network. This uniformity is achieved by using a single packet format and by treating all message packets destined to the processor in a uniform way. Specifically, all (non-protocol) messages interrupt the processor. The processor looks at the packet header and initiates an action based on the header. Possible actions include consuming the data into its registers directly, or issuing a command to storeback data via DMA.
The major contributions of this work include: (1) the design of the streamlined and uniform packet interface into tbe interconnection network, and (2) the mechanisms used to support its integration with the shared-memory hardware. The mechanisms required to integrate the message passing interface with the shared memory hardware include support for coherence on message data, highavailability interrupts, and restrictions placed on message handlers. This paper describes the design and implementation of the message-passing interface, focusing on the issues related to its integration with the shared-memory layer. Section 3 provides an overview of the message interface. Section 4 outlines the mechanisms provided and presents the rationale behind our design decisions. Section 5 focuses on the mechanisms needed to support the integration with shared memory, and Section 6 describes the opportunities afforded by an integrated interface. Section 7 highlights issues encountered during the implementation of the Alewife machine and discusses the status of this implementation. tegrated interface. Section 9 discusses related work and Section 10 presents the status of the design and summarizes the major points in this paper.
The Alewife Machine
Alewife is a large-scale multiprocessor with distributed shared memory. The machine, organized as shown in Figure 3 , uses a cost-effective mesh network for communication. A unique feature of Alewife is its LimitLESS directory coherence protocol [7] . This scheme implements a full-map directory protocol [8] by trapping into software for widely shared data items.
As discussed in Section 6.1, the Alewife message interface is necessary for implementing LimitLESS. q Sending a message is anatomic, user-level, two-phase action: describe the message, then launch it. The sending processor describes a message by writing into coprocessor registers over the cache bus. The resulting descriptor contains either explicit data from registers, or address-length pairs for DMAstyle transfers. Because multiple address-length pairs can be specified, the send can gather data from multiple memory regions.
q Message receipt is signalled with an interrupt to the receiving processor. Alternatively, the processor can mask interrupts and poll for message arrival. On entering a message reception handler, the processor examines the packet header and can take one of several actions depending on the header. Actions include discarding the message, transferring the message contents into processor registers, or instructing the CMMU to initiate a storeback of the data into one or more regions of memory (scatter).
q Mechanisms for atomicity and protection are provided to permit user and operating system functions to use the same network interface. However, allowing the processor direct access to the interconnection network bypasses the shared-memory hardware, and permits the processor to transmit the contents of its registers or regions of memory to other processors. The direct transmission of memory data through messages interacts with the implicit transmission of (potentially the same) data through loads and stores, The interactions must be designed and specified in such a way that compilers and runtime systems can make use of the two classes of mechanism witbin the same application, while minimizing implementation complexity.
The following are the important issues that arise when integrating shared memory with message passing:
The interaction between messages using DMA transfer and cache-coherence. Our solution (as discussed in detail in Section 5.3) guarantees local and remote coherence. This means that data at the source and destination are coherent with respect to local processors. If global coherence is desired, it can be achieved through a two-phase software process.
Guaranteeing only local and remote coherence significantly reduces implementation complexity, while still optimizing for common-case operations.
The need for high-availability interrupts. Consider a processor that has issued multiple shared-memory requests and is currently blocked waiting for the return of data. If nonshared-memory messages precede the arrival of the requested data and occupy the head of the message queue, then the processor must trap and dispose of these messages before it can make forward progress.
To address the above issues, Alewife supports high-availability interrupts. This support allows the Alewife processor to service external messages in the middle of a pending load or store operation.
Special restrictions on global accesses by message handlers. Output Interface Messages in the Alewife machine are sent through a two phase process: first describe, then launch. A message is described by writing directly to an array of registers in the CMMU, called the output descriptor array. Although this array is memorymapped, the addresses fit into the offset field of a special store instruction called s t io, as described in Table 11 . Consequently, the compiler can generate instructions which perform direct registerto-register moves from the processor into this array. These moves proceed at the speed of the cache bus.
Alewife packet descriptors for register-to-register or memoryto-memory transfers have the common structure shown in Figure 5 and consist of one or more 64-bit double-words.
The descriptor consists of zero or more pairs of explicit operands, followed by zero or more address-length pairs. The address-length pairs describe blocks of data which will be fetched from memory via DMA; thus, are present) as a packet header. The packet descriptor can be up to eight (8) double-words long.
Once a packet has been described, it can be launched via an atomic, single-cycle, launch instruction, called ipi launch.
(IPI stands for interprocessor-interrupt).
As shown in Table 1 If the output network is blocked due to congestion, then it is possible that the CMMU has insufficient resources to launch the next message. This information is handed to the processor in one of two ways. First, the space-availregister in the CMMU indicates the maximum packet descriptor which can be generated at the time it is read. Second, if the processor attempts to store beyond this point in the descriptor array, then the offendings t io instruction is blocked until resources are available3. Since the availability of resources is verified during the description process, launch instructions always complete.
Rather than blocking on insufficient descriptor resources, the software can optionally request that the CMMU generate a spacerequest interrupt when a specified number of double-words of descriptor space are available.
'Note that this store instruction differs from normal store instructions only in the value that it produces for the SPARC alternate spaceindicator (ASI) Table 2 lists the interrupts associated with the message interface.
These interrupts are maskable and maybe individually enabled or disabled. The processor can poll for disabled interrupts by examining a CMMU status register. Table 3 lists the CMMU registers that participate in message sending and receiving.
Input Interface
Efficient receipt of messages and dispatching to appropriate handlers is facilitated by an efficient interrupt interface.
Upon reception of the first double-word of a packet, the CMMU generates one of two reception interrupts, depending upon whether the message is a user message or a system/coherence message4.
The processor can begin flushing its pipeline and vectoring to the interrupt handler in parallel with reception of the remainder of the message. Upon entering the interrupt handler, the processor can examine the first 8 double-words of the packet through the packet input window. As with the descriptor array, the packet input window is memory mapped with short addresses and accessed through a special load instruction, 1 di o. Consequently, the compiler can generate register-to-register moves from the input window to the processor registers that proceed at the speed of the cache bus. If the processor attempts to access data that is not yet present, then the CMMU will block the processor until this data arrives. Portions of the message that are outside the packet input window are invisible to the processor. If a packet is longer than eight (8) doublewords, then only the first eight double-words appear in the window.
The remainder of the packet is invisible to the processor, possibly stretching into the network.
Once the processor has examined the head of the packet, it invokes a single-cycle storeback instruction, called ip i cs t (for IPI coherent storeback).
As shown in Table 1 , this instruction has two opcode fields, skip and length.
The skip field specifies the number of double-words that are discarded from the head of the packet, while the length field specifies the number of doublewords (following those discarded) that should be stored to memory via DMA. Either of these fields can contain a reserved "infinity"
value that denotes "until the end of the packet". When invoking DMA, the processor must write the starting address for DMA to the storeback-address register before issuing the storeback instruction.
If the sum of the skip and length fields is shorter than the length of the packet, then the remainder of the packet will appear at the head of the packet input window and another reception interrupt will be generated. Multiple storeback instructions can be issued for a single input message to scatter its data to memory (the Alewife CMMU can permit two ip i cs t instructions to issue without blocking).
A second version of the storeback instruction, called ipicsti requests a storeback completion interrupt upon completion of the storeback operation. This signals the completion of input DMA, and can be used to export blocks of data to higher levels of software.
User-Level Messaging
There are numerous advantages to exporting a fast message interface to user code. The Alewife messaging interface has many aspects that can be directly exploited by the compiler, including direct construction of the descriptor and the format of packets themselves.
This suggests that unique send and receive code might be generated for each type of communication, much in the flavor of active mes- Consequently, the Alewife machine adopts a more general mechanism. It is designed with the assumption that collisions are rare, but that the highest priority interrupt should always have access to the network. To accomplish this, we start with an atomic messagesend, as described above. Then, since message launching is atomic, interrupts are free to use the network providing that they restore any partially-constructed message descriptors before returning. Thus, the mechanism is the familiar "callee-saves" mechanism applied to interrupts.
Since the implementation described in Section 7 does not guarantee the contents of the descriptor array after launch, one additional mechanism is provided. This is the desc-length register of the CMMU. Whenever the output descriptor array is written, descIength is set to the maximum of its current value and the array index that is being written. It is zeroed whenever a packet is launched.
Consequently, this register indicates the number of entries in the descriptor array that must be preserved. It is non-zero only during periods in which packets are being described.
Protection
Historically, there has been tension between protection mechanisms and rapid access JO hardware facilities. The Alewife network interface is no different. Protection in the Alewife machine is not intended to hide information, but rather to protect the machine from errant user-code. Such protection is as follows:
q The user is not allowed to send system or coherence messages. To enforce this restriction, we require the user to construct messages with explicit headers (i.e. one or more operands). In this fashion, the opcode can be checked at the time of launch. If a violation occurs, then the ipi launch instruction is faulted.
q The user is not allowed to issue storeback instructions if the message at the head of the queue is a system or coherence message.
q The user is not allowed to store data into kernel space. This rule is enforced by checking the storeback address register at the time that an ipicst is issued.
These protection mechanisms are transparent to both user and operating system under normal circumstances.
5Along the same lines, message hardware in a multiuser system could automatically append the current process identifier (PID) to outgoing packets. At the destination, either hardware or interrupt software could then check this PID and deliver messages to an appropriate user message handler. This is beyond the scope of this paper, however. To some extent, these interactions arise from the fact that the network provides a single logical input and output port to the memory controller.
While networks with multiple channels are possible to implement, they are invariably more expensive.
High-Availability Interrupts
The need for high-availability interrupts [18] arises because sharedmemory introduces a dependence between instruction execution and and the interconnection network. "Normal" asynchronous interrupts, which occur only at instruction boundaries, are effectively disabled when the processor pipeline is frozen for a remote read or write request. Unfortunately, as shown in Figure 6 , the requested data may never arrive if it is blocked behind other messages. This figure illustrates a situation in which the processor has issued multiple shared-memory requests and is currently blocked waiting for the return of data. Unfortunately, several non-shared-memory messages have entered the network input queue ahead of the desired response. Unless the processor traps and disposes of these messages, it will never receive its desired data. Thus, the successful completion of a spinning load or store to memory may require faulting the access in progress so that a network interrupt handler can dispose of the offending messages. The term high-availability iizterrupt is applied to such externally initiated pipeline interruptions.
High-availability interrupts introduce an associated problem:
when a load or store is interrupted by a high-availability interrupt, it is possible for its data to arrive and to be invalidated while the interrupt handler is still executing. The original request must then be reissued when the interrupt finishes. In unfortunate situations, systematic thrashing can occur. This is part of a larger issue, namely the window ofvrdnerabilify, discussed in [1 8]. For a single-threaded processor, the simplest solution is to defer the invalidation until after the original load or store commits.
Restrictions on Message Handlers
A second issue is the interaction between message handlers and shared memory. When an interrupt handler is called in response to an incoming message, the interrupt code must be careful to ensure the following before accessing global-shared memory:
The network overllow intermpt must be enabled. (The network overflow handling mechanism is discussed in the next section.)
The input packet must be completely freed and network interrupts must be reenabled.
There must be no active low-level hardware locks that will defer invalidations in the interrupted code. for details.
Local Coherence for DMA
Since Alewife is a cache-coherent, shared-memory multiprocessor, it is natural to ask which form of data coherence should be supported by the DMA mechanism. Three possibilities present themselves:
1. Non-Coherent DMA: Data is taken directly from memory at the source and deposited directly to memory at the destination, regardless of the state of local or remote caches. Second, a machine with a single network port cannot fetch dirty source data while in the middle of transmitting a larger packet since this requires the sending of messages. Even in a machine with multiple logical network ports, it is undesirable to retrieve dirty data in the middle of message transmission because the network resources associated with the message can be idle for multiple network roundtrip times. Thus, a monolithic DMA mechanism would have to scan through the packet descriptor twice; once to collect data, and once to send data. This adds unnecessary complexity.
Third, globally-coherent DMA complicates network overflow recovery. While hardware can be designed to invalidate or to update remote caches during data arrival (using both input and output ports of the network simultaneously), this introduces a dependence between input and output queues that may prevent the simple "divert and relaunch" scheme described in Section 6.2 for network overflow recovery: input packets that are in the middle of a globally-coherent storeback block the input queue when the output queue is clogged.
In the light of these discussions, the Alewife machine supports a locally-coherent DMA mechanism.
Synthesizing Global Coherence
We have argued above against a monolithic, globally-coherent DMA mechanism. However, globally-coherent DMA can be accomplished in other ways. The key is to note that software desiring such semantics can employ a twophase "collect" and "send" operation at the source and a "clean"
and "receive" operation at the destination.
Thus, a globally-coherent send can be accomplished by first scanning through the source data to collect values of outstanding dirty copies. Then a subsequent DMA send operation only needs to access local copies. With the send mechanism broken into these two pieces, we see that the the collection operation can potentially occur in parallel: by quickly scanning through the data and sending invalidations to all caches which have ditiy copies.
At the destination, the cleaning operation is similar in flavor to collection.
Here the goal of scanning through destination memory blocks is to invalidate all outstanding copies of memory lines before using them for DMA storeback. To this end, some method of marking blocks as "busy" until invalidation acknowledgments have returned is advantageous (and provided by Alewife); then, data can be stored to memory in parallel with invalidations.
It is an open question whether the collection and cleaning operations should be assisted by hardware, accomplished by performing multiple non-binding prefetch operations, or accomplished by scanning the coherence directories and manually sending invalidations.
If globally-coherent DMA operations are frequent, then a hardware assist is probably desirable. At this time, however, the Alewife machine provides no hardware assistance for these operations.
Opportunities From Integration
In this section, we touch on two unique opportunities, over and above the software advantages mentioned earlier, which arise from the inclusion of a fast message interface in a shared-memory multiprocessor. These are the LimitLESS cache-coherence protocol, and network overflow recovery.
The LimitLESS Cache Coherence Mechanism
One opportunity that arises from integrating message-passing and shared memory, is the ability to extend the hardware cache-coherence protocol in software. Permitting software to send and receive coherence-protocol packets requires no additional mechanism over and above the basic messaging facilities of Section 4. In Alewife, the memory system implements a set of pointers, called directories.
Each directory keeps track of the cached copies of a corresponding memory line. In our current implementation, the size of the directory can be varied from zero to five pointers. The novel feature of the LirnitLESS scheme [7] is that when there are more cached copies than there are pointers, the system traps the processor for software extension of the directory into main memory7. The processor can then implement an algorithm of its choice in software to handle this situation.
The LimitLESS scheme leaves ample opportunity for designing custom protocols which are invoked on a per-memory-line basis.
Individual directories can be set to interrupt on all references, Then, all protocol messages which arrive for these memory-lines are automatically forwarded to the message input queue for software han- Note that this technique does not require multiple logical networks.
The heuristic that we use to detect protocol deadlock is to initialize a hardware timer with a preset value. Then, whenever the network output queue is full and blocked, the timer begins counting down from the preset value, generating a network-overjlow trap if it ever reaches zero. This counter affords some hysteresis for overflow detection, since protocol deadlock is a rare event and some queue blockage is expected.
The network overflow handler places the network in "divert mode:' diverting all packets from the network input queue to the IPI input queue. It then uses DMA to store all incoming packets into a special queue-overflow region of local memory. This process continues until the network output queue has drained sufficiently (a controller status bit indicates that the output queue is half full).
As a final phase of recovery, the diverted packets are relaunched with the IPI output interface. A low interrupt priority is used by the relaunch code, to permit normal message processing and network interrupts on relaunched packets.
Consequently, to permit network overflow recovery, we supplement the mechanisms of Section 4 with four additional mechanisms:
1. A countdown timer which can be used detect that the network output queue has been clogged for a "long" time.
2. The ability to force all incoming packets to be diverted to the IPI input queue, rather than being processed by the sharedmemory controller. Note that we have on~y added the ability to force this switch. The data path must already be present to permit both shared memory and message passing to coexist.
See Section 7.
3. A flag which indicates that the hardware output queue is empty or half full.
4. An internal loopback path from the IPI output mechanism back to the controller input, which permits packets to be relaunched to the hardware during recovery without routing through the network hardware.
Note that the fourth mechanism is not strictly necessary, but desirable since the network is backed up during network overflow processing. Section 7 shows a diagram of the network queue structure. A final requirement is more a design philosophy than anything else:
q All controller state machines must be designed such that they never attempt to start operations which have insufficient queue resources to complete. In this context, DMA requests are broken into a series of short atomic memory operations.
Adherence to this philosophy is simpler than attempting to abort operations during network overflow. The Alewife message interface is implemented in two components,~p to move this head pointer and to initiate DMA actions on the data which has beenpassed. Aseparate queue (not shown) holds issued ipicst instructions until they can be processed.
Local Coherence
As mentioned in the discussion on DMA coherence (Section 5.3), supporting locally-coherent DMA is straightforward in a machine with an invalidation-style cache-coherence protocol. In the CMMU, we coordinated the invalidation processes by using double-headed invalidation queues. The DMAcontrollers generate addresses and place requests on these queues as fast as possible (moving the tail of the queue). As soon as requests are written, the cache controller sees them, causes appropriate invalidations, and moves its head pointer.
The memory has a second head pointer which lags behind the head pointer of the cache controller. Whenever thetwo pointers differ, the memory machine knows that it can satisfy the DMA request at its head, since the corresponding invalidation has already occurred.
When the memory machine moves its head pointer beyond an entry, thatentryisfreed. Eachoftheinvalidation queues havetwo,doublecache-line entries.
Care must be taken with the input interface so that the processor cannot re-request a memory-line after it has been invalidated but before thedata has been written to memory. This situation can arise from the pipelining of DMA requests and would represent a violation of local coherence. The difference in area between the input and output invalidation queues (see Table 4 ) results from address-matching circuitry that serves as an interlock to prevent this "local coherence hole".
Network Overflow
The memory protocol output queue and the cache protocol output queue handle protocol traffic from the memory and cache, respec- and iPSC/860, Kendall Square KSR 1) take well over 400~sec
[2]. These numbers can also be compared with hardware-supported synchronization mechanisms, such as on the CM5, that take only 2 or 3 psec but that require separate, log-structured (and potentially less-scalable) synchronization networks.
As describedin Section 1, a remote thread invocation using messages reduces the invoker's overhead over a purely shared-memory implementation by a factor of 20 and that of the invokee by a factor of three. Memory-to-memory copy of data for 256-byte blocks is faster than shared-memory copy without prefetching by 2.4, and faster than shared-memory copy with prefetching by 1.5.
These results should not suggest that shared memory is unnecessary or expensive. For programs that have unpredictable, highly data-dependent access patterns, message passing implementations resort to implementing much of the shared-memory interpretive layer in software (for data location and data movement), with a corresponding loss in performance. In other cases, such as SOR, a simple block-partitioned Jacobi SOR solver, we observe little difference between well coded shared-memory and message-passing implementations.
Related Work
The CM5 provides a message passing interface and uses SF'ARCS as its processing nodes. The message interface is implemented using register reads and writes into the network interface (NI) chip.
Because the reads and writes are implemented over the main memory bus, they are slower than network register reads and writes in Alewife, which are implemented over the processor cache bus.
The CM5 interface does not provide support for DMA or shared memory and requires the processor to be involved in emptying out the message queue. The processor in the CM5 can be notified on message arrival either through an interrupt or by polling [22] .
Our interface is different from that provided by the message passing J-machine [10] in that our processor is always interrupted on the arrival of a message, allowing the processor to examine the packet header, and to decide how to deal with the message. Messages in the J-machine are queued and handled in sequence. (The J-machine, however, allows a higher priority message to interrupt the processor.) The J-machine does not provide DMA transfer of data. Finally, message sends in Alewife are atomic in that correct execution is supported even if the processor is interrupted while writing into the network queue.
Somewhat in the flavor of the Alewife machine, the J-machine generates a sendfault when the network output queue overflows. In addition, a queue over-owfault is generated when the input queue overtlows. These faults can be used to trigger network overflow recovery similar to that of Section 6.2. Additionally, the J-machine network includes a second level of network priority which can be used to shuffle excess data to other nodes, should local memory for supplementary queue space be unavailable. Unfortunately, the Jmachine mechanisms are extremely pessimistic, trapping as soon as local queue space is exhausted. In contrast, Alewife's network overflow mechanism provides hysteresis to ignore temporary network blockages. Further, the lack of message atomicity in the J-machine complicates the functionality of network overflow handlers.
Support for multiple models of computation has been identified as a promising direction for future research. For example, the iWarp [9] integrates systolic and message passing styles of communications. Their interface supports DMA-style communication for long packets typical in message passing systems, while at the same time supporting systolic processor-to-processor communication.
In the latter style, a processor could be producing data and streaming it to another, while the receiving processor could be consuming the data using an interface that maps the network queue into a processor register.
To our knowledge, there are no existing machines that support both a shared-address space and a general fine-grain messaging interface in hardware. In some cases where we argue messages are better that shared-memory, such as the barrier in Section 8, a similar effect could be achieved by using shared-memory with a weaker consistency model. For example, the Dash multiprocessor [3, 23] has a mechanism to deposit a value from one processor's cache directly into the cache of another processor, avoiding cache coherence overhead. This mechanism might actually be faster than using a message because no interrupt occurs, but a message is much more general. Another example of a shared-memory machine that also supports a message-like primitive is the BBN Butterfly. This machine provides both hardware support for block transfers and the ability to send remote "interrupt requests." Nodes in the Butterfly are able to initiate DMA operations for blocks of data which lreside in remote nodes. In an implementation of distributed shared memory on this machine, Cox and Fowler [25] conclude that an effective block transfer mechanism was critical to performance. They argue that a mechanism that allows more concurrency between processing and block transfer would make a bigger impact. It turns out that Alewife's messages are implemented in a way that allows such concurrency when transferring large blocks of data. Furthermore, the Butterfly's block transfer mechanism is not suited for more general uses of fine-grain messaging because there is no support in the processor for fast message handling.
Conclusion
This paper discussed the design of a streamlined message interface that is integrated within a shared-memory substrate. The integration of message passing with shared memory introduces many interesting issues including the need for high-availability interrupts, the need for special restrictions on message handlers, and data coherence requirements for the DMA mechanism. An interface that addresses these needs has been implemented in the Alewife machine's CMMU.
The integration of message passing mechanisms with shared memory affords higher applications performance than either a pure message passing interface or a shared memory interface. In addition, it provides unique opportunities, over and above the software advantages of multimodal support, including the LimitLESS cachecoherence protocol and network overflow recovery. 
