Processor
Introduction
Given current trends in network and systems design, it should come as no surprise that most distributed shared-memory machines are built on top of an underlying message-passing substrate [ 1, 2, 31.
Architects of shared-memory machines offen obscure this topology with a layer of hardware that implements their favorite memory coherence protocol and that insulates the processor entirely from the interconnection network. In such machines, communication between processing elements can occur only through the sharedmemory abstraction. It seems nalural, however, to expose the network directly to the processor, as shown in Figure 1 , thereby gaining performance in situations for which the shared-memory paradigm is either unnecessary or inappropriate.
For example, consider a thread dispatch operation to a remote node. This operation nquinS a pointer to the thread's code and any arguments to be placed atomically on the task queue of another processor. The task queue resides in the portion of distributed memory associated with the remote processor. To do so via shared-memory, the invoking processor must first acquire the remote task queue lock. and then modify and unlock the queue using shared-memory reads and writes. each of which can require multiple network messages. As depicted in Figure 2 , a message-based impiementation is substantially simpler: all the information necessary to invoke the thread is marshalled into a single message which is unmarshalled and queued atomically by the receiving processor. In this manner, synchronization and data transfer are combined in a single message.
Alewife integrates direct network access with the shared-memory framework. The message-passing mechanisms include direct processor access via loads and stores to the input and output queues of the network, a DMA mectianism, and a fast trap mechanism for message handling. With rhe integrated interface in Alewife, a message can be sent with just a few user-level instructions. A processor receiving such a message will trap and respond either by rapidly executing a message handler or by queuing the message for later consideration. Scheduling and queuing decisions are made entirely in software.
The challenge in implementing such a streamlined interface to the interconnection network is it0 achieve a performance that rivals that of purely message passing machines. while at the same time coexisting with the shared memory hardware.
The integration of shared memory and message passing in Alewife is kept simple through a design discipline that provides a single. uniform interface to the interconnection network. This means that the shared memo7 protocol packets and the packets produced by the message passing facilities use the same format and the same network queues and h,udware.
The message interface itself follows a similar design discipline: provide a single, uniform comunications interface between the processor and the interconnection network. This uniformity is achieved by using a single packetfonnat and by mating all message packets destined to the processor in a unifom way. Specifically, all (non-protocol) messages interrupt the processor. The processor looks at the packet header and initiates an action based on the header. Possible actions include consuming the data into its registers directly. or issuing a command to storeback data via DMA.
The major conmbutions of this work include: (1) the design of the streamlined and uniform packet interface into the interconnection network, and (2) the mechanisms used to support its integration with the shared-memory hardware. The mechanisms required to integrate the message passing interface with the s h a d memory hardware include support for cohenence on message data, highavailability interrupts, and mtrictioos placed on message handlers. This paper describes the design and implementation of the message-passing interface, ffocuuing on the issues related to its integration with the shared-memory layer. Section 3 provides an overview of the message interface. Section 4 outlines the mechanisms provided and presents dne rationale behind our design decisions. Section 5 focuses on the mechanisms needed to support the integration with shared memory, and Section 6 &Scribes the opportunities dorded by an integrated interface. Section 7 highlights issues encountered during the implementation of the Alewife machine and discusses the status of this implementation. Section 8 Alewife machine presents empirical evidence of the benefits of implementing an integrated interface. Section 9 discusses related work and Section 10 presents the status of the design and summarizes the major points in this paper.
The Alewife Machine
Alewife is a large-scale multiprocessor with distributed shared memory. The machine, organized as shown in Figure 3, Alewife's communications interface is unique for two reasons: first, it integrates message passing with a shared-memory interface, and second, the interface is highly efficient and uses a uniform packet structure. This section provides an overview of this interface and discusses the rationale behind the design.
A Uniform Message interface
The message-passing interface in the Alewife machine is designed around four primary observations:
1. Header information for messages is often derived directly from processor registers at the source and, ideally, delivered directly to processor registers at the destination. 0 Sending a message is an atomic, user-level, two-phase action: describe the message, then launch it. The sending processor describes a message by writing into coprocessor registers over the cache bus. The resulting descriptor contains either explicit data from registers, or address-length pairs for DMAstyle transfers. Because multiple address-length pairs can be specified, the send can garher data from multiple memory regions.
Message receipt is signalied with an intermpt to the receiving processor. Altematively. the processor can mask intempts and poll for message arrival. On entering a message reception handler. the processor examines the packet header and can take one of several actions depending on the header. Actions include discarding the message,transferring the message contents into processor registers. or instructing the CMMU to initiate a storeback of the data into one or more regions of memory (scatter). 0 Mechanisms for atomiciry and protection are provided to permit user and operating system functions to use the same network interface.
Integration of Shared Memory and Message Passing
Integration of message passing with shared memory is challenging because of their different semantics. The shd-memory interface (as depicted in Figure 1 ) accepts read and write requests from the processor and converts them into messages to other nodes if the desired data is neither present in the cache nor in the local memory of the requesting node. The Alewife memory system is sequentially consistent.
However, allowing the processor direct access to the interconnection network bypasses the shared-memory hardware, and permits the processor to transmit the contents of its registers or regions of memory to other processors. The direct transmission of memory data through messages interacts with the implicit transmission of (potentially the same) data through loads and stores. The interactions must be designed and specified in such a way that compilers and mtime systems can make use of the two classes of mechanism within the same application. while minimizing implementation complexity.
The following are the important issues that arise when integrating shared memory with message passing: to-register moves from the processor into this array. These moves proceed at the speed of the cache bw. Alewife packet dercripQn for register-to-register or memoryto-memory transfers haw the common structure shown in Figure 5 and consist of one or mom W i t arc present) as a packet header. The packet descriptor can be up to eight (8) double-words long.
Once a packet has been described. it can be launched via an ammic, singlecycle. launch instruction. called ipi launch. (IPI stands for interprocessor-intempt). As shown in Table 1 . the opcode fields of an ipilaunch specify the number of explicit operands (in double-words) and the total descriptor length (also in double-words). Consequently, the format of a packet must be known at compile-time. The execution of a launch insauction atomically commits the message to the network. Until the time of the launch, the description process can be aborted, or aborted and restaned without leaving partial packets in the network. AAer a launch, the descriptor array may be modified without affecting previous messages2. The ipilaunch and other instructions in Sparcle provide a tight coupling between the processor and the network.
Since quested DMA operations occur in parallel with processor execution. data blocks which are part of outgoing messages should not be modified until aAer the DMA mechanism has finished with them. Consequently, we provide a second flavor of launch instruction, ipilaunchi, which q u e s t s the generation of an intempt as soon as all data has been retrieved from memory and committed to network queues. This rnursmisswn completion intermpt can be used to fra outgoing data blocks or perform other post-transmission actions.
If the output network is blocked due to congestion, then it is possible that the CMMU has insufficient reltsoucces to launch the next message. This information is handed to the processor in one of two ways. Fit, the spucesvuilregister in the CMMU indicates the maximum packet descriptor which can be generated at the time it is read. Second, if the processor attempts to ston beyond this point in the descriptor m y , then the offending stio instruction is blocked until nsource~ arc available3. Since the availability of resources is verified during the description process, launch instructions always complete.
Rather than blocking on insufficient descriptor resources, the softwan can optionally q u e s t that the CMMU generate a spacerequesr i n t e m p t when a specified number of double-words of descriptor space an available. Input Interface Efficient receipt of messages and dispatching to appropriate handlers is facilitated by an efficient interrupt interface.
Upon reception of the first double-word of a packet, the CMMU generates one of two reception interrupts, depending upon whether the message is a user message or a systemlcoherence message'. The processor can begin flushing its pipeline and vectoring to the interrupt handler in parallel with reception of the remainder of the message. Upon entering the interrupt handler, the processor can examine the first 8 double-words of the packet through the packet input window. As with the descriptor array, the packet input window is memory mapped with short addresses and accessed through a special load instruction, ldio. Consequently, the compiler can generate register-to-register moves from the input window to the processor registers that proxed at the speed of the cache bus. If the processor attempts to access data that is not yet present. then the CMMU will block the processor until this data arrives. Portions of the message that 81c outside the packet input window are invisible to the processot, Ifapacketis longerthaneight (8) doublewords, then only the fim eight double-words appear in the window.
The remainder of the packet is invisible to the processor, possibly stretching into the network.
Once the pmessor has examined the head of the packet, it invokes a singlecycle storeback instruction, called i p i c s t (for P I coherent storeback). As shown in Table 1 If the sum of the skip and lengtb fields is shorter than the length of the packet, then the remainder of the packet will appear at the head of the packet input window and another reception interrupt wilI be generated. Multiple storebackinstructions can be issued for a single input message to scatter its data to memory (the Alewife CMMU can permit two i p i c s t instructions to issue without blocking).
A second version of the storeback instruction, called i p i c s t i requests a swreback completion interrupt upon completion of the storeback operation. This signals the completion of input DMA, andcan be usedto export blocks of data to higher levels of software.
User-Level Messaging
Then an numerous advantages to exporting a fast message interface to user code. The Alewife messaging interface has many aspects that can be directly exploited by the compiler, including direct construction of the descriptor and the f o m t of packets themselves. One solution would be to allow the user to enable and disable interrupts. This is undesirable. however. since user code should not. in general, be allowed tct perform actions that may crash or compromise the integrity of the machine. Altemately, we could provide separate output interfaces for the user and supervisor. This solution is also undesirable: on the one hand, it is overkill. since the chance that both the user and supervisor will attempt to send messages simultaneously is very low. On the other hand, the division between user and supervisor is somewhat arbitrary; we may have multiple levels of interrupts.
Consequently, the Alewife machine adopts a more general mechanism. It is designed with the assumption that collisions are rare, but that the highest priority interrupt should always have access to the network. To accomplish this, we start with an atomic messagesend. as described above. Then, since message launching is atomic, intermpts are free to use the network providing that they reston any partiallyconstructed message descriptors before returning. Thus, the mechanism is the familiar "callee-saves" mechanism applied to interrupts.
Since the implementation described in Section 7 does not guarantee the contents of the descriptor a m y after launch. one additional mechanism is provided. This is the desc-length register of the CMMU. Whenever the output descriptor array is written, desclength is set to the maximum of its current value and the array index that is being written. It is zeroed whenever a packet is launched. Consequently, this register indicates the number of entries in the descriptor array that must be preserved. It is non-zero only during periods in which packets are being described.
Protection Historically, there has been tension between protection mechanisms and rapid access to hardware facilities. The Alewife network interface is no different Protection in the Alewife machine is not intended to hide information, but rather to protect the machine from errant user-code. Such protection is as follows: In this section we present three issues that arise when integrating message-passing with cache-coherent shared memory. These issues are the need for high-availability interrupts. special restrictions on message handlers. and data Coherence for the DMA mechanism. To some extent. these interactions arise from the fact that the network provides a single logical input and output port to the memory controller. While networks with multiple channels are possible to implement. they are invariably more expensive.
High-Availability Interrupts
The need for high-availability interrupts [ 181 arises because sharedmemory introduces a dependence between instruction execution and and the interconnection network. "Normal" asynchronous intempts, which occur only at instruction boundaries, are effectively disabled when the processor pipeline is frozen for a remote read or write request. Unfortunately, as shown in Figure 6 , the requested data may never arrive if it is blocked behind other messages. This figure illustrates a situation in which the pnxessor has issued multiple sharrd-memory requests and is currently blocked waiting for the r e m of data. Unfortunately, several non-shared-memory messages have entered the network input queue ahead of the desired response. Unless the processor traps and disposes of these messages, it will never receive its desired data. Thus. the successful completion of a spinning load or store to memory may require faulting the access in progress so that a network intempt handler can dispose of the offending messages. The term high-availability interrupt is applied to such extemally initiated pipeline interruptions.
High-availability intempts introduce an associated problem:
when a load or store is intempted by a high-availability intempt, it is possible for its data to arrive Md to be invalidated while the interrupt handler is still executing. The original q u e s t must then be reissued when the interrupt finishes. In unfortunate situations, systematic thrashing can occur. This is part of a larger issue, namely the window ofvu&temb*, discussedin [ 181. For a single-threaded processor, the simplest solution is to defer the invalidation until after the original bad or store commits.
Restrictions on Message Handlers
A second issue is the interaction between message handlers and shared memory, When an interrupt handler is called in response to an incoming message, the interrupt code must be careful to ensure the following before accessing global-shared memory:
0 Tbe network overflow intmupt must be enabled. (The network overflow handling mechanism is discussed in the next section.) 0 The input packet must be complrrcly freed and network intempts must be nenabkd.
0 Them must be no active low-level hardware locks that will defer invalidations in the interrupted code. The first of these conditions arises because all global accesses potentially require use of the network. Consequently, they can be blocked indefinitely if the network should ovefiow. The second arises for the same reason that high-availability interrupts were introduced into the picture: any global data that is accessed may be stuck behind other messages in the input queue. The last condition prevents deadlocks in the thrash elimination mechanism. Sec [ 181 for details. it is an open question whether the coUection and cleaning operations should be assisted by hardware. accomplished by performing multiple non-binding prefetch operations, or accomplished by scanning the coherencedirectories and manually sending invalidations6.
Local
If globallycoherent DMA operations are frequent then a hardware assist is probably desirable. At this time, however, the Alewife machine provides no hardware assistance for these operations.
Opportunities From Integration
In this section, we touch on WO unique opportunities, over and above the software advantages mentioned earlier, which arise from the inclusion of a fast message interface in a shared-memory multiprocessor. These are the LimitLESS cachecoherence protocol, and network overflow recovery.
The LimitLESS Cache Coherence Mechanism
One opportunity that arises from integrating message-passing and shared memory, is the ability to extend the hardware cachecoherence protocol in software. Permitting software to send and receive coherence-protocol packets requires no additional mechanism over and above the basic messaging facilities of Section 4. In Alewife, the memory system implements a set of pointers, called direcrorics.
Each directory keeps track of the cached copies of a comsponding memory line. In our current implementation, the size of the directory can be varied from zero to five pointers. The novel feature of the LimitLESS scheme [7] is that when there an more cached copies than there are pointers, the system traps the processor for software extension of the directory into main memory'. he processor can then implement an algorithm of its choice in software to handle this situation.
The LimitLESS scheme leaves ample oppormnity for designing custom protocols which are invoked on a per-memory-line basis. Individual directories can be set to interrupt on all references. Then, all protocol messages which arrive for these memory-lines are automatically forwarded to the message input queue for software handling. In fact, our runrime system makes use of several extended applications of the LimitLESS interface. such as fLF0 queue locks, fetch-and-op style synchronizations, and fast barriers.
Recovery from Network OvcAow
Cache-coherenceprotocols mtnxiuce adcpcndencebetween the input and output queues of a memory controller, since they process read and write requests by reaming data This leads to a possibility for pmwcoi deadlock, since it inaoducu a circular dependence between the network queues of two or more nodes. Architectures Note that this technique does not require multiple logical networks.
The heuristic that we use to detect protocol deadlock is to initialize a hardware timer with a preset value. Then, whenever the network output queue is full and blocked, the timer begins counting down from the preset value, generating a ncmork-overj?ow trap if it ever reaches zero. This counteraffords some hysteresis for overflow detection, since protocol deadlock is a rare event and some queue blockage is expected.
The network overflow handler places the network in "divert mode," diverting all packets from the network input queue to the IPI input queue. It then uses DMA to store all incoming packets into a specialqueue-overfiow region of local memory. This process continues until the network output queue has drained sufficiently (a controller status bit indicates that the output queue is half full). As a final phase of recovery. the diverted packets are relaunched with the PI output interface. A low intempt priority is used by the relaunch code, to permit normal message processing and network interrupts on relaunched packets.
Consequently. to permit network overfiow recovery, we supplementthe mechanismof Section4 with four additional mechanisms:
A countdown timer whichcan be used detect that the network output queue has been clogged for a "long" time.
The ability to force all incoming packets to be diverted to the PI input queue, rather than being processed by the shandmemory controller. Note that we have only added the ability to force this switch. The data path must already be present to prmit both shared memory and message! passing to coexist.
See Section 7.
A flag which indicates that the hardware output queue is empty or half full.
An internal loopback path from the P I output mechanism back to the controller inpus which pennits packets to be relaunched to the hardwan during recovery without routing through the network hardware.
Note that the fourth mechanism is not strictly necessary, but desirable since the network is backed up during network overflow processing. Section 7 shows a diagram of the network queue structure. A final requirement is more a design philosophy than anything else:
0 All controller statc machines must be designed such that they never attempt to stan operations which have insufficient queue resources to complete. In this context, DMA q u e s t s are broken into a sene of short atomic memy operations.
Adherence to this philosophy is simpler than attempting to abort operations during network overtlow. in a high-level synthesis language provided by LSI and optimized with the Berkeley SIS tool set By the time of publication, this chip will have gone to LSI for fabrication. Table 4 gives a rough breakdown of the sizes of major components of the network interface. All data paths an 64 bits. Note that edge-triggered flip-flops an relatively expensive in this logic, at nine (9) gates per bit (this includes scan).
The Coprocessor Pipeline supplements the basic Sparcle inStrUCtiOn Set with a number of instructions. including ipilaunch and ipicst. The IPI Input Interface and P I Output Interface include all the launching,storeback, and DMA logic. The Invalidation queues permit IocalIy coherent DMA and include the doublequeue structures described b e b . Together, the logical components of the P I interface (ivalidaticm queues and interface logic) consume roughly 12% of the c o n u o h area. What these numbers do not indicate is the additional logic within the cache and memory controllers to support requests for invalidations and data from the DMA control mechanisms. The size of this logic is difficult to estimate, but appears to be a small fraction of the total logic for cache and memory control. Figure 7 shows the network queue structure ofthe Alewife CMMU. All datapaths are 64 bits. The IPI output descriptor queue and IPI The remaining queues are intemal queues and not visible to the programmer.
Network Queue Structure

Programmer Visible Queues
Stio operations to the descriptor amy write directly into the circular output descriptor queue, using a hardware tail pointer as the base. The atomic action of an ipilaunch instruction moves this tail pointer and records information about the size and composition of the descriptor. Consequently, description of subsequent packets is allowed to begin immediately after an ipilaunch, provided that data from pending launches is not overwritten. Queue space becomes available again as descriptors an consumed by the P I output mechanism. This implementation, while reasonably simple, does not preserve the contents of message descriptors after a launch has occllmd to move this head pointer and to initiate DMA actions on the data which has been passed. A separate queue (not shown) holds issued ipicsr instructions until they can be processed.
Local Coherence
As mentioned in the discussion on DMA coherence (Section 5.3), supporting locaily-coherent DMA is straightforward in a machine with an invalidation-style cachecoherence protocol. In the CMMU, we coordinated the invalidation processes by using double-headed invalidation queues. The DMA controllers generate addresses and place requestson these queues as fast as possible (moving the tail of the queue). As soon as requests are written, the cache controller sees them, causes appropriate invalidations, and moves its head pointer. The memory has a second head pointer which lags behind the head pointer of the cache controller. Whenever the two pointers differ, the memory machine knows that it can satis0 the DMA request at its head, since the comsponding invalidation has already occumd.
When the memory machine moves its head pointer beyond an entry, that entry is freed. Each of the invalidation queues have two,doublecache-line entries.
Care must be taken with the input interface so that the pmessor cannot re-request a memory-line after it has been invalidated but before the data has been written to memory. This situation can arise from the pipelining of DMA q u e s t s and would represent a violation of local coherence. The Merencc in area between the input and output invalidation queues (see Table 4 ) results from address-matchingcircuitry that serves as an interlock to prevent this "local coherence hole".
Network Overtlow
The memory ptvtocol ovrpw queue and the cache ptvtocol output queue handle protocol tra& from the memory and cache, respectively. While such qpatcs am important for performance reasons, they are also important for a mom subtk R8so11: they simplify the checking of network ~csou~~es. To allow network overflow recovery. controller state machines must check that all operations have sufficient resources to complete befon they an initiated (Section 6.2). These two queues simplify this task, since nsourcc checking can be done locally, without arbitrating for the output queue.
Note also that the style of network overflow recovery described in this paper requires the input and output DMA controllers to be independent of eachother. This independenceis necessary since we relaunch packets f " memory (output DMA) which may m u & storeback operations during processing (input D W ) . The applications include a thread scheduler with a synthetic fine-grain tree-suuctured application. barrier synchronization using combining trees. remote thread invocation, bulk data transfer, and Successive Over-Relaxation (SORI. Each application is implemented both using pure shared memory and using a hybrid system. As describedin Section 1, a remote thread invocationusing messages reduces the invoker's overhead over a punly shared-memory implementation by a factor of 20 and that of the invokee by a factor of thne. Memory-twncmory copy of data for 256-byte blocks is faster than shared-memory copy without pnefetching by 2.4, and faster than shared-memory copy with pmfetcbg by 1.5.
These results should not suggest that shared memory is unnecessary or expensive. For programs that have unpredictable, highly datadependent access patterns, message passing implementations resort to implementing much of the shared-memory interpretive layer in softwan (for data location and data movement), with a camspanding loss in performance. In ocher cases, such as SOR, a simple block-partitioned Jacobi SOR solver, we observe little difference between well coded shared-memory and message-passing implementations.
Related Work
The CM5 provides a message passing interface and uses SPARCs as ik processing nodes. The message interface is implemented using register reads and writes into the network interface (NI) chip. Because the reads and writes are implemented over the main memory bus, they are slower than network register reads and writes in Alewife, which are implemented Over the processor cache bus. The CM5 interface does not provide support for DMA or shared memory and requires the processor to be involved in emptying out the message queue. The processor in the CM5 can be notified on message arrival either through an interrupt or by polling [22] . Our interface is different from that provided by the message passing J-machine [ 101 in that our processor is always interrupted on the arrival of a message, allowing the processor to examine the packet header, and to decide how to deal with the message. Messages in the J-machine are queued and handled in sequence. (The J-machine, however, allows a higher priority message to interrupt the processor.) The d-machine does not provide DMA transfer of data. Finally, message sends in Alewife are atomic in that comct execution is supported even if the processor is interrupted while writing into the network queue.
Somewhat in the flavor of the Alewife machine, the J-machine generates a send fault when the network output queue overflows. In addition, a queue overfiow fault is generated when the input queue overflows. These faults can be used to trigger network overflow recovery similar to that of Section 6.2. Additionally, the J-machine network includes a second level of network priority which can be used to shuffle excess data to other nodes, should local memory for supplementary queue space be unavailable. Unfortunately, the Jmachine mechanisms are extremely pessimistic. trapping as soon as local queue space is exhausted. In contrast, Alewife's network overflow mechanism provides hysteresis to ignore temporary network blockages. Further, the lack of message atomicity in the J-machine complicates the functionality of network overflow handlers.
Support for multiple models of computation has been identified as a promising direction for future research. For example, the iWarp 191 integrates systolic and message passing styles of communications. Their interface supports DMA-style communication for long packets typical in message passing systems, while at the same time supporting systolic processor-to-processor communication. In the latter style, a processor could be producing data and streaming it to another, while the receiving processor could be consuming the data using an interface that maps the network queue into a processor register.
To our knowledge, therc are no existing machines that support both a shared-address space and a general fine-grain messaging interface in hardware. In some cases where we argue messages are better that shared-memory, such as the barrier in Section 8, a similar effect could be achieved by using shared-memory with a weaker consistency model. For example, the Dash multiprocessor [3, 231 has a mechanism to deposit a value from one processor's cache directly into the cache of another processor, avoiding cache coherence overhead. This mechanism might actually be faster than using a message because no intempt occurs, but a message is much more generaL Some shared-memory machines have implemented messagelike primitives in hardwan. For example, Beck, Kasten, and Thakkar [24] describe the implementation of SLIC-a system link and interrupt controller chip-for use with the Sequent Balance system. Each SLIC chip is coupled with a processing node and communicates with the other SLIC chips on a special SLIC bus that is separate from the memory system bus. The SLIC chips help distribute signals such as interrupts and synchronization information to all processors in the system. Although similar in flavor to this kind of interface, the Alewife messaging interface is built to allow direct access to the same scalable interconnection network used by the shared-memory operations.
Another example of a shared-memory machine that also supports a message-like primitive is the BBN Butterfly. This machine provides both hardware support for block transfers and the ability to send remote "interrupt requests.'' Nodes in the Butterfly are able to initiate DMA operations for blocks of data which reside in remote nodes. In an implementation of distributed shared memory on this machine, Cox and Fowler [25] conclude that an effective block transfer mechanism was critical to performance. They argue that a mechanism that allows more concurrency between processing and block transfer would make a bigger impact. It turns out that Alewife's messages are implemented in a way that allows such concurrency when transfemng large blocks of data. Furthermore, the Butterfly's block transfer mechanism is not suited for more general uses of fine-grain messaging because there is no support in the processor for fast message handling.
Conclusion
This paper discussed the design of a streamlined message interface that is integrated within a shared-memory substrate. The integration of message passing with shared memory introduces many interesting issues including the need for high-availability interrupts, the need for special restrictions on message handlers, and data coherence requirements for the DMA mechanism. An interface that addresses these needs has been implemented in the Alewife machine's CMMU.
The integration of message passing mechanisms with shared memory affords higher applications performance than either a pure message passing interface or a shared memory interface. In addition, it provides unique opportunities, over and above the software advantagesof multimodel suppo& including the LimitLESS cachecoherence protom1 and network overtlow recovery. Chaiken, Jonathan Babb, and Donald Yeung. Many people read earlier drafts of this paper, their feedback and criticism was much appreciated.
Acknowledgments
