Historically, processor accesses to memory-mapped device registers huve been marked uncachable to insure their visibili~to the device. The ubiquity of snooping cache coherence, howeveg makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e. g., for polling).
Permission to make digitahlwcl copy of part or all of this work for petsonal or classroom use is ranted wittrout fee provided that oopies are not made i or distributed for pro t or commercial advantage, the copyright notiq the titie of the ubiication and its date appear, and notioe is given that t copying is y permission of ACfvl, inc. To mpy omerwise, to repubitsh, to post on aetvets, or to redistribute to tisk, requires prior apeoifio permission andlor a fee.
ISCA '9S 5/98 PA, USA 01996 ACM 0-89791 -736-3/96/0005... $3.50 1 Introduction
Most current computer systems do not efficiently support finegrain communication.
Processors receive data from external devices, such as high-speed networks, through DMA and uncachable device registers. A processor becomes aware of an externaf event (e.g., a message arrival) via interrupts or by polling uncached status registers, Both notification mechanisms are costly: interrupts have high latency and polling wastes processor cycles and other system resources. A processor sends data with an uncachable store, a mechanism that is rarely given first-class support. Both uncachable loads and stores incur high overhead because they carry small amounts of data (e.g., 4-16 bytes), which fails to use the full transfer bandwidth between a processor and a device. Optimization such as block copy [42] or special store buffers [42, 23] can help improve the performance of uncachable accesses by transferring data in chunks. However, these optimization are processor-specific, may require new instructions [42, 23] , and may be restricted in their use [42] .
Snooping cache coherence mechanisms, on the other hand, are supported by almost all current processors and memory buses. These mechanisms allow a processor to quickly and efficiently obtain a cache block's worth of data (e.g., 32-128 bytes) from another processor or memory. This paper explores leveraging the first-class support given to snooping cache coherence to improve communication between processors and network interfaces (NIs). NIs need attention, because progress in high-bandwidth, low-latency networks is rapidly making NIs a bottleneck. Rather than try to explore the entire NI design space here, we focus our efforts three ways:
q First, we concentrate on NIs that reside on memory or I/O buses.
In contrast, other research has examined placing NIs in processor registers [5, 15, 21 ] , in the level-one cache controller [1] , and on the level-two cache bus [10] . Our NIs promise lower cost than the other alternatives, given the economics of current microprocessors and higher integration level we expect in the future. Nevertheless, closer integration is desirable if it can be made economically viable.
c Second, we limit ourselves to relatively simple NIs-sirnilar in complexity to the Thinking Machines CM-5 NI [29] or a DMA engine, In contrast, other research has examined complex, powerful NIs that integrate an integer processor core [28, 38] to offer higher performance at higher cost. While both simple and complex NIs are interesting, we concentrate on simple NIs where coherence has not yet been fully exploited.
q Third, we focus on program-controlled fine-grain communication between peer user processes, as required by demanding parallel computing applications. This includes notifying the receiving process that data is available without requiring an interrupt. In contrast, DMA devices send larger messages to remote memory, and only optionally notify the receiving process with a relatively heavy-wei~ht interrupt.
We explore a class of coherent network interfaces (CNIS) that reside on a processor node's memory or coherent 1/0 bus and participate in the cache coherence protocol. CNIS interact with a coherent bus like Stanford DASH's RC/PCPU [30] , but suppott messaging rather than distributed shared memory, CNIS communicate with the processor through two mechanisms: cachable device registers (CDRS) and cad-able queues (CQS) (CQS) area new mechanism that generalize CDRS from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue to amortize control overheads. To maximize performance we exploit several critical optimization: lazy pointers, message valid bits, and sensereverse. Because CQS look, smell, and act like normal cachable memory, message send and receive overheads are extremely low: a cache miss plus several cache hits. Furthermore, if the system supports prefetching or an update-based coherence protocol, even the cache miss may be eliminated.
Because CNIS transfer messages a cache block at a time, the sustainable bandwidth is much greater than conventional program-controlled NIs-such as the CM-5 NI [44] -that rely on slower uncachable loads and stores. For symmetric multiprocessors (SMPS), which are often limited by memory bus bandwidth, the reduced bus occupancy for accessing the network interface translates into better overall system performance.
An important advantage of CNIS is that they allow main memory to be the home for CQ entries,
The home of a physical address is the 1/0 device or memory module that services requests to that address (when the address is not cached) and accepts the data on writebacks (e.g., due to cache replacements). Using main memory as a home for CQ entries offers several potential advantages. First, it decouples the logical and physical locations of network interface buffers. Logically, these buffers reside in main memory, a relatively plentiful resource that eases problems of naming, allocation, and deadlock. Physically, they can be located in processor or device caches to allow access at maximum speed. Second, it provides the same interface abstraction for local and remote communication.
The sender cannot distinguish if the receiver is local, nor can the receiver tell. Third, it can exploit future processor and system optimizations, such as prefetching, replacement hints, or update protocols, that can further reduce the overheads of accessing NI registers or data buffers.
To expose the CNI design space, we develop a taxonomy reminiscent of DiriX [2] . We denote traditional network interface devices as NI,X and coherent network interface devices as CNIiX. The subscript i specifies the portion of an NI queue visible to the processor. The default unit of i is memory/cache blocks, but can also be specified in 4-byte words by adding the suffix 'w'. The placeholder X is either empty, Q, or Qm. X empty represents the simple case where the NI exposes only part or whole of one message to the processor. As a result there are no explicit head or tail pointers to manage the NI queue. X = Q represents the more complex case where the exposed part of the NI queue is actually managed as a memory-based queue with explicit head and tail pointers. X = Qm denotes that the home of the explicit memorybased NI queues is main memory.
We then evaluate four CNIS-CN14, CN116Q, CN1512Q, and CN116Qm-and compare them with N12W-an NI that uses uncached accesses to its data buffers and device registers, derived from the Thinking Machines CM-5 NI. We consider placing the NIs on both a coherent memory bus and a slower coherent I/O bus. Microbenchmark results show that compared to NIzW, CNIS can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 3770 and 125V0 respectively on a memory bus and 74% and 123?I0 respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that a CNI can improve the performance by 17-53'% on the memory bus and 30-88% on the 1/0 bus. We see our paper having two main contributions. First, we develop cachable queues, including using lazy pointers, message valid bits, and sense-reverse. Second, we do the first micro-and macro-benchmarks comparison of alternative CNIs-exposed by our taxonomy-with a conventional NI.
A weakness of this paper, however, is that we do not do an indepth comparison of our proposals with DMA. The magnitude of this deficiency depends on how important one expects DMA to be compared to fine-grain communication in future systems. Some argue that DMA will become more important as techniques like User-level DMA [3] reduce DMA initiation overheads. Others argue DMA will become less important as processors add block copy instructions [42] (making the breakeven size for DMA larger) and as the marginal cost of adding another processor diminishes [48] (making it less expensive to temporarily waste a processor).
The rest of this paper describes CDRS and CQS in detail (Section 2), presents CNI taxonomy and implementations, (Section 3), describes evaluation methodology (Section 4), analyzes results (Section 5), reviews related work (Section 6), and concludes (Section 7).
Coherent

Network
Interface Techniques
In this section, we describe two techniques for implementing CNIS: Cachable Device Registers (CDRS) and Cachable Queues (CQS). A CDR is a single coherent cache block used by a processor to communicate information to or from a CNI device. A CQ generalizes this concept into a contiguous region of coherent cache blocks. We describe the major issues in successfully exploiting CDRS and CQS. We describe their operation assuming write-allocate caches kept consistent by a MOESI write-invalidate coherence protocol [43] .
Cachable Device Registers
Cachable Device Registers (CDRS) combine the traditional notion of memory-mapped device registers with the now-ubiquitous bus-based cache-coherence protocols supported by all major microprocessors. Reinhardt, et al., [39, 40] first proposed CDRS to communicate status information from a special-purpose hardware device to a processor. We extend their work to use coherence to efficiently communicate control information and data both to and from a network interface.
A CDR is a coherent, cachable memory block shared between a processor and a coherent network interface (CNI) device. The CNI sends information to the processor-i,e,, to initiate a request or update status-by writing to the block. The CNI must first obtain write permission to the block in accordance with the underlying coherence protocol, The processor receives the infor- Device the CDR. The CNI generates an invalidation to obtain write permission (arc 1), and the processor incurs a cache miss to fetch the CDR on its next poll attempt (arcs 2-5). Because a CDR consists of a whole cache block, an entire small message can be communicated between processor and CNI in a single bus transaction, amortizing the fixed overheads across multiple words.
A CDR can also transfer information from the processor to the device, in a logically symmetric way. Processor writes to the CDR are treated just like for a normal coherent cache block, obtained using the standard coherence mechanisms.
The CNI device receives the information by reading the block, in a manner equivalent to polling. However, because the device observes the coherence protocol directly, it knows when the processor requests write permission to the block. Thus it need not poll periodically., but can read the block back immediately after the processor requests write permission. The device can provide a system programmable backoff interval to reduce the likelihood of "stealing" the block back before the processor completes its writes to the CDR. This technique, called virtual polling, is necessary because few processors can efficiently "push" data out of their caches. For processors (e.g., PowerPC [47] ) that do support user-level cache flush instructions, the CDR can be directly flushed out of the cache.
CDRS allow a processor to efficiently transfer a full cache block (e.g., 32-128 bytes) of information to or from a C!NI. For smaller amounts of data, e.g., a 4-byte word, CDRS are less efficient. For most processors, fetching a single word from an uncached device register takes roughly the same time as, from a CDR; this is because the CNI responds with the requested word first which is then bypassed to the processor. However, the CDR still has higher overhead since it will displace another block from the cache, potentially causing a later miss. CDRS do even less well for small transfers to a device. Because most modern processors have store buffers, a single uncached store is more efficient than transferring that word via a CDR. For most processors and buses the breakeven point typically occurs at two or three double words. Hence, our CNI designs still use uncached stores to transfer single words of control information from the processor to the device.
For messages larger than a cache block, we require some method to reuse the CDR. For example, after the processor has read the first block of a message, it may want to read the second block using the same CDR. Conventional device registers often solve this problem using implicit clear-on-read semantic>, where the register is cleared after an uncached read. For example, the CM-5 network interface treats the read of the hardware receive queue as an implicit "pop" operation. Clear-on-read works because processors guarantee the atomicity of individual load instructions; that is, the value returned by the device is guaranteed to be written to a register.
Clear-on-read does not work well for CDRS, since most processors do not provide the same atomicity guarantees for cache blocks. The load that causes the cache miss should be atomic (to close the "window of vulnerability"
[27]); however, there are no guarantees for the remaining words in the block. Before subsequent loads complete, a cache conflict (e.g., resulting from an interrupt) could replace the block. With clear-on-read semantics, the remainder of the data in the CDR would be lost forever.
Instead, CDRS require an explicit clear operation by the receiver to enable reuse of the block. Under a MOESI protocol even this clear operation requires a slow three-cycle handshake between the processor and CNI to make sure that the processor sees new data when it re-reads the CDR. In the first step of this handshake, the processor issues an explicit clear operation by performing an uncached store to a traditional device register. In the second step, the processor must ensure that the CNI has seen the clear request. Since most modern processors employ store buffers, this step may incur additional stalls while a memory barrier instruction flushes the store out to the bus. When the CNI observes the explicit clear operation, it invalidates the CDR by arbitrating for and acquiring the memory bus. The third step of the handshake is for the processor to ensure that the invalidation has completed. It does this by reading, potentially repeatedly, a traditional uncached device status register. 1 Consequently, while CDRS efficiently transfer a single block of information, they perform much less well for multiple blocks. We address this problem by introducing cachable queues.
Cachable Queues
Cachable Queues (CQS) generalize the concept of CDRS from one coherent memory block to a contiguous region of coherent memory blocks managed as a queue. CQS are a general mechanism that can be used to communicate messages between two processor caches or a processor cache and a device cache. A key advantage of CQS is that they simplify the reuse handshake and amortize its overhead over the entire queue of blocks. Liu and Culler [31] used cachable queues to communicate small messages and control information between the compute processor and message processor in the Intel Paragon. We show how CQS can be used to communicate directly between a processor and a network interface device, We first describe the basic queue operation, and then introduce three important performance optimizations.
Cachable queues follow the familiar enqueue-dequeue abstraction and employ the usual array implementation, illustrated in Figure 2 . The head pointer (head) identifies the next item to be dequeued, and the tail pointer (t al 1) identities the next free entry. The queue is empty if head and t ai 1 are equal, and full if t ai 1 is one item2 less than head (modulo queue size). If there is a single sender and single receiver for this queue, the case we consider in this paper, then no locking is required since only the sender updates t ai 1 and only the receiver updates head3.
A processor sends a message by simply enqueuing it in the appropriate out-bound message queue. That is, it first checks for overflow, then writes the message to the next free queue location and increments t a i 1, relying on the underlying coherence proto-1. A somewhat more efficient handshake is possible if the processor provides a user-accessible cache-invalidate operation. Issue clear operation, flush store buffer, and invalidate cache entry.
2. We assume fixed size network messagesin this paper, but CQS can be generalized to variable length messagesin a straight-forward manner.
3. Memory barrier operations maybe necessary to preserve ordering under weaker memory models. CO1to bring the block(s) local to the cache. A processor receives a message by first checking for an empty queue, then reading the queue entry pointed to by the head. The message remains in the queue until the receiver explicitly increments head. The head and tail pointers reside in separate cache blocks.
Because CQS are simply memory, they have the property that the message sender and receiver have the same interface abstraction whether the other end is local or remote. A local CQ, illustrated in Figure 2 , is simply a conventional circular queue between two processors. A remote CQ consists of two local CQS, each between a processor and CNI device, as illustrated in Figure 3 . The head and tail pointers are also managed as cachable memory. A CNI that uses CQS simply acts like another processor manipulating the queue.
The head and tail pointers of the CQS are a much simpler way to manage reuse than the complex handshake required by CDRS. If there is room in the CQ, then the tail entry can be reused; if the CQ is non-empty, then the head entry is valid. However, even though no locking is required to access the head and tail pointers, a straight-forward implementation induces significant communication between sender and receiver. This occurs because the sender must check (i.e., read) the head pointer, to detect a full queue, and the receiver must check the tail pointer, to detect an empty queue. Because the queue pointers are kept in coherent memory, cache blocks may ping-pong with each check.
We can greatly reduce this overhead using three techniques: lazy pointers, message valid bits, and sense reverse. Lazy pointers exploit the observation that the sender need not know exactly how much room is left in the queue, but only whether there is enough room. The sender maintains a (potentially stale) copy of the head pointer, shadow_head, which it checks before each send. Shadow_head is conservative, so if it indicates there is enough room, then there is. Only when shadow_head indicates a full queue does the sender read head and update shadow_head. If the queue is no more than half full on average, then the sender needs to check head-and incur a cache miss-only twice each time around the array.
Lazy pointers work much less well for the tail pointer. The receiver must check t ai 1 on every poll attempt, to see if the queue is empty. Whenever a message arrives, the receiver's cached copy of t ai 1 gets invalidated. Thus in the worst case, each message arrival causes a cache miss on t a i 1. Instead, we use message valid bits-stored either as a single bit in the message header or in a separate word-to allow the receiver to detect message arrivals without ever checking the tail pointer [10, 31] . The valid bits indicate whether a cache block contains a valid message, or not. On a poll attempt, the receiver simply examines the first message in the queue (i.e., the one pointed to by head); if it's invalid, the queue is empty. Thus no bus traffic normally occurs in this case. When a valid message is written to the queue, the sender will invalidate the receiver's cached copy, causing a cache miss when the receiver polls again. To complete the handshake, the receiver must clear the message valid bit when it advances head.
Clearing the message valid bit requires the receiver to write the queue entry; thus under a MOESI protocol, the receiver becomes owner of the queue entry's cache block, rather than simply having a shared copy. This normally requires an additional bus transac- The sender first checks if the CQ has space and then writes the message followed by its current sense as the message valid bit. The receiver compares its current sense to the valid bit in the message, with a match indicating a valid message. Sense reverse has been previously used for barriers [34] and asynchronous logic, but to our knowledge has never been used for messaging.
Combining all three optimizations minimizes the bus traffic required by CQS. Under a write-invalidation based MOESI protocol, each block of a message requires one invalidation, to obtain write permission for the sender, and one read miss, to fetch the block for the receiver. The head pointer requires only two invalidation-miss pairs for each pass around the circular queue, assuming the queues are no more than half full on average. The tail pointer is private to the sender and generates no coherence actions.
Home for CQ entries
In most computer systems, all legal physical addresses map to a home device or memory module. If a block is cachable, for example, then the home is where data are written on cache replacement. Should the home for CDRS or CQ entries be at the CNI, as with a regular device register, or in main memory?
Since CDRS are each a single block and most devices will employ only a few, the logical choice is to provide the home within the device itself. This can also simplify the implementation for some memory buses, because the device may not have to implement all cases in the coherence protocol [36] .
CQS, on the other hand, will benefit from being large. For example, Brewer, et al., have demonstrated that remote queues can significantly improve performance by preventing contention on the network fabric [6] . If the CQ'S home is main memory-a less precious resource than hardware FIFOs-then its capacity is essentially infinite. Large queues help simplify protocol deadlock avoidance, at least for moderate-scale parallel machines. Having the CQ home in memory also helps tolerate unreliable network fabrics, since messages need not be removed from the send queue until delivery is confirmed.
To place the CQ home in main memory, we must address three operating system issues. First, a CNI needs a translation scheme to translate the CQ virtual addresses to physical addresses in main memory. In this paper, we assume that the operating system allo- cates CQ pages contiguously, allowing CNIS to use a simple baseand-bounds virtual-to-physical address translation. If the operating system cannot guarantee this, then a more complicated translation mechanism may be necessary [19] . Second, a CNI must ensure that CQ pages always reside in main memory, or be prepared to fetch them from the swap device. For our implementations we assume that CQ pages are "pinned;' so that the operating system does not attempt to page them out. Alternatively, we couldl adopt a more flexible scheme [3, 19] at the expense of a more general address translation mechanism (e.g., a TLB). Finally, there must be some mechanism for the rare case in which even tlhe large amount of memory allocated for a CQ fills up. The simplest option is to block the sende~however, this may lead to deadlock. Alternatively, as proposed for MIT Fugu [32] , the CNI device can interrupt the processor, causing it to allocate free virtual memory frames and drain the CQ.
Making main memory the home is generally infeasible with a coherent 1/0 bus. Coherent 1/0 buses [18] allow memory residing on the 1/0 bus to be cached by the processor. However, they do not allow an 1/0 device to coherently cache data from the regular processor memory space. It is difficult to change this in the near future because the speed mismatch makes it hard for the I/O devices to respond to memory bus snoop requests in a timely fashion.
Multiprogramming
The demands of multiprogramming require additional support from a network interface. In traditional networking, an NI is a single shared resource virtualized by the operating system. For example, in TCP/IP the operating system multiplexes the hardware device to send and receive network messages to and from many processes. Unfortunately, the operating system's overheads severely limit performance, especially for small messages.
Many multicomputers reduce or eliminate this overhead by mapping the NI directly into the user's address space [1, 15, 29] . Thus the operating system normally need not get involvtid when messages are sent and received. However, user-mapped NIs significantly complicate support for multiprogramming.
Possible solutions range from disallowing multiprogramming [15] , to taking special actions at context switch time (to context switch the NI and network state) [44] , to optimistically assuming a messagti is destined for the current process (reverting to operating system buffering if it is not) [32] First, the operating system can unmap additional queues and accept faults when they are accessed. Second, the operating system can allocate a memorybased data structure that CNI hardware can use to find the state for all active queues (like page tables for a TLB fill).
CNITaxonomy and Implementations
This section proposes a taxonomy of network interfaces (NIs) and describes the implementation of five NIs that we evaluated in this paper. We use the NI queue structure as the main component to enumerate a taxonomy of network interfaces. NI queues are the primary carriers of messages between a processor and its NI. A processor sends messages to the NI through the send queue and receives messages from the NI through the receive queue. For our taxonomy of CNIS we assume that both the NI queues have the same structure.
Our taxonomy is modelled after Agarwal et al,'s classification of directory protocols [2] . We use the notation NIiX for traditional NIs and CNIiX for coherent network interfaces that cache the NI queues. The subscript i denotes the size of the NI queue exposed to the processor. The default unit of i is memory/cache blocks, but can also be specified in 4-byte words by adding the suffix 'w'. The placeholder X could either be empty, Q, or Qm. X empty represents the simple case where a network interface exposes only part or whole of one network message. For CNIS a network message is exposed via CDRS. CDR reuse is managed by the explicit handshake described in Section 2.1. X = Q represents the more complex case where the exposed portion of the NI queue is managed as a memory queue with explicit head and tail pointers. X = Qm denotes that the home of the explicit memory-based NI queues are in main memory. The absence of an 'm' implies that the device serves as the home for the NI queues. and two four-byte words of the message are exposed, we classify this device as N12W
The second NI extends this baseline device by using four CDRS to expose a 256-byte network message. This device, denoted by CNI1, exploits the memory bus's block transfer capability to move a message between the processor and the device. However, the status and control registers are uncached. After receiving a message, the processor issues an uncached store to explicitly pop the message off the queue. By always checking the uncached status register-which does not indicate message ready again until the cached copy has been invalidated-the processor and CN14 device perform a three-cycle handshake that prevents false hits.
The three-cycle handshake limits the bandwidth achievable by CN14. The third and fourth NIs solve this problem by employing CQS for message data and regular memory for control and status information (head and tail pointers). CN116Q and CN1512Q cache up to 16 and 512 blocks, respectively. The memory that backs up the caches resides on the devices themselves. The larger capacity of CN15~2Q reduces the number of flow control stalls, increasing performance for applications with many messages in flight.
Sending messages to a CNIiQ device involves three steps: checking for space in the CQ, writing the message, and incrementing the tail pointer. The send is further optimized by sending a message ready signal to the CNI device through an uncached store. As discussed in Section 2.1, uncached stores are more efficient than cache block operations for small control operations. Hence for the send queue, the CNI device does not use virtual polling. Instead, the CNIiQ uses the message ready signal to keep a count of pending messages. This count is incremented on each message ready signal and decremented when the device injects a message into the network. As long as this counter is greater than zero, the CNIiQ device pulls messages out of the processor cache (unless the blocks have already been flushed to their home in the device) and increments the head pointer. On the receive side, the processor polls the head of the queue, reads the message when valid. then increments the head pointer. Both sender and receiver toggle their sense bits when they wrap-around the end of the CQ.
The last device, CNI 16Qm, caches up to 16 cache blocks on the network interface device, and overflows to main memory as necessary. The total size of the memory-based queue is 512 cache/memory blocks. Having main memory as home for the CQ simplifies software flow control. Specifically, for the other NIs, whenever the sender cannot inject a message it must explicitly extract any incoming messages and buffer them in memory [6] . Conversely, the CN116Qm does this buffering automatically when the CNI cache cannot contain all the messages. The CNI16Qm taxonomy allows for memory overflow to occur on both the sending and receiving CNIS. However, for simplicity, this paper only examines memory buffering at the receiver. 
Access in Processor Cycles
The CNI devices implement a variant of virtual polling to minimize the number of bus transactions on the critical path. Specifically, under the bus's write-invalidation based MOESI protocol, the processor must generate an invalidation signal to acquire ownership of a cache block before it can write to it. Since our CQS are filled in FIFO order, an invalidation signal for all blocks other than the first block of a multi-block message implies that the processor is done writing the previous cache block. When the CNI device detects an invalidation signal it issues a coherent read on the previous cache block of the same message. Thus part of the message is transferred to the CNI cache before the processor has completed writing all the cache blocks of the message.
Methodology
This section describes the system assumptions and benchmarks used to evaluate the five network interface designs. Section 5 presents results from the evaluation.
System Assumptions
Our simulations model a parallel machine with 16 nodes, each with a 200 MHz dual-issue SPARC processor modelled after the ROSS HyperSPARC, 100 MHz multiplexed, coherent memory bus, 50 MHz multiplexed, coherent 1/0 bus, and a network interface (N12W or one of the four CNIiXs). Both buses support only one outstanding transaction. The memory bus's coherence protocol is modelled after the MBUS Level-2 coherence protocol [24] . Coherence on the 1/0 bus resembles that of the coherent extension to PCI [18] . An I/O bridge connects the memory and I/O buses. The bridge buffers writes and coherent invalidations, but blocks on reads. When transactions are simultaneously initiated on the two buses, the I/O bridge NACKS the I/O bus transaction to prevent deadlock. Fairness is preserved by ensuring that the next 1/0 bus transaction succeeds.
The single-level pfocessor cache is 256 KB direct-mapped, with duplicated tags to facilitate snooping and 64-byte address and transfer blocks. The CNI caches are also direct-mapped with 64-byte address and transfer blocks. The CNI cache sizes vary according to the subscript i in the CNIiX nomenclature. Table 2 shows the bus occupancy for our network interface and memory accesses in processor cycles. Since the I/O bus is connected to the processor via the memory bus, the bus occupancy numbers for the 1/0 bus includes the corresponding memory bus occupancy cycles.
Network topology is ignored and network message size is fixed at 256 bytes. All messages take 100 processor cycles to traverse the network from injection of the last byte at the source to arrival of the first at the destination. We model hardware flow control at the end points using a hardware sliding window protocol. A processor can send up to four network messages per destination before it blocks waiting for acknowledgments.
To avoid dead- Table 3 depicts five macrobenchmarks used in this study. Spsolve [12] is a very fine-grained iterative sparse-matrix solver in which active messages propagate down the edges of a directed acyclic graph (DAG). All computation happen at nodes of the DAG within active message handlers. The messaging overhead is critical because each active message carries only a 12 byte payload and the total computation per message is only one double-word addition. Several active messages can be in flight, which can create bursty traffic patterns.
1=-+@
$0 %+$ %U, e~",., " ---.- ------- .-------------
Macrobenchmarks
Gauss is a message-passing benchmark that solves a linear system of equations using Gaussian elimination [9] . The key communication pattern is a one-to-all broadcast of a pivot row (two kilobytes for our matrix size).
Em3d models three-dimensional electromagnetic wave propagation [13] . It iterates over a bipartite graph consisting of directed edges between nodes. Each node sends two integers to its neighboring nodes through a custom update protocol [16] . Several update messages (with 12 byte payload) can be in flight, which like spsolve, can create bursty traffic patterns. Moldyn is a molecular dynamics application, whose computational structure resembles the non-bonded force calculation in CHARMM [7] . The main communication occurs in an custom bulk reduction protocol [35] , which constitutes roughly 40% of the application's total time with N12W as the network interface. One execution of the reduction protocol iterates as many times as there are processors. In each of these iterations, a processor sends 1.5 kilobytes of data to the same neighboring processor through Tempest's virtual channels.
Appbt is a parallel three-dimensional computational fluid dynamics application [8] from the NAS benchmark suite. It consists of a cube divided into subcubes among processors. Communication occurs between neighboring processors along the boundaries of tbe subcubes through Tempest's default invalidation-based shared memory protocol [38] .
Results
This section examines the network interfaces' performance with two microbenchmarks and five macrobenchmarks.
On the memory bus we simulated all four CNIS plus N12W On the 1/0 bus we simulated all but CN116Qm, since CNI 16Qmcannot be implemented with current coherent 1/0 buses (Section 2.3). Since coherence is usually not an option on cache buses, we did not simulate CNIS there. For each microbenchmark and macrobenchmark we compare the performance of N12W on the cache bus with the best of the CNI altematives-CNI lcQ~on the tnanw bus and CN1512Q on the I/O bus. Since N12W on the cache bus is closest to the processor, it provides a rough upper bound to the erformance achievable with different coherent network interfaces. ?
Microbenchmarks
This section examines process-to-process round-trip message latency ( Figure 6 ) and bandwidth ( Figure 7 ) for our five network interface implementations,2
These numbers include the messaging layer overhead for copying a message from the network interface 7. This figure shows the process-to-process message bandwidth (vertical axis) for different message sizes (horizontal axis). The message bandwidth is expressed as a fraction of the maximum bandwidth two processors on the same coherent memory bus can sustain using a local memory queue (Figure 2 ). (a) shows the round-trip message latency for N12W CN14, CN116Q, CN1512Q, and CN116Qm (with and without snarfing) on the memory bus. (b) shows the same (except CN116QJ on the I/O bus. (c) compares CN1512Q, CN116Qm, and N12W on the I/0, memory, and cache buses respectively.
[~= CN14, + = CN116Q, + = CN1512Q, q = CN116Qm, * = CN116Qm with snarfingl to a user-level buffer, and vice versa. Thus data begins in the sending processor's cache and ends in the receiving processor's cache, rather than simply moving from memory to memory. Figure 6 shows the round-trip latency of a message for each of N12W and the four CNIiXs. It shows two important results. First, CNIS reduce messaging overheads significantly.
Round-11-ip Latency
For small messages, between 8 and 256 bytes, CN116Qm is 20-84% better than N12W on the memory bus ( Figure 6a ) and CN1512Q is 29-141% better than N12W on the 1/0 bus (Figure 6b ). Second, CN116Qm on the memory bus increases the latency over N12W on the cache bus by only 43% (Figure 6c ). This is significant, because the CNIS do not require modifications to the processor or processor board.
The four CNIS have similar Iatencies with minor variations among them. CN14 performs worst because it polls an uncached status register and must use the expensive three-cycle handshake to invalidate the previous message from the processor cache. CN116Q and CN1512Q consistently have the lowest latency due to efficiently polling the cached message valid bit and by using explicit head and tail pointers to amortize the reuse overhead across the entire queue of messages. CN116Qm's latency is slightly worse (on the memory bus) because when its cache overflows, it must flush (i.e., writeback) messages to main memory. A better replacement policy and/or a writeback buffer can help take these flushes off the critical path. However, as we will see later with the macrobenchmarks, CNI~6Qm consistently outperforms the other three CNIS on the memory bus due to its ability to overflow messages to main memory instead of backing up the network.
Bandwidth
Figare 7 graphs the bandwidth provided by the five network interfaces. The vertical axes are normalized to the maximum bandwidth two processors on the same coherent memory bus can sustain while transferring data from one cache to the other. For our simulation parameters (Table 2) , this bandwidth is 144 MB/s. This is the maximum bandwidth the four CNIS can hope to achieve with our simulation parameters. Figure 7 shows two interesting results. First, CNIS improve the bandwidth over N12W significantly, even for very small messages. On the memory bus, CN116Qm is 59-169% better than N12W for 8-4096 byte messages (Figure 7a ). For the same message sizes, CN1512Q is 51-287% better than N12W on the I/O bus (Figure 7b) . Second, N12W'S bandwidth on the cache bus is only 50% more than CN116Qm's on the memory bus ( Figure 7c ).
As in the round-trip tnicrobenchmark, all four CNIS have similar bandwidth with minor variations among them. CN14 performs worst of the four CNIS because of its high overhead for polling uncached registers and the three-cycle handshake in the critical path of message reception. CN14 shows two different knees on the memory and 1/0 buses respectively. The knee on the memory bus appears when a message crosses the first cache block boundary and writes to the second cache block. The second cache block is partially empty resulting in wasted work by CN14, which must still read the entire block. Since CN14 reuses CDRS, the processor must wait for CN14 to complete the entire read (and the threecycle handshake) before it cart write another message. The CQbased CNIS do not have this problem because instead of blocking, a processor simply writes to the next queue location. This same knee does not show up on the I/O bus because the higher I/O bus access Iatencies dominate over the pipelined transfer time for the cache block. On the I/O bus a different knee appears when CN14 saturates the 1/0 bus. CN116Q and CN1512Q perform the best due to their low poll overhead and ability to cache multiple messages (a network message fits in four cache blocks). However, when the message size reaches two kilobytes, CNI~6Q's performance on the I/O bus dips slightly. This is because the small queue size forces frequent updates to the shadow head pointer on the receive queue, which in turn creates contention at the I/O bridge. CN1512Q does not exhibit this problem because its larger queue requires less frequent updates to the shadow head.
CN116Qm achieves slightly lower bandwidth than CN1512Q. This is because the message send rate is significantly higher than the message reception rate, causing the receiving CN116Qm's cache to overflow. The resulting writebacks to main memory induce moderate bus contention which decreases the maximum communication bandwidth. Unfortunately, because the problem is bandwidth not latency, a writeback buffer will not help with this microbenchmark as it would for the round-trip microbenchmark.
However, an alternative technique, called data snarjing [17, 14] , can potentially improve both latency and bandwidth. In data snarfing, a cache controller reads data in from the bus whenever it has a tag match (i.e., space allocated) for a block in the invalid state. Thus in our microbenchmark, the processor cache on the receive side simply snarfs in the cache blocks that CN116Qm writes back to memory. This eliminates many of the invalidation misses on the receive cachable queue and improves the bandwidth by as much as Lt5?t0 (Figure 7a ). We also expect that an update-based coherence protocol would have similar behavior. However, while data snarfing significantly improves microbenchmark performance, we found it had little effect on macrobenchmark performance and do not examine it further.
The absolute bandwidth offered by CNIS can improve significantly with a more aggressive system, With our simulation parameters-200 MHz processor, 100 MHz memory bus, 64 byte cache blocks, and 230 ns cache-to-cache transfer-the maximum bartdwidth achieved by CN1512Q on the memory bus is 107 MJ3/s, This represents over 70% of the bandwidth achievable between two processors on the same coherent memory bus. More aggressive system assumptions, such as non-blocking caches, bigger cache blocks, prefetch instructions, support for update protocols, and a pipelined or packet-switched bus, would significantly improve this absolute performance. Figure 8 shows the performance gains from CNIS for the five macrobenchmarks described in Section 4.2.
Macrobenchmarks
CN14, CN116Q, CN1512Q, and CN116Qm offer a progression of incremental benefits over N12W Unlike N12W, which can only be accessed through uncached loads and stores, CN14 effectively utilizes the memory bus's high-bandwidth block transfer mechanism by transferring messages in full cache block units, CN116Q and CN15~2Q further reduce overhead by amortizing the three-cycle handshake over an entire queue of messages. The larger capacity of the CQS also helps prevent bursty traffic from backing up into the network. CN116Qm further simplifies software flow control in the messaging layer by allowing messages to smoothly overflow to main memory when the device cache fills. This avoids processor intervention for message buffering, which, otherwise, could significantly degrade performance [25] .
Block Transfer. The increase in bandwidth obtained by transferring messages in whole cache block units has a major impact on performance. Gauss and moldyn do bulk transfers and appbt communicates with moderately large (128-byte) shared-memory blocks. Gauss performs a one-to-all broadcast of a 2KB row, while moldyn's reduction protocol transfers 1.5KB of data between neighboring processors. CN14 improves gauss's performance by 39% and 46%, moldyn's performance by 42% and 20%, and appbt's performance by 1090 and 11% on the memory and 1/0 buses respectively. Even for spsolve and em3d that send small messages (12-byte payload), CN14'S performance improvement over N12W is significant (between 13-21 Yo).
CN14'S performance improvement for moldyn on the W> bus is not as high as on the memory bus because of contention at the I/O bridge. The N12W device never tries to acquire the memory or I/O bus because it is always a bus slave. However, the CN14 cache competes with the processor cache to acquire the memory and I/O buses. Simultaneous bus acquisition requests at the 1/0 bridge from the processor cache and CN14 cache creates contention. This effect is severe in moldyn because message sends, message receives, and polls on uncached device registers are partially overlapped in moldyn's bulk reduction phase. Thus, the memory bus occupancy for a system with CN14 on the 1/0 bus compared to a system with N12W on the 1/0 bus decreases by 41 YOin gauss, but by only 15% in moldyn.
Overall, for the five macrobenchmarks, CN14 improves the performance over N12W between 10-42% on the memory bus and 11-46% on the 1/0 bus. This amounts to 28-92% of the total gain MEMORY BUS achieved by our CNIS on the memory bus and 25-5290 of that on the 1/0 bus.
Extra Buffering. The CQ-based CNIS provide extra buffering that helps smooth out bursts in message traffic. However, CN116Q and CN15~2Q cannot always take advantage of this feature. The problem is that once a sender blocks, the flow control software aggressively buffers received messages in memory. This results in messages being pulled out of the CNI'S cache, even when there was still room for additional messages. Further, because of its small queue size, CN116Q frequently updates its shadow head by reading the processor's head pointer, which creates bus contention. Because of these effects, CN116Q and CN14 achieve roughly the same performance on the memory bus. CN15~2Q's larger queue reduces the frequency of shadow head updates. For em3d this improves CN15~2Q's performance over CN116Q by 29~0 on the memory bus.
On the 1/0 bus, the higher Iatencies mitigate the effects of over aggressive buffering, by slowing down the rate at which messages are extracted and buffered. This allows CQ-based CNIS to exploit their buffering and smooth out the bursty traffic of all five macrobenchmarks. In spsolve and em3d, several small active messages (with 12-byte payload) can be in flight simultaneously causing bursts in the message arrival. In gauss and moldyn, periodic bulk transfers cause the bursts. Request-response protocols normally do not cause bursts. However, appbt exhibits a hot spot in which one processor receives twice as many messages as other processors. Thus, CN116Q improves the performance of spsolve, gauss, em3d, and appbt over CN14 on the 1/0 bus by 15%, 269'o, 1 l%, and 16% respectively. For moldyn, frequent updates of the shadow head causes contention at the I/O bridge and actually slightly reduces CN116Q'S performance.
But the extra buffering and infrequent updates of the shadow head result in CN1512Q improving performance by 13'%0, 3 l%, and 51 '%0,respectively, over CN116Q for spsolve, em3d, and moldyn.
Overflow to Memory. CN116Qm allows messages to smoothly overflow to memory when the device cache fills up. This eliminates the over aggressive message buffering that was a problem for CN116Q and CN15~2Q. This automatic buffering improves spsolve's performance over CN1512Q by 20Y0. For the other four macrobenchmarks, CN116Qm is slightly better than CN1512Q. Thus, cN116Qm consistently outperformsCN1512Q
Onthe memw bus even with signtjicantly less memory (i.e., cache) on~he device.
On the memory bus, CN116Qm shows the best overall performance improvement (between 17-53Yo), while CN1512Q shows the best improvement (between 30-88%) on the 1/0 bus. Also CN116Qm on the memory bus comes within 4% of N12W'S performance on the cache bus for spsolve, gauss, and moldyn, and within 17% for appbt (Figure 8 ). For em3d, CN116Qm on the memory bus slightly outperforms the cache bus N12W because N12W has limited buffering in the device and the processor must explicitly buffer messages in memory. These indicate that CNI 16Qm is an attractive alternative because it is feasible with most commodity processors and requires no change to the processor core or board.
Finally, CNIS significantly reduce the memory bus occupancy. By polling on cached registers and transferring messages in full cache block units, CQ-based CNIS on the memory bus reduce the memory bus occupancy by as much as 66% (averaged over five macrobenchmarks) compared to N12W In comparison, CN14 reduces the memory bus occupancy by only 23~0 because it still requires the processor to poll across the memory bus.
Related Work
Coherent Network Interfaces differ from most previous work on program-controlled network 1/0 in three important respects. First, unlike other NIs, CNIS interact with processor caches and main memory primarily through the node's coherence protocol. Second, CNIS separate the logical and physical locations of NI registers and queues allowing processors to cache them like memory. Third, CNIS provide a uniform memory-based interface for both local and remote communication. neither supports coherence nor allows its registers or queues to be cached, The processor chip interfaces with the rest of the system through the NI. Unlike other machines, the DI-multicomputer supports a uniform message-based interface for both memory and the network, whereas CNI uses the same memory interface for both memory and network.
Unlike many other NIs, our implementation of CNIS does not require changes to an SMP board or other standard components. Yet they enable processors and network interfaces to communicate through the cachable memory accesses, for which most processors and buses are optimized. Henry and Joerg [21] 
Conclusions
This paper explored using snooping cache coherence to improve communication performance between processors and network interfaces (NIs). We call NIs that use coherence cdrereru network inte~aces (CNIS), We restricted our study to NI/CNIs that reside on memory or 1/0 buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.
We developed two mechanisms that CNIS use to communicate with processors. A cachable device register allows information to be exchanged in whole cache blocks and permits efficient polling where cache misses (and bus transfers) occur only when status changes. CachaMe queues reduce re-use overhead by using array of cachable, coherent blocks managed as a circular queue and (optionally) optimized with lazy pointers, message valid bits, and sense-reverse.
We then compared four alternative CNIS-CN14, CN116Q, CN1512Q, and CN116Qm-with a CM-5-like NI. Microbenchmark results showed that CNIS significantly improved the round-trip latency and bandwidth of small and moderately large messages. For small message sizes, between 8 and 256 bytes, CNIS improved the round-trip latency by 20-84% compared to N12W on a coherent memory bus and 29-141 YO on a coherent I/O bus, For moderately large messages, between 8 and 4096 bytes, CNIS improved bandwidth by 59-169910 over N12W on a coherent memory bus and 51 -287Yo on a coherent 1/0 bus. Macrobenchmark results showed that CNI 16Qm performed the best on the coherent memory bus and CN1512Q on the coherent 1/0 bus. CNI 16Qm was 17-53% better than N12W on the memory bus, while CN1512Q was better than N12W by 30-88% on the 1/0 bus. Also, CN116Qm on the memory bus came within 17'$?o of N12W'S performance on the cache bus. This indicates that CN116Qm is an attractive alternative because it is feasible with most current commodity microprocessors and requires no change to the processor core or board. Our experiments use assumptions that are reasonable for commodity parts in the present and near future. In the medium term, our quantitative results will likely be obviated by better memory interconnects that pipeline requests, allow out-of-order responses, or even abandon physical buses. Nevertheless, we expect our qualitative results in favor of CNIS to continue to hold as CNIS continue to exercise memory interconnects with the operations the interconnects are optimized for, namely, coherent block transfers. In the longer term, caches, memory bus, NIs, and memory may move onto processor chips (or, in another view, everything moves onto memory chips). To manage complexity, however, these super chips may resemble boards of old systems with die area devoted to a custom mix of relatively-standard, optimized components (e.g.,processors and DRAM) interconnected through well-defined interfaces. While integrating an NI into a processor is possible, CNIS will be interesting as a less expensive (in terms of design and verification costs) way to deliver competitive performance. 
