This work quantrjies how persistent increases in processor speed compared to 110 speed reduce the performance gap between specialized, high performance messaging layers and general purpose protocols such as TCP/IP and UDP/IP: The comparison is important because specialized layers sacrrjice considerable system connectivity and robustness to obtain increased performance. We $rst quantifi, the scaling effects on small messages by measuring the LogP pedormance of two Active Message II layers, one running over a specialized VIA layer and the other over stock UDP as we scale the CPU and I/O components. We then predict future LogP performance by mapping the LogP model's network parameters, particularly overhead, into architectural components. Our projections show that the performance benefit afforded by specialized messaging for small messages will erode to a factor of 2 in the next 5 years. Our models further show that the performance differential between the two approaches will continue to erode without a radical restructuring of the I/O system. For long messages, we quantrh the variable per-page instruction budget that a zero-copy messaging approach has for page table manipulations if it is to outperform a single-copy approach. Finally, we conclude with an examination offuture U0 advances that would result in substantial improvements to messaging pedormance.
Introduction
This work considers the impact of expected architectural trends on messaging performance. In recent years, much research has focused on the design and implementation of specialized messaging systems [6, 36, 42, 451 , reducing the fixed cost of sending and receiving messages to tens of instructions and a handful of bus operations. Increased performance, however, is typically gained only by trading off connectivity -the variety and number of potential entities a given messaging system can send to and receive fromand robustness. This trade-off is easy to understand in the context of parallel programming: connectivity and protocol robustness are secondary to performance because ( i ) the performance of many fine to medium-grained parallel programs is highly sensitive to communication performance [43] , (ii) high connectivity is superfluous because processes of a parallel program only need to communicate with their peers running on the same system, and (iii) parallel programs and machines are carefully designed so that messages are lost only very rarely [7, 221; a message loss is typically treated as a catastrophic event -either the program or the system will crash and need to be restarted. The need for a transport layer is thus eliminated by the programmer's fault model and the application requirements.
For the emerging class of large-scale distributed servers (e.g., web services), however, robustness and high connectivity are at least as important as performance. Three factors make protocol robustness critical: ( i ) these servers have very high availability requirements (e.g., minutes of down-time per year), implying that even occasional message loss cannot be catastrophic; (ii) intraserver communication depends on extemal client service demands, making it extremely difficult to exert enough control over the system "by design" to avoid message loss; and (iii) many commodity LANs do not implement sufficient hardware flow control to always prevent loss inside the network under arbitrarily adverse communication patterns.
High connectivity is also important because it allows system architects to select from a wide variety of hardware and software components, especially as such components evolve over time. It also allows designers to hedge technology risk, which is often ignored in a research setting, but is critical in an industrial context. For example, a messaging layer employing TCPfiP over Ethemet has extremely high connectivity, allowing the underlying cluster to be built using operating systems, network interfaces (NI), and switches from multiple vendors. On the other hand, a specialized messaging system that depends on a particular LAN with custom cards, switches, and protocols does not provide such high connectivity and so a site using this technology is dependent on these technologies to last a long time and perhaps for a single vendor to be successful in the long run. While recent standards such as VIA [ 181 provide connectivity at the source code level, this is a far cry from the massive connectivity provided by the drivers, busses, media access control, and physical layers of general purpose networking. An example of the low connectivity of specialized messaging appears when trying to use VIA for a distributed Java-based server. The out-of-the-box Java runtime can not take advantage of this SAN layer without either building special software connectors or running general purpose messaging over it.
In this paper, we argue that current technology trends are bringing the performance of general purpose communication systems close enough to that of specialized systems such that designers of large-scale servers should seriously consider whether the trade off of connectivity and robustness justifies the added performance. Note that we are not advocating the abandonment of specialized messaging systems. Rather, we believe that the choice between specialized and general purpose needs to be considered carefully on a case-by-case basis. We break our analysis for this argument into two parts: one for small messages and the second for large messages.
For small messages, the performance of general purpose messaging systems has been steadily increasing while that of specialized layers has not. Figure 1 shows the LogP performance for small messages of four Active Message layers from 1992 to 2000, spanning four processor generations [ 1 1,41,43] . All these layers provide roughly the same interface, and each one provides isolation of messages between users. CMAM [43] ran on an MPP, while the others ran on workstation or PC hardware. Note that absolute performance has changed little since 1992 even though processor speed has roughly doubled each generation, going from 33, to 60, 167, and finally 400 MHz.
Intuitively, the reduction in the performance gap between general purpose and specialized messaging systems is easy to understand: the high cost of general purpose messaging arises from the large number of protocol instructions [28] . As processor speed in- creases, the cost of executing instructions drops correspondingly. Critically, however, I/O devices have not kept pace with processor speed! Consider that in 1992, a typical processor and 1 1 0 bus (e.g., a SPARC 2) both ran at 33 MHz. At these speeds, the cost of executing 2500 instructions dwarfs the cost of a few accesses over the I/O bus. Currently, on a 550 MHz Pentium 111 machine, this same 2500 instructions can be executed in less than twice the time required for same few I/O accesses. Architectural trends imply that we can expect this speed differential to be growing with time: processor performance doubles roughly every 2 years while 110 performance doubles every 4-7 years. Thus, Amdahl's law implies that the relative improvement gained by messaging systems cal architectural innovations, will diminish with time.
In Section 2.2, we characterize the costs of sending and receiving small messages for two messaging systems using the LogGP model [3, 121. Both systems export the AM-I1 interface [30] ; one implements AM over a VIA LAN, the second implements AM on UDP/IP over Ethernet, giving the two layers significantly different levels of connectivity. We restrict ourselves to the AM-I1 interface because it contains many features needed by applications demanding more than just performance: thread support, blocking primitives, and error handling. We derive a simple cost model for mapping the LogGP parameters to two stock architectural components, the processor and the I/O interconnect; we also derive scaling rules for these components. We then use our model to study the effects of scaling on messaging costs over four speeds of Pentiumbased PCs (233 MHz Pentium I1 to 550 MHz Pentium Ill). Next, we use the model to predict future communication costs.
Our results show that the performance differential between specialized and general purpose messaging systems has been decreasing steadily and will continue to decrease because of mismatching trends in processor and I/O speeds. Our projections show that a specialized messaging system will maintain a 2x performance advantage over general-purpose messaging. However, this is significantly less than the performance differential of lox only six years ago [31] .
In Section 4, we consider the cost of sending and receiving large messages. In particular, we consider architectural scaling effects on three messaging architectures: single-copy, zero-copy, and shared-memory communication. Single-copy is the simplest strategy and most current implementations of TCP and UDP use this approach. The second approach, zero-copy, manipulates the which simply reduce instructions, as opposed to using more radikemel page tables to allow sharing of user pages with the NI while maintaining normal copy semantics via copy-on-write. In sharedmemory communication, the sender and receiver agree to transparently share a block of memory, who's mapping is supported by reverse page table manipulations on the NI.
Zero-copy is attractive compared to single-copy when the variable overhead of page-table manipulation is less than the per-page copy cost [40] . We refer to this cross-over point as the OS page budget. To examine the effects of architectural scaling, we estimate the number of instructions needed to setup a zero-copy transmission and compare this cost against copy costs as CPU speed and memory bandwidth scale through time. Our results show that copy reduction via software and architectural enhancements (e.g. checksum registers) remains important to obtaining high bandwidth. However, interestingly, our model shows that the OS page budget will scale only slowly with time.
Methodology
In this section, we first describe the LogGP model we use to characterize network performance. We then document our experimental apparatus and describe how we map LogGP parameters into architectural events, such as instruction count. We also document how we had to modify the interpretation of our microbenchmark results to fit our protocols into the LogGP framework.
The LogGP Model
When investigating communication architectures, it is important to recognize that the cost of each operation breaks down into portions that involve different resources: the processor, the memory, I/O busses, the network interface, and the actual network switches. However, it is also important that the communication cost model not be specific to particular hardwarekoftware implementations. We use the LogGP model [3, 121 because it provides a middle ground by characterizing the performance of the key resources but not their structure.
LogGP characterizes a communication system using five parameters (Figure 2 ): L: the latency, or delay, incurred in communicating a message containing a small number of words from its source processodmemory module to its target. The Latency includes the time spent in the network interfaces and fabric, but not in the processor.
0: the overhead, defined as the length of time that a processor is engaged in the transmission or reception of each message; during this time, the processor cannot perform other operations.
g: the gap, defined as the minimum time interval between consecutive message transmissions or consecutive message receptions at a module; this is the time it takes for a message to cross through the bandwidth bottleneck in the system. G: the Gap, or time-per-byte for long messages. The inverse of G is the peak bandwidth. This parameter was added [3] because many platforms have special acceleration for long messages (e.g. DMA).
P: the number of processor modules.
Experimental Setup
We study two communication systems, AM-VIA and AM-UDP. Both systems export the AM-I1 API [30] . Figure 3 shows the LogGP parameters of the two systems. Both AM layers implement a three-way requestheply primitive for robustness, which UDP-AM also uses for correct handling of lost messages without requiring per-node buffering that grows with the number of processors [ l I , 31, 411. Both AM layers implement flow control to avoid buffer overflows. 
Characterizing Performance
After casting our communication systems into the LogGP model, we break down the LogGP parameters into their architectural costs. Our approach is to use the processor's hardware event counters [24] to charge various hardware events to each parameter of the LogGP model. Specifically, we measure the following events:
0 The number of instructions decoded.
The number of extemal bus acceses to memory space 'The memory bus is only critical when the cache is not large enough to hold the messages being transferred. For small messages, the cache was sufficient. and all memory bus transactions are considered to have moved across YO bus as opposed to being memory acceses. 0 The number of extemal bus accesses to I/O space -the x86 instruction set defines a 64K I/O space that is separate from memory space. Unfortunately, not all U0 operations are observable using this count as operations to memory mapped devices appear in memory space, not I/O space.
0 The number of non-halted cycles.
0 The number of interrupts.
We use the ubench benchmark [ 131 to measure the LogGP values for our communication systems. This benchmark first measures the send overhead, os by direct measurement. Next, it measures gap, g. From these two parameters and knowledge of the protocol, we then infer the receive overhead, 07. Finally, we can derive the latency, L , by measuring the round-trip-time (RIT) and subtracting out the overheads. send overhead os is measured by sending a small burst of messages and computing the average cost per message. We charge events based on the average number of events per message measured during the burst. For example, if we send a burst of 4 messages and observe 32,000 instructions, the instruction count for 0, would be 800. A number of complications arose when we took this approach, however.
The first complication arose during the measurement of os for AM-UDP on the 550 MHz machine. When the tulip driver initiates a send, it quickly writes the send across the I/O bus and then immediately retums control to the user process. When the card receives this request and processes it, it responds to the processor with an interrupt, at which time the remainder of the driver's work to send the message is performed (i.e. cleaning up the send queue).
A 550 MHz processor is fast enough that it often completes a small burst without ever getting an interrupt. This implies that the ethemet chipset used, the DEC/Intel 21 140, overlaps L with part of os, reducing the accuracy of our LogGP model; a more accurate model would account for the initial overhead, o:, and the cost of the interrupt processing, of. The effect of this architecture on our probing benchmark is that it lowers the observed os to o:, yet we must count 0 : to amve at an accurate cost for os. In order to model os with reasonable precision without expanding the model, we charge one interrupt for each os. To compute the interrupt cost, we measure the number of interrupts received in a burst and found the cost of sending when one interrupt was received per send and no messages has been received by the end of the burst.
A complication also arose for estimating the os for AM-VIA. On a slow machine the acknowledgement, shown in. Figure 3 arrives before the burst is over. Thus, the ack processing time adds to the observed os. For example, Table 1 shows the measured number of I/O operations dropping with processor speed. However, the sum of os and op remains constant. In order to create a model of os independent of these effects we use the architectural event counts for os based on the measured 400 MHz machine results.
gap For a burst with only a small number of sends, the average messaging cost defines the send overhead, os. In bursts with large numbers of messages, the cost of each send approaches the steadystate initiation interval g. For both our communication systems, the bottleneck is the send and receive overheads. Figure 3 shows that in both our layers, an "extra" message is required in the threeway protocol. Taking into account this extra message, g is always equal to the steady-state cost of the slower side, which, assuming that or > os, is os + 20,.
receive overhead We can compute or in units of time by using measured values of os and g since g = os + 20,. To measure hardware events such as instructions executed, however, we only Table 1 . Measured messaging costs in AM-VIA. Measured number of user instructions, kernel instructions, cycles, and I/O operations for AM-VIA. The average number of operations are shown for a short burst size of 4 messages and a long burst size of 128 messages. The reported event counts do not include any of the adjustments described in Section 2. The 300 and 400 MHz machine results have been omitted due to space constraints. Table 2 . Measured messaging costs in AM-UDP. Measured number of user instructions, kernel instructions, cycles, and U 0 operations for AM-UDP The average number of operations are shown for a short burst size of 4 messages and a long burst size of 128 messages. The reported event counts do not include any of the adjustments described in Section 2. As above, the 300 and 400 MHz machine results have been omitted due to space constraints count events that are occurring on the sending side. Thus, for an event E , for AM-VIA, Emeasured = E,, + 2E,,, while for AM-UDP, Emeasured = 2EO3 + E,, (see Figure 3) . Since we can measure E,, , we simply solve for E,, .
Again, we had to account for the tulip driver's use of interrupt processing to reduce overhead on the 550 MHz machine. Furthermore, we also had to consider that the driver is able to consolidate work by servicing multiple sends in a single interrupt. For example, in steady state operation the driver is able to service two sends per interrupt on average. Of the two measured interrupts, we charge one to the receive and one to both sends.
. Latency Measuring L with a 3-way protocol again requires care. We can see from Figure 3 that there is a critical path for a round-trip message. Assuming that other work is perfectly overlapped, we know the critical path of an RTT is composed of only os, or, and L. Having measured the RIT, os, and derived or, we can compute L for both protocols as: RTT = 3os + 20, + 2L.
We only measure L in units of time since all events occur on the NIC and are unobservable from the CPU.
Gap To measure G, we send bursts of large messages, each with a fixed size. We then derive the bandwidth from the steady-state initiation interval and message size.
Small Message Scaling
In this section, we characterize the performance of our two messaging systems for small messages using the LogP model. We do not consider G as this parameter deals only with sending and receiving large messages.
Measured LogP Scaling
Figure 4 plots the measured LogP parameters of our two messaging systems against processor speed. Clearly, there are substantial performance advantages to using a specialized messaging system with hardware support, even when using a fairly complex protocol such as AM-I1 as the transport layer. When we examined the combined overhead of os + o r , we found the ratio between the AM-VIA and the AM-UDP dropped from about 4.0 on the 233MHz machine to 3.3 on the 400 MHz machine. When we look at this same ratio for the 550 MHz machine, it jumps back up to 3.7, due to the increase of the memory bus speed.
To better understand why the performance differential between AM-VIA and AM-UDP is decreasing, we break os and oT into their component costs, including instructions and I10 operations for both user and kernel space. We ignore L because previous work has shown that most applications can effectively overlap L with other computation [13] . Furthermore, Figure 4 shows that L is relatively constant for both AM-VIA and AM-UDP and so cannot be the cause of the decreasing performance differential. The measured L also serves as a check on our methodology. Figure 4 shows that, as expected, L stays relatively constant because the Nls and switches are the same throughout our experiments. We characterize os and or in architectural terms using the following model: executed and CPIcpu is the average number of processor cycles required to execute each instruction in I C P U .
We measure the components in Equation 1 as follows. We use the Pentium performance counters to measure the number of cycles, instructions, memory accesses, and I/O operations as explained in Section 2. We assume that each I/O operation is generated by 1 instruction and CPI1,o equals 9 110 bus cycles (based on the PCI specifications).
We note that our assumptions imply that multiple I/O operations are never overlapped or combined, making the actual cost of 110 potentially less than the model. We believe that this error is not a significant component compared to the divergence of CPU and I/O speeds. Tables 1 and 2 give the measured instruction, cycle, and 110 counts for AM-UDP and AM-VIA respectively. Figure 4 plots the percentage breakdown of os and oT. This figure shows that the fraction of time spent performing I/O operations is increasing. In the case of the 233 MHz processor, we see that AM-UDP spends approximately 11% of its time on these operations, and AM-VIA spends 21% of its time on similar pursuits. When we move to the 550 MHz platform, we see that the fraction of time spent on I/O operations has gone to 25% for AM-UDP, and 34% for AM-VIA. Correspondingly, the relative cost of executing protocol instructions vs. 1/0 operations is decreasing.
Predicted LogP Scaling
While protocol execution still accounts for a significant percentage of os and or on our fastest machine, the trend seems to suggest that, unless other architectural changes are made, I/O operations will eventually become the dominant factor in the performance of small messages. To follow this trend into the future, we derive scaling rules for the processor and I/O bus for the next five years as follows:
processor speed To predict processor speed, we extrapolate based on the past 7 years of performance data on x86 processors. Examining the clock rate and performance data from [32, 381, we can observe a rough rule of thumb of a 40% increase per year in clock rate and SPECint ratings. This level of improvement roughly corresponds to the "aggressive" predictions in [2] . At the time of this writing, 1 GHz processors are available. If processor speed doubles every 2 years (i.e., compounding at 40% year), then we would expect 3 GHz processors to be available by summer 2003 and 5 GHz by 2005. We assume that architectural enhancements, such as larger caches and microarchitecture improvements, will keep CPI for messaging at roughly the same levels as today.
protocol instructions Barring a revolution in OS design or messaging software, the number of instructions should remain relatively constant. We do not model any drop in the number of instructions.
YO bus speed The growth trend of I/O bus speed has been
considerably different from that of CPU speed, demonstrating increases every 4 to 7 years (e.g., ISA in 1984 , EISA in 1988 , 33 Mhz PCI in 1993 , 66 MHz PCI in 2000 . One factor that dampens the growth of bus speed is the fact that the number of slots available per I/O bridge tends to decrease as bus speed increases, increasing the cost of a fixed number of I/O slots. For example, at 133MHz, each PCI bus can support at most one PCI slot, requiring multiple PCI busses for more slots. We thus believe that the standard bus speed for PCs over the next five years will remain at 66 MHz.
YO operations
In addition to scaling I/O bus speed, we must also consider the CPI of I/O operations as the bus width increases. The current 33 MHz PCI bus on our machines is 32-bit wide and requires 9 cycles per operation between the CPU and a PCI card. In the future, we expect the bus width to double to 64 bits, but the number of cycles to transmit a given instruction will remain the same. However, a driver may batch multiple 1/0 operations to take advantage of the wider bus. Such batching will vary with card architecture as well as driver implementation. To account for this potential optimization, we adjust the number of bus clock cycles per IIO operation from 9 to 7. Figure 6 and 7 give the predicted scaling of os and or and their components costs. Figure 6 shows that the performance gap between specialized and general purpose messaging systems will continue to decrease, dropping from a ratio of 3.30 (for the sum of os and 0,) at 233 MHz to 2.39 at 550 MHz to 2.12 at 5 GHz2. Figure 7 shows that this decrease in the performance gap is due to the increasing importance of 110 operations. At processor speed of 5 GHz, I/O operations account for 53% of the overhead for AM-UDP and for 65% of the overhead for AM-VIA.
If this trend continues beyond our projection period of 5 years, then only a difference in the number of I/O instructions (and the corresponding CPI) will produce a significant difference in performance; the amount of time necessary to perform the user and kernel instructions will become insignificant.
Large Messages
In this section we examine the effects of architectural scaling on large messages, focusing on the per-byte overhead. In particular, we investigate the architectural scaling of the cross-over point between the cost of copying a page vs. using memory management techniques to share the page between the user process and the NI.
Broadly speaking, three approaches to high bandwidth messaging have evolved, two maintaining the classical messaging API, where the user process is free to modify the message buffer once the send primitive returns (e.g., as in socket and MPI), while the third exports a significantly different API. The first, and simplest, is to use a staging area for both in-bound and out-bound 21f we take the optimistic assumption that bus speed will be at 133 MHz in five years, the ratio will spike up from 2.25 to 2.30 when processor reaches 3 GHz to 2.12 when processor speed reaches 5 GHz. After the spike, however, the ratio will continue to decrease once again. data, requiring each message to be copied at least once. We call this the single-copy approach because most implementations of general purpose messaging layers have been able to achieve this lower limit. Although single-copy approaches can deliver good performance [15] , it increases the load on the memory bus and can also increase the per-byte overhead over the next two approaches.
The second approach, called zero-copy messaging, manipulates the kemel page tables to allow sharing of user pages with the NI while maintaining the same semantics as single-copy through copy-on-write. Although this approach has been studied extensively [8, 9,291, it has not been mapped into architectural parameters. In particular, page table manipulation, the fundamental mechanism used in zero-copy, is complex. Thus, zero-copy only becomes attractive compared to single-copy when the per-page overhead of page-table manipulations is less than the per-page copy cost [40] . We call this cross-over point the OSpage budget.
The final approach is shared-memory communication [6, 1 1, 20, 341. The basic idea behind this approach is that the sender and the receiver agree to transparently share a block of memory. Communication then occurs via writes, DMA, or sometimes reads, into the shared region. While shared-memory approaches deliver excellent performance, they have several drawbacks. The primary drawback is that classical messaging (e.g., sockets and MPI) does not map well into a pure shared-memory approach; constructing a scalable, N-to-l queue is difficult without hardware memory coherence or specialized support such as fetch-and-add registers. A more subtle problem with shared-memory communication is the considerable loss of connectivity because of the "commonality" required between the OS and NI of both the sender and receiver.
A second cost of shared-memory communication over zerocopy is the cost of mapping shared page tables between the NI and CPU. These page tables require additional communication between the NI and CPU that is not present in a zero-copy system. While many studies have documented messaging performance once all the shared mappings are in place, few have reported these set-up costs. The amortization of the mappings depends on the application and higher-level libraries, which we do not pursue in this work. Figure 8 shows the per-byte cost of our two messaging systems as measured on the 400 MHz Celeron machines. From this data, we compute that AM-VIA requires approximately 1.65 cycles per byte to send large messages. Interestingly, the per-byte cost for AM-UDP is much higher than that for AM-VIA. To understand why, we breakdown the AM-UDP per-byte cost into its user-level and kemel-level components. This breakdown shows that AM-UDP is making two copies, one at the user level (in the AM layer) and one in the kernel (in the UDPlIP protocol stack). We attribute the slightly higher kemel-level per-byte cost to checksumming.
Implementing Single-Copy
The fact that two copies are required in AM-UDP points to a deficiency in its architecture. Because the AM and UDP layers are not "aware" of each other, each makes a copy independently. This is a well-known problem with layering (e.g., [17] ) but, in this context, points to the potential performance disadvantage of splitting a messaging layer across the user-kemel boundary where there is significant functionality on both sides.
Page Table Operations
While zero-copy and shared-memory communication can avoid data copying, a known performance bottleneck, the required page table manipulations are complex. For example, on the send side, the OS must typically:
Check that the region to send is valid.
Fault in any paged-out pages.
Pin the pages, i.e, marking them as un-swappable, if the region is swappable.
Change the access permission of each page to read-only. 
5.
Translate the page-addresses into the correct physical or bus 6. Un-pin the pages once I/O is complete.
7. Change the access permission of each page back to the origNote that this sequence does not implement copy-on-write, but allows for it on a write-fault. To actually implement this would require even more complexity.
The receive side is somewhat easier than the send-side. Here the OS must:
1. Un-pin the pages, i.e, marking them as swappable, if the user's region in not pinned.
2. Change the access permission of each page to writable if required.
3. Change the user's page-table entries for the receive region to point to the received pages.
4. Return the old pages back to the kernel or device.
addresses needed by the NI.
inal permission.
A careful study of the true cost of these operations crosses far into the operating systems field and is beyond the scope of this work. Our contribution instead is to quantify how the OS page budget will scale with time. To give the reader some context of the cost of the above OS operations, we measured the cost of the VIA call to register memory for the Giganet driver, the approximate cost of user-level page-table manipulations via a pair of mlockO/ " d o c k 0 and two mprotect 0 , and copy cost via bcopyo. Figure 9 show the results on a 400 MHz Celeron machine.
We make two observations. First, it may be less costly to use a single-copy approach when the message size is one page or less'. Second, the cost of VIA memory registration is very close to basic page table manipulation costs for small buffer sizes (less than 64KB) -about lOus for 4K page size -but degrades to about 3 cycles per byte, or 133MB/s for buffer sizes above 64KBs. These results imply that on-demand pinning and un-pinning of small regions has favorable costs, but for large messages, applications should resort to managing communication with a fixed cost pinned buffer. We are currently investigating the Giganet driver to fully understand why registering 64K regions is as fast as basic page table manipulations but approaches copying cost for larger regions.
3Since we're only estimating the cost of zero-copy, we cannot say definitively where this cross-over point currently is.
4.3
We use the following simple model to examine how memory and processor scaling affects the cost of sendingheceiving large messages via a single-copy vs. a zero-copy architecture:
Predicting the OS page budget We use the same processor speed vs. time predictions that we derived in Section 2.2. To derive predictions for memory bandwidth, we use data available from the STREAM benchmark site [32], which shows that memory bandwidth has been increasing at a rate of roughly 35% per year. Figure 10(a) shows the relationship of clock speed to memory bandwidth. The linearity of the curve may be surprising but is consistent with the fact that botlh processor speed and memory bandwidth are growing exponentially.
Figure 10(b) shows the OS page budget in terms of number of instructions as the processor scales to 5 GHz. An interesting effect of our scaling rule is that, if the page size remains fixed at 4 KB as the processor and memory get faster, the OS page budget approaches an asymptotic limit of 10240 instructions. Intuitively, because the copy cost for a fixed-size page decreases as an exponential function, the limit of the exponential decay results in a fixed page budget. This asymptotic limit is directly related to the page size; doubling the page size doubles this limit. Thus, as page size increase, it becomes ever more advantageous to use a zerocopy approach. 
Zero-copy and Checksumming
Most general-purpose protocols require or are run with higherlevel checksums compared to specialized protocols which only run with a layer-2 checksum. Our OS page budget computed above does not include a checksum cost. If checksumming is to be supported, a single-copy approach become more attractive because the copy and checksum can be amortized during the same operation. An alternative for zero-copy approaches is to provide support on the NI for computing checksums.
approaches reduce latency and increase bandwidth, they decrease the overall connectivity, sometimes by substantial amounts. These approaches can be summarized as:
Related Work
The last IO vears has seen tremendous efforts investinatinn high 0 Integrate messaging into the processor [14] . performance messaging layers [6, 36, 42, 43] . In additGn, several projects describe methods of providing a measure of classic abstractions on top of these layers [16, 371. Many of these projects have provided detailed analysis or performance models [4, 13,261. However, these models were always in the context of specialized communication systems, thus making comparisons to general purpose systems difficult.
In a more general networking context, much work has been done to quantify the performance of existing 1P protocol stacks [27, 281, or Unlike the analysis of specialized messaging, the performance of these general purpose systems is rarely defined in terms of architectural models. It is difficult from these studies to determine the effect of the processor, memory or U 0 systems' impact on communication performance. The same situation exists for popular micro-benchmarks for measuring the performance of these layers [25, 33, 101. The effect of different network card organizations was investigated in [39] . However, that work did not extend projections as to which organization would best track architectural trends.
Perhaps the closest work to our own can be found in [8] . The linear models provided in the paper were scaled in a qualitative way with processor and memory speeds. However, architectural characteristics, such as the number of instructions or cycles, were not given.
Future YO Architectures
Current industy efforts to improve I/O architecture focuses on delivering increased bandwith. For example, the Infiniband [23] is a complex specification that will likely increase latency over a stock PCI bus. However, Infiniband's point-to-point switching will increase total deliverable bandwidth. In addition, its complex packet format is designed to increase connectivity by allowing disks, computers and network interfaces to share a common fabric.
On the research side, many I/O enhancements have been proposed or prototyped. Most of these take the approach of integrating the messaging unit closer to the processor. While all these 0 Integrate messaging into the cache-controller [ l , 211.
0 Integrate the messaging unit into the memory system [35] . A more radical approach integrates it into the DRAM slots [34] .
The key challenge for future high-performance networking will be to maintain a high degree of connectivity while increasing performance. This means either working within the confines of the kemel and IIO busses or completely replacing the standards which form the underlying communication substructure. Recent work in operating systems [ 5 , 191 has been able to achieve highperformance networking in a kemel context by using experimental operating systems. Such systems, however, achieve their performance by reducing software connectivity.
Perhaps the best example of a system which provides both high connectivity and performance is [9] . In that work, a few simple hardware and software modifications resulted in a large increase in the deliverable bandwidth in freeBSD. The hardware techniques included large MTUs and checksum offload. The primary modification to the OS was to provide zero-copy send and receive. The results showed a factor of 30% bandwidth improvement without an appreciable increase in latency.
In the near future the best way to achieve high-performance and connectivity will be for network standards bodies to adopt large MTUs, for card vendors to provide checksum offloading and implement large MTUs, and for OS vendors to incorporate driver API's that allow checksum offload and zero-copy techniques. These techniques alone will allow applications to use a substantial portion of the hardware bandwidth in the context of the operating system. Latency reduction will come as processor performance outstrips the rest of the I/O system. These approaches, unlike specialized messaging, however, require a common set of standards to be accepted by four distinct communities: the operating system, network interface, switch and motherboard vendors. The speed and adoption of new standards may become the limiting factor for 1/0 performance rather than any technological hurdles.
In the longer term, the real challenge will be to reduce general purpose messaging overheads. On the hardware side, better integration into the memory system and techniques such as cacheable Reducing the LogGP latency, as distinct from overhead, will remain difficult however. As technology scales, the cost of moving data through each chip will continue to drop, but as long as systems are comprised of discrete components, there will be significant latency costs. Programmers and OS designers should thus continue to focus on algorithms which tolerate latency.
Conclusions
For small messages, we have shown how the relative improvements in the processor and I/O bus result in an erosion in the performance differential between specialized and general purpose messaging systems. We have observed a reduction of this difference from a factor of 10 four years ago to a factor of 4 today. The reason for this reduction is that 110 bus speeds have not kept up with processor improvements. While processor performance doubles every 2 years, I/O busses only increase in performance every 4-7 years.
Extending our models out five years, we predict specialized layers to have software overheads of 2-3 times better than general purpose systems. This should give system designers pause as to whether such performance increases are worth the loss of connectivity with its attendant risks of technology abandonment.
For long messages, we quantified that the breakpoint between copying and zero-copy page pinning will be approximately 10,000 instructions per 4KB page and double that for 8KB pages. This will give OS designers a clear target for implementing zero-copy protocols in the future. Measuring the cost of basic page manipulations in the Linux kernel showed that these costs are not unreasonable. However, sufficient hardware support, such as checksum registers, will be required to make zero-copy protocol stacks viable.
