Since the introduction of the 10GbE standard in 2002, the ability of general purpose processors to efficiently process network traffic with common protocols such as TCP/IP has been revisited and critically evaluated. However, recent commercially available processors such as Intel® Core™ 2 Duo Processor introduce microarchitectural enhancements that could significantly influence the approach to accelerating network processing. We examine the network performance of a real platform containing Intel® Core™ micro-architecture based processors, the role of coherency and a prototype implementation of direct cache placement (Direct Cache Access or DCA) of inbound network traffic. We observe that a substantial portion of the time relates to the inefficiency of I/O specific coherence protocols in the platform. We demonstrate that a relatively, low complexity implementation of DCA called 'Prefetch Hint' provides a 15 to 43% speed-up to receive-side processing across a range of I/O sizes and present a detailed characterization of the benefits.
Introduction
Even as nearly all server platforms completed the transition to GbE (1 Gb/s Ethernet standard), the adoption of 10GbE has been limited to a few niche applications [1] . Widespread adoption of 10GbE, despite its introduction in 2002 has been constrained by the power and cost of PHY technology and the processing capability of general purpose platforms [1] . There has been a significant debate regarding various types of solutions to accelerate the processing of TCP/IP network traffic [2, 3, 4, 5] . The variety of solutions and yet the lack of a clear resolution indicates the complexity of the problem. TCP/IP processing has dependencies on interactions across platform-wide hardware components, hardwaresoftware interfaces and software interfaces such as those between the driver, the operating system and applications. A 10X increase in bandwidth can be expected to demand a comprehensive solution across the platform and the software stack. A comprehensive solution for a problem of this nature requires a system-wide, multiple-ingredient approach as well as an approach where only one or a few key ingredients are introduced at a time for broad commercial adoption. With Moore's law based improvements in microprocessor capability including lower memory and I/O latencies via integration, the base platform can be expected to steadily improve and accommodate 10GbE. In addition, if fundamental hardware architectural problems such as memory latency or software architectural problems such as interrupt processing or context switching are addressed, then a practical, comprehensive solution can be envisioned. This type of a combined approach that builds on Moore's law with certain key architectural adjustments was outlined in [5] and termed as 'CPU onloading' as a contrast to approaches that attempted to solve the problem with off-CPU accelerators.
The improvements offered by recent Intel® Core™ micro-architecture have been considered to be significant in both absolute performance as well as in performance per watt efficiency [12] . In our current research work, we were motivated by an observation that the new micro-architecture relative to the earlier Intel® Pentium® 4 processor at a typical I/O message size of 4KB was 66% faster in the capability to process inbound network traffic. Such gains are entirely complementary to improvements possible by more efficient platform data movement protocols. In an earlier paper [13] , an approach called Direct Cache Access (DCA) to directly deliver network traffic into processor caches was introduced in detail. This concept has also been known as cache injection [15] . DCA targets multiple sections of TCP/IP processing that are dependent on the memory latency to accessing data structures updated by the Network Interface Controller (NIC). By providing a coherence protocol to directly place new data in the CPU's caches, memory accesses can be eliminated and processing time is proportionately reduced. In this paper, we provide measurements based on a silicon prototype of DCA and a sensitivity analysis to demonstrate the circumstances under which processing efficiency is improved with DCA. We measured that DCA can speed up receive-side TCP/IP processing by 15.6% to 43.4% across 64B to 128KB I/O sizes. CPU utilization is often the primary metric to measure platform efficiency when performance is limited by the total network traffic received or transmitted on the platform. We demonstrate that the CPU utilization reduction provided by DCA is realizable as an overall benefit to other simultaneously running applications. This is achieved by running the well characterized SPEC CPU2000 (SPECint_rate and SPECfp_rate) benchmark simultaneously with receive-side network processing to mimic a multi-core heterogenous workload environment. The utilization reduction with DCA on network processing is measured in terms of the increase in benchmark scores.
Since the current paper builds on the foundation layed by prior work by Huggahalli et al. in [13] , we summarize the contributions in that work as follows: -introduced network I/O data structures, TCP/IP processing and describes CPU-NIC interactions via coherent memory -established that a tight producer-consumer relationship exists between the CPU and NIC interactions by demonstrating it in profiles of benchmarks such as SPECWeb99 and TPC-W -proposed DCA and discussed implementation considerations such as selection of cache in the hierarchy, system interconnect protocol, write stream and target processor selection. -mentioned briefly that among the system interconnect protocols, a 'Prefetch Hint' approach is potentially a good trade-off A limitation of [13] is that the impact of any specific implementation was neither measured nor projected to demonstrate the system level benefit and sensitivity of DCA to application level parameters such as I/O (or message) sizes or the number of networking streams. In this paper, we provided a completely new range of measured system level sensitivities for the concept. DCA is implemented in a prototype platform using the Prefetch Hint approach where a CPU hardware prefetch is triggered by the NIC. This approach completely leverages existing coherence flows and should be relatively simple to implement in real systems as well. While this alternative was briefly mentioned in [13] , there was no indication of its effectiveness compared to existing prefetching techniques. Using this prototype, system level metrics such as throughput and CPU utilization are measured and the performance benefit for DCA is expressed in relative and absolute terms. While the Prefetch Hint approach is not intended to save memory bandwidth, we show that its impact in reducing network I/O processing overhead is significant and can be considered as a good trade-off in commercial systems.
In the following section, we review various approaches towards accelerating the processing of TCP/IP network traffic. We then describe the experimental platform and baseline performance characterization. The network performance capability of current platforms is introduced and a profile of execution is presented to indicate the opportunity for DCA. In Section 4, we review the protocol used by DCA and its implementation in a silicon prototype. The remainder sections provide a detailed, measurement based characterization of the performance attained by DCA.
Background and Related Work
Previous evaluations of TCP/IP processing overheads have tended to separate non-data touching and data touching operations [6, 7] . Certain overheads such as copies from a NIC DMA buffer into the application buffer is a 'per-byte' data touching overhead. Non-data touching overheads are primarily system software overheads that include the processing cost of interrupts, system calls and context switches. Many of the overheads exist because of the modularity of standard operating systems and may not easily be overcome without significant changes to the operating system or changes to applications.
The solutions proposed to minimizing TCP/IP processing time in a typical server platform can be broadly classified into three categories: (1) base platform improvements or assists that promote 'CPU onloading' (2) TCP Offloading Engines (TOEs) and (3) new interconnects or protocols such as Infiniband, Myrinet or RDMA. To just address the problem of copies, network stack architecture changes, new protocols and techniques have been proposed. For example, user level TCP/IP stacks [8, 9] and RDMA [4, 10] define new hardware-software interfaces that permit data to be delivered to an application buffer. Both of these methods require specific application support, while RDMA requires a more expensive NIC solution to maintain the relationship between network flows and application contexts. Page flipping [11] requires OS intervention to avoid copying by remapping of a kernel buffer into a user buffer. It incurs the overhead of TLB invalidations while being applicable only when the application buffer is aligned to page boundaries. Offloading approaches have promoted network protocol execution on specialized hardware accelerators instead of host processors to reduce the utilization of the host processors due to networking. Such approaches have been shown to be effective for a small portion of the spectrum of networking applications, particularly, for storage transfers with large I/O sizes and long-lived network connections [3] . They have not been applicable for small messages, web servers or in cases where application buffers are not pre-posted for the accelerator to directly copy data into them without involving the host [5] . Overall, we assert that these evaluations are worth revisiting in light of the new levels of micro-architectural efficiency offered by processors such as the Intel® Core™ 2 Duo Processor [12] . These developments provide 'CPU onloading' techniques [5] a greater impetus and diminish the need for specialized solutions.
Base Network Performance Characterization
In this section, the base problem statements and the opportunity for hardware improvements are characterized from an experimental test bed. Key metrics that we focus on are throughput measured in Gb/s (Gigabits per second) and CPU utilization measured as the percentage utilization across all the cores in the system. We connected a prototype server system that supported dual-core CPUs based on the Intel® Core™ micro-architecture back to back with another similar system using two, 1 Gb/s Ethernet ports. Each system contained two CPUs with each CPU containing two cores operating at 2.67GHz and connected via a 1333 MHz Front-Side Bus to memory and I/O subsystems. The cores within each CPU share a 4MB, 16-way L2 cache. Memory used on this system is FBD (Fully-Buffered DIMM) operating at 667MHz. Figure 1 contains a block diagram of the prototype server system. The prototype provides the capability to enable a DCA prefetch hint mechanism. The implementation of DCA within this prototype will be explained in Section 4.
The operating system running on both systems was Linux with the 2.6.15.4 kernel obtained from kernel.org. There are two protocol processing specific parameters that should be noted. By default, the ability to coalesce back to back small I/O transmissions into large packets using Nagle's algorithm [17] has been turned off. Turning off Nagle's algorithm brings out all the overheads for separate I/Os. In addition, the Linux operating system by default uses NAPI [16] to proactively look for packets reducing the overhead of interrupts. This feature is turned on for all measurements. We also turned off general purpose hardware prefetching capability in our baseline and show sensitivity to hardware prefetching in Section 9.
Most modern NICs including the ones used in our measurements offload TCP checksum calculations. Also, while the following analysis is applicable to the receive-side only, it is noteworthy that TCP LSO (Large Send Offload) enabled for the sender. For traffic generation we used Iperf [14] -a utility that can transmit and receive data using the TCP/IP protocol. CPU utilization reported is averaged across 4 cores and is measured using 'sar' utility available as a part of sysstat package of Linux performance measurement utilities. Since the SUT had a total of 4 cores and only 2, 1 Gb/s network ports, all the processing occurred on two cores in these experiments. By default, 4 connections or streams were assigned to each port -we later show sensitivity to the total number of connections. Table 1 shows the result of measurements across various I/O sizes and also normalized metrics to indicate the effective capability of a core in the current experimental environment. The spectrum of I/O sizes is conventionally divided into two categories: small I/O sizes between 64B and 1024B and large I/O sizes for anything beyond the maximum payload size that fits into a standard Ethernet packet. Large I/O sizes thus require multiple packets to be transferred. At an I/O size of 64-bytes, a throughput of 0.24 Gb/s was achieved at an overall CPU utilization is 25.6%. Since the two network ports were affinitized to a single core in each socket, each of the active cores was utilized 51.2% of the time. The reason for low network utilization was due to the inability of the traffic generator to sustain a greater throughput than 0.24 Gb/s. For larger I/O sizes, a peak, wire-limited network throughput of 1.84 Gb/s was sustained. At the 4KB I/O size, we show that the CPU utilization is 16.8%. When normalized to 100% utilization of a core, a core can be considered to operate at a capability of 2.72 Gb/s and spending an average of 3938 nanoseconds (ns) in work associated with a typical packet. These normalized metrics should not be assumed to be measured values when a core is actually 100% utilized. However, they are useful metrics in comparing data across multiple equivalent experimental configurations and in stating the operating efficiency of a core. We note that the 'under utilization' of the four cores in the system may not reveal other bottlenecks such as Front-Side Bus (FSB) bandwidth or CPU to CPU coherency overheads in the platform particularly if 10 Gb/s ports are used to stress the system. The focus of this study is to understand the capabilities from the standpoint of a single core and the impact of I/O coherence protocols. Once the capability of a single core is understood, multi-core extrapolations under full load may be done cautiously to avoid conclusions that are too specific to the current platform. We also did not have the ability to prototype DCA with 10 Gb/s NICs at the time of this study. The performance bottleneck across the spectrum of I/O sizes can change. We profiled the execution using 'oprofile' to determine that for 4KB I/O sizes, the largest single contributor to overall processing time per packet is buffer copies (Figure 2 ). This I/O size is considered to be one of the most common I/O sizes used for storage I/O. The primary reason for the substantial time spent in copying is that when the processor executes a copy instruction (typically a string move), the source of the copy by virtue of the current platform DMA protocol is always in system memory. The processor is thus always substantially exposed to memory latency and often repeatedly during the course of the copy depending on the prefetching efficiency of the processor. We demonstrate in later sections that the time spent in copying can be improved 5 times by supplying the source data in the L2 cache. Apart from the copy routine, we also observe that packet headers can be also be placed directly in the L2 by the NIC to reduce the protocol processing time by nearly eliminating packet header memory access latency. The following sections characterize the opportunity with direct L2 placement. The role of copies and memory accesses in general reduces as the I/O size becomes smaller. In the results that follow, we note that while not as dominant, memory dependency due to packet data access still represents a significant opportunity at small I/O sizes.
Current Memory Write Coherence Protocol and Improvements
Typical interactions between I/O devices and memory as well as TCP/IP specific memory accesses on modern mainstream platforms were described in recent publications [13] . The memory type used for DMA is typically coherent since buffers are recycled by drivers and the processor is expected to read the buffers and cache them. Processors may also Figure 3 . Platform data flow associated with inbound memory writes from the NIC repeatedly access the buffers to process the data within and then to clear the buffers for later use. The coherence protocol used in current systems is indicated in Figure 3 .
Each packet results in one or more coherent memory writes by the NIC. The granularity of these writes may exceed a single cache line as specified by I/O bus specifications such as PCI Express ® . Multicache line memory writes by the NIC are broken up into cache-line granularity transactions by the chipset. For each cache-line, a snoop is generated to invalidate previously read and cached versions of the same address within all processor caches in the system. This sequence guarantees that the processor will not find the cache-line for any subsequent accesses in the cache. Conversely, the processor incurs compulsory cache misses and is always exposed to memory latency.
In Figure 4 and Figure 5 , we show two DCA alternatives to place data from the NIC in the processor's cache at a time associated with data arrival into the system. Several variants of such protocols are possible depending on feasibility in a specific Figure  4 , can be used to update lines already present in the cache in either E or M state. This approach is useful when data structures are actively recycled and can be expected to be present in cache. Both memory latency and memory bandwidth gains can be obtained with such an approach. The Prefetch Hint protocol presents a simple alternative that focuses on the more significant latency aspect of the problem. It also is a novel form of a more general Write-Allocation approach -where data is placed in cache regardless of prior existence in the cache.
Network NIC Chipset Memory
Packet Coherent Memory Write
Snoop-Hint
Writeback Memory Write
Prefetch Read
Read Data
Before CPU is interrupted
Processor (L1/L2 line in M state) Figure 5 . Prefetch Hint protocol
The DCA hardware prototype is a silicon prototype in which the Prefetch Hint protocol was implemented. The triggered prefetch, as shown in Figure 5 , brings data into the L2 cache of the Intel® Core™ micro-architecture based processor that we mentioned in Section 3. In the implementation, a normal snoop transaction is tagged by setting one of the fields in the transaction. Apart from normal snoop activity, the snoop address is passed to an existing L2level hardware prefetching facility which generates a prefetch for that address. Up to 16 externally triggered prefetches (or DCA prefetches) can be queued within the processor. If the queue is full, then subsequent hints are ignored. Once issued, these prefetches are treated as any other prefetches in the system. For example, they have a lower priority for FSB access compared to demand requests. They also may be dropped if the FSB is busy. A hint-based approach allows for a simpler design since under any race conditions, the design can chose to ignore the hints without compromising correct functionality. The implementation guarantees that the snoop itself is complete before issuing prefetches triggered by the snoop onto the FSB. By leveraging existing transactions and hardware structures, complexity of design and validation was kept to a bare minimum. This approach significantly facilitated the testing of the concept of direct cache delivery for realistic workloads. A peculiarity of the implementation regardless of leveraging existing mechanisms is the need to delay DCA prefetches by a pre-determined amount to avoid retries across the FSB. Typical chipsets, as in our prototype platform, do not anticipate read requests to the snoop address within a very short time interval relative to the snoop. After the snoop response, data may still have to propagate to a point where it can be accessible in the coherent domain. A static, boot-time parameter was introduced to specify a safe delay between 32 to 128 FSB clocks. This delay does not effect the pipelining of the prefetches and did not substantially affect performance unless it was set to a value that was below the chipset minimum specification.
In the spirit of minimal changes, the replacement policy within the L2 cache is a normal replacement policy for any new cache allocation by the processor. The line state upon filling the cache is the E state like any normal read from the processor. In our experiments, DCA was applied to descriptors, headers and payload data. While a general issue with DCA is the determination of the correct target CPU in a multiprocessor system, by affinitizing each of the two 1GbE network ports to a core, no other dynamic selection scheme was necessary. Operating system prescribed affinity policies that bind connections to processors [18] are required to support DCA in realistic environments but are beyond the scope of this paper.
The prototype implementation is focused on reducing CPU utilization and is not intended to reduce memory bandwidth. Since the data is still written by the NIC to memory and later read by the CPU, the memory bandwidth is expected to be the same as without DCA. However, with DCA, the data is brought into the cache by prefetches initiated by the NIC and the processor does not have to stall (for memory access latency) during copy. The measurement based analysis presented in the following sections, thus, has only the processor read latency benefits of the overall intended scheme.
DCA Measurements
The platform and processors described in Section 3 was also used to measure performance with DCA enabled. Figure 6 shows the impact of enabling DCA at the I/O size of 4KB. An average of 3938ns per packet in the baseline drops down to 2809ns when DCA is enabled. copy_user_generic is the kernel routine that copies payload from kernel to user (application) buffer. It executes more than 5 times faster with DCA since it does not stall on memory reads. The remaining time spent in copying are in basic load/store execution for copying, L2 access latency and loop iteration for copy. The copy destination is the application buffer that should be present in cache after first I/O transaction. The second largest component, tcp_v4_rcv also benefits from DCA. Since packet headers are also directly placed into the L2 cache, it avoids stalls on packet headers and executes nearly 1.5 times faster.
The profile shown in Figure 6 indicates an overall speed-up of 40% in processing an average packet with DCA. The performance improvement varies significantly depending on how much time is spent in copying data and reading a packet header or a descriptor compared to other functions executed during TCP/IP processing. The execution profile depends on workload parameters such as I/O size and the number of connections being simultaneously processed. Performance sensitivity to these two factors is presented in the following two sections. 
Performance Sensitivity to I/O Size
The notion of I/O size and its relationship with packet sizes was explained in Section 3. In Table 2 , we compare the measured receive-side network throughput and CPU utilization with and without DCA across various I/O sizes. For all I/O sizes, the throughput attained by the platform is roughly the same whether DCA is on or not. However, the difference in CPU utilization ranges from 3.1% to 5.3% with generally a greater difference at the larger I/O sizes. This absolute difference in CPU utilization is significant when interpreted relative to the total utilizations which are only between 15.7 to 25.6% across the I/O sizes. Large I/Os are important for storage intensive applications such as IT critical backup/restore operations, file servers, database servers, decision support servers, video servers and any form of content retrieval applications. When storage subsystems are accessed across the network, large I/Os are transferred using a network protocol such as TCP/IP and incur processing overheads as reviewed earlier. TCP offload engines specifically have targeted storage applications since core to I/O interface latencies, system software execution time can be amortized over a greater number of packets. An average of 41.4% performance improvement with DCA for I/O sizes between 2KB and 128KB, when combined with improvements due to the Intel® Core™ micro-architecture will clearly raise the bar for any alternative solutions. As a specific example at 4KB I/Os, the new microarchitecture outperformed a Intel® Pentium® 4 processor normalized to the same frequency by 66%. Combined with 40% improvement with DCA, mainstream server's base capability can be raised by 2.3X (1.4 x 1.66). In addition, DCA does provide a substantial throughput increase for small I/O sizes. Message passing applications in a cluster, network edge applications that often forward or process small packet sizes should also benefit from DCA. 
Performance Sensitivity to TCP Connections
So far the performance data presented was for 4 connections per port and 8 concurrent connections in all. As the number of connections are increased, CPU overhead increases due to more context switches and more pressure on cache because of higher number of connection contexts or Transport Control Blocks (TCBs). Each connection context takes up 1KB and must be accessed any time there is activity on a connection. Modern web servers are easily capable of sustaining thousands of connections at typical internet client request rates. Since the copy overhead relative to total processing goes down, so does the performance benefit with DCA. Figure 9 shows that as we increase connections from 2 to 512, speed up with DCA reduces from 40.3% to 29.9%. The degradation is fairly graceful considering that the execution profile is not as favorable to DCA as the number of connections increases. In applications such as web servers, the benefit may be further limited due to the additional code executed for opening and closing connections. 
Sensitivity to Memory Loading
The peak memory bandwidth of the FBD based memory subsystem used in our measurement platform is 21 GB/s for reads and approximately 10.5 GB/s of simultaneous bandwidth for writes. Each CPU is connected to the memory subsystem via 10.4 GB/s FSB interconnects. When processing only 2 Gb/s of total receive traffic, there is very little FSB or memory bandwidth contention in the system. Thus, for the data presented so far, portions of TCP/IP execution that are dependent on memory, including the copy routine is expected to be primarily limited by memory latency. The level of network traffic in the test is also utilizing only a small fraction of the 4 cores in the platform. When all cores in the platform are exercised, or in platforms with greater number of cores, greater contention for memory will increase memory latency. If the memory subsystem used fewer channels or a less efficient memory technology that cannot offer as much bandwidth, even greater contention arises. To test DCA performance for such platforms, we ran a synthetic program that continuously streams through a large section of memory. Two instances of this program are executed, one each on the core within each socket that is not doing network processing (Figure 1) . This program generates a read bandwidth of 2.4GB/s on each CPU increasing contention for FSB and memory bandwidth. Due to this increased latency to memory, network processing is expected to consume even more CPU cycles since it has to stall longer for each memory read. Figure 10 shows the absolute CPU utilizations (not normalized) with and without the memory loading program. The CPU utilizations obviously increase in the presence of the memory loading program. The increase in CPU utilization is however, lower when comparing the two memory loading scenarios with DCA enabled. At the 4KB I/O size, for example, this results in the absolute CPU utilizations to drop by 7% instead of just 4.8% due to DCA. DCA, thus, makes network traffic processing less sensitive to memory latency. We also noted an interesting behavior when the same data is seen in a normalized or in relative terms. Even as the absolute CPU utilization differences under memory loading are greater when DCA is enabled, the effectiveness of DCA is actually slightly lower ( Figure 11 ). The reason for this slightly lower DCA speed-up is a peculiarity of our prototype implementation and the prefetch hint approach used. External triggering of a hardware prefetch using a snoop may generate a prefetch well before the chipset has had a chance to make data available in the coherent domain of the platform. This causes the prefetch to be 'retried' across the Front-Side Bus causing additional bus bandwidth consumption. The prototype does contain a parameter that delays the prefetch by a relatively safe number of bus clocks. However, under heavy contention scenarios, it is possible to incur a slight inefficiency as shown in 
Comparison to General Purpose Prefetcher
Most microprocessors today employ some form of hardware prefetching to hide memory latency. The effectiveness of the prefetchers depends on both the workload and the platform. There are three general limitations of hardware prefetching: (1) typically they do not hide initial latency in a given batch of memory accesses (2) they may over fetch beyond the required number of cache lines and (3) can be significantly lower. In the context of memory copies, with single core execution, hardware prefetchers can be reasonably effective. Figure 12 compares the efficacy of hardware prefetchers relative to DCA in terms of normalized throughput per core under various I/O sizes. For the I/O sizes between 1KB and 16KB, the speed-up obtained with the normalized throughputs with DCA enabled was 39% while the speed-up obtained with hardware prefetching (HWP) was found to be inferior but a reasonable 14%. For the same I/O sizes, under memory loading, it is significant to note that the hardware prefetcher results in a performance degradation of about 5% compared to a speed-up of 36% with DCA. Compared to hardware prefetchers, we see certain forms of software prefetching as much more comparable alternatives to DCA. Software prefetching has two distinct advantages to DCA: 1) no prediction of a target CPU is required in multi CPU system and 2) significantly lower complexity since there is no platform-wide hardware impact. The key problems with software prefetching are in finding the right prefetch distance across multiple platforms and in minimizing the time spent in issuing software prefetch instructions.
Application Interference Study
We analyzed CPU network processing performance with DCA when the memory subsystem is under pressure. However, since the program creating memory stress was not doing anything with the data streamed from memory, it is not useful to evaluate DCA's impact on it. In this section, we present the results of running the SPEC CPU2000 benchmark suite along with Iperf (our network microbenchmark). SPEC CPU2000 suite is useful for this analysis because it consists of a mix of 12 integer and 14 floating point benchmarks. Therefore, it helps us analyze DCA's impact on a variety of applications. The purpose of these experiments is two fold. We have seen that bringing network data early into the cache helps network processing performance. First, we want to verify if the processor cycles saved on network processing can be effectively utilized by another unrelated program. The second purpose is to find if bringing network data early into the cache does or does not hurt performance of another program. The setup for this experiment is the same as described in Section 3. Network receive side processing is restricted to two, 1 GbE ports, one on core 0 of each CPU. Four instances of SPEC CPU2000 benchmarks are run, thus sharing processing cycles as well as cache and FSB/memory bandwidth with network processing. While the SPEC CPU2000 benchmarks run only for a limited time, network receive activity is constantly kept active at line rate. SPEC CPU2000 reports a score at the end of execution and higher the score, higher the performance. Both integer and floating point subsets of the SPECrate suite are represented by the geometric mean of the scores of individual benchmarks in the suite. Based on this overall metric, SPECint_rate and SPECfp_rate gain by 5.5% and 5.0% respectively when DCA is enabled. This gain is roughly in line with the expectation set in Table 2 that with 4KB I/Os, the CPU utilization reduction is 4.8%. We also observed that with memory contention, the CPU utilization reduction can be as high as 7.0%.
The only application that loses performance in the presence of DCA is a highly cache sensitive program called art. The producer-consumer model expected by DCA has the effect of keeping a larger steady footprint of payload data in cache. While no disruption of the directly placed data is expected, this data has a longer residency in cache since the LRU state of each cache line is updated when the processor reads the data. The gap between data placement in cache and processor reads is determined by interrupt moderation policies set in the NIC. We measured an average of 8000 interrupts per second per each GbE port. The gap, hence, is about 125 micro-secondsthis factor is responsible for increasing cache residency and could negatively affect highly cache sensitive programs like art. Further tuning of interrupt moderation settings is possible to further minimize 
Conclusions and Future Work
A combination of core micro-architectural changes and enhanced data movement protocols can substantially improve the capability of general purpose platforms to process high network traffic rates. Even if the time spent in standard networking stacks remains as a glaring issue, hardware changes with minimal impact on the network stack are promising. The inefficiencies within the network stack software are continuously receiving attention and any further improvements will further shift focus on how cores and network interfaces interact via memory hierarchy. DCA is an example of the type of improvement that is possible with a platform data movement protocol adjustment. The hardware complexity of the protocol suggested in this paper is quite small since the scheme completely leverages existing coherence primitives and data flows. Yet, we see substantial improvements at 2 Gb/s of traffic and also in the presence of realistic applications. While the concept of direct cache placement or cache injection itself can be improved by write update or write allocating protocols, there are many other opportunities in the area of platform data movement.
For example, on the transmit side, DMA read traffic need not invalidate and evict buffers being transmitted if they are resident in cache and are being frequently recycled by applications. With DCA and an equivalent transmit-side scheme, highly optimal producerconsumer data flow between cores and NICs can take place exclusively via caches. The next logical area of research is cache policy changes to contain the degree of pollution in shared caches due to non-temporal data. Our future work will encompass multi-10GbE adapters driving upcoming quad-core processors in order to characterize scalability issues and continue to explore the opportunities present in data movement via the memory hierarchy. However, we anticipate that the solutions focusing exclusively on 10GbE and legacy software/hardware friendliness may not be adequately scalable at 100GbE. With a 100GbE standard being contemplated at the IEEE, solutions proposed for future multi-core platforms may be on the verge of breaking traditional software and hardware boundaries. Opportunities with CPU integration of network I/O capabilities and new network stacks should be strongly considered.
