This paper presents both a retrospective of the development of network interface architecture, and performance and conformance data from a range of contemporary devices sporting various performance enhancing technologies. The data shows that 10Gb/s networking is now possible without statefull offload and while consuming less than one CPU core on a contemporary commodity server. [32, 20] . These developments to the mid90s have come to represent the roots of today's main-stream LAN interface designs. However, while is the case that offloads are now regarded as commodity items, the desirability and utility of even the most simple offload should remain under debate. For example Stone and Partridge [34] describe a study of the root cause of network errors which escape the Ethernet FCS check and find many systematic errors in hardware and software.
INTRODUCTION
From ARPANET [5] and Ethernet [29] in the mid-70s, through Local Area Networking [40] , distributed computing [1] , and TCP/IP in the early-80s, significant maturity in the design and philosophy of Internetworking was reached by 1988. Implementations of TCP/IP on mainframe and supercomputers in the mid-80s were reported together with some of the architectural tradeoffs of the time. For example, Kline describes [25] the implementation of a library-level protocol stack and Brandriff describes [3] a host based (rather than front-end processor) implementation.
The advent of workstation class computers [35] in the late-80s led to significant developments in the design of network interface hardware, particularly following early experiences with ATM networks [26] and the intent to handle the challenging QoS requirements of multimedia applications. Early ATM network interfaces made use of Programmed IO (PIO)for data transfer. This movement of data proved difficult for the RISC CPUs of the time, and DMA techniques came to be used. For example, Davie describes [10] the host interface used in the AURORA ATM testbed, and Smith and Traw describe [38] DMA double buffering techniques and interrupt moderation. The per-byte overheads of networking were understood and mitigations such as checksum offload were in use [32, 20] . These developments to the mid90s have come to represent the roots of today's main-stream LAN interface designs. However, while is the case that offloads are now regarded as commodity items, the desirability and utility of even the most simple offload should remain under debate. For example Stone and Partridge [34] describe a study of the root cause of network errors which escape the Ethernet FCS check and find many systematic errors in hardware and software.
One architecture which received attention from the late80s was that of executing the transport protocol on the host network interface. Implementations include the XTP Protocol Engine [7] , the Nectar communications processor [12] , and the VMP adaptor [22] . Follow on work [21, 32] based on the Protocol Engine architecture implemented TCP/IP on the host interface for a 622Mb/s ATM network. This TCP/IP offload architecture was rejected [8, 23] and has not subsequently been taken up to any significant degree by the academic community except recently as a means to an end for the support of Remote Direct Memory Access (RDMA)protocols [30] .
Another architectural choice which is periodically revisited is whether to perform protocol processing in user or kernel space. Druschel and Davie implemented [11] an interface which allowed user-space programs direct access to an ATM adaptor and Thekkath describes [36] an implementation of a user level TCP/IP stack over Mach. As well as performance, this work was concerned with the issues of multi-protocol co-existence and efficient operation in a micro-kernel environment. By contrast, the Jetstream/Afterburner adaptor [14] was used to implement TCP/IP in user space [13] over a monolithic kernel, as did Pratt using a firmware modified Gigabit Ethernet NIC [31] . These were essentially ports of their respective kernel protocol stacks to user level over a protected hardware interface. Hybrid models have also been proposed, see [27] for a survey and perspective. However achieving good all-round performance has proven to be elusive without a protocol stack implementation which has been expressly designed for user level operation.
Over the same late-80s to mid-90s period, there was considerable parallel activity with multicomputer network interface architecture. The ATOMIC project [9] utilised components from the Mosaic multicomputer to construct a Gb/s LAN (later commercialised [2] ). The multicomputer environment was somewhat less constrained than that of the ATM or distributed systems environments, application behaviour suited low-overhead user-level abstractions of communication, and these abstractions were used by host adaptors in conjunction with high-performance network techniques such as cut-through [24] and source based [6] routing. Portability for scientific application codes running on this architecture was largely resolved by the MPI specification [28] . Multicomputer and LAN convergence was proposed in the early-90s [19, 18] and reports of large-scale deployments [37] of multicomputer interconnects as a LAN were made by 1997. However the availability of commodity 100Mb/s and 1Gb/s Ethernet meant that a bigger movement formed around the use of LAN interconnects in a multicomputer environment [33, 39] .
Continual software, protocol, and chipset performance improvements over the 90s meant that achievable throughputs on commodity hardware grew from around 130Mb/s in 1996 to Gb/s by 2001 [16] . This impressive performance increase contributed to the resistance met by industry as it attempted to convert mid-90s work on user-accessible network interfaces [15, 4] into the Infiniband general-purpose converged interconnect. This performance trend has continued over the course of the introduction of 10Gb/s Ethernet. In 2003 Feng reported [17] unidirectional TCP/IP/Ethernet throughputs of 4Gb/s on commodity and 7Gb/s on nextgeneration hardware, and by 2006 10Gb/s line rate has become possible on a single core of commodity server chipsets.
The remainder of this paper offers a set of comparative micro-benchmark data for some generally available contemporary 10Gb/s Ethernet NICs. It is hoped that this data will be a useful calibration point for the community, particularly at a time when industry is again debating network interface architecture.
OFFLOAD TAXONOMY
The basic act of transmitting and receiving data from a network interface to an operating system is defined here as regular networking. A regular network interface performs no processing of the packets above the link layer protocol. For example, an Ethernet interface may process the Ethernet FCS, or perform multicast filtering, but it does not process the IP headers within the frame. Nevertheless, a regular adaptor can be a highly tuned device which is capable of efficient DMA to and from a host, tracking large numbers of transmit and receive descriptors (and their associated buffers), and balancing the tradeoffs associated with interrupting the main CPU in the system. If the adaptor is capable of providing optimisations based on the local state contained within the upper layer protocols embedded within a single frame, then an adaptor is defined to be a stateless offload adaptor. There are a number of stateless offloads which can be performed based on higher level protocols. For example TCP/IP checksum calculation and verification, and TCP Segmentation Offload (TSO)
1 . An adaptor which performs optimisations based on the state contained within upper layer protocols within a sequence of frames is a stateful offload adaptor. Where TCP is the higher level protocol, then a statefull offload adaptor is also known as a TCP Offload Engine (TOE). RDMA optimisations when run over TCP are also therefore termed statefull offloads.
METHODOLOGY
For the purposes of this study, the offload features as defined in the Taxonomy are grouped into two sets: NET={regular networking, stateless offload} and TOE={stateful offloads}.
A number of generally available 10Gb/s Ethernet NICs were used for benchmarking. Each NIC vendor is anonymised, we group them as: {A,B,C,D,E,F,G}. All NICs are capable of operation in NET configuration and {E,F,G} are also capable of TOE operation.
For each experiment, the vendor supplied tuning parameters were applied for each vendor's NIC. In practice we found that these varied little and are essentially those described by Feng [17] . Interrupt moderation was disabled (where possible) for the latency experiment. A result shown as x indicates that a measurement was not possible in the given configuration, and indicates that a measurement was not taken. All experiments were performed back-to-back, using CX4 cable, and using latest software from each vendor in Q3 2006. Windows benchmarking was not undertaken because the available TOE drivers were too unstable for measurement.
For driver availability over all the NICs, testing was first performed on a pair of Intel E7520 dual 2.8Ghz Xeon (EM64T 32bit-mode) / 2GB RAM machines, running Linux 2.6.9. The TOE measurements require the installation of a vendor supplied operating system patch which creates a fast-path from the socket interface to the driver. No operating system by-pass middleware was used. Further experiments designed to investigate the relative performance of the devices over chipset generations were then made where possible using a pair of more recent Nvidia NForce Pro2200/2500 dual AMD285 (dual core 2.8Ghz 32bit-mode) / 4GB RAM machines, running Linux 2.6.17.
RESULTS

Latency
NetPIPE was used to measure the half-round trip latency (L RT T /2 ), results are shown in Table 1 . It was not possible to disable interrupt moderation for the NET drivers of vendors E and G. 
Bandwidth and CPU Efficiency
NetPerf was used to measure the CPU efficiency (E) for a single uni-directional stream at peak bandwidth (typically around 32KB message size) and 9KB MTU. Efficiency is expressed as % of a single CPU core per Gb/s. Table 2 shows results on the E7520 platform. 
Multi-Stream Bandwidth
The HighPerf script from the Chariot test bench was used to generate 5 simultaneous uni-directional streams on the Intel E7520 platform. Results are shown in Table 4 . Bandwidth is as reported by the Chariot test bench. MTU is 9KB. The same experiment was then performed on the Nvidia platform using the same NIC (Vendor F) in both its TOE and NET configurations. Results are shown in Table 5 . Of particular note is the relative improvement for the NET configuration compared with the TOE. 
RFC Conformance
RFC conformance expressed as a % of ANVL tests passed for each RFC section (in the tool's suite) is given in Table 6 for Linux (2.6.9) and NIC vendors F and G in their TOE configuration. Vendor F does not appear to implement SACK, therefore micro-benchmark data for this device will 
Conclusions
The data presented represents a snapshot of 10 Gb/s Ethernet NIC performance on commodity hardware and confirms that line-rate and sub-10 us latency is achievable with or without a full offload implementation. Our opinion is that this situation is very similar to that of 1 Gb/s Ethernet in 2001, and that the current crop of 10 Gb/s offload devices therefore again represents a point solution.
Of note is that the performance improvements from the more recent chipset platform were shown to significantly errode the benefits of a TOE device 2 . It is recommended that experimenters be aware that their measurements and hence conclusions could be very different depending on the test platform.
The RFC conformance data we took should warn users that any statefull offload devices may not be operating at the same level of conformance than that of the operating system being offloaded.
