Abstract-We report on measurements performed to test the reliability of high rate data transmission over copper Gigabit Ethernet for the LHCb online system. High reliability of such transmissions will be crucial for the functioning of the software trigger layers of the LHCb experiment, at the CERN's LHC accelerator. The technological challenge in the system implementation consists of handling the expected high data throughput of event fragments using, to a large extent, commodity equipment. We report on performance evaluations (throughput, error rates and frame drop) of the main components involved in data transmission: the Ethernet cable, the PCI bus and the operating system (the latest kernel versions of Linux). Three different platforms have been used.
Fig. 1. Layout of the LHCb readout network.
The SFCs absorb the data, bring the fragments into order and forward them to a worker-node: this process is called eventbuilding. For events accepted by the L1, all sources send their data to a common destination where the High Level Trigger (HLT) performs event filtering to bring the rate to tape down to approximately 2 kHz. The total data throughput through the system is GB/s. We reject the use of connection-based, reliable protocols due to latency constraints. In addition, in order to minimise the protocol overheads, and thus enhance the payload link utilisation, no transport layer protocol is used. Data is embedded directly in IP packets. The absence of any mechanism which slows down the data flow (like sliding windows or slow start in TCP [3] , [4] ) improves further the payload link utilisation. It also facilitates packet handling on the sender as well as receiver side.
The UDP protocol adds two features above IP: the notion of ports and the checksum over the data. In our application we have no use for the port numbers, while the UDP checksum is considered redundant with the Ethernet CRC (Cyclic Redundancy Check) information in a switched network. Also, the UDP checksum is performed by the CPU (at least for fragmented datagrams), as opposed to the Ethernet CRC done by the MAC hardware, and so uses up additional resources. For this reason we decided to embed data directly into IP packets.
The choice of IP is motivated by the fact that packet fragmentation is well defined by the standard, and most Gigabit Ethernet switches include Layer 3 (IP) functionality.
0018-9499/$20.00 © 2006 IEEE It is therefore essential for us to avoid data loss anywhere on the path, be it at the transmitting side, the switching network connecting the sources and the destinations and in the destinations themselves. In this paper we focus on the reliability at the destination nodes [5] .
Linux is adopted as the operating system in the trigger farm nodes as well as the SFCs. The latest kernel versions include features which make it very appealing for soft real-time applications, such as the low latency scheduling and the kernel pre-emptibility, both of which minimize the response time to external events.
The main issue to be addressed here is the packet loss that may result from a too slow response of the Sub-Farm Controllers to the incoming data traffic, which may occur in particular when the CPU is under heavy load. To evaluate the system components, ranging from the physical medium, up to the operating system, issues such as the transmission error rate and the packet drop in the protocol stack, have been addressed.
II. HARDWARE SELECTED FOR TESTING PURPOSES

A. Gigabit Ethernet Controllers
All connections were implemented using category 5e copper cables (defined in ANSI/EIA/TIA 568-A-5). For our tests, we have concentrated on the Intel 8254x GbE controller based NICs. The choice was motivated by good performance exhibited in initial tests, powerful features, good documentation and good support of the Linux driver by Intel.
The 8254x controller family implements three methods of interrupt rate moderation [6] .
• An absolute timer, which starts upon the reception (or upon the transmission) of the first packet. An interrupt is issued when the timer has expired. Its purpose is to assure interrupt moderation in heavy traffic conditions. The timer is set in steps of 1 s. • A packet timer, which starts upon the reception (or upon the transmission) of every packet. An interrupt is generated only when this timer expires. Since it is reset at every packet received (transmitted), it is not guaranteed to generate an interrupt if the network traffic is high. Its purpose is rather to lower the latency in low traffic situations. The timer is set in steps of 1 s.
• An interrupt throttle mechanism can be used to set an upper boundary on the interrupt rate generated by the controller. The corresponding parameter is the maximum interrupt rate. While the first two timers work independently for the reception and the transmission of frames, the throttle mechanism can be used to ensure that, despite quickly varying traffic conditions in both directions, the total interrupt rate does not increase above a given value.
The parameters allowing control of the coalescence have to be chosen according to the expected traffic shape-too high interrupt coalescence (i.e., too low IRQ rate) will result in buffer overflow, while a too low coalescence setting (i.e., too high IRQ rate) will unnecessarily strain the system and lead to data losses due to missing CPU power available for data processing. Today's commodity PCs are typically using motherboards with a CPU by Intel or AMD and with PCI(-X) as the peripheral bus.
Intel has introduced the Hyper-Threading technology [7] to address the question of the unused CPU cycles, typically arising when the executing thread is forced to wait for data, e.g., as a result of a page fault. The physical CPU shares its resources among two logical processing units, which can increase the overall CPU use by up to 25% depending on the application. The Linux kernel, since version 2.4, is Hyper-Threading aware, recognising the siblings (i.e., the two logical CPUs in the same physical processor), and distributing running processes in an optimised way.
AMD, among other things, increases the performance of its CPUs by integrating the Northbridge logic into the processors, which interfaces the CPU core directly to the memory controller and the HyperTransport™ [8] technology interface. The Northbridge (also known as Memory Controller Hub or MCH) is traditionally one of the two chips in the core logic chipset on a PC motherboard, responsible for communications between the CPU and the RAM in the Intel 32-bit architecture. Since the Northbridge logic acts as the interface between the processor and the system bus, and in particular the system memory, integrating this device on the same die as the CPU aims at improving data throughput to and from the CPU.
For our investigations we have chosen three server class PCs. Two of the selected PCs are based on the Intel Xeon CPU, and differ mainly by the chipset used. The third PC is based on the AMD Opteron processor. The main characteristics are listed in Table I .
C. Traffic Generators/Analysers
The traffic was generated using Network Processor (NP) devices. We have used a NP card developed by the S3 company for IBM, featuring the latter's NP4GS3 Network Processor. The main features of this NP have been published in [9] and discussed in [10] .
The NPs make it possible to generate arbitrary traffic at linespeed (for all frame sizes, down to 64 Bytes) on two Gigabit Ethernet ports. Several NP cards can be synchronised to within 100 ns, corresponding to a transmission time of Bytes, to give a realistic emulation of the LHCb specific traffic pattern.
III. LINUX KERNEL NETWORK PROCESSING
To understand the changes in network performance between Linux Kernel 2.4 and 2.6 it is worthwhile to describe the recent changes in Linux core network processing. Network processing starts when a Network Interface Card, on reception of an Ethernet Frame (or a bunch of Ethernet Frames in case an interrupt moderation mechanism is active on the NIC) starts a bus-mastered DMA transfer from the NIC to the kernel space (in a circular buffer, named ) and, at the end of the transfer, raises a signal on an IRQ line, so that the Interrupt Controller issues an interrupt to the dedicated processor pin. The kernel reacts to the interrupt by executing asynchronously an interrupt handler, that is a short routine which should complete as soon as possible while interrupt reception is disabled (and therefore further frames received in the meantime could get lost if the NIC buffer fills completely).
In Linux kernel 2.4 (the mechanism is usually referred to as softnet, see Fig. 2 ) the interrupt handler pulls off the packet descriptor (which points to DMA space) from the , enqueues it in the backlog queue for the interrupted CPU, raises a soft interrupt (softirq) to schedule the deferred execution of the remaining processing at the next available opportunity (out of the hard interrupt context, with interrupts reception enabled) and finally enables interrupt reception again. The softirq handler, in turn, dequeues packets from the backlog queue and calls the relevant processing functions.
In kernel 2.6 a new mechanism has been introduced in the Linux Kernel (usually referred to as NAPI, New Application Program Interface, see Fig. 3 ) [11] which eliminates the backlog queue and converges to an interrupt-driven mechanism under light network traffic and to a poll mechanism under high network traffic. On receiving the first frame of a bunch, the interrupt handler, instead of enqueueing the packet descriptors in the backlog queue, leaves the packets in the , puts a reference to the device in a poll-list attached to the interrupted CPU and schedules a softirq, leaving the interrupt reception disabled.
The softirq handler polls all devices registered in the poll-list to get packets from the until a configurable number of packets (known as quota) is reached. If the quota is reached, and a device has still packets to offer, the device is put at the end of the poll-list; else, if the device has no more packets to offer, it is taken off the poll-list and allowed to interrupt again.
Under low load, when the kernel has enough time to process a packet before the next one arrives, the system converges toward an interrupt driven system: the packets/interrupts ratio is low and the latency is reduced. Under heavy load, the system takes its time to poll registered devices. Interrupts are allowed as fast as the system can process them: the packets/interrupts ratio is large and the latency is increased.
Apart from NAPI mechanism, other kernel changes affect the performance of network data transfer. First of all the 2.6 kernel is more pre-emptive than the previous ones (up to kernel version 2.4, a process entering the kernel mode could only be preempted by explicit yields, sleeps and interrupts) although still not fully pre-emptive. Explicit pre-emption points have been introduced in blocks of kernel code that may execute for long stretches of time. Moreover the kernel is given the chance to perform a context switch every time a spin-lock is released or the execution flow returns from an interrupt handler (in kernel 2.4, when the execution flow returned from an interrupt handler the interrupted task was always resumed).
Secondly the default kernel internal clock frequency for the i386 architecture has been increased by a factor of 10. This leads to an increased timer granularity: now the scheduler is executed every millisecond instead of every 10 ms (default setting). In general, shorter ticks produce higher resolution timers and therefore better performance of I/O multiplexing (polling) and improve system latency in process pre-emption, but introduce more overhead (more frequent timer interrupts) and more context switches between processes.
Thirdly a new O(1) scheduler has been introduced (O(1) means that the decision taking time does not depend on the number of processes in the run queue). The new scheduler now distinguishes between logical CPUs (Hyper-Threading) and true SMP (Symmetric Multi-Processing), and distributes the load among physically different processors. The CPU affinity has also been improved: a process will be migrated from one CPU to another only to resolve imbalances in the run queue length.
IV. TRANSMISSION ERROR RATE
The reliability of copper link cables as physical medium has been tested in a setup involving only the Network Processor based Gigabit Ethernet frame generator. A 100 m long category 5e copper cable was used to interconnect two ports of the traffic generator.
Two kinds of error conditions have to be distinguished: transmission errors and equipment malfunctioning. The first ones can occur during normal operation, e.g., due to noise pickup on a long copper wire. The rate at which these transmission errors happen should be limited in order for the data acquisition system to function as desired. Considering one frame lost in an hour of operation of the full system as acceptable, we aim at a BER of the order of . The second ones are typically a result of a breakdown of one of the components in the data path. In the current phase of the design we did not address the Mean Time Between Failure issue, leaving it for a later evaluation during the Market Survey/Tender procedure. We will refer to the first error source simply as errors, and the second as faults. In this paper we are interested only in the error rate.
Assuming a correctly formed frame at time of sending, the only two errors of interest are receive and checksum errors. The checksum (or CRC) error indicates that at least one bit flip occurred during frame transmission, and it was detected by the Media Access Control (MAC) device. A receive error on the other hand is signalled by the physical layer device (PHY) if it detects an error condition. The exact meaning of the receive error is specific to the PHY chip used. The error condition is not bound to the data carried by a frame, but can also happen between frames, i.e., in idle state of the link. Experience has shown that usually the two error conditions are correlated, noise induced on the wire can result in a PHY detected receive error, and would corrupt a frame being currently transmitted, thus resulting in a checksum error. Detailed description of the MAC and PHY layers can be found in [12] .
In a run of frames of 1518 B each, at 100% link load, no transmission errors were detected. All frames were correctly received. This number is equivalent to bits transmitted, so that the error rate is by far less than the requirements of the IEEE 802.3z standard [12] . Similar results have been obtained in BER measurements on point-to-point connections between PC equipped with Intel 8254x NICs, as well as in a test setup involving switches with up to 48 ports used. A first order extrapolation to the full system let us be confident in expecting less than one transmission error in hours of operation of the experiment.
V. IP DATAGRAM DROP
In these tests two PCs were connected together, either through a point-to-point link or via a Gigabit Ethernet switch. In the latter configuration the switch uplink was unplugged. No significant differences were detected among the two configurations. The tests were performed using UDP datagrams for simplicity. During the tests all non-essential processes (X11, several daemons) were stopped and the 802.3z flow control was activated on the NICs.
A. Results with Linux Kernel 2.4 (Default Setup)
A first set of results was obtained with standard Red Hat 9A setup with kernel 2.4. Only the socket send buffer size and the socket receive buffer size were increased from the default value of 128 KiB, stored in , up to the maximum of 512 KiB, stored in (throughout this paper, following IEC 60027-2, second edition, 2000-11 to avoid ambiguities, we use prefix Ki, Mi, and Gi, to mean , and , respectively, preserving for the prefix k, M and G the original SI meanings of , and ). The Intel e1000 network interface driver version was 5.0.43-k1, as supplied by the Red Hat 9A distribution. The number of descriptors allocated by both the driver and was set to 256, the queue discipline was set to and the queue length was set to 100. The datagram used for these tests had an IP payload of 4096 B (three Ethernet frames).
Results of the benchmarks performed this way showed a rather high throughput, of 999.9 Mb/s (including 8 B/datagram UDP header, 20 B/fragment IP header, 7 B/frame Ethernet preamble, 1 B/frame Ethernet Start Frame Delimiter, 14 B/frame Ethernet header, 4 B/frame Ethernet Frame Check Sequence and 12 B/frame Ethernet Inter Packet Gap), but with a corresponding datagram loss rate of (one datagram lost every 19500 datagram sent), which is certainly not acceptable for the LHCb DAQ.
Results of repeated benchmarks showed rather large fluctuations of the datagram loss rate. The distribution of the results is multi-modal, as shown in Fig. 4 .
A magnification of Fig. 4 reveals a finer structure, which is shown in Fig. 5 . The regular pattern exhibited by the datagram loss rate distributions is probably due to a throttling policy that is actuated at some level in the network data path inside the kernel to avoid service disruptions due to kernel overload: when a queue fills completely, the empty queue state is restored by dropping a group of packets to avoid congestion collapse condition.
Further tests, performed with fine tunings of the queue parameters, lead to significant improvements of the error rate, still not enough to satisfy the LHCb data acquisition requirements. A big step forward was achieved by changing the network driver from softnet to NAPI.
Benchmarks were repeated, still operating with the kernel 2.4.20 but enabling NAPI (back-ported from Linux kernel 2.5, but disabled by default [13] ). When operating in this way we observed a significant reduction of the datagram losses. These results are not shown here since they were superseded by those obtained with the kernel version 2.6. 
B. Results With the Linux Kernel 2.6.0 and NAPI
Setting the number of the datagram descriptors allocated by the driver and to 4096 (the maximum allowed value), the IP send buffer size to 512 KiB and the IP receive buffer size to 1 MiB, the maximum throughput achieved was 999.9 Mb/s, while the datagram loss was dramatically decreased, with respect to the previous tests, to a rate of (101 datagram lost for datagram sent). Transmission errors can be divided in two groups: receive errors, as reported by the NIC, and protocol handling related errors, which are due to failures in IP fragments reassembly.
The first category includes CRC errors and frame losses due to overflow in NIC internal buffers. Errors reported by the higher level (layer 3 and above) protocol handlers typically mean a failure to reassemble a datagram, usually as a result of a missing fragment, thus are usually correlated with the NIC reported errors. They have been observed in earlier kernel versions for large datagram sizes, but the improvements in 2.6.0-test11 kernel seem to have eliminated this problem. Fig. 6 shows the maximum data transfer rate measured as a function of the datagram payload size. The black line represents the payload rate, while the grey one represents the total rate, including UDP header (8 B/datagram), IP header (20 B/fragment), Ethernet preamble (7 B/frame), Ethernet Start Frame Delimiter (1 B/frame), Ethernet header (14 B/frame), Ethernet Frame Check Sequence (4 B/frame) and Ethernet Inter Packet Gap (12 B/frame).
The discontinuities in the payload rate curve are due to the increase in the overhead, occurring when an additional Ethernet frame is required by the fragmentation process. The minimum Ethernet frame size of 64 B requires padding for frames carrying less data, thus lowering the payload rate.
From Fig. 6 one can also notice that, from 500 B on, the transfer rate reaches the nominal speed of the Gigabit Ethernet. The observed behaviour, below 500 B, indicates a limit in the Ethernet frame rate, which is more clearly visible in Fig. 7 , which shows the maximum Ethernet frame rate as a function of the datagram payload size. The highest frame rate (280 kframes/s) corresponds to the shortest Ethernet frames (64 B). When all the frames of a datagram have the maximum size, the maximum achievable frame rate reaches the minimum of about 80 kframes/s. One can also notice that, for datagram sizes below 500 B, the maximum Ethernet frame rate became almost independent of datagram size. This behaviour indicates a bottleneck independent of datagram size, such as a constant overhead in packet transmission/reception, as discussed in Section VII.
As another measure of performance, we have investigated the socket buffer occupancies in the case of IP forwarding. Here, the host acts as a receiver and sender at the same time, as intended for the SFC. The socket buffer occupancy was measured by patching the relevant kernel routine in the IP stack. The average length of the queue over a period of time was logged into the kernel log, and read out after a test run.
Raw IP packets of 1548 B payload, fragmented in two Ethernet frames, have been generated such as to force IP packet reassembly. The results, shown in Fig. 8 , indicate a significantly more efficient protocol handling on the Opteron platform. The two plots concerning the Xeon based PCs show that enabling Hyper-Threading brings a performance increase of merely %. This is not surprising, given that the application is not very CPU intensive, but the load lies rather on the I/O capacity.
VI. INFLUENCE OF THE INTERRUPT RATE
At constant incoming frame rate, the interrupt rate can be modulated by means of parameters given to the network driver. The adjustment of the coalescence settings provides a means Fig. 9 . Packet loss vs. interrupt rate to tune the packet handling latency without overburdening the CPU with interrupts (a reduction of the coalescence settings decreases the packet handling latency but scales up the interrupt rate).
For the trigger traffic, we rely on low latency in packet handling and thus we prefer, in principle low coalescence settings; too low coalescence settings can lead however to packet loss further up the network stack, due to too high CPU utilisation, which is unacceptable for trigger purpose. The coalescence settings have therefore to be tuned in order to have as high an interrupt rate as possible, but without losing frames even at highest link loads.
We have thus measured the impact of the interrupt rate, as reported by the utility, on packet drop under heavy network traffic conditions. For this test, we have programmed the data sources to fill the wire at 100%, while the frame size was set to 1000 payload Bytes, i.e., in a range where we do not expect packet loss due to transmission rate itself (c.f. Fig. 6 and  7) . The interrupt rate was varied by changing the NIC driver's interrupt coalescence settings. Fig. 9 shows the packet loss as a function of the interrupt rate for both the Server-2 (Xeon) and Server-3 (Opteron). The two plots for each server reflect two cases: in the first case the only running user application was our data receiver and the CPU load was %; in the second case an additional background task was running, emulating the load induced by the event building process, and the CPU load was %. For this test we performed runs of frames each. At high CPU load, the loss-less limit (no packet loss in a run) has been found to be around 10 kHz on Server-2, while it is kHz on Server-3. We observed that the interrupt rate has to be kept low, in particular on the Xeon based system, where already 20 kHz leads to significant packet loss .
VII. PERFORMANCE OF PCI-X DMA TRANSFERS
The previous sections have shown how significantly the rate drops when a network device sends short frames. The performance decrease is probably due to the delays introduced by the operating system along the execution path followed by the data before being transferred to the I/O device (user-space process, socket library, system call, copy to kernel buffer, socket queue, device queue, driver ring and then DMA to NIC). We have thus measured the efficiency of the last step, when frames are downloaded from the host memory by the network device with a DMA.
For that purpose, we have reduced as much as possible the processing of data in the operating system by using the Linux packet generator . This kernel module first allocates an Ethernet frame and fills it with some consistent data. Then, it constantly feeds the driver transmit-ring with the corresponding packet descriptor (no other packet is allocated). Each time an entry of the ring is released, the packet is queued again. In this way the device has always something to send.
We use a NIC based on the Intel 82546EB Ethernet controller which has the interesting feature of reading DMA descriptors provided by the driver in sets of 64 instead of one by one. This means that once a set of DMA descriptors has been read, the device downloads the associated frames with DMA over the PCI bus, one after another as fast as it can without any other interaction with the operating system. Because of the way the packet generator queues packets, the controller will get several new DMA descriptors in one shot and will schedule the transfers one after the other with a minimal delay between them.
In Fig. 10 we show the bit rate seen at the output of the network device, as a function of the frame size. The packets were generated by the Linux kernel packet generator and counted by a NP-based receiver, connected to the output of the network device. Again, the rate drops below the theoretical limit for short frames.
Using a PCI bus analyzer, we have measured delays between two successive DMAs. Knowing the frame length, one can then compute the bit rate seen on the bus. (Fig. 11) . Since we sampled several frames (about 100 to 150), we show both the bit rate computed from the average inter DMA delay and a peak rate computed from the overall minimum delay seen between two frames. The peak rate on the PCI bus is not of course what we get in average and it shows an overoptimistic performance of the device. We plot it to show what we understand to be a performance level we would never manage to beat. And actually, this limit is still amazingly low for short frames. For both computations, the rate seen on the bus drops below 1 Gb/s for short frames.
The theoretical PCI bandwidth should permit download of short frames at a sufficiently large rate to reach link speed anyway. There seems to be a non negligible extra delay per frame (time to set up a DMA, processing on the device, etc.) we try to evaluate.
Considering only the first segment (short frames) of the peak rate, one can compare it to the theoretical PCI bandwidth. Since the bus is 64 bits wide and is clocked at 66 MHz, the peak theoretical bandwith is 4.2 Gb/s. What we ignore is the length of the constant delay added between two frames. A fit to the data gives the value of this extra-delay of 92 cycles (1.34 s on a 66 MHz PCI-X bus). On a 64 bit wide bus, this represents 736 B. The measured and fitted transfer rates for frame lengths in the range between 64 and 300 B are shown in Fig. 12 .
We have repeated the measurements with different hardware combinations, including an AMD Opteron platform and two NICs from different manufacturers (Broadcom 5700 and Intel PRO/10 GbE). The results are in good agreement with the above measurements.
VIII. CONCLUSIONS
We demonstrate that the IP stack, as implemented in the Linux kernel version 2.6, allows processing of data traffic in a loss-less manner, for datagram above B in length using commodity server equipment based on AMD Opteron or Intel Xeon processors. This high reliability is in particular of importance for DAQ systems relying on protocols without packet retransmission.
For frame lengths below 500 B, we investigated the performance loss. A constant overhead per frame of s on a 66 MHz PCI-X bus has been measured. We attribute this overhead to the cost of setting up a DMA transfer on the PCI bus.
The transmission error rate on copper links in small setups has been measured to be very low, and we are confident that scaling up to the complete system will result in a manageable frame loss.
