Abstract-Interrupt coalescing (IC) technique has been used in general-purpose operating systems to mitigate Receive Livelock (RL) problem in Gigabit Ethernet Network hosts. Schemes for dynamically tuning the interrupt coalescing behavior of a communication interface based on traffic load or system state have been proposed. However, all the existing IC schemes are designed using heuristics. In this paper we present an analytical model for the IC technique and carry out a detailed study of existing IC schemes in terms of their performance characteristics including system goodput, CPU consumption and latency. We validate our analysis through measurement-based experiments.
I. INTRODUCTION
The Receive Livelock (RL) [1] problem, which is caused under heavy incoming traffic load in hosts containing Gigabit Ethernet (GbE) network interface cards (NIC), prevents the hosts from benefiting from the higher speed supported by the interfaces [2] . Interrupt coalescing (IC) technique is a technique that can be used in general-purpose OSes to mitigate the RL problem. Many of the new GbE NICs have been manufactured to support interrupt coalescing features and most GbE NIC drivers provide interfaces for manually changing interrupt generating rates. In addition, some NIC drivers have implemented adaptive IC schemes. However, all the IC schemes designed so far are based on heuristics. However, IC schemes designed using an ad-hoc local-information encounter some problems, namely, (i) there is no way to know if the system performance can be improved; (2) very little is understood about the reasons why a scheme can or cannot improve system performance.
The authors in [3] and [4] modeled the packet reception process as a Markov process and then analyzed interrupt disable-enable scheme and IC technique, respectively, based on queue theory. Some authors such as in [5] and [6] systematically analyzed NAPI [7] in Linux 2.6 kernel, an implementation of the device hybrid interrupt-polling scheme proposed in [8] . These models and analysis either cannot capture the characteristics of IC technique or cannot be directly applied to investigate IC schemes in terms of system goodput and CPU consumption and latency over a wide range of hardware and traffic conditions. Note that we use system goodput and goodput interchangeably in the subsequent sections of this paper. This paper considers the IC techniques that are implemented in the current general-purpose OSes, in which interrupt handling has absolute priority over those of all other tasks, on commodity off-the-shelf PCs. We first present an analytical model and then study the performance of IC technique in terms of system goodput and CPU consumption and latency. The analysis gives insight into the behavior of the IC techniques.
Considering that the networking subsystems are implemented differently in different OSes and in different versions of the same OS, we describe our model and analysis in the context of Linux kernel 2.6.20. In addition, there exists some difference in the driver implementation of different NICs. Unless otherwise specified, we describe our analysis in the context of Intel Pro/1000 GbE NIC drivers, namely the e1000 driver.
The rest of this paper is organized as follows. Section II presents background. In Section III we present the model and analysis for IC technique and the existing IC schemes. Section IV gives the results of a series of performance tests. Section V discusses the conclusions and describes future work.
II. BACKGROUND
In this section, we first describe how a purely interruptdriven networking subsystem processes the incoming packets. Then we describe interrupt coalescing.
A. Description of a Purely Interrupt-driven Networking Subsystem
From the software point of view, Fig. 1 describes the process of a packet traversing from the DMA ring (in RAM) to the intended recipient in a purely interrupt-driven networking subsystem. An interrupt is generated right after an incoming packet is put into the DMA ring or a packet is sent out. When the CPU receives the interrupt signal due to an incoming packet, it invokes the RX interrupt service routine (ISR) to remove packets from the DMA ring and put into a temporary queue. In order to improve performance, an ISR invocation can handle more than one packet in the DMA ring but the number of handled packets, denoted by B i , is limited in order to avoid a long interrupt handling. If no interrupt signal arrives right after the RX ISR execution, the RX softirq service routine (SSR) is invoked for processing the packets by the protocol stack by taking packets from the temporary queue and putting them into rcv_buffer. In 
Recipient rcv_buffer Recipient
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
the rest of the paper, by interrupt we mean a hardware interrupt; by softirq we mean a software interrupt. Protocol processing typically involves IP protocol processing and transport protocol processing such as TCP, UDP and SCTP. Part of SSR is implemented in the NIC driver. The protocol processing is non-preemptible but can be interrupted. In Linux 2.6, the RX SSR execution is stopped when it passes a given limit of execution time or passes a given limit of the number of processed frames or other interrupt occurs such as timer interrupt. Without loss of generality, we disable the first condition in the following. We use Budget (B p ) to denote the maximum number of packets processed in an RX softirq invocation. Its value is fixed in Linux 2.6 kernel. Note that it is possible that RX SSR is interrupted during its execution and continues executing after the interrupt handling. This situation is not an RX softirq invocation, just continuing the last invocation.
If there is still no interrupt signal arrival or if there is no packet in DMA ring for protocol processing or the protocol processing of B p packets is complete, the application-level processing begins. Due to the priorities of ISR and SSR over those of the normal applications, it is possible that less CPU cycles are left over for the applications under heavy traffic load. In these circumstances a receive livelock occurs.
rcv_buffer has different meanings in different contexts. If the computer is for packet forwarding, rcv_buffer denotes the incoming buffer of another NIC. If the computer delivers packets locally such as packet capture, rcv_buffer denotes the socket receiving buffer (or a predefined buffer). User-space applications pick up and process the packets from rcv_buffer. We study the second case in the rest of the paper.
B. Description of Interrupt Coalescing
Interrupt coalescing is a hardware technique, which groups multiple packets in a single interrupt to limit the rate of interrupt generation by the NIC. Thus, instead of generating an interrupt for every incoming packet, the NIC waits for a predefined number of packets to accumulate in the incoming queue before generating an interrupt. There are five kinds of timers in Intel ® GbE Controllers [9] for IC. The most commonly used timer is the Interrupt throttling timer (ITT), which is a simple countdown timer and is supported in most GbE NICs. ITT can work independently of any interrupt source and is not affected by any network events [9] . Fig. 2 (adapted from [9] ) describes how the NIC controller generates interrupts when ITT is enabled. The GbE NIC controller blocks all interrupt sources until ITT expires. If interrupt events are pending when ITT expires, the GbE controller generates an interrupt. When the countdown reaches zero, ITT resets and restarts its countdown. 
III. MODELING AND ANALYSIS OF INTERRUPT COALESCING
The Receive Livelock problem occurs due to the large amount of packet processing overhead incurred before the packet arrives at the recipient. Removing the temporary queue as in NAPI can reduce some overhead. Without loss of generality, we consider a networking subsystem, in which: (i) the RX ISR does nothing more than posting a softirq; (ii) the RX SSR takes packets from the DMA ring and puts into rcv_buffer. This is slightly different from the networking subsystem described in Section II. In the section we first present a model for IC and then its analysis. At last we evaluate some existing IC schemes.
Before continuing, we list some assumptions that are used in the subsequent analysis: (i) all the arriving packets have the same packet size and arrive at a constant rate;(ii) rcv_buffer size > DMA ring size (B DMA )>B p ; (iii) there is only one logic CPU; (iv) the effect of timer interrupt is ignored. Ignoring timer interrupt does not affect our results, but simplifies further discussion. Some variables which are used in this section are defined below:
Notation Description T the time interval between two consecutive interrupts generated due to ITT expiration.
Tp the CPU time for protocol processing a packet
Ta the CPU time for user-space application processing a packet, including the cost of system call; when cache miss occur, Ta also includes CPU cost of cache missing 
A. Modeling and Analysis of IC
The packet interarrival time, 1/λ, can be in (T p , +∞) or (0, T p ]. In the second case, if B i and B p are infinite, there are no CPU cycles left for application processing no matter how to adjust T due to the priority setting policy and the scheduling policy [10] in Linux 2.6. In this subsection, we only consider the case of (T p , +∞). We will discuss the second case in next subsection.
The packet processing time line can be depicted as shown in Fig. 3 , in which a round denoted by Γ is the time interval between the two consecutive interrupt signals, which occurs when CPU is not in protocol processing. If the CPU is doing protocol processing at the time of an ITT interrupt signal arriving, the protocol processing will be stopped and will continue after RX ISR is done. Application processing occurs only when there is no packet in DMA ring. Let T o is fixed for a specific environment, determined by the hardware and software configurations. In a common PC architecture, the data to be processed by the CPU must be copied to page cache and L2 cache, if it is not in each cache. We refer to as memory-jump consumption the CPU cycles consumed in this copying process, during which CPU is idle and cannot do other things in the uniprocessor system. The number of CPU cycle consumed for any packet from the point of arriving at the NIC to the point of delivery to the application consists of two parts: memory-jump consumption and code execution consumption. Code execution consumption in the protocol processing is same for any incoming packet. So is memory jump consumption. The reason is that any packet from the DMA ring is a new packet to the CPU and thus must be transferred to the L2 cache through a cache miss. Thus, T p can be regarded as fixed. Although code execution consumption is same for any packet in the application processing, memory-jump consumption may be varying in the application processing when k p is very large. Part of the reason is as follows. It is possible that part or all of information of the packet required for application processing has already been put in L2 cache. When k p is not large, this information is still in L2 cache at the beginning of the application processing. Then there are less cache misses. But if k p is large, it is possible that the L2 cache area occupied by this information has been reused by other packet. This usually occurs under heavy small-packet traffic load. Our experiment result shows that the effect of the T a change on G is not significant. The following analysis assumes that T a is also unchanged.
In the following, we use Γ m to denote Γ, η m to denote η, T m to denote T and m p k to denote k p in the m th round. The problem of maximizing the system goodput can be formally specified as: 
Proposition 2. If PCI bus is not a bottleneck and
, there is no packet dropped in the NIC. DMA ring must overflow in the protocol processing (by the assumption that PCI bus is not a bottleneck). Thus,
and then γ>Tλ. There is a contradiction. . For a given traffic load, the packet arriving rate is constant. Thus the conclusion.
The less CPU consumption means that (i) more power is saved; (ii) CPU can handle more other applications. When η >1, it is difficult to analyze the relationship between the system performance and T due to the unpredictable η. However, when 1/λ >2T p or B p =Tλ, the system becomes predictable. In the following we investigate the characteristics of the system under 1/λ <2T p or B p =Tλ. 
discussions.

B. Analysis under 1/λ <T p
In this case G can be improved by adjusting B p and T. For a given T, B p is adjusted until rcv_buffer doesnot overflow. The larger T, the larger G when cache missing overhead is ignored.
C. Analysis of some Existing Interrupt Coalescing Schemes
The above analysis indicates:
(1) The effective combination of adjusting B p and T can produce better performance. The realistic experimental data in [3] shows that it is reasonable to assume T p =2µs for UDP traffic in C1 computer. The hardware and software configurations of C1 are given in Section IV. Thus, for the traffic load with Ethernet frame size larger than 500bytes, 1/λ>2T p . That is, for such traffic load, G can be improved by increasing T (by proposition 9). The special case (B p =Tλ) of the situation under 1/λ<2T p also has the features of proposition 9. In order to reduce the overhead caused by computing λ, we could set B p =T×the incoming rate of 490-byte traffic. In the following experiments, we use this equation to set B p when T is changed.
(2) Adjusting T only based on the packet arriving rate (λ) is incomplete (by Eq. (2) and (3)), ignoring the application workload. The dynamical adjusting scheme in E1000 7.3.x driver is such scheme. It defines three interrupt generating rates based on experiments and decides which rate is used according to the incoming traffic pattern in the last timeframe. In order to avoid receive livelock, the rate is always set for the worst case scenarios and then other performance such as system response is degraded.
(3) Proposition 3 shows it is right to adjust T based on whether rcv_buffer overflows. However, how to apply this information is important in uni-processor systems, where the providing and consuming processes of the socket buffer packets are asynchronous. The authors in [11] proposed adjusting the maximum interrupt rate based on the socket buffer utilization periodically. It is possible that the buffer utilization is zero at the end of the consuming process and this information is sampled to adjust the interrupt rate each time. Actually, the buffer utilization is high at the end of the providing process. Then the wrong adjustment is done, leading to the goodput fluctuation. The authors in [6] proposed adaptive IC scheme, which adjusts T based on the maximum rcv_buffer utilization in the sampling interval and then avoids the goodput fluctuation.
IV. PERFORMANCE EVALUATION
Our experimental platform, shown in Fig. 4 , consists of two end systems labeled C1 and C2. They are connected by a Gigabit switch. The configurations are given in Table I . The PCI buses of C1 and C2 are not bottleneck. All computers run Asianux 2.0 [12], whose kernel is upgraded to 2.6.20. Unless otherwise specified, HyperThreading (HT) is disabled. To validate our analysis, we implement the networking subsystem mentioned in Section III by enabling NAPI and removing the operation of disabling RX interrupt. C2 is used as the packet-generator, sending out as many packets as possible such that the full load to C1 can be sustained. All the traffic is UDP/IP based, in order to avoid the effect on the packet generating rate of the flow control and congestion avoidance algorithms defined in TCP protocol.
There is only an application running in C1. To emulate the packet application processing such as storing, the application in C1 performs 200 floating-point multiplications before dropping the received packet.
In one softirq invocation in Linux kernel 2.6.20, each SSR is executed (set to 10 by default) times and at most netdev_budget packets are protocol processed in an SSR execution. In the following experiments, we set =1 and then B p = netdev_budget. We set netdev_max_backlog=50000 to avoid its overflow. In addition, the kernel stops the protocol processing by default when the protocol processing time is beyond 1ms. We remove this limitation. Other parameters are set as follows: rcv_buffer=8000000bytes, # of DMA ring count =4096.
The performance metrics are Goodput, CPU consumption and latency. Goodput is defined as the rate at which packets are successfully delivered to and processed by the intended recipients. We compute goodput according to the packet size, which is the byte-count in the length field of the IP header. CPU consumption is evaluated by CPU idle percent. Latency is evaluated by Round Trip Time (RTT) of ICMP packets.
A. Effect of T on G and on CPU Consumption
The experiments of this section aim to investigate the effect T on G in C1 computer under different packet sizes. We do experiments by varying maximum interrupt rate from per interrupt per incoming packet (PIPT) to 2000 per second. We experiment six different packet sizes, viz. 64, 128, 256, 512, 1024, and 1500 bytes. Fig. 5 shows the goodput versus 1/T under different packet sizes. P64 denotes the results when packet size is 64. So are for P128, P256, P512, P1024, and P1500. Fig. 6 plots the CPU idle time percent (measured with ) versus 1/T of packet size 1500, 1024 and 512 bytes. When packet size is 64, 128 and 256, CPU is always busy. Thus, the results of these packet sizes are not given. The results confirm proposition 9. We observe that G decreases when 1/T is small for small packet traffic. It may be due to cache miss.
B. Ping Latency
The experiments in this subsection investigate the effect of T on system latency of C1. We measure Round Trip Time (RTT) of ICMP packets using ping. C2 sends 1500-byte packets as many as possible to C1. We ping C1 from another computer with GbE NIC. Fig. 7 shows the ping latency variation over time. We observe that the larger T the larger fluctuation in ping latency and the larger average latency. When the ICMP packet arrives just before the protocol processing, the latency is the lowest.
V. CONCLUSIONS AND FUTURE WORK
In this paper we present a model to analyze IC technique in terms of goodput, CPU consumption and latency. The analysis gives insight into understanding IC technique. In addition, the analysis provides guidelines about designing efficient NIC drivers and manually adjusting the networking subsystem parameters in the current general-purpose OSes on commodity off-the-shelf PCs connected to high-speed networks.
Note that the analysis in this paper is only validated through realistic experiments. We plan to implement a simulator in order to validate the analysis in detail. In addition, the work in this paper only analyzes the system behavior in the case of η=1. We plan to investigate the system behavior under infinite B p in the future. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
