Abstract-For long time, high-speed packet processing has been reserved for specialized hardware devices since software based solutions were not able to achieve the required performance. However, off-the-shelf packet processing hardware and software improved over the last years, which is why software based solutions cope with high-speed traffic nowadays. Due to the flexibility of software there is a trend towards doing packet processing in software, e.g. using OpenFlow or virtual switches. Although packet processing in software offers many capabilities, the complexity of such software bases solutions makes it hard to evaluate, optimize, or predict the networking performance of servers, end user hosts, or routers. We present a study that investigates the packet latency caused by the packet processing in the Linux network stack. We develop a simulation model in ns-3 for packet processing via the Linux network stack that helps understanding of its performance implications. We validate our simulation model based on measurements with nanosecond accuracy and software profiling.
I. INTRODUCTION AND RELATED WORK
Networking and Internet access is a core feature of any modern operating system (OS). Therefore, the OS needs to provide a network stack that is responsible for processing incoming and outgoing packets. The high complexity of an OS makes it hard to analyze and predict the packet processing performance and characterize performance guarantees. Specialized networking hardware, such as routers and switches, are optimized for high-speed packet processing and meet specified performance guarantees. Nevertheless, commodity hardware can be turned into routers, switches, firewalls, and other packet processing systems using software implementations, which makes them both more cost-efficient and flexible while still being able to scale up to high-speed traffic [1] , [2] .
As previous works have shown, the central processing unit (CPU) is the performance bottleneck for packet processing systems [1] - [5] . This bottleneck can be mitigated by efficient packet processing software [6] - [8] . For example, interrupt moderation techniques which reduce the number of interrupts the CPU needs to handle [9] , [10] . Other approaches use the massive parallel processing capabilities of dedicated graphics processing units [11] .
In order to optimize packet processing in software, we need to indentify the performance bottlenecks that down this process. Therefore, measurements in network testbeds are used to analyze the performance of pcaket processing in a controlled environment. As black box measurements [12] , [13] only provide a rough understanding and white box measurements have to deal with side effects caused by measurement software, previous work has addressed the challenges of measurements using a careful combination of both approaches [1] , [14] - [16] . Carlsson et al. [12] presented a latency measurement setup for black box measurements of routers according to RFC 2679. Rotsos et al. [13] utilized FPGAs for accurate software switch latency measurements. Bolla and Bruschi [14] presented a detailed study of a Linux kernel 2.6 based PC and performed RFC 2544 conformance tests by means of a special network device testing box. The dedicated device testing box allowed to measure latency with microsecond accuracy. Dobrescu et al. [1] made analytical estimations for packet processing latency and combined them with the number of CPU cycles per packet that was measured with software profiling. Tedesco et al. [15] applied a queuing model to distribute measured latencies to different internal processing steps in a PC. Recently, Emmerich et al. [16] described practical knowhow on throughput measurement of Open vSwitch (OvS) in which they dealt with the problem of tampered results due to measuring.
Another approach to gain insights into the internals of complex PC-based packet processing systems is modeling and simulation. Therefore, Chertov et al. [17] presented a model of forwarding devices that can be configured to simulate the behavior of different devices. Kristiansen et al. [18] proposed a model for the packet processing overhead resulting from software. Meyer et al. [5] modeled resource contention in a server to analyze parallel packet processing with multiple cores.
In this paper, we measure and simulate the packet latency which is introduced by networking software with nanosecond accuracy. For this purpose, we analyze a NIC driver and the OS mechanisms with respect to the packet processing based on commodity hardware. Within the following analysis we distinguish three categories of tasks in network packet processing in PC systems: packet reception, application-specific processing, and packet transmission. We put our focus on the packet reception and transmission that is carried out by network stacks provided in OS and the network interface card (NIC) drivers. We modeled these mechanisms and extended our previously proposed ns-3 resource management module [4] accordingly. Our extended simulation model enables us to predict the packet latency of software routers. Compared to a real software router, 978-1-4799-5804-7/15/$31.00 c 2015 IEEE our simulation model can be modified with relatively little effort, which in turn facilitates the development of latency reducing software optimizations of the OS networking and NIC driver.
The rest of this paper is organized as follows. Section II reviews the developments in packet processing on PC hardware and introduces concepts to optimize the packet processing performance. Section III gives a step-by-step explanation of the Linux network stack and its interaction with the NIC driver. In Section IV we present the setup which was used for measurements in our testbed. We introduce our simulation model in Section V. Section VI contains the calibration and validation of our simulation model based on the results of our testbed measurements and simulations. We summarize our results in Section VII.
II. PC-BASED PACKET PROCESSING
In the following we discuss hardware and software techniques that are relevant to mitigate different potential bottlenecks in PC-based packet processing systems.
A. Hardware
On the hardware side, the high network performance of PC systems can be attributed to two main developments: (1) Connections of hardware components and subcomponents in PC architectures and the related interaction processes are optimized, interactions are bundled and thus reduced. (2) The hardware underwent changes to cope with the growing number of cores and to shift workload from the general purpose CPU to dedicated components.
The first development results in a dedicated memory controller that provides direct memory access (DMA) for the NICs. A descriptor to the containing memory region is stored in a queue-like structure and via an interrupt the NIC informs the OS that a packet is ready to be processed. Even forehanded copying of data to CPU caches [19] is common today. Bus systems like PCIe allow for increased maximum data rates with each new version.
As a result of the second development the NICs (e.g. [20] ) can distribute packets to different CPU cores via programmable hardware filters or static hash-based criteria. Even direct handover of specified flows to subsequent software processing steps by the NIC is common today. NICs also provide capabilities for packet segmentation, checksum calculation, and combination of shortly followed interrupts (interrupt moderation) to shift this workload of additional processing tasks and interrupt handling routines from the CPU.
B. Software
The actions for handling a packet after the NIC informed the OS are determined by software. While the driver coordinates the interaction with the NIC, the main functionality is provided by the OS which abstracts it via interfaces: in Linux this interface is called New API (NAPI), in Windows this is called Transport Device Interface (TDI). In the following we will focus on the Linux NAPI.
Legacy NIC drivers cause NICs to trigger an interrupt request (IRQ) for every incoming packet which implies IRQ storms in high load situations. As this overhead prevents the CPU from the actual packet processing, a new network interface was introduced in Linux kernel 2.6. It allows compliant NICs and drivers for IRQ mitigation to reduce the system load. Furthermore, the NAPI introduces a polling mechanism which enables the NIC driver to fetch multiple packets from an input queue while IRQs are disabled. Besides, it favors packet throttling in overload situations by early packet dropping directly in the NIC.
Modified drivers, that do not conform NAPI and take polling to the extreme (e.g. busy-wait) like the Click Modular Router [21] , achieve higher packet rates in comparison. However, they suffer from drawbacks like a permanent high CPU load due to active polling for packets even if no packets were received. Frameworks that fully rely on polling mitigate this by providing techniques to downscale the CPU frequency [8] .
Depending on the purpose of the software on top of the OS network stack the relation of received traffic to the transmitted traffic can be of any type: 1:N (e.g. game server), 1:1 (e.g. firewall), N:0 (e.g. monitoring), 0:N (e.g. actor nodes), etc. The processing costs per packet vary from a constant number of CPU cycles per packet up to totally unpredictable per packet costs [22] . Processing is determined by software until a descriptor is placed in a Tx queue of the outgoing interface and the next processing steps are again performed by the egress NIC.
The applications can either run in kernel or in user space context. User space applications require additional copying which introduces extra overhead for each context switch between user and kernel space. Kernel space applications entail the risk of system crashes due to programming bugs. They require careful development and extensive testing due to the additional challenge of running in the kernel. Besides the Linux network stack, packet processing frameworks like DPDK [8] , PF RING [7] , and netmap [6] exist, which replace the default network stack for certain purposes. These alternative network stacks can also be used in monitoring systems [22] , web or game servers [23] , (software) switches, routers, (software) firewalls, or workstations [24] .
There are currently four different approaches to optimize the performance of the network stack: (1) Avoiding the copy operation between kernel and user space processes by mapping buffer regions. (2) Preallocated packet buffers do not receive any supplementary adaptations and remain as initially configured to avoid any overhead. (3) The introduction of polling instead of interrupts. (4) Processing batches of packets with one API call on reception and sending to distribute the per-call overhead to a larger number of packets. Although these techniques allow high throughput rates they introduce drawbacks: (1) The lack of a standardized API via that applications can access the network increases the implementation complexity of applications. (2) Static buffer sizes prohibit adaptive reactions to filled buffers that introduce extra latencies in overload scenarios. (3) Continuous polling avoids sleeping of the CPU and counteracts power saving features that are desired in low load scenarios. The general purpose Linux network stack internalizes these techniques that aim at high packet rates but tries to satisfy any application by making trade-offs between performance, usability, functionality, and power saving. The NAPI needs to perform well not only in providing high packet rates but also in low load scenarios, so specialized approaches outperform it in selected performance metrics, like the maximum packet rate. Nonetheless, the NAPI is widely-used today in end systems and servers due to its generality.
III. LINUX-BASED PACKET PROCESSING
In this section, we describe the packet processing of the Linux kernel in detail. This includes the NAPI, the interaction between the NAPI and the NIC driver, and the Interrupt Throttling Rate (ITR) as an important configuration parameter for the packet processing.
A. NAPI
NAPI-based packet processing includes the steps depicted in Fig. 1 which are described in the following:
1) The DMA engine copies a packet from the receiving NIC hardware Rx Queue to a dedicated Input Queue in the main memory.
2) The NIC triggers a hardware IRQ (Input Queue specific) which is then served by the assigned CPU core. The mapping between hardware IRQs and CPU cores can be statically assigned.
3) The IRQ Handler, which processes the hardware IRQ, enqueues an entry referring to the Input Queue into the poll list of the assigned CPU core (napi_schedule()). Each CPU core maintains a dedicated poll list to manage a set of Input Queue, whereby an Input Queue can only occur in one poll list. Finally, the IRQ Handler schedules a so-called soft IRQ in order to defer the packet processing from interrupt context [25] . 4) The soft IRQ Scheduler completes the soft IRQ and invokes the networking functionality (net_rx_action()). 5) net_rx_action() peeks the first entry of the poll list and initiates the poll (poll()). Since the implementation of poll() is driver-specific, we discuss its behavior in the next section. The poll() function is responsible to fetch the packets from the Input Queue and push them to the higher layers of the network stack. 6) The poll returns due to one of the following reasons:
(a) The corresponding Input Queue is empty.
(b) poll() yields after processing a certain quota of packets (poll size) to prevent other Input Queues from starving and the algorithm continues with step 8. 7) The respective entry is removed from the poll list (napi_complete()) the poll finishes and the algorithm continues with step 9. 8) The current poll is suspended although the Input Queue is not empty. The respective entry is re-enqueued into the poll list in a round-robin manner. 9) If the poll list still contains entries, the NAPI continues with step 5. Otherwise, the algorithm ends.
The handling of net_rx_action() is limited to a budget of processed packets as well as to a timeout in order to share the CPU core with competing device drivers or concurrent processes. If the budget is exceeded or if a timeout occurred, then a correspondent soft IRQ is rescheduled before the CPU core is released.
B. NIC driver
To illustrate the interaction between the NAPI and the NIC driver, we examine the open source ixgbe driver, which supports Intel's current 10 GbE NICs. The ixgbe driver implements the Input and Output Queues as Rx and Tx rings which are continuously allocated memory blocks of descriptors. These descriptors point to the actual packet buffers and are used by the DMA engine to copy packet data from the NIC to the main memory (Rx) and vice versa (Tx). When the NIC receives a packet, it checks whether there are free Rx descriptors available. If there is a free Rx descriptor, the packet is stored in one of the Rx Queues. The NIC's Board Logic then fetches a clean Rx descriptor from the Rx ring and transfers the packet via DMA into the associated buffer in the main memory. In case the system is overloaded and can not process all incoming packets the RX queues will fill up. When there are no more clean Rx descriptors available, arriving packets are dropped by the NIC, which prevents further system overload. A detailed view of the most important steps performed by the ixgbe driver's poll() function is provided in Fig. 2. A feature of the ixgbe driver is that IRQs can be shared by Rx and Tx rings to further mitigate the number of IRQs. This means an IRQ can either indicate that a packet has been received and has to be handled, or that a packet has been transmitted and the Tx ring must be cleaned. For this reason poll() is split up into the two following phases:
• Tx clean: In this phase the driver cleans Tx descriptors from the Tx ring. Cleaning is required because packets which have already been sent by the NIC still reside in the main memory until they are freed. The driver may clean up to 256 Tx descriptors consecutively.
• Rx clean: This phase starts with the recycling of Rx descriptors, which has to be done before they are returned to the hardware. Then, a Rx descriptor is read from the Rx ring in order to fetch a packet. Afterwards, a socket buffer structure (SKB) is created, which encapsulates the Fig. 2 . NAPI in conjunction with the ixgbe NIC driver corresponding packet while it is processed by the Linux kernel. After several sanity checks, the processing of the SKB is initiated. In an unmodified software router the packets are processed by the native Linux network protocol stack. However, it is possible to replace the native protocol stack by other modules (e.g. by OvS like it is done in this case study). In any case, the applied routing software has to determine the outgoing interface as well as the appropriate output queue. Finally, the SKB is scheduled for transmission. For this purpose a Tx descriptor is prepared in the Tx ring. If there are more packets available on the Rx ring and if the poll size is still not reached, then the driver continues with the Rx clean phase and starts a further iteration beginning with the recycling of Rx descriptors. Otherwise, the poll returns to the NAPI (cf. Section III-A, step 7). However, before the poll finally returns to the NAPI it is checked if the Tx and Rx rings were cleaned. If this is the case, then the respective IRQ is re-enabled. If the dynamic ITR (cf. Section III-C) is enabled, then the NIC is set to the new ITR value in interrupts per second (ips).
C. Interrupt Throttling Rate
Software-based packet processing can be configured by several parameters. In case of using the ixgbe driver one of the most important parameters regarding the packet latency is the ITR. The ITR defines an upper bound of IRQs per second for a set of Tx and Rx rings. The ITR relies on a timer which is set to a value of 1 IT R after an IRQ was asserted. Until the timer will expire, no further IRQs can be generated. If packet transmission or reception happened before the timer expired, the IRQ is fired on the expiration of the timer. Otherwise the next reception or transmission event immediately causes an IRQ. The ITR can be configured as static, dynamic, or disabled.
Disabling the ITR results in short packet latencies but has a negative impact on the throughput in high traffic load situations, where the CPU is occupied with IRQ handling. Using a static ITR is suitable for manually setting the upper bound of IRQs per second. The increase of the ITR lowers the latency but increases the CPU utilization and lowers the maximum throughput. Hence, the appropriate configuration of the ITR is a trade-off between latency, CPU utilization, and maximum throughput.
With a dynamic ITR, the ITR is adopted according to the current throughput Θ. When a poll finishes, the ITR is recalculated. The three ITR states lowest, low (initial state) and bulk are defined whereby each ITR state is associated to a specific ITR value as depicted in Fig. 3 . 
IV. LATENCY EVALUATION WITH MEASUREMENTS
For our measurements we need a simple packet processing scenario to analyze the NAPI performance. Using a simple layer 2 forwarding scenario which requires a constant amount of CPU cycles per packet and adds a constant latency per packet, we want to minimize the effect of packet processing apart from the NAPI in our measurements. Previously, we have shown that OvS [26] - [28] has a predictable average per packet processing cost in terms of CPU cycles [16] . Therefore, we decided to use OvS as a representative NAPI-based in-kernel packet forwarding application. OvS is part of Linux and is able to operate in layer 2 of the ISO OSI stack but also in higher layers.
A. Measurement Setup
Our test setup is based on recommendations by RFC 2544 [29] . The device under test (DuT) is connected to another system which is responsible for load generation and packet capturing. On the DuT we use the profiling tool perf to gather statistics like the interrupt rate. Profiling measurements were run for five minutes per test to get accurate results. Our tests indicate that running this utility on the DuT introduces an overhead that reduces the maximum throughput by 1 %.
The DuT uses an Intel X540-T2 dual 10 GbE NIC and is equipped with a 3.3 GHz Intel Xeon E3-1230 V2 CPU. We disabled Hyper-Threading, Turbo Boost, and power saving features that scale the frequency with the CPU load because we observed measurement artifacts with these features.
The DuT runs the Debian-based live Linux distribution Grml with a 3.7 kernel, the ixgbe 3.14.5 NIC driver with interrupts statically assigned to CPU cores. OvS is used in version 2.0.0 with manually created OpenFlow rules to match the traffic.
B. Load Generation and Measurement Accuracy
We use our packet generator MoonGen to generate traffic and to measure latency and throughput. MoonGen uses hardware features of modern NICs to generate constant bit rate traffic with precise inter-arrival times. It also features hardware time stamping with sub-microsecond precision and accuracy [30] . As we have previously shown that the packet size does not affect the throughput of OvS [16] , we use minimally sized Ethernet frames (64 B) to send as many packets as possible at the available line rate of 10 Gbps.
V. LATENCY EVALUATION WITH SIMULATIONS
Simulations are a cost-effective approach to design, validate, and analyze proposed protocols and algorithms in a controlled and reproducible manner. Our proposed simulation model is able to imitate the packet processing software by means of NAPI and NIC driver behavior in a Linux system (cf. Section III). Besides the prediction of throughput and CPU utilization, our simulation model aims for the prediction of latencies introduced by the packet processing software for any offered load.
In order to simulate the software induced packet latencies of real systems, we model the scheduling of polls defined by NAPI and the dispatching of packets according to ixgbe as described in Section III.
A. Integration into ns-3 Resource Management
We implement our simulation model using the widelyused discrete event network simulator ns-3 [31] . In our previous work, we presented a modeling approach for resourceconstrained network nodes [4] which we applied to show the linear scaling of multi-core software routers [5] . The OS (i.e. its process scheduling subsystem) is modeled by the resource manager and the actual packet processing is modeled by task units. We implemented this modeling approach as the ns-3 resource management module. Now we extend this modeling approach with respect to the NAPI and NIC driver behavior. Fig. 4 illustrates our simulation model derived from a real Linux-based system with NAPI behavior (cf. Figs. 1 and 4) . The resource manager can be seen as the abstraction of the NAPI functionality and a task unit represents the functionality of the NIC driver.
B. Simulation Model
The Resource Manager is responsible for handling IRQs and managing the poll lists. In real systems the interrupt service routine is a high priority task that causes other processes to be suspended during IRQ handling. Our simulation model considers this behavior and consumes a specific amount of simulation time t irq for handling IRQs. The driver-specific behavior of the poll() function is modeled by a Task Unit. As described before ixgbe's poll function splits up into two cyclic phases (cf. Section III-B). In real systems numerous functions are involved in each phase, but the consecutive performed steps within a phase are always the same. Profiling revealed that the performed steps take a characteristic amount of time. Thus, our model abstracts from these concrete steps and considers each of both phases as a loop of a single step. These two phases are described in the following:
• Tx clean: The total simulation time consumed in this phase depends on the number of packets that have been transmitted using the Tx ring. Each transmitted packet causes the simulation time to advance for a specific amount t clean of time. In order to keep track of the transmitted packets, each Task Unit provides a set of counters (dscr. counter) that are altered when a packet is scheduled for transmission, a packet has been transmitted (tx callback), or the Tx ring is cleaned.
• Rx clean: The maximum number of successively handled packets is limited by the poll size. The model does not consider real packet processing but packet transmission. Thus, the simulation model has to schedule the subsequent transmission events for up to poll size successive packets from the Input Queue. The simulation time between two consecutive transmission events is t proc .
Additionally, each Task Unit needs to generate IRQs in order to indicate the Resource Manager that a queue state has changed. IRQs can be generated on ITR events. An ITR event schedules the next ITR event according to the ITR (cf. Section III-C).
Our simulation model focuses on the sojourn time caused by software, thus we determine the packet latency as the difference between the time when the NIC passed the packet data to the Input Queue (cf. Fig. 4 marker (A) ) and the time when the software inserts the packet into the respective Output Queue (cf. Fig. 4 marker (B) ).
Our model is designed to simulate the latency which is introduced by software, the model does not consider realistic DMA or other hardware components which may introduce additional latencies. Nevertheless, we have to consider the additional latency which is caused by hardware. Thus we introduce a constant offset value t of f which is added to the simulated latency. Furthermore, our model currently neglects concurrent processes which might also influence the packet processing. However, due to the strict separation of hardware, OS, and NIC driver in our model, we can easily extend the model by adequate software and hardware sub-models.
VI. MODEL CALIBRATION AND VALIDATION
In this section we discuss our measurement and simulation results. We measured the per-packet processing latency of OvS. The measurements were conducted in a testbed (cf. Section IV) and the simulations were made with our ns-3 resource management extension (cf. Sec. V).
For the calibration and validation of our model we simulate the system and the setting that we use in our real world measurements (cf. Section IV-A). This reflects the standard setup for device benchmarking [2] , [29] : the DuT is loaded with traffic that flows from a load generator to a sink. Both are directly connected to the DuT. In both our simulation and our measurement setup we configured the DuT with the default Tx and Rx ring size of 512. The poll size is hardcoded in the driver source. For the ixgbe driver the poll size is 64 packets. The ITR was set to dynamic ITR scheme. We load the system with traffic of different constant bit rates (CBRs).
A. Model Calibration
Model calibration is the procedure of setting the model parameters in the simulation model with respect to the modeled real system. We calibrate the model according to measurement and profiling results of the DuT in the testbed.
Based on our throughput measurements we obtain a maximum packet forwarding rate of 1.87 Mpps achieved by the DuT with one CPU core. With a CPU frequency of 3.3 GHz this relates to a processing time of 535 ns per packet resp. 1765 CPU cycles per packet. Profiling indicates that 99.4 % of these 535 ns are consumed by poll() which is composed as follows: 17% resp. 90 ns correspond to an iteration of the Tx clean phase (t clean ) and 83% resp. 441 ns correspond to an iteration of the Rx clean phase (t proc ). Profiling also revealed that handling an IRQ from the NIC takes ca. 737 CPU cycles resp. 223 ns (t irq ).
From our latency measurement, we observe a minimum latency of approximately 8.0 μs. The measurement was performed with low offered load (i.e. 44.6 Kpps), so each packet received by the ingoing interface immediately generates an IRQ and the packet is processed directly (the Tx descriptor rings of the outgoing interface has been cleaned due to a separate IRQ for the prior packet transmission). Since a single packet takes 441 ns to be processed, we have to add an additional latency (t of f ) of 8000 ns -441 ns = 7559 ns to each processed packet. Fig. 5 shows the measured and the simulated interrupt rate against the offered load in million packets per second (Mpps). The simulated interrupt rate reveals two abnormalities:
B. Model Validation
1 The peak which is visible at ca. 0.5 Mpps is missing. 2 The interrupt rate decreases faster and less smooth.
We assign these discrepancies to effects not considered in our simulation model such as realistic DMA or hardware buffers.
The measured packet latency is not normally distributed (cf. Fig. 6 ). Thus, we omitted mean values and confidence intervals. Instead, we denote the latency distributions (for The observed latency distribution as indicated by the percentiles results from randomized sampling in measurements (cf. Sec. IV-A) and simulations. Fig. 7(a) shows the measured 25th, 50th, 75th, and 99th percentile for the observed latency distribution in relation to the offered load. Fig. 7(b) illustrates the percentiles for the latency distribution predicted by our simulation model in relation to the offered load. An Xth percentile refers to the minimal value which is higher than X percent of the measured or simulated latencies.
The comparison of Fig. 7(a) with Fig. 7(b) shows: 1) For low offered load the simulation results for the latency show no variance, whereas the measured latency shows a large variance. This error arises from the constant latency offset t of f , which we use as an estimation for the additional latency introduced by the non-software parts of the system. We are aware that the choice of a constant value for t of f is rather suboptimal, but in order to provide a better estimation of the latency introduced by hardware, a more detailed knowledge about the involved components is required, which we will investigate in future work. However, this constant offset significantly influences the simulated latency only for low offered load (i.e. below 0.2 Mpps). For higher offered loads, the ITR as well as the NAPI predominantly affect the latency. 2) From 0.5 to 1.0 Mpps the measured latency slightly increases (a), whereas the simulated latency decreases (b). (a) For offered loads from 0.5 to 1.0 Mpps we observe for the measurements an interrupt rate which decreases with a growing offered load, whereby the interrupt rate is higher than 16 kips. According to the considered offered load, the throughput on the Tx ring as well as on the Rx ring is clearly above 20 MB/s and we expect 8 kips at maximum for each of both rings (16 kips in total), but we observe up to 27 kips. Therefore, we assume the ITR is oscillating. This effect occurs if a poll starts with many packets backlogged in the Rx ring while the current ITR has a high value. In this case the time between the poll finishes (all packets from the Rx ring are served) and the expiration of the ITR timer is short. In worst case the poll finishes shortly before the ITR timer expires and causes an IRQ. Since the observed throughput was high, the ITR is decreased. For the successive poll which starts immediately after the previous poll due to the IRQ, there are just few packets backlogged. In this case the time between the poll finishes and the remaining time until the ITR timer expires is long. Hence, many packets are backlogged for the successive poll. Furthermore, the ITR is increased due to the low throughput observed and the procedure starts over. An oscillating ITR potentially influences the distribution of latencies, because on the one hand the large backlogs introduce high latencies and on the other hand small backlogs introduce low latencies. In a real system it is possible that concurrent processes interrupt the packet processing and exceptionally more packets get backlogged. (b) If the interrupt rate remains constant and the offered load increases, then the mean latency decreases because more packets arrive during an active poll. Instead of being backlogged in the Rx ring and served by the successive poll, such packets are served directly by the poll they arrived in. The time these packets are backlogged in the Rx ring before being served is therefore short. This positive effect reaches its maximum when the packets arrive approximately as fast as they are served. This is why we observe a drop in the measured as well as the simulated latencies right before entering the overload situation at 1.87 Mpps and above. 3) In overload situations the measured latency is ca. four times higher than the simulated latency. In this case the latency is predominantly defined by the service time of the bottleneck (here this is the CPU) and the accumulated queue sizes in front of the bottleneck. Therefore, we assume this effect is related to additional hardware buffers which are currently not considered in our model (e.g. the Rx buffer in the NIC).
To measure the difference between the measured and the simulated latency, we calulate the relative error E rel according to E rel = |T is the observed latency of the ith measurement (resp. simulation). Fig. 7 shows the relative error of the 99th percentiles. The error plot increases drastically in cases of very low offered load and overload situations due to unconsidered effects.
We also calculated the absolute mean error E abs = 18.64 μs and the relative mean error E rel = 17.58 % that are defined as follows: E abs = , where n is the total number of sampled latencies between 0.2 Mpps and 1.87 Mpps. This means we left out the corner cases with very low and very high offered load, because in these cases our model does not make acceptable predictions.
It is conspicuous that the error in Fig. 7 is unsteady. On closer inspection we noticed that the peaks for relative errors are in coincidence with disparities in the measured and simulated interrupt rates (except in overload situation). Therefore, if we manage to eliminate these discrepancies (e.g. by implementing a proper DMA model), we expect the prediction of intra-node packet latency becomes more precise. Despite of the discussed deficits, the prediction of the interrupt rate as well as the prediction of the latency indicate that the approach of our simulation model is basically valid. Thus, our model is suitable to predict latency related effects caused by the NAPI and ixgbe.
VII. CONCLUSION
In this study, we investigated the latency which a packet incurs due to the packet processing software in PC systems based on Linux. We analyzed in detail the interactions between the NIC driver and NAPI as part of the OS. On the one hand, we carried out testbed measurements to determine the distribution of packet latency with sub-microsecond accuracy. On the other hand, we modeled and simulated NIC driver and the NAPI mechanisms by extending our ns-3 resource management module. Based on the testbed measurements, we calibrated and validated our simulation model with respect to the packet latency. The simulation results show, in comparison with the results of the measurements, that our model can accurately predict the packet latency except for corner cases. In future work, we will use our validated model to evaluate new algorithms for low latency packet processing (e.g. real-time support). Based on that, we will recommend and implement optimizations for NIC drivers and the Linux OS kernel networking.
