Traffic-aware adaptive polling mechanism for high performance packet processing by Trifonov, Hristo Georgiev
  
Department of Electronic & Computer Engineering 
 
Research Report 
 
 
Traffic-Aware Adaptive Polling Mechanism 
for High Performance Packet Processing 
 
 
Hristo Trifonov 
 
University ID Number: 11060905 
Project Supervisor: Dr. Donal Heffernan 
  
 
 
 i 
 
 
Note to the reader 
 
This research report was written as a masters of engineering thesis but it was not 
submitted as the author progressed to a PhD programme. The thesis was successfully 
examined by Dr Ciaran MacNamee, ECE Department, UL (internal examiner) and 
Professor Andreas Grzemba, University of Deggendorf, Germany (external examiner). 
I declare that I am the sole author of this Research Project.  It is entirely my own work, 
and any aspects that have been adopted from others are clearly referenced and 
acknowledged. All information in this paper has been obtained and presented as per the 
University’s academic project guidelines. 
 
 
  
 ii 
 
Abstract 
 
Traffic-Aware Adaptive Polling Mechanism for High Performance 
Packet Processing 
Hristo Trifonov 
The work described in thesis concerns the topical subject area of SDN (software defined 
networking), in particular the Data Plane architectures. High-speed network interfaces of 
10 GB/s and more are widely used in today’s data centres. The increased performance of 
modern multi-core commodity servers allows for network packet processing 
functionalities to be implemented in Linux user space with the use of specialised software 
frameworks, which bypass the Linux kernel network stack and essentially obviates the 
need for dedicated network hardware components. A drawback for a number of such 
frameworks is related to the underlying architectural design where all network devices 
are accessed using polling mode instead of interrupt driven mode. This leads to high CPU 
core utilization and an increased power demand by the packet processing system, which 
results in inherent inefficiencies. A specific problem is explored where such inefficiencies 
are known to exist in the Data Plane Development Kit (DPDK), due to the high polling 
frequency of the DPDK’s packet sampling scheme. This thesis describes a novel approach 
to reduce such inefficiencies, where the design, implementation and evaluation of a new 
Adaptive Polling Mechanism (APM) is presented within the receiving loop of the 
DPDK’s packet processing framework. The APM implementation focuses on reducing 
the polling frequency of the dedicated relevant CPU core(s). To evaluate the new 
proposed approach a full experimental laboratory setup, that includes the hardware 
equipment as well as the software configurations, is developed as a ‘test bench’ for the 
research.  Two experimental case studies are developed to evaluate the effectiveness of 
the new approach. The related energy consumption by computing resources that are using 
the DPDK is experimentally measured and presented. 
The results of this research work show that the proposed adaptive polling mechanism 
reduces the polling frequency of the CPU core, and can lead to an increased application 
efficiency of up to 60%, according to the experimental results. Further it is seen that the 
proposed adaptive polling mechanism can reduce the energy consumption of the Data 
Plane’s forwarding core(s) per ‘processed Ethernet packet’, and reduces the overall 
instantaneous system power demand by up to some 6.5W at low line rates.  
 
  
 iii 
 
Table of Contents 
 
Note to the reader .............................................................................................................. i 
Abstract ............................................................................................................................. ii 
Table of Contents ............................................................................................................ iii 
List of Figures ................................................................................................................. vi 
List of Tables ................................................................................................................. viii 
List of Acronyms and Abbreviations .............................................................................. ix 
Chapter 1    Introduction ............................................................................................ 1 
1.1 Background of the research topic ............................................................................ 1 
1.2 Research Objective .................................................................................................... 1 
1.3 Motivation .................................................................................................................. 2 
1.4 Novelty ........................................................................................................................ 2 
1.5 Literature Survey ...................................................................................................... 3 
1.6 Structure of the Thesis .............................................................................................. 4 
Chapter 2   State-of-the-Art Technology ................................................................... 6 
2.1 Overview of SDN .............................................................................................................. 6 
2.2 Network functions virtualisation ..................................................................................... 9 
2.3 The DPDK ....................................................................................................................... 11 
2.3.1 Overview of core components and features ............................................................................. 12 
2.3.2 Packet reception, processing and forwarding .......................................................................... 15 
2.3.3 DPDK – current status ............................................................................................................. 16 
2.3.4 DPDK – impact on overall system power demand and core temperatures .............................. 16 
Chapter 3   Experimental Environment .................................................................. 20 
3.1 Equipment ....................................................................................................................... 20 
3.1.1 Packet Generator ...................................................................................................................... 21 
3.1.2 Packet Forwarder ..................................................................................................................... 22 
3.2 Network Traffic Generation and Modelling ................................................................ 22 
3.2.1 MoonGen High-Speed Packet Generator ................................................................................. 23 
3.2.2 TRex Traffic Generator ........................................................................................................... 24 
3.3 Measurements Resolution .............................................................................................. 26 
 iv 
 
3.3.1 Maximum Packet Throughput .................................................................................................. 26 
Chapter 4   The Adaptive Polling Mechanism ........................................................ 29 
4.1 Background ...................................................................................................................... 29 
4.2 The ON/OFF Traffic Model ........................................................................................... 29 
4.3 APM Theoretical Design Overview ............................................................................... 30 
4.3.1 Initial APM Design Tests ......................................................................................................... 33 
4.3.2 Induced Delay Schemes ........................................................................................................... 35 
4.3.3 Induced Delay Schemes Evaluation ......................................................................................... 37 
4.4 Summary .......................................................................................................................... 40 
Chapter 5  Case Study One: CBR and Bursty Traffic ............................................ 41 
5.1 CBR Traffic Tests ........................................................................................................... 41 
5.1.1 Testing Environment ................................................................................................................ 41 
5.1.2 CBR Traffic Experiments Description ..................................................................................... 42 
5.1.3 CBR Traffic Experiments Results ............................................................................................ 42 
5.1.4 CBR Traffic Case Study Conclusions ...................................................................................... 44 
5.2 Bursty Traffic Tests ........................................................................................................ 44 
5.2.1 Bursty Traffic Experiments Description .................................................................................. 44 
5.2.2 Bursty Traffic Experiments Results ......................................................................................... 46 
5.2.3 Bursty Traffic Case Study Conclusions ................................................................................... 47 
Chapter 6  Case Study Two: ..................................................................................... 48 
IMIX and Random Packet Size Traffic ......................................................................... 48 
6.1 IMIX Traffic Tests .......................................................................................................... 48 
6.1.1 IMIX Traffic Experiments Description .................................................................................... 48 
6.1.2 IMIX Traffic Experiments Results ........................................................................................... 49 
6.1.3 IMIX Traffic Case Study Conclusions ..................................................................................... 52 
6.2 Random Packet Size Traffic Experiments .................................................................... 52 
6.2.1 Random Packet Size Traffic Experiment Description .............................................................. 52 
6.2.2 Random Packet Size Traffic Experiment Results..................................................................... 53 
6.2.3 Random Packet Size Traffic Case Study Conclusion ............................................................... 56 
Chapter 7 ........................................................................................................................ 57 
Energy Consumption Investigations and Improvements.............................................. 57 
7.1 Theory Discussion ........................................................................................................... 57 
7.1.1 Artificial Delay and Power Saving on Intel CPUs ................................................................... 57 
7.1.2 Power and Energy Measurements ............................................................................................ 58 
 v 
 
7.2 Description of Laboratory Equipment Setup ............................................................... 59 
7.2.1 Forwarding CPU Core Energy Consumption per Packet with CBR Traffic Profile ................ 61 
7.2.2 DuT System Power Demand with CBR Traffic Profile ........................................................... 62 
7.2.3 Forwarding CPU Core Energy Consumption per Packet with Bursty Traffic Profile ............. 63 
7.2.4 DuT System Dynamic Power Demand with Bursty Traffic Profile ......................................... 64 
7.2.5 Forwarding CPU Core Energy Consumption per Packet with IMIX Traffic Profile ............... 64 
7.2.6 DuT System Dynamic Power Demand with IMIX Traffic Profile .......................................... 65 
7.2.7 Forwarding CPU Core Energy Consumption per Packet with Random Packet Size ............... 66 
7.2.8 DuT System Dynamic Power Demand with Random Packet Size Traffic Profile .................. 66 
7.3 Chapter Summary .......................................................................................................... 67 
7.4 Chapter Conclusions ...................................................................................................... 67 
Chapter 8       Summary, Conclusions and Future Work ....................................... 69 
8.1 Thesis Summary ............................................................................................................. 69 
8.2 Discussion of Results ...................................................................................................... 69 
8.3 Conclusion ....................................................................................................................... 71 
8.3.1 Limitations ............................................................................................................................... 72 
8.4 Future Work ................................................................................................................... 72 
Bibliography ................................................................................................................... 74 
Appendix I General Linux Kernel Networking .......................................................... 1 
Appendix II Intel Haswell Power Management .......................................................... 1 
 
 
  
 vi 
 
List of Figures 
 
Figure 1. Conventional Networking Device ..................................................................... 7 
Figure 2. SDN Architecture .............................................................................................. 9 
Figure 3. Traditional Legacy Networking with NFs as Hardware Appliances ............... 10 
Figure 4. Simplified NFV Diagram - Internal Layers are omitted .................................. 11 
Figure 5. DPDK 16.11.1 Architecture ............................................................................ 12 
Figure 6. Simplified Packet Flow in L2FWD – DPDK .................................................. 15 
Figure 7. Excessive Polling in DPDK ............................................................................. 17 
Figure 8. CPU Core Temperature Issue in DPDK – 0% Line Utilization ...................... 18 
Figure 9. CPU Core Temperature Issue in DPDK – 100% Line Utilization .................. 19 
Figure 10. Hardware Laboratory Equipment Setup ........................................................ 20 
Figure 11. Simple Connection Session ........................................................................... 23 
Figure 12. MoonGen Architecture [1] ............................................................................ 23 
Figure 13. TRex Basic Setup [44] ................................................................................... 25 
Figure 14. The Simple ON/OFF Traffic Model .............................................................. 29 
Figure 15. L2FWD-DPDK Control Flow ....................................................................... 30 
Figure 16. Initial APM Design ........................................................................................ 32 
Figure 17. Effect of Induced Delay on RX/TX Polling Frequency at Maximum Line 
Rate ................................................................................................................................. 33 
Figure 18. Effect of Induced Delay on the Efficiency of the Basic Forwarding 
Application ...................................................................................................................... 34 
Figure 19. Schemes for Delay Values Generation .......................................................... 36 
Figure 20. Delay Schemes Testing Environment ............................................................ 37 
Figure 21. Traffic Profile with 10 Streams ..................................................................... 38 
Figure 22. Delay Schemes Evaluation ............................................................................ 39 
Figure 23. Laboratory Testbed Environment .................................................................. 41 
Figure 24. CBR Traffic Experiments – AVG Polls/s and AVG Internal Latency 
Comparison ..................................................................................................................... 42 
Figure 25. CPU Headroom with CBR Traffic ................................................................ 43 
Figure 26. CBR Traffic Experiments - Application Efficiency Comparison ................. 43 
Figure 27. TRex Traffic Profile - 1000 Bursts ................................................................ 45 
Figure 28. Bursty Traffic Experiments - AVG Polls per Second Comparison ............... 46 
Figure 29. Bursty Traffic Experiments - Internal Latency per Packet Comparison ....... 46 
 vii 
 
Figure 30. Bursty Traffic Experiments - Application Efficiency Comparison ............... 47 
Figure 31. IMIX Traffic Profile ...................................................................................... 49 
Figure 32. IMIX Traffic Experiments - AVG Polls per Second Comparison ................ 50 
Figure 33. IMIX Traffic Experiments - Internal Latency per Packet Comparison ......... 50 
Figure 34. CPU Headroom with IMIX Traffic ............................................................... 51 
Figure 35. IMIX Traffic Experiments - Application Efficiency Comparison ................ 51 
Figure 36. Random Packet Size Profile .......................................................................... 53 
Figure 37. Random Packet Size Experiments - AVG Polls per Second Comparison ..... 54 
Figure 38. Random Packet Size Experiments – AVG RTT per Packet Comparison ..... 54 
Figure 39. CPU Headroom with Random Packet Size Profile ....................................... 55 
Figure 40. Random Packet Size Experiments - Application Efficiency Comparison ..... 55 
Figure 41. Power Measurement Laboratory Setup ......................................................... 60 
Figure 42. CPU Current Measurement with Yokogawa Osciloscope............................. 61 
Figure 43. CPU Energy Consumption per Packet - CBR Traffic ................................... 62 
Figure 44. DuT System Power Demand - CBR Traffic .................................................. 63 
Figure 45. CPU Energy Consumption per Packet - Bursty Traffic................................. 63 
Figure 46. DuT System Dynamic Power Demand - Bursty Traffic ................................ 64 
Figure 47. CPU Energy Consumption per Packet - IMIX Traffic .................................. 65 
Figure 48. DuT System Dynamic Power Demand - IMIX Traffic ................................. 65 
Figure 49. CPU Energy Consumption - Random Packet Size Profile ............................ 66 
Figure 50. AVG DuT System Power Demand - Random Packet Size Profile ............... 67 
 
  
 viii 
 
List of Tables 
 
Table 1. Core Components of DPDK ............................................................................. 14 
Table 2. Theoretical Polls per Second with Different Burst Sizes .................................. 17 
Table 3. Fibre Optic Cable Specifications ...................................................................... 21 
Table 4. Theoretical Maximum Packet Rates per Second .............................................. 28 
Table 5. Theoretical Polls/s for Different Bursts and Sizes of Packets .......................... 31 
Table 6. Formulae for Delay Values Generation ............................................................ 37 
Table 7. IMIX Profile TRex 2.26 .................................................................................... 48 
Table 8. Random Size Packet Ranges and Amounts ...................................................... 52 
Table 9. Comparisson between Turbostat and Osciloscope Power Measurements ........ 61 
Table 10. Dynamic Power Saving ................................................................................... 68 
 
  
 ix 
 
List of Acronyms and Abbreviations 
 
  
API Application Programming Interface 
APM Adaptive Polling Mechanism 
ASLR Address space layout randomization 
BPS Bits per Second 
BSD Berkeley Software Distribution 
CAPEX Capital Expenses 
CBR Constant Bit Rate 
CLI Command line interface 
CP Control Plane 
CRC Cyclic Redundancy Check 
DE Desktop Environment 
DMA Direct Memory Access 
DP Data Plane 
DuT Device under Test 
EAL Environment Abstraction Layer 
ECE Electronic & Computer Engineering 
HPET High Precision Event Timer 
IA Intel Architecture 
IFG Inter Frame Gap 
IMIX Internet MIX 
IPG Inter-packet gap 
ISG Inter-Stream Gap 
 x 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
KVM Kernel Virtual Machine 
LAN Local-Area Network 
MAC Media Access Control 
NAPI New Application Programming Interface 
NF Network Function 
NFV Network Function Virtualisation 
NIC Network Interface Card 
NUMA Non-Uniform Memory Access 
OPEX Operating Expenses 
OS Operating System 
OSI Open Systems Interconnection 
PMD Poll Mode Driver 
PPS Packets per second 
RAM Random Allocation Memory 
RPM Revolutions per Minute 
RTE Running Time Environment 
RTT Round-trip time 
SFD Start Frame Delimiter 
SSD Solid State Drive 
TSC Time Stamp Counter 
TSP Telecom Service Providers 
VHA Virtual Hardware Abstraction 
VNF Virtual Network Functions 
 1 
  
Chapter 1    Introduction 
 
1.1 Background of the research topic 
Network traffic processing at 10 Gigabits per second (Gbps) and greater is a very 
demanding task and it is usually handled by custom-built hardware that is tuned for a 
number of specific tasks and features. Telecom Service Providers (TSPs) worldwide 
have realised that this custom hardware architectural approach lacks flexibility and 
extensibility for their network infrastructure and associated services. 
In the last decade, researchers and engineers across the globe have greatly contributed to 
the move towards the software implementation of various classical Network Input / 
Output (I/O) workload-intensive equipment items; such as routers, switches, firewalls 
and load balancers to name but a few [2]. While searching for new ways to increase 
their infrastructure agility, TSPs are switching to software based equipment solutions, 
running on commodity hardware, to realise lower maintenance costs, improved 
reliability and greater expansion capabilities.  
At the raw data plane level, the ever-increasing volume of network traffic creates the 
demand for a very high-speed network packet processing scheme.  In this research work 
the focus is on the  Data Plane Development Kit (DPDK) [3], which overcomes the 
processing limitations imposed by the general Linux kernel networking stack.  
This project is a collaboration between Intel, Shannon and the Electronic & Computer 
Engineering (ECE) Department at the University of Limerick.  
 
1.2 Research Objective 
The main objective of this research work is to design, develop, test, validate and 
implement an Adaptive Polling Mechanism (APM) within the receiving (RX) loop of 
DPDK’s applications, for performance optimisation purposes. 
This APM will attempt to control and adapt the polling frequency of the poll-mode 
DPDK driver that is retrieving packets from the Network Interface Card (NIC) receiving 
(RX) queues, under varying traffic workloads without disrupting the system 
responsiveness constrains. 
 
 2 
  
1.3 Motivation 
The DPDK, a Linux Foundation Project, is a specialised high performance 
framework for fast packet processing in data plane applications. The framework 
provides a set of optimized software libraries and interrupt-free poll mode drivers for 
specific environments and architectures; including but not limited to Intel processors 
from the Atom to the Intel Xeon generation [4]. The performance achieved by the 
DPDK in terms of packet throughput comes at the cost of running a dedicated 
processing core in polling mode around the clock [5].  
In general, a polling scheme is the preferred option for activity checking in systems with 
heavy network loads. However, network traffic load in most communication 
infrastructure systems is highly dynamic. Therefore, allowing one or more Central 
Processing Unit (CPU) cores to constantly check the NIC for incoming packets makes 
this approach very effective but resource-inefficient.  
The main motivation factors for this research are: 
 To investigate polling frequency optimisation of the CPU core that is executing 
DPDK’s receiving loop according to the dynamics of the incoming network 
traffic. 
 To attempt to increase the efficiency of DPDK’s receiving and forwarding loops 
in terms of CPU cycles per packet. 
 To attempt to decrease energy usage and heat dissipation for the CPU core that 
is pinned to a DPDK thread that is receiving and forwarding network packets 
 
1.4 Novelty  
Currently, the drivers used with DPDK libraries use periodic polling as a method of 
checking if network packets have arrived at the NIC; this is done without using 
interrupts. If one or more packets had arrived, they are read during the poll and moved 
immediately to the main memory for further processing; but if no packets are present at 
the NIC, the polling continues regardless of packet availability. Within this continuous 
busy-wait loop scheme there is an opportunity to investigate novel approaches for 
improved optimisation. The novelty for the work described in this thesis is claimed 
based on the results of such investigations as will be described in later chapters. 
 
 3 
  
1.5 Literature Survey 
To gain a proper understanding of how the Adaptive Polling Mechanism can be 
implemented, it was imperative to conduct a substantial literature survey at the 
beginning of this research work. 
All documentation provided with the DPDK were studied in detail including the 
DPDK’s Programmer Guide [4] and Sample Applications User Guide [6]. The Sample 
Applications User Guide includes forty different example programs that are provided 
with the DPDK framework. It also provides instructions on how to compile, run and use 
these sample programs to showcase a specific network functionality. Furthermore, a 
number of key papers published on high-speed network packet processing using off-the-
shelf hardware was studied. In particular the following papers were found to be most 
informative: 
 Fast Userspace Packet Processing [7] – This paper provides a review of the 
technical aspects of existing userspace Input/Output (I/O) frameworks. It 
was useful to learn what userspace frameworks are available and how they 
manage the reception, processing and forwarding of network frames at 10 
Gbps line rate. 
 Comparison of Frameworks for High-Performance Packet I/O [8] – This 
paper reviews the performance of various frameworks for high-performance 
packet I/O and analyses the trade-off between throughput and latency. This 
paper helped in the design phase of the APM.  
 Assessing Software and Hardware Bottlenecks in PC-based Packet 
Forwarding Systems [9] – This paper provides analysis and comparison of 
performance between Linux kernel packet forwarding and DPDK’s Layer 
(L2) forwarding example application [10]. It is very informative for setting 
up the test environment to benchmark the mechanism proposed in this 
research work.  
To understand how ‘CPU cycles per packet’ can be optimised as a figure of merit, it 
was necessary to have a clear understanding of how the Linux OS monitors and controls 
CPU utilization under heavy load, and how an Intel Xeon processor’s performance can 
be influenced from within the DPDK framework. There are countless publications and 
data sheets available on this topic but the key processor of interest for this project  is the  
Intel Xeon, which is described in the document ’Intel Xeon E3-1285 v3’ [11]. The Xeon 
 4 
  
processor is based on the Intel  Haswell, which is described in the document ‘4th 
Generation Intel® Core™ Processor, codenamed Haswell’  [12]. 
The energy saving approaches relating to the Intel x86 architecture were studied, and 
include performance and sleep states enforced by the latest intel_pstate driver, which is 
responsible for the power management policy on Intel CPUs.  The document ‘Intel P-
State driver’[13]   provides an in-depth technical description for this topic. 
To investigate how to test the implementation of the proposed APM, where multiple 
traffic profiles needed to be generated, a number of relevant documents were reviewed. 
A large number of available documents relating to software traffic generators capable of 
producing at least 10 Gbps line rate of 64 bytes network packets were studied. In 
particular the following papers were found to be most informative: 
 The Handbook of Computer Networks [14] - This  paper helps to understand 
much about what one needs to know about packets, flows, sessions, traffic 
patterns, and continuous and discrete source traffic modelling.  
 MoonGen: A Scriptable High-Speed Packet Generator [1] - This paper describes 
MoonGen, one of the modern software traffic generators capable of saturating a 
10 Gigabit Ethernet (GbE) link with 64 bytes network packets. MoonGen is used 
to test and evaluate latency and throughput as will be described later in this 
thesis. 
 TRex: Realistic Traffic Generator – Stateless support  [15] - This paper is a 
presentation about the second traffic generator TRex, which is used throughout 
the testing and evaluation stages of the APM in this research work. It provides 
great flexibility for traffic profiling with different packet rates, sizes, bursts etc. 
 
1.6 Structure of the Thesis 
Chapter 2 provides a detailed overview of the underlying concepts that includes 
Software Defined Networking (SDN) [16] and Network Function Virtualisation (NFV) 
[17]. These concepts inspired software developers and led to the creation of the DPDK, 
an open source community driven project with active participation of reserachers from 
industry and academia. In the same chapter, DPDK’s structure is introduced along with 
its main components and a description of its basic operation. At the end of Chapter 2, 
the DPDK’s impact on CPU utilisation is discussed, and includes the related overall 
system power demands. 
 5 
  
Chapter 3 describes the full experimental setup for the research. This includes a 
description of the equipment used, the Linux OS distribution, the software traffic 
generators, and some discussion on the resolution of measurements etc. 
Chapter 4 explains the background to the design of the Adaptive Polling Mechanism, 
along with the various proposals for optimisation and their evaluations. Features and 
limitations of the selected algorithm are also presented. 
Chapters 5 presents Case Study One: CBR and Bursty traffic experiments, results and 
preliminary conclusions. 
Chapter 6 presents Case Study Two: IMIX and Random Packet Size traffic experiments, 
results and conclusions. 
Chapter 7 presents the energy consumption investigations and improvements introduced 
by using the APM. 
Chapter 8 discusses the results, conclusions and recommendations for future work. 
  
 6 
  
Chapter 2   State-of-the-Art Technology 
 
This chapter aims to provide an overview and analysis of the two key concepts that 
inspired network engineers and software developers to move packet processing from 
kernel to user-space; and from proprietary devices to widely available commodity 
hardware-based solutions. It also aims to present the shortfalls of the Linux kernel 
packet processing schemes at line rates of 10 Gbps and above, and how the user-space 
based DPDK solves some of these problems. The structure and operation of DPDK is 
reviewed along with DPDK’s impact on the overall system energy consumption 
 
2.1 Overview of SDN 
Software Defined Networking [18] is a relatively new network architecture approach, 
which is based around the decoupling of the control and data planes to allow the 
separation of packet forwarding technology from routing hardware, so as to achieve 
much greater architectural flexibility with far better centralized management. 
Every individual network device (router, switch etc.) has to perform three fundamental 
separate activities, which can be considered as three planes: 
 The Data Plane (fast path) will process ingress packets 
 The Control Plane (CP) will establish and update the list of devices connected to 
the network using (slow path) protocols  
 The Management Plane provides an interactive connection (usually command 
line interface (CLI)) with its owner 
All three planes, in the context of networking, represent the integral components of a 
data communications architecture. In conventional networking, the three planes might 
reside within a network hardware device firmware as shown in figure 1. 
 
 
 
 
 
 
 7 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
In such a model, everything is tied up into a particular proprietary hardware platform 
which makes the implementation of any flexible development and innovation a very 
slow and time consuming process. In an network of any complexity, the network 
management is inflexible and constrained, because it is handled through a large number 
of proprietary agents, many with their own specialized hardware, operating systems, and 
control programs [19]. This involves high Operating Expenses (OPEX) and Capital 
Expenses (CAPEX) as operators have to acquire and maintain different management 
solutions and the corresponding teams need to provide support for these solutions. 
Furthermore, the network policies are often enforced by configuring each device 
individually using very low-level vendor specific commands and configurations, which 
makes it virtually impossible to re-configure the full network dynamically under a 
single uniform control interface [16]. 
 
 
 
Data Plane 
Port 0 
Port 1 
Port 2 
Port 3 
 
 
 
Forwarding Path 
Control Plane 
ARP, Routing, 
 MAC learning 
L2/L3 forwarding tables 
Forwarding decision 
F 
 
I 
 
R 
 
M 
 
W 
 
A 
 
R 
 
E 
    
M 
A 
N 
A 
G 
E 
M 
E 
N 
T 
 
P 
L 
A 
N 
E 
 
R O U T E R / S W I T C H 
Figure 1. Conventional Networking Device 
 8 
  
SDN introduces the following features: 
 Decoupling of data and control planes. This takes away the control functions 
from the network devices and they become packet forwarding units only. The 
forwarding decisions are imposed by the external network controller as 
explained below.  
 Per–flow based forwarding. Forwarding decisions are flow based instead of 
destination based. A flow, in the SDN context, is a train of packets between 
source and destination carrying an identical set of packet field values (example: 
source IP, destination IP). All packets associated with the same flow receive the 
same service by the forwarding devices, so as to avoid the overhead of per-
packet routing decisions.  
 External network controller. Control logic is a software platform that runs on 
an external commodity server and provides the necessary resources and 
abstractions to accommodate the programming of forwarding devices based on 
the whole network state. The use of an external control logic offers the 
following benefits: 
- Simple modification of network policies through high-level programming 
languages instead of low-level device specific configuration. 
- Centralised controller reacts automatically to sudden network state changes 
keeping the network policies intact. 
- Keeping control logic centralised provides global knowledge of the network 
state thus simplifying the development of more complex network functions and 
services. 
 Network management through software. Allows the network configuration to 
be changed by programming software applications running on top of the 
network controller. 
It is important to point out at this stage that the separation of control plane and data 
plane is realized by well-defined programming interfaces between the switches and the 
SDN controller [19]. The centralised controller uses Application Programming 
Interfaces (APIs) to exercise direct control over the state of data plane devices. 
Each plane carries different traffic and runs independently from one another in a layered 
structure as shown in figure 2. 
 
 9 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2.2 Network functions virtualisation 
This section provides an explanation of what is a Network Function (NF), and why do 
we need to virtualise this, and how is that relevant to this research. 
According to the European Telecommunications Standard Institute (ETSI) [20], a 
Network Function (NF) represents a functional block within a Network infrastructure 
that has well-defined external interfaces and well-defined functional behaviour. In 
practical conventional terms, a NF is a network node or a physical network appliance. 
Traditionally, every service provided by a Telecommunication Service Provider (TSP) 
Northbound APIs 
APPLICATION LAYER 
Business 
Application 
Business 
Application 
CONTROL PLANE (Slow Path) 
Controller 
Network Controller Software 
Controller Controller Controller 
D 
E 
C 
O 
U 
P 
L 
E 
D 
 
P 
L 
A 
N 
E 
S 
Southbound APIs 
DATA PLANE (Fast Path)  
Switch Switch Switch ATM Switches 
Receiving / Forwarding Devices Receiving / Forwarding Devices 
Abstraction Layer from Networking Devices 
East APIs West APIs 
 
Router Router 
Figure 2. SDN Architecture 
 10 
  
within a network is based on physical proprietary devices being deployed as a part of 
the network. Figure 3 shows a simplified representation of that situation. 
 
 
 
 
 
 
 
 
 
 
 
 
This practice of using physical proprietary devices is no longer a valuable approach 
since it restricts the quality, stability and agility of the services provided, and leads to 
heavy dependency on specialised hardware. The increase in customer demand for more 
new services provided by NFs pushes TSPs to increase their spending on new hardware 
to satisfy customers’ expectations, but this leads to higher CAPEX and OPEX for 
equipment that may be needed for only very short periods of time, which yields a poor 
return on the investment made. Inevitable obsolescence of the NFs as hardware devices 
compounds the problem. 
The main principle of NFV is the separation of network functions from the proprietary 
vendor based hardware, in favour of implementing their equivalent functionalities in 
software, running on industry standard commodity hardware (i.e. servers, switches and 
storage). This is achieved using Virtual Hardware Abstraction (VHA). In this way, any 
network service can be decomposed into a group of Virtual Network Functions (VNFs), 
which, ideally, can be deployed anywhere on the network. Figure 4 shows a simplified 
representation of VNFs replacing NFs. 
Switch Switch 
Switch Switch 
DPI 
File 
Servers 
NAT 
Firewall 
QoS 
VPN 
Router 
Network Functions 
Figure 3. Traditional Legacy Networking with NFs as Hardware Appliances 
 11 
  
 
 
 
 
 
 
 
 
 
 
 
 
The main advantages for using the VNF approach are: 
 Software and hardware innovations can evolve independently from each other 
due to the separation achieved. 
 Greater flexibility in deployment of network functions. 
 Dynamic service provisioning is possible, depending on customer and network 
demands. 
The freedom of programming the network planes, in the SDN scheme, and the ability to 
develop and deploy a variety of NFs using NFV, has paved the way to a new framework 
development, which combines the two concepts and implements high-speed packet 
processing software schemes, which can run on standard industry hardware. Such 
schemes available today are Netmap [21] developed by Luigi Rizzo, PF_RING Direct 
NIC Access (DNA) [22], Packet Shader [23], Packet_MMAP [24] and of course DPDK. 
2.3 The DPDK 
This section aims to provide an overview of DPDK’s architecture, features and 
operation. 
TSP Customer Site 
Switch Switc
Switch Switch 
DPI 
File 
Servers 
vNAT 
vFirewall 
vQoS 
vVPN 
vRouter 
Virtual 
Network 
Functions 
packets 
Figure 4. Simplified NFV Diagram - Internal Layers are omitted 
 12 
  
DPDK was designed to enhance the data plane packet processing on Intel Architecture 
(IA) platforms in Linux user space. At its most recent public release in November 2016, 
it contained the following core components as illustrated on figure 5. 
 
 
2.3.1 Overview of core components and features 
The main goal of the DPDK creators was to address the issues of concern encountered 
by the Linux Network Stack at line rates of 10 Gbps and to overcome such issues by 
providing a simple and complete framework for fast packet processing in DP 
applications. 
DPDK achieves high I/O functionality based on the following features: 
User Applications (CMD, Ethtool, Skeleton, RX/TX Callback, L2FWD, L3FWD etc.) 
Common Function APIs 
Device 
 
dev, 
ethdev, 
ethctrl, 
rte_flow, 
cryptodev, 
eventdev, 
metrics, 
bitrate, 
latency, 
devargs, 
PCI 
Memory 
 
memseg, 
memzone, 
mempool, 
malloc, 
memcpy 
CPU  
Arch 
 
branch 
prediction, 
cache 
prefetch, 
SIMD, byte 
order, CPU 
flags, CPU 
pause, I/O 
access 
CPU 
multicore 
 
interrupts, 
launch, 
lcore, per-
lcore, 
service 
cores, 
keepalive, 
power/freq 
Layers 
Ethernet, 
ARP, 
ICMP, IP, 
SCTP, TCP, 
UDP, GRO, 
LPM IPv4 
route, LPM 
IPv6 route, 
ACL, EFD, 
frag/reass 
Timers 
 
cycles, 
timer, 
alarm 
Packet 
framework 
Port: ethdev, 
ring, frag, 
reass, sched, 
kni, src/sink 
 
Table: lpm 
IPv4, lpm 
IPv6, ACL, 
hash, array, 
stub, pipeline 
Core Libs 
EAL 
Ring 
Mempool 
Mbuf 
Crypto 
Timer 
Hash 
LPM 
Drivers 
Crypto 
aesni_gsm 
aesni_mb 
armv8 
dpaa2_sec 
kasumi 
null 
openssl 
qat 
scheduler 
snow3g 
zuc 
Network 
af_packet, ark, avp, 
bnx2x, bnxt, 
bonding, cxgbe, 
dpaa2, e1000, ena, 
enic, fm10k, i40e, 
ixgbe, kni, liquidio, 
mlx4, mlx5, nfp, 
null, pcap, qede, ring, 
sfc, szedata2, tap, 
thunderx, vhost, 
virtio, vmxnet3, 
xenvirt 
Other 
Config 
Devtools 
Doc 
Usertools 
Buildtools 
Test 
 
Architectures 
Intel IA-32-64 
IBM Power8 
EZchip TILE-Gx 
ARM 
 
USER SPACE 
KERNEL SPACE KNI IGB_UIO VFIO UIO_PCI_GENERIC 
Figure 5. DPDK 16.11.1 Architecture  
 13 
  
 Bypassing the Linux Kernel network stack (figure 5) and allowing the incoming 
frames to be processed by the user application running on top of the packet 
processing framework. 
 Strictly polling the NIC to receive packets, thus avoiding interrupt overhead. 
 Packet memory buffers are pre-allocated at application initialisation time, thus 
saving CPU cycles on further allocation or de-allocation of memory. 
 Zero copying of data between Kernel space memory and user space memory; a 
packet is copied in main memory only once using the DMA engine of the NIC. 
 The overhead of processing packets is amortized by processing in batches based 
on a single API call. 
Moreover, the Ethernet driver(s) runs in user space and a Kernel module provided with 
the source package helps to map the Peripheral Component Interconnect Express (PCIe) 
resources into user space so the DPDK application has direct access to the NIC 
hardware level for greater control. The CPU cores have their own cache memory, which 
limits their access to the shared ring of a memory poll and thus saves even more CPU 
cycles. The actual applications are run as execution units,   using threads, on logical 
processing cores named lcore(s) in the scheme. 
The DPDK framework supports two models for packet processing. The user can choose 
between the two models: 1) the run-to-completion model, where a processing unit is 
allocated to each packet until the packet is processed, and 2) the pipe-line model where 
the tasks for processing a packet are split between different execution units on the pipe 
line. 
The fundamental features of the DPDK include: 
 No limitation of how many cores and processors are used 
 No scheduler is employed as in NAPI and all devices are accessed using polling 
 Support for 32 and 64 bit processors with or without Non-Uniform Memory 
Access (NUMA) 
 Optimal packet allocation across memory channels 
 Scales from low-budget Intel Atom CPU to the high-performance Intel Xeon 
CPU family 
In table 1 the DPDK core components are listed along with a short description of each 
one. 
 14 
  
Table 1. Core Components of DPDK 
COMPONENT DESCRIPTION 
ENVIRONMENT 
ABSTRACTION 
LAYER 
Provides a generic interface that hides the platform specifics 
from the applications and libraries[4]. Services include: 
 Running Time Environment (RTE) loading and 
launching 
 Multi thread and multi process support 
 Core affinity procedures 
 System memory allocation and de-allocation 
 Atomic and lock operations 
 Time reference 
 PCIe bus access 
 Trace and debug functions 
 CPU feature identification 
 Alarm operations 
 Memory management 
 Interrupt handling 
Every service is best described in detail in [4]. 
RING MANAGER Provides fixed-size, lockless multi-producer and multi-
consumer First In First Out (FIFO) API in a finite size table 
for storing objects. 
The resulting ring is used by the Memory pool manager. 
MEMORY 
POOL 
MANGER 
Allocates pools of objects in main memory. A pool is 
identified by name and uses a ring to store free objects. It 
provides some other optional services, such as a per-core 
object cache and an alignment helper to ensure that objects are 
padded to spread them equally on all Random Allocation 
Memory (RAM) channels. 
NETWORK 
PACKET 
BUFFER 
MANAGER 
Provides the facility to create and manipulate buffers that may 
be used for packets carrying data. 
 
TIMER 
MANAGER 
Provides a timer services to DPDK’s execution units to enable 
asynchronous call-back functions execution. It uses High 
Precision Event Timer (HPET) or the CPUs Time Stamp 
Counter (TSC) to provide a reliable time reference. 
POLL MODE 
DRIVER 
A Poll Mode Driver (PMD) consists of APIs, provided 
through the BSD driver running in user space, to configure the 
devices and their respective queues. In addition, a PMD 
accesses the RX and TX descriptors directly without any 
interrupts (with the exception of Link Status Change 
interrupts) to quickly receive, process and deliver packets in 
the user’s application.  
DPDK includes 1 Gigabit, 10 Gigabit and 40 Gigabit Poll 
Mode Drivers. 
 
DEBUG 
MANAGER 
Provides debug helpers. 
 
 
 15 
  
2.3.2 Packet reception, processing and forwarding  
This section will aim to describe the network packet flow within the DPDK framework 
and more specifically Layer 2 (L2) forwarding of network frames. More information on 
L2 of the Open Systems Interconnection (OSI) reference model can be found here [25]. 
The operation of the L2FWD [10] sample program can be explained with the following 
steps as illustrated with figure 6 below: 
 
Figure 6. Simplified Packet Flow in L2FWD – DPDK 
 Application initialization – Includes the creation of EAL with the provided 
command line parameters, where CPU cores and use of memory channels are 
specified. Memory pools and port queues are created and configured where 
certain buffer sizes, cache sizes and optional NUMA parameter usage are 
identified. 
 Port initialization – Receive and transmit queues are setup where the RX queue 
is provided with a pointer to an appropriate memory pool to get buffers from for 
packet storage. Also, a number of set packet descriptors are initialized. Both 
Pkts out 
Pkts out 
Pkts in 
Pkts in 
Master Core 
Application Initialisation 
EAL Options 
- Number of Cores 
- DRAM channels 
App specific options 
- Number of ports 
- Queues per port 
 
Intel NIC 82599ES 
Continuous Poll 
for RX packets Burst of RX packets 
Lcore1 
100% 
 Port0 RX 
Port0 TX 
 Port1 RX 
Port1 TX 
Lcore0 / Lcore1 
Application 
Processing 
L2 Forward 
Source MAC 
address is replaced 
with default 
destination 
MAC address 
 Lcore0 
100% Burst of TX packets 
 16 
  
queues are configured with a certain threshold of ring registers. Once the ports 
are initialized and configured, they are started. 
 Start threads on cores provided earlier – Every slave core (lcore0, lcore1) is 
attached an infinite loop for receiving and forwarding, except for the master 
core which is responsible for gathering and displaying statistics. 
 Packet processing – Packets are captured in groups and then processed one by 
one. As soon as packets are ready for transmission, they are put in a buffer and 
sent in bunches if the buffer becomes full. 
2.3.3 DPDK – current status 
The DPDK community has been growing continuously since the project became an 
open source activity in April 2013. Since then, there have been contributions from 70 
different organisations and over 400 developers have been involved in 10 major releases 
of the software. Some 20 open source projects are based on the DPDK libraries. The 
DPDK project has grown from strength to strength with multiple CPU architecture 
support (x86_64, IA32, Power 7/8 [26], Tilera (EZchip), multiple NICs support (Intel, 
Cisco [27], Mellanox [28], Broadcom [29], Chelsio [30]), multiple OS distributions and 
multiple virtualization environments such as Kernel Virtual Machine (KVM) [31], 
VMware [32] and Xen [33]. 
In April 2017, the DPDK project has moved under the governance of The Linux 
Foundation [34]. A Governing Board will manage the marketing strategies, lab 
resources, legal and licensing issues. The Technical Board will be in charge of the 
technical developments in DPDK including the approval of new sub-projects, 
deprecating old sub-projects and resolution of technical disputes. 
2.3.4 DPDK – impact on overall system power demand and core temperatures 
In examining figure 6, it is obvious that the CPU cores (lcore0 and lcore1) are 
continuously polling the NIC ports for packets. The same cores are responsible for the 
processing and forwarding of the network packets once they become available at the 
ports. Due to the polling nature of DPDK, lcore0 and lcore1 are utilized fully (100%) at 
all times regardless of packet availability. 
 
 
 
 
 17 
  
Figure 7 illustrates the high-polling frequency issue within the receiving loop of 
L2FWD-DPDK.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
On figure 7 above, an unnecessary 98 million polls per second gap was identified when 
the CPU is running at normal frequency of 3.6 GHz. This situation improves slightly if 
the CPU frequency is lowered by half, but the original issue remains.  
The theoretical maximum polls per second needed to receive and process 10 Gbps 
incoming traffic containing minimum size packets (64 bytes) is presented in table 2. 
Table 2. Theoretical Polls per Second with Different Burst Sizes 
Packet size 
(bytes) 
Burst size Theoretical Polls / s needed by RX loop 
64 1 14880952 
64 16 930059 
64 32 465029 
 
Figure 7. Excessive Polling in DPDK 
98.95 M
49.40 M
0.46 M
0.01 M
0.10 M
1.00 M
10.00 M
100.00 M
0 2 4 6 8 10 12
L
2
F
W
D
-R
X
 l
o
o
p
 P
o
ll
in
g
 F
re
q
u
en
cy
Time (s)
Comparison between measured polls per second  and 
calculated theoretical polls per second need.
(L2FWD-DPDK-stable-16.11.1)
Measured Polling Frequency @3.6Ghz Measured Polling Frequency @1.8Ghz
Theoretical Polls /sec 64b Theoretical Polls /sec 128b
Theoretical Polls /sec 256b Theoretical Polls /sec 512b
Theoretical Polls /sec 1024b Theoretical Polls /sec 1280b
Theoretical Polls /sec 1518b
48.9M polls/s  
gap @ 1.8Ghz 
98M polls/s 
gap @3.6Ghz 
 18 
  
Figure 8 shows the effect of on CPU temperatures of L2FWD-DPDK running idle. Core 
1 and core 2 are continuously polling the NIC for packets; core 0 (green line) and core 3 
(dark orange line) are idle. The red graph line represents the CPU fan speed in 
Revolutions per Minute (RPM) for a certain CPU package temperature. The graphs 
shown in figure 8 were obtained using psensor [35], which is a graphical hardware 
temperature monitoring tool for Linux that was developed by Jean-Philippe Orsini. The 
psensor tool uses lm-sensors package, which provides user space support for monitoring 
the sensor chips that are incorporated in all modern hardware such as motherboards, 
CPUs, hard drives, NICs, fan speed controllers and many more. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 8 highlights the CPU temperature rise within a short period (8 minutes) of time 
without any packet processing being done by the L2FWD-DPDK. The next figure, 
figure 9 represents the inability of the system’s CPU fan (red graph line) to keep the 
CPU package within the manufacturer’s recommended operating temperatures ( Top < 
73oC 
29oC 
System 
Idle 
L2FWD-DPDK running idle for 8 
minutes 
(No incoming network traffic) 
DuT 
Intel 
Xeon 
E3-1285 
@3.6GHz
, 
1148 RPM 
1584 RPM 
1366 RPM 55oC 
Figure 8. CPU Core Temperature Issue in DPDK – 0% Line Utilization 
 19 
  
80oC) when all four cores are forwarding network traffic at maximum line rate (10 
Gbps). 
 
 
 
 
 
 
 
 
 
 
 
 
 
This situation can seriously shorten the lifespan of the CPU and it has a negative effect 
on the system energy efficiency.  
The combination of prolonged heat and flow of electricity through the CPU’s transistors 
causes changes within the chip at the atomic level, known as Electromigration (EM) 
[36]. EM refers to the unwanted movement of materials in a semiconductor.  If the 
current density is high enough, there can be a momentum transfer from moving 
electrons to the metal ions that make up the lattice of the interconnect material. The ions 
will drift in the direction of the electron flow. The result is the gradual displacement of 
metal atoms in a semiconductor, potentially causing open and short circuits [37]. This 
power demand issue is the main drawback identified within the DPDK scheme. The 
design and implementation of the APM scheme in this research project aims to address 
this issue, as will be described later on in chapter 7.   
System 
Idle 
36oC 
97oC 
L2FWD-DPDK forwarding 
traffic at 10 Gbps for 8 minutes 
1238 RPM 
2136 RPM 
1687 RPM 
70oC 
DuT 
Intel Xeon 
E3-1285 v3 
@3.6GHz, 
 4 cores 
 
Figure 9. CPU Core Temperature Issue in DPDK – 100% Line Utilization 
 20 
  
Chapter 3   Experimental Environment 
 
This chapter describes the actual experimental setup used throughout this research work. 
It includes the detailed description of the hardware that is used to build the Device 
under Test (DuT) machine, the Linux OS distribution chosen, the software traffic 
generators and a discussion on the resolution of measurements that will be presented in 
later chapters. 
3.1 Equipment 
This section reveals the hardware and software used for all experiments including the 
specification of the fibre optic cables interconnecting the traffic generator and the 
forwarding machine running the DPDK sample programs. 
Figure 10 illustrates the hardware laboratory setup used. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Packet Generator 
Intel DZ77GA-
70K 
2xPCIe 3.0 
Intel Core i7-
3770k 
3.5GHz, 4 cores, 
8 threads 
Intel 82599ES, 
10 GbE, 
PCIe v2, 
Dual-port, 
SFP+  
120GB SSD 
SATA 
4x4 GB, DDR3-
1600  
DuT 
120GB SSD 
SATA 
Asus Z87-Expert 
2xPCIe 3.0 
Intel Xeon E3-
1285  
3.5GHz, 4 cores, 
8 threads 
4x8 GB, DDR3-
1600 
Intel 82599ES, 
10 GbE, 
PCIe v2,  
Dual-port, 
SFP+ 
 
2M, LC/LC – LC/LC, OM3, LSZH 
Fibre Optic Cable 
 
 
F 
E 
D 
O 
R 
A 
 
L 
I 
N 
U 
X 
 
22 
 
 
F 
E 
D 
O 
R 
A 
 
L 
I 
N 
U 
X 
 
24 
Figure 10. Hardware Laboratory Equipment Setup 
 21 
  
The experimental laboratory configuration setup, as presented in figure 10, was 
carefully chosen so as to be capable of supporting all experiments attempted in this 
research work. 
The configuration consists of: 
 Packet generator machine configured with the necessary hardware and software 
needed for network traffic generation of up to 10 Gbps with minimum size 
packets of 64 bytes (more information is provided in the next sub-section).  
 Packet forwarding machine (DuT); - capable of running the DPDK’s sample 
applications used for receiving, processing and forwarding the traffic generated 
by the packet generator. 
 Duplex fibre optic cable connection with specifications as outlined in table 3. 
Table 3. Fibre Optic Cable Specifications 
Cable image Cable length 2 meters 
 
Cable rating Fiber – 10 Gbps 
Connector A Fiber Optic LC Duplex Male 
Connector B Fiber Optic LC Duplex Male 
Fiber size 50/125 
Fire rating Low Smoke Zero Halogen (LSZH) 
Fiber classification OM3 
Fiber type  Multi-mode 
 
3.1.1 Packet Generator 
The hardware configuration used for network traffic (packet) generation, see figure 10, 
is based on Intel’s Z77 chipset [38] motherboard, which is equipped with Intel’s core i7-
3770K processor [39], 16 Gigabytes (GB) of dual-channel Double Data Rate type three 
(DDR3) main memory, Solid State Drive (SSD) hard disk with 120 GB capacity, and a 
PCIe based Intel 82599ES dual-port NIC with 10GbE support.  
The software configuration includes the Fedora 22 Linux OS [40] running kernel 
version 4.4.14-200.fc22.x86_64 and the Gnome 3 Desktop Environment (DE). The 
system is installed and configured with the two frameworks (MoonGen and TRex) 
which are used for packet generation. The packet sizes that can be generated within a 
stream are in the range from 64 to 1518 bytes. More information about packet 
generation in software is presented in section 3.2. 
 22 
  
3.1.2 Packet Forwarder 
The DuT machine is the packet forwarder. The hardware configuration on this machine 
is based on Intel’s Z87 chipset [41] baseboard, which is installed with Intel Xeon E3-
1285 v3 processor, 32 GB of DDR3 RAM, SSD with 120 GB capacity and a PCIe 
based Intel 82599ES dual-port NIC. 
The software side includes the Fedora 24 Linux distribution [42] with kernel version 
4.5.5-300.fc24.x86_64 and the Gnome 3 DE. The latest DPDK stable version (16.11.1) 
is installed and configured to forward packets at up to 10 Gbps maximum line rate. 
 
3.2 Network Traffic Generation and Modelling 
This section describes the tools and schemes used in this research to generate and model 
the network traffic. The packet generating frameworks are described in the next two 
sub-sections. 
Source traffic generation and modelling are very important features of this specific 
research work and for network communications in general; it is a starting point of the 
network packet processing evaluations. The synthetic traffic generation features include 
packet(s) generation, grouping the packets in a flow (stream) which is based on their 
destination IP or Media Access Control (MAC) address, and sending the flow to the 
transmit buffer of the NIC. 
 Synthetic traffic modelling provides the researcher with the mathematical 
approximation for real network traffic behaviour. Configuration decisions depend on 
key features such as: the session duration, the size of the packets, the amount of packets 
per second (PPS), the inter-packet gap (IPG), the size of the Inter-Stream Gap (ISG) and 
the line bandwidth available. A session is the time interval in which two hosts are 
connected and exchanging information in terms of network packets. 
Figure 11 provides simple representation of a session between two arbitrary hosts. 
 
 
 
 
 23 
  
 
Figure 11. Simple Connection Session 
In order to test the implementation of the Adaptive Polling Mechanism, variable 
network traffic profiles needed to be generated using two different network packets 
generators. As already stated in chapter 1.  MoonGen and TRex are the traffic 
generators and both are based on DPDK. The next two sub-sections explain the general 
features of these two packet generators. 
3.2.1 MoonGen High-Speed Packet Generator 
MoonGen is a software packet generator framework developed by Paul Emmerich from 
the Technical University in Munich, Germany. It is designed to be used with widely 
available hardware and it is built on top of DPDK. Figure 12 illustrates the architecture 
of MoonGen. 
 
 
 
 
 
 
 
 
IPG 
N 
IPG 
2 
IPG 
1 
IBG 
2 
IBG 
2 
IBG 
1 
ISG 
1 
IBG 
1 
Stream Host 
A 
Host 
B 
Session 
Packet 
64bytes 
Burst 
1 
Burst 2 
Packet 
128byte
s 
Packet  
256bytes 
Burst N 
Packet  
1518 bytes 
Stream 
Burst 
1 
Burst N Burst 2 
(IBG)  Inter-Burst Gap 
(ISG)  Inter-Stream Gap 
(IPG)  Inter-Packet Gap 
 
 
  
     
 
 
 
 
 
 
 
 
 
 
  
  
Figure 12. MoonGen Architecture [1] 
 24 
  
The “MoonGen Core”, see figure 12 is a Lua [43] wrapper for DPDK that provides 
utility functions for the packet generation process. The API comes with functions to 
configure the underlying hardware features. 
MoonGen offers precise and accurate timestamping for the Intel 82599 NIC, as used in 
all experiments. Timestamping on 82599ES NIC operates at 156.25 MHz resulting in a 
precision of ±6.4 𝑛𝑠. The ‘rate control’ is an important feature for any traffic generator. 
Controlling the PPS rate or Bits per Second (BPS) rate is a challenging task even for 
hardware based traffic generators.  MoonGen can utilize a hardware rate control on Intel 
NICs limited to the generation of Constant Bit Rate (CBR) traffic and also bursty 
traffic. It also provides a rate control API for software control. Traffic patterns can be 
created using the software rate control by inserting a variable length network frame with 
an incorrect Cyclic Redundancy Check (CRC) checksum which creates a non-uniform 
Inter-Packet Gap (IPG) on the receiving end because the NIC will automatically discard 
a packet with a ‘bad’ CRC. The duration of the IPG can be strictly controlled by the size 
of the invalid packet sent. Typical precision is limited by the byte-rate (0.8 ns) and the 
minimum packet size of 76 bytes (64 + 12 Inter Frame Gap (IFG)). Byte-rate is the 
duration of time in nano seconds for 1 byte (8 bits) information at 10 Gbps calculated 
as: 𝑏𝑦𝑡𝑒 − 𝑟𝑎𝑡𝑒 =
1(𝑠)
10,000,000,000 (𝑏𝑖𝑡𝑠/𝑠)÷8 (𝑏𝑦𝑡𝑒𝑠/𝑠)
≈ 0.8 𝑛𝑠 
MoonGen comes with many example scripts which can test different network 
functionalities. The scripts used in this research have the purpose of testing of packet 
latencies, software and hardware timestamping and L2 traffic generation. 
3.2.2 TRex Traffic Generator 
TRex is another packet generator built on top of the DPDK framework and running on 
standard Intel processors. It operates on the basis of a client/server model where both 
client and server reside on the same machine. Figure 13 shows the basic setup of TRex. 
 
 
 
 
 
 
 25 
  
 
Figure 13. TRex Basic Setup [44] 
TRex supports both stateful and stateless traffic generation modes. The ‘Stateful’ mode 
is meant to test networking gear which saves state per flow (5 tuple) [45]. The traffic 
flows that are generated in “Stateful” mode are identified by their 5-tuple (source IP, 
source port, destination IP, destination port and transport protocol to be used). Usually, 
TRex Stateful mode uses pre-recorded network traces, which are saved as pcap files and 
can be injected on a pair of interfaces thus imitating client – server communication. The 
‘Stateless’ mode is intended to test networking equipment that makes decisions based 
on packet contents and enables basic L2/L3 network traffic testing associated with a 
switch or a router. The ‘Stateless’ mode is much more flexible enabling the user to 
define the type and contents of a packet. Most of the experiments attempted were 
carried out using the stateless mode. The following are some of the high level 
functionalities provided by TRex ‘Stateless’ mode: 
 Supports 10-22 Million packets per second (Mpps) per CPU core. 
 Support for 1, 10, 25, 40 and 100 Gbps interfaces 
 Multiple traffic profiles per interface are supported 
TRex Machine 
In
te
l 
N
IC
 8
2
5
9
9
E
S
 S
F
P
+
 
Port 0 
16.0.0.1– 16.0.0.255 
Client 
Client 
Client 
Client 
Client 
TRex Client Side 
Emulated network 
10.10.10.2 
Router 
Port 1 
48.0.0.1 – 48.0.0.240 
Server 
 
Server 
 
Server 
 
Server TRex Server Side 
Emulated network 
Server 
12.12.12.2 
Router 
DuT 
In
te
l 
N
IC
 8
2
5
9
9
E
S
 S
F
P
+
 
Port 1 
Port 0 
10.10.10.1 
12.12.12.1 
 26 
  
 Multiple streams per profile are supported 
 Stream support includes: 
o Packet template to build any packet 
o Field engine program to change any field inside the packet 
o Mode, which can be continuous, burst and multi-burst 
o Rate that can be specified in Mpps (14Mpps), link bandwidth percentage 
(25%) and L1 or L2 bandwidth in Megabits per second (Mb/sec) 
o Stream can trigger another stream 
 Statistics support: 
o Per interface 
o Per stream 
o Latency and jitter per stream 
TRex is supplied with a wide variety of examples and comprehensive documentation 
for creating custom traffic profiles. 
3.3 Measurements Resolution 
In this section some of the measurements used in the case studies that will be presented 
in chapters 5, 6 and 7 are discussed and explained. In order to evaluate certain system 
parameters and measure system performance, some metrics used later in this document 
need explaining. 
 
3.3.1 Maximum Packet Throughput  
In our particular setup, the maximum packet throughput is limited by the bandwidth of 
the transmission link and by the PCIe bandwidth. The PCIe bandwidth is mentioned 
here because the NICs used in all experiments are connected to the system through PCIe 
slot. 
Maximum Packet Rate Calculation 
The theoretical maximum bit rate on 10 GbE line must be considered. The MAC bit rate 
of 10 GbE, as defined in IEEE standard 8002.3ae [46], is 10 billion bits per second 
(10,000,000,000 bits/sec). This does not reveal much useful information unless the 
actual packet size and rate are specified. 
The maximum packet rate for 10GbE is calculated using the following formula: 
 27 
  
𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑃𝑎𝑐𝑘𝑒𝑡 𝑅𝑎𝑡𝑒 =
𝑀𝐴𝐶 𝑇𝑟𝑎𝑛𝑠𝑚𝑖𝑡 𝐵𝑖𝑡 𝑅𝑎𝑡𝑒
(𝑃𝑟𝑒𝑎𝑚𝑏𝑙𝑒 + 𝐹𝑟𝑎𝑚𝑒 𝐿𝑒𝑛𝑔𝑡ℎ + 𝐼𝑛𝑡𝑒𝑟 𝐹𝑟𝑎𝑚𝑒 𝐺𝑎𝑝)
 
 
 Preamble is the starting point of an Ethernet packet and it consists of seven byte 
pattern of 1 and 0 bits plus 1 byte for the Start Frame Delimiter (SFD). The total 
size is 8 bytes (64 bits) and it does not change for different sizes of packets. 
 Frame Length is made of destination MAC address (48 bits), source MAC 
address (48 bits), payload field (368 bits minimum) and Frame Check Sequence 
(FSC) code (32 bits) with a minimum total size of 64 bytes (512 bits). 
 Inter-Frame Gap (IFG) is the minimum idle time required to synchronise clocks 
before sending the next packet on the line. It is a fixed size of 12 bytes (96 bits). 
Therefore, 
𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑃𝑎𝑐𝑘𝑒𝑡 𝑅𝑎𝑡𝑒 =  
10,000,000,000 𝑏𝑖𝑡𝑠
(64 𝑏𝑖𝑡𝑠 + 512 𝑏𝑖𝑡𝑠 + 96 𝑏𝑖𝑡𝑠)
= 14,880,952 𝑝𝑎𝑐𝑘𝑒𝑡𝑠 𝑝𝑒𝑟 𝑠𝑒𝑐𝑜𝑛𝑑 
At maximum packet rate of 14.88 Mpps the time (t) it takes to transmit a minimum size 
packet of 64 bytes (84 bytes including preamble and IFG) is only 67.2 nano seconds 
(67.2 E (-9)). 
This is calculated with the following formula: 𝑡 =
64 𝐵𝑦𝑡𝑒𝑠 + 20 𝐵𝑦𝑡𝑒𝑠
10,000,000,000 𝐵𝑖𝑡𝑠/𝑠𝑒𝑐
= 67.2𝑛𝑠 
This transmission duration is also known as inherent latency per packet. The inherent 
latency varies for different sizes of packets. 
Table 4 presents the maximum packet rate per second for the seven different packet 
sizes identified by RFC 2544 [47] methodology for benchmarking networking devices. 
 
 
 
 
 
 28 
  
Table 4. Theoretical Maximum Packet Rates per Second 
Packet 
Size 
(Bytes) 
Theoretical 
Maximum 
Packet Rate 
(Mpps) 
Time budget 
per packet at 
maximum 
packet rate 
(ns) 
CPU cycles 
per packet 
@ 3.6GHz 
64 14.880952 67.20  241.92 
128 8.445946 118.40 426.24 
256 4.528986 220.80 794.88 
512 2.349624 425.60 1532.16 
1024 1.197318 835.20 3006.72 
1280 0.961538 1040.0 3744 
1518 0.812744 1230.40 4429.44 
 
The CPU cycles per packet, shown in table 4, can be estimated with the following: 
𝐶𝑃𝑈 𝑐𝑦𝑐𝑙𝑒𝑠 𝑝𝑒𝑟 𝑝𝑎𝑐𝑘𝑒𝑡 =  
𝐶𝑃𝑈 𝐶𝑜𝑟𝑒 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝐿𝑖𝑛𝑒 𝑅𝑎𝑡𝑒 (𝑀𝑝𝑝𝑠)
 
 
Maximum PCIe Bandwidth Estimation 
The total available PCIe bandwidth depends on the PCIe version and the amount of 
lanes used. In this research work, only version 2.0 is considered because the Intel’s 
82599ES NIC uses this version. The PCI Special Interest Group (PCI-SIG) [48] 
announces the PCIe 2.0 bandwidth as 5 Gigatransfers per second (GT/s) raw data rate, 
which is around 5000 MB/s. On an 8 lane format this equates to 40 GT/s or 
approximately 40 GB/s theoretical rate per card.  
However, the actual data rate that can be achieved on a dual-port 10 GbE NIC is 8 Gbps 
per direction per port. The total bandwidth for two ports that are transmitting and 
receiving at the same time is maximum 32 GB/s. 
The difference can be explained with the encoding of data. PCIe is a serial bus with a 
clock embedded in the data and it needs to ensure that there are enough level transitions 
(1 to 0 and 0 to 1) occurring for any receiver to recover the clock from the data and to 
synchronise its own PCIe clock. PCIe 2.0 uses 10 bits to send 8 bits of data limiting the 
effective bit rate to 80% of the raw data rate. 
The total bandwidth of the laboratory setup in figure 10 can be calculated as: 
𝑇𝑜𝑡𝑎𝑙 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ = 40 𝐺𝑇 𝑠⁄ ∗ (80%) = 32 𝐺𝑏𝑝𝑠   
 29 
  
Chapter 4   The Adaptive Polling Mechanism 
 
This chapter presents the background to the Adaptive Polling Mechanism (APM) along 
with a discussion about the various approaches which were considered in the beginning 
of this research work. The final algorithm chosen and the rationale for that is presented. 
4.1 Background 
The problems that the APM attempts to address were briefly outlined in subsection 1.3 
‘Motivation’ and in subsection 2.3.4 ‘DPDK – impact on overall system power demand 
and core temperatures’.  
When designing the APM, a number of key conditions had to be considered: 
 It must be implemented within the RX loop of the DPDK’s sample application 
L2FWD.  
 It needs to take into account the dynamics of the incoming network traffic and 
adjust the polling frequency of the RX loop accordingly.  
 It must not introduce intolerable latency and computation overhead to the 
system.  
 It must be as generic as possible.  
The leading traffic model considered when designing the APM was the ON/OFF model. 
This model describes the traffic transmitted between two individual Local-Area 
Network (LAN) hosts (Packet generator and DuT). 
4.2 The ON/OFF Traffic Model 
The simple ON/OFF network traffic model is operated by the source host, alternating 
between an “active” (ON-period) and an “idle” (OFF-period) state. During the ON-
period, the stream of packets generated and transmitted are considered to be at a 
constant rate 1/T, where T represents the inter-arrival times between packets.  During 
the OFF-period, there are no packets generated. Figure 14 illustrates the basic scheme. 
 
Figure 14. The Simple ON/OFF Traffic Model 
ON Period 
OFF-Period 
ON Period 
Packet 1 
T1 
Packet N 
T1 T1 T1 T2 T2 T2 
OFF-Period 
 30 
  
The key area of interest concerning the APM design is the OFF-period where an 
innovative approach can be devised to prevent the DPDK’s lcore(s) from exhaustive 
polling behaviour. 
4.3 APM Theoretical Design Overview 
This section describes the APM theoretical design approaches including some important 
constraints and variations. 
Figure 15 presents the control flow within the L2FWD-DPDK and indicates where the 
Adaptive Polling Mechanism should be implemented. 
 
Figure 15. L2FWD-DPDK Control Flow 
L2FWD-DPDK 
RX/TX LOOP 
START 
EAL 
Initialization 
Port (device) 
initialization 
 
Driver 
initialization 
Mbuf Pool 
initialization 
TX queue 
initialization 
RX queue 
initialization 
TX buffer 
initialization 
Start All 
Initialized 
Devices 
Start Threads 
on all 
lcore(s) 
Lcore1 
RX/TX 
Check and retrieve burst of  
packets from RX queue 
rte_eth_rx_burst ( ) 
 
Y 
N 
Modify, Buffer & 
Transmit packets 
l2fwd_simple_forward ( ) 
Prefetch packets in CPU 
core cache lines 
rte_prefetch0 ( ) 
 
Packets? 
Adaptive Polling Mechanism 
is adjusting the polling 
frequency of the RX loop when 
ZERO packets are received 
Lcore2 
RX/TX 
Lcore N 
RX/TX 
 31 
  
Inside the RX/TX loop, in figure 15, the CPU core(s) execute(s) continuous busy-wait 
loop behaviour, which polls the NIC’s RX queue(s) for incoming packets. The duration 
of every poll depends on the time taken to call, execute and return from function 
rte_eth_rx_burst().It was already identified, in figure 7, that the rte_eth_rx_burst() 
function is called nearly 100 million times per second more than needed when the 
processor core is running at a nominal frequency of 3.6GHz. In this particular case there 
are two obvious ways to reduce the polling frequency of the RX/TX loop: 
 Reduce the CPU core frequency if the number of packets returned by the 
recent polls is zero. The polling frequency of the RX loop is directly related to 
the frequency of the CPU core(s) executing the actual RX/TX loop. Figure 7 in 
2.3.4 provides evidence of this claim.  
 Insert delay between two consecutive polls returning zero packets. DPDK 
libraries provide functions to pause the executing thread for a certain amount of 
time, usually microseconds or milliseconds. Using this approach, the interval 
between two successful polls is widened, which leads to reduced polling 
frequency of the RX/TX loop. 
The first approach is explored by the example program ‘l3fwd-power’ that is supplied 
with the DPDK package [49]. However, this sample program does not work properly 
with the current intel_pstate Linux kernel driver which is responsible for governing the 
performance and sleep states of Intel CPUs. The research work in this thesis is based on 
the second approach. The idea is to introduce an artificial delay between consecutive 
zero polls (zero packets retrieved) and bring the polling frequency as close as possible 
to the theoretical levels calculated for different bursts and sizes of packets, as illustrated 
in table 5 below. 
Table 5. Theoretical Polls/s for Different Bursts and Sizes of Packets 
Packet 
Size 
(Bytes) 
Maximum 
Theoretical 
Line Rate 
(PPS) 
Theoretical 
polls/s with 
burst size = 
32 packets 
Theoretical 
polls/s with 
burst size = 
16 packets 
Theoretical 
polls/s with 
burst size = 
4 packets 
Theoretical 
polls/s with 
burst size = 
1 packet 
64 14,880,952 465,030 930,060 3,720,238 14,880,952 
128 8,445,946 263,936 527,872 2,111,487 8,445,946 
256 4,528,986 141,531 283,062 1,132,247 4,528,986 
512 2,349,624 73,426 146,852 587,406 2,349,624 
1024 1,197,318 37,416 74,832 299,330 1,197,318 
1280 961,538 30,048 60,096 240,385 961,538 
1518 812,744 25,398 50,797 203,186 812,744 
 32 
  
The insertion of an artificial delay between consecutive zero loops is handled by the 
rte_delay_us () function provided by the DPDK’s API. Detailed explanation on the 
inner workings of rte_delay_us () is provided in chapter 7. 
The induced delay is to be increased in a conservative manner with every consecutive 
zero poll until a certain threshold is reached. If a burst of packets is received by the 
NIC, the delay to be inserted in the next loop is reduced to a minimum because the 
probability of more packets arriving is increased greatly.  
Figure 16 illustrates the idea behind the initial APM design. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 16. Initial APM Design 
L2FWD-DPDK 
RX/TX LOOP 
Try RX again 
Poll = 1 
N 
N 
Start RX 
Y 
Go to TX 
Poll = N 
Insert small delay in microsecond(s) 
current_delay = delay0 (µs) 
Poll = 2 
Try RX again 
N 
Increase delay in a conservative manner 
and insert it here 
current_delay = delay1 (µs) 
delay1 > delay0 
Y 
Y 
 Go to TX 
Go to TX 
Try RX again 
 
Insert increased delay here 
current_delay = delayN (µs) 
delayN   >  previous_delay 
 
 
Reset delay to minimum 
current_delay = delay0 (µs) 
Reset delay to minimum 
current_delay = delay0 (µs) 
Packets? 
Packets? 
Packets? 
 33 
  
4.3.1 Initial APM Design Tests 
The amount of delay to be inserted between consecutive zero polls has to be carefully 
chosen based on certain latency requirements, which are imposed by the user 
application sending packets to the packet forwarding machine that is running L2FWD-
DPDK; and the length of that delay should not impair the system responsiveness to 
incoming network traffic. 
To establish a tolerable delay that can be inserted within the RX/TX loop, as shown on 
figure 16, some preliminary experiments were conducted using the basic forwarding 
sample application that is  included in the DPDK framework [50]. This sample program 
is very similar to L2FWD, but with the exception of using single RX and TX queues. 
Assuming a dual-port NIC is installed on the forwarding machine, the ‘skeleton app’ 
receives incoming packets in bursts on RX port and transmits them in bursts on TX port 
using simple mapping scheme port 0 (RX queue)  port 1 (TX queue), port 1 (RX 
queue)  port 0 (TX queue). The results obtained are presented in figure 17. 
 
Figure 17. Effect of Induced Delay on RX/TX Polling Frequency at Maximum Line Rate 
33.06 M
13.5 µs, 0.64 M
25.25 M
27.5 µs, 0.36 M
40.32 M
52.5 µs, 0.20 M
76.5264 M
105.5 µs, 0.0970 M
103.1786 M
183.5 µs, 0.0497 
112.5498 M
200.0 µs, 0.0362 M
0.02 M
2.00 M
0.0 µs 50.0 µs 100.0 µs 150.0 µs 200.0 µs 250.0 µs
P
o
ll
s 
p
er
 s
ec
o
n
d
Induced Delay
Effect of Induced Delay on the Polling Frequency of 
Skeleton App
64 bytes 128 bytes 256 bytes 512 bytes 1024 bytes 1518 bytes
"golden pair of numbers" for each packet size
 34 
  
The above figure 17 highlights the fact that introducing different delays within the 
receiving loop has the positive effect of reducing the polling frequency of the RX/TX 
loop. The ‘golden pair of numbers’ represents the maximum delay that can be injected 
between consecutive ‘zero polls’, and the corresponding polling frequency measured at 
maximum Constant Bit Rate (CBR) line rate. The maximum delay is obtained by 
manually inserting a constant delay in microseconds within the RX loop of the 
“skeleton app” and observing if a packet drop occurs. If there are no packets lost, the 
delay is increased in a conservative manner for the next test. The function that is used to 
insert delay is rte_delay_us() from DPDK’s API. This experiment was completed 
without any performance penalties on the receiving side or any packets being lost. In the 
same experiment the changes in the application’s efficiency for forwarding traffic with 
different packet sizes was recorded. The results are presented in figure 18. 
 
Figure 18. Effect of Induced Delay on the Efficiency of the Basic Forwarding Application 
This initial test with static artificial delay confirmed that the proposed theoretical 
approach can be a valuable solution for the original high polling frequency problem. 
64 bytes
128 bytes
256 bytes
512 bytes
1024 bytes
1518 bytes
0.0 µs
50.0 µs
100.0 µs
150.0 µs
200.0 µs
250.0 µs
10.95%
82.84%
88.93%
90.76%
In
d
u
ce
d
 d
el
ay
 p
er
 z
er
o
 p
o
ll
 e
n
co
u
n
te
re
d
Effect of induced delay on Receiving Loop Efficiency
(Skeleton App, DPDK)  
64 bytes 128 bytes 256 bytes 512 bytes 1024 bytes 1518 bytes
 35 
  
The next step was to propose a scheme which will create a range of delay values that 
can be used by the adaptive mechanism.  
4.3.2 Induced Delay Schemes 
In this sub-section some of the schemes for generating a range of delay values are 
discussed, presented and evaluated. 
The first step for calculating a range of values is to establish the upper and lower 
bounds of the range. In this case, the minimum delay that can be inserted into the RX 
loop without missing any packets represents the lower bound. The upper bound is the 
maximum delay that can be inserted without incurring a packet loss, and it is usually 
based on user’s application latency requirements.  
Calculating minimum delay value 
If we consider a 10 GbE line saturated with 64 bytes packets and a machine forwarding 
the same traffic using L2FWD-DPDK, it can be calculated that a burst of 32 packets 
will have a duration of 2.15 micro seconds (µs). This means that, in theory, the RX loop 
can poll every 2.15 µs and will be able to process a burst of 32 packet without loss. It 
will be shown later that choosing a minimum delay value of 2 µs is a good starting 
point. 
Calculating maximum delay value 
The largest delay value that can be induced into the RX loop depends mainly on the 
user’s application latency requirements and can hold values that are bigger than the 
average Round-trip time (RTT) of a packet between a source and a destination machines 
on the network. In this context, the source is the computer sending the packet (packet 
generator) and the destination is a remote host (packet forwarder) that receives the 
packet and retransmits it back to origin. The value of RTT can vary from microseconds 
to seconds and it depends mainly on: 
 Data transfer rate of the source 
 The physical distance between source and destination 
 The type of medium connecting the two hosts on the network 
 Network congestions, network bandwidth etc. 
For the purpose of all laboratory experiments attempted in this research work the 
maximum delay value is kept at 300 µs and it was chosen according to the network 
bandwidth of 10 GbE and the fibre optic connection latency measurements. More 
 36 
  
specifically, the maximum delay of 300 µs corresponds to the 512 receive buffers that 
are allocated for incoming packets multiplied to the duration of the average size (594 
bytes) TCP/IP packet at maximum line rate of 10 Gbps. The calculation is as follows: 
𝑀𝑎𝑥𝑑𝑒𝑙𝑎𝑦 = 512 ∗ 0.623µs ≈ 300µs   
Once the limits of the delay range were established, there were 8 different schemes 
chosen to produce delay values ranging from 2 µs to 300 µs. Three of the schemes are 
based on a linear increasing series of values, and the rest are based on exponentially 
increasing series of values. The 8 different schemes are developed to explore the 
insertion of artificial delay in the first 100 consecutive zero loops encountered by the 
L2FWD’s receive mechanism. Each scheme has a unique step of delay increase that 
provides different application responsiveness. The presumption here is that the 
application can switch to interrupt mode after the first 100 zero loops. 
A graphical representation of the proposed schemes is shown in figure 19. 
 
Figure 19. Schemes for Delay Values Generation 
0 µs
50 µs
100 µs
150 µs
200 µs
250 µs
300 µs
350 µs
1 1 0 1 0 0
IN
D
U
C
E
D
 D
E
L
A
Y
CONSECUTIVE ZERO LOOPS
PROPOSED SCHEMES FOR GENERATION OF DELAY 
VALUES
Linear Increase 1 Linear Increase 2 Linear Increase 3
Exponential r = 1.125 Exponential r = 1.25 Exponential r = 1.5
Exponential r = 2.0 Exponential r = 2.5
 37 
  
Calculating the N-th term 
Figure 19 provides a visual representation of how the induced delay is increased on 
every poll that does not return any packets. Table 6 provides the formulae and 
explanation of how to generate each number sequence for the different schemes. 
Table 6. Formulae for Delay Values Generation 
Scheme Formula for calculating the n-th 
term 
Terms 
Linear Incr 1 𝑎𝑛 = (𝑎1 + (𝑛 − 1) ∗ 𝑑) + 1, d = 2 𝑛 = 1, 𝑎1 = 2, 𝑎2 = 5 … 
Linear Incr 2 𝑎𝑛 = (𝑎1 + (𝑛 − 1) ∗ 𝑑) + 1, d = 4 𝑛 = 1, 𝑎1 = 2, 𝑎2 = 7 … 
Linear Incr 3 𝑎𝑛 = (𝑎1 + (𝑛 − 1) ∗ 𝑑) + 1, d = 6 𝑛 = 1, 𝑎1 = 2, 𝑎2 = 9 … 
Exponential Incr 1 𝑎𝑛 = 𝑎1 ∗ 𝑟
𝑛−1, r = 1.125 𝑛 = 1, 𝑎1 ≅ 3, 𝑎2 ≅ 5 … 
Exponential Incr 2 𝑎𝑛 = 𝑎1 ∗ 𝑟
𝑛−1, r = 1.25 𝑛 = 1, 𝑎1 ≅ 3, 𝑎2 ≅ 5 … 
Exponential Incr 3 𝑎𝑛 = 𝑎1 ∗ 𝑟
𝑛−1, r = 1.5 𝑛 = 1, 𝑎1 ≅ 3, 𝑎2 ≅ 7 … 
Exponential Incr 4 𝑎𝑛 = 𝑎1 ∗ 𝑟
𝑛−1, r = 2.0 𝑛 = 1, 𝑎1 ≅ 4, 𝑎2 ≅ 12 … 
Exponential Incr 5 𝑎𝑛 = 𝑎1 ∗ 𝑟
𝑛−1, r = 2.5 𝑛 = 1, 𝑎1 ≅ 5, 𝑎2 ≅ 19… 
Once the delay sequences are generated, the next logical step is to test and evaluate 
them. 
4.3.3 Induced Delay Schemes Evaluation 
In this section the schemes for producing delay values are tested, compared and 
evaluated. The scheme chosen for all further experiments is the one that guarantees the 
lowest RX/TX polling frequency for the L2FWD-DPDK application combined with 
minimum added latency per packet. 
A typical laboratory setup is presented in figure 20. 
 
 
 
 
 
 
 
 
 
 
Pktgen 
 
DuT 
  
Intel NIC 82599ES SFP+ Intel NIC 82599ES SFP+ 
Port 1 TX Port 0 RX 
TRex 2.26 
Client                  Server 
 
Port 0 TX Port 1 RX 
L2FWD-Adaptive  
Figure 20. Delay Schemes Testing Environment 
 38 
  
The packet generator machine (Pktgen) is using TRex 2.26 to generate a traffic profile 
with increasing line utilization and containing ten streams of 64 bytes packets with each 
stream having different packet rate and duration. The gap between the streams is varied 
from 10 µs to 155 µs to try and mimic real traffic conditions.  
Every experiment for each scheme is repeated a minimum of 10 times and results are 
recorded for further analysis. On each run, the following parameters were measured and 
recorded: 
 Average total polls per second for the experiment duration 
 Average zero polls per second 
 Average full (successful) polls per second 
 Average delay injected into the RX loop per second by the scheme 
 Average latency per packet 
Figure 21 presents the output of the described traffic profile. 
 
 
 
 
 
 
 
 
 
 
 
 
 
The packet forwarding machine (DuT) acts as a switch that forwards the traffic with the 
L2FWD-Adaptive (DPDK-stable-16.11.1). The Pktgen repeats the traffic profile four 
14s12s10s6s 8s4s2s
Stream 1 
1.5  
Mpps 
Stream 2 
3.1  
Mpps Stream 3 
4.4  
Mpps 
Stream 4 
6.1  
Mpps 
Stream 5 
7.5  
Mpps Stream 6 
9.1  
Mpps 
Stream 8 
11.9 
Mpps 
Stream 9 
13.4 
Mpps Stream 10 
14.5 
Mpps 
10 
µs 
49 
µs 
22 
µs 
100 
µs 
86 
µs 
115 
µs 
41 
µs 
Stream 7 
10.4  
Mpps 
112 
µs 
72 
µs 
155 
µs 
0s 15s 
Figure 21. Traffic Profile with 10 Streams 
 39 
  
times so that each experiment last around 60 seconds. To guarantee consistency of the 
results, there are certain settings that are applied to the DuT for this experiment: 
 Hyper-Threading of the CPU is disabled to guarantee that each experiment is 
run on a physical core instead of a virtual core. 
 Turbo boost technology of the processor is switched off to avoid throttling the 
CPU core(s) beyond the nominal frequency of 3.6 GHz. 
 Address space layout randomization (ASLR) is turned off to avoid warning from 
DPDK when running a secondary DPDK process to reset NIC counters. 
 L2FWD-Adaptive is configured with an increased size of receive and transmit 
software buffers, 512 and 2048 respectively. This setting may slightly increase 
the latency per packet but guarantees better performance against fluctuating 
traffic. 
 The L2FWD-Adaptive code is modified to time stamp every packet and to 
calculate the time taken to process each burst of packets. Average values are 
calculated from the collected data on each run. 
The results of the evaluation are graphed in figure 22. 
 
Figure 22. Delay Schemes Evaluation 
48.88 M
1
.6
6
8
 (
µ
s)
1
.9
4
5
 (
µ
s)
1
.9
3
1
 (
µ
s)
1
.9
2
8
 (
µ
s)
1
.9
5
1
 (
µ
s)
1
.9
6
0
 (
µ
s)
1
.9
5
6
 (
µ
s)
1
.9
7
9
 (
µ
s)
1
.9
9
2
 (
µ
s)
1.6 (µs)
3.2 (µs)
0.00 M
0.01 M
0.10 M
0.82 M
6.55 M
52.43 M
In
te
rn
al
 A
d
d
ed
 L
at
en
cy
 p
er
 P
ac
k
et
P
o
ll
in
g
 F
re
q
u
en
cy
 i
n
 M
il
li
o
n
s 
P
o
ll
s 
/ 
se
c
Comparison between the effectivenes of the 8 schemes used by 
the adaptive mechanism
(less is better)
AVG Full polls / sec AVG Zero polls / sec
AVG Total Polls / sec AVG Lat / pkt
 40 
  
Based on the results displayed in figure 22, the scheme that provides the best 
combination between low polling frequency and low added latency per packet is 
“Linear Increase 3”. The “Linear increase 3” scheme is implemented and used for all of 
the remaining experiments attempted in this research work. The exponential increase 
schemes are showing lower polling frequency but the added latency per packet is in the 
range of 230 ns per packet. If one wants to choose lower polling frequency over low 
latency, an exponential scheme can be used with the adaptive algorithm. 
 
4.4 Summary 
In this chapter the design, test, implementation and evaluation of the Adaptive Polling 
Mechanism was presented step by step.  
In the next three chapters the chosen APM design that is implemented within the RX 
loop of the L2FWD program is referred to as L2FWD-Adaptive and it is tested and 
compared with the original L2FWD application. 
  
 41 
  
Chapter 5  Case Study One: CBR and Bursty Traffic 
 
This chapter presents the methodology for testing the implemented APM design within 
the L2FWD-Adaptive program versus the current L2FWD-DPDK sample application 
supplied with the DPDK-stable-16.11.1 package. 
5.1 CBR Traffic Tests 
The first tests conducted are meant to test the robustness of the APM design at various 
line rates of up to 14.88 Mpps with the minimum size packets of 64 bytes. In these 
experiments the IPG is dependent on the line rate utilization.  
5.1.1 Testing Environment 
The laboratory testbed environment is presented on figure 23.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The traffic generator machine uses MoonGen or TRex to generate CBR or bursty traffic 
of Ethernet packets at different rates.  
The DuT machine receives incoming packets in bursts on RX port(s) and transmits the 
received packets on a TX port(s) as shown in figure 23. 
 
 
 
 
 
 
 
 
 
 
 
 
Traffic  
Generator 
 
DuT 
 
T
R
ex
 2
.2
6
 
M
o
o
n
G
en
 
 
Port 0 RX 
Port 0 TX 
Port 1 RX 
Port 0 TX 
D
P
D
K
 T
X
/R
X
 A
P
I 
 
Port 0 TX 
Port 0 RX 
Port 1 TX 
Port 1 RX 
D
P
D
K
 A
P
I 
L
2
F
W
D
-A
D
A
P
T
IV
E
 o
r L
2
F
W
D
-D
P
D
K
 
Figure 23. Laboratory Testbed Environment 
 42 
  
5.1.2 CBR Traffic Experiments Description 
In these experiments, the MoonGen traffic generator is chosen to create a CBR flow of 
64 bytes packets on port 0 TX with rates starting from 1 Mpps and going up to 14.88 
Mpps. MoonGen was preferred in this case because it can utilize the hardware rate 
control on Intel NICs when generating Constant Bit Rate (CBR) traffic. The DuT 
forwards the incoming packets with L2FWD-Adaptive and L2FWD-DPDK from dpdk-
stable-16.11.1 package release. Each experiment with a different rate is repeated a 
minimum of 10 times and results are recorded for analysis. There were certain changes 
applied to the source code so the following parameters can be measured on each run: 
1) Average total polls per second per CPU core for experiment duration 
2) Average internal latency per packet added by L2FWD-Adaptive when compared 
with L2FWD-DPDK 
3) Average CPU headroom. This is the time in microseconds per second when the 
forwarding CPU core does not process any packets and it is in a “wait” state. 
CPU headroom is expressed in percentage terms. 
4) Application efficiency. This is the ratio between the full polls and total polls 
expressed in percentage terms. 
5.1.3 CBR Traffic Experiments Results 
The first two results from the described experiments are graphed in figure 24.  
 
Figure 24. CBR Traffic Experiments – AVG Polls/s and AVG Internal Latency Comparison 
37.80 M
28.91 M
14.07 M
12.68 M 13.30 M
0.95 M 1.06 M 1.27 M 1.23 M 1.22 M
111ns
24ns
129ns
246ns
259ns
0ns
50ns
100ns
150ns
200ns
250ns
300ns
0 M
5 M
10 M
15 M
20 M
25 M
30 M
35 M
40 M
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
In
te
rn
al
 A
d
d
ed
 L
at
en
cy
 p
er
 P
ac
k
et
P
o
ll
in
g
 F
re
q
u
en
cy
 i
n
 M
il
li
o
n
 P
o
ll
s 
/ 
se
c
Line Rate
CBR Traffic Experiments
AVG Total Polls and AVG Added Internal Latency per Packet
L2FWD-DPDK L2FWD-ADAPT AVG Added Internal Latency per Packet
 43 
  
The CPU headroom results are obtained only using the L2FWD-Adaptive application. 
These measurements illustrate the amount of induced delay inserted by the APM every 
second in order to prevent the CPU core from excessive polling with different line rates 
applied. 
The results are presented in figure 25. 
 
Figure 25. CPU Headroom with CBR Traffic 
The application efficiency is a ratio of the useful work performed by the RX/TX loop to 
the total work done by the L2FWD application. The comparison between the two 
schemes is presented in figure 26. 
 
Figure 26. CBR Traffic Experiments - Application Efficiency Comparison  
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
90.05%
83.73%
72.31%
56.24% 50.24%
C
P
U
 i
d
le
 t
im
e 
in
 (
%
)
Line Rate
CBR Traffic Experiments
CPU Core Headroom at Different Line Rates
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
2.02% 7.82%
30.21% 28.64% 25.06%
F
u
ll
 P
o
ll
s 
/ 
T
o
ta
l 
P
o
ll
s 
in
 (
%
)
Line Rate 
CBR Traffic Expeoriments
Application Efficiency Comparison
L2FWD-DPDK L2FWD-ADAPT
 44 
  
5.1.4 CBR Traffic Case Study Conclusions 
Figure 24 shows that the APM can positively maintain a low polling frequency of the 
RX/TX loop at a low line rate of 1Gbps and even at maximum line rate of 10 Gbps. 
This is achieved with a negligible increase of the internal latency per packet ranging 
from 111 ns for low line rate of 1 Gbps to 259 ns for full line rate of 10 Gbps. 
Figure 25 highlights an interesting fact that even at a maximum line utilization the CPU 
core only spends around 50% of the time processing incoming packets. The rest of the 
time is utilised by the APM and the core’s execution of the busy-wait RX/TX loop is 
paused thus saving energy.  
Figure 26 illustrates best the fact that the application efficiency can be increased 
dramatically with the use of the APM design within the RX/TX loop. The improvement 
at all different line rates is over 50% compared with the original scheme. 
 
 
 
 
 
5.2 Bursty Traffic Tests 
The second line of tests conducted are aimed to test the APM design at ‘bursty’ network 
traffic conditions with various line rates of up to 14.88 Mpps along with variable length 
of packet bursts. The size of the packets to be used is 64 bytes. In these experiments the 
IPG is dependent on the burst rate. The Inter-burst Gap is increased uniformly from 2 µs 
to 1001 µs. 
5.2.1 Bursty Traffic Experiments Description 
In these experiments, the TRex 2.26 traffic generator is chosen to generate a ‘bursty’ 
traffic profile, illustrated on figure 27, containing 64 bytes packets on port 0 TX with 
burst rates varying from 1.5 Mpps up to 14.86 Mpps. There are 1000 different size 
bursts generated with uniformly increasing Inter-burst gap. Each train of packets is sent 
from TRex 2.26 packet generator at a different line rate from previous one. The 
configuration of this traffic profile guarantees that there are 1000 unique streams of 64 
bytes packets with a unique IBG. 
 45 
  
 
Figure 27. TRex Traffic Profile - 1000 Bursts 
The DuT forwards the incoming packets with L2FWD-Adaptive and L2FWD-DPDK 
from dpdk-stable-16.11.1 package release. Each experiment is repeated a minimum of 
10 times and results are recorded for analysis. There were certain changes applied to the 
source code so the following parameters can be measured on each run: 
1) Average total polls per second for experiment duration 
2) Average zero polls per second 
3) Average full polls per second 
4) Average internal latency per packet 
5) Application efficiency. This is the ratio between the amount of full polls for the 
duration of the experiment and total polls for experiment’s duration expressed in 
percentage terms. 
 
 
1 µs
2 µs
4 µs
8 µs
16 µs
32 µs
64 µs
128 µs
256 µs
512 µs
1024 µs
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
0 100 200 300 400 500 600 700 800 900 1000
In
te
r-
B
u
rs
t 
G
ap
P
ac
k
et
s 
p
er
 s
ec
o
n
d
 a
n
d
 B
u
rs
t 
si
ze
 
Number of individual bursts
Traffic Profile with 1000 Bursts of Ethernet Packets
IBG PPS Burst Size
 46 
  
5.2.2 Bursty Traffic Experiments Results 
The results of 1, 2 and 3 are presented in figure 28. 
 
Figure 28. Bursty Traffic Experiments - AVG Polls per Second Comparison 
The internal latency per packet was measured as an average per packet for every 1.377 
million processed packets, or 400 times during a single experiment. The results are 
presented in figure 29. 
  
Figure 29. Bursty Traffic Experiments - Internal Latency per Packet Comparison 
L2FWD-DPDK L2FWD-ADAPT
AVG Total Polls / sec 24.52 M 0.72 M
AVG Zero polls / sec 22.2793 M 0.2067 M
AVG Full polls / sec 2.2381 M 0.5155 M
24.52 M
0.72 M
0.10 M
0.40 M
1.60 M
6.40 M
25.60 M
A
V
G
 P
o
ll
s 
/ 
se
c 
in
 M
il
li
o
n
s
Bursty Traffic Experiments
(Polling Frequency Comparison)
0.00 µs
1.00 µs
2.00 µs
3.00 µs
4.00 µs
5.00 µs
6.00 µs
7.00 µs
0
.0
1
.9
3
.7
5
.6
7
.4
9
.3
1
1
.1
1
3
.0
1
4
.9
1
6
.7
1
8
.6
2
0
.4
2
2
.3
2
4
.1
2
6
.0
2
7
.8
2
9
.7
3
1
.6
3
3
.4
3
5
.3
3
7
.1
3
9
.0
4
0
.8
4
2
.7
4
4
.6
4
6
.4
4
8
.3
5
0
.1
5
2
.0
5
3
.8
5
5
.7
5
7
.5
5
9
.4
6
1
.3
In
te
rn
al
 L
at
en
cy
 p
er
 P
ac
k
et
Experiment Duration (s)
Bursty Traffic Experiments
Internal Latency per Packet Comparisson
L2FWD-DPDK L2FWD-ADAPT
 47 
  
The overall application efficiency and corresponding packets retrieved per poll are 
presented in figure 30. 
 
Figure 30. Bursty Traffic Experiments - Application Efficiency Comparison  
 
5.2.3 Bursty Traffic Case Study Conclusions 
Figure 28 highlights the balanced performance of the L2FWD-Adaptive forwarding 
application under ‘bursty’ traffic conditions. The polling frequency is kept below 1 
million polls per second per core on average for the experiments duration with a single 
performance penalty of very small added latency per packet, which is shown on figure 
29. The amount of the average added packet latency in this particular experiment is only 
251 ns per packet. 
The application efficiency in this case was increased from just over 9% with the 
standard forwarding scheme to nearly 71.5% with scheme employing the APM as 
shown on figure 30. 
The average packet retrieval per poll is also increased from 4 to 17 packets per poll. 
 
 
  
L2FWD-DPDK 9.13%
L2FWD-ADAPT
71.38%
L2FWD-DPDK
4 pkts/poll
L2FWD-ADAPT
17 pkts/poll
0 2 4 6 8 10 12 14 16 18
0%
10%
20%
30%
40%
50%
60%
70%
80%
A
p
p
li
ca
ti
o
n
 E
ff
ic
ie
n
cy
AVG Packets per Poll
Bursty Traffic Experiments
Application Efficiency and Packets per Poll Comparisson
 48 
  
Chapter 6  Case Study Two:  
IMIX and Random Packet Size Traffic 
 
This chapter presents the continuation of the testing of the implemented APM design 
within the L2FWD-DPDK-Adaptive program versus the current L2FWD-DPDK sample 
application supplied with the DPDK-stable-16.11.1 package. So far all experiments 
were based on using a constant size of 64 bytes Ethernet packets. Tests with constant 
size packets reveals some of the forwarding application’s capabilities but differ from 
‘real’ network traffic conditions. 
 6.1 IMIX Traffic Tests 
The Internet Mix (IMIX) is a term used to describe a simplified representation of a 
typical Internet traffic passing through some network equipment and it is usually based 
on statistical sampling that is carried out on routers and switches. There are different 
IMIX traffic profiles, simple and complete, which are used in industry to simulate real-
world traffic patterns and packet distributions. The ‘simple’ IMIX profile usually 
contains an aggregated flow of up to 4 streams containing different sizes of packets and 
it is used to test variable aspects of network performance and network configuration. A 
similar profile is used to test the performance of the L2FWD-Adaptive application. 
6.1.1 IMIX Traffic Experiments Description 
In this line of experiments, the TRex 2.26 traffic generator is producing an IMIX flow 
of packets containing 3 streams. Each stream contains only one predefined size of 
packets as described in this document [51].The mix ratio between the different streams 
is outlined in table 7. 
Table 7. IMIX Profile TRex 2.26 
Packet size 
(incl. IP 
header) 
Number 
Packets 
Distribution in 
packets 
Bytes 
Distribution in 
bytes 
64 bytes 7 53.84 % 448 11.95 % 
594 bytes 5 38.46 % 2970 47.54 % 
1518 bytes 1 7.69 % 1518 40.50 % 
 
 49 
  
The size of the Inter-packet gap depends on the line utilization chosen. In order to 
maintain the same packet distribution for all line rates, the Inter-stream gaps are 
constant. 
The IMIX profile is sent to the DuT at different line utilizations starting at 10% and 
increasing up to 95% as shown on figure 31. 
 
Figure 31. IMIX Traffic Profile 
 
As with previous experiments, the DuT forwards the incoming traffic with L2FWD-
Adaptive and L2FWD-DPDK from dpdk-stable-16.11.1 package. Each test is repeated a 
minimum of 10 times and results are recorded for analysis. In every run, the following 
performance parameters are measured and recorded: 
1) Average total polls per second for experiment duration 
2) Average internal latency per packet  
3) CPU headroom available  
4) Application efficiency 
 
6.1.2 IMIX Traffic Experiments Results 
The results obtained for AVG polling frequency are presented in figure 32. 
18.76 M
37.53 M
75.06 M
150.12 M
178.26 M
0.00 M
20.00 M
40.00 M
60.00 M
80.00 M
100.00 M
120.00 M
140.00 M
160.00 M
10% 20% 40% 80% 95%
A
m
o
u
n
t 
o
f 
p
ac
k
et
s 
in
 m
il
li
o
n
s
Line Utilization from 10 GB / sec
IMIX Traffic Profile
(TRex 2.26)
1518 bytes
594 bytes
64 bytes
Total pkts
 50 
  
 
Figure 32. IMIX Traffic Experiments - AVG Polls per Second Comparison 
The internal packet latency was measured 400 times per experiment of 60 seconds, 
which is every 150ms. All of the results obtained during the experiments are graphed in 
figure 33. 
 
Figure 33. IMIX Traffic Experiments - Internal Latency per Packet Comparison 
44.80 M 43.00 M 39.40 M 32.05 M 29.47 M
0.08 M
0.15 M
0.33 M
0.77 M 0.95 M
0.00 M
0.00 M
0.00 M
0.00 M
0.00 M
0.00 M
0.00 M
0.02 M
0.07 M
0.26 M
1.05 M
4.19 M
16.78 M
67.11 M
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
A
V
G
 P
o
ll
s 
/ 
se
c 
in
 M
il
li
o
n
s
Line Rate
IMIX Traffic Experiments
AVG Polling Frequency Comparison
L2FWD-DPDK L2FWD-ADAPT
4µs
8µs
16µs
32µs
64µs
0 10 20 30 40 50 60
In
te
rn
al
 L
at
en
cy
 p
er
 P
ac
k
et
Duration of Experiment (s)
IMIX Traffic Experiments
(Internal Latency per Packet)
L2FWD-DPDK-1GB/s
L2FWD-ADAPT-1GB/s
L2FWD-DPDK-2GB/s
L2FWD-ADAPT-2GB/s
L2FWD-DPDK-4GB/s
L2FWD-ADAPT-4GB/s
L2FWD-DPDK-8GB/s
L2FWD-ADAPT-8GB/s
L2FWD-DPDK-9.95GB/s
L2FWD-ADAPT-9.95GB/s
 51 
  
The Data Plane CPU headroom was measured with the L2FWD-Adaptive application. 
The results are presented in the graph of figure 34. 
 
Figure 34. CPU Headroom with IMIX Traffic 
The overall application efficiency for the duration of these experiments is presented in 
figure 35. 
 
 
 
 
 
 
 
 
 
 
 
 
75.00%
80.00%
85.00%
90.00%
95.00%
100.00%
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
98.52%
97.04%
94.06%
87.13%
84.33%
C
P
U
 i
d
le
 t
im
e 
in
 (
%
)
Line Rate
IMIX Traffic Experiments
CPU Core Headroom at Different Line Rates
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
1 GB / sec
2 GB / sec
4 GB / sec
8 GB / sec
10 GB / sec
0.42%
0.92%
2.02%
4.88%
6.05%
32.14%
42.30%
49.49%
56.38%
57.97%
F
u
ll
 P
o
ll
s 
/ 
T
o
ta
l 
P
o
ll
s 
in
 (
%
)
IMIX Traffic Experiments
Application Efficiency Comparison 
L2FWD-DPDK L2FWD-ADAPT
Figure 35. IMIX Traffic Experiments - Application Efficiency Comparison 
 52 
  
6.1.3 IMIX Traffic Case Study Conclusions 
Figure 32 highlights the amount of CPU core(s) cycles that can be saved by using an 
adaptive polling mechanism without any impact on the DuT system responsiveness. 
During all IMIX traffic experiments, there was no increase of the internally added 
latency per packet as shown in figure 33.  
Using streams with different sizes of packets in the IMIX profile uncovers the 
unutilized CPU core(s) headroom available at different line rates. Figure 34 shows that 
in this particular case the CPU core is engaged in Data Plane packet processing only 
15% of the time at 10 Gbps line rate. At lower line rates, 1GB/s and 2GB/s, it was 
observed that the CPU core spends only 3% of its run time processing packets and the 
rest of the time is spent on polling the NIC for incoming traffic. 
The improvements achieved by the implementation of the APM are highlighted in 
figure 35. The efficiency of the forwarding application in terms of polls per second is 
steadily increased and reaching near 60% at high line rates. 
6.2 Random Packet Size Traffic Experiments 
This is the last set of experiments that is aimed to test the APM design with traffic 
profiles containing random size Ethernet packets, ranging in size from 64 bytes up to 
1518 bytes. The IPG is dependant on the line rate. 
6.2.1 Random Packet Size Traffic Experiment Description 
The TRex 2.26 packet generator is chosen to generate random size packet profile at 
rates from 1 Gbps up to 9.5 Gbps. For the purpose of testing the L2FWD-DPDK and 
L2FWD-Adaptive with the same profile, a pseudo-random number generator [52] is 
used to produce a certain amount of packets with random sizes from 5 different ranges 
shown in table 8. 
Table 8. Random Size Packet Ranges and Amounts 
Packet Size 
Range 
(bytes) 
Amount of packets at different line rates in millions for 60 seconds 
(1 GB/s) (2 GB/s) (4 GB/s) (8 GB/s) (9.5 GB/s) 
65 - 127 0.39 M 0.78 M 1.56 M 3.12 M 3.71 M 
128 – 255 0.83 M 1.67 M 3.33 M 6.65 M 7.90 M 
256 - 511 1.66 M 3.33 M 6.66 M 13.31 M 15.81 M 
512 - 1023 3.18 M 6.37 M 12.74 M 25.48 M 30.26 M 
1024 - 1518 3.19 M 6.38 M 12.76 M 25.52 M 30.30 M 
Total 
Packets 
9.26 M 18.52 M 37.04 M 74.09 M 87.98 M 
 
 53 
  
Using a pseudo-random generator with the same ‘seed’ value guarantees that on every 
run of the traffic generator the same random sequence of sizes and amount of packets 
are generated. The actual size of each packet within the ranges in table 8 are unknown. 
The traffic profile contains only one continuous stream of packets. The details of this 
profile are shown in figure 36. 
 
Figure 36. Random Packet Size Profile 
 
The DuT forwards the incoming traffic with L2FWD-DPDK and L2FWD-Adaptive 
applications. Each experiment is repeated a minimum of 10 times and results are 
gathered for further analysis. On each run certain parameters of interest are recorded: 
1) Average total polls per second for experiment duration are recorded at the DuT 
2) Average round-trip time per packet for every line rate is recorded at TRex 2.26 
3) CPU Headroom available is measured at the DuT 
4) Application efficiency 
 
6.2.2 Random Packet Size Traffic Experiment Results 
The results for AVG polling frequency comparison are presented in figure 37. 
9.26 M
18.52 M
37.04 M
74.09 M
87.98 M
0.00 M
10.00 M
20.00 M
30.00 M
40.00 M
50.00 M
60.00 M
70.00 M
80.00 M
90.00 M
100.00 M
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
A
m
o
u
n
t 
o
f 
p
ac
k
et
s 
in
 m
il
li
o
n
s
Line Rate
Random Packet Size Traffic Profile
Total pkts
1024b to 1518b
512b to 1023b
256 to 511
128b to 255b
65b to 127b
 54 
  
 
Figure 37. Random Packet Size Experiments - AVG Polls per Second Comparison 
The average RTT was measured by adding a second stream of packets to the traffic 
profile in figure 36.  
The latency stream of packets was set to a constant rate of 1000 PPS with packet size of 
64 bytes. TRex 2.26 timestamps every latency stream packet as it is transmitted on port 
0 TX and as it is received back on port 1 RX. The difference between the two 
timestamps represents the Round-trip time taken by a latency packet at different line 
rates. The results are presented in figure 38. 
 
Figure 38. Random Packet Size Experiments – AVG RTT per Packet Comparison 
88.26 M 85.71 M 81.21 M 70.39 M 66.78 M
0.05 M
0.14 M
0.30 M
0.75 M 0.91 M
0.00 M
0.00 M
0.00 M
0.00 M
0.00 M
0.00 M
0.00 M
0.02 M
0.07 M
0.26 M
1.05 M
4.19 M
16.78 M
67.11 M
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
A
V
G
 P
o
ll
s 
/ 
se
c 
in
 M
il
li
o
n
s
Line Rate 
Random Packet Size Experiments
AVG Polling Frequency Comparison
L2FWD-DPDK L2FWD-ADAPT
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
L2FWD-DPDK 227.00 µs 129.50 µs 81.50 µs 57.50 µs 56.50 µs
L2FWD-ADAPT 244.50 µs 143.50 µs 86.50 µs 59.50 µs 58.00 µs
0.00 µs
50.00 µs
100.00 µs
150.00 µs
200.00 µs
250.00 µs
300.00 µs
R
o
u
n
d
 t
ri
p
 t
im
e
Random Packet Size Experiments
AVG Round Trip Time Comparison
 55 
  
The forwarding CPU core headroom was measured using the L2FWD-Adaptive 
program and the results are shown in figure 39. 
 
Figure 39. CPU Headroom with Random Packet Size Profile 
As with previous experiments, the application’s efficiency was calculated based on the 
ratio between full polls and total polls per experiment. The results of this calculations 
are graphed in figure 40 where the efficiency of the two schemes is compared. 
 
Figure 40. Random Packet Size Experiments - Application Efficiency Comparison 
84.00%
86.00%
88.00%
90.00%
92.00%
94.00%
96.00%
98.00%
100.00%
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
99.27%
98.23%
96.43%
91.82%
90.14%
C
P
U
 i
d
le
 t
im
e 
in
 (
%
)
Line Rate
Random Packet Size Experiments
CPU Core Headroom 
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
1 GB / sec
2 GB / sec
4 GB / sec
8 GB / sec
10 GB / sec
0.15%
0.31%
0.66%
1.53%
1.91%
20.65%
37.32%
45.25%
51.58%
53.12%
F
u
ll
 P
o
ll
s 
/ 
T
o
ta
l 
P
o
ll
s 
in
 (
%
)
Random Packet Size Experiments
Application Efficiency Comparison 
L2FWD-DPDK L2FWD-ADAPT
 56 
  
6.2.3 Random Packet Size Traffic Case Study Conclusion 
Figure 37 highlights the vast difference between L2FWD-DPDK and L2FWD-Adaptive 
in terms of AVG polls per second. The saving of CPU cycles achieved by the APM 
comes at a price of increased RTT per packet, which is noticeable at low line rates as 
shown on figure 38. The average increase of packet latency is 17.5µs at 1 Gbps and 
gradually decreases to 1.5µs when the line rate reaches 9 Gbps. 
Figure 39 shows that using a traffic profile with constant flow of random size packets 
and variable inter-packet gaps creates a situation where the forwarding CPU core is 
spending the majority of its runtime polling the NIC for incoming packets. The key 
finding is that the APM design is able to increase application efficiency just by reducing 
the polling frequency of the RX/TX loop as figure 40 shows.  
 
  
 57 
  
Chapter 7  
Energy Consumption Investigations and Improvements 
 
This chapter presents some of the power demand and energy consumption 
measurements obtained during the experiments presented in chapter 5 and chapter 6. It 
also provides a glimpse of the environmental impact of the L2FWD-DPDK application 
compared with L2FWD-Adaptive program, which was developed for this research. 
7.1 Theory Discussion 
There are several factors contributing to the CPU’s energy consumption. 
 Dynamic energy consumption ( 𝑬𝒅𝒚𝒏). This occurs at normal operation and it is 
caused by switching the state of the transistors inside the logic gates 
 Energy consumed due to a short circuit (𝑬𝒔𝒄). This can occur when some 
transistors within the CPU are conducting simultaneously for an instance in time 
and a direct path between source and ground is present. 
 Energy consumed due to a leakage (𝑬𝒍𝒆𝒂𝒌). This occurs at a micro-level within 
the physics of the CPU transistors. 
In summary, the total energy consumed by the CPU is 𝑬𝑪𝑷𝑼 =  𝑬𝒅𝒚𝒏 + 𝑬𝒔𝒄 + 𝑬𝒍𝒆𝒂𝒌 
In chapters five and six, it was established that the APM, which was implemented in the 
L2FWD-Adaptive program reduces the polling frequency of the DPDK’s RX/TX loop.  
In theory, when CPU thread execution is paused, it would suggest that the CPU core 
should consume less energy since it is not doing any work. 
The actual situation is not that simple and the explanation requires detailed knowledge 
of the CPU’s architecture and the Linux kernel driver associated with it. Appendix II 
provides an insight into the Intel Haswell’s architecture power management and takes a 
brief look at the intel_pstate driver. 
7.1.1 Artificial Delay and Power Saving on Intel CPUs 
In this subsection, the implementation of the function that APM design uses to insert 
artificial delay is discussed and explained. 
 58 
  
As mentioned in chapter 4, the APM uses rte_delay_us() function from DPDK’s API to 
insert predefined delay in microseconds within the RX loop of the L2FWD-Adaptive 
application.  
The rte_delay_us() function acts like wrapper for rte_pause(void) function with the 
option to accept an argument, which is the number of microseconds to wait. The 
rte_pause(void) description states that the intended use of this function is for tight loops 
which poll shared resource or wait for an event. The short pause within a tight loop may 
reduce the power demand.  
A further investigation in the rte_pause(void) implementation shows that this is in fact 
the well known __mm_pause() function from emmintrin.h, which is part of the GCC 
compiler. According to the Intel’s 64 and IA-32 Architectures Software Developer’s 
Manual Volume 2 [53] invoking the __mm_pause () will send a PAUSE instruction to 
the Intel x86_64 processor. The pause instruction was introduced with the Intel’s 
Streaming SIMD Extensions 2 (Intel SSE2) [54] instruction set.  
Essentially, the pause instruction delays the next instructuion’s execution for a finite 
period of time and provides a hint to the IA based processor that the code is a spin-wait 
loop. The CPU uses this hint to avoid the memory order violation, which greatly 
improves its performance. An additional benefit of using the pause instruction is the 
reduction of the instantaneous power demand by the processor while executing a spin 
loop. The energy consumption investigation and improvements are based on the 
conclusion from the theoretical discussion above. 
 
7.1.2 Power and Energy Measurements 
In this subsection, some of the key power and energy measurements and calculations are 
presented and explained. 
CPU Core Energy Consumption per Packet 
Estimating the amount of energy consumed by the CPU core while forwarding packets 
is performed by taking the average energy used by the CPU as a package for a certain 
period of time, divided between each individual core involved in data plane packet 
forwarding. The standard unit of energy measurement in electronic DC circuits is the 
Joule (J) and represents a measure of power that is dissipated over time. In a DC circuit, 
a voltage source Vs that delivers current I to the circuit components, represents the 
 59 
  
power P. In this case, the power P = Vs * I Joules per second (J/s), which is the official 
unit according to the International System of Units (SI) [55]. The more widely known 
unit for power is the Watt, defined as 1 Joule per second and named after the steam 
engine developer James Watt. The energy consumed by the ‘CPU core per packet’ is 
calculated with the formula: 
𝐸
(𝑝𝑒𝑟
𝑝𝑘𝑡
𝑐𝑜𝑟𝑒)
=
[ (
𝑃𝑜𝑤𝑒𝑟(𝑤𝑎𝑡𝑡𝑠) ∗  𝑇𝑖𝑚𝑒(𝑠𝑒𝑐𝑜𝑛𝑑𝑠)
𝑃𝑎𝑐𝑘𝑒𝑡𝑠 )]
𝐷𝑃𝐶𝑃𝑈𝑐𝑜𝑟𝑒(𝑠)
⁄
=  (𝐽𝑜𝑢𝑙𝑒𝑠 𝑝𝑎𝑐𝑘𝑒𝑡⁄ ) (𝐷𝑃𝑐𝑜𝑟𝑒)⁄  
 
Average Instantaneous System Power Demand 
This measurement is simply the amount of instantaneous current Iinst, supplied to the 
DuT system by a voltage source Vs. It represents the amount of power in Watts that is 
dissipated by the system in order to perform its packet forwarding duties and it is 
calculated as follows Pinst = Vs * Iinst (Watts).  
Multiplying the amount of instantaneous power demand by a certain period of time 
would describe the amount of energy consumed by the DuT system for this period of 
time. 
Both CPU Core Energy Consumption per Packet and Average Instantaneous System 
Power Demand are used later in this chapter to evaluate the performance and the 
efficiency of the L2FWD-Adaptive application compared with the standard L2FWD-
DPDK sample program. 
 
7.2 Description of Laboratory Equipment Setup 
To measure the total energy consumed by the DuT there was some extra equipment 
attached to the laboratory setup as shown in figure 23: 
 A Watts Up Pro load meter and data logger instrument [56] was attached 
between the wall socket and the DuT to measure the active energy consumption. 
 A Yokogawa Oscilloscope DL9040 [57] with a current probe was used to clamp 
the 12V rail that supplies the CPU socket with current. 
 
 60 
  
The described laboratory setup is illustrated in figure 41.  
 
 
 
 
 
 
 
 
 
 
 
 
 
While the power meter, attached to the AC wall socket, shows total energy consumed 
by the DuT, it is not able to provide a breakdown of the each component’s power usage.  
To monitor the power drawn by the CPU, the Yokogawa oscilloscope is used. The 
oscilloscope measures with high precision, sampling 625 measurements per second, the 
current passing through the 8-pin, 12V rail dedicated to the CPU socket. For further 
verification of the results, the turbostat Linux utility [58], which is built into the Linux 
kernel, is used in debug mode to show the CPU package power demand. The turbostat 
utility was developed by Len Brown who is a principal engineer at the Intel Open 
Source Technology Center.  
All of the experiments presented in chapter 5 and chapter 6 were repeated with the 
above laboratory setup. There was only a single CPU core used to forward network 
traffic 
 
Figure 41. Power Measurement Laboratory Setup 
UL LABORATORY SETUP 
12VDC 
240VAC 
240VAC 
12VDC 
CPU 
ONLY 
3.3VDC 
AC 
WattsUp Pro 
Meter 
Yokogawa 
P
ro
b
e 
Direct 
Current 
5VDC 
PSU 
AC/DC 
550W 
 61 
  
7.2.1 Forwarding CPU Core Energy Consumption per Packet with CBR Traffic 
Profile 
In this section the energy per packet in Joules is estimated based on the measurements 
obtained by the turbostat Linux utility and the measurements monitored with the 
Yokogawa oscilloscope. The turbostat utility is invoked in debug mode to measure the 
average power demand of the CPU package for the duration of every experiment. The 
Yokogawa oscilloscope is used to perform a sanity check on the results obtained by the 
turbostat utility as displayed in figure 42 below. 
 
Figure 42. CPU Current Measurement with Yokogawa Osciloscope 
The comparison between the measurements obtained with turbostat and oscilloscope 
are very close as shown in table 9. 
Table 9. Comparisson between Turbostat and Osciloscope Power Measurements 
 AVG Instantaneous Power in Watts 
 Turbostat Yokogawa 
L2FWD-DPDK 32.63 W 32.88 W 
L2FWD-Adaptive 26.97 W 27.24 W 
0.0 A
0.5 A
1.0 A
1.5 A
2.0 A
2.5 A
3.0 A
3.5 A
4.0 A
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
A
m
o
u
n
t 
o
f 
C
u
rr
e
n
t 
D
ra
w
n
 b
y
 t
h
e 
C
P
U
 (
A
m
p
er
e
s)
Experiment Duration (s)
CPU Current Measurements with Yokogawa 
Osciloscope at 1GB/s CBR Traffic
Idle L2FWD-DPDK Idle_1 APM Not Active L2FWD-ADAPT Idle_2
 62 
  
Figure 43 shows the comparison between L2FWD-DPDK and L2FWD-Adaptive in 
terms of energy used per packet at the different line rates offered. Calculating the CPU 
energy consumption per packet for the duration of the experiment is achieved by 
multiplying the measured average power demand by the duration of the experiment. The 
resulting energy figure is then divided by the amount of packets that were forwarded by 
the DuT machine. The same procedure is repeated for all line rates presented in figure 
43. 
 
 
Figure 43. CPU Energy Consumption per Packet - CBR Traffic 
 
7.2.2 DuT System Power Demand with CBR Traffic Profile 
The next measurement was obtained with the WattsUp Pro meter and it concerns the 
overall system power demand while forwarding packets at different line rates. Figure 43 
presents the comparison between a DuT forwarding packets with L2FWD-DPDK and 
L2FWD-Adaptive. The WattsUp Pro meter is set to log the dynamic power drawn by 
the system every second. The results presented in figure 44 are the overall average 
power demand for the testing period of 60 seconds at different line rates. 
0.00E+00
5.00E-06
1.00E-05
1.50E-05
2.00E-05
2.50E-05
3.00E-05
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
A
V
G
 J
o
u
le
s 
p
er
 P
ac
k
et
CBR Line Rate with 64b packets
CPU Energy (Joules) Consumption per Packet
(less is better)
L2FWD-DPDK L2FWD-ADAPT
 63 
  
 
Figure 44. DuT System Power Demand - CBR Traffic 
 
7.2.3 Forwarding CPU Core Energy Consumption per Packet with Bursty Traffic 
Profile 
In this section the forwarding CPU Core energy consumption per packet is presented. 
The DuT forwards ‘bursty’ network traffic, which was generated with TRex packet 
generator, with L2FWD-DPDK and L2FWD-Adaptive. The comparison graph is 
presented in figure 45.  
 
Figure 45. CPU Energy Consumption per Packet - Bursty Traffic 
60.00W
64.00W
68.00W
72.00W
76.00W
80.00W
84.00W
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / secA
V
G
 P
o
w
er
 D
em
an
d
 (
W
at
ts
)
CBR Line Rate with 64b packets
AVG DuT System Power Demand
(less is better)
L2FWD-DPDK L2FWD-ADAPT
4.04E-06
2224.58(J)
3.59E-06
1976.48(J)
1.00E-06
1.00E-05
1.00E-04
1.00E-03
1.00E-02
1.00E-01
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
Joules / pkt Total Energy Consumed
CPU Energy Consumption per Packet
(less is better)
L2FWD-DPDK L2FWD-ADAPT
 64 
  
7.2.4 DuT System Dynamic Power Demand with Bursty Traffic Profile 
In this experiment the dynamic power drawn by the DuT is calculated as a difference 
between the total active system power demand when forwarding packets and the 
system’s idle power demand when there are no active processes running.  
𝑷𝒅𝒚𝒏 =  𝑷𝒕𝒐𝒕𝒂𝒍 −  𝑷𝒊𝒅𝒍𝒆  , 𝑷𝒊𝒅𝒍𝒆 in this case was measured as 34.2W. 
Figure 46 shows the instantaneous power dissipated by the DuT while forwarding 
‘bursty’ network traffic with L2FWD-DPDK and L2FWD-Adaptive respectively. The 
arrowed area between the red and the green line illustrates the power saving achieved by 
the L2FWD-Adaptive application. 
 
 
Figure 46. DuT System Dynamic Power Demand - Bursty Traffic 
 
7.2.5 Forwarding CPU Core Energy Consumption per Packet with IMIX Traffic 
Profile 
This section reveals the energy used by the CPU while forwarding IMIX network 
traffic. In figure 47 the results of this investigation are graphed. 
0
5
10
15
20
25
30
35
40
45
50
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89
A
V
G
 P
o
w
er
 D
em
an
d
 (
W
at
ts
)
Experiment Duration (s)
DuT Dynamic Power Demand
L2FWD-DPDK L2FWD-ADAPT
 65 
  
 
Figure 47. CPU Energy Consumption per Packet - IMIX Traffic 
 
7.2.6 DuT System Dynamic Power Demand with IMIX Traffic Profile 
As with previous section, the overall dynamic power dissipated by the DuT system is 
measured with WattsUp Pro meter and the results are displayed in figure 48. 
 
Figure 48. DuT System Dynamic Power Demand - IMIX Traffic 
0.00E+00
2.00E-05
4.00E-05
6.00E-05
8.00E-05
1.00E-04
1.20E-04
1.40E-04
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
A
V
G
 J
o
u
le
s 
p
er
 P
ac
k
et
Line Rate with IMIX Profile
CPU Energy Consumption per Packet
(less is better)
L2FWD-DPDK L2FWD-ADAPT
0.00W
5.00W
10.00W
15.00W
20.00W
25.00W
30.00W
35.00W
40.00W
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
32.78W
35.11W 36.03W 34.30W
37.61W
28.64W
30.52W 30.00W
33.42W
32.26W
A
V
G
 P
o
w
er
 D
em
an
d
 i
n
 (
W
at
ts
)
Line Rate with IMIX packets
AVG DuT System Dynamic Power Demand
L2FWD-DPDK L2FWD-ADAPT
 66 
  
7.2.7 Forwarding CPU Core Energy Consumption per Packet with Random 
Packet Size  
In this last section of the power and energy investigations the performance of L2FWD-
DPDK is compared with the performance of the L2FWD-Adaptive application when 
random packet sizes are used in a traffic profile generated by TRex 2.26 packet 
generator. The average energy cost per packet is calculated by dividing the total energy 
consumed by the CPU package for the duration of the experiment and dividing the same 
by the amount of packets forwarded. The results are presented in figure 49. 
 
Figure 49. CPU Energy Consumption - Random Packet Size Profile 
 
 
7.2.8 DuT System Dynamic Power Demand with Random Packet Size Traffic 
Profile 
Figure 50 represents the comparison between the average dynamic power usage 
associated with L2FWD-DPDK and L2FWD-Adaptive when the DuT forwards network 
packets with random sizes.  
 
0.00E+00
5.00E-05
1.00E-04
1.50E-04
2.00E-04
2.50E-04
3.00E-04
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB /
sec
L2FWD-DPDK 2.63E-04 1.20E-04 6.00E-05 3.13E-05 2.50E-05
L2FWD-ADAPT 2.05E-04 1.08E-04 5.21E-05 2.45E-05 2.09E-05
A
V
G
 J
o
u
le
s 
p
er
 P
ac
k
et
CPU Energy Consumption per Packet
(less is better)
 67 
  
 
Figure 50. AVG DuT System Power Demand - Random Packet Size Profile 
 
7.3 Chapter Summary 
In this chapter, the DP CPU core energy consumption measurements were presented 
along with DuT’s system instantaneous power usage. 
The environmental impact of the APM within the DPDK’s RX/TX loop was measured 
and compared with the standard RX/TX mechanism that is a part of the DPDK 
framework. 
 
7.4 Chapter Conclusions 
Figures 43, 45, 47 and 49 showed that the average energy consumed by the CPU core 
per forwarded packet is always lower when the L2FWD-Adaptive is running, evaluated 
for four different types of network traffic profiles: CBR, Bursty, IMIX and Random. 
These four traffic scenarios were evaluated at five different offered line rates, 1GB/s, 
2GB/s, 4GB/s, 8GB/s and 10GB/s. There are encouraging results, which shows that the 
APM design has met one of its targets, i.e. to attempt to decrease energy usage and heat 
dissipation of the CPU core that is pinned to a DPDK thread. 
Figures 44, 46, 48 and 50 provide evidence of the decreased instantaneous dynamic 
power demand of the DuT system which forwards packets using the L2FWD-Adaptive 
application. Table 9 shows a summary of the average dynamic power saving that can be 
0.00W
5.00W
10.00W
15.00W
20.00W
25.00W
30.00W
35.00W
40.00W
1 GB / sec 2 GB / sec 4 GB / sec 8 GB / sec 10 GB / sec
L2FWD-DPDK 35.78W 35.53W 35.95W 36.55W 35.99W
L2FWD-ADAPT 29.32W 30.86W 30.97W 32.33W 32.23W
A
V
G
 P
o
w
er
 D
em
an
d
 i
n
 W
at
ts
AVG DuT System Power Demand
 68 
  
achieved across five line rates and four different traffic profiles with the implementation 
of the APM, showing a minimum instantaneous power saving figure of 0.75 watts and a 
maximum of 6.51 watts in the experiments. 
Table 10. Dynamic Power Saving 
AVG Instantaneous Power Demand Estimated AVG Energy 
Saving per Year. 
(based on continuous 
usage) 
Traffic 
Profile 
Line Rates Offered 
1 GB/s 2 GB/s 4 GB/s 8 GB/s 10 GB/s 
CBR 3.16W 1.90W 0.75W 1.37W 4.56W 20.56 kWh 
IMIX 4.14W 4.59W 6.03W 0.88W 5.35W 36.77 kWh 
Random 6.46W 4.67W 4.98W 4.22W 3.76W 42.20 kWh 
Bursty 6.51W for the duration of the experiment 57.02 kWh 
 
  
 69 
  
Chapter 8       Summary, Conclusions and Future Work 
 
This chapter provides an overall summary of the completed research work. It draws 
conclusions based on the theory presented and the data results obtained during the case 
study experiments. The chapter also presents future work recommendations and 
suggestions. 
8.1 Thesis Summary 
Chapter 1 provided an introduction to the background of the underlying problem, which 
relates to the system inefficiencies due to the high polling frequency of the DPDK’s 
RX/TX loop. The main objective of the research investigations is to improve the 
efficiency of the scheme. Chapter 2 provided a detailed overview of the key underlying 
concepts, such as SDN (software defined networking) and NFV (network function 
virtualisation), which are at the core of the DPDK’s conceptual development. In the 
same chapter, chapter 2, the DPDK’s components and operations were explained, along 
with a discussion on the impact on the overall system power demand that is due to the 
strictly polling nature of the DPDK framework. In chapter 3 the full experimental setup 
was described. This includes the hardware equipment as well as the software 
configurations of the two machines used for all experiments. The next chapter, chapter 
4, presented the background to the design of the Adaptive Polling mechanism, along 
with the various initial proposals and their evaluations; and concluded with the 
presentation of the specific adaptive algorithm that was finally chosen. Chapter 5 and 
chapter 6 described the test and evaluation of the chosen APM scheme, which was 
implemented within the L2FWD-Adaptive application, and this was compared with the 
standard L2FWD-DPDK sample application, which is distributed with the DPDK 
framework. Chapter 7 revealed the environmental impact associated with high speed 
packet processing, due to energy usage. It also showed the improvements that can be 
achieved by implementing the APM and how the overall instantaneous system power 
demand and CPU energy consumption can be reduced by using a software solution, 
based on the adaptive polling algorithm. 
8.2 Discussion of Results 
The results from Case Study One “CBR and ‘bursty’ traffic forwarding”, showed that 
the implementation of the APM within the DPDK’s RX/TX loop can positively sustain 
 70 
  
a much lower polling frequency, as compared with the standard RX/TX loop supplied 
with the DPDK’s distribution release.  
The L2FWD-Adaptive application efficiency, see figure 26, is over 50% higher for all 
CBR line rates, as compared to the L2FWD-DPDK. The tests with the ‘bursty’ traffic 
profile showed that the L2FWD-Adaptive is over 60% more efficient, see figure 30, as 
compared to the standard L2FWD-DPDK. 
Figure 25, “CPU Headroom with CBR Traffic” shows that even at a maximum line rate 
of 10 Gbps, the Data Plane’s CPU core spends only 50% of its runtime for actual packet 
processing. The rest of the time, the DPDK’s busy-wait loop execution is paused by the 
APM and the CPU core is prevented from excessive polling. 
The penalty for maintaining low polling frequency levels with the ‘bursty’ traffic profile 
is the internally added ‘latency per packet’ of around 250 ns, measured in the APM 
implementation. 
In Case Study Two: “IMIX and Random Packet Size Traffic”, the increase of 
application efficiency is shown to be up to 50% for the IMIX experiments, see figure 
35, and this is around 51% for the random packet size experiments as shown in figure 
40. The results obtained during the IMIX traffic tests did not show any increase of the 
internally added latency per packet. However, the tests with the random packet size 
traffic revealed an increase of the RTT per packet of 17.5µs at 1 Gbps, and reducing to 
1.5µs at a 9 Gbps line rate.  
Both use cases, IMIX and Random, reveal that the Data Plane’s CPU core headroom 
remains extremely high, at over 97%, at low line rates of 1GB/s and 2GB/s. Even at a 
maximum line rate of 10GB/s, the lcore forwards packets for only 16% of its runtime 
with the IMIX traffic profile, see figure 34, and 10% of its runtime with the random 
packet size traffic profile, see figure 39. 
The energy consumption investigation, which was conducted in parallel with the two 
case studies, showed that there is a difference in the average energy consumed per 
packet by the forwarding CPU core for both of the forwarding applications, L2FWD-
DPDK and L2FWD-Adaptive. From figures 43, 45, 47 and 49 it is clear that the energy 
per packet consumption is lower when the L2FWD-Adaptive is forwarding network 
packets at different line rates. The overall system power demand is also decreased when 
the APM is employed within the RX/TX loop. The summary of the average dynamic 
 71 
  
power savings achieved by the L2FWD-Adaptive in all experiments are shown in table 
9. The diversity results in table 9 would suggest that the APM design implementation is 
able to successfully reduce the average system energy usage at a range of different line 
rates with variable incoming traffic dynamics. The estimated minimum average energy 
savings are around 20.56 kWh per year for the CBR traffic profile, and the maximum 
average savings per year that can be achieved are estimated around 57 kWh for the 
‘bursty’ traffic profile. 
8.3 Conclusion 
In this research work, an Adaptive Poling Mechanism (APM) was developed, evaluated 
and implemented within the receiving (RX) loop of the DPDK’s L2FWD application. 
The main focus of the developed APM was to address the problem of the excessive 
polling nature of the DPDK’s RX/TX loop.  
There were significant advances made in the understanding of the high-speed packet 
processing behaviour in user space and in the inner-workings of the DPDK framework. 
Practical case studies to test the APM design were developed and executed, and the 
results from them were collected and analysed. 
The main findings of this research work are: 
 The implemented adaptive polling mechanism, within the DPDK’s receiving 
loop, reduces the poling frequency of the CPU core that is pinned to a DPDK 
thread. The reduction of polling frequency is very significant at low line rates of 
1 Gbps and it was shown to be in the range of 34 to 87 million polls per second. 
At peak line rates of 10 Gbps, the reduction is in the range between 12 and 65 
million polls per second. In percentage terms, at 1 Gbps the reduction is from 
97% up to 99.94%, and at 10 Gbps the decrease in total polls per second is from 
90.82% up to 99.28%. 
 The widening of the polling interval achieved by the APM leads to less CPU 
cycles being used to process a single network packet, which yields an increased 
application efficiency of up to 60%. 
 It was observed and measured that using the APM within the DPDK’s RX/TX 
loop reduces the energy consumption of the Data Plane forwarding core(s) per 
processed Ethernet packet and also helps to cut back the overall instantaneous 
system power demand by up to circa 6.5W at low line rates. This improvement 
 72 
  
can reduce the environmental impact of a high-speed packet processing 
framework such as the DPDK. 
The reduction in energy costs per single packet forwarding system within a Data Center 
may allow service providers to increase the number of CPU cores per server, without 
increasing the running costs.  
8.3.1 Limitations 
The implemented APM was developed on a machine equipped with an entry level Intel 
Xeon E3-1285 v3 CPU, codenamed Haswell, which is based on x86_64 architecture. 
The testing and evaluations may perform differently on other generations of Intel CPUs, 
due to the different microarchitecture extensions featured in the higher-end processors. 
Further testing will be required to evaluate the performance of the developed adaptive 
algorithm on such processor architectures. The overall performance of the APM may 
also be affected if a DPDK version release other than dpdk-stable-16.1.1 is used. The 
compatibility with future DPDK releases is not guaranteed even though the structure of 
the RX loop has remained the same for the past two years. 
8.4 Future Work 
The ability of the implemented APM to reduce the DPDK’s lcore polling frequency, 
and consequently to contribute to the decrease of the overall energy usage of the whole 
packet forwarding system, may be considered as evidence to encourage further 
development of a suitable base model. 
There is room for improvement within the presented APM design. One of the ways 
forward may be adjusting the source code to make use of the newly developed DPDK 
libraries, which can provide detailed metrics of the incoming network traffic in real time 
without introducing computation overhead on the Data Plane’s CPU core. This may 
enable the existing algorithm to provide better application efficiency in terms of 
network packets retrieved per poll. 
Alternatively a separate DPDK thread can be used to monitor the incoming traffic 
dynamics and to switch to interrupt mode instead of polling mode when there are no 
packets received for a certain period of time. Another approach may explore the 
combination between an artificially induced delay and an active CPU core frequency 
scaling.  
 73 
  
Future work could study the feasibility and advantages of the suggestions made above, 
and provide recommendations to service providers for further advances in the APM 
development. 
 
 
 
 
 
 
 
  
 74 
  
Bibliography 
 
[1] P. Emmerich, S. Gallenmüller, D. Raumer, F. Wohlfart, and G. Carle, 
"MoonGen: a scriptable high-speed packet generator," in Proceedings of the 
2015 ACM Conference on Internet Measurement Conference, 2015, pp. 275-
287. 
[2] M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone, et al., 
"RouteBricks: exploiting parallelism to scale software routers," presented at the 
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems 
principles, Big Sky, Montana, USA, 2009. 
[3] DPDK, "Data plane development kit," ed, 2014. 
[4] DPDK. (2016, 24/04/2017). DPDK Programmers Guide. Available: 
http://dpdk.org/doc/guides/prog_guide/index.html 
[5] N. Egi, M. Dobrescu, J. Du, K. Argyraki, B.-G. Chun, K. Fall, et al., 
"Understanding the Packet Processing Capability of Multi-Core Servers," Intel 
Technical Report, 2009. 
[6] DPDK. (24/04/2017). DPDK Sample Applications User Guides. Available: 
http://dpdk.org/doc/guides/sample_app_ug/index.html 
[7] T. Barbette, C. Soldani, and L. Mathy, "Fast userspace packet processing," in 
Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for 
networking and communications systems, 2015, pp. 5-16. 
[8] S. Gallenmüller, P. Emmerich, F. Wohlfart, D. Raumer, and G. Carle, 
"Comparison of frameworks for high-performance packet IO," in Architectures 
for Networking and Communications Systems (ANCS), 2015 ACM/IEEE 
Symposium on, 2015, pp. 29-38. 
[9] P. Emmerich, D. Raumer, F. Wohlfart, and G. Carle, "Assessing soft-and 
hardware bottlenecks in PC-based packet forwarding systems," ICN 2015, p. 90, 
2015. 
[10] DPDK. DPDK Sample Applications User Guide, L2FWD. Available: 
http://dpdk.org/doc/guides/sample_app_ug/l2_forward_real_virtual.html 
[11] (2017, 20/04/207). Intel Xeon E3-1285 v3. Available: 
http://www.intel.co.uk/content/www/uk/en/products/processors/xeon/e3-
processors/e3-1285-v3.html 
[12] P. Hammarlund. (20/04/2017). 4th Generation Intel® Core™ Processor, 
codenamed Haswell. Available: https://www.hotchips.org/wp-
content/uploads/hc_archives/hc25/HC25.80-Processors2-epub/HC25.27.820-
Haswell-Hammarlund-Intel.pdf 
[13] (2017). Intel P-State driver. Available: 
https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt 
[14] H. Bidgoli, The Handbook of Computer Networks: Wiley Publishing, 2007. 
[15] C. Hanoh Haim. (2017, 24/04/2017). TRex: Realistic Traffic Generator. 
Available: https://github.com/cisco-system-traffic-generator 
[16] D. Kreutz, F. M. Ramos, P. E. Verissimo, C. E. Rothenberg, S. Azodolmolky, 
and S. Uhlig, "Software-defined networking: A comprehensive survey," 
Proceedings of the IEEE, vol. 103, pp. 14-76, 2015. 
[17] ETSI, "Network Functions Virtualisation (NFV)," Dusseldorf-Germany, 2014. 
[18] O. N. Foundation. (2017, 25 April). Software-Defined Networking (SDN) 
Definition. Available: https://www.opennetworking.org/sdn-resources/sdn-
definition 
 75 
  
[19] M. Condoluci, T. Mahmoodi, and G. Araniti, "Software Defined Networking 
(SDN) and Network Function Virtualization (NFV) for C-RAN systems." 
[20] ETSI, "NFV group," GS NFV, vol. 3. 
[21] L. Rizzo, "Netmap: a novel framework for fast packet I/O," in 21st USENIX 
Security Symposium (USENIX Security 12), 2012, pp. 101-112. 
[22] NTOP. (2017, 24/04/2017). PF_RING. Available: 
http://www.ntop.org/products/packet-capture/pf_ring/ 
[23] S. Han, K. Jang, K. Park, and S. Moon, "PacketShader: a GPU-accelerated 
software router," in ACM SIGCOMM Computer Communication Review, 2010, 
pp. 195-206. 
[24] L. K. Contributors. packet_mmap. Available: 
https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt 
[25] W. Stallings and M. M. Manna, Data and computer communications vol. 6: 
Prentice hall Englewood Cliffs, NJ, 1997. 
[26] B. Sinharoy, J. Van Norstrand, R. J. Eickemeyer, H. Q. Le, J. Leenstra, D. Q. 
Nguyen, et al., "IBM POWER8 processor core microarchitecture," IBM Journal 
of Research and Development, vol. 59, pp. 2: 1-2: 21, 2015. 
[27] Cisco. (18/04/2017). Cisco. Available: 
http://www.cisco.com/c/en_uk/index.html 
[28] (18/04/2017). Mellanox. Available: 
http://www.mellanox.com/page/ethernet_cards_overview?ssn=s00spi73kko7v3d
rdjpfr6vbg6 
[29] (18/04/2017). Broadcom. Available: 
https://www.broadcom.com/products/ethernet-
connectivity/?technology%5B%5D=88 
[30] C. Communications. Chelsio. Available: http://www.chelsio.com/category/nic/ 
[31] K. contributors. KVM Hypervisor. Available: https://www.linux-
kvm.org/index.php?title=Main_Page&oldid=173792 
[32] VMware. vSphere Hypervisor. Available: 
http://www.vmware.com/products/vsphere-hypervisor.html 
[33] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, et al., "Xen and 
the art of virtualization," presented at the Proceedings of the nineteenth ACM 
symposium on Operating systems principles, Bolton Landing, NY, USA, 2003. 
[34] (10 April 2017). The Linux Foundation. Available: 
https://www.linuxfoundation.org/ 
[35] J.-P. Orsini. Psensor. Available: https://wpitchoune.net/psensor/ 
[36] J. Lienig, "Electromigration and its impact on physical design in future 
technologies," in Proceedings of the 2013 ACM international symposium on 
International symposium on physical design, 2013, pp. 33-40. 
[37] R. Goering. (2014, 27/04/2017). Electromigration – What IC Designers Need to 
Know. Available: 
https://community.cadence.com/cadence_blogs_8/b/ii/archive/2014/09/01/electr
omigration-what-ic-designers-need-to-know 
[38] Intel. Z77 Chipset. Available: https://ark.intel.com/products/64024/Intel-Z77-
Express-Chipset 
[39] Intel. Core i7-3770K Processor. Available: 
https://ark.intel.com/products/65523/Intel-Core-i7-3770K-Processor-8M-Cache-
up-to-3_90-GHz 
[40] F. Project. Fedora 22 Release Notes. Available: 
https://docs.fedoraproject.org/en-US/Fedora/22/html/Release_Notes/index.html 
[41] Intel. Z87 Chipset. Available: http://ark.intel.com/products/75013/Intel-Z87-
Chipset 
 76 
  
[42] Fedora. Fedora 24. Available: https://docs.fedoraproject.org/en-
US/Fedora/24/html/Release_Notes/index.html 
[43] R. Ierusalimschy, L. H. De Figueiredo, and W. Celes Filho, "Lua-an extensible 
extension language," Softw., Pract. Exper., vol. 26, pp. 635-652, 1996. 
[44] H. Haim. (2016). TRex Config Guide. Available: https://trex-
tgn.cisco.com/trex/doc/trex_config_guide.html#slide-2 
[45] C. Hanoh Haim. TRex FAQ. Available: https://trex-
tgn.cisco.com/trex/doc/trex_faq.html 
[46] R. M. Grow, "IEEE 802.3 Ethernet Working Group," 2007. 
[47] S. Bradner and J. McQuaid, "RFC 2544," Benchmarking methodology for 
network interconnect devices, 1999. 
[48] P. PCI-SIG, "Express Base Specification Revision 2.0," ed: December, 2006. 
[49] DPDK. DPDK Sample Application User Guide, L3FWD-POWER. Available: 
http://dpdk.org/doc/guides/sample_app_ug/l3_forward_power_man.html 
[50] DPDK. DPDK Sample Application User Guide, Skeleton. Available: 
http://dpdk.org/doc/guides/sample_app_ug/skeleton.html 
[51] A. Morton, "IMIX Genome: Specification of Variable Packet Sizes for 
Additional Testing," 2013. 
[52] M. Matsumoto and T. Nishimura, "Mersenne twister: a 623-dimensionally 
equidistributed uniform pseudo-random number generator," ACM Trans. Model. 
Comput. Simul., vol. 8, pp. 3-30, 1998. 
[53] Intel. (2016, 15/08/2017). Intel 64 and IA-32 Architectures Software 
Developer’s Manual Volume 2 2. Available: 
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-
ia-32-architectures-software-developer-instruction-set-reference-manual-
325383.pdf 
[54] Intel. (2017). Intel Streaming SIMD Extensions Technology. Available: 
https://www.intel.com/content/www/us/en/support/processors/000005779.html 
[55] B. Taylor, Guide for the Use of the International System of Units (SI): The 
metric system: DIANE Publishing, 1995. 
[56] W. P. Meters. (03/07/2017). Watts Up Pro Plug Load Meter. Available: 
https://www.wattsupmeters.com/secure/products.php?pn=0&wai=0 
[57] Ykogawa. (2012, 03/07). Ykogawa Osciloscope DL9000. Available: 
http://tmi.yokogawa.com/discontinued-products/oscilloscopes/digital-and-
mixed-signal-oscilloscopes/dl9000-dso-series/ 
[58] L. Brown. Turbostat. Available: https://access.redhat.com/documentation/en-
US/Red_Hat_Enterprise_Linux/7/html/Performance_Tuning_Guide/sect-
Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-
Performance_Monitoring_Tools-turbostat.html 
 
 1 
  
Appendix I General Linux Kernel Networking 
 
This section aims to explain the operation of the Linux kernel networking stack and its 
limitations. 
In all current distributions of the Linux OS (Operating System), running kernel 2.4.20 
and later, the processing of incoming and outgoing network packets is handled by the 
Linux kernel and more specifically by the New Application Programming Interface 
(NAPI). NAPI [1] is just an abstraction of the functionalities that the Linux OS 
provides; exported as an Application Programming Interface (API).  
Prior to NAPI’s existence, a NIC would generate an Interrupt Request (IRQ) for every 
received packed to indicate that data is available for processing by the kernel. The 
packet would then be attached to a descriptor in NIC’s Receive (RX) queue. The RX 
queues are usually implemented as rings [2]. The packet descriptor [3] contains 
information about the address in memory to where the incoming packet will be copied, 
using Direct Memory Access (DMA) transfer. Transmitting a packet involves copying 
in the opposite direction and raising an interrupt to notify the Central Processing Unit 
(CPU) core that transfer has been completed and new packets can be sent. However, this 
approach can quickly overwhelm system resources and stall the CPU core, thus 
preventing it from actual processing.  
Figure 1 bellow provides a visual representation on the scheme outlined above. 
 
 
 
 
 
 
 
 
 
pop from  
buffer 
netif_rx() 
push in 
buffer 
(copy) 
memcpy() 
IRQ 
Physical 
Link 
Main 
Memory 
Region 
RX 
Ring 
Kernel Packet 
Buffer 
Intel 82599 NIC User 
Application 
Figure I - 1 Linux Network Receive Scheme prior to NAPI. 
 2 
  
 
Changes introduced with the introduction of NAPI allows the device driver to register a 
poll() function after the first hardware (HW) IRQ, and this disables future interrupts and 
will check if there are packets available. Furthermore the poll() function will copy and 
enqueue the available frames into the network stack, and then for each packet a function 
will be called to start the processing through the network stack [4]; and if no more 
packets are available, the same poll()  function will re-schedule itself to be called and 
executed in the very near future, without an interrupt being called. If there are no 
packets present in the RX queue for a certain period of time, the system will return to 
the interrupt driven model again, so as  to prevent the CPU from continuous execution 
of the poll() function. 
Another improvement with using NAPI-aware device drivers is the packet throttling [3] 
feature. To prevent the system from becoming overwhelmed whenever high speed 
traffic burst occurs, the NAPI-compliant driver will drop packets at the network adapter 
by using flow-control mechanisms thus preventing the CPU core from doing 
unnecessary work. The NAPI operation is presented in figure 2. 
 
 
 
 
 
 
 
 
 
 
 
 
 
(copy) 
(copy) 
push 
in 
buffer 
memcpy() 
napi_schedule() 
push 
in 
buffer 
memcpy() 
Physical 
Link 
Intel 82599 NIC 
RX 
Ring 
Kernel Packet 
Buffer 
User 
Application 
pop 
from  
buffer 
Main 
Memory 
Region 
Figure I - 2. Linux Network New API Receive Scheme 
 3 
  
The Linux network stack was designed for general purpose networking and handles line 
speeds of up to 1 Gbps, however it reaches its processing limit as line rate approaches 
10 Gbps and then the OS starts dropping packets rapidly. This is a problem as the 
overall scheme also needs to satisfy any application by maintaining a balance between 
performance, power saving, functionality and usability. 
The NAPI-compliant driver needs to perform well at high packet rates and also at low 
line utilization scenarios. Nowadays, there are different high-speed processing schemes 
that use modified drivers, bypassing the Linux kernel and outperforming the standard 
Linux networking scheme in terms of maximum packet throughput rate, latency and 
system efficiency. 
 
 
 
 
 
References 
 
 
 
[1] T. Barbette, C. Soldani, and L. Mathy, "Fast userspace packet processing," in 
Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for 
networking and communications systems, 2015, pp. 5-16. 
[2] E. Biersack, C. Callegari, and M. Matijasevic, Data Traffic Monitoring and 
Analysis: Springer, 2013. 
[3] J. L. García-Dorado, F. Mata, J. Ramos, P. M. S. del Río, V. Moreno, and J. 
Aracil, "High-performance network traffic processing systems using commodity 
hardware," in Data Traffic Monitoring and Analysis, ed: Springer, 2013, pp. 3-
27. 
[4] M. Rio, M. Goutelle, T. Kelly, R. Hughes-Jones, J.-P. Martin-Flatin, and Y.-T. 
Li, "A map of the networking code in Linux kernel 2.4. 20," Technical Report 
DataTAG-2004–1, 2004. 
 
 1 
  
Appendix II      Intel Haswell Power Management 
 
The multicore CPUs used nowadays inside server systems are the biggest consumers of 
the overall energy supplied to a single server when they are at active state. Saving even 
small amount of energy inside of the CPU has a multiplicative effect at a large scale 
such as a Data Center. Reducing the instantaneous power used by the CPU has a 
positive effect on reducing the cost for cooling a single system and therefore lowering 
the overall cost of running the system itself. 
CPU Power, Performance and Sleep States 
There are number of techniques for turning the logic gates within the CPU off or to 
lower the frequency of the running clock to save power.  
The Advanced Configuration and Power Interface (ACPI) [1] establishes standard 
interface enabling the OS to discover and configure hardware components. The ACPI 
provides support for a number of processor power states illustrated on figure 1. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
G0 - Working 
S0 – Processor Fully Powered (full on mode / connected standby mode) 
C0 – Active Mode  
P0  
C1 – Auto Halt  
Pn 
C1E – Auto Halt, Low frequency, Low voltage 
C3 – L1/L2 Caches flush, Clocks off 
C6 – Save core states before shutdown 
C7 – Similar to C6, L3 cache flush 
G1 - Sleeping 
S3 – Cold – Sleep – Suspend to Ram (STR) 
S4 – Hibernate – Suspend to Disk (STD), Wakeup on PCH 
S5 – Soft Off – no power, Wakeup on PCH 
G3 – Mechanical OFF 
Figure II - 1. Processor Power and Sleep States 
 2 
  
It enables the OS to perform some power management functions such as requesting the 
unused hardware to be put to sleep. The area of interest for this research work are the 
performance (P) states and the sleep (C) states. 
Table 1 provides a brief description of the processor’s sleep states. These states save 
power by stopping execution on the CPU core. Deeper sleep state will save more energy 
but will take longer to shift C0 state thus incurring latency overhead. 
Table II - 1. Processor Idle States 
STATE DESCRIPTION 
C0 
Active mode, processor executing code. 
C1 
AutoHALT state. 
C1E 
AutoHALT state with lowest frequency and voltage operating point. 
C3 
Execution cores in C3 state flush their L1 Instruction cache, L1 data cache, 
and L2 cache to the L3 cache. Clocks are shut off to each core. 
C6 
Execution cores in C6 state save their architectural state before removing 
core voltage. 
C7 
Execution cores in this state behave in a similar way to C6 state. If 
execution cores request C7 state, L3 cache is flushed until it is cleared. If 
the entire L3 cache is flushed, voltage is removed from L3 cache.  
 
While a processor core is in active C0 state, its power demand is determined by the 
core’s operating frequency and the voltage applied to it. Each voltage-frequency 
operating point is defined by the ACPI as a P-state. Performance states are voltage-
frequency pairs that can set the speed and power usage of the processor core. They are 
processor specific and numbered from P0, the highest voltage-frequency combination, 
through Pn, which is the lowest combination and saves the most energy. A P-state can 
also be described as a performance level that is requested by the OS to the hardware, in 
our case the CPU core. The intel_pstate driver provides an interface to control the P-
state selection for all Intel processors from Sandy Bridge generation to date. The driver 
decides what P-state to use based on the requested state and internal policies enforced 
by the OS. If the processor is capable of selecting the next P-state internally, the driver 
will offload the selection responsibility to the internal hardware performance manager. 
In some cases the processor is not able to select its own P-state and the intel_pstate 
driver has implemented algorithms to perform this selection. 
 
 
 
 3 
  
References 
 
 
 
[1] ACPI. (07/07). Advanced Configuration & Power Interface. Available: 
http://www.acpi.info/ 
 
 
 
