An active router architecture using programmable hardware by Alexandros Fragkiadakis (7202060)
University Library 
I • Loughborough 
• University 
Author/Filing Title . .F~A..r;.KJAOA\~~~-1-·A: ............ . 
····································~················································ ! 
Class Mark ................................................ ·········· ··········· 1 
Please note that fines are charged on ALL 
overdue items. 
0403191580 
111111 111111111111111111111111111111111111111 

An Active Router Architecture using 
Programmable Hardware 
By 
Alexandros Fragkiadakis 
A doctoral thesis submitted in partial fulfilment of the 
requirements for the award of Doctor of Philosophy of 
Loughborough University 
June 2005 
© by Alexandros Fragkiadakis 2005 
..... ·- ...... ·-· 
Chiss 
·A« 
Ni>. • 
Dedicated to my family 
Abstract 
The current generation of networks is referred as the passive or conventional 
networks. Their functionality is limited to transfer packets from one point to another. 
Passive nodes such as routers and switches are vertically integrated devices that allow 
small or no modification on the operations they perform. Their functionality is limited 
to the network and (sometimes) transport protocol layers. They perform no 
computation on the payload data of the packets they handle. Due to their nature, 
passive networks impose limitations such as the difficulties of introducing new 
protocols and services, and limited performance. 
There is a different approach in the computer networks field, called Active Networks. 
Active Networks consist of routers and switches (Active Nodes) that not only forward 
packets from one point to another but also perform customised computations on them. 
The applications that execute on the Active Nodes can be user-driven. This allows 
end-users to program the network and tailor its services to their needs. Users could 
also inject their own code, which programs intermediate Active Nodes. 
This thesis presents an Active Router architecture using programmable hardware. The 
router consists of two hosts running Linux. The separation of the router functionality 
into two hosts is performed for safety reasons. A PCI-based board that hosts a Field 
Programmable Gate Array (FPGA) device comprises the programmable hardware 
element of the router. The motivation for using an FPGA device is that it is 
reprograrnmable, so Active Applications can be installed on the fly, and it provides a 
safer execution environment. 
The software architecture of the Active Router comprises several modules that 
implement tasks such as the safe download of applications from dedicated servers, 
resource management, fault detection and isolation. 
The performance evaluation of the Active Router reveals several bottlenecks and 
limitations such as the PC! bus, the interrupt-driven nature of the Linux operating 
system. 
I 
Acknowledgements 
I wish to acknowledge my supervisor Proj David Parish for his encouragement, 
guidance and support during my entire PhD study at the Loughborough University. 
I wish also to thank Dr. Mark Sandford and Mr. Andy Larkum, for their valuable 
support during my first steps in the magic world of Linux. 
I should also mention Dr. Omar Bashir for correcting part of this thesis. 
Finally, I wish to thank my family to whom I dedicate this text, for their support and 
their patience. 
11 
Abbreviations and Acronyms 
AA Active Application 
ACTREG Active Registry 
AE Active Engine 
AF Active Filter 
AFL Active Flow 
API Application Program Interface 
APPLOD Application Loader 
AR Active Router 
ARP Address Resolution Protocol 
ASIC Application Specific Integrated Circuit 
ATM Asynchronous Transfer Mode 
CLB Configurable Logic Block 
CPU Central Processing Unit 
CPUM CPU Monitor 
CPR Core Process 
CS Code Server 
CU Core Unit 
DLL Delay Locked Loop 
DMA Direct Memory Access 
DSA Digital Signature Algorithm 
DSP Digital Signal Processing 
EEPROM Electrically Erasable Programmable Read-Only Memory 
EPROM Erasable Programmable Read-Only Memory 
FEC Forward Error Correction 
FIFO First-In First-Out 
FPGA Field Programmable Gate Array 
GMID Global Module Identifier 
Ill 
GRM General Routing Mask 
HALT Higher Abstraction Level Threat 
HTB Hierarchical Token Bucket 
ICMP Internet Control Message Protocol 
IEEE Institute of Electrical and Electronics Engineers 
lOB Input/Output Block 
lP Internet Protocol 
ISA Industry Standard Architecture 
LAN Local Area Network 
LC Logic Cell 
LKML Loadable Kernel Module Loader 
LKM Loadable Kernel Module 
L TT Linux Trace Toolkit 
LUT Look-Up Table 
MELT Malicious Electrical Level Threat 
MEMM Memory Monitor 
MJF Major Faults 
MMU Maximum Memory Utilisation 
MNF Minor Faults 
NACK Negative Acknowledgement 
NIC Network Interface Card 
OOM Out of Memory Management 
PCI Peripheral Component Interconnect 
PI Packet Injector 
PID Process Identification 
PLAN Packet Language for Active Networks 
PMC Peripheral Components Interconnect Mezzanine Card 
PNAT Process Nature 
PQ Packet Queue 
IV 
PSTAT Process Status 
PTMP Process Timestamp 
QDISC Queuing Discipline 
QoS Quality of Service 
RAM Random Access Memory 
RDM Request Decision Module 
RFE Routing and Forwarding Engine 
RP Rest Period 
RQ Request Queue 
RT Request Type 
SALT Signal Alteration Logic Threat 
SCP Secure Copy 
SFM Safety Module 
SPR Safety Process 
SRAM Static Random Access Memory 
SSH Secure Shell 
TCP Transport Control Protocol 
TRS Traffic Shaper 
TSC Time-Stamp Counter 
TTY Teletype Device 
UDP User Datagram Protocol 
VHDL Very High Speed Integrated Circuit Hardware Description 
Language 
V 
CONTENTS 
Abstract ............................................................................................................................. ! 
Acknowledgments .......................................................................................................... 11 
Abbreviations and Acronyms ..................................................................................... Ill 
Contents ......................................................................................................................... VI 
List of Figures ................................................................................................................ XI 
Chapter 1 
1 Introduction ................................................................................................................. 2 
1.1 Passive Networks ................................................................................................... 3 
1.2 Active Networks-a different Approach ................................................................. 3 
1.3 Realisation of the Active Networks ....................................................................... 4 
1.3.1 The Capsule Approach ..................................................................................... 5 
1.3.2 The Programmable Switch Approach .............................................................. 5 
1.4 Issues raised by activating the Networks ............................................................... 5 
1.4.1 Safety and Security .......................................................................................... 5 
1.4.1.1 Safety and Security from the Programming Point of View .............. 6 
1.4.1.2 Safety and Security from the Systems Point of View ..................... 7 
1.4.2 Resource Management ..................................................................................... 8 
1.4.3 The End-to-End Argument .............................................................................. 9 
1.5 Thesis Overview .................................................................................................. 10 
Chapter2 
2 Field Programmable Gate Arrays (FPGAs) ........................................................... 12 
2.1 Chapter Summary ................................................................................................ 13 
2.2 Introduction ......................................................................................................... 13 
2.3 Structure of the FPGAs ........................................................................................ 13 
2.4 Different Types ofFPGAs ................................................................................... 14 
2.5 The XCV1600-E Field Programmable Gate Array ............................................. 15 
2.5.1 Architectural Description ............................................................................... 15 
2.5.2 Control Logic Blocks ..................................................................................... 16 
2.5.3 Input/Output Blocks ....................................................................................... 16 
2.5.4 Look-Up Tables ............................................................................................. 17 
2.5.5 Configuration Modes ..................................................................................... 17 
2.6 FPGAs and Active Networks .............................................................................. 19 
2.6.1 Motivation for using FPGAs in Active Networks ......................................... 19 
2.6.2 Reconfigurable Hardware Security ................................................................ 21 
2.7 Summary .............................................................................................................. 22 
Chapter3 
3 Relevant Research in Active Networks ................................................................... 23 
3.1 Chapter Summary ................................................................................................ 24 
VI 
3.2 Active Network Projects ...................................................................................... 24 
3.2.1 The Active lP Option ..................................................................................... 24 
3.2.2 Active Network Encapsulation Protocol... ..................................................... 24 
3.2.3 Smart Packets for Active Networks ............................................................... 25 
3.2.4 PLANet: An Active Intemetwork .................................................................. 26 
3.2.5 LARA ............................................................................................................ 26 
3.2.6 The Phoenix Framework ................................................................................ 27 
3.2.7 The Programmable Protocol Processing Pipeline .......................................... 27 
3.2.8 The Active Network Processing Element ..................................................... 28 
3.2.9 The Flexible High Performance Platform ...................................................... 28 
3.2.10 The Field Programmable Port Extender ...................................................... 29 
3.3 Applications and Services that can be applied in Active Networks .................... 30 
3.3.1 Active Reliable Multicast .............................................................................. 30 
3.3.2 Active Anycast. .............................................................................................. 30 
3.3.3 Forward Error Correction .............................................................................. 31 
3.3.4 Cryptography ................................................................................................. 31 
3.3.5 Active Firewall ............................................................................................... 31 
3.3.6 Mixing Sensor Data ....................................................................................... 31 
3.3.7 Active Networks in Telephony ..................................................................... 32 
3.3.8 Virtual Active Networks ................................................................................ 32 
3.3.9 Active Caching ............................................................................................... 32 
3.4 Summary .............................................................................................................. 34 
Chapter4 
4 Active Protocols-Active Applications and the Hardware Element of the Active 
Router ............................................................................................................................. 35 
4.1 Chapter Summary ................................................................................................ 36 
4.2 The Active Protocol ............................................................................................. 36 
4.3 The Hardware Element of the Active Router ...................................................... 38 
4.3.1 PCI Mezzanine Cards .................................................................................... 38 
4.3.2 Layout of the PCI Card .................................................................................. 38 
4.4 Active Applications ............................................................................................. 39 
4.4.1 Software Active Applications ........................................................................ 39 
4.4.2 FPGA Active Applications ............................................................................ 39 
4.4.2.1 The Software Part of an Active Application ......................... ..40 
4.4.2.2 The Hardware Part of an FPGA-Active Application ............... ..40 
4.4.3 A DES Algorithm implemented as an FPGA Application ......................... .46 
4.4.3.1 Interfacing the DES Application with the PLX 9080 ................ ..47 
4.4.3.2 Producing the Bitstream .................................................... 53 
4.4.4 Naming of the Active Applications ............................................................... 55 
4.5 Summary .............................................................................................................. 56 
ChapterS 
5 Architecture of the Active Engine ............................................................................ 57 
5.1 Chapter Summary ................................................................................................ 58 
5.2 Introduction ......................................................................................................... 58 
5.3 The Active Engine ............................................................................................... 59 
VII 
5.3.1 The Software Part of the Active Engine ........................................................ 59 
5.3.1.1 The Active Filter ............................................................ 60 
5.3.1.1.1 Loadable Kernel Modules .................................... 60 
5.3.1.1.2 Netfilter Hooks ................................................. 60 
5.3.1.2 The Core Process ............................................................ 63 
5.3.1.2.1 Loading the Active Applications ............................ 64 
5.3.1.2.2 Packet Queues .................................................. 65 
5.3.1.2.3 The Active Registry ........................................... 65 
5.3.1.2.4 The Request Mechanism for the Active Applications .. 69 
5.3.1.3 The Application Loader ...................................................... 73 
5.3.1.3.1 Communication with a Code Server ........................ 73 
5.3.1.3.2 The Proc Filesystem .......................................... 79 
5.3.1.3.3 Loading Active Applications in Memory ............ 82 
5.3.1.4 The Memory Monitor ......................................................... 88 
5.3.1.4.1 Locate the Active Applications Loaded in Memory ..... 88 
5.3.1.4.2 Check the Major and Minor Faults ........................... 88 
5.3.1.4.3 Check if the Active Applications are still "Alive" .......... 88 
5.3.1.4.4 Process Ageing ................................................... 89 
5.3.1.4.5 Memory Monitor ................................................ 89 
5.3.1.5 The CPU Monitor .............................................................. 91 
5.3.1.5.1 CPU Utilisation Measurements .................................... 91 
5.3.1.5.2 Traffic Shaping Requests ................................... 92 
5.3.1.5.3 Penalising Active Applications for High CPU 
Utilisation ....................................................................... 93 
5.3.1.6 The Packet Injector ................................................................... 95 
5.3.1.7 The Safety Process .................................................................... 96 
5.3.1.7.1 Functionality of the Safety Process .............................. 96 
5.3.1.7 .2 Out of Memory Management. ..................................... 97 
5.3.1.7.3 Out of Memory Management Test ........................... 97 
5.3.1.8 Recovery Procedure performed by the Core Process .................... 98 
5.4 Summary ............................................................................................................ 101 
Chapter 6 
6 Architecture of the Routing and Forwarding Engine ......................................... 103 
6.1 Introduction ....................................................................................................... 104 
6.2 Redirecting Active Packets to the Active Engine .............................................. 105 
6.3 The Safety Module ............................................................................................ 107 
6.3.1 Structure of the Safety Module .................................................................... 107 
6.3.1.1 Sending ICMP Requests ........................................................... 107 
6.3.1.2 Receiving ICMP Replies .......................................................... .108 
6.4 Communication between the Routing and Forwarding Engine, and the Active 
Engine ................................................................................................................ 109 
6.5 Loadable Kernel Module Loader and Traffic Shaper ........................................ 111 
6.5.1 The Traffic Shaper ....................................................................................... 111 
6.5.1.1 Unloading the Loadable Kernel Modules ....................................... lll 
6.5.1.2 Traffic Shaping ........................................................................ 112 
6.5.1.2.1 Queues and Queuing Disciplines in Linux ................. .112 
6.5.1.2.2 Classful Qdiscs .................................................. 112 
6.5.1.2.3 The Token Bucket Filter ...................................... 113 
VIII 
6.5.1.2.4 The Hierarchical Token Bucket Qdisc and Shaping of the 
Active Traffic ........................................................... 113 
6.5.1.2.5 Testing the Traffic Shaping Mechanism ..................... 115 
6.5.2 The Loadable Kernel Module Loader .......................................................... 125 
6.6 Summary ............................................................................................................ 127 
Chapter7 
7 Performance Evaluation of the Active Engine ..................................................... 128 
7.1 Chapter Summary .............................................................................................. 129 
7.2 Packet Loss ........................................................................................................ 129 
7 .2.1 A Generic Description of the System under Test.. ............................... 129 
7.2.2 CPU Cycle Consumption .................................................................... 131 
7 .2.3 Measuring the Packet Loss .......................................................................... 131 
7.2.3.1 Reducing the Number of the Process Context Switches ............... .134 
7.2.3.1.1 Choosing the Size of the Buffers ............................. .138 
7.2.3.1.2 Packet Loss with and without Buffering .................. .138 
7.3 Packet Delay Measurements .............................................................................. 142 
7.3.1 Delay Measurements when using no Buffers .............................................. 144 
7.3.2 Delay Measurements when using Buffers of 6 Kbytes ............................... 146 
7.3.3 Comparing the Packet Delay for no Buffering and Buffering using different 
Buffer Sizes ........................................................................... 149 
7.4 Switching between Active Applications ............................................................ 152 
7.5 Performance Evaluation of a Software and a Hardware DES Application ....... 158 
7.5.1 Delay Measurements ................................................................................... 158 
7.5.2 CPU Consumption Measurements .............................................................. 160 
7.5.2.1 Performance Counters and the Time-Stamp Counter ................ 160 
7.5.2.2 CPU Cycle Consumption .............................................. .160 
7.6 Summary ............................................................................................................ 172 
ChapterS 
8 Active Secure FfP .. ................................................................................................ 173 
8.1 ChapterSummary .............................................................................................. 174 
8.2 Activating the Passive Packets .......................................................................... 174 
8.3 Description and Performance Evaluation of the Active Secure FfP 
Application ........................................................................................................ 177 
8.4 Summary ............................................................................................................ 180 
Chapter9 
9 Conclusions and Future Work ................................................................ ~ .............. 181 
References .................................................................................................................... 189 
Appendix A des.c (DES Encryption!Decryption) •••••••••••.•...•......•..•.••••••••••••••••....•.... 201 
Appendix B des.vhd (DES Encryption/Descyption) ................................................ 211 
Appendix C Local PCI Bus Signals ........................................................................... 220 
IX 
Appendix D Testbench File used for the Simulation of the des.vhd ...................... 222 
Appendix E Script File used for Traffic Shaping .................................................... 228 
Appendix F File produced by LTT ............................................................................ 230 
Appendix G The skbuffStructure ............................................................................. 233 
X 
List of Figures 
Figure 2.1 The FPGA Architecture ................................................................................. 14 
Figure 2.2 The Virtex-E Architecture Overview ............................................................ 15 
Figure 2.3 Two-slice Virtex-E Configurable Logic Block ............................................. 16 
Figure 2.4 Virtex-E Input/Output Block ......................................................................... 17 
Figure 2.5 Slave-Serial Configuration Mode .................................................................. 18 
Figure 2.6 Master-Serial Configuration Mode ................................................................ 18 
Figure 2.7 SelectMap Configuration Mode .................................................................... 18 
Figure 2.8 Boundary-Scan Configuration Mode ............................................................ 19 
Figure 3.1 Format of the Active lP Option Field ............................................................ 24 
Figure 3.2 The ANEP Header ......................................................................................... 25 
Figure 3.3 Format of the Smart Packet ........................................................................... 26 
Figure 3.4 The P4 Architecture ....................................................................................... 27 
Figure 3.5 The ANPE Architecture ................................................................................ 28 
Figure 3.6 The FHiPPs Architecture ............................................................................... 29 
Figure 3.7 NID and RAD Configuration ........................................................................ 30 
Figure 4.1 Placing the Active Header in the lP Options Field ........................................ 36 
Figure 4.2 Wrapping Passive Packets into Active Packets ............................................. 37 
Figure 4.3 Block Diagram of the PCI-based FPGA Board ............................................. 39 
Figure 4.4 The Xilinx Design Flow ................................................................................ 41 
Figure 4.5 Schematic Entry of a Comparator ................................................................. 42 
Figure 4.6 HDL Entry of the Comparator ....................................................................... 43 
Figure 4.7 Functional Simulation of the Comparator ..................................................... 44 
Figure 4.8 Design Flow ................................................................................................... 45 
Figure 4.9 Block Diagram of the DES Encryption!Decryption Algorithm .................... 46 
Figure 4.10 Interfacing the DES with the Hardware Algorithm ..................................... 47 
Figure 4.11 Local PCI Bus Signals ................................................................................. 51 
Figure 4.12 Direct Slave Single Cycle Write .................................................................. 52 
Figure 4.13 Direct Slave Single Cycle Read .................................................................. 53 
Figure 4.14 Using ModelSim for Functional Simulation ............................................... 54 
Figure 4.15 Snapshot from the Simulation Process ........................................................ 55 
Figure 5.1 The Fundamental Parts of the Active Router placed into the Network ........ 58 
XI 
Figure 5.2 The Software Architecture Layout of the Active Engine .............................. 59 
Figure 5.3 The Journey of a Packet in the Linux Kernel ................................................ 61 
Figure 5.4 Netfilter Hooks in the Linux Kernel.. ............................................................ 62 
Figure 5.5 Flowchart of the Functionality of the Active Filter ....................................... 63 
Figure 5.6 Packet Queues ............................................................................................... 65 
Figure 5.7 The Request Queue ........................................................................................ 70 
Figure 5.8 Tokens and the Request Mechanism ............................................................. 71 
Figure 5.9 Flowchart of the Core Process ....................................................................... 72 
Figure 5.10 Packet sent during the Control Phase 1 ....................................................... 73 
Figure 5.11 Packet sent during the Control Phase 2 ....................................................... 74 
Figure 5.12 Control Phase with two Retransmissions .................................................... 76 
Figure 5.13 The proc/meminfo File ................................................................................ 79 
Figure 5.14 The /proc/stat File ........................................................................................ 80 
Figure 5.15 The /proc/maps File ..................................................................................... 81 
Figure 5.16 Acquiring System Information from the /proc Filesystem .......................... 82 
Figure 5.17 Utilisation of the Physical Memory and the Swap Space ............................ 84 
Figure 5.18 The search_PID Module .............................................................................. 86 
Figure 5.19 Functionality of the Memory Monitor ......................................................... 90 
Figure 5.20 Traffic Shaping Requests ............................................................................ 93 
Figure 5.21 The Recovery Procedure ........................................................................... 100 
Figure 6.1 The Software Architecure Layout of the Routing and Forwarding Engine 104 
Figure 6.2 The Fundamental Parts of the Active Router placed into the Network ...... 105 
Figure 6.3 Replacing the Destination lP Address of the Active Packets by using 
LKMs ............................................................................................................................ 106 
Figure 6.4 Script that loads the ip_queue Module and queues ICMP Packets from 
Kernel to User-Space ................................................................................................... 108 
Figure 6.5 ICMP Replies forwarded from Kernel to User-Space ................................ 109 
Figure 6.6 Traffic Control in the Kernel Network Stack ............................................. 112 
Figure 6. 7 The Sequence of Events for Traffic Shaping ............................................. 114 
Figure 6.8 Computation of the Ceil Argument for each Request for different Shaping 
Steps .......................................................................................................................... 115 
Figure 6.9 CPU Idle Time with and without Traffic Shaping ...................................... 116 
Figure 6.10 CPU Idle Time with and without Traffic Shaping for PR=30000, PL=100 
and S=0.1 ...................................................................................................................... 118 
XII 
Figure 6.11 CPU Idle Time with and without Traffic Shaping for PR=6250, PL=1000 
and S=0.1 ...................................................................................................................... 118 
Figure 6.12 CPU Idle Time with and without Traffic Shaping for PR=10000, PL=100 
and S=0.02 .................................................................................................................... 119 
Figure 6.13 CPU Idle Time with and without Traffic Shaping for PR=30000, PL=100 
and S=0.02 .................................................................................................................... 120 
Figure 6.14 CPU Idle Time with and without Traffic Shaping for PR=6250, PL=lOOO 
and S=0.02 .................................................................................................................... 120 
Figure 6.15 CPU Idle Time with and without Traffic Shaping for PR=lOOOO, PL=lOO 
and S=0.005 .................................................................................................................. 121 
Figure 6.16 CPU Idle Time with and without Traffic Shaping for PR=30000, PL=lOO 
and S=0.05 .................................................................................................................... 122 
Figure 6.17 CPU Idle Time with and without Traffic Shaping for PR=6250, PL=lOOO 
and S=0.005 .................................................................................................................. 122 
Figure 6.18 CPU Idle Time with and without Traffic Shaping for PR=lOOOO, PL=lOO 
and different Shaping Steps .......................................................................................... 123 
Figure 6.19 CPU Idle Time with and without Traffic Shaping for PR=30000, PL=lOO 
and different Shaping Steps .......................................................................................... 124 
Figure 6.20 CPU Idle Time with and without Traffic Shaping for PR=6250, PL=lOOO 
and different Shaping Steps .......................................................................................... 124 
Figure 7.1 Basic Process of the Active Engine with one loaded Active Application ... 130 
Figure 7.2 Removing the Unpredictability of an Active Application by replacing it 
with a Simple Application ........................................................................................... 130 
Figure 7.3 The Test Network ........................................................................................ 132 
Figure 7.4 Counters placed in the Kernel-Space of the RFE and the AE ..................... 133 
Figure 7.5 Packet Loss for different Packet Lengths at different Packet Rates ............ 133 
Figure 7.6 Linux Trace Toolkit Architecture ................................................................ 135 
Figure 7.7 Adding Buffers in the Core Process and the Simple Application ............... 136 
Figure 7.8 Number of the Process Context Switches as a Function of increased Buffer 
Size ............................................................................................................................... 137 
Figure 7.9 Maximum Data Rate between two Processes as a Function of the 
transferred Data ............................................................................................................ 138 
Figure 7.10 Packet Loss for different Packet Lengths at different Packet Rates when 
Buffers are used ............................................................................................................ 139 
Figure 7.11 Packet Loss when using different Buffer Sizes for lOO-byte Packets ....... 140 
Figure 7.12 Packet Loss when using different Buffer Sizes for 200-byte Packets ....... 140 
Figure 7.13 Packet Loss when using different Buffer Sizes for 300-byte Packets ....... 141 
Figure 7.14 Performance Improvement when Buffers are used for 1 00-byte Packets at 
different Packet Rates ................................................................................................... 142 
XIII 
Figure 7.15 Time-Stamping of the Packets in the Input and Output Hooks ................. 143 
Figure 7.16 Time-Stamped Packet.. .............................................................................. 144 
Figure 7.17 Delay for lOO-byte Packets at different Packet Rates ............................... 145 
Figure 7.18 Delay for 1000-byte Packets at different Packet Rates ............................. 145 
Figure 7.19 Delay for 100, 500, 1000 and 1500-byte Packets at a Packet Rate of 1000 
p/s ................................................................................................................................ 146 
Figure 7.20 Delay for lOO-byte Packets at different Packet Rates when using Buffers 
of 6 Kbytes .................................................................................................................... 147 
Figure 7.21 Delay for 1000-byte Packets at different Packet Rates when using Buffers 
of 6 Kbytes .................................................................................................................... 147 
Figure 7.22 Delay for 100, 500, 1000 and 1500-byte Packets at a Packet Rate of 1000 
p/s when using Buffers of 6 Kbytes .............................................................................. 148 
Figure 7.23 Delay for lOO-byte Packets at a Packet Rate of 1000 p/s .......................... 149 
Figure 7.24 Delay for lOO-byte Packets at a Packet Rate of 15000 p/s ........................ 149 
Figure 7.25 Delay for 1000-byte Packets at a Packet Rate of 1000 p/s ........................ 150 
Figure 7.26 Delay for 1000-byte Packets at a Packet Rate of 5000 p/s ........................ 151 
Figure 7.27 The Test Network Topology ..................................................................... 154 
Figure 7.28 Configuration Times for different Methods used to Load an FPGA 
Device ........................................................................................................................... 156 
Figure 7.29 Delay for the DES Packets for two different Rest Periods ........................ 156 
Figure 7.30 DES Encryption performed in Software and Hardware ............................ 158 
Figure 7.31 Delay for the SW-DES and the HW-DES using two different Clocks for 
Packets of various Sizes ................................................................................................ 159 
Figure 7.32 Perfctr Probes placed in the two DES Applications .................................. 161 
Figure 7.33 CPU Cycle Consumption for the HW-DES and the SW-DES for Packets 
of different Sizes ........................................................................................................... 161 
Figure 7.34 Cost Cappi for Packets of different Sizes .................................................... 164 
Figure 7.35 Block Diagram of a Simple Application that performs DMA Transfers .. 166 
Figure 7.36 Predicted CPU Cycle Usage when using Memory Mapping and DMA, for 
different Data Sizes ....................................................................................................... 167 
Figure 7.37 CPU Cycle Usage when using DMA with and without Data Buffering ... 168 
Figure 7.38 Performance Improvement when using DMA with Data Buffering and 
Interrupt Mitigation ....................................................................................................... 169 
Figure 7.39 CPU Cycle Usage for the SW-DES and the different Versions of the HW-
DES .............................................................................................................................. 170 
Figure 8.1 Netfilter Hooks used for the activation and deactivation of Packets ........... 175 
Figure 8.2 The Data-Related Information of the skbuff Structure ................................ 175 
XIV 
Figure 8.3 The Active Network Layers ........................................................................ 176 
Figure 8.4 Active Secure FTP performed by two Active Routers ................................ 177 
Figure 8.5 Time required to download Files of different Sizes using Passive FTP and 
Active Secure FTP ........................................................................................................ 179 
XV 
Chapter 1 
Introduction 
2 
Chapter I Introduction 
1. Introduction 
1.1 Passive Networks 
Passive network is a term used to characterise the current generation of networks. 
Their functionality has been to deliver packets from one point to another. No 
computation takes place on packets' payload other than that required for packet 
delivering. Passive router functionality is limited to the network and (sometimes) 
transport protocol layers. Processing within the network has been limited largely to 
routing, simple quality of service (QoS) schemes, and congestion control. Today 
however, there is a considerable interest in pushing other kind of processing into the 
network [LWG98]. 
Due to their non-dynamic nature, legacy routers impose several difficulties in passive 
networks such as: 
• The difficulties of integrating new technologies and standards into the network 
infrastructure [SFG96]. Conventional routers use specific protocols and are usually 
vertically integrated devices. If a new protocol is to be introduced into the network, 
standardisation is necessary. Standardisation is a very time-consuming procedure that 
could take years and so it is inflexible and evolves slowly. 
• Limited performance. As previously stated, passive router functionality is limited to 
the network and transport protocols. There are however network applications or 
services that could benefit if part of the computation takes places within the network. 
Some examples of these applications and services are [WLG98]: videoconference, 
Internet telephony, caching. 
1.2 Active Networks - a different Approach 
The Active Network approach is much different to that of the conventional networks. 
Active Networks consist of routers and switches that not only forward packets from 
one point to another but also perform customised computations on them. The 
behaviour of the Active Routers can also be modified by the packets they forward. 
This enables faster protocol innovation by making it easier to deploy new network 
protocols that can be installed on the fly, in the intermediate nodes of an Active 
Network. Passive packets are replaced by active packets, packets that interact with the 
3 
Chapter I Introduction 
Active Nodes. An important issue here is that the active packets have to be 
distinguished from the passive ones. Several protocols for the active packets have 
been implemented [FBPA03], [HMA99], [SJS99], [WETIEN96], to achieve this. 
"Active architectures permit a massive increase in the sophistication of computation 
that is performed within the network, since computation takes part not only in the end-
points but in the routers as well" [TENW96]. 
The extreme flexibility of the Active Networks has caused the proliferation of many 
definitions for them, for example: programmable interface for the network [CBZ98], 
programmable network [WLG98], adaptive protocols [TENW96], platform for user-
driven customisation of the infrastructure [TSS97], network as computational engine 
[BCZ96] and so on [DFF99]. 
Of course it is not a priori obvious that a programmable network is a good idea. It 
clearly offers flexibility, but at some cost. End users are able to inject their code and 
program the intermediate nodes. This opens the "Pandora's box" for safety and 
security issues [TENW96]. Safety and security will be briefly discussed in one of the 
next sections. 
Summarising, the features of an Active Network can be categorised as follows 
[ALMLOI]: 
• Router behaviour can be easily changed or enhanced, because the active packets 
contain functions and parameters that modify the behaviour of the router. 
• New functionality can be provided by loading software modules into Active Routers 
through the use of active packets. 
• Active Routers can work up to the application layer, thereby being able to handle 
new services, and be able to read the contents of all packets. 
1.3 Realisation of the Active Networks 
There are two approaches to the realisation of the Active Networks and these are 
related to the structure of the active packets. 
4 
Chapter I Introduction 
1.3.1 The Capsule Approach (or in-band approach) 
In this approach, the active packets carry code that is executed among the Active 
Nodes they traverse. The limitation of this approach is that the program size should 
not be more than about 1 Kbytes since the size of the Ethemet packet is limited to 1.5 
Kbytes. There is relevant research for creating such miniature programs, such as the 
PLAN (the Packet Language for Active Networks) [HMA99] and Python [BATB02]. 
A further drawback of the in-band approach is that Active Code can be replicated into 
the network consuming valuable bandwidth. Not all the active packets have to carry 
the Active Application. 
1.3.2 The Programmable Switch Approach (or out-of-band approach) 
In this approach, the active packets do not carry any miniature program, instead they 
carry a unique module id number that references an Active Application. Every 
application has to have a unique module id that has to be global to all Active Nodes 
since it characterises that application. When the Active Router receives an active 
packet, it uses the module id number to download the Active Application from a Code 
Server or from another Active Router, which has previously received the packet. Most 
of the research groups focusing on Active Networks use the programmable switch 
approach. 
1.4 Issues raised by activating the Networks 
On the one hand, giving the end users the ability to inject their own programs into the 
network makes the network much more flexible and the creation of services and 
protocols can take place very rapidly. On the other hand, major safety and security 
issues arise for protecting the nodes from users or programs that intentionally or 
unintentionally misbehave. 
1.4.1 Safety and Security 
Conventional networks and end users suffer from different types of attacks such as 
denial-of-service (DoS) attacks, viruses etc. Intermediate routers and particularly 
servers have been the "victims" of many attacks in the past with catastrophic results. 
5 
Chapter I Introduction 
Active Networks because of their nature (programmability) are more sensitive to 
different kinds of attacks than the conventional networks. For this reason, a safety and 
security mechanism has to exist in every Active Node. 
A distinction should be made between safety and security [PSN99]. The role of a 
security framework is to protect an Active Node from malicious attacks, while a 
safety framework is to protect the Active Node from users or packets that 
unintentionally misbehave (e.g. a bug into an Active Application). 
Various research projects have been investigating the implementation of security and 
safety mechanisms to protect the integrity of an Active Node. Safety and security 
issues should be investigated from two points of view: the programming point of view 
and the systems point of view [PSN99]. 
1.4.1.1 Safety and Security from the Programming Point of View 
Active Networks are programmed by the injection of code into the Active Nodes. The 
programming code can be safely executed in an Active Node if a safe language has 
been used to create it. Creation of safe languages can eliminate expensive run-checks 
in the active nodes and provide safety via for example, putting bounds on the amount 
of memory a program can use, limiting the number of threads a process can create 
[PSN99]. 
Relevant work that has been taking place on safe languages is the PLANet project 
[HMA99]. They have created the PLAN language (Packet Language for Active 
Networks). PLAN is a small language that has resource-limit semantics and ensures 
that PLAN programs always terminate and that packet and their descendents visit only 
a fixed number of Active Nodes. 
Other relevant research is the software fault isolation introduced in [WLAG93]. This 
is a safety mechanism for isolating suspicious modules running in the same address 
space. 
[NEC97] presents the proof carrying code. It permits arbitrary code to be executed as 
long as a valid proof of safety accompanies it. 
6 
Chapter I Introduction 
Python [BATB02] is another language for Active Network systems. It provides 
portable bytecode, which allows the execution of code on different platforms. 
1.4.1.2 Safety and Security from the Systems Point of View 
There are numerous research groups that introduce several safety and security 
architectures in the infrastructure of the Active Nodes. These frameworks perform 
tasks such as authentication and integrity control of the active packets, control and 
monitoring of the system resources etc. 
The attacks to which an Active Node is susceptible are more than those are in current 
conventional networks. Generally, some of the attacks can be [DEN99]: 
• Misuse of an Active Node by the Active Code. 
A piece of active code can claim the identity of a trusted Active Code (masquerade) 
and gain access to data (of the Active Node) it is not permitted to. The result can be 
corruption of the Active Node via overusing its resources and services. 
• Misuse of Active Code by an Active Node 
Masquerading can take place in this case too. An Active Node can claim the identity 
of a trusted Active Node and gain access to the Active Code that traverses the 
network. The malicious node can now monitor the data sent over the network or even 
corrupt them. 
• Misuse of Active Code by the Underlying Infrastructure 
Threats exist while the Active Code traverses the network from host to host. One 
external attacker could perform all kind of attacks such as masquerade, denial of 
service, unauthorised access, copy and replay, alteration etc. 
Finally a combination of the above categories is possible. 
7 
Chapter 1 Introduction 
1.4.2 Resource Management 
Current conventional network nodes (routers) just examine the packet header (lP 
header) of the packets they forward. This does not require a large amount of 
computation or storage; therefore resource management is not a critical issue. 
Applications that process packets can be divided into two categories: header-
processing applications and payload-processing applications [WFOO]. For header-
processing applications the processing cost is independent of the packet size, but for 
the payload-processing applications complexity is increased. Active Applications are 
expected to be payload-processing applications or a mixture of these two categories. 
For this reason, Active Nodes must provide mechanisms to enforce resource limits, 
associated with specific network traffic [GALT99]. The resource management 
framework is usually part of a generic safety-security framework. 
In Active Nodes, resource requirements typically fall into four categories: processing, 
memory, storage and bandwidth [DEN99]. A more detailed definition of the resources 
of an Active Node is given in [DEN99]. The authors distinguish two types of 
resources namely physical and logical. Physical resources refer to the hardware 
capabilities of the Active Node, while logical resources refer to the software 
capabilities of the Active Node for the creation of objects that handle the various data 
flows inside the node. Physical resources are the CPU time, data BUS bandwidth, 
input bandwidth of network, memory, output network bandwidth, input network 
buffers, output network buffers, storage. Logical resources are the queues for storing 
packets, packet classifiers, filters, threads. 
There are several research groups investigating resource management issues and fair 
allocation of the resources [CGMOO], [DEN99], [GALT99], [GMBOI], [GMCOI], 
[PAW002], [SAB03], [SAJ03], [Y ALA95]. Two of these projects are briefly 
described. 
In [GALT99], a benchmark is used to measure the performance (EEa, EEb etc) of 
different execution environments (EEs) in different Active Nodes. Also, the 
performance of these EEs is measured in a reference Active Node (EErl, EEr2 etc). 
Using these values, two transformations are defined: node-to-reference (NR) and 
8 
Chapter I Introduction 
reference-to-node (RN). Prior to transferring an Active Application model (values 
EEl, EE2, etc) between two nodes, A and B, the model is subjected to an NAR 
transform. The model with its CPU requirements expressed in terms of a reference 
node is then transmitted across the network. Upon arrival at node B, the model is 
subjected to an RNs transform. The combination of these two transforms will convert 
the CPU times within an application model from a form meaningful on node A into a 
form meaningful on node B [GALT99]. 
In [PA W002], the authors present actual execution times of various applications on 
packets of varying lengths, measured on a programmable router. They show that for 
the restricted class of network applications, the processing times are strongly 
correlated to the size of the data being processed, i.e. to the packet length. Using this 
correlation they predict packet execution times to perform admission control and also 
to schedule packets for processing. 
1.4.3 The End-to-End Argument 
Active Networking as mentioned before, is the placement of user-controllable 
functionality into the network. The end-to-end argument states that functions should 
be placed in the network only if they can be cost-effectively there [BCAZ97], or 
functions placed at low levels of a system may be redundant or of a little value when 
compared with the cost of providing them at that low level [SRC84]. This means that 
activating the network will not be a benefit to all network applications. A network 
application could be implemented exclusively in the end-system (i.e. current network 
applications) or could be a combination of implementation at the end-systems and in 
the network. (i.e. Active Applications). 
In [SRC84], a simple example of a file transfer between two end points is given. For 
this application to be reliable, a mechanism for error detection and correction is 
necessary. The argument is where the service of this mechanism should be better 
implemented; in the end points, in the network or be a combination of these two. In 
[BCAZ97], a model that quantifies the benefit to an application of network-based 
functionality is presented. The expected performance of an end-system and a 
combined end-system and network is evaluated. The model is then applied to a 
9 
Chapter 1 Introduction 
congestion-control application to show the benefit to that application of network-
based functionality. 
1.5 Thesis Overview 
This thesis presents the architecture of a PC-based Active Router. The router consists 
of two hosts running on Linux. The separation of the router functionality into two 
hosts is performed for safety reasons. The first host is responsible for forwarding 
active packets into the second host, where the Active Applications execute. The 
second host contains a PCI-based FPGA board. 
Chapter 2 of the thesis presents the Field Programmable Gate Arrays (FPGAs). Their 
basic structure is described, as well as their different types. A more detailed 
description of the XCV1600-E FPGA is given since this is the programmable device 
used to build the Active Router. The motivation behind using FPGAs for 
implementing the Active Router and hardware security issues are discussed. 
The first part of Chapter 3 presents relevant research performed in the Active 
Networks field. The second part refers to applications and services that can be applied 
in Active Networks. 
Chapter 4 describes the protocol used to build the active packets, since they have to be 
distinguished from the passive ones. The format of the Active Applications is 
presented and the description of the hardware element of the router follows. The last 
part of the chapter presents an Active Application that performs encryption/decryption 
using the DES (Data Encryption Standard) algorithm and it executes on the FPGA 
board. 
Chapter 5 demonstrates the architecture of the Active Engine (AE). The AE is the host 
(part of the Active Router) where the Active Applications execute and where the 
FPGA board is placed. The AE consists of several modules that implement tasks for 
resource management, fault detection and isolation and communication with Code 
Servers. 
10 
Chapter I Introduction 
Chapter 6 describes the architecture of the second host that comprises the Active 
Router. It is called the Routing and Forwarding Engine (RFE). RFE is responsible for 
the forwarding of active packets to the AE. It also contains a module that implements 
traffic shaping in order to protect AE from Denial of Service Attacks, and a second 
module that assists in the safety of the Active Router. 
Chapter 7 demonstrates the performance evaluation of the Active Engine, in terms of 
packet loss, packet delay and delay jitter within the Active Engine. A method for 
reducing the packet loss is presented and its affect to the packet delay and delay jitter 
is shown. The second part of the chapter presents the switch operation between two 
FPGA applications (applications that change on the fly) and the time necessary for the 
switch to take place is measured. The performance evaluation of two Active 
Applications in terms of CPU cycle consumption and delay within the Active Engine, 
is described in the last part of Chapter 7. These applications are a DES application 
running in software and a DES application running in hardware (FPGA). Several 
methods for improving the performance of the hardware DES are discussed. 
Chapter 8 presents the implementation of an Active Secure FTP application. The 
method used to activate and deactivate the packets originating from an FTP client and 
an FTP server is analysed. Finally, the performance evaluation, in terms of the time 
required downloading files of different sizes, of the normal (passive) FTP and the 
Active Secure FTP is compared. 
Chapter 9 contains the conclusions and suggestions for future work. 
11 
Chapter 2 
Field Programmable Gate Arrays 
(FPGAs) 
12 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
2. Field Programmable Gate Arrays (FPGAs) 
2.1 Chapter Summary 
This chapter presents the Field Programmable Gate Arrays (FPGAs ). Their basic 
structure is described, as well as their different types. 
A more detailed description of the XCV1600-E FPGA is given and the different 
modes that can be used for programming it follow. 
The motivation behind using FPGAs for implementing the Active Network is 
described and issues related to hardware security are discussed. 
2.2 Introduction 
Over the recent years, the development of programmable logic integrated circuits has 
accelerated considerably. Field-programmable gate array technology was introduced 
in the mid 1980s. It was a new technology for implementing digital logic. FPGAs 
were capable of implementing significantly more logic than Pills, especially because 
they could implement multi-level logic, while most Pills were optimised for two-
level logic [HAU98]. 
A general description of different types of FPGAs is given and a more detailed 
description of the XCV1600-E type of FPGA is presented because this programmable 
device has been used for the realisation and evaluation of this work. 
2.3 Structure of the FPGAs 
FPGAs comprise two major configurable elements as shown in Figure 2.1 [HAU98]: 
• Configurable logic blocks (CLBs) which provide the functional elements for 
constructing logic. 
• Input/output blocks (lOBs) that provide the interface between the package pins and 
the CLBs. 
13 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
DD DD DD Dg Logic Block 
1/0 Block-t[J I[] El D m 0 D . 0 
D E1 El [3 L] 0 D . 0 
D El E1 D D 0 D . D . . 
D I[] I[] [3 [3 0 D . . 0 . 
DD DD DD DD 
Figure 2.1: FPGA architecture 
The number of the CLBs and lOBs vary for different types of FPGAs and for different 
vendors. The main FPGA manufacturers are Xilinx [XIL WWW] and Altera 
[ALTWWW], although other manufacturers of the product exist. 
2.4 Different Types of FPGAs 
FPGAs are completely prefabricated, and contain special features for customisation. 
These customisation points are normally either SRAM cells, EPROM, EEPROM or 
anti-fuses [HAU98]. 
Antijuse FPGAs are one-time programmable devices so they cannot be used for 
Active Networking since reprogrammability is a basic characteristic of these 
networks. 
SRAM-based FPGAs are reprogrammable and for this reason they are widely used in 
numerous applications. Active Networks are an example of their widespread use. 
SRAM-based FPGAs have to be reprogrammed every time the system (in which the 
device is hosted) is turned on. 
14 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
EPROMIEEPROM FPGAs are devices somewhere between the anti-fuse and the 
SRAM-based FPGAs. Their basic characteristic is that the programming of the 
EPROMIEEPROM is retained even when the power is turned off. 
2.5 The XCV1600-E Field Programmable Gate Array 
2.5.1 Architectural Description 
This FPGA is manufactured by Xilinx [XILWWW]. As mentioned before, an FPGA 
consists of two basic programmable elements, CLBs and lOBs. The Virtex-E 
architecture is shown in the following figure [XPS02]: 
VeruRing 
•• • • • • ~ ~ ~ ~ ~ ~ 
lli " i!i " " lli 
• !!] 
" 
• 
:l 
i!i 
(5 
ea 
• 
Figure 2.2: The Virtex-E architecture overview 
A general routing matrix (GRM) is used to interconnect CLBs. The GRM comprises 
an array of routing switches located at the intersections of horizontal and vertical 
routing channels. Each CLB nests into a VersaBlock that also provides additional 
routing resources to connect the CLB to the GRM. 
The Virtex-E architecture also includes the following circuits that connect to the 
GRM [XPS02]: 
• Dedicated block memories of 4096 bits each, 
• Clock DLLs for clock distribution delay compensation and clock domain control, 
15 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
• 3-state buffers (BUFfs) associated with each CLB that drive dedicated segmentable 
horizontal routing resources. 
Values stored in static memory cells, control the configurable logic elements and 
interconnect resources. These values load into the memory cells on power up and can 
reload if necessary to change the function of the device. 
2.5.2 Control Logic Blocks 
The main component of the Virtex-E CLB is the logic cell (LC). It includes a 4-input 
function generator, carry logic and a storage element. The output from the function 
generator in each LC drives both the CLB output and the D input of the flip-flop. 
Each CLB contains four LCs, organised in two similar slices (Figure 2.3) [XPS02]. 
COUT COUT 
YB YB 
G4 >-r-r-- y G4>-_- y 
Go>- r- w_..- o•>-= LUT _ W-r.p G2 >- LUT _ cany& o o- .,. YO G2>- carry& 0"' Of-- f-;>vo r Control CE control CE 
G1 >- G1>- -_ 
- I I BY ~ BY .!E.. 
"" 
XB 
.. >-r-r-- X •• >- r-- X w_-..- j_rw FO>- ••>- - f-:. xa LUT I-- Cmy& o a .,. xa _ LUT r-- Cauy&. o 0-F>>- Control r--- F>>-
F1 >- CE F1 >-
control c~ 
'---- I '- I ~ ....!'!£.. BX BX 
Stlct1 SllotO 
CIN CIN 
Figure 2.3: Two-slice Virtex-E Configurable Logic Block 
In addition to the four basic LCs, each CLB of this type of FPGA contains logic that 
combines function generators to provide functions of five or six inputs. 
2.5.3 Input/Output Blocks 
lOBs provide the interface between the package pins and the CLBs. The block 
diagram of an lOB is shown in the next figure [XPS02]: 
16 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
Figure 2.4: Virtex-E Input/Output Block 
Each JOB features Select/0+ inputs and outputs that support a wide variety of 110 
signal standards. 
2.5.4 Look-Up Tables 
Virtex-E function generators are implemented as 4-input look-up tables (LUTs). Each 
LUT can provide a 16 x 1-bit synchronous RAM. Also, the two LUTs within a slice 
can be combined to create a 16 x 2-bit or 32 x 1-bit synchronous RAM, or a 16 x 1-bit 
dual-port synchronous RAM. A LUT can also provide a 16-bit shift register. 
2.5.5 Configuration Modes 
Virtex-E FPGAs are configured by loading configuration data into the internal 
configuration memory. The configuration data are usually referred as the "bitstream". 
Each bitstream is produced using appropriate software tools and this procedure will 
be described in Chapter 4. Some of the pins that are used for the configuration of the 
FPGA device are dedicated pins, while others can be re-used as general purpose 
inputs and outputs once configuration is complete. 
Virtex-E supports the following configuration modes [XPS02]: 
• Slave-serial mode 
• Master-serial mode 
• SelectMAP mode 
• Boundary -scan mode (JTAG) 
17 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
In the slave-serial mode, the FPGA receives configuration data in bit-serial form from 
a serial PROM or other source of serial configuration data. 
I Serial Data I DATA FPGA I 
CCLK 
I Control Logic I 
Figure 2.5: Slave-Serial Configuration Mode 
In the master-serial mode, the configuration data is sent in bit-serial form. Here, the 
FPGA provides the control logic and drives the configuration clock. 
1/CLK I 
Figure 2.6: Master-Serial Configuration Mode 
The selectMAP mode is the fastest configuration option. The configuration clock is 
provided by external logic and data is sent one byte per configuration clock. 
Byte-Wide DATA 
Data FPGA 
Control Signals CCLK 
I Control Logic I 
Figure 2.7: SelectMap Configuration Mode 
18 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
In the boundary-scan mode, external logic is required. Control signals and data are 
presented on the boundary scan pins. Data is loaded one bit per configuration clock. 
r Serial Data I DATA r FPGA l 
CCLK 
1 Control Logic 1 
Figure 2.8: Boundary-Scan Configuration Mode 
2.6 FPGAs and Active Networks 
2.6.1 Motivation for using FPGAs in Active Networks 
To implement an Active Network, conventional routers have to be augmented by the 
capability to execute customised code, which goes beyond common header processing 
and signalling. Currently, this is often achieved by using PC-based routers in many 
experimental projects. PC-based routers lack performance compared to traditional 
routers (CISCO routers etc), due to the nature of their operating system and the 
characteristics of their hardware. Traditional routers can service packets in the Gb/s 
range but are vertically integrated devices and cannot be modified to support new 
functions or protocols. PC-based routers used for Active Networking are however, a 
relevant low cost solution that provides high flexibility. 
The processing power of a PC-based Active Router's CPU is shared among the 
operating system and the processing of the active packets [WOLF99]. The limited 
computational power of the workstation restricts the data traffic from 1 Gb/s [BIA05] 
to a few Mb/s [WOLF99]. 
The authors in [BIA05] present the performance evaluation of a PC-based router 
running Linux, in terms of the maximum forwarding rate. When using minimum-size 
Ethemet frames ( 64 bytes), the maximum forwarding rate is around 600,000 packets/s 
(307 .2 Mb/s), but it can reach 1 Gb/s when using larger packets. 
19 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
The performance of a PC-based router will drop in case it acts as an Active Router, 
since Active Routers perform computations on the data of the packets, as well as 
routing them. In conventional routers, the cost of forwarding a single packet is usually 
referred as the per packet cost. A new metric is inserted for the Active Routers, the 
per byte cost [WFOO]. The per packet cost is independent of the packet size because 
the handling and routing of the packet depends on software operations, such as reduce 
TTL, find next hop etc, that need the same CPU cycles for every packet. In contrast, 
the per byte cost depends on the complexity of the Active Application and it is 
proportional to the size of the packets. 
"For the evolving field of Active Networks, their performance is crucial to be 
competitive with respect to the end-to-end performance of current networks. This can 
be achieved partially by providing superior functionality" [WOLF99]. 
Superior functionality can be achieved using FPGAs. These devices have three major 
advantages: 
• They are reprogrammable, so they can be configured on the fly, providing different 
services to the packets. The reprogrammability of the FPGAs allows one to download 
algorithms onto the FPGAs, and change these algorithms just as general-purpose 
computers can change programs [HAU98]. 
• They can accelerate Active Applications by many orders of magnitude, compared to 
workstations' CPUs. Their processing power is comparable to that of ASIC's 
(Application Specific Integrated Circuits). The degree of performance speed up 
depends upon the amount of parallelism in the application. 
• They provide a safer and more restricted execution environment for the Active 
Applications than the address space of a host. Also, in the case of a malicious 
application, a hardware watchdog monitor could quickly and safely isolate the 
offending application from the rest of the system. Any damage can be limited to the 
FPGA or the board that carries it. However, a malicious application could potentially 
physically destroy the FPGA device, as will be described in the next section. 
20 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
2.6.2 Reconfigurable Hardware Security 
Reconfigurable hardware gives the ability to accelerate network applications by 
orders of magnitude and provide different services to packets. Modules implemented 
in hardware that originated from a foreign source could be installed in the Active 
Router. Intuitively, such a run-time environment could also allow the implementation 
of hardware equivalents of computer viruses that attack the system at a much lower 
level; namely the electrical signals level [HUS99]. Attacks in this level can physically 
destroy the programmable device (FPGA). The only relevant research found by the 
author on hardware security is presented in [HAD99] and will be briefly described. 
Three categories have been defined based on the type of threat to a reconfigurable 
system: 
• Electrical Signals Level: The attacker can create electrical conflicts either inside the 
device or at pins connecting the attached device to other components of the system. 
The goal of this attack is to physically destroy system components. This is called 
Malicious Electrical Level Threat (MELT) [HAD99]. 
• Logic Signals Level: The attacker generates signals, which are electrically correct, 
but logically make no sense to other devices. This is called Signal Alteration Logic 
Threat (SALT) [HAD99]. 
• Software-like attacks: The attacker may generate legitimate cycles, which together 
compose the execution of a malicious task. This attack level is equivalent to attacks in 
software systems. This is called Higher Abstraction Level Threat (HALT) [HAD99]. 
MELT is the most destructive attack; SALT attacks may cause unpredictable 
behaviour in the system while HALT attacks should be treated as equivalent to 
malicious software code and are thus not FPGA-specific. Prevention and detection 
methods for the MELT and SALT attacks are detailed described in [HUS99] and 
[HAD99]. 
21 
Chapter 2 Field Programmable Gate Arrays (FPGAs) 
2.7Summary 
FPGAs are reprogrammable devices that can be used in numerous applications. There 
are three types of FPGAs: anti-fuse, SRAM-based and EPROMIEEPROM FPGAs. 
The SRAM-based FPGAs are widely used because they are reprogrammable devices. 
Active Networks can benefit through the use of the FPGAs because these devices: are 
reprogrammable so Active Applications can change on the fly, they can accelerate the 
Active Applications by orders of magnitude, and they provide a safer execution 
environment. 
Hardware security is an important issue. There are three level of attacks that a 
hardware system can suffer, defined in [HAD99]: the Malicious Electrical Level 
Threat (MELT), the Signal Alteration Logic Threat (SALT) and the Higher 
Abstraction Level Threat (HALT), 
22 
Chapter 3 
Relevant Research in Active Networks 
23 
Chapter 3 Relevant Research in Active Networks 
3. Relevant Research in Active Networks 
3.1 Chapter Summary 
This chapter is divided into two parts: the first part presents some work done in the 
field of Active Networks by several research groups, while the second one refers to 
applications and services that can be applied in Active Networks. 
3.2 Active Network Projects 
3.2.1 The Active lP Option 
In this work [WETTEN96], passive packets are replaced with active capsules, packets 
that carry code fragments that are executed at each Active Node they traverse. The 
options mechanism of the lP network layer is extended to carry code fragments as 
shown in Figure 3.1 [WETTEN96]. Code is written using the scripting language Tel. 
lP opt10tl!l (!Pvi/IPVG) 
IP He~der O::er Ct5ta 
1t nn:~:lriil::l.j:!!ft1n;~t1cnl) (t'Oifly_tp .. . ) 
vdue 
Figure 3.1: Format of the Active lP Option Field 
The Active lP Option field provides a mechanism for embedding a program fragment 
in an lP datagram. These fragments are then executed by Active Routers along the 
path taken by the datagram. 
3.2.2 Active Network Encapsulation Protocol (ANEP) 
ANEP [AGG97] specifies a mechanism for encapsulating Active Network frames for 
transmission over different media. The suggested format allows use of an existing 
network infrastructure (such as lP [RFC1883]) or transmission over the link layer. 
The ANEP header is shown in the figure below: 
24 
Chapter 3 Relevant Research in Active Networks 
Version I Flags Type ID 
ANEP Header Length ANEP Packet Length 
Options 
Payload 
Figure 3.2: The ANEP header 
The Version field indicates the header format in use and it is l.The Flags field is 8 
bits long and specifies what the Active Node should do if it does not recognise the 
Type ID. The ANEP Header Length field specifies the length of the ANEP header in 
32 bit words. The Type ID field indicates the evaluation environment of the message. 
The Active Node should evaluate the packet in the proper environment. Type ID 
numbers are issued to interested parties by the Active Network Assigned Numbers 
Authority (AN ANA). The Packet Length specifies the length of the entire packet. The 
Options field specifies packets' characteristics that are meaningful into a specified 
evaluation environment. 
ANEP is used by various research projects, some of them are [CFS99], [HMA99], 
[TSC98]. 
3.2.3 Smart Packets for Active Networks 
Smart Packets [SJS99] focuses on applying Active Network technology to network 
management and monitoring. Its authors aimed to put active technology into the 
network management and make management nodes programmable. Advantages using 
this approach are the reduction of the back traffic sent to the management center as 
well as the data requiring examination etc. 
Smart packets are encapsulated into an ANEP header that is further encapsulated in an 
IP header as shown in Figure 3.3 [SJS99]: 
25 
Chapter 3 Relevant Research in Active Networks 
1\itO 8 16 
... 
lP llwdcr .!,j 
=~ Rou lcr A.k-rt l1plil'll 
\'er I Hags T)'l" ID 
llcnll~r Length l'.x:kct I.cngU1 
~ ' . . -E:! -:1 s ..... urcc ll~llllfl('f ~~.-----;-:-~~-:-:--:-::::-------1 ~ - i>..'<linotion Identifier 
S..'J ncuoo Nu m tcr 
'li .il : Sn•nt l'xkct h •·l·~•d 
E ·~ : J 
Vl ,.~ 0 
-. 
' 
' 
' 
Figure 3.3: Format of the Smart Packet 
Smart Packets use the Router Alert Option (placed in the IP Options header) to 
differentiate from the passive packets. There are two languages used in this project: 
the first is called Sprocket and it is a high-level language like C, and the second one is 
called Spanner and it is an assembly language. 
3.2.4 PLANet: An Active Internetwork 
The basic characteristic of Planet [HMA99] is that all active packets contain programs 
(active capsules). Programs are written using a special programming language called 
PLAN (Packet Language for Active Networks) and it is written used the language 
OCaml. PLAN has resource-limited characteristics that ensure programs always 
terminate and that packets and their descendants visit only a fixed number of nodes. 
3.2.5 LARA (Lancaster Active Router Architecture) 
LARA [CFS99] is a composite hardware/software architecture. It splits into four 
parts: Cerberus, a first prototype of the LARA concept, the LARA Platfonn 
Abstraction Layer (IARNPAL), the LARA Management and Policy Database 
(LARAIMAN) and the IARA Runtime Execution Environment (IARAIRT). LARA 
provides primitives such as CPU scheduling and multithreading support, memory 
management, network bandwidth management and enforcing policy decisions. 
26 
Chapter 3 Relevant Research in Active Networks 
3.2.6 The Phoenix Framework-Intel Corporation 
The Phoenix framework [PUBAOO] defines a set of open safe Java-based interfaces. It 
implements a mobile agent system to provide a flexible execution environment. It also 
ensures that dynamically loaded code, which executes is only allowed access to 
necessary resources and that all access is properly permissioned. The Phoenix 
framework uses Java technology because it is an object-oriented language that 
inherently supports security. Some applications that can be implemented with this 
framework are Scriptable Remote Network Management, Congestion Analysis and 
Intrusion Detection. 
3.2.7 The Programmable Protocol Processing Pipeline (P4) 
P4 [HAS97] is a run-time reconfigurable FPGA board. It allows the implementation 
of protocol-processing algorithms in hardware. It is designed to operate on an OC-3 
(155 Mb/s) ATM link. Its architecture composes a set of RAM-based FPGA devices 
in a pipeline, with a switching array selecting which devices are engaged in 
processing a data stream. 
A (non-active) FEC (Forward Error Correction) algorithm has been implemented on 
the P4 architecture. 
' 
' 
' 
' ATM: 
header fields forwarding 
bypassFIFO 
Switching Array 
i i 
link ' IBTPEl IBTPEl !~···············~ 
' ' ' 
ATM 
link 
________ t ______________ y _______ t _____________ y ____ J( ______ _ 
•-- • Control paths 
_.... Datapaths 
I Controller I 
Figure 3.4: The P4 Architecture 
27 
Chapter 3 Relevant Research in Active Networks 
3.2.8 The Active Network Processing Element (ANPE) 
ANPE [WOLF99] is a hardware architecture that can operate at link speeds in the 
range of Gb/s. Its hardware is based on FPGA technology and is designed to be used 
on line cards of common routers and on network interface cards of workstations. 
The layout of the ANPE is shown below [WOLF99]: 
to other ANN to other ANN 
J ANPE 1 t ANPE 
CPU ' ;;_,;,;,d> CPU ,;,~;n~~ F:~GA Memory F:PGA \PlC Cache '::!,\:hhi~l:IH: 
• 
APIC Cache 
• • 
,:::::J'::i!ili::;:r:i 
•f 81 !Lt!lii1JiEI- . ·. .... Bl .. ... 
A ! T B 
I ATM "Backplane" I 
ANPE T ANPE 
I CPU I I CPU . 
\PlC 
FPGA FPGA 
]Cache] : .. MemOJY • • • 1\PIC cache Memory 
·. Bl . ' ... 
,, ;,:;;:t;/>, 
'',-'ki', 
l c 1 D 
to other ANN to other ANN 
81 - Bus Interface 
Figure 3.5: The ANPE Architecture 
3.2.9 The Flexible High Performance Platform (FHiPPs) 
The hardware platform FHiPPs [HMP99] is an approach that combines the flexibility 
of standard microprocessor technology with the performance of FPGA designs and 
digital signal processors. FHiPPs is composed of several units that can operate 
concurrently. Its hardware architecture is shown in the figure below [HMP99]: 
28 
Chapter 3 Relevant Research in Active Networks 
NEC 
C~~~:~J·I L~~¥i~].l··-c~~~~J 
I FPGA 1 I ..... [·········:t:g .......... ~·· [ FPGA 5 I 
·····----····--·············· ............ -.............. ·-·····• ----.. -······-·········-····· 
~-;;;AYJi: ~ :--r·p;o:-;:6-·1 
[FPGA i].._ glue logicLj FPGA"YJ 
MACH4-2.56 SAR Controller 
--- FJFO 
FPG.Il board 
io processor 32MB DRAM 
FIFO TIDSP TMS32C080 
Figure 3.6: The FHiPPs Architecture 
t960 bas~d nrolrr ""'" 
The parts that form FHiPPs are a CPU with local memory, a DSP (Digital Signal 
Processing) unit and several FPGAs. 
3.2.10 The Field Programmable Port Extender (FPX) 
The FPX platform [LOCKOl) includes high-speed network interfaces, multiple banks 
of memory, and Field Programmable Gate Arrays (FPGAs). It allows reprogrammable 
hardware modules to be dynamically installed into a router or firewall through the use 
of full or partial reprogramming of an FPGA. Applications have been developed for 
the FPX that include Internet packing routing, data queuing and application-level 
content modification [LOCKOl]. The FPX implements all logic using two FPGA 
devices: the Network Interface Device (NID) and the Reprogrammable Application 
Device (RAD). The interconnection of the RAD and NID to the network and memory 
components is shown in Figure 3.7 [LOCKOI]. The NID controls how packet flows 
are routed to and from modules. It also provides mechanisms to dynamically load 
hardware modules over the network and into the router. The combination of these 
features allows these modules to be dynamically loaded without affecting the 
switching of other traffic flows or the processing of packets by the other modules in 
the system [LOCKOl]. The RAD contains the modules that implement customised 
packet processing functions. Each module on the RAD connects to one Static RAM 
(SRAM) and to one, wide Synchronous Dynamic RAM (SDRAM) [LOCKOl]. 
29 
Chapter 3 Relevant Research in Active Networks 
Figure 3.7: NID and RAD Configuration 
3.3 Applications and Services that can be applied in Active Networks 
3.3.1 Active Reliable Multicast 
Multicast in the conventional networks is a difficult problem. The main difficulty is 
the NACK (negative acknowledgement) implosion and happens when servers or the 
network in general, are overloaded by simultaneous requests from large number of 
receivers. Also, receivers in a multicast group may experience different packet loss 
rate depending on their locations in the multicast tree [LESG98]. 
In the active multicast approach, control of the NACK implosions is feasible if for 
example an Active Node caches and retransmits NACKs when it is necessary. Such 
an approach is described in [LESG98]. 
3.3.2 Active Anycast 
In conventional networks, anycasting is performed between a client and the closest 
server. In active anycast a client can communicate (via an active node) with an 
adequate server from the viewpoint of load balancing among candidate servers 
[YAMOl]. An approach of active anycasting, its problems and some possible 
solutions are described in [YAMOl]. 
30 
Chapter 3 Relevant Research in Active Networks 
3.3.3 Forward Error Correction (FEC) 
An adaptive length and burst error capability of an FEC algorithm can be applied into 
active packets where and when is required. A decision algorithm is needed to decide 
when the FEC will be applied, a monitor tool which will monitor the network 
bandwidth and the kind of the FEC algorithm (length etc). Such an algorithm can be 
applied into links with adequate bandwidth to strengthen the error detection and 
correction. The trade-off between latency (FEC inserts an overhead in each packet) 
and the bit error rate should be defined. This means that inserting a strong FEC 
mechanism in a stream with a low bit error rate (assuming there is enough bandwidth) 
will reduce the errors in the packets but will insert a large overhead in each packet, so 
the latency will be increased and packet losses may occur [PAWOl]. 
3.3.4 Cryptography 
Several cryptographic algorithms can be applied to packets on demand. Active 
Routers can be placed at strategic locations in the network (e.g. between a trusted and 
a non-trusted local area network) and encrypt the packets. 
3.3.5 Active Firewall 
Several security policies of a firewall can be injected in a stream of packets and send 
it to an Active Router. The Active Router could then protect a LAN (local area 
network) by acting as a firewall. Security policies can be defined by a network 
administrator. A proposal for such an application is presented in [ALMLOl]. 
In [VAN97], Active Networking is used for defence against address spoofing. 
3.3.6 Mixing Sensor Data 
In this application, an Active Node use fusion to mix data that are transmitted from 
different sources such as microphones, antennas collecting radio signals, devices 
measuring emissions of pollutants etc [LWG98]. If the mixed signal is smaller than 
the sum of its constituents, the network traffic will be reduced. It also reduces the 
bandwidth and processing needed at the end nodes. 
3! 
Chapter 3 Relevant Research in Active Networks 
3.3.7 Active Networks in Telephony 
The author in [MAX99] describes how the Active Networking approach can be 
applied in telephony. If a network were able to transmit voice data, it would be 
cheaper than the telephone network. An active switching mechanism could redirect 
data from a telephone network to a data network. If congestion occurs in the data 
network, the same data could be redirected back to the telephone network [MAX99]. 
Several codecs can be implemented in the Active Nodes depending on the desired 
quality of service. 
3.3.8 Virtual Active Networks 
The work presented in [SUOO], describes the Virtual Active Network Architecture 
(V AN). A V AN is a dynamically constructed virtual network that provides 
application-specific services, such as web caching, multicasting and transcoding. The 
goal of a V AN is to enable large-scale network applications to control and configure 
network topology and resources to best support their needs. 
3.3.9 Active Caching 
Intermediate Active Nodes can cache some of the data they forward. In some cases 
they can act on behalf of a server and so to provide these data to the clients that 
request them. 
Active caching has several potential benefits [LWG98]: 
• It can decrease the latency observed by the clients, 
• It can decrease the traffic at routers, 
• It can decrease the load on the server. 
There are several algorithms and protocols under investigation for path selection and 
caching in Active Networks that will maximise the chances that a request for data will 
be routed along a path that is likely to quickly intersect a cache with a copy of the data 
[LWG98]. 
32 
Chapter 3 Relevant Research in Active Networks 
Active Cache described in [CZB98] allows caching of dynamic contents at Web 
proxies through the use of Java applets. Proxies act on behalf of Web servers, 
providing data to clients saving network bandwidth and increase network 
performance. 
An adaptive and distributed cache protocol is presented in [BLP02]. Every Active 
Node in the network can be used as a cache. Proxies translate HITP requests sent by 
clients to a defined active cache protocol. As soon as a HTIP request reaches an 
Active Router with a proxy installed, it will be translated to an active request and 
forwarded to a proxy close to the server, where it is translated back into a HTIP 
request [BLP02]. 
33 
Chapter 3 Relevant Research in Active Networks 
3.4 Summary 
Several projects regarding Active Networks have been described in this chapter. Some 
of them use FPGA devices to increase the performance of the Active Nodes. A major 
concern is to provide security as well as resource management. 
In the second section of the chapter, several applications and services that can be 
applied to the Active Networks are briefly described. Some of them are reliable 
multicast, adaptive Forward Error Correction, anycast. 
Most of the services applied in Active Networking aim to make conventional network 
more flexible and improve some of its limitations and problems, such as the NACK 
implosion problem. 
34 
Chapter 4 
Active Protocol-Active Applications 
and the Hardware Element of the 
Active Router 
35 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
4. Active Protocol, Active Applications and the 
Hardware Element of the Active Router 
4.1 Chapter Summary 
This chapter describes the protocol used for the active packets, the format of the 
Active Applications (AAs) and the hardware element of the Active Router (AR). 
Also, an application that performs encryption/decryption using the DES (Data 
Encryption Standard) and executes on the FPGA board is presented. 
4.2 The Active Protocol 
Active packets have to be distinguished from passive packets; therefore an Active 
Header is necessary. There are several proposals [HMA99], [SJS99], [WETIEN96], 
regarding the format of an active packet. 
In the very early days of this project, a similar method to that described in 
[WETIEN96] was used. The IP options field is a convenient place to put an Active 
Header (maximum size should be 20 bytes). 
JP JP Options/ ACT I TCP or UDP 
• 
• 
• 
• 
' 
' 
' 
' 
JP Header (including 
' . : the Acllve Header) 
--
' 
' 
' 
' 
' 
' 
' 
' 
' 
' 
-.: 
Data 
Figure 4.1: Placing the Active Header in the IP Options Field 
Currently, there are 24 "well-known" IP options' numbers (e.g. 7 for record route etc), 
so a new IP option number used for Active Networking should be greater than 25. 
Compatibility with legacy routers is guaranteed, since packets with unrecognised IP 
options are simply forwarded to the next hop. 
In order to be protocol-compatible with the IFAN project [SAN03] (a project within 
the same research group), the following format for the active packets is adopted: 
36 
Chapter 4 Active Protocol - Active Applications and the Hardware Element of the Active Router 
IP UDPI ACT IP UDP2orTCP Payload 
Passive Packet ----1~ 
Figure 4.2: Wrapping Passive Packets into Active Packets 
Passive packets are encapsulated in UDP packets as shown above. The Active Header 
(ACT) carries the following information: 
Active Header 
Name Size (bits} 
Type 8 
seq_no 8 
Options 16 
GMID 32 
last_node 32 
Table 4.1: The Active Header 
Active packets are distinguished from passive packets using the destination UDP port 
44075 (active port) in the first transport header (Figure 4.2). The second transport 
header (UDP or TCP) is application-specific. The format of the Active Header is 
shown in Table 4.1. 
The type field specifies the type of the active packet. The current types that have been 
defined are: 
i) Type 0: UDP Active Packets (not encapsulated packets), 
ii) Type 1: TCP Active Packets (wrapped TCP packets), 
iii) Type 2: UDP Active Packets (wrapped UDP packets), 
The Type 0 packets are not encapsulated packets. 
37 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
The sequence number (seq_no) is used to aid the introduction of reliability schemes 
where reliability is required. The options field can be used to specify options to code 
modules. The GMID field specifies the Global Module Id that refers to the code 
module that this packet should be handled by. It is unique for an AA. The last active 
node (last_node) field specifies the last Active Node that the packet passed through 
[SAN03]. 
4.3 The Hardware Element of the Active Router 
The hardware element of the AR, is a PCI-based (Peripheral Component Interconnect) 
FPGA board purchased from [ALDWWW]. This is a high performance 
reconfigurable PMC (PCI Mezzanine Card) based on the Xilinx Virtex-E 1600E 
FPGA. 
4.3.1 PCI Mezzanine Cards 
Mezzanine is a term used to describe the stacking of computer component cards into a 
single card that then plugs into the computer bus or data path. The bus itself is 
sometimes referred to as a mezzanine bus. The term derives from the Italian word, 
mezzano, which means middle [WTWWW]. To satisfy the need for replaceable PCI-
based modules in embedded systems, IEEE developed the PCI Mezzanine Card 
(PMC) specification. The PMC specification, while logically and electronically the 
same as the PCI bus, permits placement of the PMC cards parallel to the baseboard. 
This made it possible to develop baseboards with connectors that could be populated 
as needed with the latest available PMC modules (typically 110 expansion) without 
the need to re-spin the base system hardware [ffiMWWW]. 
4.3.2 Layout of the PCI Card 
The block diagram of the board is shown in the following figure [PCIMWWW]: 
38 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
______c, 
p 
B 
Cl 
us 
~ 
PC! 
Interface 
PLX9080 
SS RAM 
256K 
X 
36 
AID 
• r Debug }---
Resources 
I Clock ~ Generntor 
SS RAM SS RAM SS RAM 
256K 256K 256K 
X X X 
36 36 36 
Virtex 
V400 ... V1000 
V405E •.. V2000E 
Select 110 I 
110 Connector I 
Figure 4.3: Block Diagram of the PCI-based FPGA Board 
Fundamental parts of the FPGA board are: 
• The PLX 9080 device that provides the interface between the board and the PCI bus, 
• The Xilinx Virtex-E 1600E FPGA that is used to host the AAs, 
• Four independent banks of 256 x 36 bits of synchronous SRAM, 
• Two programmable clock generators, 
• On-board Flash memory with capacity of 1 Mbytes. 
4.4 Active Applications 
There are two types of Active Applications (AAs): software-only AAs that do not use 
the FPGA and FPGA-AAs that make use of it. 
4.4.1 Software Active Applications 
This kind of application is written using the C programming language and it can 
execute in every AR. 
4.4.2 FPGA Active Applications 
This type of application has usually two parts: a software part and a hardware part. 
39 
Chapter 4 Active Protocol - Active Applications and the Hardware Element of the Active Router 
4.4.2.1 The Software Part of an FPGA-Active Application 
The software part of an FPGA-AA is written in C and uses a software interface 
provided by [ALDWWW] to communicate with the FPGA board. The software 
interface provides functions such as Open_Card(), Close_Card() etc. A user-space 
process communicates with the FPGA board via the PLX 9080 device. Part of the 
memory address space of the process is mapped to the FPGA. Memory Mapping or 
DMA (Direct Memory Access) can be used to transfer data to the FPGA and also two 
DMA channels are provided by the PLX. 
4.4.2.2 The Hardware Part of an FPGA-Active Application 
The hardware part of an FPGA-AA is written using a specific type of language called 
VHDL ((Very High Speed Integrated Circuit) Hardware Description Language). 
Another language called Verilog could also be used. 
Every VHDL file is in the format program. vhd, where program is the name of the 
application and vhd its extension, showing that it is a VHDL file. 
A file in the format program. vhd cannot be loaded into the FPGA device as it is. It has 
to be transformed into a so-called bitstream file program.bit. The bitstrearn is then 
used to program the FPGA. The process of transforming the file program. vhd into the 
configuration file program. bit is shown by the design flow. 
The design flow 
The following figure shows the Xilinx design flow [TXWWW]: 
40 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
Design 
' 
Doslgn Verification 
Entry 
l ~tlonal, latton · 
Dosign -..... -
Synthosis 
' 
.... 
Dosign 
lmpl~>montation 
~I StatiC Tl,.lng 
• I Anllylll 
Optlrnl:otlon 
FP GAl 
• ,,. pprrg 
• Placement r 
• Alutlng 
' 
CPLOI ~~Q ... [ Timing 1 
• fltt 111<,1 1 
Annotation Slflulatbn , 
81tJtrum 
Generation 
... 
Down load to a ln-ClrC'Uit 
Xilin~ Dtvico Y•lllelUOn ~ 
X9537 
Figure 4.4: The Xilinx Design Flow 
The main parts of the design flow are: 
i) Design Entry. This can be in one of two formats: Schematic entry or HDL 
Entry. Schematic entry is the creation of designs using schematic tools. An 
example of the schematic entry of a comparator is shown below [CSWWW]: 
41 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
Figure 4.5: Schematic Entry of a Comparator 
r----Dlrn 
')o--D gr 11 hrr 
If the complexity of an application is large, using schematic entries is a very time-
consuming and error-prone procedure. HDL entries can be used instead. They are 
creations of designs using a programming language such as the VHDL. The HDL 
entry of the previous comparator is shown in Figure 4.6 [CSWWW]: 
42 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
-- n-bit Comparator (ESD book figure 2.5) 
-- by Weijun Zhang, 04/200 I 
--this simple comparator has two n-bit inputs & 
-- tbree 1-bit outputs 
library ieee; 
use ieee.std_logic_ll64.all; 
entity Comparator is 
generic(n: natural :=2); 
port( A: in std_logic_vector(n-1 downtoO); 
B: in std_logic_vector(n-1 downtoO); 
less: out std_logic; 
equal: out std_logic; 
greater: out stdJogic 
); 
end Comparator; 
architecture behv of Comparator is 
begin 
process(A,B) 
begin 
if (A<B) then 
less<= '1'; 
equal<= '0'; 
greater<= '0'; 
elsif (A= B) then 
less<= '0'; 
equal <;::; '1 '; 
greater <= '0'; 
else 
less<= '0'; 
equal <= '0'; 
greater<= '1 '; 
end if; 
end process; 
Figure 4.6: IIDL Entry of the Comparator 
ii) Design Verification. In this stage, the design entry is verified by using functional 
simulation. A testbench is used and linked to the VHDL code, then input signals are 
generated and outputs signals are produced. By observing the output signals, potential 
logical errors can be detected. The functional simulation of the comparator is shown 
below [CSWWW]: 
43 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
· file &dit Search · Y:iew · Q.esign · ~imulation Waveform !oo!S Help 
Name 
I±JnrA 
.......................................................................... , ............................. . 
I±JnrB 1 
nr equal .1 
nr greater 0 
................................ 
nr less 0 
:Enters the zoom mode. 
' 1 
· 100 ns 
Figure 4.7: Functional Simulation of the Comparator 
A and B are the inputs generated by a testbench file and equal, greater and less are the 
outputs. 
iii) Synthesis. During Synthesis the VHDL code (HDL entry) is transformed into a 
hardware design (netlist). A netlist is a list of devices and how they are 
interconnected. 
iv) Timing Analysis. Here, the delay from each input to each output for all the devices 
is calculated. Delays are added up along each path through the circuit to get the 
critical path. 
v) Map, Place and Route. The mapping tool collects the netlists into groups that fit 
into the LUTs and then the place and route tool assigns the netlist collections to 
specific CLBs while opening or closing the switches in the routing matrices to 
connect the netlists together. 
vi) Generate the bitstream. The bitstream generator extracts the state of the switches 
in the routing matrices and generates the bitstream. 
44 
Chapter 4 Active Protocol - Active Applications and the Hardware Element of the Active Router 
The figure below summarises the design flow (except the functional simulation) for an 
FPGA application [XSWWW]: 
VHLJL source Code 
ently leddcd iS 
port( 
d: In std_logic_vector(3 do-.....nto 0); 
s: out std_logic_vector(6 doonto 0); 
); 
end; 
archrtecture leddcd_arch or ledclcd is 
begin 
s <= "111 0111" when d="OOOO" else 
"001 001 0" when d="OOOt" etse 
"1101101"; 
end leddcd_arch; 
routing 
------.:S~y~nth esize 
----· 
./ 
_/ 
+- ~.Place & Route 
XSA Board 
Bit stream 
!0!0!11!0!0!!10!0! 
0!1!!1101110110101 
0111111001011110!1 
011101001110111011 
~----------lJ 1110U!OUO!U!IU 111111110101011111 
DOV\Inload and 
Figure 4.8: Design Flow 
1!1!11!011!!11!1!1 
01!1!!10!1!0!1101! 
0!111!!1!11!!1!111 
011001100110111010 
111011101111011!01 
45 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
A simple example of an FPGA-M would be an adder: the software part feeds the 
FPGA with two numbers; the application hosted in the FPGA adds these numbers and 
returns the sum to the software part. 
An M that has been used to test the AR is a DES (Data Encryption Standard) 
algorithm provided by [FREEWWW]. It is described in the next section. 
4.4.3 A DES Algorithm implemented as an FPGA Application 
This application, as for most of the FPGA-Ms, has two parts named as: des.vhd and 
des.c. 
Des.c is a program written in C that receives packets from the network, splits the 
payload of the packets into 64-bit words and feeds the encryption/decryption 
algorithm with data. 
Generally, a DES algorithm takes as input a 64-bit plaintext, a 64-bit key and 
produces a 64-bit ciphertext as output. Since a 32-bit PCI bus will be used, the 
plaintext, the key, and the ciphertext have to be split in two 32-bit words each and 
passed separately to and from the FPGA through the PCI bus. In Figure 4.9, Data_inl 
is the high word (32-bit) of the plaintext, Data_in2 is the low word (32-bit) of the 
plaintext, Key_inl and Key_in2 are the high and low words of the key respectively. 
Encrypt is a flag that specifies if the algorithm will encrypt or decrypt the data, since 
this application is actually an encryption/decryption application. 
Data_inJ---1~------l 
Data_in2---~ 
Key_inl 
Key_in2 
Encrypt 
DES 
ENCRYPTION/ 
DECRYPTION 
1---• Data_outl 
1---• Data_out2 
Figure 4.9: Block Diagram of the DES Encryption/Decryption Algorithm 
46 
Chapter 4 Active Protocol - Active Applications and the Hardware Element of the Active Router 
Des. vhd, provided by [FREEWWW], is a DES algorithm in the form of a VHDL 
package. A package in VHDL is similar to a Dynamic Linked Library (DLL) as used 
in Windows programming. It can be called by any VHDL function. This VHDL file 
(des.vhd) can be used for more than one type of Virtex FPGAs and it is not 
specifically designed for the board used in this project. For this reason, it does not 
interface with the PLX 9080 device, in its initial form. 
A new VHDL file has been created, which interfaces with the PLX 9080, in order to 
transfer data through the PCI bus and uses the package des. vhd to en crypt or decrypt 
the data. 
r--------------------------------------------------
g - pp zca wn fpaa 1" t" 
""' 
~ RegA I 
""' 
I 
""' 
srw I RegB h Host 
"" 
PLX OUT 
9080 interface 
- t: Process I RegC DES 
""' 
to 
i 
""' 
PLX9080 l RegD r ~ PCI Bus ~ RegE I 
FPGA 
Figure 4.10: Interfacing the DES Algorithm with the Hardware System 
4.4.3.1 Interfacing the DES Application with the PLX 9080 
Interfacing the DES algorithm is important, since data cannot be transferred through 
the PCI bus otherwise. 
To avoid confusion during the description of the application, host-process will refer to 
the application running in the Linux user-space (des.o, the executable file of the 
des.c), fpga-application the application that "runs" in the FPGA device and AA the 
whole application (fpga-application plus the host-process application). 
47 
INT 
Chapter 4 Active Protocol - Active Applications and the Hardware Element of the Active Router 
Five registers have been created in order to store the incoming data (plaintext, key, 
encrypt flag), as shown in Figure 4.10. Thefpga-application receives 32-bit data each 
time, sent by the host-process, since a 32-bit PCI bus is used. 
The AA is implemented as a master-slave application. Master is the host-process that 
sends the plaintext data and then reads the ciphertext data back, and slave is the fpga-
application that encrypts the data. 
Thefpga-application has to know when data are sent from the host-process and when 
data has to be read back. It also has to know if the incoming 32-bit data are part of the 
plaintext (64 -bit), part of the key (64-bit) or the encrypt flag. 
The host-process as described previously, splits the payload of the packet into 64-bit 
words and then feeds these to the fpga-application along with the key and the en crypt 
flag. If the payload is not a multiple of 64 bits, is padded with zeros. The number of 
the bytes needed for the padding is saved in the Active Header, so that the decryption 
module can correctly decrypt the data. 
The sequence for sending data is: 
i) send the low word of the plaintext (32 bits) into the FPGA, 
ii) send the high word of the plaintext, 
iii) send the low word of the key, 
iv) send the high word of the key, 
v) send the encrypt flag. 
The same sequence is followed for every 64 bits of plaintext and for each packet. 
Memory mapping is the method used to send data. The memory address space of the 
host-process is mapped to the FPGA area through the PLX 9080. So, when the host-
process needs to send data, it just writes them to a specific virtual address. A function 
contained in the software interface provided by [ALDWWW] helps the host-process 
to find this address (Appendix A). By writing to that address, the appropriate PCI 
cycles are generated to transfer data to the fpga-application. 
48 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
If ADDRESS is the (virtual) address that data have to be written to, the host-process 
writes the input data (shown in Figure 4.9) as follows: 
Address Data 
ADDRESS Data_in2 
ADDRESS+4 Data_inl 
ADDRESS+8 Key_in2 
ADDRESS+l2 Key_inl 
ADDRESS+l6 Encrypt 
Table 4.2: Data and Virtual Addresses 
Each time, data are written to a four byte offset. The addresses above are virtual 
addresses and are part of the virtual address space of the host-process. Each time data 
are written to one of the virtual addresses, the appropriate PCI cycles are generated. 
Data and PCI addresses are multiplexed on the same channel. 
Data and PCI addresses are 32-bit words because a 32-bit PCI bus is used. For this 
application, five "PCI writes" (data sent from the host-process to the fpga-
application) are necessary to send 64 bits of plaintext, the 64 bits of the key and 32 
bits for the encrypt flag (Appendix A). Data are written as Table 4.2 shows and the 
PCI addresses used, are incremented by one as shown in Table 4.3: 
PCI Address Data 
xxxxxxxxvvvvvvvvvvvvvvvvvvvOOlxx Data_in2 
xxxxxxxxvvvvvvvvvvvvvvvvvvvOIOxx Data_inl 
xxxxxxxxvvvvvvvvvvvvvvvvvvvOllxx Key_in2 
xxxxxxxxvvvvvvvvvvvvvvvvvvvlOOxx Key_inl 
xxxxxxxxvvvvvvvvvvvvvvvvvvvlOlxx Encrypt 
Table 4.3: Data and PCI Addresses 
Bits denoted by "x" are reserved and by "v" are don't care bits (they can be I or 0). 
49 
Chapter 4 Active Protocol - Active Applications and the Hardware Element of the Active Router 
Table 4.3 shows that for every type of data, a different PCI address is used, thus the 
fpga-application is able to know the type of the incoming data by checking the bits 2, 
3, and 4 of the PCI address (Appendix B). It then stores the data into the appropriate 
register (Figure 4.10). 
After the fpga-application has received the data, it feeds the DES algorithm, which 
produces a 64-bit ciphertext. The host-process is blocked in the meantime waiting to 
receive the data back. Since the AA is a master-slave application, the host-process 
initiates the PCI cycles for reading the encrypted data. After the ciphertext has been 
produced the fpga-application "wakes-up" the host process by raising a hardware 
interrupt (Appendix B). Now, the host-process can read the data back. The ciphertext 
is 64 bits long and it is transferred using two "PCI reads" (data now flow from the 
FPGA to the host-process). The low word of the ciphertext is sent first and then the 
high word follows. 
Another issue for the fpga-application is that it has to know when a PCI read or a PCI 
write takes place. The signals between the local PCI bus (the bus between the PLX 
9080 and the FPGA) are shown in Figure 4.11. 
The fpga-application contains a VHDL component (Appendix B) that interfaces the 
FPGA with the PLX 9080. The fpga-application can be informed about the PCI 
transaction by checking the local PCI bus signals (Appendix B). 
A brief description of the local PCI bus signals is given in Appendix C. 
50 
Chapter 4 Active Protocol - Active Applications and the Hardware Element of the Active Router 
-
-
LA[31:2] LA[23:2] 
LD[31:0] LD[31:0] VIRTEX 
L 
LWRlTE 
0 LBLAST 
c LADSL A 
p L LDACK 
c PLX9080 LBE I B 
u FHOLDA 
B s LRESETOL u 
s LCLKA 
LBTERML 
LREADYL 
LEOTL 
UNTIL 
FHOLD 
EEPROM 
'----
Figure 4.11: Local PCI Bus Signals 
The following signals are generated during a PCI write [PLXWWW]: 
51 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
IOns 
I I 
1250ns 
I I 
ISOOns 
' 
I I I I I I I 
CLK 
FRAME# 
I 
AD[31:01 ~ o~"' 
I ' 
C/BE[3:0l# CM;:l 
" 
IRDY# 
OEVSEL# 
TRDY# 
LCLK 
LHOLD 
LHOLDA 
ADS# 
LWIR# 
BLAST# 
LA[31:21 ADOR 
LD[31:0] o •• 
READYi# 
Figure 4.12: Direct Slave Single Cycle Write 
The following signals are generated during a PCI read [PLXWWW]: 
52 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
IOns I I 11 DOns I, .1 1. I ,I 120Dns 1300ns I I 1. I , I I 1400ns 1500ns I, ,I I, I ,I , I 
CLK 
FRAME# 
AD(31:0( ~>------------; ____ __;__ _ __;_ __ -<~>--'---
C/BE(3:0(# CMO •• 
IRDY# 
DEVSEL# 
TRDY# 
LCLK 
LHOLD 
LHOLDA 
ADS# 
LW/R# 
BLAST# 
LA(31 :2) AOOR 
LD(31 :0) 
"'" 
READYI# 
Figure 4.13: Direct Slave Single Cycle Read 
How the fpga-application checks these signals is shown in Appendix B. 
4.4.3.2 Producing the Bitstream 
After the VHDL part of the AA has been written, functional simulation is used to test 
it. VHDL code was written using the ModelSim SE software package [MODWWW]. 
During functional simulation, input signals feed the application under-test and outputs 
signals are observed for potential errors. In VHDL programs, parts of the code can 
"execute" at exactly the same time (in parallel) in contrast with other programming 
languages used for programs, which execute in the user-space of an operating system. 
Also, a programming error of a conventional programming language (C, Java etc) can 
crash the machine in the worst case, but such an error in a VHDL program could 
physically destroy the FPGA device. For many reasons, functional simulation is a 
very important step before producing the bitstream. 
53 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
To implement the simulation, two files are needed. The VHDL file of the fpga-
application and a file that generates the input data. This file will be called testbench 
file. For simulation the ModelSim is used. 
[ ______ , ModeiSim r-----: 
testbench file 
0 
0 
0 
0 
0 
0 
INPUT 
fpga-application 
0 
0 
0 
0 
0 
0 
OUTPUT 
Figure 4.14: Using ModelSim for Functional Simulation 
The dotted arrows symbolise the probes of ModelSim, while the solid arrows the input 
and output data. The testbench file feeds the fpga_application with the appropriate 
data. The type of input or output data can be signals like these shown in Figure 4.12 
and Figure 4.13 or registers with stored data etc. 
The testbench file used to simulate the fpga_application is presented in Appendix D. 
A snapshot of the simulation process is shown in Figure 4.15: 
54 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
" 1;11starti:J ~ ~-o d 11 !J ~JibosdtWord-n-~s 1 'MModeiSmSEPUJSs.sr Jla~_:_~eult _:_.·_: _. 
Figure 4.15: Snapshot from the Simulation Process 
4.4.4 Naming of the Active Applications 
Applications follow a specific naming; making it easier for the Active Router to 
perform several tasks such as loading them in memory, monitoring their status, 
downloading etc. 
The software part of each AA is named as actAppl[GMID], where GMID is the Global 
Module Identification. So, if the GMID of an AA is 3, its source code is named as 
actAppl3.c and its executable file as actAppl3.o. If it is an FPGA-AA, its bitrstream is 
named as [GMID].bit. The bitstream of the previous AA is named as 3.bit. 
55 
Chapter 4 Active Protocol- Active Applications and the Hardware Element of the Active Router 
4.5 Summary 
This chapter consists of three major parts. The first part has described the Active 
Protocol and how passive packets are wrapped into active packets. 
The second part has described the hardware component of the Active Router, used in 
this work. This is a PCI-based board that hosts a Virtex-E 1600E FPGA. The board 
interfaces to the PCI bus of the host via the PLX 9080 device. 
The format of the Active Applications is presented in part three. They have two parts: 
a software part, which is written in C and is resident in the memory address space of 
the host, and a hardware part, that is written by using a hardware language such as 
VHDL and is hosted in the FPGA board. 
The last part of the chapter has described a DES encryption/decryption application 
implemented in hardware that will be used as an Active Application. 
56 
Chapter 5 
Architecture of the Active Engine 
57 
Chapter 5 Architecture of the Active Engine 
5. Architecture of the Active Engine 
5.1 Chapter Summary 
The architecture of the Active Router (AR) composes two main parts: the Active 
Engine (AE) and the Routing and Forwarding Engine (RFE) (Figure 5.1). 
Furthermore, the AE consists of two basic elements, the software part and the 
hardware part. The software part includes the Linux operating system and the user-
space processes. The hardware part is implemented using a PCI (Peripheral 
Component Interconnect) board with a Xilinx XCV1600-E Field Programmable Gate 
Array (FPGA). 
5.2 Introduction 
The AR composes two fundamental parts, the RFE and the AE, as shown in Figure 
5.1: 
Figure 5.1: The Fundamental Parts of the Active Router placed into the Network 
AE and RFE are Linux hosts that provide discrete services to the packets. The AE 
contains the FPGA board and it is the host where Active Applications execute. RFE is 
responsible for forwarding active packets to the AE and routing them back to the 
network, after they have been processed by an Active Application. 
The separation of the Active Router functionality into two hosts is necessary, because 
the routing of the packets has to be maintained at any cost. If a malicious Active 
58 
Chapter 5 Architecture of the Active Engine 
Application damages the host in which it executes, the damage will be limited to that 
host and will not affect the routing of the packets. If the AE crashes, as the result of a 
bogus Active Application loaded into the FPGA, the RFE will not be affected and the 
routing of the packets will continue as normal. Another benefit of this separation is 
the protection of the AE from high volume active traffic that can cause DoS (Denial of 
Service) attacks. This is achieved by a traffic shaping mechanism implemented in the 
RFE that drops packets before they reach AE. This mechanism is presented in Chapter 
6. 
5.3 The Active Engine (AE) 
The AE consists of two main parts: a software part and a hardware part. The 
hardware part was presented in section 4.3. 
5.3.1 The Software Part of the Active Engine 
Figure 5.2 shows the layout of the software part (when one Active Application is 
loaded): 
CPU 
Monitor 
Memory 
Monitor 
Application 
Loader 
Core Process 
I= I-
I-
f.-
I Safety Process I 
~,....,...,i······-··A~;;·~~ ............ f...r.,.....-l Packet 
I I I Application I IT Injector 
L. ..................................... FL:_:___:.:.__j 
~ .......-- Packet queue User- Space 
Active Filter 
Kernel - Space 
Figure 5.2: The Software Architecture Layout of the Active Engine 
59 
Chapter 5 Architecture of the Active Engine 
5.3.1.1 The Active Filter (AF) 
The Active Filter is a Loadable Kernel Module (LKM) and its task is to capture active 
packets in kernel-space and forward them into user-space for processing. 
5.3.1.1.1 Loadable Kernel Modules (LKMs) 
LKMs are pieces of code that can be loaded and unloaded into the kernel upon 
demand. They extend the functionality of the kernel without the need to recompile it 
or reboot the host. 
They are mainly used as: 
• Device drivers, software that interfaces a hardware device (network cards, FPGAs 
etc), 
• Filesystem drivers. Such a driver interprets the contents of a filesystem, 
• System calls. User-space programs use system calls to get services from the kernel, 
an LKM can be used to replace a system call or create a new one, 
• TIT line disciplines. These are augmentations of device drivers for terminal devices, 
• Netfilter "hooks" to mangle or filter packets that traverse the kernel. 
In this work, LKMs are used as netfilter hooks. 
5.3.1.1.2 Netfilter Hooks 
Netfilter [NETWWW] is a series of "hooks" in various points in the kernel network 
stack. The hooks are well-defined points in the traversal of a packet in the kernel. The 
path of a packet in the kernel when the Linux PC acts as a router is shown in Figure 
5.3 [OPWWW]. On the left side of the figure, there is the path for a packet received 
by the Linux and on the right side is the path when a packet leaves the Linux kernel. 
When a packet is received, it is first stored in the memory of the NIC (Network 
Interface Card). Then, the card issues an interrupt request to the operating system, 
which will suspend any running process (because hardware interrupts have the highest 
priority) to acknowledge and serve the request. The incoming packet is then copied 
from the memory of the NIC (using the function netif_rx( )) to a network buffer by the 
device driver using DMA (Direct Memory Access). 
60 
Chapter 5 Architecture of the Active Engine 
Llnux Kernel 2A ~ Network Drivers (driverslnet.J'l') L Packet handling V y ip_Dn~h_output2 (netliP.v-<Vip_outr.ut.c) 
I @ ( dev_quw'-""'"1· ocallsl\h.outJ;!IltOL' 
~~i-~t:-~!;T .... ~~.fo..Qf.n~Q ... .. d(_no u 
[ (netlcore/dev.c) ....: .... 
o dst OUip.it routine 
i (netkore/dev.c) lpJorward_flnilh 
(netlipv41ipJocward.c) NF_IP_POST_ROUTING 
netft_action J D (nef/co~dev.c) 8 o handle lP o,etioru I o fragment packet (NET_RX softkqJ in~tqu ue I lp_ftnbih_CJUtput (Cp.!] {netlipv4'ip_ootput.c) 
+ NFJP_FORWARD l packet_rn lp_rn 
<tepdump process> arp_ra' I hand le arp requests (netl1pv41i p_input.c) ip_rutput <dhcpd process> 
and teJ?lles /tp_forward (netlipv41ip_output.c) < .•. > verify skb, lP header 
and lPchecksum (net/ipv~ip_forward.c) I 
I o handle coutec alett NFJP_LOCAL_OLIT 
lp_local_deliver ovcrifyTIL 
NF_IP_PRE_ROUTING a verify Jtrict routitt~ I ~ (netli pv41i P.-input.c) o send redirect if needed 
~ defcaa: fragmented packet I o decreue TIL / ip_bulkl.xmlt 
' 
o verifY that fraa is pouible (mru) ip_bulld_xmltJ;Iow 
.. + ~ lp_l'l'l'_ftnil;h I p_bulld_and_send_pkt J 
~ 
! 
! 
I 
' ~ 
• .• 
• ;; 
NF_IP_LOCAL_IN 
(netlipvMp_input.c) lp_queue_xmit 
o find inp.lt route if unknown lp_enor (netlipvMp_output.c) j o handle lP OJ?I:ions 
-
(nerlipv4/rwte.c) o create and build lP pa.cket 
o find output toute 
ip_loeaLdellnr_ftnbh routing ertor: send icmp pkl: o ... 
(netlipv41ipJnput.c) 
find protocol band let or I l km._. ... 
send icmp_desr_unreach / (netlipv41rtnp.c) Local lP Services -........_ send an ICMP message ~ in tespon~ to a siruarlon 
Figure 5.3: The Journey of a Packet in the Linux Kernel 
Network buffers are necessary to successfully process the packet as it progresses 
through the kerneL They are queued in a FIFO (First-In First-Out) input queue where 
they wait before being processed. Linux can then return from the interruption after 
having scheduled a softirq (soft interrupt request), which is an interrupt with a low 
priority. Depending on the type of packet (ARP, TCPIIP, UDPIIP, ICMPIIP etc) and 
whether the packet is to be routed or delivered to a user-space process it can follow a 
different path. 
On the right side of the diagram, the route a packet follows when it is transmitted to 
the network is shown. There are two different routes depending on whether the packet 
is just routed to the next hop (in case the Linux host acts as a router) or it is generated 
by a user-space process and then injected into the network. 
61 
Chapter 5 Architecture of the Active Engine 
As previously stated, the netfilter hooks are well-defined points in the traversal of a 
packet in the kernel. These hooks are named 
NF _IP _LOCAL_IN, NF _IP _LOCAL_ OUT, 
NF _IP _POST _ROUTING (Figure 5.3). 
as NF_IP_PRE_ROUTING, 
NF_IP _FORWARD and 
For convenience, a new diagram is plotted (Figure 5.4), removing the entire kernel 
functions except the netfilter hooks. The arrows in Figure 5.4 show the possible 
packet routes. 
User Space 
User-Space Processes I 
t 
' 
Kernel Space ' ' 
' 
' 
' 
' 
I I ~--------~---------· I I NF _lP _LOCAL_IN ' ip_queue ' NF_IP_LOCAL_OUT ' ' 
' ' 
---------.. --------· 
' 
--------------------------· 
' 
' J I ' NF lP FORWARD 
' 
' 
' 
' 
' 
' 
' I NF_IP_pRE_ROUTING I I NF_[P_pOST_ROUTING I 
Network 
Figure 5.4: Netfilter Hooks in the Linux Kernel 
The RFE forwards active packets to the AE (Figure 5.1). This is achieved by replacing 
the original IP destination address by the IP address of the AE (there is a detailed 
description in Chapter 6). When a packet enters the kernel of the AE, it should follow 
the path: NF_IP_PRE_ROUTING -> NF_IP_LOCAL_IN ->User Space Process-> 
NF_IP_LOCAL_OUT -> NF_IP_POST_ROUTING and then back to the network. 
Since there is no process listening to the "active port", the packet will be dropped and 
an ICMP error will be returned to the sender of the packet. It is possible to bypass this 
default procedure by "stealing" the active packets in the NF_IP_PRE_ROUTING 
62 
Chapter 5 Architecture of the Active Engine 
hook and queue them to the Core Process. That is exactly what the Active Filter does 
(Figure 5.2). It is an LKM that registers itself in the NF _JP _PRE_ROUTING hook and 
filters the packets. If the incoming packet is an active packet, it queues it to user-space 
(to the Core Process), as shown in Figure 5.2; passive packets are processed as 
normal. The actual transferring of packets from user-space to kernel-space is 
performed via another LKM called ip_queue (shown with dotted arrows in Figure 5.4) 
that is part of the libipq software [NETWWW]. The ip_queue module uses a specific 
type of socket, the Net/ink Sockets for copying the packets from kernel to user-space. 
The operation of the Active Filter is summarised using the flowchart below: 
Register in the 
NF _lP _?RE_ROUTING 
hook 
Check the incoming 
packet 
YES l 
Send it to the Core Process 
NO 
Pass packet into the 
next hook 
Figure 5.5: Flowchart of the Functionality of the Active Filter 
5.3.1.2 The Core Process (CPR) 
The main duties of the CPR are to host the incoming packets and invoke the Active 
Applications for servicing them. It communicates with the AF through the ip_queue 
LKM for receiving the active packets (Figure 5.4). 
The communication between the CPR, the Active Applications (AAs) and the Packet 
Injector (PI) (Figure 5.2) is implemented using the Unix Domain Sockets mechanism. 
This is a specific type of socket for communication used by processes that are built in 
63 
Chapter 5 Architecture of the Active Engine 
the same host. Network sockets are characterised by an IP Address and a port number, 
while Unix Domain sockets are characterised by a pathname. For example, a server 
can open a Unix Domain socket and bind the pathname "/usr!routef'. Then every 
client can send packets to the server if it knows the pathnarne used (the server and the 
client are in the same host). 
The usual path an active packet takes is CPR ->AA ->PI (Figure 5.2) and then back 
to the network. Each AA has to have its own pathname (similar to the well-known 
ports of the servers), as well as the PI. This is useful because a software developer 
that writes code for an AA has to know where packets come from and where they 
should be sent. The pathnames that are used are: 
i) "/usr/active_router/reinjectp" for the PI, 
ii) "/usr/active_router/pathX for the AA with the GMID (Global Module ID) X. 
For example an AA with GMID=3 has the path name !usr/active_router/path3. 
5.3.1.2.1 Loading the Active Applications 
AAs are downloaded from Code Servers (CSs), as it will be described in section 
5.3.1.3. 
The hardware element of the AE consists of one FPGA device and for this reason only 
one FPGA-AA can execute at a time. If an AA does not require the use of the FPGA 
device, it can be loaded in memory and execute in the multi threaded environment of 
the AE, like any other user-space process. If there are more than one AAs competing 
for the FPGA, the AE has to perform a switch operation between the AAs, providing 
access to the FPGA for a specific period of time to every AA. The incoming active 
packets are checked by the CPR, so packets that require an AA that is already loaded 
in memory and has access to the FPGA (if it is an FPGA-AA) are directly passed to 
this AA. If there is no AA to receive the packets, they are temporary stored into packet 
queues. 
64 
Chapter 5 Architecture of the Active Engine 
5.3.1.2.2 Packet Queues (PQs) 
The PQs are created in the memory address space of the CPR. They are used to 
temporarily store active packets before the appropriate AA can process them. There is 
one queue for every AA (Figure 5.6). 
PQ! 
Active Filter Core Process 
-+ 
PQN 
Figure 5.6: Packet Queues 
PQN is the packet queue used to store packets that carry the GMID N. PQs are 
implemented as linked lists and have a limited capacity. Before storing packets to a 
PQ, the CPR checks its capacity and if it is full, it re-injects the incoming packets 
back to the network (via the PI}, without modifying them. 
A better approach would be to classify the packets in queues based not only on their 
GMID but also on the flow to which they belong. This could provide QoS (Quality of 
Service) to different flows. 
5.3.1.2.3 The Active Registry (ActReg) 
The second main purpose of the CPR is to invoke the appropriate AAs. For this 
reason, it has to keep a database for several "housekeeping" issues such as: the stored 
AAs, AAs that are loaded in memory, AAs that have to make use of the FPGA etc. This 
is achieved by creating the Active Registry (ActReg). This is an array stored in 
memory that maintains information about the AAs. Table 5.1 shows this information. 
There is one array entry for each AA. 
65 
Chapter 5 Architecture of the Active Engine 
Active Registry 
INFO Descriotion 
GMID Global Module Identification 
PID Process Identification 
PSTAT Process Status 
PNAT Process Nature 
MMU Maximum Memory Utilisation 
PTMP Process Timestamp 
MJF Major Faults 
MNF Minor Faults 
Table 5.1: The Active Registry 
• The GMID is the Global Module Identification number that is carried by every 
active packet and specifies the AA that has to process the specific packet. 
• The Process ID (PID) is the identification number the operating system assigns to 
every application when it is first initialised. 
• The PNAT defines if the AA specified by the GMID is a software-only application or 
an FPGA application. This is useful when loading a new application because only one 
AA can have access to the FPGA at a time. The PNAT of every AA is provided by the 
Code Servers. 
• When an AA is loaded into the memory, it is timestamped with the current time at 
that moment. The Process Timestamp (PTMP) contains the timestamp. This can be 
used to unload an AA from the memory if it has not been used for a long time (high 
process ageing). 
• MMU is the maximum memory utilisation permitted, for the AA specified by the 
GMID, and is provided by the Code Servers. It is used by the Memory Monitor 
(MEMM) for guarding and preventing the loaded AAs from consuming more memory 
than they are permitted. It is also used by the APPLOD when it takes a decision on 
whether or not there is sufficient memory to load an AA. 
66 
Chapter 5 Architecture of the Active Engine 
• Major and Minor faults are limits used to penalise AAs that misbehave. Each time 
an AA misbehaves, depending on the seriousness of the situation, the appropriate 
counter (mjf_counter for major faults and mnf_counter for minor faults) is increased 
by one. There is a threshold with a different value for each type of faults. If any of 
these two thresholds is reached, the AA is unloaded from memory and its status is set 
toPSTATs. 
• Each AA is characterised by a Process Status (PSTAT) and can be in one only state 
at each time. The PSTATs that have been defined are shown in Table 5.2. 
Status DescriQtion 
0 The AA is not stored in the AE 
1 The AA is stored in the AE 
2 The AA is loaded into memory 
3 The AA is loaded into to memory and it is 
a "running" process. 
4 The AA is in a transient state. 
The AA has caused too many errors and 
5 should be loaded again only after time T 
has elapsed. 
The AA has not been downloaded from a 
6 Code Server. Attempt to download it 
should take place only after time T. 
Table 5.2: Process Status definitions 
The PSTAT characterises an AA at any given time. The ActReg is created by the CPR 
but can be accessed or modified by the CPU Monitor (CPUM), the Memory Monitor 
(MEMM) and the Application Loader (APPLOD). The ActReg is a mechanism that the 
core components of the AE use to share information about any AA. The usefulness of 
this registry will be highlighted throughout this chapter. 
From Table 5.2, some more definitions about the PSTATs follow: 
67 
Chapter 5 Architecture of the Active Engine 
If PSTAT is 0, then the AA specified by the GMID does not exist in the hard disk of 
the AE and if there are any packets requiring this application, CPR requests the 
downloading of the AA from a Code Server. 
If PSTAT is 1, then the AA specified by the GMID, has been stored in the AE but it is 
not loaded into memory yet, so if there are any packets requiring this AA, CPR 
requests the loading of it from the hard disk to the memory. Requests for loading or 
downloading AAs are sent from the CPR to the APPLOD. During the requests, the 
incoming packets are stored in the appropriate PQs. 
If PSTAT is 2, then the AA specified by the GMID, has already been loaded into the 
memory. If it is an FPGA application it does not own the resources of the FPGA yet. 
If PSTAT is 3, then the AA specified by the GMID, is a "running" application. That 
means that it is loaded into the memory and if it is an FPGA application (it uses the 
FPGA) it owns the resources of the FPGA board. Packets requiring this AA are 
directly forwarded from the CPR to the AA. 
If PSTAT is 4, then the AA specified by the GMID, is in a "transient" state. An AA is 
in a transient state when its code is being downloaded from a Code Server and it is not 
yet ready to execute. Packets requiring this application are temporary stored in the 
appropriate PQs. 
If PSTAT is 5, then the AA specified by the GMID, has caused too many errors. It is 
unloaded from the memory and an attempt to load it again will take place after time T 
has elapsed. 
If PSTAT is 6, then the AA specified by the GMID, could not be downloaded from the 
Code Server. If an AA has this status, it means that it has not been found in any of the 
Code Servers or it has been found but the connection was lost during downloading or 
for any other reason the downloading was not possible (the Code Server was down 
etc). An attempt to download this AA will take place only after time Thas elapsed. 
There are similarities between different PSTATs. For example PSTATo (PSTAT is 0) 
and PSTAT6 characterise AAs that have not be downloaded and are not stored in the 
68 
Chapter 5 Architecture of the Active Engine 
AE. The difference is that for AAs with PSTAT6 there has been an attempt to download 
them but it failed. This does not mean that active packets will stop requiring this 
application. If there was only one PSTAT for both cases, the CPR would continuously 
request the downloading of an AA even if it fails. Separating the PSTATs saves 
valuable resources (network bandwidth, CPU, memory). PSTAT2 and PSTAT3 are the 
same if the AA is a software-only application (an AA that is not using the FPGA). 
Software-only AAs are distinguished using the Process Nature (PNAT). 
The downloading and loading of the AAs is not performed by the CPR itself. CPR 
only requests these operations from the APPLOD. The reason behind this is that they 
are time-consuming operations and the CPR should not block while they are taking 
place, leading to packet loss. While APPLOD is servicing the requests, CPR is storing 
incoming packets and it is preparing new requests. The APPLOD is a POSIX Thread 
created by the CPR, when it is first loaded. Threads share the same address space 
with the main process that creates them, so exchanging data between them and the 
main process is not expensive, in terms of CPU cycles and delay. APPLOD, MEMM 
and CPUM are all threads (Figure 5.2) that are created by the CPR (main process) and 
they all have access to the ActReg. Access to the ActReg is controlled using mutexes 
(mutual exclusion); therefore only one thread has access to it at a time, preventing 
deadlock conditions. 
5.3.1.2.4 The Request Mechanism for the Active Applications 
The CPR sends requests to the APPLOD using a Request Queue (RQ). The Request 
Queue is different than the Packet Queue, shown in Figure 5.2, because it does not 
store packets but requests. 
The format of each request has two fields: 
i) The GMID that specifies which AA has to be loaded into memory or 
downloaded from a Code Server (CS), 
ii) The Request Type (RT) that specifies if the required AA has to be downloaded 
from a CS (if it is not stored in the hard disk of the AE). 
69 
Chapter 5 Architecture of the Active Engine 
The CPR defines both fields of each request and it is then transferred to the APPLOD 
via the RQ. 
How often requests are made by the CPR affects the switch operation time between 
AAs that have to use the FPGA. For example, if there were two competing packet 
flows requiring service by two FPGA-AAs, it would be very resource consuming to 
switch an AA for every packet that has a different GMID. 
Request Queue 
I 
Core Process I I I I I Application Loader 
Figure 5.7: The Request Queue 
The CPR manages to put a limit in the number of requests by using tokens. There is 
one token available for every AA, for a specific period of time. 
Active flows (AFLs) are defined as virtual packet flows that carry packets with the 
same GMID. Active packets with the same GMID belong to the same AFL but may 
belong to different packet flows at the same time. 
When the first packet of an AFL enters the CPR, the corresponding token is assigned 
to it and no other packet of the same AFL can use it. With this method, requests are 
issued only once per active flow. For example, if there are 500 packets in the same 
AFL, only one request will be sent to the APPLOD and not 500. During the request 
and till the AA gets loaded, the active packets are stored in the PQ. There is one PQ 
for every AFL (Figure 5.6). If there are more than one AFLs competing for the FPGA, 
tokens are assigned to the first packet of each flow but requests are issued in a first-
come first-serve basis and only after some time has elapsed after the previous request. 
This period will be called as the Rest Period (RP). The RP is necessary since there is a 
70 
Chapter 5 Architecture of the Active Engine 
time overhead for the initialisation of an AA. After the RP has elapsed, an incoming 
AFL is reassigned a token and a new request is sent. 
The Request Decision Module (RDM) is part of the CPR and issues requests to 
APPLOD in pre-defined intervals (RP) (Figure 5.8). Using this request mechanism, 
there is one request issued per RP. If the duration of an AFL is greater than the RP, 
then more than one request can be sent for the same AFL. 
AFL#l Token#! 
.A-. 
r:.r 'fl 
AFL#2 Token #2 ..... ~ 
L._ Request#N 
AFL#3 Token Pool Token#3 
-
AFL#4 Token#4 Request Decision 
Module 
Figure 5.8: Tokens and the Request Mechanism 
The following flowchart summarises the functionality of the CPR. 
71 
Chapter 5 Architecture of the Active Engine 
r·rfPSTAT=Oor 
lor2 
Assign a 
token 
Previous 
time=current 
time 
~ the appropriate request 
to the APPLOD 
! 
' ...... 
YES 
NO 
Store packet 
into the PQ 
Send packet 
back to 
network 
Receive a new 
packet 
Check the 
PSTATofthe 
packet 
Get Current 
Time 
Assign a 
token 
Get Current 
Time 
Previous 
time=current 
time 
Send the appropriate request 
to the APPLOD 
Figure 5.9: Flowchart of the Core Process 
If PSTAT=3 
Forward packet to 
the appropriate AA 
Send packet 
back to 
network 
IfPSTAT=4 
NO 
Store packet into 
thePQ 
72 
.. ,... ... 
Chapter 5 Architecture of the Active Engine 
5.3.1.3 The Application Loader (APPLOD) 
The APPLOD is the most complicated part of the AE. Its main task is to receive and 
service requests sent by the CPR. 
5.3.1.3.1 Communication with a Code Server 
As described in section 5.3.1.2.4, the CPR can send two types of requests. If RT is 1 
(RT1), then the AA that has to be loaded is already stored in the hard disk of the AE. If 
the request type is RTo, APPLOD has to contact a CS and download the AA specified 
bytheGMJD. 
APPLOD maintains a list of CSs, so if an AA is not found in a CS, it communicates 
with the next one in the list. The servers are organised into a list according to the 
number of loops away that they are located, so the closest server will be queried first. 
A better approach would be to use active anycast as described in [YAMOl], so as the 
CS with the minimum load will be contacted first. 
The communication between the APPLOD and the CS consists of two phases: 
• The Control Phase, 
It is established using a UDP connection. Every CS has opened a well-known UDP 
port for accepting requests. The protocol used for the Control Phase is described 
below: 
APPLOD sends a request for a specific application by using the GMID of the AA and 
sending the packet (Control Phase 1): 
lP UDP GMID 
Figure 5.10: Packet sent during the Control Phase 1 
GMID is the Global Module ID of the specific AA. CS then replies with the packet 
(Control Phase 2): 
73 
Chapter 5 Architecture of the Active Engine 
IP UDP Server _Reply 
Figure 5.11: Packet sent during the Control Phase 2 
The payload of the reply packet carries the structure Server _Reply that contains the 
following information: 
Server _Reply 
Field Descriution 
Id It is the GMID of the AA requested to be 
downloaded 
Reply If the AA was found in the CS, Reply=O 
otherwise Reply= 1 
PNAT Process Nature, PNAT=l if AA is an 
FPGA application, otherwise PNAT=O 
Maximum Memory Utilisation, it is the 
MMU maximum physical memory in bytes that 
this AA can utilise for proper 
functionality 
Bit_size The size of the bitstream of the AA in 
bytes. If PNAT=O then Bit_size=O 
Exec_size The size of the executable file of the AA 
in bytes 
Table 5.3: The Server_Reply info 
The two messages exchanged between the APPLOD and the CS use UDP as the 
transport protocol. UDP is an unreliable (connectionless) protocol, so a time-out 
mechanism has been implemented (alternatively a TCP connection could be used). An 
additional problem is that the CS may not be online for various reasons and a 
retransmission mechanism is necessary to take part after some time has elapsed. 
74 
Chapter 5 Architecture of the Active Engine 
The APPLOD after sending the packet of Control Phase 1, it waits for the reply of 
Control Phase 2. If no reply is received within time T1ME_OUT, it restarts Control 
Phase 1. This is repeated up to RE-TRY times and if there is no reply, APPLOD 
contacts the next CS in the list. If no reply is received from any of the CSs, the packets 
that have been stored (by the CPR) in the appropriate PQ are re-injected into the 
network. The values T1ME_OUT and RE-TRY are set to 20 sec and 5. 
Packets are also re-injected back into the network if the communication with the CSs 
was successful but the AA was not found to any of the servers (Reply=O, sent by all 
servers). 
In both cases (Reply=O, no contact with the CSs), the Process Status of the AA is set 
to 6 (PSTAT6) (Table 5.2), so if there are any other incoming active packets requiring 
the same AA, an attempt to download it again will take place only after some time has 
elapsed (it is administrative set), thus avoiding sending requests for an AA that cannot 
be found. 
If an AA is in PSTAT6, any packets received by the AE that have to be serviced by this 
AA consume resources (bandwidth, memory, CPU). If these packets were not 
forwarded by the RFE (to the AE), valuable resources could be saved. This is 
achieved by sending a control packet to the RFE to inform it that the specific 
application is in PSTAT6. Then, the RFE stops forwarding active packets that require 
the specific AA. The packets are not dropped, but routed as normal. The 
communication between the RFE and the AE for control purposes and how the RFE 
stops forwarding specific types of active packets (to the AE), are presented in Chapter 
6. 
In the time-diagram below, the Control Phase (with two retransmissions) is shown: 
75 
Chapter 5 Architecture of the Active Engine 
APPLOD CS 
-r 
Phase I 
TIME_ OUT 
-f- Phase I 
TIME_ OUT 
J Phase I 
Figure 5.12: Control Phase with two Retransmissions 
After the Control Phase has finished and if any of the CSs has replied with Reply=l, 
the Data Phase takes place. 
• The Data Phase, 
During this phase, the downloading of the AA is performed. Prior to downloading, the 
APPLOD has to check if there is available space in the hard disk to store the AA. 
Since it has got information, from the CS, regarding the size of the AA (Table 5.3), it 
can compare it with the free capacity of the hard disk. If there is no available space, 
AA is set to PSTAT6 and the Data Phase exits. All the stored packets requiring this AA 
are re-injected into the network. 
Instead, if there is free space to store the application, the APPLOD is ready to start 
downloading the AA. 
76 
Chapter 5 Architecture of the Active Engine 
Authentication and Secure Download 
For the communication with the CS, the SSH (Secure Shell) mechanism [SSHWWW] 
is used. SSH is software for logging into a remote machine and for executing 
commands on that machine. It provides secure encrypted communication between two 
hosts over an insecure network. The authentication mechanism chosen is the DSA 
algorithm (Digital Signature Algorithm). In order to login to a remote machine, the 
typing of a password is necessary. Here, the process has to be automated since no 
human-interaction should take place. 
The Linux utility ssh-keygen [SSKWWW] is used to create the public and the private 
keys of the AE. Then, the public key is sent to each CS and stored to a default file 
(authorized_keys2). When the APPLOD tries to authenticate, it uses its private key to 
sign the session identifier sent by the CS and then sends the result back to the CS. 
Then, the CS will: 
i) Check if the public key of the APPLOD is listed in its authorized_keys2 list, 
ii) Use the public key of the APPLOD to verify that the session identifier is the 
valid one. 
If all the above are successful, the APPLOD is granted access to the CS. The 
APPLOD uses SSH indirectly, because it actually uses it through the SCP (Secure 
Copy). SCP is a mechanism to transfer data between hosts and provides security 
because data are encrypted. The encryption algorithm used here is the 3-DES (Data 
Encryption Standard). 
Using the mechanisms above, the AAs are downloaded from the CSs to the AE. 
Compressing and Uncompressing Data 
The data (AAs) that are downloaded using SCP are compressed in the CS, in order to 
minimise the transfer delay. For the compression, the bzip2 software [ANWWW] is 
used. After the downloading of the data has finished, the bunzip2 package is used for 
decompression. 
77 
Chapter 5 Architecture of the Active Engine 
TIME OUT for SCP 
The TIME_OUT mechanism described before (Control Phase 1), puts a limit in the 
maximum delay for trying to request the download of an AA. A similar mechanism is 
necessary here that puts a limit in the time AE waits to gain access in a CS. If for 
example a CS crashes or is offline, the AE should not spend a large amount of time to 
try and down load the AA. 
SCP uses TCP as the transport protocol. A timeout can be set using the setsockopt 
command that specifies the timeout for a TCP socket but modification of the source 
code of SCP and recompiling is necessary. To avoid this, the default TCP settings in 
the AE can be changed (via the /proc filesystem). 
The default settings are: TCP starts a keepalive timer for each connection and if the 
connection is idle for tcp_keepalive_time (=7200) seconds, it starts sending probes to 
the other end. It sends a maximum of tcp_keepalive_probes (=9) each 
tcp_keepalive_intvl (=75) seconds apart, and if the other end has not responded by 
then, it drops the connection. The new values that have been set are 
tcp_keepalive_probes=O and tcp_keepalive_time=lO seconds. If a CS crashes or a 
connection fails (or for any other reason), the maximum time spent for trying to 
download an AA is tcp_keep_alive_time seconds. 
Verifying the Integritv of an AA 
After tcp_keep_alive_time has elapsed, APPLOD has to verify the integrity of the AA. 
If the SCP connection was dropped in the middle of the download, only part of the AA 
is stored on the hard disk. APPLOD after time tcp_keep_alive, checks the integrity of 
the AA by comparing the size of the downloaded files with the sizes shown in Table 
5.3. If they do not match (meaning that connection was lost during downloading), it 
will try to re-establish the connection and download the AA up to RE-TRY times. If it 
fails, it sets the AA into PSTAT6 and flushes the associated PQ into the network. If AA 
has been successfully downloaded, APPLOD updates the ActReg (Table 5.1) with 
information obtained from Table 5.3. It also stores the same info into a backup file 
that is used for the recovery procedure, described in section 5.3.1.8. 
78 
Chapter 5 Architecture of the Active Engine 
Before continuing further with the description of the APPLOD, it is necessary to 
describe the /proc filesystem of the Linux operating system. 
5.3.1.3.2 The Proc Filesystem 
The /proc filesystem is a pseudo filesystem because its files do not correspond to 
actual files on a physical device. It is a real-time memory resident filesystem that 
tracks the processes running in the Linux host. It contains a directory entry for each 
running process and the name of this directory is the process ID (PID) of the 
corresponding process. The /proc filesystem can be regarded as a control and 
information centre for the kernel. 
Some of the files it contains are: 
• /proclloadavg. This file provides a look at load average on the processor over time. 
• /proclmeminfo. This is one of the commonly used files in the /proc directory as it 
gives valuable information about the current RAM usage on the system. A 
/proclmeminfo file is similar to this: 
total: used: free: shared: buffers: cached: 
Mem: 524943360 87416832 437526528 0 3170304 46911488 
Swap: 1069277184 0 1069277184 
MemTotal: 512640 kB 
MemFree: 427272 kB 
MemShared: 0 kB 
Buffers: 3096 kB 
Cached: 45812 kB 
SwapCached: 0 kB 
Active: 20052 kB 
Inactive: 54312 kB 
HighTotal: 0 kB 
HighFree: 0 kB 
LowTotal: 512640 kB 
LowFree: 427272 kB 
SwapTotal: 1044216 kB 
SwapFree: 1044216 kB 
Figure 5.13: The /proc/meminfo File 
Mem displays the current state of physical RAM in the system, including a full 
breakdown of total, used, and free amounts of swap space, in bytes. 
Swap displays the total, used and free amounts swap space, in bytes. 
MemFree is the free amount of physical memory, in kilobytes. 
79 
Chapter 5 Architecture of the Active Engine 
MemShared is unused with 2.4 and higher kernel with left in for compatibility with 
earlier kernel versions. 
Buffers is the amount of physical RAM, in kilobytes used for file buffers. 
Cached, is the amount of physical RAM, in kilobytes, used as cached memory. 
Active is the total amount of buffer or page cache memory, in kilobytes, that is in 
active use. 
Inactive is the total amount of buffer or cache pages, in kilobytes that are definitely 
free and available. 
HighTotal and HighFree, are the total and free amount of memory, respectively, that 
is not directly mapped into kernel space. 
LowTotal and Low Free are the total and free amount of memory, respectively, that is 
directly mapped into kernel space. 
SwapTotal is the total amount of swap available, in kilobytes. 
Swap Free is the total amount of swap free, in kilobytes. 
• /proc/stat. This file keeps track of a variety of different statistics about the system 
since it was last restarted. A /proc/stat file is similar to this: 
cpu 927 0 2424 56436 
cpuO 927 0 2424 56436 
page 47885 7093 
swap 1 0 
intr 76781 597 87 1092 0 0 0 21 5 0 0 0 33 0 3339 0 9042 3462 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 000000 0 0 0 0 0 0 0 0 0 
disk_io: (3,0): (5634,4115,95770,1519,14186) 
ctxt 68836 
btime 1098982599 
processes 1428 
Figure 5.14: The lproc/stat File 
cpu measures the number of jiffies (11100 of a second) that the system has been in 
user mode, user mode with low priority (nice), system mode, and the idle task 
respectively. The total for all CPUs is given at the top, and each individual CPU is 
listed below with its own statistics. 
J2S!£f. is the number of memory pages the system has written in and out to disk. 
80 
Chapter 5 Architecture of the Active Engine 
swap is the number of swap pages the system has brought in and out. 
intr is the number of interrupts the system has experienced. 
btime is the boot time, measured in the number of seconds since January 1 1970, 
otherwise known as epoch. 
• /proc/maps. This files gives information about the memory areas of a process. An 
example of this file is shown below: 
start end perm offset major minor inode image 
08048000-0804c000 r-xp 00000000 03:02 197490 /usr/alex/test/actApp13 
0804c000-0804d000 rw-p 00003000 03:02 197490 /usr/alex/test/actAppl3 
40000000-40013000 r-xp 00000000 03:02 686795 /lib!ld-2. 2. 5. so 
40013000-40014000 rw-p 00013000 03:02 686795 /lib!ld-2. 2. 5. so 
40025000-4002a000 r-xp 00000000 03:02 344699 
/usr/1ib/1ibadmxrc2.so.2.2.3 
4002a000-4002b000 rw-p 00004000 03:02 344699 
/usr/1ib/1ibadmxrc2.so.2.2.3 
4002b000-4002c000 rw-p 00000000 00:00 0 
4002c000-4042c000 rw-s e6000000 03:02 65410 /dev/admxrciiO 
4042c000-4082c000 rw-s e6400000 03:02 65410 /dev/admxrciiO 
4082c000-4092b000 rw-p 00000000 00:00 0 
42000000-4212c000 r-xp 00000000 03:02 719499 /lib/i686/libc-2.2.5.so 
4212c000-42131000 rw-p 0012c000 03:02 719499 /1ib/i686/1ibc-2.2.5.so 
42131000-42135000 rw-p 00000000 00:00 0 
bfffeOOO-cOOOOOOO rwxp fffffOOO 00:00 0 
Figure 5.15: The !proc/maps File 
start is the beginning virtual address of this memory area. 
end is the end virtual address of this memory area. 
penn is a bit mask with the memory area's read, write, and execute permissions. This 
field describes what the process is allowed to do with pages belonging to the area. The 
last character in the field is either p for "private" or s for "shared." 
offset shows where the memory area begins in the file that it is mapped to. An offset 
of zero means that the first page of the memory area corresponds to the first page of 
the file. 
major, minor are the numbers of the device holding the file that has been mapped. 
inode is the inode number of the mapped file. Devices that are hosted in a Linux PC 
have a unique inode number. In the figure above the inode=6541 0 corresponds to the 
FPGA device since application actAppl3 has granted access to it. The file 
81 
Chapter 5 Architecture of the Active Engine 
/dev!admxrciiO describes the FPGA device and the memory region 4002c000-
4042c000 is the memory used to pass data to and from the FPGA. The application 
actAppl3 sends data to the FPGA by writing to that memory region (as described in 
Chapter 4). The APPLOD, as will be described later in this chapter, checks if an 
application has opened the FPGA board by checking if any values of the inode 
numbers are 65410. 
image is the name of the file that has been mapped. 
The /proc filesystem is a human readable filesystem. A mechanism is necessary to 
allow different processes or threads to get information from it. For this reason, the 
LibGTop software [LffiGWWW] is used. This is a software interface that allows 
processes to gather information from the /proc filesystem (Figure 5.16). 
Process H LibGTop H /proc 
Figure 5.16: Acquiring System Information from the/proc Filesystem 
LibGTop provides functions such as glibtop_get_proc_rnern to get memory utilisation 
information about a process, glibtop_get_cpu to get the total CPU utilisation etc. 
5.3.1.3.3 Loading Active Applications in Memory 
After an AA has been downloaded from a CS, the APPLOD has to make a decision 
regarding if it can execute or not. There are several scenarios: 
Scenario Nol: The AA has been downloaded but it is not loaded in memory yet 
(PSTATI). In this case, prior to loading it into memory, the APPLOD has to check if 
there is available memory. How the memory can be efficiently managed and fairly 
allocated to AAs is a very complicated task. 
82 
Chapter 5 Architecture of the Active Engirie 
All applications that are loaded in Linux are placed in virtual memory first. The 
amount of the virtual memory is larger than that of the physical memory. Linux also 
performs demand paging meaning that memory pages are loaded to physical memory 
only if they are used by an application. Using demand paging physical memory is 
used more efficiently. 
If for example, there are two AAs loaded in memory and if the maximum physical 
memory utilisation by AA1 is 1000 bytes and 2000 bytes are utilised by AA2, then 
3000 bytes of physical memory are needed in the worst-case scenario. This is the 
worst-case scenario assuming that both applications will utilise this amount of 
memory at the same time. Linux usually reallocate memory dynamically by "stealing" 
pages temporarily from one application to make them available to another application. 
The worst-case scenario can happen if both applications have locked the memory 
pages they are using. 
Memory Management 
APPLOD implements a simple memory management framework. It gathers 
information about maximum memory utilisation from the CSs (Table 5.3), for every 
downloaded AA. Next, it gathers information about the utilisation of the total physical 
memory through LibGTop (Figure 5.16) and checks if the following condition is true: 
Mrot>Mused+MMU!GMID[, where Mtot is the total amount of the physical memory on the 
AE, Mused is the total amount of memory used and MMU!GMIDJ is the maximum 
memory required by the AA. If the condition above is true, the AA can be loaded. The 
MMU!GMIDJ value is provided by the CS and is stored in the ActReg (Table 5.1). 
A better approach to this is to avoid swapping in the disk, as the result of a high 
physical memory utilisation. When swapping takes place, the performance of the host 
drops dramatically because writing and reading data from the hard disk are expensive 
and slow operations. Swapping takes place when there is high physical memory 
utilisation. To demonstrate this, sar [SEBWWW], a Linux utility for memory 
utilisation monitoring purposes is used. Performing the experiment presented in 
section 5.3.1.7.3, the graph in Figure 5.17 is produced. 
83 
Chapter 5 Architecture of the Active Engine 
~ 
~ 
·~ 
"' "' :g 
~ 
1===~1 
100 
90 
80 / 70 
60 
~ 
40 
30 
20 
10 
···~~·~·-... ····~···-·~·· ·-· 0 o~=~~~~oo~~~~~~wo~~~~~~~~3~~-~~~4~~~~~~=sL~~«n 
Time (seconds) 
Figure 5.17: Utilisation of the Physical Memory and the Swap Space 
Figure 5.17 shows that, when the utilisation of the physical memory becomes too 
high, swapping of memory pages in the hard disk takes place. 
In Linux, there is a kernel thread called kswapd (swap daemon) that is initialised at 
system start and its task is to free memory when there is memory pressure. 
Physical memory in Linux is divided into three zones: 
i) ZONE_DMA (first 16 MB of memory) is memory in the lower physical 
memory ranges, which certain ISA devices require, 
ii) ZONE_NORMAL (16 MB-896 MB) is the memory directly mapped by the 
kernel, 
iii) ZONE_HIGHMEM (896Mb-End) is the memory over 896 Mb. 
For each of the memory zones described above, there are three thresholds that are 
used by the kswapd in order to start freeing memory. These thresholds are [GOR04]: 
pages min, when this threshold is reached, the memory allocator will do the kswapd 
work and free pages, 
pages low, when it is reached kswapd is woken up to start freeing pages, 
84 
Chapter 5 Architecture of the Active Engine 
pages high, once it is reached, kswapd is woken and will start freeing pages until 
pages_high are free. 
The above values depend on the total RAM of the host. The amount of the physical 
memory in the AE is 512 MB so there are two memory zones (ZONE_DMA, 
ZONE_NORMAL) and all AAs use the ZONE_NORMAL (16MB-512MB). 
Their values are: 
Memory Parameters Size {bytesl 
1 page 4096 
pages_min 1044480 
pages_low 2088960 
pages_high 3133440 
Table 5.4: Memory size-dependent parameters 
As long as the free physical memory is at least pages_high bytes, kswapd will not 
start swapping pages into the hard disk. 
Rewriting the condition shown previously: Mtot - (Mused+ MMU[GMIDJ) > Mtres where 
Mtres=pages_high=3133440 bytes and Mtot is the total amount of the physical memory. 
The APPLOD checks if this condition is true before loading an AA into the memory. 
If the condition is not true, the AA will not be loaded and all the incoming packets will 
be sent back to the network. 
The memory utilisation of each loaded AA is monitored by the MEMM as will be 
described in section 5.3.1.4. If there is available memory, the AA is finally invoked. 
The APPLOD then has to make sure that AA is loaded into memory. In the case of a 
software bug, AA will be unable to initialise its data and start executing. 
Every application that is first initialised into the Linux memory, is assigned a PID 
(process id) that is saved in the /proc filesystem. There is a mapping between the 
85 
Chapter 5 Architecture of the Active Engine 
process name and the PID. The APPLOD uses a software module called search_? ID 
(shown in Figure 5.18) that takes as input the process name and searches the /proc 
filesystem to find the corresponding PID. If zero is returned, the process has not been 
loaded into memory. 
..---------
-------r·· lproc -~--------., 
,'' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' 
' ' ',, ,," 
Process Name 
........ _ ,,'' 
- --
...... _______ __ ...... -
search_PID Process ID (PID) 
~----
Figure 5.18: The search_?ID Module 
If search_PID returns zero (meaning that the AA has terminated) then, the 
mjn_counter of the AA is increased by one and the incoming packets are sent back to 
the network. 
The APPLOD at this point also checks the total number of the major and minor faults 
of the AA. If any thresholds have been reached, the status of the AA is set to PSTATs 
and RFE is notified, not to forward packets for this application. 
Scenario No 2: The AA has been downloaded, and it is an FPGA-AA. In this case, 
the APPLOD has to check if there is any other FPGA-AA loaded into memory that has 
access to the FPGA. If there is no other FPGA-AA, the new AA is loaded after the 
available memory is checked. 
86 
Chapter 5 Architecture of the Active Engine 
If there is an FPGA -AA and has reserved the FPGA resources (has opened the card, its 
memory address space is mapped to the FPGA space etc), the APPLOD has to take 
extra care, since only one AA can have access to the FPGA board at a time. 
If an AA tries to use the FPGA while another AA has access to it, the first AA will be 
killed by the kernel (it is a default operation), because a memory violation occurs. 
The APPLOD prevents this, by performing the follow operations: 
i) It sends a control packet to the previous AA to release the FPGA resources. 
The control packet is one byte long. It is sent to the AA via the Unix domain 
socket path as for the normal packets. The value of the control packet is OxO 1 
and the AA has to release the FPGA resources. 
ii) It checks if the AA has released the FPGA resources. This is achieved through 
a LibGTop function accessing the /proc/maps file (Figure 5.15). If an 
inode=65410 is found, this means that the FPGA is still reserved by the first 
AA, otherwise the FPGA is free to be used by the new AA. The status of the 
previous AA is set to PSTAT2. The APPLOD penalises the previous AA if it 
does not release the FPGA resources by killing it. Also the mjn_counter is 
increased by one and its status is set to PSTATI. 
iii) It sends a control packet to the new AA that, the FPGA is free for use. The 
value of the control packet in this case is Ox02. It then checks if the new AA 
has granted access to the FPGA, via the /proc/maps file. If not, it tries up to 
APPLIC_TRY times (mnf_counter is increased by one each time) to check the 
inode again and if it is finally impossible for the AA to open the FPGA card (in 
which case there is a bug in the code of the AA) all the incoming packets, 
stored in the PQ, are sent back to the network. 
If the new AA successfully opens the FPGA card, its status is set to PSTAT3. 
87 
Chapter 5 Architecture of the Active Engine 
5.3.1.4 The Memory Monitor (MEMM) 
The Memory Monitor (MEMM) is a thread created by the CPR during system start-up. 
It "wakes up" every second and its main task is to monitor the physical memory 
utilised by each AA. MEMM performs several housekeeping operations, as described 
in the following sections. 
5.3.1.4.1 Locate the Active Applications Loaded in Memory 
Before the MEMM starts monitoring the memory status of the AAs, it has to detect 
which of them are currently loaded in memory. ActReg (Table 5.1) holds useful 
information about AAs that have been requested by several end-users. The MEMM 
accesses the ActReg and seeks AAs with status PSTAT2 or PSTAT3. AAs with any of 
these statuses are already loaded into memory. Their GMIDs are then copied to a 
private array in the memory address space of the MEMM. This takes place because 
access to the ActReg should be performed as fast as possible since other threads share 
it too. While a thread is accessing the ActReg any other thread that is trying to access 
it, at the same time, it is blocked. 
5.3.1.4.2 Check the Major and Minor Faults 
After the MEMM has located the loaded AAs, it checks the number of major and 
minor faults they have caused. If any of the two thresholds is reached, the 
corresponding AA is killed and a message to the RFE is sent to stop forwarding 
packets for this AA. Also, its status is set to PSTATs. 
5.3.1.4.3 Check if the Active Applications are still "Alive" 
While AAs are loaded and executed in memory, one or more of them could for some 
reason exit or become a "zombie". In this case, ActReg should be updated, avoiding 
packet loss. The MEMM checks for AAs that have stopped executing, using the 
search_PID module (Figure 5.18). If it returns zero, it means that the AA has been 
terminated. Then, the status of the AA is set to PSTAT1. Any packets of this 
application stored in the PQ are flushed into the network by the APPLOD since it can 
check (via theActReg) that the PSTAThas changed to one. 
88 
Chapter 5 Architecture of the Active Engine 
If search_PID returns a non-zero value (the PID of the corresponding AA), it means 
that the AA is still loaded in memory. It does not necessary mean that the AA is not 
malfunctioning. It might have become a zombie process. Zombies are processes or 
threads that have exited but their "remains" are still in memory. There are two types 
of zombies: Thread zombies, that are processes that have not "joined", and process 
zombies that are "child" processes, their "father" has exited without calling wait(). 
The MEMM can detect if an AA has become a zombie through the /proc filesystem. In 
this case, the AA is killed, its status is set to PSTATI and the mjn_counter is increased 
by one. 
5.3.1.4.4 Process Ageing 
When an AA is first loaded in memory, it is also timestamped. Timestamping is 
performed by saving the time (in the ActReg as Table 5.1 shows) the AA was first 
loaded in memory. AAs are also timestamped every time they execute after a switch 
(between AAs) takes place. Every new timestamp overwrites the previous one. 
If an AA has status PSTAT2, the process ageing is checked. Process ageing is defined 
as the difference in time between the current time and its timestamp. If this time is 
more than a predefined interval (set to five hours), the AA is unloaded from the 
memory and its status is set to PSTATI. Using this method, memory is saved because 
an AA that has not been executed for a long time is unloaded. 
5.3.1.4.5 Memory Monitor 
In this phase, the memory monitoring takes place. The MEMM collects memory 
utilisation information from the /proc filesystem, through LibGTop, for every AA. 
It compares this with the MMU (Table 5.1), and if any AA has utilised more memory 
than its MMU, it is killed. Then, its status is set to PSTATI and its mjn_counter is 
increased by one. 
The functionality of the MEMM is summarised in the following diagram: 
89 
Chapter 5 Architecture of the Active Engine 
Check Process 
Ageing 
YES 
Kill AA and 
update ActReg 
Go to Start 
Find the Active 
Applications loaded in 
memory 
Check memory 
utilisation 
Kill AA, update 
ActReg and 
increase 
mjn_counter 
Figure 5.19: Functionality of the Memory Monitor 
START 
Kill AA, set it to 
PSTATs and 
informRFE 
Update ActReg, 
increase 
mjn_counter 
Kill AA, update 
ActReg and 
increase 
mjn_counter 
Go to Start 
90 
Chapter 5 Architecture of the Active Engine 
5.3.1.5 The CPU Monitor (CPUM) 
The CPU Monitor (CPUM) is a thread invoked by the CPR to monitor the CPU 
utilisation caused by each AA, as well as the total CPU utilisation in the AE. It uses 
LibGTop to gather CPU statistics from the /proc filesystem. 
The AAs are network applications. This means that the number of the CPU cycles they 
consume is proportional to the input bandwidth of the active traffic. In [PA W002], 
the correlation between the packet size and the CPU utilisation for several network 
applications is shown. Also, two types of costs (regarding CPU cycles) have been 
defined: the per byte cost and the per packet cost [PAW002]. 
A generic CPU management framework that does not take into account the volume of 
the input active traffic and penalises AAs when they over-consume CPU cycles will be 
unfair for two reasons: 
i) CPU utilisation is proportional to the incoming active traffic and an AA is not 
responsible for high input traffic, 
ii) In the /proc filesystem, the CPU utilisation of each application is recorded (by 
default). The unfairness here is that if an application is executing and then, is 
temporarily suspended by the operating system if a hardware interrupt has to 
be serviced by the CPU, the cost to service that interrupt will be charged to the 
application suspended, even if this application is not associated with the 
interrupt. 
5.3.1.5.1 CPU Utilisation Measurements 
To overcome these problems, AE has to check if high CPU utilisation is caused by the 
active traffic or not. 
Initially, the CPUM monitors the total CPU utilisation for a total period of sixty 
seconds. This period will be called the monitoring period. Measurements are taken 
every second from the lproc filesystem, using LibGTop. During the monitoring 
period, a counter is increased by one, every time the total CPU utilisation is higher 
91 
Chapter 5 Architecture of the Active Engine 
than the cpu_watennark (90 % ). At the end of the monitoring period, the value of the 
counter is checked and the following categories have been defined: 
i) If the value of the counter is less than critical_tres_min (= 40 or 66.66 %),the 
AE is assumed not to be overloaded, 
ii) If the value of the counter is between critical_tres_min and critical_tres_max 
( = 50 or 83 % ), the AE is assumed to be loaded but not overloaded, 
iii) If the value of the counter is greater than critical_tres_max, the AE is 
overloaded. 
In case (iii) immediate action should be taken to relieve AE from high input load, in 
case (ii) no action is taken and in case (i), the AE could handle more traffic. 
5.3.1.5.2 Traffic Shaping Requests 
If the AE is being overloaded, the CPUM requests from the RFE to reduce the active 
traffic it forwards from the network (Figure 5.1). Traffic is shaped step-by-step. For 
example if the step is chosen to be 0.1, then the traffic is shaped by 10 % after a 
request is sent. This step is called the shaping step. 
The CPUM may issue more than one request if the CPU utilisation continues to be 
over the critical_tres_max threshold. It makes a decision after each monitoring period 
ends and sends a request if it is necessary. How active traffic is shaped is described in 
Chapter 6. 
A request can ask for traffic increases too. If the total CPU utilisation is less than 
critical_tres_min, the CPUM requests an increase of active traffic, if a decrease 
request was sent in the past. 
All requests are sent using a TCP connection. The same connection is used by the 
APPLOD and the MEMM to inform the RFE if an AA is in PSTATs or PSTAT6. 
92 
Chapter 5 Architecture of the Active Engine 
AE 
I C~UM I I AF I 
1 
I 
I 
Shap erequests~ ._Active 
I 
Traffic 
I 
I 
l 
Traffic Forwarding 
Shaper Eneine 
RFE 
Figure 5.20: Traffic Shaping Requests 
Different shaping steps have been tested and the results are presented in Chapter 6. 
5.3.1.5.3 Penalising Active Applications for High CPU Utilisation 
The CPUM is trying to relieve the AE, in the case of high input traffic by requesting 
traffic shaping. Then, the RFE shapes the traffic. 
If a bogus AA is loaded in the memory of the AE, CPU utilisation will remain high 
even if traffic is shaped. In this case, the CPUM tries to detect and isolate the 
offending AA. 
Requests for traffic shaping (decrease of active traffic) will stop if 0.9/S requests have 
been sent and a second round of CPU measurements will start aiming to detect the 
offending AA. S is the shaping step. If for exampleS is 0.1, at most nine requests will 
be made for traffic shaping before the second round starts. 
The "second round" of measurements starts, measuring not the total CPU utilisation, 
but the CPU cycles utilised by each loaded AA. The measuring period is sixty seconds 
and the CPUM performs the measurements every second. It first scans the ActReg to 
find the loaded AAs. Then, for convenience, it creates an array loaded in memory that 
stores the following information for every AA: 
93 
Chapter 5 Architecture of the Active Engine 
Field Descril!tion 
GMID Global Module ID of the AA 
Cpu_util % sum of the CPU cycles utilised in user 
and kernel space by the AA 
Tmp Tirnestamp of the AA, it keeps the time 
when the measurement took place 
Violations Number of times CPU utilisation caused 
by the AA was over the AA_watennark 
Table 5.5: The array entry CPUM creates for every loaded AA 
• GMID, is the Global Module ID of the AA, 
• cpu uti/, is the % sum of the CPU cycles utilised in user and kernel space by the AA 
within the last and the current measurement. It is computed using the following 
formula: 
.1 ( ) _C.=...yc_le_s.o..[ n...:.l_-_C""-yc_l_es_,_[ n_-_1...:.1 cpu_utz n = 
tmp[n]-tmp[n-1] 
[5.1] 
Cycles[n] are the CPU cycles of the nth measurement given by the /proc filesystem for 
the specific AA, tmp[n] is the time the nth measurement took place, Cycles[n-1] are 
the CPU cycles measured in the previous measurement and tmp[n-1] the time when 
the previous measurement took place. 
Tmp[n]-Tmp[n-1] should be one second since measurements are repeated every 
second. This is not always the case because Linux is not a real-time operating system 
and CPUM is a user-space thread, so it is not guaranteed that CPU measurements will 
be taken exactly every second, in case of heavy load. 
• Tmp is the time when the current measurement took place. It is used in the above 
form. 
• violations. During the second round of measurements, when the CPUM tries to 
locate any offending AA, traffic has been reduced by 90 %. CPU utilisation should not 
be high under these conditions. When a measurement is taken (using formula 5.1), the 
94 
Chapter 5 Architecture of the Active Engine 
output is compared to the AA_watennark that is 0.9. Every time CPU utilisation for a 
given AA is higher than the AA_watermark the violations counter is increased. 
When the monitoring period of the second round ends (after sixty seconds), the 
violations counter of every AA is checked. If any of the counters' value is over the 
critical_tres_max, the corresponding AA is assumed to be the offending one. It is then 
killed, its status is set to PSTATJ and its mjn_counter is increased by one. 
Next, a new monitoring period (the first round of measurements) starts, measuring the 
total CPU utilisation. 
5.3.1.6 The Packet Injector (PI) 
The task of the Packet Injector is to re-inject active packets in the network. It receives 
packets from the AAs, the APPLOD or the CPR via a UNIX domain socket. The 
pathname PI uses to receive packets is the "/usr/active_routerlreinjectp". 
PI recomputes the UDP checksum of the packets prior to injecting them in the 
network. The mechanism for injecting packets into the network is the Libnet 
[LIBWWW] software interface. Libnet is a generic networking API that provides 
access to several protocols. It uses the raw sockets mechanism to send packets to the 
network. 
PI uses three main functions of Libnet: 
• libnet_do_checksum(); this is a function that computes the IP, UDP or TCP 
checksums. Input parameters are the type of the checksum (IP, UDP or TCP), the 
offset in the packet data where the data starts and the packet size. It returns the 
checksum, a 16-bit number. 
• libnet_open_raw _sock(); this is a function that creates a socket. Input parameter is 
the protocol of the raw socket. It returns the socket descriptor. 
• libnet_write_ip(); this a function that sends packets into the network. Its input 
parameters are the socket descriptor, the packet to send and the packet size. 
95 
Chapter 5 Architecture of the Active Engine 
5.3.1.7 The Safety Process (SPR) 
The Safety Process (SPR) is a user-space process resident in the memory of the AE 
and its task is to check if the core processes of the AE (CPR, APPLOD, MEMM, 
CPUM and PI) are loaded in memory. Any of the above processes may terminate in 
the case of an error (software bug). 
Another possibility is a bogus AA, which allocates a huge chunk of memory, and the 
MMUM does not have time to react. The MMUM performs memory monitoring every 
second, so if a bogus AA allocates a lot of memory within a period smaller than a 
second, there is a threat to the rest of the processes ( CRP, PI etc ). An example of such 
a threat will be described later in this chapter. 
5.3.1.7.1 Functionality ofthe Safety Process 
The SPR wakes up every second and seeks in the /proc filesystem (through the 
search_PID module shown in Figure 5.18) for the process IDs (PIDs) of the CPR and 
the PI. If any of these two PIDs is zero (meaning that the process is no longer loaded 
in memory) the corresponding process is reloaded into memory. If any of the other 
threads (MEMM, CPUM, and APPLOD) terminates, the CPR will terminate too, 
because is their parent process. So checking for the PID of the CPR, is an indirect 
mechanism to check if all threads of the AE are loaded in memory. 
Next, the SPR checks if any of the core processes have become zombies. If this is the 
case, the zombie process is killed and reloaded. Also, the memory utilisation of the 
CPR is monitored. 
The SPR performs these checks every second for over a period of sixty seconds. 
Every time a process is found unloaded or, a zombie, or the memory utilisation is over 
the memory threshold, the value of a safety counter is increased. After the test period 
of the sixty seconds has elapsed, the value of the safety counter is checked and if it is 
over a critical threshold a message is sent to the RFE to stop forwarding active 
packets. Also, a message is printed in the monitor of the RFE to warn the network 
administrator that the AE is in an unstable condition. If the value of the safety counter 
is below the critical threshold, a new period of checks starts. 
96 
Chapter 5 Architecture of the Active Engine 
5.3.1.7.2 Out of Memory Management (OOM) 
If there is high memory utilisation, Linux tries to free memory by swapping memory 
pages into the hard disk. If there is no free swap space and free memory is still low, 
the Out of Memory Management ( OOM) is invoked. 
The OOM is a built-in Linux functionality that is activated when the amount of the 
free physical memory drops below a threshold value. OOM calls a function called 
select_bad_process() that is responsible for choosing a process to kill. It decides by 
stepping through each running task and calculating how suitable it is for killing with 
the function badness() [GOR04]. The badness of a loaded process is calculated using 
the following formula [GOR04]: 
b d fi k -;=====~t;;ot:;a;:l ~v;,;m;;.=.~fo;;,;r==:;t;;;as;;k~===== [5.2] a ness or tas = 
- - ~cpu_time_in_seconds *Vcpu_time _in_minutes 
Total_vm_task is the amount of virtual memory utilised by a process, 
cpu_time_in_seconds and cpu_time_in_minutes is the CPU time used by the process 
in seconds and minutes respectively. 
The badness of a process using the above form chooses a process that is using a large 
amount of memory but is not that long lived. 
If there is high memory utilisation due to a bogus AA, Linux will choose a process to 
kill using the above form, but it is not guaranteed that the correct process will be 
chosen. 
5.3.1.7.3 Out of Memory Management Test 
In order to test the OOM, an AA was modified in order to allocate more and more 
memory as it was servicing active packets. The MEMM at this point was deactivated. 
Every time the AA receives a packet, it allocates 5000 bytes of memory. A test 
network was set up, as shown in Figure 5.1. The host acts as a packet source injecting 
active packets (of Type 0) into the network. Their length was 200 bytes and the packet 
rate was 30,000 packets/sec. The graph shown in Figure 5.17 was produced. 
97 
Chapter 5 Architecture of the Active Engine 
In the horizontal axis, the time (in seconds) since the start of the test is shown. In the 
vertical axis, there is the utilisation (%) of the memory and the swap space. 
As shown in Figure 5.17, memory utilisation is very high about 125 seconds after the 
beginning of the test. This happens because, the M has allocated a huge amount of 
memory. At this point, the kswapd daemon wakes up and tries to free memory by 
swapping memory pages into the hard disk (as described in section 5.3.1.3.3). 
After 475 seconds, swap space utilisation is almost 100 % and there is no free swap 
space. At this point, the OOM takes place and an application has to be killed. 
After repeating the same test for several times, it was noticed that after the OOM 
exits, not only was the bogus M killed but the CPR too. This is unfair to the CPR 
since it was not responsible for the high memory utilisation. It was probably killed, 
either because Formula 5.2 gave a wrong result or because the APPLOD was the 
process that invoked the M and it should be killed too. The APPLOD is a thread 
created by the CPR. So, the APPLOD is a child process and the CPR the parent 
process. If for some reason a child process is killed, the parent process dies too. It is 
very probable that the APPLOD was terminated by the OOM; therefore the CPR was 
terminated too. 
As stated before, the MEMM was deactivated and such a situation is very unlikely to 
take place, because an offending M will be terminated. 
But, if this happens and the CPR is unfairly terminated, the SPR will reload it and will 
inform the network administrator (via the RFE) if this happens more frequently. 
5.3.1.8 Recovery Procedure performed by the Core Process 
In the previous section, it was described how the CPR is reloaded by the SPR, in the 
case of a failure. After the CPR is reloaded, the ActReg will have no data because it 
was resident in the memory address space of the CPR and after the failure this part of 
the memory was released. However, one or more Ms might be still loaded into 
memory and their status has to be defined and the ActReg has to be re-created 
accordingly, otherwise the operation of the AE will be problematic. 
98 
Chapter 5 Architecture of the Active Engine 
In section 5.3.1.3.1, the downloading of an AA from a CS was described. After an AA 
is successfully downloaded, an entry is created into the ActReg saving valuable 
information about the specific AA (Table 5.1). The same data saved in ActReg are also 
saved into a file called the registry file. 
The registry file is used as a backup file and it is useful when a failure takes place and 
the CPR has to be reloaded. When the CPR is reloaded, it opens the registry file to 
seek the AAs that have been stored in the hard disk. It then uses the search_PID 
module (Figure 5.18), described in section 5.3.1.3.3, to check if any of the stored AAs 
are loaded into memory. If an AA is loaded and if it is a non FPGA-AA, its status is 
direct! y set to P ST A T3. 
For loaded FPGA-AAs, there is a different approach since only one of them can have 
access to the FPGA board at a time. When the CPR detects a loaded FPGA-AA, it 
checks if the specific application has access to the FPGA board. This is performed via 
LibGTop and the /proc/maps file (Figure 5.15). If it has access to the board, its status 
is directly set to PSTATJ. Any other loaded FPGA-AAs are set to PSTAT2. 
If FPGA-AAs are loaded and none of them has access to the FPGA board, their status 
is set to PSTAT2. 
After the recovery procedure has ended, ActReg has been partially reconstructed. The 
only fields that cannot be recovered are the major and minor faults (Table 5.1). 
The recovery procedure is summarised using the diagram below. Data size is the size 
of each entry data (Table 5.1) and it is 32 bytes long. File_size is the size of the 
registry file in bytes, andfile_offset the offset in bytes into that file. 
99 
Chapter 5 Architecture of the Active Engine 
flle_offset=flle_offset+data_size 
Set AA to 
PSTAT2 
Set AA to 
PSTAT3 
Load the registry file into 
memory 
Read file at offset 
flle_offset as long as 
flle_offset<file_size 
Set AA to 
PSTAT3 
Figure 5.21: The Recovery Procedure 
flle_offset=flle_offset+data_size 
too 
Chapter 5 Architecture of the Active Engine 
5.4 Summary 
This chapter has described the software architecture of the Active Engine (AE). It 
consists of seven core user-space processes and one kernel-space module. 
The Active Filter (AF) forwards packets from user-space to kernel-space. It is a 
Loadable Kernel Module (LKM). 
The Core Process ( CPR) hosts the incoming active packets and sends appropriate 
requests to the Application Loader (APPLOD), for loading the requested Active 
Applications (AAs ). 
The APPLOD contacts a Code Server and downloads the AAs using a secure channel. 
It then checks for the integrity of the downloaded AAs. After downloading, it loads the 
AAs into memory and takes extra care if there are more than one FPGA-AAs. It 
performs a switch, making the FPGA board available to each FPGA-AA for over a 
period of time. 
The CPU Monitor (CPUM) monitors the total CPU utilisation and the CPU cycles 
consumed by every AA. It performs two rounds of measurements. During the first 
round, it monitors the total CPU utilisation, and if this is over some pre-defined limits, 
requests for traffic shaping are sent to the RFE. After 90 % of the total number of the 
requests has been sent and if CPU utilisation is still high, the second round takes 
place. During this round, the CPU cycles spent by each loaded AA are monitored if it 
is above some limits, the corresponding AA is unloaded. 
The Memory Monitor (MEMM) monitors the memory utilisation of each AA and if it 
is over a pre-defined limit, the AA is unloaded. This thread also checks if any of the 
AAs has been unloaded or become a zombie or if it has caused too many errors and the 
appropriate action is taken. 
The Safety Process (SPR) monitors the status of the core processes of the AE and 
reloads them in case they exit. 
101 
Chapter 5 Architecture of the Active Engine 
The Packet Injector (PI) re-injects packets back to the network. 
102 
Chapter 6 
Architecture of the Routing and 
Forwarding Engine 
103 
Chapter 6 Architecture of the Routing and Forwarding Engine 
6. Architecture of the Routing and Forwarding 
Engine (RFE) 
6.1 Introduction 
The architecture of the Routing and Forwarding Engine (RFE) consists of three user-
space processes and a cluster of kernel-space loadable modules, as shown in the figure 
below: 
Safety 
Module 
r-----------------
1 I Traffic Shaper 
' 
J r,-LK_M_Lo-ad_e_r--, 
l-----------------
User- Space 
Kernel - Space 
Figure 6.1: The Software Architecture Layout of the Routing and Forwarding Engine 
The cluster of the LKMs is located in kernel-space and their task is to redirect active 
packets to the AE. Each LKM redirects packets that carry a specific GMID. LKMN 
redirects packets that carry the number N as their GM/D. 
The Safety Module (SFM) polls the AE at specific intervals to test if it is still 
functioning. Polling is achieved by sending ICMP requests. If no ICMP reply is 
received within a timeout period, SFM assumes that AE has crashed and it then, 
unloads all LKMs from the kernel-space. By unloading the LKMs, active packets are 
not redirected to the AE but just routed to the next hop. 
The Traffic Shaper (TRS) has two tasks: (i) it shapes the traffic RFE redirects to AE 
after the AE has sent a request for traffic shaping and (ii) it removes a specific LKM, if 
the AE has requested this (if an AA has been set to PSTATs or PSTAT6). 
The LKM Loader (LKL) is a thread invoked by the TRS. Its task is to monitor which 
LKMs have been unloaded from kernel-space and when that took place. Unloaded 
104 
Chapter 6 Architecture of the Routing and Forwarding Engine 
LKMs are reloaded after time T has elapsed since Ms in PSTATs or PSTAT6 can have 
the chance to execute after this period has elapsed. 
6.2 Redirecting Active Packets to the Active Engine 
Figure 5.1 is shown again for convenience: 
1---i 
Figure 6.2: The Fundamental Parts of the Active Router placed into the Network 
Active packets are sent from the Source to the Destination host. Active Networking is 
transparent to the end-users; therefore they do not have to know the lP addresses of 
the RFE or the AE. 
The RFE is a Linux router that routes packets from one port to another. Active 
packets have to be forwarded to the AE. In this case, the destination IP address of the 
packets is modified. This is achieved by using LKMs, described in sections 5.3.1.1.1 
and 5.3.1.1.2. There is one LKM for every M and it is registered in the 
NF _IP _?RE_ROUTING hook of the kernel-space of the RFE (Figure 6.3). 
105 
Chapter 6 Architecture of the Routing and Forwarding Engine 
ToAE 
I NIC3 I 
, __ _j j__ ____ , 
Fr 
I LKM[71Dl] 1···1 
. 
I LKM [GMIDN] 
omSource 
" 
ToDestina 
' NICI --------...1 
' 
NIC2 lion 
----------------------------· t 
i I 
I redirection I 
RFE kernel-space L_ .llefault routing 
Figure 6.3: Replacing the Destination lP Address of the Active Packets by using LKMs 
Every LKM acts on a specific type of active packet by checking the GMID it carries 
and then performs the following operations: 
i) It replaces the destination lP address of each packet with the lP address of the AE. 
The packet is then redirected to the AE by the default routing mechanism. The original 
destination lP address of each packet is stored in the Active Header (in the last_node 
field, Table 4.1). This is important since every active packet has to reach its final 
destination, after it has been processed by an AA. 
Before the LKM replaces the IP address, it checks the source MAC Address of each 
packet, in order to avoid re-sending packets back to AE; therefore packets sent from 
the AE are routed to the destination host. The PI, described in section 5.3.1.6, replaces 
the stored (original) destination IP address from the Active Header back to the IP 
header, after it receives an active packet from an AA. 
ii) After the original lP Address has been replaced with the IP Address of the AE, the 
lP and UDP checksums of each packet are computed, otherwise packets would be 
dropped by the kernel of the AE. 
106 
Chapter 6 Architecture of the Routing and Forwarding Engine 
Only one LKM could be used for all active packets but, by using one LKM for every 
different GMID, RFE can stop forwarding packets with a specific GMID on demand. 
For example, if a bogus AA executes in the AE, the MEMM will kill it, set its status to 
PSTAT5 and then informs the RFE to stop forwarding packets for this AA. This is 
performed by simply unloading the corresponding LKM. The specific active packets 
will then be handled by the default routing operation and they will be sent to the 
destination host, instead of being redirected to the AE. 
6.3 The Safety Module (SFM) 
As stated in section 5.2, the functionality of the Active Router is provided, for safety 
reasons, by two hosts (RFE and AE) (Figure 6.2). If the AE crashes as the result of a 
bogus AA, the RFE will not be affected and the routing of packets will continue as 
normal. 
The task of the SFM is to poll the AE and check if it is still functioning. The AE is 
polled by sending ICMP requests every polling period that is equal to two seconds. 
After the polling period ends, the SFM waits for over a timeout period that is set to 
five seconds. If no ICMP reply is received by then, the RFE assumes that the AE has 
crashed and all LKMs (Figure 6.3) are unloaded. By unloading the LKMs, active 
packets are not redirected to the AE but routed to the next hop. 
SFM will continue sending ICMP requests and as soon as it receives an ICMP reply 
from the AE, LKMs are reloaded and redirection takes place again. 
6.3.1 Structure of the Safety Module 
The structure of the SFM consists of one user-space process, resident in the memory 
address space of the RFE. This process has two parts: one for sending ICMP requests 
and another for receiving ICMP replies. 
6.3.1.1 Sending ICMP Requests 
The mechanism for sending ICMP requests is the Libnet API [LIBWWW]. The SFM 
uses three functions provided by Libnet: 
• libnet_build_ip(), is used to build an lP packet. Input parameters are all the lP 
header-related information such as lP addresses, type of service etc. 
107 
Chapter 6 Architecture of the Routing and Forwarding Engine 
• libnet_build_icmp(), is used to build the ICMP header carried by the lP packet. 
Input parameters are the type of the ICMP packet (ICMP _ECHO), the payload of the 
packet etc. 
• libnet_write_ip(), is a function that sends packets into the network. Its input 
parameters are the socket descriptor, the packet to send and the packet size. 
6.3.1.2 Receiving ICMP Replies 
If the AE is still running, it will send an ICMP reply after it receives the ICMP echo 
sent by the RFE. ICMP replies are received by the kernel networking code of the 
RFE, as shown in Figure 5.3. The SFM is a user-space process, so ICMP replies have 
to be forwarded (redirected) from kernel-space to user-space. This is achieved by 
using iptables, a user-space tool of netfilter. Packets that are sent from the AE and 
carry an ICMP reply are directed from the NF_IP_PRE_ROUTING hook to user-
space (SFM), via the ip_queue LKM (Figure 6.5). 
Iptables is a packet filter mechanism that its input parameters are received by the 
command line interface. In order to automate this, a script passes the appropriate 
information to iptables. The script contains the following data: 
insmod /usr/src/linux-2.4.20/net/ipv4/netfilter/ip_queue.o 
iptables -A INPUT -s 192.168.23.2 -p icmp --icmp-type 0 -j QUEUE 
Figure 6.4: Script that loads the ip_queue Module and queues ICMP Packets from Kernel to 
User-Space 
The first line loads the ip_queue LKM into the kernel memory. The second line passes 
the command line parameters to iptables, in order all ICMP reply packets (type 0) sent 
by theAE (its lP address is 192.168.23.2) to be queued to user-space (SFM). 
108 
Chapter 6 Architecture of the Routing and Forwarding Engine 
User-space I SFM I 
I ip_queue I 
Kernel-space 
' ICMP reply se nt by the AE 
Network 
Figure 6.5: ICMP Replies forwarded from Kernel to User-Space 
6.4 Communication between the Routing and Forwarding 
Engine and the Active Engine 
The RFE and the AE use two different types of communication. The first type is the 
forwarding of active packets as described in section 6.2. The second type of 
communication is the transmission of control messages through a TCP connection 
(Figure 5.20). 
Control messages are sent from the AE to the RFE in the following situations: 
i) The CPUM sends requests for traffic shaping. There are two types of traffic 
shaping requests: increase_active_traffic and reduce_active_traffzc. The type 
of each request depends on the CPU utilisation in the AE. 
ii) If an AA has caused too many minor or major faults, it is killed by the MEMM. 
Its status is set to PSTATs and a message is sent to the RFE to stop redirecting 
packets for this AA, for over a period of time. 
iii) The APPLOD can set an AA to PSTAT6 if the CSs cannot provide it or when 
the downloading of its code has failed too many times. The APPLOD sends a 
message to the RFE to stop redirecting packets for this AA. 
iv) When the SPR detects that the core processes of the AE have been terminated 
too many times, it sends a message to inform the RFE that AE is in an unstable 
109 
Chapter 6 Architecture of the Routing and Forwarding Engine 
condition. In this case, the RFE unloads all the LKMs and no more active 
packets are directed to the AE. It also prints a warning message on the screen 
to warn the network administrator. 
The protocol used for sending information, regarding the control messages, is built 
over TCP and its semantics are shown in the table below: 
Field Size {bits} 
Type 16 
GMID 32 
Table 6.1: Protocol semantics used for the control messages 
The Type of the control packet specifies the action RFE should take. There are four 
possible Types: 
i) Reduce_active_traffic (Type 0), is the type of control packets the CPUM sends 
to the RFE in case of high CPU utilisation to reduce the active traffic it 
redirects. 
ii) lncrease_active_traffic (Type 1) is the type of control packets the CPUM 
sends to the RFE in the case of low CPU utilisation, to increase the active 
traffic it redirects. 
iii) APPL_IN_ST5 (Type 2), is sent by the APPLOD, the CPUM or the MEMM 
when an AA is set to PSTATs or PSTAT6. In this case, the corresponding LKM 
is removed from the kernel-space for over a period of time. 
iv) AE_IN_PANIC (Type 3) is sent by the SPR when the core processes of the AE 
have been terminated too many times. All LKMs are removed from the kernel-
space. 
GMID is the Global Module ID of an AA and has meaning only when control packets 
of Type 2 are sent. In this case, the RFE unloads the LKM associated to the GMID 
(Figure 6.3) and packets of this AA are not redirected to the AE. 
110 
Chapter 6 Architecture of the Routing and Forwarding Engine 
6.5 Loadable Kernel Module Loader (LKL) and Traffic 
Shaper (TRS) 
The LKL and the TRS are parts of the same process. TRS is the main process that 
invokes the thread LKL (Figure 6.1 ). 
6.5.1 The Traffic Shaper (TRS) 
The TRS has two main tasks, depending on the control messages it receives from the 
AE (Table 6.1 ). 
6.5.1.1 Unloading the Loadable Kernel Modules 
As described in the previous sections, the APPLOD, the MEMM or the CPUM can for 
several reasons unload an AA and set its status to PSTATs or PSTAT6. Then, they 
inform the RFE about the removal of the specific AA by sending a control packet of 
Type 2. 
After receiving this control packet, the TRS checks the GMID carried on the control 
header (Table 6.1) and unloads the specific LKM (Figure 6.3). For convenience, the 
name of each LKM is kernel_filter[GM/D].o, where GMID is the Global Module ID 
of the AA, whose packets are redirected by the specific LKM. For example, if an AA 
has a GMID equals to three, the LKM that redirects its packets to the AE is named as 
kernel_filter3.o. 
The unloading of the LKMs is performed using the UNIX command rmmod. 
Supposing the MEMM unloads an AA with GMID=3, it sends a control packet (Type 
2) carrying the number 3 as the GMID in the control header. The TRS, upon receiving 
the control packet, removes the kernel_filter3. o LKM by using the command: "rmmod 
kernel_filter3.o". The specific LKM is removed from the kernel-space of the RFE and 
any packets with the GMID=3 are routed to the next hop instead to the AE. 
If it receives a Type 3 packet, it unloads all the LKMs (Figure 6.3). 
111 
Chapter 6 Architecture of the Routing and Forwarding Engine 
6.5.1.2 Traffic Shaping 
6.5.1.2.1 Queues and Queuing Disciplines in Linux 
The main components of the Linux QoS architecture are queuing disciplines (qdiscs), 
classes, classifiers or filters and meters. When IP queues a packet on the outgoing 
interface, the packet is matched against the available filters, or classifiers. Filters have 
an associated priority, so a filter with a higher priority is matched first. The purpose of 
filters is to classify the packet into some traffic class. Each class owns a physical 
queue- to actually hold packets after the filter has assigned the packet to a class. After 
being queued in this qdisc, it is the qdisc's responsibility to eject the packet. The 
ejection may be based on rate disciplines like token bucket filters, priority filters or 
round robin [GAR02]. 
The place in the kernel networking code, where qdiscs (traffic control) are placed, is 
shown in the figure below [GAR02]: 
r·~""'' /1.----..:.r-r-.J 
Call ,' ~>.:d:_5C.lmtj; ''-·_,------'--L----, 
!~mhun~pll 
'---~=;;:;.;_--' 
Figure 6.6: Traffic Control in the Kernel Network Stack 
6.5.1.2.2 Classful Qdiscs 
Classful qdiscs are useful when different kinds of traffic should have different 
treatment. For example, supposing the incoming traffic should be classified into web 
traffic, interactive traffic and any other traffic. By using the appropriate qdiscs, 
bandwidth could be guaranteed to each type of traffic (e.g. 20 % to web traffic, 30 % 
to interactive traffic and 50% to the rest of the traffic). 
112 
Chapter 6 Architecture of the Routing and Forwarding Engine 
6.5.1.2.3 The Token Bucket Filter (TBF) [HUBOO] 
The TBF is a simple qdisc that only passes packets aniving at a rate, which does not 
exceed some administratively set rate, but with the possibility to allow short bursts in 
excess of this rate. 
The TBF implementation consists of a buffer (bucket), constantly filled by some 
virtual pieces of information called tokens, at a specific rate (token rate). The most 
important parameter of the bucket is its size that is the number of tokens it can store. 
Each arriving token collects one incoming data packet from the data queue and is then 
deleted from the bucket. There are three possible scenarios [HUBOO]: 
i) The data arrives in TBF at a rate that is equal to the rate of incoming tokens. 
Then each incoming packet has its matching token and passes the queue 
without delay. 
ii) The data anives in TBF at a rate that is smaller than the token rate. In this case 
only a part of the tokens is deleted, so the tokens accumulate up to the bucket 
size. The unused tokens can then be used to send data at a speed that exceeds 
the standard token rate, if short data bursts occur. 
iii) The data anives in TBF at a rate that is greater than the token rate. In this case, 
the bucket will soon be devoid of tokens, which causes the TBF to throttle 
itself for a while. If packets keep coming in, they will start to get dropped. It is 
here where traffic shaping starts. 
6.5.1.2.4 The Hierarchical Token Bucket (HTB) Qdisc and Shaping of the Active 
Traffic 
The HTB is a classful qdisc. Incoming packets are classified by filters. A ceil 
argument can be set for each different type of incoming traffic (web traffic, 
interactive etc). The ceil argument is the maximum bandwidth a specific type of 
traffic can utilise. 
In this work, incoming traffic is classified into two types: the passive traffic that 
consists of the passive packets and the active traffic that contains the active packets. 
The bandwidth assigned to the passive traffic is not modified. Traffic shaping is 
performed on the active traffic by increasing or reducing the provided bandwidth. 
113 
Chapter 6 Architecture of the Routing and Forwarding Engine 
Traffic is shaped in the outgoing interface that connects the RFE with the AE (NIC3 in 
Figure 6.3). 
Traffic is shaped (reduced or increased) after a control packet is sent by the AE (Type 
0 or 1). Shaping is performed step-by-step each time, defined by the shaping step. If 
for example the shaping step is 0.1 then, the traffic is shaped by 10 % each time. The 
theoretical maximum bandwidth utilisation is 100 Mb/s; therefore the shaping 
increases or reduces the active traffic, redirected to the AE, by 10 Mb/s each time. 
Initially, the TRS performs no shaping. Every time it receives a shaping request, it 
uses the HTB qdisc to enforce a ceil argument in the active traffic. 
This is achieved by using a script that receives the ceil argument (maximum 
bandwidth allowed) and loads the HTB qdisc. The sequence of events for traffic 
shaping is shown in the diagram below: 
Receive 
request from 
theAE 
~ 
Compute the 
ceil argument 
and feed the 
traffic 
shaping script 
I Shape traffic 
Figure 6.7: The Sequence of Events for Traffic Shaping 
The ceil argument is computed by using the formula: 
Carg[n]= Carg[n-1] +a* S * BWmax [6.1] 
where, Carg[n] is the ceil argument in the nth time a request has been received, S is 
the shaping step, a is the step coefficient and BWmax is the maximum theoretical 
114 
Chapter 6 Architecture of the Routing and Forwarding Engine 
bandwidth. The value of the step coefficient a, depends .on the type of the r-equest, that 
the AE sends. Its values are shown below: 
Step coefficient(«) Type ill request (RTf · 
-1 Reduce._ active _traffic {Rio)· 
I Increase active traffic{RTt) 
- -
Table 62: The steproefficientfoc tOOt\rofellllest types 
The next graph shows· how .. tlie: ceil' argument:·changes· fur different- v.alilesc of the 
shaping step. It is assumed here•that theAE sends reiFJests of reduce :._active:_c_trtif.[ic, 
so a=- I. 
'"' ~ 
'-' 
l ~ 
"" .. u 
100 
• • ' 
90 
80 
70 
60 
so 
40 
30 
20 
.·i \I \ \ , ~\ \\ \ \ \ .. 
~' \ \ \ \\ \, 
" • i ' ' •• 
•\ \ •. \ ' ... 
• . ' . . 
..... 
llhopingStep 
• --···c··-0.01 
·-··-··•·0.02 
···-·······-·· 0.03 
-·--···'· 004 ~----- ·- cO.OS 
·--·--+0.06 
0.07 
- .......... c 0.08 
• .. ---'·0.()9 
- ·--0.1 
10- \~ \\ \ \ \\ ., 
0~~7-~\~\~='~,~~·~\=-·~=-·~~~·=-~~=-·~ o 10 20. m 40 m 60 • m 90 ~ 
a { seqt~eRCe nunlber of request) 
Figure 6.8: Computation of the Ceil Argument for each Request fur Different Shaping Steps 
As shown in the graph above, traffic shaping gets smoother as the:.11bitpirig step 
becomes smaller. The script file that shapes the rndl"JC is shown in Appendi-x E. 
6.5.1.2.5 Testing the Traffu: SkapingMeclumism 
In this section, the traffic shaping mechanism is tested and tlie impact of the Shaping 
step for different input loadsis shown. 
liS 
Chapter 6 Architecture of the Routing and Forwarding Engine 
The test network used for the experiments is shown in Figure 6.2. The Source host 
injects packets into the network at a specific packet rate. Only the first round of 
measurements in the CPUM is activated (section 5.3.1.5.1). For the following 
experiments, the AA (DES encryption/decryption) described in Chapter 4 was used. 
The packet rate (PR), the packet length (PL) and the shaping step (S) varied to show 
their impact on the traffic shaping mechanism. 
The following graphs are presented by grouping them according to the same 
shaping step for different packet rates and packet lengths 
• PR=lOOOO pkts/sec, PL=lOO bytes and S=O.l 
The Source host injected active packets into the network at a packet rate of 
PR=lO,OOO pkts/sec. lOO-byte packets were used. The Linux utility sar [SEBWWW] 
was used to collect CPU utilisation statistics every second. The CPUM was activated 
and the shaping step was set to S=O.l. The duration of the experiment was 2000 
seconds. Next, the experiment was repeated, but the shaping mechanism was 
deactivated. The graph below shows the impact of the traffic shaping on the CPU idle 
time ( 1 00-CPU utilisation % ). 
110 
100 ~ 
00~ IV 
~ 00 70~ ~ 00~ 
~ 50 g 40~ 
ro~ 
:n~ 
10 ~ , A..lv..t. ,,l,JwM.. ,1,! 
0 
0 500 1000 1500 
Time (seconds) 
Figure 6.9: CPU Idle Time with and without Traffic Shaping 
As shown in the above graph, the CPU utilisation is very high without shaping. The 
CPU idle time is less than 10 %, meaning that the CPU utilisation is higher than 90 %. 
The CPU idle time remains below 10% throughout the duration of the experiment. 
116 
Chapter 6 Architecture of the Routing and Forwarding Engine 
With shaping, the CPU idle time is low during the first 600 seconds. From the first 
minute (60 seconds), the CPUM sends requests for traffic reduction, as described in 
section 5.3.1.5.2. During the first requests, there is no impact on the CPU idle time 
because the bandwidth of the incoming traffic is smaller than the ceil argument. After 
600 seconds, more requests have been sent and the CPU idle time increases because 
the incoming bandwidth reduces more. Then, the CPU idle time is 100 % and after 60 
seconds becomes less than 10 %. This is repeated as the spikes in the graph above 
show. This happens because after the AE has sent the last request for traffic reducing 
(reduce_active_traffic) in the 600th second, CPU utilisation has been decreased and a 
new monitoring period starts. During this period, CPU utilisation is lower than 
critical_tres_min (section 5.3.1.5.1) and the CPUM sends an increase_active_traffic 
request. Then, the TRS increases the active traffic by 10 % (S=O.l) and the impact of 
this is the increase of CPU utilisation in the AE. During the next monitoring period, 
CPU utilisation is higher than critical_tres_max and the CPUM sends a 
reduce_active_traffic request and so on. The shaping step is 0.1, so bandwidth is 
increased or decreased by 10 Mb/s or 12,500 pkts/sec. The Source host sends 10,000 
pkts/sec, so after a reduce_active_traffic request the incoming packet rate drops to 
zero and the CPU idle time becomes 100 %, as shown in Figure 6.9. 
In this case, there is CPU cycle wastage because the CPU idle time becomes 100 % 
and packets are dropped in the RFE. This is due to the small granularity of the shaping 
algorithm (the value of the shaping step is high). 
• PR=30,000 pkts/sec, PL=lOO bytes and S=O.l 
Repeating the same experiment, but with a different PR value, the graph shown in 
Figure 6.10 was produced. In this graph, there are also spikes as in Figure 6.9, but the 
CPU idle time does not become 100 % because there is still a significant amount of 
active packets that cause CPU utilisation, even traffic is reduced. With no shaping, the 
AE is starved of CPU cycles, but after using traffic shaping, CPU utilisation is not 
very high and the AE is relieved. 
117 
Chapter 6 Architecture of the Routing and Forwarding Engine 
--nos~ng 
---------shaping 
25.0 
225 
20.0 
~ 175 
~ 
~ 15.0 !' 12.5 
11 ~ 10.0 g 75 11 
5.0 
25 
0.0 
0 250 500 750 1000 1250 1500 1750 2000 
Time (seconds) 
Figure 6.10: CPU Idle Time with and without traffic shaping for PR=30000, PL=lOO and 
S=O.l 
• PR=6250 pkts/sec, PL=lOOO bytes and S=O.l 
16 
14 
~ 12 
e" 
~ 
~ 10 
~ 8 
~ 6 (J 
4 
2 
0 
0 500 1000 1500 2000 2500 
Time (seconds) 
--no~ng 
M-·-----s~ing 
3<XXl 3500 4000 
Figure 6.11: CPU Idle Time with and without traffic shaping for PR=6250, PL=lOOO and 
S=0.1 
118 
Chapter 6 Architecture of the Routing and Forwarding Engine 
In this case, there is a small impact after applying traffic shaping. The packet rate in 
this experiment is smaller but the packet length is high. 
Every AA is characterised by two types of costs; the per byte cost and the per packet 
cost, because AAs can apply computations up to the application layer. The per byte 
cost can be as expensive as the per packet cost. 
In this experiment, the per packet cost has been decreased due to traffic shaping, but 
the packets are long ( 1000 bytes), so the per byte cost stiii causes high CPU 
utilisation. 
Next, the shaping step is changed and the measurements are repeated. 
• PR=lOOOO pkts/sec, PL=lOO bytes and S=0.02 
20 
18 
16 
.§ 12 
E-< 
:i:l 10 
-~ 8 
p.. 
u 6 
4 
--nosh<:~ping 
--------shaping 
Figure 6.12: CPU Idle Time with and without traffic shaping for PR=lOOOO, PL=lOO and 
S=0.0.2 
The duration of the experiments was increased to 4000 seconds because the shaping 
step was reduced and more time was needed for the traffic shaping (more requests that 
take place every 60 seconds). 
119 
Chapter 6 Architecture of the Routing and Forwarding Engine 
Figure 6.12 shows that by using a smaller shaping step, the AE is relieved after about 
3000 seconds. There is no CPU cycle wastage and the maximum CPU idle time is 18 
%. 
• PR=30000 pkts/sec, PL=lOO bytes and 8=0.02 
I noshaping 
-------shap•ng 
22.S 
20.0 
17.S 
~ IS.O 
~ 12.S 
~ 10.0 g 1.S 
s.o 
2.S 
0.0 
Time (seconds) 
Figure 6.13: CPU Idle Time with and without traffic shaping for PR=30000, PL=lOO and 
S=0.0.2 
When using a higher input packet rate, the traffic shaping mechanism has the same 
effect as that, when a lower packet rate was used. This happens because the packet 
rate is higher and more packets are dropped in the RFE. 
• PR=6250 pkts/sec, PL=lOOO bytes and 8=0.02 
16 
14 
~ 12 
~ 10 
~ 8 
4 
2 
Time (seconds) 
Figure 6.14: CPU Idle Time with and without traffic shaping for PR=6250, PL=lOOO and 
S=0.02 
120 
Chapter 6 Architecture of the Routing and Forwarding Engine 
When using larger packets, the CPU utilisation is not affected significantly because 
the per byte cost causes high CPU cycle consumption, even the input traffic is 
reduced. 
The shaping step is changed to 0.005 and the tests are repeated. 
• PR=lOOOO pkts/sec, PL=lOO bytes and S=O.OOS 
--nos~ng 
----·-sru.»ng 
20 
18 
16 
~ 14 
~ 12 10 ~ 8 g 6 
4 
2 
0 
0 1000 2000 3000 4000 5000 6000 
Time (seconds) 
Figure 6.15: CPU Idle Time with and without traffic shaping for PR=lOOOO, PL=lOO and 
S=0.005 
In this case, the CPU utilisation is reduced and the AE is relieved from the high input 
load. The CPU idle time increases about 5500 seconds after the start of the test. In 
Figure 6.9, the CPU idle time increases after 600 seconds after the start of the 
experiment. This happens because the time needed to shape the traffic is inversely 
proportional to the shaping step. With a smaller shaping test, more requests have to be 
sent when the input load persists. 
121 
Chapter 6 Architecture of the Routing and Forwarding Engine 
• PR=30000 pkts/sec, PL=lOO bytes and S=O.OOS 
20.0 
175 
15.0 
~ 
"" 
~ 125 ~ 10.0 
o1l 
.... 
~ 75 
5.0 
25 
I --no •haping 
------s~mg 
Figure 6.16: CPU Idle Time with and without traffic shaping for PR=30000, PL=lOO and 
S=0.005 
With a higher input packet rate, the CPU idle time increases after 5000 seconds the 
test began. 
• PR=6250 pkts/sec, PL=lOOO bytes and S=O.OOS 
18 
16 
14 
~ 12 
~ 
~ 
~ 8 g 
4 
2 
1000 2<XX> 3000 4000 
Time (seconds) 
I --nostq>ing 
---------shaping 
5000 6000 
Figure 6.17: CPU Idle Time with and without traffic shaping for PR=6250, PL=1000 and 
S=0.005 
122 
Chapter 6 Architecture oftJ-.e Rooting aJld Forwaromg Engine 
With targer packets, due to the per IJyte cost, traffic shaping has ahrrost no e:ffuct on 
the CPU utilisation. 
The following graphs are presented by grOI!I!ing them accOFding to the SBHJe lfflCket 
ra.te and pac.l!.et length ami di/[ereJtt slwpillf! steps 
s 
... PR=lOOOO. pkts/sec, PL=IOO ~ytes 
Figure 6.. ~8 : CPU Idle Time with and withouttr.tffic shaping for Pft= 10000, PL=i.OO and 
different shapi:Dg_.steps 
As shown in the- graph aeove;. cboosing a targe shapfng step- causes CPU cycle 
wastage (spikes. at 100 %), wherr the packet rate is small. Smatler shaping- sl.eps 
relieve the AE without wastmg CPU cycles. 
123 
Chapter 9 Archltectwe of the Routmg. aAd F~g Eagine 
• P.R-30000: pktsl~, PL=I.OO::tiytes 
25.0 . -·~res 
22.5 
.. 8"<1:1 r- 9-&Ail 
.•. -· &-0.005 
20.0 
~ 17 .5 f-
~ 15.0 12 .5 
* - 10.0 e 7.5 
s.o 
r-
1-
~ 
1- uqj~~~ 
r-
2-5 ~~~fu :; ·~i fj i• u a.o 
0 1000 2000 5000 6000 
Time (seconds) 
Figure 6. 19: CPU Idle Time with and without traffic shaping fur PR=30000, PL=lOO.and 
different shaping steps 
In this case where the packet rate is high and the packet length is small, different 
shaping steps Lead to almost the same result. Air relieve the AE without wasting..£PU 
cycles. 
• PR=6250 pktstsec, PL=lOO~ bytes 
25.0 
225 
20.0 
'0' 17.5 
0 
..._, 
~ 15.0 
!f l2..5 
~ 
...... 10.0 
::J 
e> 7.5 
5.0 
25 
Time ~ds) 
- .... ~ 
~ 
5-0.02 
• , • • SOO.«<$ 
Figure 6.20: CPU Idle Time with and without traffic shaping for :P.R=6259, PL=H}OOand 
different shaping-steps 
124 
Chapter 6 Architecture of the Routing and Forwarding Engine 
In this case, there is almost no impact after traffic shaping, since the packets are long 
and the per byte cost increases the CPU utilisation, although the per packet cost is 
reduced. 
For these experiments, a UDP traffic generator was used that injected packets at a 
constant packet rate each time. In a real network, packet traffic appears more random 
and the CPUM is not tested under these conditions. From the measurements above, it 
is seen that the shaping mechanism relieves the AE if the active traffic consists of 
small packets at a high or medium packet rate. Generally, small packets stress a PC 
host mainly due to the hardware interrupts issued when a packet is sent or received. In 
the case of the AE, not only do the interrupts cause high CPU utilisation but the AAs 
too. In these experiments, larger packets need more CPU cycles since the AA used is a 
payload-processing application and its per byte cost seems to be very high. Although 
traffic is shaped, the CPU utilisation is not significantly affected. The shaping 
mechanism presented here is effective when small packets are injected into the 
network at high rates. As mentioned before, the second round of CPU measurements 
in the CPUM was deactivated. This round aims to detect applications that overuse the 
CPU. If it were not deactivated, it would probably unload the application tested here 
because of the high CPU utilisation it caused. 
The RFE will be still vulnerable under these conditions, but it is more robust than the 
AE, because it performs less complex operations. 
6.5.2 The Loadable Kernel Module Loader (LKL) 
As described in section 6.5.1.1, one of the tasks of the TRS is to unload an LKM when 
an AA is set to PSTATs or PSTAT6. When the LKM is removed, it has to be reloaded 
after time T has elapsed. For convenience, the LKL creates an array and for every 
removed LKM it stores the information shown in Table 6.3. Maintaining this 
information, the LKL knows which LKM was removed from the kernel and when. The 
LKL wakes up every five seconds and checks the timestamps (TMPs) stored in the 
array below. If the difference between the current time and the TMP is greater or 
equal to the time T, the associated LKM is directly loaded into the kernel. By loading 
125 
Chapter 6 Architecture of the Routing and Forwarding Engine 
the LKM, active packets that carry the associated GMID are redirected to the AE 
again. 
Field DescriJ:!tion 
GMID The Global Module ID the removed LKM 
refers to. 
TMP Timestamp, it is the time when the LKM 
was removed from the kernel. 
Table 6.3: Information stored for every unloaded LKM 
The loaded LKM will be unloaded from the kernel again only when the AE requests 
for it. 
126 
Chapter 6 Architecture of the Routing and Forwarding Engine 
6.6 Summary 
The primary task of the RFE is the redirection of active packets to the AE. This is 
achieved by using a cluster of Loadable Kernel Modules (LKMs) that replace the 
destination lP address of each packet with the lP address of the AE. The original lP 
address is saved in the Active Header and it is placed back to the lP header by the 
Packet Injector (PI), which executes in the AE. 
The RFE also performs traffic shaping, after a request is sent by the AE. Traffic 
shaping is performed by the Traffic Shaper (TRS) and involves reducing the volume 
of the redirected active traffic as well as increasing it, depending on the received type 
of request. The shaping mechanism reduces the per packet cost in the AE and relieves 
it from the input load. The per byte cost is not significantly reduced since large 
packets can still cause high CPU utilisation, although traffic is shaped. 
The RFE unloads one or more LKMs, stopping the redirection of specific Active 
Flows (AFLs). This takes place when an AA is set to PSTATs or PSTAT6 in the AE, 
and a request is then sent to the RFE to unload the associated LKM. 
Applications in PSTATs or PSTAT6 can execute after time T has elapsed; therefore 
their associated LKM has to be reloaded. This is performed by the LKM Loader 
(LKL). This process has saved the time when each LKM was unloaded and compares 
it with the current time. If the difference is greater or equal to time T, LKM is reloaded 
and redirection of the active packets takes place again. 
127 
Chapter 7 
Performance Evaluation of the Active 
Engine 
128 
Chapter 7 Performance Evaluation of the Active Engine 
7. Performance Evaluation of the Active Engine (AE) 
7.1 Chapter Summary 
In this chapter, the performance evaluation, in terms of packet loss, packet delay and 
delay jitter within the AE is investigated. 
A method for reducing the packet loss is presented and its affect to the packet delay 
and delay jitter is shown. 
The second part of the chapter presents the switch operation between two FPGA-Ms 
and measures how fast the AE can switch from one FPGA-M to the other, servicing 
two active packet flows injected into the network at the same time. 
The last part of the chapter presents the performance evaluation of two Active 
Applications, in terms of CPU cycle consumption and delay within the AE. These 
applications are a DES application running in software and a DES application running 
in hardware (FPGA). Several methods for improving the performance of the hardware 
DES are discussed. 
7.2 Packet Loss 
In this section, the performance of the AE, in terms of packet loss, is investigated. 
7.2.1 A Generic Description of the System under Test 
As described in Chapter 5, the AE consists of two parts: 
i) The basic core unit (CU) that includes the Core Process (CPR), the CPU 
Monitor (CPUM), the Memory Monitor (MEMM), the Application Loader 
(APPLOD), the Packet Injector (PI), the Safety Process (SPR) and the Active 
Filter (AF). 
ii) The Active Applications (Ms) that can be downloaded, installed and executed 
on the fly. 
A simplified figure of the software architecture of the AE, with one running M, is 
shown below: 
129 
Chapter 7 Performance Evaluation of the Active Engine 
Core Unit 
:--'"'""''''''''''''''''''''""''''"'""'""''""""'""'''''-·-······-·······-: 
i Active Application i 1--~i f-i ~ Packet Injector 
'-----r----' L. ............................................................................................... .J L...-.--;---------1 
NETWORK -
Figure 7.1: Basic Processes of the Active Engine with one loaded Active Application 
The Packet Injector is separated for demonstration purposes, although it is a part of 
the Core Unit (CU). 
The AA (as every AA), is placed between the CU and the PI (Figure 7.1). It receives 
packets from the APPLOD or the CPR through the Unix domain sockets mechanism 
(described in Chapter 5). The AA can be changed on the fly. The performance of each 
AA is application-specific and so it heavily depends on the way the application is 
developed. 
In this chapter, the performance of the basic software architecture of the AE is 
investigated; therefore the unpredictability of the performance of the AA has to be 
removed so as it does not affect the total performance of the AE. For this reason, the 
AA was replaced by a simpler process called the Simple Application (SA), shown in 
the following figure: 
Core Unit 
,. ................................................................................................... 1 1--.;! Simple Application !,___~ 
L.. .... ··············································· ............................... .J 
Packet Injector 
NETWORK 
Figure 7.2: Removing the unpredictability of an Active Application by replacing it with a 
Simple Application 
130 
Chapter 7 Performance Evaluation of the Active Engine 
The Simple Application receives packets from the CPR and sends them to the PI. This 
is the basic operation every AA performs. The SA is the simplest form of an AA. 
Using the SA, the basic performance of the AE can be investigated or using other 
words the upper bound of throughput can be measured. The term upper bound of 
throughput was defined in [LCAMOl] and is the same with the per packet cost 
referred in previous chapters. 
7.2.2 CPU Cycle Consumption 
The CPU usage in the AE can vary and depends on many factors, the most important 
of them are: 
i) Kernel space- User space crossing. Every packet crosses this boundary twice. 
When it is passed from the AF to the CPR and when it is injected back to the 
network by the PI (Figure 5.2). 
ii) Hardware interrupts. They are issued when packets are received or 
transmitted by the AE. Also, FPGA-AAs may use interrupts. 
iii) Process context switches. They take place when packets are sent from one 
user-space process to another. They can be very expensive if the packet rate is 
high, because during every process context switch the CPU has to transfer data 
to and from the cache. Also, there is an additional overhead inserted by the 
Linux process scheduler that chooses which process is next to execute. 
iv) The AAs. The application-dependent code may cause high CPU utilisation. It 
has been referred as the per byte cost. 
The CPU cycle usage caused by the above mechanisms is proportional to the load of 
the input active traffic. As shown in section 6.5.1.2.5, not only the packet rate can 
affect the CPU utilisation, but the packet length too. This is because AAs are network 
applications that perform computations up to the application layer. 
7.2.3 Measuring the Packet Loss 
To measure the pack loss, the network shown in Figure 7.3 was set up. The Source 
host injects active packets using a kernel-space UDP generator provided by 
131 
Chapter 7 Performance Evaluation of the Active Engine 
[UDPGWWW]. The packet length varied from 100 bytes to 1500 bytes and the input 
bandwidth from 0 Mb/s to 50 Mb/s. The maximum packet rate used was 35,000 
pkts/sec for lOO-byte packets and each measurement lasted for 150 seconds. The 
CPUM was deactivated; therefore no traffic shaping took place. 
Figure 7.3: The Test Network 
In order to measure the pack loss; packet counters were placed in the AE and the RFE. 
For a lower overhead, the counters were placed in the input and output hooks of the 
Linux kernel-space in both machines. Counters were placed in the 
NF_IP_?RE_ROUTING and NF_IP_POST_ROUTING hooks (Figure 5.4). The 
LKMs used, are the AF in the AE, and one of the LKMs of the RFE (Figure 6.3). 
Figure 7.4 shows the placement and the number of the packet counters in the kernel-
space of the AE and the RFE. The counters count the number of packets as follows: 
• Cl counts the active packets sent by the source, 
• C3 counts the active packets sent from the RFE to the AE, 
• CS counts the active packets received by the AE, 
• C6 counts the active packets sent by the AE, 
• C4 counts the active packets received by the RFE, 
• C2 counts the active packets sent from the RFE to the destination host. 
132 
Chapter 7 Performance Evalnatimu>f the Active &gine 
AE 
r C5 C6 ~ 
r t 
C3 C4 
Source RFE Dest i lliltK>n 
-+- Cl C2 ____.,. 
Figure 7.4: Counters placed in Kernel-Space ofthe RFE and tbe AE 
The packet loss is given by the formula: PL=P5-P6, where Pi is the number of J)a':kets 
the counler Cr counts. The total packet loss is given by the furrnula PL=P2-PL It is 
expected that P3=Ps and P4=P6. 
The packet loss at different packet rates and packet sizes is shown in the next figure: 
100 
90 
80 
70 
......... 
'#. 
-
60 
en 
~ ~ 50 
..... 
Cl.> 
...!o: 
u 
40 
<IS 
~ 30 
20 
10 
0 
0 5000 
-IOOI!ytls 
- 200qa 
- XO!Jyb 
-409~ 
~~ 
-~ 
n~ 
-~~ 
10000 15000 20000 25000 30000 35000 
Packet Rate (paclrettjsec) 
Figure 7.5: Packet Loss for Different Packet Lengths at Different Packet: Rates 
133 
Chapter 7 Performance Evaluation of the Active Engine 
From these measurements is noticed that P3=P5, P4=P6 and P5-P6=PI-P2 (Figure 7.4); 
therefore the total packet loss (RFE+AE) is equal to the packet loss caused by the AE. 
From the graph above, there is no loss for packets larger than 900 bytes because the 
packet rate could not become very high. With smaller packets, the packet rate can be 
excessive. This can result in many hardware interrupts, as well as a huge number of 
process context switches and many kernel space - user space crossings. Smaller 
packets at high packet rates stress the AE more. For example, for an input bandwidth 
of 28Mb/sand lOO-byte packets (the packet rate is 35,000 pkts/sec), the packet loss is 
88.17 %. In this situation, the AE was starved of CPU cycles and many packets were 
dropped in kernel-space, because there were no CPU cycles available for processing 
them. If an AA was loaded, it would have no chance to execute properly. This 
phenomenon is known as receive livelock [KIMOl], [MOG97]. Receive livelock takes 
place when the delivered throughput (packets sent out) drops to zero, while the input 
overload persists. 
One solution to avoid the receive livelock, is the traffic shaping described in Chapter 
6. Traffic shaping drops packets in the RFE before they reach the AE. A wiser method 
would be to increase the throughput of the AE, by improving its software architecture 
and then make CPUM shape the traffic. 
There are several ways to improve the throughput of the AE, such as interrupt 
mitigation, minimising the number of the process context switches. Interrupt 
mitigation is reducing the hardware interrupts issued, when packets are received or 
transmitted. Instead of issuing one interrupt for every packet, the driver of the NIC 
generates one interrupt for a group of received (or transmitted) packets. 
The next section describes how packet loss is reduced by reducing the number of the 
process context switches. 
7.2.3.1 Reducing the Number of the Process Context Switches 
A process context switch takes place when a packet is sent from one process to 
another. The cost of the process context switch can be very expensive in terms of 
CPU cycles and delay, since it involves data transfers between the CPU cache and the 
134 
Chapter 7 Performance Evaluation of the Active Engine 
memory, the overhead inserted by the Linux scheduler that makes the decision about 
the next process to execute etc. There are some other issues that can decrease the 
performance of the host, if many process context switches take place, such as the 
cache interference cost, described in [FROM98]. 
When an AA executes in the user-space of the AE, the "journey" of an active packet in 
the user-space is usually: CPR->AA->Pl (Figure 7.1). Replacing the AA by the SA, 
the path of the packets is: CPR->SA->Pl, as shown in Figure 7.2. For every incoming 
packet, three context switches take place. The Linux Trace Too/kit (LTI) [Y AGHOO] 
is used in order to display these context switches. 
The LTT is software that provides a way of recording and analysing complete system 
behaviour. Its architecture is shown below [YAGHOO]: 
Daemon 
Virtual File System 
Linux Kernel 
' 
' 
' 
' Trace Module Trace facility 
·-----------------l 
Figure 7.6: Linux Trace Toolkit Architecture 
The LTT is capable of recording system events and make them available to the user. 
The events are forwarded to the trace module via the kernel trace facility. The trace 
module, visibly in user-space as an entry in the /dev directory, then logs the events in 
its buffer. The trace daemon then reads from the trace module device and commits the 
recorded events into a user-provided file [YAGHOO]. 
A single packet was sent from the Source to Destination (Figure 7.3). The LTT was 
used to capture and display the sequence of events that took place in the kernel of the 
AE, when this packet was received till it was retransmitted back to the network. Part 
135 
Chapter 7 Performance Evaluation of the Active Engine 
of the file created by the LIT to hold the information, is shown in Appendix F. The 
first column of the file displays the event that took place, the second column displays 
the time it took place, the third column shows the process ID of the process that was 
running that time, the fourth column shows the entry length and the fifth column 
displays the event description. The process context switches that took place are 
highlighted in the file, for convenience. The PID of the CPR is 3104, the SA's is 3116 
and the Pfs is 3105. 
Three context switches take place when one packet is processed, as shown in the 
output of the file. At an input packet rate of 35,000 pkts/sec, there were 105,000 
context switches per second (csws!s), a huge number that caused very high CPU 
utilisation and is one of the reasons for extended packet loss (Figure 7 .5). 
The number of the process context switches can be reduced by buffering packets 
between the CPR and the SA and between the SA and the PI, as shown in Figure 7.7. 
NETWORK 
~ Packet Buffer 
Figure 7.7: Adding Buffers in the Core Process and the Simple Application 
By adding buffers in the CPR and the SA, packets are transferred from one process to 
another only when the buffers become full. The number of the process context 
switches is reduced, because only one context switch takes place for a group of 
packets rather than one for every packet. The formula that gives the number of the 
csws/s is shown below: 
N = P, x(1+ z;L) 
I 
[7.1] 
136 
Chapter 7 Performance Evaluation of the Active Engine 
where, N is the number of the context switches per second (cswsls), Pr the number of 
packets per second, L the packet length and Bf the buffer size. L>40 since the smallest 
active packet is 40 bytes long, Pr>O and B.f>40 since with no buffering the size of the 
buffer is equal to the packet size. 
In the above formula, if Bf=L then N=3 x Pr, that is the csws/s shown in the LIT file if 
Pr=l (N=3 because one packet was used). Without using buffering, there are three 
csws per packet. The first csw takes place when the CPR receives a packet, the second 
when the SA receives the packet and the third when the PI receives the same packet 
(Figure 7.2). By using buffers the last two csws are affected. 
In the graph below, the number of the csws per sec is plotted as a function of the 
buffer size for different packet lengths (100, 500 and 1000 bytes), at a packet rate of 
1000 pkts/sec. 
~= gaxm 
§ 1!rol 
~ 1!rol 
"" &14o:xl 
~ 12XJJ 
·~ 
"' 1CXXXJ 
--100~ 
--·Sll~ 
-+-1aD~ 
\ ~.~±~±~±-
-
' ' ' ' ' ' ' ' 3D) !rol roll 12XJJ 1500) 1!rol 210:0 24o:xl 270CO = 
Buffer Size (bytes) 
Figure 7.8: Number of the Process Context Switches as a Function of increased Buffer Size 
As shown in Figure 7.8, the number of the context switches reduces when buffers are 
used and this is expected to reduce the packet loss, as the result of reducing the CPU 
utilisation. 
137 
Chapter 7 Performance Evaluation of the Active Engine 
7.2.3.1.1 Choosing the Size of the Buffers 
In order to repeat the experiment and show the impact of packet buffering on the 
packet loss, the size of the buffers has to be defined. The Unix domain sockets is the 
mechanism for transferring data between processes. The maximum data rate between 
two processes, depends on the size of the transferred data. To investigate the impact 
of the data size on the transfer bandwidth, the hbench test-bench [BRSE97] is used. 
Using hbench the following graph is produced: 
600 
~ 500 
" 
" :a 400 
:;; 
~ 
.c 300 ~ 
't:J 
·~ 
't:J 200 
" 
" m 100 
0 
1 10 100 1 000 10000 1 00000 1 000000 1 E+07 1 E+OB 
Block size of data transferred (bytes) 
Figure 7.9: Maximum Data Rate between two Processes as a Function of the transferred Data 
The above graph shows that optimum bandwidth is achieved at a buffer size of 60 
Kbytes; therefore the size of the buffers was initially selected to 60 Kbytes. 
7.2.3.1.2 Packet Loss with and without Buffering 
The test performed in the previous section was repeated after buffers were placed in 
the CPR and the SA, as shown in Figure 7.7. The packet loss was measured and the 
graph in Figure 7.10 was produced. There is almost no loss for packets with length 
over 400 bytes. Comparing this figure with Figure 7.5, it is clear that the packet loss 
differs for different packet sizes. With no buffering, the per packet cost is independent 
of the packet size, as Figure 7.5 shows. There are three context switches for every 
packet regardless its size. 
138 
.so 
~.s 
~ .. o 
3:$ 
... 
~ 3(l 
.....) 
l2J 
~ 20 
Cl. 
l.S 
10 
.s 
0 I . I . 
o 3GOO ~eo 9'000 
~~sec) 
_ ,...,.,a 
-~'"a 
-411~"'fLC.J 
SIIIUf\a 
- ..aa::II"T'a 
1M.JifLC11 
Figure 7 . lO: Packet Loss. for different. Packet Lengths at different Packet Rates. -dlen :Bu:ffers 
are used 
With bttffering. not on~y the bttffer sizes affect the context switches but the packet 
length too, as shown m Eonnula 7_L The number of the. co.ntext. swiiches is 
proportional to- the packet length, S(} for the same input padret rate and same buffer 
size, the packet Loss is higher for longer packets. FOf example, from the figure aoove, 
for a packet rate of 24,000 pkts/sec, the loss for the 200-hyte packets is- 4:3:51 %, 
while for the lOO-byte packets is 37.62 %. The tafgeFthe f)ClCke~ the more ~tly 
the buffers are. flushed; therefore more. process context switches take. pjace. 
Tbe test was repeated for a buffer size of 6 Kb:ytes to demonstrate the impact gf the 
buffer size on the packet loss. The forrowing graphs show tbe packet loss for dif&rent 
packet lengths", with and without buffering ami fur two diffen:nt buffer sizes. 
139 
Chapter 7 Performance Evaluation of the Acti ve Engine 
100 
00 
00 
70 
......... 
~ 
........, 00 
V) 
V) 
0 
.....l 50 
..... 
~ 
~ 40 (.) 
«S p... 
3J 
2) 
10 
0 
0 3XX) ro:x:> OOXl 1axxJ 1SXX> 100:0 21CXXl 24000 2700) 3XX)Q 
Packet Rate (packets/sec) 
Figure 7. 11 : Packet Loss when using different Buffer Sizes for l OO-byte packets 
As shown in Figure 7.11 , the packet loss significantly reduces when buffers are used. 
The same effect takes place when 200-byte and 300-byte packets are used, as the 
following figures show. 
100 
90 
80 
70 
;""'. 
~ 
-..__. 60 
V) 
V) 
0 
.....l 50 
...... 
~ ~ 40 (.) 
«S p... 
30 
20 
10 
0 
0 10000 1500) 20000 
Packet Rate (pac kets/sec) 
Figure 7.12: Packet Loss when using differe nt Buffer Sizes for 200-byte packets 
140 
Chapter 7 Performance Evaluation of the Active Engine 
60 
20 
10 
+/ 
+/ 
+...--
+/ 
+/ 
+/ 
+/ 
+/ 
Ill 
/.I 
0~~~~~~~-+~~~--~~-=~~_.--~~~ 
0 2500 5000 7500 ((XXX} 12500 15000 17500 200)) 
Pac ket Rate (packets/sec) 
Figure 7.13: Packet Loss when usi ng different Buffer Sizes for 300-byte packets 
A similar graph was produced for the 400-byte packets. For larger packets there was 
almost no packet loss. 
As shown in the above figures, packet l oss reduces when bufferin g is used because of 
the reduction in the number of the context switches. In Figure 7.14, the performance 
im provement for the lOO-byte packets is shown. The petf Oimance improvement is 
defined as the di fference between the packet 1oss without buffering and the loss wi th 
bufferi ng. 
The performance improves till the input packet rate becomes about 18,000 pkts/sec 
and then it drops. This is because the CPU cycles spent due to hardware intetTupts and 
kernel-user space cross ings become signjfic~nt at higher packet rates. Although the 
number of the context switches reduces, the performance improvement deteriorates. 
E ven the performance improvemen t drops after a certain point, packet loss 
significantly reduces when buffers are used . 
141 
Chapter 7 Performance Evaluation o f the Active Engine 
ro .-~-,.-~-,.-~-.--~-.--~-.--~-.--~~ 
10 I ==~ 
o ~~~~~~--~~--~~--~~--~~--~~ 
0 5000 10000 15000 20000 25000 30000 35000 
Packet Rate (packets/sec) 
Figure 7:14: Petformance Improvement when Buffers are used for lOO-byte Packets at 
different Packet Rates 
As seen in figures 7.11 , 7.12 and 7.13, the buffer size does not im pact the 
petformance significantly. As shown in Figure 7.8, by using buffers the number of the 
context switches reduces rapidl y, as the size of the buffer increases. However, after 
the buffer size becomes larger than a threshold value, there is no further reduction in 
the number of the context switches. For example (from Formula 7. 1), using a packet 
rate of 13,000 pkts/sec with lOO-byte packets and no buffering, there are 39,000 
csws!s (three csws per packet) and 49.7 % of packet loss. When using buffers of 6 
Kbytes , Formula 7.1 gives 13,433 csws/s and the packet loss is 0.16 %. With buffers 
of 60 Kbytes, there are 13,043 csws/s and a packet loss of 1.35 % results. By using a 
larger buffer the csws/s are reduced only by 2.9 %. B y applying buffers, fewer packets 
are lost but changing the buffer size does not affect the improvement substantiall y. 
The buffer size does not impact on performance above 6 Kbytes but does below it. 
The delay when using buffers with sizes between 0 Kbytes and 6 Kbytes would be 
analogous to the buffer size. For example, if 3 Kbyte buffers were used then, the delay 
packets experience would be the half of that when 6 Kbyte buffer are used . 
7.3 Packet Delay Measurements 
The previous section described how the packet loss reduces when buffers are used. 
This section investigates the delay packets experience within the AE, wi th and without 
buffering. The delay can be computed by recording the time each packet enters and 
ex its the kernel-space of the AE, as shown in Figure 7.15. 
142 
Chapter 7 Performance Evaluation of the Active Engine 
cu SA PI 
User -Space 
------- ------- - ---- -- ---------------- --- ------------
Kernel-Space 
Input Hook Output Hook 
CJ 0 
TMin TMout 
NJC's Device Dri ver 
~ 
NETWORK 
Figure 7. 15: Time-Stamping of the Packets in the Input and Output Hooks 
The delay D is defined as: 
D = TMout- TMin [7.2] 
where, TMin is the time the packet enters the kernel of the A£ and TMow the li me 
when it ex its the kernel. 
In order to measure the delay, the network shown in Figure 7.3 was used. The times 
TMin and TMout were recorded for every packet that entered or left the kerne l. This 
was achieved by using two LKMs: one in the NF_/P _PRE_ROUTING hook and the 
second one in the NF_IP _POST_ROUTING hook of the kernel of the A£ (Figure 5.4). 
The LKM used in the NF _JP _PRE_ROUTING hook was the Active Filter (A F). 
Acti ve packets of Type 0 were injected into the network by the Source host. When a 
packet entered the kernel space of the A£, it was time-stamped by defaul t by the 
143 
Chapter 7 Performance EvaluaLion of the Active Engine 
operating system (an skbuff structure is created). It then crossed the 
NF _lP _PRE_ROUTING hook. Here, added code in the AF got the time TMin (from 
the skbu.ff" structure) and copied it in the first four bytes of the packet payload. When 
the same packet exited the kernel , it crossed the NF_IP_POST_ROUTING hook. The 
TMow time was taken (through the do_gellimeofday function) and saved in the next 
four bytes of the payload, by the second LKM. 
For the delay measurements, the Simple Application (SA) was used. This allowed 
removing the unpredictable delay inserted by an Active Application. 
The acti ve packets that exit the kernel have the following format: 
l P UDP ACT t i• l™j 
Payload 
Figure 7. 16: Time-Stamped Packet 
Each packet that left the AE was timestamped and it then reached the Destination 
host. In this host, there was a process that had opened a port l istening to active 
packets. Upon receiving a packet it checked its payload, computed the delay using 
Formula 7.2 and saved its va lue into a log file, for post-processing. 
7.3.1 Delay Measurements when using no Buffers 
In this experiment, the Source host injected l OO-byte packets at a different packet rate 
each time and the delay was computed at the Destination host, using the method 
described in the previous section. The experiment was then repeated for 1000-byte 
packets, at different packet rates. Finally 100, 500, 1000 and 1500 byte long packets 
at a packet rate of 1000 p/s (packets per second) were used to compare the delay for 
different packet lengths. At the end of the experiments, the log files were col lected 
from the Destination host and the following graphs were plotted: 
144 
Chapter 7 Performance Evaluation of the Active Engine 
10000 
........ 
0 
Q) 
en 
::3 
........ 
>. 
.m 1000 
Q) 
0 
0 5000 10000 15000 20000 25000 30000 
Packet Sequence Number 
Figure 7 . 17: Delay fOT I 00-byte Packets at different Packet Rates 
145 
140 
135 
130 
....... 125 
0 
Q) 
en 120 ::3 
->. 
.!it 115 
Q) 
0 110 
105 
100 
95 
0 5000 10000 15000 
Packet Sequeace Number 
Figure 7. 18: Delay for 1000-byte Packets.atdifferentPacket Rates 
-lllllp's 
---~s 
~ 
-----~s 
~ 
-~ 
-~ 
-~ 
-~s 
-~s 
145 
Chapter 7 Performance Evaluation af the Active Engine 
130 
-1~1 
~· 125 I
. ·- ~  .. 
120 
,...... 
0 
ILS 
~ 
'-J ILO 
>. 
..Q 
8 
Packet Sequence Number 
Figure 7.19: Delay for I 00, 500, 1000 and 1500 byte Packets-at a Packet Rate of 1000 p/s 
As shown in the figures above, the delay each packet suffers is independent of it£ size 
since there are no payload-processing applications executing in the AE. The CU 
perfonns only header-processing operations. 
When the packet rate becomes high, packets suffer much higher delay (figure 7 .17). 
This happens because under heavy load (rugb input packet rate); CPU cyc[es are 
mostJy spent to service the interrupts so the user-space applications are l~ often 
scheduled to execute and service the packets. 
7.3.2 Delay Measurements when using Buffers of 6 Khytes 
The same experiment was repeated with buffers of 6 KB placed in the AE. The 
fol1owing graphs were produced: 
146 
Chapter 7 Performance Evaluation of the Active Bngine 
6~ .-~,-~--~-,.-~~~---T-.--~.-~--~-..------. 
6(XX)O 
55000 
50000 
4'5000 
'0 40000 
Q) 
~ 35000 
'-" 
>- 30000 
.s 025000 
20000 
15000 
10000 
5000 
-.......... 
--···--. 
--.. .............. 
.. ........ 
O L-~~~-L~~~~~~-L~~~~~~~~~ 
0 ~ ~ 00 ~ 100 1~ ~ 100 1~ 
Packet Sequence Nam ber 
---~ 
~ 
lOOOOp's 
- - I SOCXlp's 
Figure 7.20: Delay for 1 00-byte Packets at different Packet Rates when usmg &ffers of 6 
Kt>ytes 
6000 
5500 
5000 
4500 
4000 
-g3500 
CLJ 
2.3000 
;;.... 
.s 2500 
I!) 
Cl 2000 
1500 
1000 
500 
0 
0 
\ 
' \ """'J 
3 6 9 Jl 
Packet Sequence Nwnber 
--~ 
-~ 
~s 
--«Xq:rs 
~~ 
-~ 
Figure 7.21: Delay for 1000-byte Packets at different Packet Rates when using Buffers of 6 
K.bytes 
147 
Chapter 7 Performance Evaluation of the Active Engine 
65000 
60000 
55000 
50000 
45000 
'0' 40000 
~ 35000 
""' ;;:.., 30000 
.ill 
& 25000 
20000 
15000 
10000 ' 
.5000 
0 
0 
' 
l2 24 36 
-uu~ ..... 
~­LID:bytea 
• - • ~ SIXllyto• 
43 60 72 84 96 108- 120 
Packet Sequence NUmber 
Figure 7.22: Delay for 100, 500, lOOO.and 1500-byte Packets at a Packet Rate of 1000 p/s 
when 'llsing Bu:ffen of 6 Kbytes 
When buffers are used, the delay graphs show different results compared to these 
when no buffering takes place. They have a triangle shape, because each packet 
experiences a different delay. The first packet that enters each buffer experiences the 
maximum delay and the last one the minimum delay, since a buffer is flushed only 
when it becomes full. The packet delay jitter is also increased. Jitter is defined as the 
variation of the delay. The delay also depends on the packet sizes and the packet rate. 
Buffers become full and flushed more frequently, as the packet size or the packet rate 
m creases. 
Repeating the same test for buffers of 60 KB, similar graphs were produced but the 
delay is higher because packets were stored in the buffers for looger. 
The following section compares the delay for the three possible scenanos.: no 
buffering (0 KB), using buffers of 6 Kbytes (6 KB) and using buffers of 60 Kbytes 
(60KB). 
148 
Chapter 7 Performance Evaluation of the Acti ve Engine 
7.3.3 Comparing the Packet Delay for no Buffering and Buffering 
using different Buffer Sizes 
The next two figures show the delay lOO-byte packets experienced, when they were 
injected into the network at two different packet rates (lOOO p/s and 15 ,000 p/s). 
l(XXJ 
100~--------------------------------------~ 
0 150 300 450 600 750 900 1050 1200 
Packet Sequence Number 
Figure 7.23: Delay for lOO-byte Packets at a Packet Rate of 1000 p/s 
--M! 
--~ 
- X-00<8 
100Xl 
1f£l 1C60 1<ro 
Packet Sequence Number 
Figure 7.24: Delay for lOO-byte Packets at a Packet Rate of 15000 p/s 
149 
Chapter 7 Performance Evaluation of the Active Engine 
The above graphs show that, with small packets at low packet rates , buffering 
increases the delay as well as the delay jitter. At high packet rates, de lay jitter and 
delay are reduced when using buffers of 6 Kbytes. This is because when a process is 
scheduled to execute (after a context switch takes place), a group of packets is 
transfetTed to that process and serviced rather than just one packet (as in the case 
where no buffering is used). When buffers of 60 Kbytes are used, de lay increases and 
jitter reduces. 
The next two graphs show the impact of bufferi ng into larger packets ( 1000 bytes 
long) at two different packet rates (1000 p/s and 5000 p/s). 
Packet Sequence Number 
Figure 7.25: Delay for 1000-byte Packets at a Packet Rate of 1000 p/s 
150 
Chapter 7 Performance Evaluation o f the Active Engine 
Packet Sequence Number 
Figure 7.26: Delay for 1000-byte Packets at a Packet Rate of 5000 p/s 
Delay and delay jitter become worse when buffeling is used for larger packets, 
because the packet rate for these packets cannot become very high and buffers are less 
frequently flushed. 
The median, the standard deviation from the mean and the standard deviation of 
change in delay f rom the mean for a range of packet lengths and packet rates (p/s: 
packets per second) were computed to show the impact of buffering on them. They 
are presented in T able 7.1 and all are expressed in microseconds. 
Their definiti ons are [SANOl]: 
i ) Median (M) is the (n+l )/2 th value if the values are put in rank order: 
M d 11 + l 
2 
[7.3] 
i i) Standard Deviation (from the Mean) fJ, that is delived by summing the squares of 
the differences between each value and the mean value: 
[7.4] 
15 1 
Chapter 7 Performance Evaluation of the Active Engine 
iii) Median Change in delay (CM), is the (n+l )/2 th value if the delay va lues are put 
in rank order: 
CM= Cdn+l 
-
[7.5] 
2 
As shown in Table 7. 1, for small packets at low packet rates, M, Band CM increase 
when buffering is used . At higher packet rates, M, Band CM reduce if buffers of 6 KB 
are used. However, M 
increases and Band CM reduce if 60 KB buffers are used. For larger packets, packet 
rate cannot become very high and M, Band CM increase. 
No Buffers 6 Kbytes Buffers 60 Kbytes Buffers 
M B CM M B CM M B CM 
100 b~tes 
1000 p/s 97 2.35 0 31056 16792.1 97 1 309401 168574.6 971 
5000 p/s 98 4.19 0 7167 2899.02 166 71406 28834.5 167 
10000 p/s 14190 7756.03 39 4375 1040.29 60 43419.5 10389.7 1 60 
15000 p/s 16250 8687.08 87 3748 285 .33 17 37014.5 3365.62 17 
500 bvtes 
1000 p/s 98 4.78 0 6227 3348. 12 966 62544.5 34559.1 966 
5000 p/s 99 4.03 0 1490 552.2 15 164 15138 5629.46 165 
10000 p/s 747 1 3858.2 47 937 200.58 1 54 9716 1791.39 52 
1000 bytes 
1000 p/s 100 3.62 0 3174 1664.72 970 31345.5 16898.55 979 
5000 p/s Ill 9.55 0 687 282 .96 163 7546.5 2952.58 175 
1500 bytes 
1000 p/s 101 4.45 0 1710 1086.17 967 20636 11379.71 978 
4000 p/s LO I 6.01 0 649 254 225 6156 2595.23 228 
Table 7.1: The Median, Standard Deviation from the Mean and the Standard Deviation of 
Change in Delay for different packet lengths at different packet rates 
Summarising, the method of adding buffers is more suitable in case smal l packets at 
high packet rates traverse the network. 
7.4 Switching between Active Applications 
As described in C hapter 5, the AE hosts one FPGA device; therefore onl y one FPGA-
AA can have access to it at a time. If there are more than one FPGA-AAs that have to 
152 
Chapter 7 Performance Evaluation of the Active Engine 
execute, the APPLOD performs the fo llowing operations every time the Rest Period 
(RP) expires: 
i) It transmits a control packet to the previous AA to re lease the FPGA resources. 
The control packet is one byte long and it is sent to the AA via the Unix 
domain socket path as the normal packets. The value of the control packet is 
OxOl and the AA has to release the FPGA resources (by closing the FPGA 
board). 
ii ) It c hecks if the AA has released the FPGA resources and thi s is achieved 
through a LibGTop functi on accessing the /proclmaps file (Figu re 5.15). If an 
inode=65410 is found, it means that the FPGA is stil l reserved by the first AA, 
othe rwise FPGA is free to be used by the new AA. The status of the previous 
AA is set to PSTAT2. The APPLOD penalises the previous AA if it does not 
release the FPGA resources by killing it. Also, the mjn_counter is increased by 
one and its status is set to PSTAT1 . 
iii ) It transmi ts a control packet to the new AA that the FPGA is free for use. The 
value of the control packet in this case is Ox02. Then, it checks if the new AA 
has granted access to the FPGA via the /proc/maps file. If it has not, it tries up 
to APPLIC_TRY times (mnf_counter is increased by one each time) to check 
the inode again and if it is fi nall y impossible for the AA to open the FPGA 
card (in case there is a bug in the code of the AA) all the incoming packets 
stored in the meantime are sent back to the network. If the AA successfull y 
opens the FPGA card its status is set to PSTATJ. 
The time Ts needed for a switch between FPGA-AAs to take place is: Ts=TRP+ Tinit, 
where TRP is the time of the RP and Tini t is the time an FPGA-AA needs to initiali se its 
data and be ready to execute. TRP>Tinit, otherwise the AA will not be ab le to service 
packets. 
In orde r to test the switch between the AAs, two FPGA -AAs were used: the DES 
Encryption/Decryption algori thm described in Chapter 4 and a Nibble-Reverse 
application. The second app lication is a very simple FPGA-AA and its task is the 
nibble reversion of the payload data of packets. If for example the input data is OxAB, 
it returns OxBA. It is also a master-slave app lication as the DES one . 
153 
Chapter 7 Per formance Evaluation of the Active Engine 
For this test, the fol lowing network was set up: 
1• 1 ---+ 
.-..-. 
Figure 7.27: The Test Network Topology 
There are two packet sources called Green and Magenta and one packet sink called 
Be/la. Green generates acti ve packets with GMID=3 (DES encryption) and Magenta 
generates active packets with GMID=2 (Nibble reversion). Both types of packets have 
to be serviced by the FPGA, in the A£. D elay measurements were taken only for the 
packets with GMID=3 (they were timestamped in the input and the output hooks of 
the Linux kernel running in the A£ using the method with the LKMs described in 
section 7.3). 
For each received packet, an application executing in Be/la, computed the delay and 
stored its value in a log fi le. Both packet sources sent their packets to Be/la. A s 
described before, delay measurements were taken onl y for packets with GMID=3; 
therefore the AE timestamped only the packets sent by Green. The same way, the 
application that ran in Be/la had to compute the delay only for packets with GMID=3. 
This application had opened a p01t listening to number 44075, which is the active port 
(all acti ve packets regardless of their GMID are sent to this port). In order to prevent 
this application from receiving packets from Magenta (packets with GMID=2), an 
iprables ru le was inserted in the kernel of Be/la so packets with GMID=2 were 
dropped before reaching the applicatjon that computed the delay. 
154 
Chapter 7 Performance Evaluation of the Active Engine 
Both packet sources sent 500-byte packets at a packet rate of 100 packets per second. 
For these measurements no buffers were used. 
Be fore runn ing the test, the Rest Period had to be defined and it sho uld be greater than 
the initiali sati on times Tinitl and Tinit2, in order for the two FPGA -AAs to have enough 
time to initiali se their data and execute. Both appli cati ons were already loaded into 
memory, so they were in PSTAT2. T he ti me each application needs to initi alise its data 
cannot be determini stic because Linux is not a real-time operati ng system. Under 
heavy load, priority is given to hardware interrupts and user-space processes are not 
scheduled to execute very often. For this test, the packet rate was kept low at 100 p/s. 
Every time a switc h takes place, the APPLOD performs the operations descri bed 
previous ly to ensure that an FPGA-AA has opened the FPGA card and the application 
to execute is set to PSTATJ. E very ti me the two applications used in the test were 
about to execute, the FPGA board was opened and loaded with their bitstream. The 
opening of the FPGA card is not time-consuming, but the time needed to load the 
programmable device with the bitstream, depends on the method used to load it. 
There are four possible methods for loading an FPGA device with a con figuration fi le 
(bi tstream): 
i) Transferring the bitstream from the hard di sk to the memory and then load the 
FPGA using programmable I/0 (it will be referred as HD), 
ii) Transfening the bitstream from the hard di sk to the memory and then load the 
FPGA using DMA (referred as HD_DMA), 
iii ) Save the bitslream in to a memory region and every time a loading has to take 
place, load the FPGA using programmable I/0 (referred as MM), 
iv) Save the bitstream into a memory region and every time a loading has to take 
place, load the FPGA using DMA (referred as MM_DMA). 
For the methods MM and MM_DMA, every time an FPGA-AA is set to PSTATJ, it 
loads the FPGA wi th its bitstream that was trans ferred (from the hard di sk) in 
memory, when it ran fo r the first time. 
155 
Chapter 7 Performance Evaluation of the Active Engine 
The configuration times needed for the above methods have been measured for the 
bitstream of the DES appl ication and are displayed in Figure 7.28. 
I 
HO 
2 
HO _[MA 
3 
MM 
Figure 7.28: Configuration Times for different Methods used to load the FPGA Device 
As expected, programming the FPGA from the memory using DMA is the fastest 
method; therefore both FPGA-AAs that ran during the test were modified to use this 
method. 
The minimum Rest Period selected for proper running of the FPGA-AAs was 100 
msec. The test ran for two rest periods ( lOO and 500 msec) and the delay the packets 
of the DES application experience is shown in Figure 7.29. 
-1==1 
-
ICXXXXJ 
I 00 200 300 400 500 600 700 800 900 I CXXl 
Packet Sequence Number 
Figure 7.29: Delay for the DES Packets for two different Rest Periods 
156 
Chapter 7 Perfo rmance Evaluation of the Active Engine 
The delay graphs have a tri angle format because wh ile one of the two applications is 
being executed, the packets of the second application are stored in the appropliate 
Packer Queue (PQ). The first packet stored in the queue experiences the max imum 
delay. 
The maximum delay pac kets experienced was 150 msec and 550 msec, when the RP 
was set to 100 msec and 500 msec respective ly. There is an overhead of 50 msec and 
is due to the time applications need to initialise their data and load the FPGA with the 
appropriate bitstream. In these 50 msec, the time overhead the APPLOD needs to 
perform the operations described previous ly is included, for ensuling that the previous 
applicat ion has released the FPGA resources, that the new application has gai ned 
access to the FPGA, for sending the control packets etc. The minimum delay packets 
expelienced was about 1320 usec. These were the packets that entered the AE when 
the AA was already at PSTAT3 and were directly passed from the CPR to the AA 
without buffering. 
157 
Chapter 7 Per formance Evaluation of the Active Engine 
7.5 Performance Evaluation of a Software and a Hardware 
DES Application 
This section investi gates the performance eva luation of two Active App licati ons, in 
terms of CPU cycle usage and delay, wi thin the AE. These applications are a software 
encryption DES application (SW-DES) and a hardware DES (HW-DES) encryption 
application. The Active Application that performs DES encryption in hardware, is that 
desctibed in C hapter 4. 
In order to make a fair comparison between the two applications; both of the m were 
developed such, that only the actual encryption part is differe nt. The rest of their parts 
is identical, as shown in Figure 7.30. 
Spl it packet payload Split packet payload 
in 32-bit words and in 32-bit words and 
feed the encryption feed the encryption 
algori thm algorithm 
~ ~ 
DES Encryption DES Encryption 
(S/W) (H/W) 
~ ~ 
Place the encrypted Place the encrypted 
data into the correct data into the correct 
position in the position in the 
packet ' s payload packet's payload 
Figure 7.30: DES Encryption performed in Software and Hardware 
7.5.1 Delay Measurements 
The de lay measurements were taken at the input and output hoo ks of the AE for 
packets of different lengths, for both applications. The method for taking the delay 
measurements was described in section 7.3. Packets of valious sizes were used, at a 
packet rate of 500 p/s. 
158 
Chapter 7 Performance Evaluation of the Active Engine 
Befo re presenting the measure ments, it would be useful to refer to the block di agram 
of the FPGA board, shown in Figure 4.3. FPGA applications use a programmable 
clock generator. There are two primary c lock inputs to the FPGA, both from the 
programmable clock generator [PC1MWWW]. The c lock generators on this board are 
as fo llows [PCIMWWW]: 
Clock Index Name Range Function 
0 LCLK 400KHz-40MHz Local bus c lock 
1 MCLK 400KHz-! OOMHz General purpose 
Table 7.2: Programmable Clock Generators on the PCI-based FPGA board 
The encryption algorithm used fo r the performance tests, uses the Local bus clock at 
two di fferent values of the clock: the default 28.5 MHz and the maximum allowed of 
40MHz. 
The delay, packets of various sizes experience, is shown in Figure 7.31 for: (i) the 
HW-DES using the default c lock of 28.5 MHz, (ii) the HW-DES using the maximum 
clock of 40 MHz and ( iii) the SW-DES. 
--Eru)pi01SVV 
·--··- Eru)picnt-+W 
- 8"o)pp01HW<O 
Packet Size (bytes) 
Figure 7.31: Delay for the SW-DES and the HW-DES using two different Clocks for Packets 
of vari ous Sizes 
159 
Chapter 7 Performance Evaluation of the Active Engine 
The SW-DES application has a better performance than the HW-DES. Using a higher 
clock rate for the HW-DES does not affect its performance since the difference 
between the 28.5 MHz and the 40 MHz is not enough to affect the delay. The SW-
DES uses a Pentium 4 of 1.8 GHz CPU, 45 times faster than the clock of the HW-
DES that is only 40 MHz. This is one of the reasons software is fas ter than the 
hardware, but for the specific applications. A more detailed comparison between the 
two app lications and potential bottlenecks are described in the nex t sections. 
7.5.2 CPU Consumption Measurements 
This secti on investigates the performance of the two applications, in terms of CPU 
cycle consumption. 
7.5.2.1 Performance Counters and the Time-Stamp Counter 
Each processor contains a register called the Time-Stamp Counter (TSC). It gives the 
number of clock cyc les since the CPU was powered up or reset. B y reading thi s 
register , in di fferent locations of a source code, the CPU cyc les consumed between 
these locati ons can be read; therefore the performance of specific parts of the code (in 
terms of CPU cyc le usage) can be realised. 
The TSC is a read-only counter and its content can be read by usi ng assembly 
functi ons, such as the rdtscl function . For convenience, the perfct r [PERWWW] 
soft ware interface is used. Its mai n features are that every process can have its own 
virtual performance counter and time-stamp counter. The value of the TSC can be 
read by opening a specific file called /dev/perfctr and using the appropri ate system 
call. The perfctr provides a user-space library, so a process can read its virtual TSC 
from the user-space. 
7.5.2.2 CPU Cycle Consumption 
In order to take reliable CPU usage measurements, the perfctr software package was 
used. Measurements of the CPU cycle usage (from the TSC) were taken for each 
app lication (SW-DES and HW-DES), at the points in their code shown in Figure 7.32. 
160 
Chapter 7 Performance Evaluation of the Active Engine 
The data, within each application, follow the path: (1)->(2)->(3)->(4). The CPU 
consumption measurements were taken between the points (1)->(2), (2)->(3) and (3)-
>(4). The CPU cyc les consumed in the paths (1)->(2) and (3)->(4) wi ll be the same 
for both applications (HW-DES and SW-DES), because the code between these points 
is identical. The number of the CPU c ycles spent between (2)->(3) is expected to be 
different. 
(1 ) 
Split packet payload 
in 32-bit words and 
feed the encryptio n 
algorithm 
(2) 
DES Encryptio n 
(S!W orWW) 
(3) 
Place the encrypted 
data into the correct 
position in the 
packet's payload 
~ (4) 
Figure 7.32: Pe1jctr Probes placed in the two DES Applications 
In the fo llowing figure, the CPU cycles spent for the path (1)->(4) are shown, for both 
appli cations and for di fferent packet lengths. 
1! 500<XXX) 
"R ~ 400<XXX) 
6 
-~ 3<XXXXX) 
~ 5 20000X> 
--SW-DES 
-liE-J JW-DES 
----
~~~~=====:J 
l OO 200 300 400 500 600 700 800 900 1000 11 00 1200 1300 1400 
Data Size (bytes) 
Figure 7.33: CPU Cyc le Consumption for the HW-DES and the SW- DES for Packets of 
different Sizes 
161 
Chapter 7 Performance Evaluation of the Active Engine 
The SW-DES alg01ithm has a better performance. This can be explained by ana lysing 
the mechani sm that applications use to transfer data to and from the actual encryption 
algorithm. 
The SW-DES application 
This application transfers the unencrypted data to the encryption module and then, it 
reads the encrypted data back. For thi s operation, two memory copies are performed, 
that are fast with low CPU overhead. Most of the CPU cycles are spent for the actual 
encryption operation. 
The HW-DES application 
The HW-DES application operates significantly different because it sends and 
receives data through the PCI bus. It therefore consists of two parts: a software part 
that splits the data into 32-bit words (because a 32-bit PCI bus is used) and a hardware 
part that performs the encryption (as described in Chapter 4) . The block diagram of 
the DES algorithm is shown in Figure 4 .9. 
Generally, a DES algorithm takes as input a 64-bit plaintext, a 64-bit key and gives a 
64-bit ciphertext as output. The plaintext, the key and the ciphertext have to be split in 
two 32-bit words each and passed separately to and from the FPGA, because a 32-bit 
PCI bus is used. In Figure 4.9, Data_inl is the high word (32-bit) of the plaintext, 
Data_in2 is the low word (32-bit) of the pJaintext, Key_inl and Key_in2 are the low 
word and high word of the key respectively . Encrypt is a flag that speci fies if the 
algori thm will encrypt or decrypt the incoming data, because this application is 
actuall y an encryption/decryption algorithm. 
For every 64 bits of plaintext, five "PCI writes" (Data_inl, Data_in2, Key_inl, 
Key_in2, and Encrypt) and two "PCI :reads" (Data_outl, Data_out2) are necessary. 
Additionall y, one hardware inteJTupt is ra ised from the FPGA for each 64 bit of 
encrypted data. This inten·upt is necessary, because it informs the host process that the 
encryption of the data has fi ni shed and the encrypted data can be read back. 
For PCI bus reading or writing, memory mapping is used. The address space of the 
host process is memory-mapped to the FPGA space via the PLX 9080 PCI controller 
162 
Chapler 7 Performance Evaluation of Lhe Aclive Engine 
(Figure 4.10). By writing and reading to this me mory address, leads to generate the 
appropriate PCI cyc les for writi ng and reading data through the PCI bus. 
Refening to Figure 7.32, the total cost (for every 64 bits of plaintext), in te1ms of 
CPU cycle usage, can be written as: 
[7.6] 
where Ci-j is the number of the CPU cycles consumed between the points i and j in the 
software code. The costs Ct -2 and CJ-4 are the same for both applications (SW-DES 
and HW-DES), because the code between these points is identical. The cost C2-3 will 
be significantl y different, since the SW-DES algorithm performs only memory copies 
but the HW-DES uses the PCI bus plus one interrupt. 
Rewriting Formula 7.6, the cost for encrypting L bytes of data (or L/8 blocks of 64 
bits) is : 
[7.7] 
The cost C2-J (for the HW-DES) can be further analysed into: 
C 2-3 = 5 X C writ~ + 2 X C r~nd + C 1111 u [7.8] 
where Cwrite is the cost for wri ting 32 bits of data to the PCI bus, Cread is the cost for 
reading 32 bits of data from the PCI bus and Cinter is the cost for using the hardware 
inten·upt, generated by the FPGA. Formula 7.8 can be rewritten as: 
C 2- 3 = C writ~ _ Ill/ + C r~nd _ 101 + C inl er [7.9] 
where, C write_tot= 5 X Cwrite and C read_tot= 2 X C read. The sum of the costs Cl -2 and CJ-4 
wi 11 be refe1Ted as cost C appl = C1 -2 + CJ-4. The total cost is: 
L 
C ,n,CL> = C nppt(L> +gx (C,~,;,. tot +C,end _tm +C;m ., ) [7.10] 
163 
Chapter 7 Performance Evaluation of the Active Engine 
Measurements are taken using perfcrr for the costs Cappi{LJ, C wri1e_w1, Cread_lm and Cimer. 
The median number of each cost is computed and shown below: 
C read_lal (cycles) C wrire_ror (cycles) Cmn (cyc les) 
5012 500 31228 
Table 7.3: CPU costs for reading and writing data through the PCI bus and using the hardware 
interrupt 
The cost Cappi(LJ is a function of the data size (L bytes) and it was measured separatel y. 
Its values, fo r different packet lengths, are shown in Figure 7.34. 
§&m 
·c 
<o::l 
:; 4(XX) 
:::> 
:::J 0.. :JXX) 
u 
4CO oco 1al> 1400 
Data Size (bytes) 
Figure 7.34: Cost Cappl for Packets of di fferent Sizes 
From Table 7 .3, it is obvious that readi ng from the PCI bus (Cread_/()/) and using the 
interrupt (Cimer) are CPU consuming operations. For these reasons, the performance of 
the HW-DES is worse than that o f the SW-DES (Figure 7.33). 
Through these measurements, it is c lear that transfers to and from the PCI bus using 
memory mapping are very expensive operations to be used in an Acti ve Router that 
has to service a heavy load of input acti ve traffic. The performance of the hard ware 
could be improved by using a different mechanism than memory mapping for 
transfering data through the PCI bus. A possible solution is to use DMA (Direct 
164 
Chapter 7 Performance Evaluation of the Active Engine 
Memory Access) transfers. DMA transfers do not use the CPU; therefore a lower total 
CPU utilisation is expected. 
In the next section, the CPU cost for using DMA transfers is measured and compared 
to that of memory mapping. A prediction of the CPU cost is then performed for the 
HW-DES (using DMA) and compared to the SW-DES and HW-DES with memory 
mapping. 
Usin!! Direct Memory Access CDMA) Transfers 
DMA is a hardware mechanism that allows pelipheral components to transfer their 
VO data directly to and from mai n memory, without the need for the system processor 
to be involved in the transfer. 
By replacing memory mapping with DMA for the HW-DES, Formula 7 .9 is rewritten 
as follows: 
c 2 - 3 = C ,,rife 
_ OM/\ + C mul OM/\ [7 .11] 
The cost Cmercontained in Formula 7.9 is not present in Formula 7.11 , because when 
a process performs a DMA read, the hardware writes data to the DMA buffer and 
raises an interrupt when it is done. So, the cost for each hardware interrupt is included 
in the cost Cread_DMA. 
An application that performs DMA transfers through the PCI bus is avai I able and it 
was used to predict the performance of the HW-DES. Its block diagram is shown in 
Figure 7.35. This s imple application performs a DMA write to trans fer data from 
Buffed to the FPGA and then it performs a DMA read to read them back from the 
FPGA. 
Using this application, measurements for the costs C wrire_DMA and Cread_DMA (Formula 
7 .11) could be taken. These two costs will be the same for every application that has 
to perform DMA transfers, because the same software interface (between the user-
space of the process and the hardware) is used. These measurements could then be 
165 
Chapter 7 Performance Evaluation of Lhe Active Engine 
used to make a prediction for the CPU cycle usage the HW-DES wou ld cause, if it 
were using DMA transfers. 
Linux OS FPGA 
I 
I I PCI Bus I 
I V I DMA_Write I 
I 
I Buffer ! I 
DMA_Read 
I Buffer2 j Buffer3 
I 
Figure 7.35: Block Diagram of a Simple Appl ication Lhat performs DMA Transfers 
Cpred_w r is the total predicted CPU cost and it is given by the formu la: 
L 
C pml _ tot(L ) = C nppi(L ) + g X (C,.,~ _ DMA + C und _DMA ) [7.12] 
As referTed for the memory mapping, five PCI writes and two PCI reads are necessary 
for every 64 bits of plaintext. F01mula [7.12] should be written as: 
[7.13] 
or, 
L 
C pred _ wt ( L) = C nppi ( L) +g-X(C,.,e_tm_DMII +C,~nd_tot_DMA) [7.14] 
where C write_IQI_DMA= 5 X Cwrile_DMA and Cread_wr_DMA= 2 X Cread_DMA. 
Using the simple app lication refened before and the perfctr software, the median 
numbers for the costs C wrile_rm_DMA and Cread_w,_DMA were measured (Table 7.4). 
166 
Chapter 7 Performance Evaluation of the Active Engine 
C write_wt_OMA (cycles) C read_rm_DMA (cycles) 
37 1,580 144,836 
Table 7.4: CPU costs for reading and writing data through the PCI bus using DMA 
Comparing the values of Table 7.4 with these of Table 7.3, it is obvious that 
transferring onl y 32 bits each time using DMA can be more expensive than using 
memory mapping. This is because of the overhead insetted to prepare a DMA 
transfer. 
Using Formula 7.14, the predicted cost of using DMA is computed and compared to 
that of memory mapping, for diffe rent packet sizes. 
c: 
0 
-M:m:ryW~:Wrg 
-·-·- DW\ 
100 2X) :m 400 500 all 700 00) oco 1<XXJ 1100 12X> 13Xl 1400 
Data Size (bytes) 
Figure 7.36: Predicted CPU Cycle Usage when using Memory Mapping and DMA for 
di fferent Data Sizes 
As shown in Figure 7.36, there is huge performance degradation when using DMA. 
Trans ferring onl y 32 bits of data each time using DMA, is very expensive because of 
the overhead of the DMA inserted each time a transfer takes place. For each DMA 
transfer, a device driver performs several operations before the data are transmi tted. 
These operations inc lude: (i) providing the DMA contro ller with the directi on of the 
167 
Chapter 7 Performance Evaluation of the Active Engine 
transfer (from the device to the host address space or vice-versa), the bus address and 
the size of the transfer, (ii) " talking" to the periphera l device to prepare it for 
transferring the data and (iii) responding to the inten·upt when the DMA is over. 
Buffering data and transferring larger blocks could improve the performance of the 
DMA. In order to investigate the CPU cycles needed by using DMA for different 
buffer sizes, the simple application was used again. Data were buffered every 100 
bytes and sent to the FPGA with DMA. The experiment was repeated for block sizes 
of 100 bytes, 200 bytes up to 1400 bytes. In Figure 7.37, the performance of the DMA 
(in terms of CPU cycle usage) is compared when using buffering (Buf) and when no 
buffering (No Buf) was used. 
,-... 
V) 
~ 
u 
>. 1E7 
u 
'-' 
s:: 
0 
".;::J 
c<l 
Vl 
:5 
:::::> 
0... 
1E6 
u 
1cron ~-L~L-~~~~~-L~~~~~~~-L~L-~ 
100 ~ :m 400 &X) a:x> 700 tm oco 1<XX> 1100 12X> 13Xl 1400 
Data Size (bytes) 
Figure 7.37: CPU Cycle Usage when using DMA with and without Data Buffering 
Fewer CPU cycles are consumed when data are buffered, as shown in Figure 7.37. For 
example, if the size of the data to be encrypted is 100 bytes, the cost to transfer them 
through the PCI bus us ing DMA and a buffer of 100 bytes is from Figure 7.37, 
150,334 CPU cycles. If a 4-byte (32 bits) buffer was used then , 100/4 = 25 DMA 
writes and 25 DMA reads are needed. The cost is 25 x C wrile_DMA + 25 x Creod_DMA = 
3,668,350 cycles, 95.9 % more CPU cycles are needed; therefore buffering data into 
larger blocks improves DMA performance. 
168 
Chapter 7 Performance Evaluation o f the Acti ve Engine 
Another benefit of buffering data is the interrupt mitigation. With larger block sizes, 
onl y one hardware interrupt is raised for each block of data and not one inteiTupt fo r 
every e ight bytes as it was before. 
The perfo1mance of the HW-DES using buffering and interrupt mitigation is precticted 
and compared to that of the HW -DES that performs no buffering. T he predicted cost 
for the HW-DES using DMA, buffe ring and interrupt mitigation is given by the 
formula: 
C pmf_tot(L) = C flppi ( L) + C wrrte _ DMA(L) + C read_DMA (L) + c .,.rite_DMA(l2 ) [7. 15] 
The cost C wrire_DMA(I2J is the CPU cycles needed to se nd the key (8 bytes) and the 
encrypt flag (4 bytes) for the DES (assuming that both are sent using a single DMA 
write of 12 bytes). The cost C oppt(LJ is already known from previous measurements. 
F igure 7 .38 shows the performance improvement fo r different bloc k s izes for the HW-
DES (with DMA) when buffe rin g and interTupt mitigation was used compared to that 
when no buffering and no inteJTupt mitigation was used. 
~ ~~~~~~~~~~~~~~~~~~~~~~ 
100 :m 3:X) 400 !m Em 70) !IX) roJ 1(XX) 1100 1al) 13:X) 1400 
Data Size (bytes) 
Figure 7.38: Performance Improve ment when using DMA with data buffering and Interrupt 
Mitigation 
169 
Chapter 7 PerfuriDallGe Evaluatioo of the Active Engine 
As shown in the above graph, perfonmmce increases as the size of the data increases 
because interrupt mitigation becomes stronger (only one interrupt for more data) and 
the performance of the actual DMA inC£eases. 
The ~ext figure compares the performance of tbe different modes of the HW-DES 
with the SW-DES. MM is the HW-DES osing memory mapping fur transferring .data, 
DMA_NOBUF is the HW-DES using DMA but no buffering or interrupt mitigation 
and DMA._ BUF _ INTMIT, the HW -DES that uses interrupt mitigation and DMA.. The 
CPU usage for the DMA_BUF and DMA_BUF _INTMIT are predicted values. 
IF.9 - S\\LIES 
- I:M'\_lUF_INll\tl 
lE8 
........... 
.., 
Cl) 
c:i 
>. (.) 
'-" 
IE7 
:::;, 
~ 
u 
IE6 
100000 
lOO 200 300 400 500 600 ?00 800 ~ 1000 1100 1200 l:lOO 1400 
P<K:kct Sire (bytes) 
Figure 7.39: CPU Cyd e Usage for the SW -DES a.OO the different Versions of thc HW- D ES 
As shown in Figure 7.39, the performance oftbe HW-DES can become better than the 
SW-DES after a threshold value of the data size (400 bytes}, if DMA and interrupt 
mitigation is used. For the SW-DES., CPU cycles are spent for the actual encryprioo of 
the data because the memory copies used to transfer the. data to the encryption 
algorithm do not need many CPU cycles. For the HW-DES, although the encryptWn is 
performed in hardware (FPGA) and is much. faster than in software, the mechanism to 
transfer the data to and from the FPGA device needs a large number of CPU cycles 
that drops the ootal .perfor:raance. 
170 
Chapter 7 Performance Evaluation of the Active Engine 
There are some other methods that could be used to further improve the performance 
of the HW-DES, such as the use of a 64-bit PCI bus or a PCI-X bus, or the use of a 
dual processor PC. 
171 
Chap1er 7 Performance Evalualion of !he Active Engine 
7.6 Summary 
This chapter has presented performa nce evaluation issues regard ing the software 
architecture of the AE, the switch between FPGA-AAs, and a comparison between a 
D ES encryption algori thm imple mented in hardware and a DES encryption algorithm 
implemented in software. 
As it concerns the performance evaluation of the AE, in terms of packet loss, there is 
ex tended packet loss under heavy input load (small packets at high rates). A method 
to improve the thro ughput is using buffeti ng, because the number of the process 
context switches between the user-space processes of the AE red uces. By reduc ing the 
number of the process context swi tches, less CPU cyc les are consumed; the re fo re le ss 
packets are lost. 
By using buffers, the packet loss reduces but the re is an impact on the de lay packets 
experi ence within the AE. The delay (when buffers are used) depends on the size of 
the buffers, the packet length and the input packet rate. With smal ler packets under 
high packer rates, the delay and the delay ji tter reduce when buffers of 6 Kbytes are 
used. Delay and de lay jitter increase when larger buffers are used . 
The next section of the chapter has described the switch between two FPGA-AAs. 
Under a relati vely low input packet rate, the sm allest switch time is 150 msec. 
The last part of thi s c hapter has presented the perf01mance evaluation of a DES 
encryption algorith m implemented in hardwa re and a DES implemented in software, 
in tetms of CPU cycle usage and delay within the A£. Initiall y, the software DES has 
a better performance. The performance of the HW-DES in terms of CPU cyc le usage 
can be improved when using DMA and interrupt mitigation . 
172 
Chapter 8 
Active Secure FTP 
173 
Chapter 8 Active Secure FfP 
8. Active Secure FTP 
8.1 Chapter Summary 
This chapter presents the implementation of an Active Secure FfP app lication. 
Packets are encrypted and decrypted within the Active Network, so they are protected. 
The method used to activate and deactivate the packets is desc ribed and then, the 
performance evaluation , in terms of the down loading time, of the normal FfP and the 
Active Secure FfP is compared. 
8.2 Activating the Passive Packets 
As described in Chapter 4, passive packets are wrapped into UDP packets to form 
active packets (Figure 4.2). The Active Applications, as well as the activation and de-
activation of the passive packets should be transparent to the end-to-end applicati ons. 
In the previous chapters, it was shown how the Active Network approach becomes 
transparent to the end hosts. The Active Routers perform routing as the passive 
routers do and the end hosts do not need to be aware of the locations of the routers. 
A mechanism has to be defined so as the wrapping of the passive packets into acti ve 
packets and late r the deactivation of the m, becomes transparent to the end-to-end 
applicati ons. Such a mechan ism can be implemented by us ing two Loadable Kernel 
Modules (LKMs), in the kernel-space of a Linux host. The first LKM will be registered 
in the NF_IP_POST_ROUTING hook (Figure 8. 1) and it wi ll acti vate the outgoing 
passive packets and the second LKM wi ll be placed in the NF_IP _PRE_ROUTJNG 
hook and it will deactivate the incoming active packets. 
Every packet that traverses the Linux kernel is described by an skbuff structure 
(Appendix G). This is created by the operating system, when a packet is received 
from the network or generated by a user-space process . This structure holds useful 
information about the packet, such as the headers (fP, TCP or UDP etc), the payload 
data, the time the packet entered the kernel , the name of the interface card that 
captured the packet etc. 
174 
Chapter 8 Active Secure FTP 
l Processes (FTP etc) J 
~ 
User-Space 
--------------- -------------------- ---------------
---------------
Kernel-Space 
(NF_IP _?RE_ROUTING) (NF_IP_POST_ROUTING) 
DEACfiV A TE_INCOM ING_PACKETS() ACfiVATE_OUTGOING_PACKETS() 
,. 
NETWORK 
Figure 8.1: Netfilter Hooks used for the acti vation and deactivation of Packets 
The data-related parts of the skbuffstructure are shown be low: 
Headroom Data Area Tail room 
Figure 8.2: The Data-Re lated Information of the skbuff Structure 
When a user-space process creates a packet a nd be fo re the packet is injected into the 
network, depending on the type of packet, it travels through several parts of the kerne l 
network stack. If it is a TCP packet for example, it first traveJs through the TCP stack, 
where the TCP header is added and then it passes the IP stack. There, the IP header is 
added and it is finaJl y passed to the device dti ver of the ne twork interface card that 
adds the Ethemet header and finall y injects it into the ne twork. All the above headers 
are added in the memory area called headroom, shown in Figure 8.2. 
Using the same method, packets can be acti vated by adding the appropriate headers in 
the headroom of the corresponding skbuffstruc tures. 
175 
Chapter 8 Active Secure FfP 
Two additional layers, the Active Network Layer 1 (ANL1) and Active Network Layer 
2 (ANL2) are created within the kernel JP stack (through the use of the two LKMs). 
Incoming packets are deactivated in ANL2 and outgoing packets are activated in ANLl. 
User-Space Processes 
UDP or TCP Layer 
LP Layer 
l i 
ANLt ANL2 
Ethernet Layer 
Figure 8.3: The Active Network Layers 
For the activation, three additional headers (JP, UDP and ACT) are added to the 
packets (Figure 4.2). There is a 40 byte overhead for each packet, and if the available 
headroom is smaller than 56 bytes (16 bytes are needed for the Ethemet header), it is 
expanded to fit the additional headers. The IP checksum of the first JP header and the 
UDP checksum of the UDP header are computed (Figure 4 .2). 
For the deacti vation, the first three headers a re removed from the headroom of the 
packets and then , the passive packet is transferred to the upper layers as normal. 
The netjilter hooks, used for the acti vation and deactivation of the packets, give the 
abi lity to: (i) load the LKMs on demand and (i i) use fi lters that specify which packets 
will be acti vated and deactivated. The packets of an appl ication can be sent as passive 
or active packets, by loadjng and unloading the LKMs. 
Filteiing is necessary because not all the outgoing packets have to be activated. Port 
numbers can be explicitly de fined in the source code of the LKMs or passed via the 
command line interface. For example, by typing "insnzod outhook.o portl=20 
176 
Chapter 8 Active Secure FTP 
port2=21 gmid=3", the LKM call ed ourhook activates the packets generated by an 
FTP server (the port numbers are the well-known ports 20 and 21). The val ue of the 
GMJD in the Active Header is set to number 3. 
8.3 Description and Performance Evaluation of the Active Secure 
FTP Application 
FTP (Fi le Transfer Protocol) is built over TCP and is used to transfer fi les between a 
server and a client. This is an unsecured method, because data are not encrypted. This 
section describes an Active Secure FTP application where data are encrypted with in 
the network, using the DES algorithm. 
For the implementati on of the Active Secure FTP, the network topology shown in 
Figure 8.4 was set up, with two Active Route rs (AR1 and AR2), an FTP c lient and an 
FfP server. AR1 consists of two Linux hosts (AE and RFE) and AR2 consists of one 
Linux host to show that an FPGA is not always needed and for compari son reasons. 
AR1 can host FPGA-AAs as well as software-onl y AAs in contrast with AR2 that can 
host only software AAs. 
FTP C lient AR2 FTP Server 
AR1 
Figure 8.4: Acti ve Secure FfP performed by two Active Routers 
Active Secure FTP is a conventional FTP application with the difference that data are 
encrypted and decrypted in the two ARs by the active applications. AR1 hosts an AA 
that enc rypts and decrypts data using DES cipher executed in the FPGA, while AR2 
hosts a software-onl y Active Appl icati on that e ncrypts and decrypts the data. The 
L77 
Chapter 8 Active Secure FTP 
network, symbolised by the cloud in Figure 8.4, is assumed to be a non-trusted 
network, so FTP data have to be encrypted. 
The encryption and decryption that takes place within the network, requi res active 
packets. The FTP server and the client generate passive packets, which are acti vated 
by LKMs (Figure 8.1), prior to transmission into the network . There are two LKMs in 
each host: one that activates the outgoing FTP packets and a second one that 
deactivates the incoming active packets. The first LKM registers itself in the 
NF_IP _POST_ROUTING hook, while the second one in the NF_IP _PRE_ROUTJNG 
hook. The LKMs form the ANLt and ANL2 (Figure 8.3). For the second LKM, it is easy 
to filter and deactivate the incoming active packets only, since they are UDP packets 
that have a unique number as their destinati on port. The first LKM, that activates the 
outgoing FfP packets, has to con ectly fi lter them prior to activation. Fo r this reason, 
the active mode of FTP is chosen by the c lient. The active mode should not be 
confused w ith the Acti ve Secure FTP implemented in the Ac ti ve Netwo rk. When the 
c lient requests the active mode o f FTP, the FTP server uses the wel l-known port 20 to 
send the requested data, so the first LKM is able to fil ter and activate the corTect 
packets . If the passive mode were chosen, both the server and the cl ie nt use epheme ral 
ports; there fore the activati on would not be possible because the LKM cannot be 
aware o f these port s. 
In this Active Secure FTP implementation, not only the data are enc rypted but also the 
control messages exchanged between the serve r and the cli ent. The well -known port 
21 is used for these messages. 
Both Active Routers perform encryption and decryption on the payload of the packets, 
as well as routing. With the Acti ve Secure FTP, the pac kets that travel within the 
untrusted network are encrypted; therefore they are protected. T his operation is 
complete ly transparent to the end-to-end FrP application, because packets are 
acti vated and deactivated by LKMs placed in the kerne l-space of the FTP server and 
the c lient. 
In the next diagram , the time to download files o f various sizes was measured fo r 
three possible cases: (i) conventional (passive) FTP, (ii) Active Secure FTP usi ng 
178 
Chapter 8 Active Secure FrP 
encryption/decryption in hardware (FPGA) in AR1 and software encryption/decryption 
in AR2, and (iii) Active Secure FfP using software encryption/decryption in both 
Active Routers. 
--1\bTrUFW 
-A:tileFW(~ 
1400 
-+-..OdileFW(~ 
100 150 400 
File Size (Kbytes) 
Figure 8.5: Time required to download Fi les of different Sizes using Passive FfP and Active 
Secure FTP 
As shown in Figure 8.5, the time to download different files is comparable to the 
normal FrP when using encryption/decryption implemented in so ftware, but when 
using the DES algorithm implemented in the FPGA, the time overhead increases. The 
reasons behind this and methods for improving the performance of the hardware were 
discussed in Chapter 7. 
179 
Chapter 8 Acti ve Secure FrP 
8.4 Summary 
This chapter has presented the implementatjon of an Active Secure FfP application. 
This is performed by two Active Routers, placed in the edges of an untrusted network. 
D ata are well secured within Lhe untrusted network because they are enc rypted. 
To implement thi s application , packets are acti vated and deacti vated in the kernel-
space of the end-hosts. This operati on is completely transparent to the FTP cl ient and 
server processes because it has been implemented in the kernel IP stack. Two new 
layers have been created in the network layer of the IP stack (through the use of two 
LKMs) called ANLJ and ANL2. ANL1 activates the outgoing packets while ANL2 
deactivates the incoming packets. 
The performance evaluation of the Active Sec ure FfP shows that its performance, in 
terms of the time required to download fi les of different sizes, is comparable to the 
pe1formance of the passive FTP when software e ncryption/decryption is performed in 
the Active Routers. Its performance deteriorates when the encryption/decryption is 
performed in hardware fo r the reasons desc1ibed in Chapter 7. 
180 
Chapter 9 
Conclusions and Future Work 
181 
Chapter 9 Conclusions and Future Work 
9. Conclusions and Future Work 
Active Networks are a different approach in the computer networks field. The idea 
behind thi s approach is that users can inject their own code and program intermediate 
Active Routers. This gives them the abi lity to tai lo r the network services to their 
needs. 
Conventional routers perform no computation on the payload data of the packets they 
route. This makes them passive devices that their operations are limited to the 
network layer. Also, passive routers are verti cally integrated devices and if a new 
protocol were to be introduced, it could take many years to be standari sed. Active 
Routers overcome these limitations. 
This thesis demonstrates an Acti ve Router architecture, where programmable 
hardware is used. Such an architecture should implement modules for safety and 
resource management, since an Acti ve Router is not limited to the network layer and 
it hosts applications that are written by third party vendors or authorised users. 
Several modu les that compri se the software ar chitecture of the router presented in this 
thesis implement a robust environment, with resource management and fau lt detection 
and isolation. An Active Application that misbehaves can be quickly detected and 
iso lated from the rest of the system. The Memory Monitor can detect applications that 
violate the memory usage limit and penalise them. The CPU Monitor detects and 
iso lates applications that overuse the CPU. This is the case for the software part of the 
applications.The Safery Process is a software watchdog process that moni tors if the 
fundamenta l processes that compri se the software architecture of the router operate in 
a normal way. In every software project, there is always the danger of a bug in the 
code. If the bas ic processes of the router for some reason crash, the router can 
effectively recover and continue its opera ti on . The loaded Active Applications 
execute in their own address space and they are not affected. 
Active Routers download the requested Active Applications from one of severa l Code 
Servers. The communicati on between the servers and the routers is ini tially performed 
using a protocol bui lt over UDP. The application is then downloaded usi ng a secure 
connection that protects the integrity of the data. While the Data Phase (described in 
182 
Chapter 9 Conclusions and Future Work 
Chapter 5) is secure enough, the Control Phase (implemented using the protocol built 
over UDP) is not secure. As in passive networks, there is always the danger of attacks 
such as the man in the middle attack, where data can be captured and al tered. An end 
host could pretend that is a Code Server and provide a router with wrong informat ion 
or even a malicious app lication. The solution to this problem is the use of a strong 
authenticati on algorithm that prov ides a safer environment. Using this algotithm, the 
Code Servers and the Active Routers could authenticate themselves prior to 
exchanging any other information. This thes is has not foc used on securi ty issues since 
secUJity is out of the scope of this work. 
The Active Applications are assumed to be secure applications, which do not 
intentionally try to compromise the Acti ve Router. The software modules of the router 
perform operations regarding safety issues. Even if an Active Application was 
developed by a trusted third party, a bug in its code could make it crash. The safety 
modules can successfu lly detect and iso late such an application. But, if for example, 
an application tries to format the hard di sk of the router, it wi ll succeed because the re 
is no protection against thi s kind of threats. 
For the implementation of the Active Router, a PC-based router is used. This kind of 
router lacks performance when compared to a conventional router. PC-based routers 
have though , several characteristics that make them ideal for experimentation in 
Active Networks. The operating system they run can be based on open source 
software, giving the ability to researchers to modify or enhance several functions. In 
this thesis, a Linux PC-based router was used, because Linux is open source software. 
More important, it imposes a powerful network stac k implementation that can be used 
to build the software architecture of the Active Router. Several defaul t po licies can be 
changed, or new ones can be introduced by modifying several operations of the 
operati ng system that are rel ated to the network stack. This thesi s demonstrates such 
policies and operations, such as packet redirection , packet activation and deactivation, 
traffic shapi ng. 
Software routers just route packets from one port to another. Software (PC-based) 
Acti ve Routers not onl y route packets but a lso perform computations on the payload 
data of the packets. For this reason, the processing complexity increases as the packet 
183 
Chapter 9 Conclusions and Future Work 
length increases and as the complexity of the Active Applications increases. Active 
Routers should be augmented with adequate processing power to cope with these 
requirements. 
In this work, an FPGA (Field Programmable Gate An·ay) device is used, in order to 
enhance the processing capabil ity of the router. FPGAs can provide superior 
functionality (i n terms of performance) and accelerate the Acti ve Applications by 
many orders of magnitude. Another advantage of the FPGAs i s that they are 
reprogrammable so Active Applications can change on the fly, as demonstrated in this 
thesis. However, with regards to the switch between Acti ve Applications (on the fly 
execution), there is a limitation imposed by the nature of Linux. Linux is not a real 
time operating system, so under heavy input traffic, priority is given to lower system 
operations (such as the servic ing of hardware interrupts). Active Appli cations cannot 
operate normally under these conditions because insufficient CPU cyc les are avai !able 
to them; therefore the switch becomes problematic. In this thesis, the swi tch between 
different Active Applications was performed under a relati vel y low input packet rate, 
and a switch could take place every 150 msec. 
Another benefit of the FPGAs is that they provide a safer execution environment. A 
malicious hardware application could be detected and isolated by a hardware 
watchdog monitor. However, hardware does not provide a completely safe 
environment because hardware equi valent viruses can be created for hardware 
applications as well. Misconfigured appl ications can physically destroy an FPGA 
device. The implementation of a hardware watchdog monitor could detect and isolate 
such applications. 
The performance evaluation of the Active Router revealed several bottlenecks in its 
architecture. Due to the characteri stics of the operating system and the actual 
implementation of the architecture of the Active Router, three major sources for CPU 
cycle consumption exist. These are: hardware interrupts issued when packets are sent 
or received, process context switches that take place when packets are sent from one 
user-space process to another and kernel-space user-space crossings that take place 
when packets cross this boundary. With regard to hardware interrupts, interrupt 
mitigation cou ld substantially improve the performance of the router. Several modem 
184 
Chapter 9 Conclusions and Future Work 
network interface cards as well as newer Linux kernel verswns perform a mixed 
version of interruption and polli ng, trying to minimise the cost for servicing the 
hardware interrupts. 
This thesi s described a method for mjnimizing the number of the process context 
switches through the use of buffers. By buffering, the CPU cyc le usage reduces, 
leading to less packet loss. However, buffers increase the delay packets experience 
within the router, when the input packet rate is kept low. Under higher packet rates, 
the delay as well as the delay jitter decrease. Active Routers will be usually placed 
into busy networks so high input packet rate will be expected in such networks. 
The hardware e le ment of the router is implemented using a PCI-based FPGA board. 
As stated before, FPGAs can accelerate Active Applications by many orders of 
magnitude. However, the performance of an applicati on that executes on hardware 
depends not onl y on the characteristics of the FPGA and its sun·ounding hardware but 
also, on the actual implementation of the application. Hardware app lications can be 
developed using several methods such as memory mapping or DMA. DMA can 
provide a better performance because the processor is not involved when data flow 
from a user-space process to the FPOA and vice-versa. The hardware application 
presented in thi s thesis uses memory mapping to transfer data through the PCI bus. 
This is one of the reasons its performance is worse than that of the software 
application. Other issues that lower the performance of the hardware application, is 
the 32-bit width of the PCI bus, as well as the c lock generator located on the FPGA 
board. However, the performance of hardware app lications can be substantial ly 
increased by using a 64-bit PCI bus, a PCI-X bus, DMA and a dual processor PC. 
This thesis focuses on the implementation of the basic architecture of the Active 
Router and not on the development of high performance Active Applications. The 
development of such applications, in a future commercial Active Network, will be a 
responsibility of third-party vendors. This thesis demonstrates how the performance 
(in terms of CPU cycle usage) of a hardware DES application can be improved when 
DMA and interrupt mitigation is used . 
The last part of the thesis has demonstrated an example of the Active Network 
approach, by implementing an Active Secure FTP application. The activati on and 
185 
Chapter 9 Conclusions and Future Work 
deacti vation of the packets as well as the encryption and decryption operations are 
completely transparent to the end appl ications. The performance of this appl ication 
(when the software DES was used) is comparable to that of conventional FTP. This 
adhers to the end-to-end argument, which states that "functions should be placed in 
the network only if they can be cost-effectively there". 
The architecture of the router could be enhanced in many ways. Several operations its 
basic modules perform can be transferred to kernel-space, increasing the tota l 
performance. 
More FPGA devices can be used to increase the total performance of the router. The 
current PCI-based board can support up to two FPGAs, but more boards can be placed 
in the PCI architecture. I f five boards were installed for example, ten FPGAs would 
be avai lable, thus ten Active Applications could execute concurrently. 
The processing power of the Active Router cou ld be substantial ly increased if more 
Active Engines were used. This could create a cluster of Active Engines that would be 
able to supp01t multiple Active Applications. The Routing and Forwarding Engine 
could forward packets to the Active Engines according to their load and the Active 
Applications they host. For example, if five Active Engines were used and each one 
hosted five PCI-based boards, then fifty Active Applications could execute. The 
number of the Acti ve Applications could be further increased if switching between 
several applications were performed (as the router presented in this thesis performs). 
Some types of FPGAs, as the type of the FPGA used in this work, support partial 
reconfiguration. This means that more than one Active Appl ications could be placed 
in the same FPGA at the same time. If partial reconfiguration were used, the operation 
of the router would become more complicated because a process should ex ist, that 
checks if the new bitstream "fits" in the FPGA, if another bitstream has already been 
placed there. Partial reconfiguration was not used in this thesis because; although the 
FPGA supports it, the software interface provided by the manufacturing company 
does not. 
186 
Chapter 9 Conclusions and Future Work 
The total performance of the Active Router does not depend on the characteristics of 
its architecture onl y. The performance of the Acti ve Applications it hosts, affect its 
performance since they become part of its archi tecture. Several Active Applications 
could be tested, using memory mapping, DMA and interrupt mitigation to 
demonstrate the superiority of using FPGAs. This thesis has described how the 
pe1formance of the hardware could be substantially improved i f DMA and inten·upt 
mitigation were used. Some predicted results were demonstrated but an actual 
implementation, demonstration and benchmarking of Acti ve Applications that use 
these methods could better highl ight the advantages of the FPGAs. 
This thesis has not focused on security issues regarding the archi tecture of the router. 
For the implementation of the router, the C programming language has been used. 
This language is very powerful but insecure as well, due to several characteristics 
such as memory pointers. In Active Networks research, several secure languages have 
been implemented. These languages cannot be used in this Active Router to 
implement hardware Active Appl ications. The reason for this is that the software 
interface (developed by the company which provides the FPGA) is written in C. So 
any application that needs to communicate with the FPGA has to have modules 
implemented in C. However, the security of the Acti ve Applications could be 
enhanced if Java programming were used. Java provides a restricted environment 
where Active Applications could execute. Several policies can be defined, by the 
network administrator (using the Java Securi ty M anager), so the Active Appli cations 
(written in Java) would execute in a restricted environment. For example, the network 
administrator could restrict the opening of network connections by the Active 
Applications. With these limitations, i t would be impossible for an Active Application 
to open a connection. The current router implementation does not provide such 
limitation. 
Another benefit of using Java is the Java Native Interface. Acti ve Applications written 
in Java could call C functions through this interface. The router could provide these C 
functions and they would include all the operations an application needs to 
communicate with the FPGA. Using this method, Active Applications would be able 
to communicate with the FPGA and execute in a restricted environment. 
187 
Chapter 9 Conclusions and Future Work 
Further work to this research could inc lude the testing or developing of several 
di fferent FPGA-based architectures, which could increase the performance of the 
Active Router. An example of this could be the use of FPGA devices that host Active 
Applications to act as network interface cards too. Thjs would increase the 
performance of the router since packets wou ld not have to cross the kernel-user space 
boundary and the PCI bus as often. 
Publications based on parts o[rhis thesis 
1. A.G. Fragkiadakis, D.J. Parish , "Pe1formance Evaluation of a PC-based Active 
Router and Analysis of an Active Secure FTP Application", to appear in IEEE 
International Symposium on Network Computing and Applications (NCA 05), 
Cambridge, MA, USA, July 2005. 
2. N.G. Bartzoudis, A.G. Fragkjadakis, D.J . Parish and J.L Nuiiez, "A System for 
Fault Detection and Reconfiguration of Hardware Based Active Networks", in 
Proceedings of l 01h IEEE International On-Line Testi ng Symposium Madeira 
Island, Portugal, July 2004. 
3. N.G. Bartzoudis, A.G. Frag kiadakjs, D.J . Parish and J.L Nuiiez, "A Monitor 
Module for Active Networks with Hardware Support", in Proceedings of lEE 
System-on-Chip Design, Test and Technology (Postgraduate Seminar), Cardiff, 
U.K. , September 2003. 
4. Fragkiadakis, A.G, Brutzoudis N .G., Parish, DJ. and Sandford, J.M, "Hardware 
support for Active Networking", SAM03 The 2003 International Conference on 
Securi ty and Management, Las Vegas, June 2003, pp 27-33, ISBN 1-932415-16-5. 
5. Fragkiadakis, A.G, Bartzoudis N.G., Parish, D.J. and Sandford, J.M, "Active 
Networking using Programmable Hardware", PGNET2003 PostGraduate 
Networking Conference, Li verpool , June 2003. 
6 . Bartzoudis, N.G., Fragkiadakis, A.G. , Parish, D.J. , Nunez, J.L. and Sandfo rd, 
J.M., "Reconfigurable Computing and Active Networks", ERSA03 Proceedings of 
the international Conference on Engineering of Reconfigurable Systems and 
Algori thms, Las Vegas, June 2003, pp 80-83, ISBN 1-932415-05-X. 
188 
References 
189 
References 
[AGG97] 
[ALDWWW] 
[ALMLOl] 
[ANWWW] 
[BATB02] 
[BCAZ97] 
[BCZ96] 
[BIAOS] 
D .S. Alexander et al, "Active Network Encapsulation. Protocol 
(ANEP)", Technical report, RFC Draft, Uni versity of 
Pennsylvania, Department of Computer and Information Science, 
USA July 1997, 
http://www.cis.upenn.edu/-switchware/ANEP/docs/ANEP .txt. 
Alpha Data Homepage: http ://www.alpha-data.com. 
AI-Moussa, F., Linge, N. 2001, "Active networking applied to 
network security" , 2nd Annual Postgraduate S ymposium on the 
Convergence of Telecommunications, Networking a nd 
Broadcasting, PGNet 2001 , pp.l47-1 5 1, EPSRC, Liverpool John 
Moores University , UK, June 2001, ISBN: 1 902560 078. 
Bzip2 Man Page: 
http://annys.eines. in fo/cgibin/man/man2htm l?bzi p2recover+ 1. 
F. Baumgartner, T. Braun, B . Bhargava, "Design and 
Implementation of a Python-Based Active Network Platform for 
Network Managem.ent and Control" , IFIP-TC6 4th International 
Working Conference (lW AN 2002), ZUrich, Switzerland, 
December 2002. 
S . Bhattachatjee, K. L. Calvert, E. Zegura, "Active Networking and 
the End-to-End Argument". In Proceedings of the IEEE 
Intemational Conference on Network Protocols (ICNP'97), pp. 
220-228, Atlanta, USA, October 1997. 
Bhattachatjee S. , Calvert K. L., Zegura E .W, "On Active 
Networking and Congestion", Technical Rep01t GIT-CC-96-02, 
Coll ege of Computing, Georgia Tech, USA. 
A. Bianco, J . Finochiero, G. Galante, M . Melli a and F. Neri , 
l90 
References 
[BLP02] 
[BRSE97] 
[CBZ98] 
[CFS99] 
[CGMOO] 
[CSWWW] 
"Open-Source PC-Based Software Routers: A Viable Approach to 
High-Pe1Jonnance Packet Switching", Third International 
Workshop on QoS in Multiservice IP Networks, Catania, Italy, 
February 2005. 
P. Backx. , T . Lambrecht, L. Peters, B. Dhoedt, P. Demeeste r, 
"Adaptive Distributed Caching", OpenArch 2002 short paper 
session, New York, USA, pp. 47-52. 
A. Brown, M. Seltzer, "Operating System Benchmarking in the 
Wake of Lmbench: A Case Srudy of the Performance of NetBSD on 
the lntel x86 Architecture", In Proceedings of the 1997 ACM 
SIGMETRICS Conference on the Measurement and Modeling of 
Computer Systems, Seattle, USA, June 1997, pp. 214-224. 
K. Calvert, S. Bhattacharjee and E. Zegura, "Directions in Active 
Netvvorks", IEEE Communications Magazine, 1998. 
R. Cardoe, J. Finney, A. C. Scott and W. D. Shepherd, "LARA: A 
Prototype System f or Supporting High Pe1Jonnance Active 
Networking", IW AN 99, International Working Conference on 
Acti ve Networks, pp. 117- 131, Berlin , Germany, 30 June-2 July 
1999. 
Y. Carlinet, V. Galtier, K . L. Mills, S. Leigh, A. Rukhin , 
"Calibrating an Active Network Node", presentation of the 2nd 
Workshop on Active Middleware Services, Pittsburgh, 
Pennsylvania, USA, August 2000. 
"VHDL Tutorial", Computer Science and E ngineering 
Department, Uni versity of California Ri verside, USA, 
http://www .cs. ucr.edu/conten t/esd/labs/tu tori al. 
191 
References 
[CZB98] 
[DEN99] 
[DFF99] 
[FBPA03] 
P. Cao, J. Zhang, K. Beach, "Active Cache: Caching Dynamic 
Contents on the Web", In Proceedings of the IFIP International 
Conference on Distributed Systems Platforms and Open 
Di stributed Processing (Middleware '98), pp. 373-388, Lake 
District, England, UK, September 1998. 
S. Denazis et al, "Future Active JP Networks" , IST-1999-10561-
FAIN, FAIN Homepage: http://www.ist-fain.org. 
L. Delgrossi, G. D. Fatta, D. Ferrari , G. L. Re, "Interference and 
Communicarions among Active Network Applications", 
Proceedings of the First International Working Conference on 
Active Networks, Berlin, Germany, Junel999, pp. 97-108. 
A. Fragk:iada.kis, N. Bartzoudis, D . Parish , "Hardware Support f or 
Active Networking", In Proceedings of SAM'03, The 2003 
Inte rnational Conference on Security and Management, Las Vegas, 
USA, June 2003, pp. 27-33 , ISBN 1-9324 15-16-5. 
[FREEWWW] Free IP Project Homepage, http://free-ip.com/DES. 
[FROM98] R. Fromm and N. Treuhaft, "Revisiting the Cache Interference 
Cosrs of Context Switching", Technical Report, Computer Sc ience 
Di vision, Univers ity of California-Berkeley, USA, 
[GALT99] 
[GAR02] 
http://citeseer. ist.psu.edu/25286 l .html. 
V. Gal tier et a!, "How Much CPU Time? Expressing Meaningful 
Processing Requirements among Heterogeneous Nodes in an 
Active Network", Whitepaper, National Institute of Standards and 
Techno logy, Gaithersburg, Maryland , USA, 
http://www .antd.nist.gov/-mi lls/whi tepapers/NISTanetsTR.pdf. 
A. Garg, A. L. Narasimha Reddy, "Mitigation of DoS attacks 
192 
References 
[GMBOl] 
[GM COl] 
[GOR04] 
[HAD99] 
[HAS97] 
[HAU98] 
[HMA99] 
through Qos regulation ", In Proceedings ofiWQOS Workshop, 
Miami , USA, May 2002. 
V. Galtier, K. Mills, Y. Carli net, S. Bush, A. Ku lkami , "Predicting 
a11d Controlling Resource Usage in a Heterogeneous Active 
Network". Proceedings o f the Third Annual International 
Workshop on Acti ve MiddJeware Services, San F rancisco, USA, 
August 2001, p. 35. 
V. Galtier, K. L. Mi lls, Y. Carlinet, S. Bush, A. Kulkami , 
"Predicting Resource Demand in. Heterogeneous Active 
Networks". Proceedings of MJLCOM 2001, Washington, USA, 
October 2001, pp. 35-44 . 
Me! Ge rman, "Understanding the Linux Virtual Memory 
Manager" , 15111 February 2004. Univers ity of Limerick, Ire land, 
Linux Virtual Me mory Documentati on Project, Virtua l Memory 
Guide, 
http://www.csn.ul .ie/-mel/projects/vm/guide/pdf/understand.pdf. 
I. Hadzic, "Applying Reconfigurable Computi ng to Reconfigurab le 
Networks", PhD Thesis, Uni vers ity o f Pennsylvania, 1999. 
l. Hadzic and J. M . Smith, " P4: A Platfonn for FPGA 
Implementation of Protocol Boosters", 7th Internati onal Worksho p 
on Fie ld Programmable Logic and Applications (FPL97), 
September 1997, pp. 438-447. 
S. Hauck, "The Roles of FPGAs in Reprogrammable Systems", 
Proceedjngs of the IEEE , Apiill998, Vol. 86, No. 4 , pp. 615-638. 
M. Hicks, 1. T. Moore, D. S . Alexander, C. A. Gunter and S. M . 
Nettles, "PLANet: An Active ln.tem.etwork", In Proceedings of the 
193 
References 
[HMP99] 
[HUBOO] 
[HUS99] 
[IBMWWW] 
[KJMOl] 
[LCAMOl] 
18th IEEE Computer and Communication Society INFOCOM 
Conference, New York, USA, March 1999, pp. 1124- 1133. 
T . Harbaum, D . Meier, M. Prinke, "Design of a Flexible 
Coprocessor Unit", 25th Euromicro Conference 
(EUROMICR0'99), Milan, Italy, September 1999, Vol. 1, pp. 
1335-1342. 
B . Hubert, G. Maxwell , R. van Mook, M. van Oosterhout, and P. 
B . Schroeder, "Linux 2.4 Advanced Rottting HOWTO" , The Linux 
Documentation Project, http://l inuxdoc.org/HOWTO/Adv-
Routing-HOWTO.html . 
I. Hadjic, S. Udani and J. M . Smith, "FPGA Viruses", Proceedings 
of the 9th International Workshop on Field-Programmable Logic 
and Applications, Glasgow, Scotland, August 1999, pp. 9 1-300 , 
ISBN: 3-540-66457-2. 
IBM Homepage, www.3 ibm.com/chips. 
I. Kim, J . Moon and H. Y. Yeom, "Timer-Based Interrupt 
Mitigation For High Performance Packet Processing", In 
Proceedings of 51h International Conference on High-Performance 
Computing in the Asia-Pacjfic Region, Gold Coast, Australia, 
2001. 
D. Lanabeiti , M. Calderon, A. Azcorra, M . Uruena, J. Kri stensen 
and L. Kristensen, "SARA: a Simple Active Router-Assistant 
Architecture", Whitepaper, Universidad Carlos Ill de Madrid, 
Spain, 
http://www.it. uc3m.es/-muruenyalpapers/SARA w hi tepaper 2002 
~· 
194 
References 
[LESG98] L. H. Lehman, S. Garland, and D . L. Tennenhouse, "Active 
Reliable Multicasr", In Proceedings of Infocom'98, San Francisco, 
CA, March 1998. 
[LIBGWWW] LibGTop Homepage: 
[LIBWWW] 
[LOCKOl] 
[LWG98] 
http://www.ugcs.caltech.edu/info/ li bgtop/ libgtop L .h tm I. 
Libnet Homepage: http://libnet.sourceforge.net. 
J. W. Lockwood, "An Open Platfonnfor Development of Network 
Processing Modules in Reprogrammable Hardware", Proceedings 
of International E ngineering Consortium DesignCon (IEC 
DesignCon 01), Santa Clara, California, USA, January 2001, p. 
WB-19. 
U. Legedza, D . Wetherall and J. Guttag, "Improving the 
Performance of Distributed Applications Using Active Networks", 
In Proceedings of INFOCOM' 98, San Francisco, USA, April 1998, 
pp. 590-599. 
[MAX99] N. F. Maxemchuk, "Active NeMorks in Telephony", In 
Proceedings of OPENARCH'99, New York, USA, March 1999, 
pp. 2-8. 
[MODWWW] Mentor Graphics Corporation , "ModelSim " Homepage: 
www.model.com. 
[MOG97] 
[NEC97] 
J. Mogul , K. K. Ramalaishnan , "Eliminating Receive Livelock in 
anfnlerrupt-Driven Kernel", ACM Transactions on Computer 
Systems (TOCS), August 1997, Vol. 3, pp. 217-252, ISSN: 0734-
2071. 
G. C. Necula, "Proof carrying code" . In Proceedings of the 241h 
195 
References 
[NETWWW] 
[OPWWW] 
[PAW002] 
[PAWOl] 
Annual ACM SIGPLAN-SIGACT Symposium on Principles of 
Programming Languages (POPL), New York, J anuary 1997, ACM 
Press, pp. 106-119. 
Netfilter Homepage: http://netfilte r.samba.org. 
Arkoon Network Security, URL: 
http:/ /open-sou rce.arkoon. net/ke rne l. php. 
P. Pappu, T. Wolf, "Scheduling Processing Resources in 
Programmable Routers". In Proceedings of the 2 1st IEEE 
Conference on Computer Communications (INFOCOM), New 
York, USA, June 2002, pp. 104-112. 
K. Park, W. Wang, "QoS-Sensitive Transport of Real-Time MPEG 
Video using Adaptive Redundancy Control " , Computer 
Communications, January 2001 , Vol. 24, pp. 78-92. 
[PCIMWWW] Alpha-Data Ltd, "ADM-XRC PC! Mezzanine Card User Guide", 
http://www.alpha-data.co. uklpdf/ADM-
XRC%20User%20Manua l.pdf. 
[PERWWW] Perfctr Homepage: http://user.it.uu.se/-mikpe/linux/perfctr. 
[PLXWWW] PLX Technology Inc, "PC/9080 Data Sheet", http://www.der-
i ngo.de/bi n/mi lanhelp/PLX9080.pdf. 
[PSN99] K. Psounis, "Acti ve Networks: Applications, Security, Safety, and 
architectures", lEE Communications Surveys, 1999. 
[PUBAOO] D . Putzolu, S. Bakshi , S . Yadav and R. Yavatkar, "The Phoenix 
Framework: A Practical Architecture for Progranunable 
Networks" , IEEE Communications Magazine , March 2000, Vol. 
196 
References 
[SAB03] 
[SAJ03] 
[SANOl] 
[SAN03] 
[SEBWWW] 
[SFG96] 
[SJS99] 
38, no. 3, pp. 160--65. 
F. Sabrina, S. Jha, "Scheduling Resources in Programmable and 
Active Networks Based on Adaptive Estimations". Proceedings of 
28th Annual IEEE Conference on Local Computer Networks 
(LCN), IEEE Computer Society, Bonn, Ge1many, October 2003, 
pp.20-24. 
F. Sabrina, S. Jha, "A Novel Architecture for Resource 
Management in Active Networks Using a Directory Service", lOth 
International Conference on Telecommunications, French 
Polynesia, IEEE Communications Society, January 2003, Vol. l , 
pp. 45-52. 
M. Sandford, "Detecting Changes in Network Performance front 
Low Level Measurements", PhD Thesis submitted in 2001 , 
Loughborough Uni versity, UK. 
M. Sandford, D. Parish, A.Fragkiadakis, N. Bartzoudis, "An 
Jmem et-Friendly Architecture to support the Rapid Deployment of 
new network services", Whi tepa per, Department of Electronic and 
Electrical Engineering, Loughborough U ni versity, UK, 2003, 
http://www. lboro.ac.uk/departments/el/research/hsn/ifan.htm. 
SYSSTAT Utilities Homepage: 
http:/ /perso. wanadoo. fr/sebastien. godard. 
J. M. Smith, D. J. Farber, C.A. Gunter, S. M. Nettles, D. C. 
Feldmeier and W . D. Sincoskie, "Switch Ware: Accelerating 
Network Evolution", White Paper, 26 June 1996, 
http://www.cis.upenn.edu/- jms/white-paper.ps. 
B. Schwartz, A. Jackson, W . T. Strayer, W. Zhou, R. D. Rockwel l 
197 
References 
[SRC84] 
[SSHWWW] 
[SSKWWW] 
[SUOO] 
[TENW96] 
[TSC98] 
[TSS97] 
[TXWWW] 
and C. Partridge, "Smart Packets for Active Networks ", Second 
International Conference on Open Architectures and Network 
Programming (OPENARCH), New York, USA, March 1999. 
J. H. Saltzer, D. P. Reed and D . D. Clark, "End-to-End Arguments 
in System Design ", ACM Transactions on Computer Systems 
(TOCS), 1984 , Vol. 2, pp. 277-288, ISSN: 0734-207 1. 
SSH Homepage: 
http://www .eos.ncsu.edu/ remoteaccess/man/ssh. html . 
Linux Man Page: http://www.die.net/doc/l inux/man/manl /ssh-
keygen.l.html. 
G. Su, "Virtual Active Networks" , Thesis submitted in 2000, 
Computer Science Department, Columbia Uni versity. 
D. Tennenhouse and D. Wetherall , "Towards an Active Network 
Architecture", ACM SIGCOMM Computer Communication 
Review, Vol. 26, pp. 5-17, ISSN: 0146-4833. 
C. F. Tschudin, "ANON: A Minimal Overlay Network for Active 
Networks Experimems", Technical Report 98.10, Computer 
Science Department, Uni versity of Zurich, Switzerland. 
D. Tennenhouse, J. M. Smith, W. D. Sincoskie, D. Wetherall, G. J. 
Minden, "A Survey of Active Network Research", IEEE 
Communicati ons Magazine, January 1997, Vol. 35, No. l , pp 80-
86. 
Xilinx Corporation Support Homepage 
http://tool box .xi I i nx .com/docsan/xi I i nx4/data/docs/dev/dsgnflow2. 
html. 
198 
References 
[UDPGWWW] UDPGEN Homepage: 
http://www.fokus.gmd.de/research/cc/berl ios/emplovees/sebastian. 
zander/pri vate/udpgen. 
[VAN97] 
[WETEN98] 
V. C. Van , "A Defense Against Address Spoofing Using Active 
Networks", Thesis submitled in May 1997, MIT. 
D. Wetherall , J. Guttag, D . Tennenhouse, "ANTS: A Toolkitfor 
Building and Dynamically Deploying Network Protocols ", 
Proceeding of IEEE OPEN ARCH '98, San Francisco, April 1998, 
pp. 117-129. 
[WETTEN96] D. Wetherall and D. L. Tennenhouse, "The ACTIVE JP Option", 
[WFOO] 
[WLAG93] 
[WLG98] 
[WOLF99] 
In Proceedings of the Seventh ACM SIGOPS European Workshop, 
Connemara, Ireland, September 1996. 
T. Wolf and M. A. Franklin , "Commbench-a Telecommunications 
Benchmark for Nerwork Processors". In Proceedj ngs of IEEE 
International Symposium on Performance Analysis of Systems and 
Software (ISPASS), Austin TX, USA, April 2000, pp. 154-162. 
W. Lucco, T. Anderson, and S. Graham, "Efficient Software Fault 
Isolation" . In Proceedings of the 141h Symposium on Operating 
System Principles, Asheville, NC USA, December 1993, pp. 203-
216. 
D. Wetherall , U. Legedza, and J. Guttag, "Introducing New 
Internet Services: Why and How", IEEE Network M agazine 
July/August 1998. 
T. Wolf, "A Proposal for a High-Performance Active Hardware 
Architecture", Technical Report, WUCS-99-08, Depa1t ment of 
199 
References 
[WTWWW] 
[XILWWW] 
[XPS02] 
[XSWWW] 
[YAGHOO] 
[YALA95] 
[YAMOl] 
Computer Science, Washington University, 15 February 1999. 
http://www.arl.wustl.edu/Publications/1995-99/wucs9908.pdf. 
Whatis.com Homepage: www.whatis.com. 
Xilinx Corporation Homepage: www.xilinx.com. 
Product Specification, "Virtex-E 1.8 V Field Programmable Gate 
Arrays ", University of California, 
http://www-inst.eecs.berkeley.edu/-cs150/fa03/handouts/virtexE-
datasheet.pdf. 
XESS Corporation Homepage: http://www.xess.com/fpgatut.htm. 
K. Yaghmour and M. R. Dagenais, "Measuring and 
Characterizing System Behavior Using Kernel-Level Event 
Logging", In Proceedings of the USENIX Annual 2000 Technical 
Conference, San Diego, California, USA, June 2000, pp. 13-26. 
R. Yavatkar and K. Lakshman, "A CPU Scheduling Algorithm for 
Continuous Media Applications". In Proceedings of 5th 
International Workshop on Network and Operating System 
Support for Digital Audio and Video, Durham, New Hampshire, 
April 1995, pp. 223-226. 
M. Yamamoto et a!, "A Network-Supported Server Load Balancing 
Method: Active Anycast", IEICE Transactions on 
Communications, June 2001, pp. 1561-1568. 
200 
Appendix A 
des.c (DES Encryption/Decryption) 
201 
Appendix A des.c (DES Encryption/Decryption) 
Appendix A. des.c (DES Encryption/Decryption) 
#include <sys/types.h> 
#include <sys/socket.h> 
#include <sys/un.h> 
#include <stdio.h> 
#include <linux/ip.h> 
#include <linux/udp.h> 
#include "main_func. h •• 
#include "act.h" 
#include <sys/ipc.h> 
#include " .. /includeF/adrnxrc2.h" 
#include " .. /includeF/cornmon.h" 
#define PATHNAME "/usr/alex/test/path3" 
#define REINJECTP "/usr/alex/test/reinjectp" 
#define ENCRYPTION 1 
#define DECRYPTION 0 
int main (void) 
ADMXRC2_STATUS 
ADMXRC2_HANDLE 
ADMXRC2_SPACE_INFO 
ADMXRC2_CARDID 
ADMXRC2_IMAGE 
volatile uint32_t* 
const char* 
float clock_freq; 
status; 
card; 
spinfo; 
cardiD; 
image; 
fpgaSpace; 
filename; 
u_long rnod_id,*key,*datal,*data_out, bit_size; 
int div,r; 
int i,sockfd,msgsock,rval,offset,act_offset; 
u_char *p; 
u_short encrypt_flag,data_length,packet_size; 
struct iphdr *iph; 
struct udphdr *udph; 
struct acth *actpnt; 
struct sockaddr_un server; 
unlink(PATHNAME); 
filename=malloc(50); 
iph=(struct iphdr*)malloc(sizeof(struct iphdr)); 
p=(u_char*)malloc(1500); 
datal=malloc(2*sizeof(u_long*)); 
202 
Appendix A des.c (DES Encryption/Decryption) 
data_out=malloc(2*sizeof(u_long*)); 
key=malloc(2*sizeof(u_long*)); 
cardiD=O; 
status= ADMXRC2_0penCard(cardiD, &card); 
if (status != ADMXRC2_SUCCESS) { 
) 
printf("Failed to open card with ID %ld: %s\n", 
card!D, ADMXRC2_GetStatusString(status)); 
exit(-1); 
clock_freq=40000000; 
I* Get the address of FPGA space *I 
status= ADMXRC2_GetSpace!nfo(card, 0, &sp!nfo); 
if (status != ADMXRC2_SUCCESS) { 
) 
printf("Failed to get space 0 info: %s\n", 
ADMXRC2_GetStatusString(status)); 
return -1; 
fpgaSpace = (volatile uint32_t*) spinfo.VirtualBase; 
strcpy(filenarne, "lusrlsrciFPGA_bitl3 .bit"); 
11 status= ADMXRC2_ConfigureFromFile(card, filename); 
11 if (status != ADMXRC2_SUCCESS) { 
I I printf ("Failed to load the bitstream '%s': %s\n", 
11 filename, ADMXRC2_GetStatusString(status)); 
11 return -1; 
11 ) 
status=ADMXRC2_LoadBitstream(card,filename,&image,&bit_size); 
if (status != ADMXRC2_SUCCESS) { 
) 
printf ("Failed to load the bitstream to memory '%s': %s\n", 
filename, ADMXRC2_GetStatusString(status)); 
return -1; 
status=ADMXRC2_ConfigureFromBufferDMA(card,image,bit_size,ADMXRC2_DMA 
CHAN_ANY, NULL) ; 
if (status != ADMXRC2_SUCCESS) { 
printf ("Failed to load the bitstrearn to memory '%s': %s\n", 
filename, ADMXRC2_GetStatusString(status)); 
return -1; 
203 
Appendix A des.c (DES Encryption/Decryption) 
if ((sockfd=socket(AF_LOCAL,SOCK_STREAM,O))<O) 
perror ( •• opening socket") ; 
exit(1); 
server.sun_family=AF_LOCAL; 
strcpy(server.sun_path,PATHNAME); 
if (bind(sockfd, (struct sockaddr*)&server,sizeof(struct 
sockaddr_un))<O) 
{ 
perror ( "error binding socket'' ) ; 
exit(1); 
} 
listen(sockfd,S); 
while ( 1) { 
if ((msgsock=accept(sockfd, (struct sockaddr*)NULL, (int*)NULL))==-1) 
{ 
} 
perror (''error connecting socket") ; 
exit(1); 
if ((rval=read(msgsock,p,1550))<0) 
perror ( "error reading on socket'' ) ; 
if (rval!=O) 
{ 
close (msgsock) ; 
if (rval<20) 
{ 
if (*p==1) 
{ 
ADMXRC2_CloseCard(card); 
} 
else 
{ 
cardiD=O; 
status= ADMXRC2_0penCard(cardiD, &card); 
if (status != ADMXRC2_SUCCESS) { 
printf("Failed to open card with ID %ld: %s\n", 
cardiD, ADMXRC2_GetStatusString(status)); 
204 
Appendix A des.c (DES Encryption/Decryption) 
exit(-1); 
) 
/* Get the address of FPGA space */ 
status= ADMXRC2_GetSpaceinfo(card, 0, &spinfo); 
if (status != ADMXRC2_SUCCESS) { 
printf("Failed to get space 0 info: %s\n", 
ADMXRC2_GetStatusString(status)); 
return -1; 
fpgaSpace = (volatile uint32_t*) spinfo.VirtualBase; 
status=ADMXRC2_ConfigureFrornBufferDMA(card,image,bit_size,ADMXRC2_DMA 
CHAN_ANY, NULL) ; 
} 
if (status != ADMXRC2_SUCCESS) { 
printf ("Failed to load the bitstream to memory '%s': %s\n", 
filename, ADMXRC2_GetStatusString(status)); 
return -1; 
goto Line2; 
} 
iph=(struct iphdr*)p; 
packet_size=ntohs(iph->tot_len); 
if (packet_size==40) goto Line3; 
bzero(data1,2*sizeof(u_long)); 
bzero(data_out,2*sizeof(u_long)); 
actpnt=(struct acth*) (p+IP_H+UDP_H); 
mod_id=ntohl(actpnt->id); 
act_offset=IP_H+UDP_H+ACT_H; 
if (actpnt->type==O) offset=act_offset; 
else if (actpnt->type==1) offset=act_offset+IP_H+TCP_H; 
else if (actpnt->type==2) offset=act_offset+IP_H+UDP_H; 
else goto Line3; 
205 
Appendix A des.c (DES Encryption/Decryption) 
data_length=packet_size-offset; 
/*-----------------------------------------------------------------*/ 
*(key+l)=Ox9bbcdffl; /* Low word */ 
*key=Ox13345779; /* High word */ 
encrypt_flag=ntohs(actpnt->seq_no); 
encryption, •o• for decryption */ 
/* '1' for 
div=data_length/8; 
r=data_length-8*div; 
if (div==O) /*if data length is less than 8 bytes long */ 
{ 
if (r<=4) /* if data length is less or equal to 4 bytes */ 
{ 
} 
memcpy((u_long*) (datal+l), (u_char*) (p+offset),r); 
bzero(datal,sizeof(u_long)); 
else { /* if data length is more than 4 bytes and less than 8 
bytes long */ 
memcpy((u_long*)datal, (u_char*) (p+offset) ,4); 
memcpy((u_long*) (datal+l), (u_char*) (p+offset+4),r-4); 
} 
goto Line4; 
} 
for (i=O;i<div;i++) 
Back: 
bzero(datal,2*sizeof(u_long)); 
bzero(data_out,2*sizeof(u_long)); 
if (div!=O) 
memcpy((u_long*)datal, (u_char*) (p+offset),4); 
memcpy((u_long*) (datal+l), (u_char*) (p+offset+4),4); 
goto Line4; 
206 
Appendix A des.c (DES Encryption/Decryption) 
else 
if (r<=4) 
{ 
rnerncpy((u_long*) (datal+l), (u_char*) (p+offset),r); 
bzero(datal,sizeof(u_long)); 
} 
else 
{ 
rnerncpy((u_long*)datal, (u_char*) (p+offset),4); 
rnerncpy((u_long*) (datal+l), (u_char*) (p+offset+4),r-4); 
} 
goto Line4; 
Line4: 
*datal=ntohl(*datal); 
*(datal+l)=ntohl(*(datal+l)); 
fpgaSpace[O]=datal[l]; I* 1st PCX bus write *I 
fpgaSpace[l]=datal[O]; I* 2nd write *I 
fpgaSpace[2]=key[l]; I* 3rd write *I 
fpgaSpace[3]=key[O]; I* 4th write *I 
fpgaSpace[4]=(uintl6_t)encrypt_flag; I* s•• write *I 
I* wait for interrupt *I 
status=ADMXRC2_WaitForinterrupt(card,NULL,O,NULL); 
if (status!=ADMXRC2_SUCCESS) 
} 
printf("Process failed to wait for interrupt\n"); 
exit(l); 
207 
Appendix A des.c (DES Encryption!Decryption) 
data_out=fpgaSpace[O]; I* 1•• PCI read */ 
(data_out+l)=fpgaSpace[l]; /* 2~ PCI read*/ 
*(data_out+l)=htonl(*(data_out+l)); 
*data_out=htonl(*data_out); 
if (div>l) 
( 
rnerncpy((u_char*) (p+offset+4), (u_long*)data_out,4); 
rnerncpy((u_char*) (p+offset), (u_long*) (data_out+1),4); 
div--; 
offset+=B; /* if it is not the last 8-byte 
block */ 
if ( (div!=O) 11 (r!=O)) goto Back; 
) 
if ((div==l)&&(encrypt_flag==ENCRYPTION)) 
{ 
rnerncpy((u_char*) (p+offset+4), (u_long*)data_out,4); 
rnerncpy((u_char*) (p+offset),(u_long*) (data_out+1),4); 
div--; 
offset+=B; 
if (r!=O) goto Back; 
actpnt->seq_no=DECRYPTION; 
actpnt->param_size=8; 
goto Line3; 
) 
if ((div==l) && (encrypt_flag==DECRYPTION)) 
( 
udph=(struct udphdr*) (p+IP_H); 
r=actpnt->param_size; 
if (r<=4) 
( 
208 
Appendix A des.c (DES Encryption/Decryption) 
memcpy((u_char*) (p+offset), (u_long*)data_out,4); 
memcpy((u_char*) (p+offset+4), (u_long*) (data_out+1),4); 
} 
else 
{ 
memcpy((u_char*) (p+offset+4), (u_long*)data_out,4); 
memcpy((u_char*) (p+offset), (u_long*) (data_out+1),4); 
} 
packet_size-=8-r; 
iph->tot_len:htons(packet_size); 
data_length:ntohs(udph->len); 
data_length-:8-r; 
udph->len:htons(data_length); 
actpnt->seq_no:htons(ENCRYPTION); 
goto Line3; 
if ((div::Q) && (encrypt_flag::ENCRYPTION)) 
{ 
udph:(struct udphdr*) (p+IP_H); 
actpnt->param_size=r; 
memcpy((u_char*) (p+offset+4), (u_long*)data_out,4); 
memcpy((u_char*) (p+offset), (u_long*) (data_out+1),4); 
packet_size+=B-r; 
iph->tot_len:htons(packet_size); 
data_length:ntohs(udph->len); 
data_length+:S-r; 
udph->len:htons(data_length); 
209 
Appendix A des.c (DES Encryption/Decryption) 
actpnt->seq_no=DECRYPTION; 
goto Line3; 
} 
Line3: 
actpnt->id=htonl(4); 
open_socket(REINJECTP,p,packet_size); 
network *I 
/* re-inject packet to the 
Line2: 
} /* while */ 
printf("Error!! !\n"); 
close(sockfd); 
} 
210 
Appendix B 
des.vhd (DES Encryption/Decryption) 
211 
Appendix B des.vhd (DES Encryption/Decryption) 
Appendix B. des.vhd (DES Encryption/Decryption) 
library IEEE; 
use IEEE.STD_LOGIC_ll64.ALL; 
use IEEE.STD_LOGIC_ARITH.ALL; 
use IEEE.STD_LOGIC_UNSIGNED.ALL; 
use work.des_lib.des_kp; 
--library freedes; 
--use freedes.des_lib.all; 
Uncornment the following lines to use the declarations that are 
provided for instantiating Xilinx primitive components. 
--library UNISIM; 
--use UNISIM.VComponents.all; 
entity encrypt_test is 
Port ( lclk : in std_logic; 
lreseto_l : in std_logic; 
lwrite : in std_logic; 
lads_l : in std_logic; 
lblast_l : in std_logic; 
ld: inout std_logic_vector(31 downto 0); 
la : in std_logic_vector(23 downto 2); 
lbe_l : in std_logic_vector(3 downto 0); 
fholda : in std_logic; 
lbterm_l : out std_logic; 
lreadyi_l : out std_logic; 
finti_l :out std_logic); 
end encrypt_test; 
architecture RTL of encrypt_test is 
signal rst: std_logic; 
signal ads_i: std_logic; 
signal qlads: std_logic; 
signal write_i: std_logic; 
signal blast_i: std_logic; 
signal lreadyi_oe: std_logic; 
signal lbterm_o: std_logic; 
signal lbterm_oe: std_logic; 
signal ds _xfer: std_logic; 
signal ds _decode: std_logic; 
signal ld_o: std_logic_vector(31 
signal ld_i: std_logic_vector(31 
signal ld_oe: std_logic; 
signal la_i: std_logic_vector(23 
signal be_i: std_logic_vector(3 
down to 
down to 
down to 
down to 
signal ld_out: std_logic_vector(31 
signal lreadyi_o std_logic; 
0) ; 
0) ; 
2) ; 
0) ; 
down to 
signal la_q: std_logic_vector(23 down to 2) ; 
signal write_q: std_logic; 
signal logicO: std_logic; 
signal logicl: std_logic; 
signal regO: std_logic_vector(31 down to 0) ; 
signal regl: std_logic_vector(31 down to 0) ; 
signal reg2: std_logic_vector(31 down to 0); 
signal reg3: std_logic_vector(31 down to 0) ; 
0); 
212 
Appendix B des. vhd (DES Encryption!Decryption) 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
signal 
datain :std_logic_vector(63 downto 0); 
valid_datain : std_logic; 
bufferl : std_logic_vector(63 downto 0); 
key_perm std_logic_vector(55 downto 0); 
keyin std_logic_vector(63 downto 0); 
dataout: std_logic_vector(63 downto 0); 
stalll: std_logic :='0'; 
key_outp: std_logic_vector(55 downto 0); 
valid_dataout: std_logic; 
encryptl std_logic; 
en_l std_logic; 
finti std_logic; 
component plxdssm 
port( 
elk: in std_logic; 
rst: in std_logic; 
sr: in std_logic; 
qlads: in std_logic; 
lblast: in std_logic; 
lwrite: in std_logic; 
ld_oe: out std_logic; 
lreadyi: out std_logic; 
lreadyi_oe: out std_logic; 
lbterm: out std_logic; 
lbterm_oe: out std_logic; 
transfer: out std_logic; 
decode: out std_logic; 
ready: in std_logic; 
stop: in std_logic); 
end component; 
component des_fast 
port (elk :in std_logic; 
reset 
stall 
encrypt 
key_in 
din 
din_ valid 
:in 
:in 
:in 
:in 
:in 
std_logic; 
std_logic; 
std_logic; 
std_logic_vector 
std_logic_vector 
:in std_logic; 
(55 
(63 
down to 0) ; 
down to 0); 
dout :out std_logic_vector (63 downto 0); 
dout_valid :out std_logic; 
key_out :out std_logic_vector (55 downto 0)); 
end component; 
begin 
logicO <= '0'; 
logicl <= '1' ; 
Convert the inputs to active high. 
rst <= not lreseto_l; 
blast_i <= not lblast_l; 
ads_i <= not lads_l; 
213 
Appendix B des. vhd (DES Encryption/Decryption) 
write_i <= lwrite; 
la_i <= la; 
ld_i <= ld; 
be_i <= not lbe_l; 
Decode the address of the FPGA which is the space when LA[23] is 0 
qlads <= ads_i and not la_i(23) and not fholda; 
Latch local bus address and 'write' on LADS# pulse 
latch_addr : process(rst, lclk) 
begin 
if rst = '1' then 
la_q <= (others=> 'Z'); 
write_q <: '0'; 
elsif lclk'event and lclk = '1' then 
if ads_i = '1' then 
la_q <= la_i; 
write_q <= write_i; 
end if; 
end if; 
end process latch_addr; 
BTERM should only be driven when the fpga is addressed otherwise 
float, because the control logic on the XRC might also drive it. 
update_bterm : process(lbterm_o, lbterm_oe) 
begin 
if lbterm_oe = '1' then 
lbterm_l <= not lbterm_o; 
else 
lbterm_l <= 'Z'; 
end if; 
end process update_bterm; 
LREADYI# should only be driven when the fpga is addressed, 
otherwise 
-- float because the control logic on the XRC might also drive it. 
update_ready : process(lreadyi_o, lreadyi_oe) 
begin 
if lreadyi_oe = '1' then 
lreadyi_l <= not lreadyi_o; 
else 
lreadyi_l <= 'Z'; 
end if; 
end process update_ready; 
Drive the local data bus on a read. 
214 
Appendix B des.vhd (DES Encryption/Decryption) 
data_bus process(ld_oe,ld_o) 
begin 
if ld_oe = '1' then 
ld <= ld_o; 
else 
ld <= (others=> 'Z'); 
end if; 
end process data_bus; 
If the current cycle is a write, update the registers 
update_reg : process(lclk, rst) 
begin 
if rst = '1' then 
re gO <= (others => I 0 I ) ; 
reg1 <= (others => I 0 I ) j 
reg2 <= (others => I Q I ) i 
reg3 <= (others => I 0 I) ; 
valid_datain<='O'; 
elsif lclk'event and lclk = '1' then 
if ds_xfer = '1' and write_i = '1' then --It's a PCI write 
--Check the PCI Address (bits 2,3,4) and load the registers 
if la_q(4)='0' and la_q(3)='0' and la_q(2)='0' then 
if be_i(O) = '1' then 
reg0(7 downto 0) <= ld_i(7 downto 0); 
end if; 
if be_i(1) = '1' then 
reg0(15 downto 8) <= ld_i(15 downto 8); 
end if; 
if be_i(2) = '1' then 
reg0(23 downto 16) <= ld_i(23 downto 16); 
end if; 
if be_i(3) = '1' then 
reg0(31 downto 24) <= ld_i(31 downto 24); 
end if; 
end if; 
if la_q(4)='0' and la_q(3)='0' and la_q(2)='1' then 
if be_i(O) = '1' then 
reg1(7 downto 0) <= ld_i(7 downto 0); 
end if; 
if be_i(l) = '1' then 
reg1(15 downto 8) <= ld_i(15 downto 8); 
215 
Appendix B des.vhd (DES Encryption/Decryption) 
end if; 
if be_i(2) = '1' then 
reg1(23 downto 16) <= 1d_i(23 downto 16); 
end if; 
if be_i(3) = '1' then 
reg1(31 downto 24) <= ld_i(31 downto 24); 
end if; 
end if; 
if la_q(4)='0' and la_q(3)='1' and la_q(2)='0' then 
if be_i(O) = '1' then 
reg2(7 downto 0) <= ld_i(7 downto 0); 
end if; 
if be_i(1) = '1' then 
reg2(15 downto 8) <= ld_i(15 downto 8); 
end if; 
if be_i(2) = '1' then 
reg2(23 downto 16) <= ld_i(23 downto 16); 
end if; 
if be_i(3) = '1' then 
reg2(31 downto 24) <= ld_i(31 downto 24); 
end if; 
end if; 
if la_q(4)='0' and la_q(3)='1' and la_q(2)='1' then 
if be_i(O) = '1' then 
reg3(7 downto 0) <= ld_i(7 downto 0); 
end if; 
if be_i(1) = '1' then 
reg3(15 downto 8) <= ld_i(15 downto 8); 
end if; 
if be_i(2) = '1' then 
reg3(23 downto 16) <= ld_i(23 downto 16); 
end if; 
if be_i(3) = '1' then 
reg3(31 downto 24) <= ld_i(31 downto 24); 
end if; 
end if; 
if la_q(4)='1' and la_q(3)='0' and la_q(2)='0' then 
if be_i(O) = '1' then 
encrypt1<=ld_i(O); 
end if; 
end if; 
valid_datain<=la_q(4); 
end if; 
end if; 
end process update_reg; 
fill_registers 
begin 
process(reg0,reg1,reg2,reg3) 
216 
Appendix B des.vhd (DES Encryption/Decryption) 
datain(63 downto 32)<=reg1(31 downto 0); 
datain(31 downto O)<=reg0(31 downto 0); 
keyin(31 downto 0)<=reg2(31 downto 0); 
keyin(63 downto 32)<=reg3(31 downto 0); 
end process fill_registers; 
enable_encryption 
begin 
process(valid_datain,datain,keyin) 
if valid_datain='l' then 
key_perm(55)<=keyin(7); key_perm(54)<=keyin(15); 
key_perm(53)<=keyin(23); 
key_perm(52)<=keyin(31); key_perm(51)<=keyin(39); 
key_perm(50)<=keyin(47); 
key_perm(49)<=keyin(55); 
key_perm(48)<=keyin(63); 
key_perm(47)<=keyin(6);key_perm(46)<=keyin(14); 
key_perm(45)<=keyin(22); 
key_perm(44)<=keyin(30);key_perm(43)<=keyin(38); 
key_perm(42)<=keyin(46); 
key_perm(41)<=keyin(54);key_perm(40)<=keyin(62); 
key_perm(39)<=keyin(5); 
key_perm(38)<=keyin(13);key_perm(37)<=keyin(21); 
key_perm(36)<=keyin(29); 
key_perm(35)<=keyin(37); 
key_perm(34)<=keyin(45); key_perm(33)<=keyin(53); 
key_perm(32)<=keyin(61); 
key_perm(31)<=keyin(4); key_perm(30)<=keyin(12); 
key_perm(29)<=keyin(20); 
key_perm(28)<=keyin(28); 
key_perm(27)<=keyin(l); 
key_perm(26)<=keyin(9);key_perm(25)<=keyin(17); 
key_perm(24)<=keyin(25);key_perm(23)<=keyin(33); 
key_perm(22)<=keyin(41); 
key_perm(21)<=keyin(49); 
key_perm(20)<=keyin(57); key_perm(19)<=keyin(2); 
key_perm(18)<=keyin(10); 
key_perm(17)<=keyin(18); key_perm(16)<=keyin(26); 
key_perm(15)<=keyin(34); 
key_perm(14)<=keyin(42); 
key_perm(13)<=keyin(50); key_perm(12)<=keyin(58); 
key_perm(ll)<=keyin(3); 
key_perm(lO)<=keyin(ll); 
key_perm(9)<=keyin(19);key_perm(8)<=keyin(27); 
key_perm(7)<=keyin(35); 
key_perm(6)<=keyin(43); key_perm(S)<=keyin(Sl); 
key_perm(4)<=keyin(59); 
key_perm(3)<=keyin(36);key_perm(2)<=keyin(44); 
key_perm(l)<=keyin(52); 
key_perm(O)<=keyin(60); 
buffer1(63 downto 0)<=datain(63 downto 0); 
217 
Appendix B des. vhd (DES Encryption/Decryption) 
end if; 
end process enable_encryption; 
generate_flag: process(lclk,ds_decode,write_q) 
begin 
if lclk' event and lclk;'l' then 
if ds_decode='l' and write_q='O'then 
en_l<; la_q(2); 
else 
en_l<='Z'; 
end if; 
end if; 
end process generate_flag; 
generate 'finti_l' FPGA interrupt 
gen_finti: process(lclk,rst) 
begin 
if rst='l' then 
finti<='O'; 
elsif (lclk'event and lclk;'l') then 
--It's a PCI read 
if valid_dataout<='O' or valid_dataout<='l' then 
finti<;valid_dataout; 
end if; 
end if; 
end process gen_finti; 
finti_l<; not finti; --Send an interrupt to the host-process 
get_valid_data process (valid_dataout,dataout,en_l) 
begin 
if valid_dataout;'l' and en_l;'O'then 
ld_o(31 downto O)<;dataout(31 downto 0); 
elsif valid_dataout;'l' and en_l;'l' then 
ld_o(31 downto O)<;dataout(63 downto 32); 
218 
Appendix B des.vhd (DES Encryption/Decryption) 
else 
ld_o<=(others=>'Z'); 
end if; 
end process get_valid_data; 
encrypt_rnodule: des_fast port map ( 
elk =>lclk, 
reset 
stall 
en crypt 
key_in 
din 
din_ valid 
dout 
dout_valid 
key_out 
=>rst, 
=>stalll, 
=>encryptl, 
=>key _perm, 
=>bufferl, 
=>valid_datain, 
=>dataout, 
=>valid_dataout, 
=>key_outp); 
dssrn : plxdssm 
port map ( 
elk => lclk, 
rst 
sr 
qlads 
lblast 
lwrite 
ld_oe 
lreadyi 
lreadyi_oe 
lbterrn 
lbterm_oe 
transfer 
decode 
ready 
stop 
end RTL; 
=> rst, 
=> logicO, 
=> qlads, 
=> blast_i, 
=> lwrite, 
=> ld_oe, 
=> lreadyi_o, 
=> lreadyi_oe, 
=> lbterm_o, 
=> lbterm_oe, 
=> ds_xfer, 
:::> ds_decode, 
=> logicl, 
=> logicl); 
219 
Appendix C 
Local PCI Bus Signals 
220 
Appendix C Local PCI Bus Signals 
Appendix C. Local PCI Bus Signals [PLXWWW] 
Symbol Signal Name Descril!tion 
LA[23:2] Address Bus Carries the physical address bus 
LD[31:0] Data bus Carries 32 bit data 
LWRITE Write/Read Asserted low for reads and high for 
writes 
Sigoal Driven by current Local Bus 
LBLASTL Burst Last Master to indicate I ast transfer in a 
bus access 
Indicates a valid address and start of 
LADSL Address Strobe a new bus access. Asserted for first 
clock of a bus access. 
When a channel is programmed 
through the Configuration registers 
LDACK DMA Acknowledge Outputs to operate in Demand mode, its 
LDACK output indicates a DMA 
transfer is being executed 
For a 32-bit bus, the four byte 
LBE Byte Enables enables indicate which of the four 
bytes are active during a Data cycle 
FHOLD Hold Request Asserted to request use of Local 
Bus 
FHOLDA Hold Acknowledge Asserted by Local Bus when control 
is granted in response to FHOLD 
LRESETOL Local Bus Reset Out Asserted when the PC! 9080 chip is 
reset 
LCLKA Local Processor Clock Local clock input 
LBTERML Burst Terminate For processors that burst up to four 
Lwords 
When the PC! 9080 is a Bus 
LREADYIL Ready In Master, indicates that Read Data on 
bus is valid or that a write data 
transfer is complete 
LEOT End of transfer Terminates current DMA channel 
UNTIL Local Interrupt In When asserted low, causes a PCI 
interrupt 
221 
Appendix D 
Testbench File used for the Simulation 
of the des. vhd 
222 
Appendix D Testbench File used for the Simulation of the des. vhd 
Appendix D. Testbench File used for the Simulation of the des. vhd 
library IEEE; 
use IEEE.STD_LOGIC_1164.ALL; 
use IEEE.STD_LOGIC_ARITH.ALL; 
use IEEE.STD_LOGIC_UNSIGNED.ALL; 
entity TB_ENCRYPT is 
end TB_ENCRYPT; 
architecture TEST of TB_ENCRYPT is 
component encrypt_test 
Port ( lclk : in std_logic; 
lreseto_l : in std_logic; 
lwrite : in std_logic; 
lads_l : in std_logic; 
lblast_l : in std_logic; 
ld : in std_logic_vector(31 downto 0); 
la : in std_logic_vector(23 downto 2); 
lbe_l : in std_logic_vector(3 downto 0); 
fholda : in std_logic; 
--ld_o : out std_logic_vector(31 downto 0)); 
ds_decode: in std_logic; 
-- dataout out std_logic_vector(63 downto 0); 
--ld_o: out std_logic_vector(31 downto 0)); 
--datalow: out std_logic_vector(31 downto 0); 
-- datahigh: out std_logic_vector(31 downto 0)); 
--oe_dataout: out std_logic); 
-- lbterm_l: out std_logic; 
-- lreadyi_l : out std_logic; 
--valid_dataout: out std_logic); 
finti_l :out std_logic); 
--en_l:out std_logic); 
regO: out std_logic_vector(31 downto 0); 
regl: out std_logic_vector(31 downto 0); 
reg2: out std_logic_vector(31 downto 0); 
reg3: out std_logic_vector(31 downto 0); 
bufferl : out std_logic_vector(63 downto 0); 
key: out std_logic_vector(55 downto 0)); 
valid_datain :out std_logic); 
oe_key: out std_logic); 
datain : out std_logic_vector(63 downto 0); 
keyin out std_logic_vector(63 downto 0); 
encryptl : out std_logic); 
key_inp out std_logic_vector(55 downto 0)); 
--bufferl : out std_logic_vector(63 downto 0)); 
end component; 
signal clock std_logic; 
signal reset std_logic; 
signal write std_logic; 
signal ads std_logic; 
223 
Appendix D Testbench File used for the Simulation of the des. vhd 
signal blast 
signal ld_l 
signal la_l 
signal lbe 
signal fholda_l 
signal decode : 
--signal data_o 
--signal burst 
std_logic; 
std_logic_vector(31 downto 0); 
std_logic_vector(23 downto 2); 
std_logic_vector(3 downto 0); 
std_logic; 
std_logic; 
: std_logic_vector(31 downto 0); 
std_logic; 
--signal ready : std_logic; 
--signal ld_out std_logic_vector(31 downto 0); 
--signal qlads : std_logic; 
--signal enable_o : std_logic; 
signal rega: std_logic_vector(31 downto 0); 
signal regb: std_logic_vector(31 downto 0); 
signal regc: std_logic_vector(31 downto 0); 
signal regd: std_logic_vector(31 downto 0); 
----signal enable : std_logic; 
--signal valid_out : std_logic; 
--signal valid_in : std_logic; 
--signal data_in : std_logic_vector(63 downto 0); 
--signal keyinp : std_logic_vector(63 downto 0); 
--signal encrypt_flag : std_logic; 
--signal key1: std_logic_vector(SS downto 0); 
--signal buffer2 std_logic_vector(63 downto 0); 
signal finti : std_logic; 
constant period : time:=100 ns; 
begin 
stimulus: process 
begin 
reset<= '0', 
'1' after period/2; 
fholda 
-
l<= I 0 I I 
'1' after 12*period, 
'0' after 30. 6*period, 
'1' after SS*period, 
'0' after SO*period, 
'1' after 107*period, 
'0' after 130*period, 
'1' after 158*period, 
'0' after 180*period, 
'1' after 200*period; 
blast<= I 1' 1 
'0' after 28. 89*period, 
'1' after 30. 37*period, 
'0' after 73.73*period, 
'1' after 75.21 *period, 
'0' after 124.03*period, 
224 
Appendix D Testbench File used for the Simulation of the des.vhd 
'1' after 125.51*period, 
'0' after 174.75*period, 
'1' after 176.08*period; 
write<::: I 0 I I 
'1' after period/2, 
'0' after 30.6*period, 
'1' after SO*period, 
'0' after 80*period, 
'1' after lOO*period, 
'0' after 130*period, 
'1' after 150*period, 
'0' after 180*period, 
'1' after 190*period, 
'0' after 220*period, 
'1' after 340*period, 
'0' after 370*period, 
'1' after 390*period, 
'0' after 420*period, 
'1' after 440*period, 
'0' after 470*period, 
'1' after 490*period, 
'0' after 520*period, 
'1' after 540*period, 
'0' after 570*period; 
decode<='O', 
la_l<= 
'1' after 210*period, 
'0' after 300*period, 
'1' after 560*period, 
'0' after 680*period; 
'1' after 23l*period, 
'0' after 261*period, 
'1' after 281 *period, 
'0' after 311 *period, 
'1' after 331*period, 
'0' after 36l*period, 
'1' after 38l*period, 
'0' after 411 *period, 
'1' after 431*period, 
'0' after 461*period; 
"0001010101101010100000" after 15*period, 
"0010101010101010100001" after 60*period, 
"0101010101010101001010" after llO*period, 
"0101010101010101001011" after 161 *period, 
"0000010101010101010100" after 2 0 0 *period, 
"0010101010101010100000" after 241*period, 
"0010101010101010101001" after 291*period, 
"0101010101010101001000" after 350*period, 
"0101010101010101001001' after 400*period, 
"0000010101010101010010" after 450*period, 
"0101010101010101001011" after SOO*period, 
'0000010101010101010100" after 550*period, 
--1st read 
--2nd read 
225 
Appendix D Testbench File used for the Simulation of the des. vhd 
"0010101010101010100000" after 590*period, --1st read 
"0010101010101010101001" after 640*period; --2nd read 
ld_l<=" 10001001101010111100110111101111" 
"00000001001000110100010101100111" 
"10011011101111001101111111110001' 
'00010011001101000101011101111001" 
"11111111111111110000000000000001" 
ads<= 
"11111111111111110000000000000001" 
"01001110000000001111111111111111' 
"11000001111111110000000000000000" 
"11111110000001111111111111111111" 
"11111111001111110000001000000001' 
"11111110000001111111111111111111" 
"11111111001111110000001000000001" 
"11111111111111110000000000000001" 
"01001110000000001111111111111111" 
Ill 1 
'0' after lS*period, 
'1' after 16*period, 
'0' after 60*period, 
'1' after 61*period, 
'0' after llO*period, 
'1' after 111*period, 
'0' after 161*period, 
'1' after 162*period, 
'0' after 200*period, 
'1' after 201*period, 
'0' after 24l*period, 
'1' after 242*period, 
'0' after 291*period, 
'1' after 292*period, 
'0' after 350*period, 
'1' after 351*period, 
'0' after 400*period, 
'1' after 401*period, 
'0' after 450*period, 
'1' after 451*period, 
'0' after SOO*period, 
'1' after 501*period, 
'0' after 55l*period, 
'1' after 552*period, 
'0' after 591*period, 
'1' after 592*period, 
after 16*period, 
after 61*period, 
after 111*period, 
after 162*period, 
after 201*period, 
after 242*period, 
after 292*period, 
after 351*period, 
after 401*period, 
after 451*period, 
after 501 *period, 
after 551*period, 
after 591*period, 
after 641*period; 
226 
Appendix D Testbench File used for the Simulation of the des.vhd 
'0' after 64l*period, 
'1' after 642*period; 
lbe<~ "0000" after lS*period; 
wait; 
end process stimulus; 
DUT: encrypt_test port 
map(clock,reset,write,ads,blast,ld_l,la_l,lbe,fholda_l, 
decode,finti); 
end TEST; 
227 
Appendix E 
Script File used for Traffic Shaping 
228 
Appendix E Script File used for Traffic Shaping 
Appendix E. Script File used for Traffic Shaping 
# ! /bin/bash 
dev=ethO 
echo ''shape rate is $1'' 
./tc qdisc add dev $dev root handle 1: htb 
# This is the initial ceil argument (100 Mb/s) 
./tc class add dev $dev parent 1: classid 1:1 htb rate 97639.424Kbit 
ceil 97639.424Kbit 
# Here is placed the ceil argument passed by the command line 
./tc class add dev $dev parent 1:1 classid 1:10 htb rate $1Kbit ceil 
$1Kbit 
#shaping will be applied only to active packets (destination UDP port=44075 
(Oxac2b)) . 
. /tc filter add dev $dev protocol ip parent 1:0 prio 1 u32 match ip 
protocol Oxll Oxff match u16 Oxac2b Oxffff at 22 flowid 1 
229 
Appendix F 
File produced by LTT 
230 
Appendix F File produced by LTT 
Appendix F. File produced by LTT 
Sched change 1,079,981,401,241,295 
STATE : 0 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) ' 3 
Socket 
SIZE ' 28 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) ' 1 
Socket 
TYPE : 1 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) 6 
Process 
STATE : 1 
Syscall exit 
Syscall entry 
socketcall; EIP 
socket 
FPM(FD) ' 6 
Socket 
SIZE : 200 
Syscall exit 
Syscall entry 
EIP : Ox08052B92 
File system 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) 3 
Socket 
1,079,981,401,241,304 
1,079,981,401,241,313 
Ox080503E8 
1,079,981,401,241,314 
1,079,981,401,241,317 
1,079,981,401,241,327 
1,079,981,401,241,333 
Ox0804B438 
1,079,981,401,241,334 
1,079,981,401,241,350 
1,079,981,401,241,351 
1,079,981,401,241,354 
Ox08052B92 
1,079,981,401,241,355 
1,079,981,401,241,370 
1,079,981,401,241,371 
1,079,981,401,241,373 
Ox08052B92 
1,079,981,401,241,374 
1,079,981,401,241,375 
1,079,981,401,241,378 
1,079,981,401,241,380 
1,079,981,401,241,381 
1,079,981,401,241,391 
1,079,981,401,241,394 
Ox080500C8 
1,079,981,401,241,394 
1,079,981,401,241,395 
3; SIZE 1536000000 
Sched change 1,079,981,401,241,398 
3104; STATE : 1 
Syscall exit 
Syscall entry 
: Ox08049DB2 
File system 
1550 
Socket 
1; SIZE : 1550 
Syscall exit 
Syscall entry 
EIP : Ox08049DE1 
File system 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) : 1 
Socket 
TYPE ' 1 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) 6 
Process 
STATE : 1 
Syscall exit 
Syscall entry 
socketcall; EIP 
1,079,981,401,241,407 
1,079,981,401,241,412 
1,079,981,401,241,413 
1,079,981,401,241,413 
1,079,981,401,241,416 
1,079,981,401,241,418 
1,079,981,401,241,419 
1,079,981,401,241,426 
1,079,981,401,241,430 
Ox08049910 
1,079,981,401,241,431 
1,079,981,401,241,435 
1,079,981,401,241,436 
1,079,981,401,241,438 
Ox0804996D 
1,079,981,401,241,439 
1,079,981,401,241,447 
1,079,981,401,241,449 
1,079,981,401,241,450 
Ox080499B2 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3104 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
19 
7 
12 
16 
16 
7 
12 
16 
16 
7 
12 
16 
16 
7 
12 
16 
16 
7 
12 
20 
7 
12 
16 
16 
19 
7 
12 
20 
16 
7 
12 
20 
7 
12 
16 
16 
7 
12 
16 
16 
7 
12 
IN 3104; OUT 0; 
SYSCALL : 
SO; CALL 16; 
SO SEND; TYPE : 3; 
SYSCALL : 
SO; CALL : 1; 
SO CREATE; FD 6; 
SYSCALL : 
SO; CALL : 3; 
WAKEUP PID : 3116; 
SYSCALL : 
SO; CALL 9; 
SO SEND; TYPE 1; 
SYSCALL close; 
CLOSE : 6 
SYSCALL 
SO; CALL : 12; 
SO RECEIVE; TYPE 
IN : 3116; OUT : 
SYSCALL : read; EIP 
READ : 6 ; COUNT 
SO RECEIVE; TYPE : 
SYSCALL close; 
CLOSE : 6 
SYSCALL 
SO; CALL : 1; 
SO CREATE; FD 6; 
SYSCALL : 
SO; CALL : 3; 
WAKEUP PID : 3105; 
SYSCALL 
231 
Appendix F File produced by LTT 
Socket 
FPM(FD) ' 6 
Socket 
SIZE : 200 
Syscall exit 
Syscall entry 
EIP ' Ox080499CO 
File system 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) ' 5 
1,079,981,401,241,451 
1,079,981,401,241,452 
1,079,981,401,241,454 
1,079,981,401,241,456 
1,079,981,401,241,457 
1,079,981,401,241,461 
1,079,981,401,241,462 
Ox08049D73 
1,079,981,401,241,463 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
3116 
Sched change 1,079,981,401,241,466 
OUT : 3116; STATE : 1 
Syscall exit 1,079,981,401,241,473 
1,079,981,401,241,477 Syscall entry 
' Ox08048DC6 
File system 
1550 
Socket 
1; SIZE ' 1550 
Syscall exit 
Syscall entry 
EIP : Ox08048DF5 
File system 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) ' 2 
Socket 
TYPE : 3 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) ' 4 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) : 4 
Syscall exit 
Syscall entry 
socketcall; EIP 
Socket 
FPM(FD) ' 4 
Socket 
SIZE : 200 
Network 
PROTOCOL : 8 
Syscall exit 
Syscall entry 
EIP : Ox080495C3 
1,079,981,401,241,478 
1,079,981,401,241,479 
1,079,981,401,241,481 
1,079,981,401,241,484 
1,079,981,401,241,484 
1,079,981,401,241,490 
1,079,981,401,241,494 
Ox08049C43 
1,079,981,401,241,495 
1,079,981,401,241,503 
1,079,981,401,241,504 
1,079,981,401,241,505 
Ox08049CSD 
1,079,981,401,241,506 
1,079,981,401,241,509 
1,079,981,401,241,511 
Ox08049C75 
1,079,981,401,241,512 
1,079,981,401,241,514 
1,079,981,401,241,516 
Ox0804AOF3 
1,079,981,401,241,516 
1,079,981,401,241,517 
1,079,981,401,241,529 
1,079,981,401,241,538 
1,079,981,401,241,540 
File system 1,079,981,401,241,541 
Syscall exit 1,079,981,401,241,545 
Syscall entry 1,079,981,401,241,547 
socketcall; EIP Ox08048D87 
Socket 1,079,981,401,241,548 3105 16 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
3105 
SO; CALL: 5; 
16 
16 
7 
12 
20 
7 
12 
16 
3105 
7 
12 
20 
16 
7 
12 
20 
7 
12 
16 
16 
7 
12 
16 
7 
12 
16 
7 
12 
16 
16 
12 
7 
12 
20 
7 
12 
SO; CALL : 9; 
SO SEND; TYPE 1; 
SYSCALL close; 
CLOSE ' 6 
SYSCALL 
SO; CALL : 5; 
19 IN ' 3105; 
SYSCALL : read; EIP 
READ : 4 ; COUNT 
SO RECEIVE; TYPE : 
SYSCALL close; 
CLOSE : 4 
SYSCALL 
SO; CALL : 1; 
SO CREATE; FD 4; 
SYSCALL : 
SO; CALL : 14; 
SYSCALL : 
SO; CALL : 14; 
SYSCALL : 
SO; CALL : 11; 
SO SEND; TYPE 3; 
PACKET OUT; 
SYSCALL close; 
CLOSE ' 4 
SYSCALL 
232 
Appendix G 
The skbuff Structure 
233 
Appendix G The skbuffStructure 
Appendix G. The skbuffstructure 
struct sk_buff { 
I* These two members must 
struct sk_buff * next; 
*I 
be first. *I 
I* Next buffer in list 
struct sk_buff * prev; I* Previous buffer in list 
*I 
struct sk_buff head * list; 
*I 
/* List we are on 
struct sock *sk; 
*I 
I* Socket we are owned by 
struct timeval stamp; 
*I 
/* Time we arrived 
struct net_device *dev; /* Device we arrived on/are 
leaving by *I 
I* Transport layer header *I 
union 
struct tcphdr *th; 
struct udphdr *uh; 
struct icmphdr *icrnph; 
struct igmphdr *igmph; 
struct iphdr *ipiph; 
struct spxhdr *spxh; 
unsigned char *raw; 
} h; 
I* Network layer header *I 
union 
} nh; 
struct iphdr 
struct ipv6hdr 
struct arphdr 
struct ipxhdr 
unsigned char 
I* Link layer header *I 
union 
mac; 
struct ethhdr 
unsigned char 
struct dst_entry *dst; 
I* 
*iph; 
*ipv6h; 
*arph; 
*ipxh; 
*raw; 
*ethernet; 
*raw; 
* This is the control buffer. It is free to use for every 
* layer. Please put your private variables there. If you 
* want to keep them across layers you have to do a skb_clone(} 
* first. This is owned by whoever has the skb queued ATM. 
*I 
char cb[48]; 
unsigned int 
*I 
unsigned int 
len; 
data_len; 
I* Length of actual data 
234 
Appendix G The skbuffStructure 
.unsigned int 
*I 
unsigned char 
reused 
csurn; 
_unused, 
*I 
cloned, 
refcnt to be sure). *I 
pkt_type, 
*I 
I* Checksum 
I* Dead field, may be 
I* head may be cloned (check 
I* Packet class 
ip_summed; 
*I 
priority; 
I* Driver fed us an IP checksum 
_u32 I* Packet queueing priority 
*I 
atomic_t users; /* User count - see 
datagram.c,tcp.c *I 
unsigned short protocol; I* Packet protocol from 
driver. *I 
unsigned short security; I* Security level of packet 
*I 
unsigned int truesize; I* Buffer size 
*I 
unsigned char *head; I* Head of buffer 
*I 
unsigned char *data; I* Data head pointer 
*I 
unsigned char *tail; I* Tail pointer 
*I 
unsigned char *end; I* End pointer 
*I 
void (*destructor) (struct sk_buff *); I* Destruct 
function *I 
#ifdef CONFIG_NETFILTER 
I* Can be used for communication between hooks. *I 
unsigned long nfmark; 
I* Cache info *I 
_u32 nfcache; 
/* Associated connection, if any */ 
struct nf_ct_info *nfct; 
#ifdef CONFIG_NETFILTER_DEBUG 
unsigned int nf_debug; 
#endif 
#endif I*CONFIG_NETFILTER*I 
#if defined(CONFIG_HIPPI) 
union{ 
#end if 
_u32 ifield; 
private; 
#ifdef CONFIG_NET_SCHED 
_u32 tc_index; 
index *I 
#endif 
} ; 
I* traffic control 
235 


