Design of a scalable network interface to support enhanced TCP and UDP processing for high speed networks by Elbeshti, Mohamed
 
 
 
 
Design of a Scalable Network Interface to Support 
Enhanced TCP and UDP Processing for High Speed 
Networks 
 
                                                                                                 .           
A thesis submitted for the degree of 
Doctor of Philosophy 
by 
Mohamed Elbeshti 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Murdoch University 
2014 
  
 
Declaration   
 
 
To the best of my knowledge, this thesis contains no material previously 
published by any other person except where due acknowledgment has been 
made.   
This thesis contains no material which has been accepted for award of any 
other degree in any other university.    
Signature:  
 
Date: July  07, 2014 
 
 
 
ii 
 
 
 
 
 
 
 
 
 
 
 
 
 
Copyright by  
Mohamed Elbeshti 
2014 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
iii 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Dedicated to my Parents, Amira, Ahmed, Aseal, Yousef and Omar 
for their support… 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
iv 
 
 
Abstract  
 
Communication networks have advanced rapidly in providing additional services, with 
improvements made to their bandwidth and the integration of advanced technology.  As 
the speed of networks exceeds 10 Gbps, the time frame for completing the processing of 
TCP and UDP packets has become extremely short. The design and implementation of 
high performance Network Interfaces (NIs) that can support offload protocol functions 
for current and next-generation networksis challenging. In this thesis two software 
approaches are presented to enhance protocol processing of TCP and UDP in the network 
interface. A novel software Large Receive Offload (LRO) approach for enhancing the 
receiving side has been proposed.The LRO works by aggregating the incoming TCP and 
UDP packets into larger packets inside the NI’s buffer. The receiving side software has 
been improved to support out-of-order packets. The second proposed software solution is 
applied on the Large Send Offload (LSO). The proposed LSO function processing is 
implemented by segmenting TCP and UDP messages that are larger than the Maximum 
Transmission Unit to the Maximum Segment Size. New packet headers are generatedfor 
each new outgoing packet. 
A scalable programmable NI based 32-bit RISC core is presented that can support 100 
Gbps network speeds. Acceleration of the processing time frame required at the NI has 
been implemented to prevent hazards (such as Data Hazard and Control Hazard) during 
the execution of the LRO and the LSO functions. An R2000/3000 RISC has been used in 
order to test the LRO and LSO functions and to discover the instruction set that is most 
suitable. Following this the VHDL NI was implemented with three pipeline RISC cores, a 
simple DMA controller and Content Addressable Memory. An evaluation of the desired 
RISC clock rate that is required to process TCP and UDP streams at 100 Gbps was 
conducted. It was determined that aRISC core running at 752 MHz with a DMA clock of 
3753 MHz was able to process packets 512 bytes or larger fast enough to support 100 
Gbps network speeds.  
 
 
v 
 
Publications 
 
Journal Articles: 
 M. Elbeshti, M. Dixon, and T. KoziniecLarge Sending Offload: Design and 
Implementation for High-speed Communications Rate up to 100 Gbps, 
International Journal of New Computer Architectures and their Applications 
(IJNCAA). Vol 3, No. 2. 2013. 
Conference Papers:  
 M. Elbeshti, M. Dixon, and T. Koziniec, Design and Simulating Specialized 
Embedded Cores for UDP Network Interface Processing: 24
th
 International 
Conference on Modeling and Simulation (MS 2013) Canada, 2013.  
 M. Elbeshti, M. Dixon, and T. Koziniec, Design consideration for efficient 
network interface supporting the Large Receive Offload with embedded 
RISC: 36th International Conference on Telecommunications and Signal 
processing (TSP), Italy, 2013. 
 M. Elbeshti, M. Dixon, and T. Koziniec, Design and Simulating a Network 
Interface-based RISC Cores: 19
th
 Asia-Pacific Conference on Communications, 
Bali, 2013. 
 
 M. Elbeshti, M. Dixon, and T. Koziniec, Design a Scalable Ethernet Network 
Interface Supporting the Large Receive Offload for 100 Gbps: 12
th
International 
Symposium on Communications and Information Technologies (ISCIT), Australia, 
2012. 
 M. Elbeshti, M. Dixon, and T. Koziniec, RISC core supporting the Large Sending 
Offload in 100 Gbps: 12
th
International Symposium on Communications and 
Information Technologies (ISCIT), Australia, 2012. 
 M. Elbeshti, M. Dixon, and T. Koziniec, An Evaluation of the TCP and UDP 
Processing Requirements Network Interface Design at 100 Gbps: 
13
th
International Symposium on Advances of High Performance Computing and 
Networking (HPCN), Canada, 2011. 
vi 
 
 M. Elbeshti, M. Dixon, and T. Koziniec, TCP and UDP Processing Requirements 
for Network Interface Design at 100 Gbps:3
rd
 International Congress on Ultra 
Modern Telecommunications and Control Systems and Workshops (ICUMT), 
Budapest, 2011. 
 M. Elbeshti, M. Dixon, and T. Koziniec, Facing the Challenges of Designing a 
Network Interfaces for 100 Gbps: 2nd International Conference on Research 
Challenges in Computer Science, China, 2010. 
 
Posters and Demonstration: 
 Mohamed Elbeshti, Designed and Implemented a Programmable Network Interface 
for High-speed Communications: Hot Chips: A Symposium on High Performance 
Chips, 2013.   
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
vii 
 
 
 
Table of Contents 
 
 
Abstract .........................................................................................................................iv  
Publications .................................................................................................................. v  
Table of Contents ........................................................................................................ vii 
List of Figures .............................................................................................................. xiii 
List of Tables ................................................................................................................ xvi 
Acknowledgements ...................................................................................................... xvii 
List of Abbreviation .................................................................................................... xviii 
Chapter 1 Introduction...............................................................................................1 
 
1.1 Background ............................................................................................................... 1 
1.2 Protocol Processing Overview .................................................................................. 2 
1.3 Towards 100 Gbps Packet Processing ...................................................................... 4 
1.3.1 Protocol Processing Considerations ................................................................... 5 
1.3.2 Network Interface Design Approaches .............................................................. 5 
1.4 Programmable-based Network Interface Design ...................................................... 6 
1.4.1 Processing Rate .................................................................................................. 7 
1.4.2 Core Structure ..................................................................................................... 8 
1.5 Thesis Contributions ................................................................................................. 8 
1.6 Research Approach ................................................................................................. 10 
1.7 Organization of the Thesis ...................................................................................... 11 
 
Chapter 2 Overview of the Protocol Processing..............................................15 
 
Chapter 2 ........................................................................................................................... 15 
2.1 Introduction ............................................................................................................. 15 
2.2 Overview of  the Protocol Processing at a Server ................................................... 16 
2.2.1 Host CPU Time Required for Protocol Processing .......................................... 17 
2.2.2 Protocol Processing Time Used by the Network Interface............................... 18 
2.2.3 Packet Processing Time .................................................................................... 20 
2.2.3.1 Maximum Number of Packets in one Second .......................................................... 20 
viii 
 
2.2.3.2 Maximum throughput of a workstation is dependent on packet size ....................... 22 
2.2.4 Reducing Protocol Overhead and Enhancing Server Performance .................. 25 
2.2.5 Techniques Used for Protocol Processing in the Host Area ............................. 26 
2.2.6 Extending the Ethernet Frame Size .................................................................. 28 
2.2.7 Techniques Implemented Inside the Network Interface ................................... 28 
2.3 An Analysis of the Existing Protocol Processing Solutions for Gigabit  ................ 31 
2.4 Aims of the Research .............................................................................................. 33 
2.5 Research Contributions ........................................................................................... 34 
2.6 Offloading Processing and Related Work ............................................................... 35 
2.6.1 Large Receive Offload in the Virtual Driver .................................................... 35 
2.6.2 Receive Side Coalescing .................................................................................. 36 
2.6.3 Large Receive Offload Concerns in High-Speed Networks ............................. 36 
2.6.4 Large Segment Offload (LSO) Enhancements ................................................. 37 
2.7 Programmable Packet Processor Design Methodology .......................................... 39 
2.7.1 Network Interface Hardware-based .................................................................. 40 
2.7.2 Network Interface Programmable-based .......................................................... 41 
2.8 Programmable Approaches for Packet  Processing ................................................ 42 
2.8.1 Network Interface Using a Single Processor .................................................... 43 
2.8.2 Using a General-Purpose Processor Supported by a DMA for Transferring 
Packets to or from a Network ........................................................................... 44 
2.8.3 Using a Dual Core Engine supported with DMA for UDP Protocol................ 45 
2.8.4 Multiprocessing Cores for Packet Processing .................................................. 46 
2.8.5 Target Core Observation .................................................................................. 48 
2.9 Conclusion .............................................................................................................. 49 
 
Chapter 3 Large Receive Offload Methodology..............................................50 
 
3.1 Introduction ............................................................................................................. 50 
3.1 Related Implementations  of Large Receive Offload Processing ........................... 51 
3.1.1 Virtual Large Receive Offload ......................................................................... 51 
3.1.2 LRO performance within the host area............................................................. 54 
3.1.3 Receive Side Coalescing .................................................................................. 55 
3.2 Enhancing the Large Receive Offload Processing .................................................. 56 
3.2.1 Lost Large Packet Treatment inside the Network Interface ............................. 58 
3.2.2 Large Packet Processing ................................................................................... 60 
3.3 Primary Design and Structure for the Receiving Side ............................................ 61 
ix 
 
3.4 Receiving Side Processing ...................................................................................... 62 
3.4.1 TCP Processing Methodology .......................................................................... 64 
3.4.1.1 Linked-list Structure Format .................................................................................... 67 
3.4.1.2 Out-of-order Processing ........................................................................................... 70 
3.4.2 UDP/IP Processing Methodology ..................................................................... 72 
3.5 Verification of the LRO Processing ........................................................................ 72 
3.5.1 Modelling SPIM Simulator Architecture ......................................................... 74 
3.5.2 Simulation Processing Analysis ....................................................................... 77 
3.5.3 Instruction Cycles ............................................................................................. 79 
3.5.4 Instruction Type ................................................................................................ 81 
3.5.5 Total Registers .................................................................................................. 82 
3.5.6 RISC Clock Rate .............................................................................................. 83 
3.6 Network Interface Design Considerations for 100 Gbps ........................................ 85 
3.6.1 DMA for Data Movements ............................................................................... 86 
3.6.2 Local Bus Width ............................................................................................... 87 
3.6.3 Pipeline Stages .................................................................................................. 87 
3.6.4 Lookup Memory ............................................................................................... 88 
3.6.5 Overlapped Processing ..................................................................................... 88 
3.7 Conclusion .............................................................................................................. 89 
 
Chapter 4 Large Send Offload Methodology....................................................91 
 
4.1 Introduction ............................................................................................................. 91 
4.2 Sending Side Block Diagram .................................................................................. 91 
4.3 Protocol Processing Methodology .......................................................................... 94 
4.3.1 UDP Processing ................................................................................................ 96 
4.3.2 TCP Processing ................................................................................................ 99 
4.4 SPIM Simulator for LSO ...................................................................................... 100 
4.5 Simulation Results ................................................................................................ 103 
4.5.1 Instruction Types ............................................................................................ 105 
4.5.2 RISC Clock Rate ............................................................................................ 106 
4.6 Design Consideration for 100 Gbps at the Sending Side ...................................... 108 
4.6.1 Enhancing Packet Processing ......................................................................... 109 
4.7 Conclusion ............................................................................................................ 110 
 
 
x 
 
Chapter 5 A Scalable Network Interface Architecture for 100 Gbps...112 
 
5.1 Introduction ........................................................................................................... 112 
5.2 Network Interface Model ...................................................................................... 113 
5.2.1 Network Interface Buffering .......................................................................... 115 
5.2.2 Data Transfer .................................................................................................. 117 
5.2.2.1 DMA for Data Transfer .......................................................................................... 117 
5.2.2.2 Bus Width............................................................................................................... 120 
5.2.3 Content Addressable Memory ........................................................................ 120 
5.2.4 The CAM Implementation inside the proposed NI ........................................ 122 
5.3 The Network Interface FIFOs ............................................................................... 127 
5.4 The Interface Buffers ............................................................................................ 131 
5.4.1 Memory Management..................................................................................... 131 
5.4.2 The Receiving and Sending Buffer ................................................................ 134 
5.4.3 Receiver and Transmission Line Buffers ....................................................... 134 
5.5 Conclusion ............................................................................................................ 137 
 
 
Chapter 6 Developing the RISC Core for TCP/IP and UDP/IP 
                  Processing   ............................................................................................138 
 
6.1 Introduction ........................................................................................................... 138 
6.2 RISC Pipeline ........................................................................................................ 138 
6.3 Instructions Set Representation ............................................................................. 142 
6.3.1 Arithmetic and Logic Operation Instructions ................................................. 143 
6.3.2 Branch Instructions ......................................................................................... 144 
6.3.3 Memory Access Instructions .......................................................................... 146 
6.4 Pipeline Hazard ..................................................................................................... 148 
6.5 RISC Registers ...................................................................................................... 153 
6.6 Components required for RISC cores ................................................................... 155 
6.7 Packet Data Path ................................................................................................... 155 
6.7.1 The PCI Interface............................................................................................ 159 
6.7.1.1 PCI Interface at the Packet Processing Unit ........................................................... 160 
6.7.1.2 Reading data from the Receiving buffer ................................................................ 162 
6.7.1.3 Reading data from Sending side............................................................................. 163 
6.7.2 Interrupt Moderation Window Size ................................................................ 164 
6.8 Conclusion ............................................................................................................ 166 
 
xi 
 
 
Chapter 7LRO and LSO Processing Analysis inside the PPU.................168 
 
7.1 Introduction ........................................................................................................... 168 
7.2 Enhancement to Improve Packet Processing ........................................................ 168 
7.3 Processing Analysis .............................................................................................. 171 
7.3.1 Large Receive Offload Analysis through Full-System Simulation ................ 173 
7.3.1.1 TCP processing cycles ........................................................................................... 174 
7.3.1.2 UDP processing cycles ........................................................................................... 175 
7.3.2 Large Send Offload Analysis through Full-System Simulation ..................... 189 
7.3.2.1 TCP processing cycles ........................................................................................... 189 
7.3.2.2 UDP processing cycles ........................................................................................... 190 
7.4 The Payload lengh Path ......................................................................................... 199 
7.5 Conclusion ............................................................................................................ 203 
 
Chapter 8 VHDL Simulation Results ...............................................................205 
 
8.1 Introduction ........................................................................................................... 205 
8.2 Packet Processing Enhancements for High-Speed  Networks .............................. 205 
8.3 RISC Clock Rate for 100 Gbps ............................................................................. 210 
8.4 Results ................................................................................................................... 211 
8.5 The DMA and RISC Clock Rate for 100 Gbps..................................................... 217 
8.6 Conclusion ............................................................................................................ 218 
 
Chapter 9 Conclusion and Future Work ....................................................... 219 
 
9.1 Summary of Contributions .................................................................................... 219 
9.2 Future Work .......................................................................................................... 222 
References ................................................................................................................ ..223  
Appendix AData Collection.................................................................................. 231 
 
A.1 Collection of real TCP and UDP streams from multiple tests .............................. 231 
A.2 Tests Methodology ................................................................................................ 232 
A.2.1 Receive side Flows ......................................................................................... 233 
xii 
 
A.2.1.1 Commands: ............................................................................................................ 234 
A.2.2 Send Side Flows ............................................................................................. 234 
A.2.2.1 Commands: ............................................................................................................ 234 
 
Appendix B Schematic Diagrams......................................................................  235 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
xiii 
 
List of Figures  
 
 
Figure 2.1: Ethernet frame format ................................................................................... 19 
Figure 2.2: Theoretical maximum throughput for 10 Gbps ............................................ 25 
Figure 2.3: Workstation architecture............................................................................... 39 
Figure 3.1a: Receive side data flow when Large Receive Offload is not implemented . 52 
Figure 3.1b: Receive side data flow when Large Receive Offload is implemented ...... 52 
Figure 3.2: Extract part of the LRO code shows packets that do not match the LRO 
requirements in a separate buffer ........................................................................ 53 
Figure 3.3: Offloading the LRO approach to the Network Interface .............................. 57 
Figure 3.4: The Ethernet network interface structure ..................................................... 62 
Figure 3.5: Segment Message of a TCP stream .............................................................. 65 
Figure 3.6: Processing flow of TCP of LRO................................................................... 66 
Figure 3.7: Linked-list data structure .............................................................................. 68 
Figure 3.8: Lookup Memory structure ............................................................................ 71 
Figure 3. 9: Illustrates inter-packet processing  .............................................................. 73 
Figure 3.10: Receiving block diagram ............................................................................ 74 
Figure 3.11: Packet Processing Unit based SPIM simulator .......................................... 75 
Figure 3.12: Programmed I/O approach for data movement .......................................... 77 
Figure 3.13: A TCP/IP and UDP/IP Hexadecimal format .............................................. 78 
Figure 3.14: Total percentage of data movements of LRO ............................................. 81 
Figure 3.15: Floating Point registers during the processing of the proposed LRO ........ 83 
Figure 3.16: RISC clock rate for packet header processing ............................................ 84 
Figure 3.17: MIPS required for the Receiving side using Programmed I/O ................... 85 
Figure 3.18: DMA approach for data movement inside the Network Interface ............. 87 
Figure 3.19: Overlapped processing at the receiving side .............................................. 89 
Figure 4.1: Sending side Model ...................................................................................... 93 
Figure 4.2: Four pointers are used with the new approach for segmenting packets ....... 96 
Figure 4.3: Processing flow of TCP and UDP of LSO ................................................... 97 
Figure 4.4: Procedure of sending a UDP user data application ...................................... 98 
Figure 4.5: Procedure of sending a TCP user data application ....................................... 99 
Figure 4.6: Sending block diagram ............................................................................... 100 
Figure 4.7: SPIM simulator block diagram ................................................................... 101 
Figure 4.8: Communication between the host and the NI  ............................................ 102 
Figure 4.9: Processing flow........................................................................................... 103 
Figure 4.10: Total percentage of data movements of LSO processing ......................... 105 
Figure 4.11: RISC clock rate for packet header processing .......................................... 107 
Figure 4.12: Amount of MIPS required for sending side using Programmed I/O ........ 108 
Figure 4.13: Pipeline processing at the sending side .................................................... 110 
Figure 5.1: Network interface block diagram ............................................................... 114 
Figure 5.2: DMA structure ............................................................................................ 118 
xiv 
 
Figure 5.3: DMA cycles to transfer data from the source to the destination ................ 118 
Figure 5.4: DMA channel ............................................................................................. 119 
Figure 5.5: CAM structure when the Linked-list  ......................................................... 121 
Figure 5.6: CAM based implementation of the look-up-table ...................................... 123 
Figure 5.7: CAM-based search engine block diagram .................................................. 124 
Figure 5.8: Cycles required during read operations  ..................................................... 125 
Figure 5.9: The two FIFOs are used to send data from the receiver RISC processor  .. 129 
Figure 5.10: Sends the TCP active connections to the receiving side through FIFO 3 130 
Figure 5.11: The FIFO carries the information needed for segmenting a message ...... 130 
Figure 5.12: Circulation Buffer architecture ................................................................. 132 
Figure 5.13a: Tracking the size of the RB .................................................................... 133 
Figure 5.13b: Signal sent when 200 pages of the RB are occupied ............................. 133 
Figure 5.14: Receiving Buffer Interface architecture ................................................... 135 
Figure 5.15: Sending buffer interface architecture ...................................................    136 
Figure 6.1a: Normal Structure of RISC instruction pipeline .....................................   139 
Figure 6.1b: Structure of RISC instruction pipeline ..................................................... 139 
Figure 6.2: Block diagram of the Fetch, Decode, Execute and Write/Back  ................ 141 
Figure 6.3a: Arithmetic/Logic instruction formation .................................................... 143 
Figure 6.3b: Arithmetic/Logic immediate instruction formation ...............................  143 
Figure 6.4: Arithmetic instructions ............................................................................... 144 
Figure 6.5: Branch instruction format ........................................................................... 145 
Figure 6.6: Branch instruction example ........................................................................ 145 
Figure 6.7: Load/Store instruction format ..................................................................... 146 
Figure 6.8: Load and store instructions with memory address ..................................... 147 
Figure 6.9: LCAM instruction format ........................................................................... 147 
Figure 6.10: Load a memory address from CAM ......................................................... 148 
Figure 6.11a: Before scheduling the branch-delay slot ................................................. 149 
Figure 6.12: Before scheduling procedure .................................................................... 151 
Figure 6.13: Delay slot technique for Data hazard ....................................................... 151 
Figure 6.14: Forward mechanism used in the simulator ............................................... 153 
Figure 6.15: Latching the output of the Arithmetic Logic Unit  ................................... 153 
Figure 6.16: RISC register file ...................................................................................... 154 
Figure 6.17: The topology of the test environment ....................................................... 156 
Figure 6.18: Tested model for sending and receiving packets  ..................................... 157 
Figure 6.19: Transferring packets from the Receiving Buffer to the Host Memory..... 160 
Figure 6.20:  Timing diagram captured from the simulation for burst transfer ............161 
Figure 7.1: Tested model for sending and receiving packets at the PPUnit ................. 170 
Figure 7.2: Large Receive Offload Processing cycles characteristics .......................... 172 
Figure 7.3:  Timing diagram for TCP BOM packet Instructions  ................................. 177 
Figure 7.4: Total number of instructions for TCP BOM packet without idle cycles .... 178 
Figure 7.5: Total number of instructions for TCP COM packet without idle cycles .... 179 
Figure 7.6: Total number of instructions for TCP EOM packet without idle cycles .... 180 
Figure 7.7: Total number of instructions for TCP SSM packet without idle cycles ..... 181 
xv 
 
Figure 7.  8 : Total number of instructions for the TCP out-of-order packet when the                
sub linked-list is equal to “0” in the CAM ........................................................ 182 
Figure 7.9: Total instructions for TCP out-of-order packet when the sub linked-list          
is not equal to “0” in the CAM .......................................................................... 183 
Figure 7.10: Total number of instructions to process the UDP BOM packet  .............. 184 
Figure 7.11: Total number of instructions to process the UDP COM packet  .............. 185 
Figure 7.12: Total number of instructions to process the UDP EOM packet  .............. 186 
Figure 7.13: Total number of instructions to process the UDP SSM packet  ............... 187 
Figure 7.14: Total instructions for the UDP out-of-order packet when the sub linked-         
list is not equal to “0” in the CAM .................................................................... 188 
Figure 7.15: Total number of instructions to process the TCP BOM packet  ............... 191 
Figure 7.16: Total number of instructions to process the TCP COM packet  ............... 192 
Figure 7.17: Total number of instructions to process the TCP EOM packet  ............... 193 
Figure 7.18: Total number of instructions to process the TCP SSM packet  ................ 194 
Figure 7.19: Total number of instructions for the UDP BOM packets  ........................ 195 
Figure 7.20: Total number of instructions for the UDP COM packets  ........................ 196 
Figure 7.21: Total number of instructions for the UDP EOM packets  ........................ 197 
Figure 7.22: Total number of instructions for the UDP SSM packets  ......................... 198 
Figure 8.1: Total RISC idle cycles when the DMA clock is double RISC clock rate .. 206 
Figure 8.2:  Desired DMA clock rate for LRO and LSO .............................................. 207 
Figure 8.3: RISC clock rate for LSO and LRO for UDP/IP when the DMA is 3759               
Mhz for receiving-side and 2115MHz for sending-side .......................... 211 
Figure A.1: Network topology ...................................................................................... 231 
Figure A.2: A snapshot of the Hexadecimal file from WirShark ...................................235 
Figure B.1: VHDL based Packet Processing Unit architecture .................................... 237 
Figure B.2: Structure of RISC instruction VHDL based pipeline ................................ 238 
Figure B.3: DMA schematic diagram ........................................................................... 239 
Figure B 4: RISC register file schematic diagram ........................................................ 240 
Figure B.5: CAM schematic diagram ........................................................................... 241 
Figure B.6: Receiver Buffer Interface (RBI) schematic diagram ................................. 242 
Figure B.7: Minimize Data Hazard by latching the output of the ALU by forwarding 
hardware (U12) to be  read within next instruction (forward mechanism). ...... 243 
Figure B.8: VHDL block diagram for  DMA entity ..................................................... 244 
Figure B9:VHDL block diagram for  Register  entity .................................................. 245 
Figure B.10: VHDL block diagram for  PipeLine entity .............................................. 246 
Figure B.11:VHDL block diagram for  CAM entity .................................................... 247 
 
 
 
 
 
 
 
 
xvi 
 
List of Tables 
 
 
Table 2.1: The frame rate per second applied to different line speed rates .................... 22 
Table 2.2: Minimum and Maximum-sized Ethernet frames  .......................................... 23 
Table 2.3: Some of Network Processor cores used as netwrok porcessor ...................... 43 
Table 3.1: A comparison between the virtual LRO processing and offloaded LRO ...... 58 
Table 3.2: Number of cycles needed to complete out-of-order TCP or UDP packet ..... 80 
Table 3.3: Instruction types that are used with LRO processing .................................... 82 
Table 4.1: Number of cycles within the SPIM simulator.............................................. 104 
Table 4.2:  Instruction types used with LRO processing .............................................. 106 
Table 5.1: Input and output signals of the CAM ........................................................   125 
Table 6.1: RISC Instructions ......................................................................................... 142 
Table 6.2: Number of occurrence for Conditional Branch instructions  ....................... 150 
Table 6.3: Number of occurrence of the Read after Write (R/W) hazard for UDP  ..... 152 
Table 6.4: Number of occurrence of the Read after Write (R/W) hazard for TCP  ...... 152 
Table 6.5: Number of Register files size for LRO and LSO ......................................... 155 
Table 6.6: Shows the components needed for the LSO and the LRO functions ........... 155 
Table 6.7: The Interrupt Moderation sizes and absolute time ..................................... 165 
Table 7.1: The number of RISC instructions required to process the LRO when                              
the DMA is double RISC’s clock ...................................................................... 200 
Table 7.2: RISC cycles while performing the Large Send Offload, when the                                                                  
DMA clock rate is double the RISC's clock rate ............................................... 202 
Table 8.1: Total number of RISC instructions to complete the processing of the                   
LRO for TCP and UDP when the DMA clock is 3759 MHz ............................ 208 
Table 8.2: Total number of RISC instructions to complete the processing of the                   
LSO for TCP and UDP when the DMA clock is 2115 MHz ............................ 209 
Table 8.3: Packet processing at the receiving side when the RISC clock is 752                          
MHz and the DMA is 3759 MHz ...................................................................... 213 
Table 8.4: LSO packet processing time when the RISC clock is 752 MHz and                      
the DMA is 3759 ............................................................................................... 214 
Table 8.5: LRO packet processing time when the RISC clock is 1449 MHz and                           
the DMA is 3759 ............................................................................................... 215 
Table 8.6: LSO packet processing time when the RISC clock is 1449 MHz and                                         
the DMA is 3759 ............................................................................................... 216 
Table 8.7: The RISC and DMA clock rate supporting LRO and LSO for TCP and                                         
UDP at 100 Gbps ........................................................................................217 
 
 
 
 
 
 
xvii 
 
 
 
 
Acknowledgements 
 
 
 
 
 
 
All praises are due to Almighty God “Allah”, Who provided me with the strength and 
willingness to undertake this work and the opportunity to contribute a drop in the sea of 
knowledge. 
I am most grateful to my supervisors, Mike Dixon and Terry Koziniec. Their open doors 
and persistent encouragement was invaluable for the completion of this research. Also 
appreciated the time that Patrick, Lynette spent with me 
Finally, these acknowledgements would not complete without appreciating the 
unwavering support of my family including my father, my wife and children, and the 
memory of my mother. 
 
 
 
 
 
 
 
 
 
 
xviii 
 
List of Abbreviation 
 
 
 
ACK ACK - Acknowledges received data 
BOM Beginning of Message 
CAM Content Addressable Memory 
COM Continuation of Messages 
DMAC Direct Memory Access Controller 
EOM End of Message  
FIN FIN - (Final) Cleanly terminates a connection 
HI Host Interface  
HNIC Host-NI Level of Communication Buffer 
LI Line Interface  
LRO Large Receive Offload 
LSO Large Send Offload 
MAC Media Access Control 
RB Receiving Buffer  
RBI Receiving Buffer Interface 
REP Receiving Embedded Processor 
RISC Reduce Instruct Set Computer  
SB Sending Buffer  
SBI Receiving Buffer interface 
SEP Sending Embedded Processor 
SN Sequence Number  
SSM Single Segment Message 
SYN SYN - (Synchronize) Initiates a connection 
VHDL Very High-Speed Description Language   
 
1 
 
Chapter 1 
 
Introduction 
 
 
1.1  Background  
The standard for the 40-100 Gigabit Ethernet was ratified in 2010 [1]. While the 40-100 
Gigabit Ethernet deployments have grown every year since then, the technology has 
primarily been used to interconnect switches and routers [40, 100]. Nearly all of the 
server connections within the data center have remained at 10 Gbps, limiting the amount 
of network throughput available to each server.  The performance of current and next 
generation server applications depends upon the efficiency of the protocol stack 
processing within a network node [54]. Such server applications include e-commerce, 
storage and web servers, which employ Internet Protocol (IP), Transmission Control 
Protocol (TCP) or User Datagram Protocol (UDP) as the communication protocol of 
choice. 
A delay in the processing of the incoming or outgoing packets may cause a bottleneck at 
the node [25, 50, 53, 117], which can result in packets being dropped. For instance, after 
a TCP connection is established between two end nodes, the processing of arrived 
packets has several processing steps that should be carried out in a given time 
(depending on the line speed). This processing includes passing the packets from the 
network interface to the host memory, processing the IP and the TCP headers [71, 117], 
verifying the payload packet and then moving the data packet to the application space. 
2 
 
Additionally, acknowledgment messages have to be sent to the sender, confirming that 
the packet has been received successfully. During the process of sending packets from 
one network node to another, the processor at the sender node has other tasks to 
perform, in order to deliver packets to a network. The first step is concerned with 
generating the TCP or UDP as well as the IP packet headers, attaching user data to the 
headers to create a complete packet and then storing the packet in the host memory. The 
second step of the protocol processing is to notify the network device (network 
interface) to fetch the packet from the host memory to be sent to a network line. Finally, 
the network interface notifies the host processor that the packets have been sent to the 
network. 
 
1.2  Protocol Processing Overview   
As the technology has grown beyond Gigabit network speeds, the host CPU has become 
burdened with the increased amount of protocol processing required [54]. Enhancing the 
protocol processing at the end node is one of the methods which may be used to improve 
packet processing at high-speed, so that it can support the network speed.  Accordingly, 
there are a number of solutions supporting the network speed demand, including 
assigning one processor (in the host area strategies) [6, 49] at the end node to enhance 
the protocol processing. The other host CPUs can then devote more cycles to the 
processing of the user applications and Operating System tasks. The possibility of 
passing the arrived packets to all of the server processors is also proposed as a solution 
to improve protocol processing within high-speed networks [7, 70]. The minimization of 
3 
 
data copying to reduce packet processing at the host area has been used with most 
modern servers [8]. Sending a copy of the packet headers to a CPU’s cache to reduce 
cache misses is also implemented to enhance protocol processing [85]. However, in 
order to keep up with the packet rate delivered by future network speeds (40-100 Gbps 
links) most of the mentioned implementation methods used within the host area still 
have a number of drawbacks. For instance, these implementations are not involved in 
the TCP, UDP or IP header processing.  
Jumbo frames or 9000-byte payload frames have the potential to reduce the per-packet 
processing at the end node [96]. However, the Jumbo frame is not universally used since 
most network nodes are still not supporting 9000 bytes, and the Maximum Transmission 
Unit (MTU) is 1500 bytes. Furthermore, the original IEEE 802.3 specifications [47, 90] 
defined a valid Ethernet frame size to be 1518 bytes. 
Another method for reducing protocol processing is the TCP Offload Engine (TOE) 
technique [5, 13, 22, 23, 24, 26] or Packet Processing Engine (PPE) technique. The 
offload technique partitions the TCP/IP or the UDP/IP stack protocol to allow some 
functions, such as the lower level functions, to be processed on the network interface [9, 
21, 35], leaving the other processing functions [10, 85] for the host processor. The 
offload approach reduces the amount of processing that the host processor usually 
performs when the network interface is not involved with protocol processing [120].  
The network requires fast device in order to pass the packets between end nodes. Such 
devices could provide appropriate processing speed to complete the processing required 
for incoming and outgoing packets. This help will enhance the packet processing 
4 
 
required to maintain the rapid increase of network speeds. In order to improve the 
packet processing, data centres and network developers need to consider which of the 
protocol functions can be extracted from the host processor and processed by the 
network interface. Additionally, a network interface can process the shifted protocol 
functions efficiently. However, shifting all of the stack functions to the network 
interface increases the complexity of the network interface design as well as the 
Operating System configuration [14].  
 
1.3 Towards 100 Gbps Packet Processing    
The design of a Gigabit Ethernet network interface that can support the protocol 
functions such as the TCP/IP or the UDP/IP stack is extraordinarily challenging. This 
challenge includes managing different packet sizes from 64 bytes to 1500 bytes, 
implementing the proper offload protocol functions in a network interface and 
supporting high-speed transmission network lines. As the network speed increases, the 
time that is available to the network interface to process the arrived packets will 
decrease. The time needed for a 1500 bytes packet, for instance, is 123.4 ns for a line 
speed of 100 Gbps [71]. With small packets (64 bytes), the time limit required is 6.72 ns 
for line speeds of 100 Gbps. Therefore, the processing unit inside the network interface 
should be fast enough to finish verifying one packet before the next one arrives.  
 
 
5 
 
1.3.1 Protocol Processing Considerations    
Protocol functions also need to be considered when designing a network interface for 
high-speed networks. For instance, supporting the UDP/IP protocol function in a 
network interface [15, 16, 19, 39, 65] at 10 Gbps is a more straightforward 
implementation than using other protocols such as TCP/IP [23, 29]. This is a 
consequence of the reduced processing requirements for UDP/IP in comparison to 
TCP/IP, as the UDP header carries only 8 bytes and operates on a best-effort basis.  It 
leaves the other functions associated with end-to-end reliable transport to the higher 
layers of the protocol. On the other hand, TCP carries 20 bytes in its header, which 
provides end-to-end reliable transmission. 
 
1.3.2 Network Interface Design Approaches 
There are two main approaches to implement a network interface; a hardware based 
network interface, such as the Application Specific Integrated Circuit (ASIC) controller 
[21, 65] and a field-programmable gate array (FPGA) [29]. The other approach is the 
programmable based network interface [15, 16, 23, 39, 43]. Advances in chip design 
technology, specifically those which are ASIC based, have made it possible to integrate 
most, if not all, discrete components that are required for the network interface onto a 
single chip providing reduced power consumption and higher performance. Yet ASIC-
based network interfaces have difficulty in accepting new protocols without the re-
design of any single part of the chip components. However, the programmable based 
network interface provides more flexibility in the modification of the network protocol 
6 
 
functions within the network interface. FPGAs are reconfigurable hardware with a 
specific network interface function which is more attractive to designers when building 
a network interface. Even though the FPGAs provide higher performance than ASIC-
based designs, they still require a lot of set-up overheads compared to ASICs, which 
require more transistors to accomplish what an FPGA does with one. This adds latency 
and increases the power consumption of the NI design. In addition, FPGAs are more 
sensitive to coding styles and design practices [65]. 
 
1.4  Programmable-based Network Interface Design  
The programmable-based network interfaces may not provide the same level of 
performance as the other methods offered, but they are more flexible and can easily 
accommodate protocol revision or even a new protocol. The wide availability of these 
processors has contributed to the low development costs for network interfaces. Using 
these processors when designing a network interface simplifies the data path and thus 
makes their design simple too. 
Today, many off-the-shelf embedded cores have become available and can be imported 
to the network interface (e.g., recent chips from Cavium, Tilera and IBM). However, 
these general-purpose processors products are designed with the possibility of 
supporting different applications. The control unit has to support general functions, 
complex instructions and long and variable execution times [44]. The embedded 
general-purpose processors can be used in the target network interface for supporting the 
proposed receive and send offload functions. However, these general-purpose 
7 
 
processors are not optimized for a particular protocol application in a network interface. 
Hence, some portions of these licensed cores may not be required for the network 
interface design. For example, the Floating-Point Unit is not necessary for network 
interfaces. There are also a number of instruction sets of general-purpose processors 
which are not used by all of them. The long pipeline stages also create longer processing 
times for the completion of a single instruction [44]. If Floating Point is eliminated, the 
pipeline stages condense and the large register units are reduced. Thus, the target core 
processor to suit this application becomes simpler to develop, faster and lower in cost. 
The limited number of instructions required to support interface processing can reduce 
the size and complexity of the control unit, leading to increased speed.  
 
1.4.1 Processing Rate 
Protocol processing at high communication rates requires an adequate clock rate to 
complete the processing of the protocol functions to avoid creating a bottleneck in a 
network and degrading system performance. Using parallel processors when designing a 
network interface is one of the solutions to facilitate high-speed communications. An 
example of such a network interface design is the one developed at Rice University [15, 
42], which employed six simple pipelined cores to support 10 Gbps full-duplex UDP 
Ethernet frames. Even though using the multiple core method provides better 
performance for 10 Gbps, the use of the Ethernet network interface is problematic, 
leading to the high cost of its development and its expensive requirements within a 
8 
 
network interface. Furthermore, the utilization of six processors is difficult due to the 
need to share data and resources. 
 
1.4.2 Core Structure  
The design of a specialised processor for a high-speed network (40-100Gbps) requires 
more consideration such as the offloading protocol function(s) and the clock rate 
required. To achieve the goal of the design, an understanding of the PPU processing 
operation requirements for TCP/IP and UDP/IP including the set of instructions, the 
pipeline stages and the program structure format is required. In addition, selecting 
appropriate devices that can be integrated with the core processor for processing the 
incoming and outgoing packets is also beneficial for high-speed processing. 
 
1.5 Thesis Contributions 
The thesis provides three solutions for enhancing protocol processing for the receiving 
and the sending side of the designed Packet Processing Unit (PPU) for 100 Gbps. A 
Large Receive Offload (LRO) algorithm has been implemented on the PPU receiving 
side. The proposed LRO algorithm works by aggregating the arriving packets from a 
single stream to form large packets at the network interface’s buffer before they are sent 
to the stack processing, so that the host CPU processes fewer packet headers. This 
reduces the amount of DMA initiations required to transfer small packets from the 
network interface’s buffer to the host memory [50]. Amalgamating packets that arrived 
9 
 
from a same stream to form larger packets in the network interface decreases the number 
of packets travelling through the host bus and occupying space in the host memory. The 
proposed offloading does not change the treatment of the protocol stack processing or 
enter into any amendment in the structure of the protocol. It works similarly to the 
Jumbo Frame (9000 bytes) mechanism. This makes the proposed implementation 
consistent with all types of Operating Systems.  
This research also extends to the LRO function, managing of sequence packets [73]. 
This is done by reading each arrived packet’s header in order to determine the packet 
type and sequence of the packet. This research also aims to cover the LRO to support 
other network protocols, such as UDP. In adding these improvements, the challenge is to 
process each received packet’s header within the budgeted time frame for 100 Gbps line 
speeds (e.g., 123 ns per 1500 bytes packet) in order to identify the criteria for which the 
packets can be merged. 
The sending offload approach has shown remarkable improvement in supporting the 1 
Gbps and 10 Gbps line rates, which is known as Large Sending Offload (LSO) [6]. This 
thesis provides a novel approach for generating the packet headers and segmenting the 
packets that are larger than the MTU, before sending them to the Media Access Control 
(MAC) unit. A simple data path algorithm has been developed for sending segments to 
the MAC unit faster. 
The final contribution is to design a scalable PPU that supports the proposed LRO and 
LSO for high communication speeds of up to 100 Gbps has been designed that can 
support two different network protocols like TCP and UDP without changing the NI 
10 
 
structure. Use a specialised high-performance embedded processor core as a core engine 
for LRO and LSO functions within the network interface to support a line rate of up to 
100 Gbps instead of using parallel homogeneous processing cores [15, 108, 109]. This 
can significantly improve the network interface architecture, simplify the data path and 
reduce the cost. 
This research can open the door to improve the common receive methods which have 
been applied and used to accelerate the protocol processing. Such methods include the 
use of Zero-copy [10] and the Direct Cache Access (DCA) [85] with this receiving 
approach. These methods then need to work with fewer headers and less data copy. For 
DCA, fewer headers will be copied into the host CPU’s cache instead of a large number 
of small packet headers. Besides that, there will be less data calls to copy data into the 
host memory. This could decrease the packet processing needs in the future. 
 
1.6 Research Approach 
In this thesis, the consideration is given to the design of the Packet Processing Unit 
(PPU), based on the specialised embedded processor core for inbound and outbound 
processing. The PPU has been designed to avoid interrupting the host CPU [76] several 
times with each arriving packet. Furthermore, the model is capable of supporting 
multiple TCP connections. In addition, design and implement appropriate devices at the 
proposed PPU to support the cores for high-speed networks including Direct Memory 
11 
 
Access (DMA) for data movements and Content Addressable Memory (CAM) as a 
look-up-table have been completed.  
This research implements algorithms on the network protocol that are stated in the RFCs 
[33, 47, 60, 62, 63, 67, 69] for amalgamating packets to form large packets inside the 
PPU. In addition, this research provides an approach for cutting the large TCP or UDP 
messages to the Maximum Segment Size (MSS) and generating the packet headers for 
each outgoing MSS.  
A R2000/R3000 RISC based on the well known DLX processor architecture [44] has 
been used to test and extract the instructions set required for the LRO and LSO 
functions. A Very High-Speed Integrated Circuit (VHSIC) hardware description 
language (HDL) VHDL [2, 77, 87] network interface model is designed. The IEEE 
1164-1993 standard running on the Xilinx simulation [78, 97] is used to test the 
Behaviour Model of the proposed architectural network interface, ensuring the core is 
using the appropriate instruction set and techniques to eliminate data and branch 
hazards. The desired embedded processor clock rate while performing the receiving and 
sending process and supporting different sized packets of TCP/IP and UDP/IP is 
investigated. 
 
1.7 Organization of the Thesis  
This thesis is organized into nine chapters. The introduction contains the brief history of 
protocol processing followed by the methods used to improve protocol processing at 10 
12 
 
Gbps. Chapter 1 also highlights the new proposed approaches for inbound and outbound 
processing and describes the targeted network interface model design for high-speed 
networks.  
Chapter 2 provides an overview of the protocol processing that has been used to enhance 
and accelerate TCP/IP and UDP/IP processing for high-speed networks. It also 
illustrates other research which is related to the work carried out in this thesis. This 
chapter discusses the methods and ideas of enhancing end node performance by shifting 
part of the TCP/IP and UDP/IP processing to the network interface, using a 
programmable platform. The choice of using a specialised embedded processor within a 
network interface for processing the send and receive packets will be discussed. The 
enhancements of the network interface structure, scalability and design will be described 
in detail in the following chapters.  
Chapter 3 provides the methodology of the Large Receive Offload that is proposed for 
the TCP/IP and UDP/IP protocols. This chapter relates to LRO, which is followed by a 
new approach of amalgamating packets in the network interface's buffer. A proposed 
design structure that supports the received packets at high speed rates is also presented 
in this chapter. A SPIM simulator [98] based R2000/R3000 RISC is used for verifying 
the proposed LRO processing and extracting the instructions set that is required for TCP 
and UDP. The desired RISC clock rate is also determined.  
Chapter 4 provides the Large Send Offload of the TCP/IP and UDP/IP methodology. 
The primary design of the network interface for sending a MTU packet from the 
network interface's buffer to the MAC is described. The processing methodology also 
13 
 
includes the header generation processing for each outgoing packet. Following this, the 
chapter discusses options that have been chosen to enhance packet processing at the 
network interface. The SPIM simulator is used for verifying the proposed receiving 
function.  
Chapter 5 describes the scalable network interface structure that has been designed for 
high-speed networks. The IEEE 1164-1993 standard running on the Xilinx is chosen for 
structuring the network interface and testing the behaviour model. Very High Speed 
Integrated Circuit Hardware Description Language (VHDL) will be used for building 
the scalable network interface. The structure of the DMA, Content Addressable Memory 
(CAM) and FIFO buffers are addressed in this chapter.   
Chapter 6 describes the specialised embedded processor core with three pipeline stages. 
The instructions set format is also presented. The components functionality and 
behaviour in the network interface are also covered in this chapter.  
Chapter 7 will focus on the RISC analysis and performance for transmitting and 
receiving packets. The analysis of the packet processing covers TCP/IP and UDP/IP 
headers during the receiving and sending of packets within the designed network 
interface. The processing analysis on the receiving side starts after a valid packet has 
been received and has been verified. For the sending side, the analysis covers the point 
where the packet has been allocated inside the network interface’s buffer (the Sending 
Buffer). The total embedded processor cycles to complete TCP/IP or UDP/IP packet 
processing is also presented in this chapter.  Chapter 7 will also describe a method of 
enhancing packet processing by reducing the core idle cycles.  
14 
 
Chapter 8 presents the VHDL results and the RISC clock rate that supports the high-
speed rate of 100 Gbps. The desired DMA clock rate required for transferring data 
inside the PPU will also be illustrated in chapter 8. 
Chapter 9 provides the conclusion and introduce future thoughts. A comparison between 
the speed of the proposed RISC and previous approaches will be discussed. The 
possibility of integrating the proposed LRO and LSO with other existing 
implementations such as DCA or Zero-copying is discussed. 
 
 
 
  
 
15 
 
Chapter 2  
Overview of the Protocol Processing 
 
2.1   Introduction   
The TCP over IP (TCP/IP) or the UDP over IP (UDP/IP) protocol stack is the common 
thread that links today’s Local Area Networks (LANs), Wide Area Networks (WANs) 
and Storage Area Networks (SANs). As network speeds continue to increase, server 
CPUs have to dedicate more processing time to keep up with the protocol network 
packets, rather than processing the user applications or the Operating System needs. The 
protocol stack uses a substantial amount of resources for protocol processing and this 
has become a major cause of bottlenecks in high-speed networks [3, 5, 25, 54]. Protocol 
offloading is one technique that has been used to improve the networking performance 
of a server. This improvement is made by partitioning the protocol stack functions. 
Partitioning of the protocol stack functions is achieved by allowing part of the protocol 
layer, such as lower level functions (the network and data link layer), to be processed on 
the network interface, leaving others for the host CPU processor.  Doing so reduces the 
amount of processing that the host processor usually carries out. 
This chapter provides overviews of the existing protocol processing that is applied to 
enhance packet processing. The preliminary design of the proposed network interface 
that can support a novel offload algorithm for sending and receiving TCP/IP and UDP/IP 
packets with high-speed communication rates of up to 100 Gbps is also highlighted in 
this chapter. In addition, the scalability that allows it to accept any changes to the 
  
 
16 
 
network protocol that are stated on the Internet Protocol [62] and the Transmission 
Control Protocol [33] or to support a new protocol have been considered.  
       
2.2  Overview of  the Protocol Processing at a Server  
Today’s technology shows remarkable improvements in switches and routers to send 
packets faster to the end nodes [30, 40, 75]. Determination of the server’s ability to 
perform protocol stack processing at switch and router speeds is the key to avoiding the 
creation of bottlenecks in the network. A previous measure [3] of a server’s performance 
found that TCP or UDP protocol processing costs a CPU around one third of CPU 
power. In addition, the TCP/IP overheads could be as high as 60 to 70 percent for 
processors associated with web servers [5]. The rule of thumb
1
 for network processing 
stipulates that 1 GHz of CPU processing frequency is required for a 1 Gbps Ethernet 
network speed. This high percentage required for processing the TCP or UDP protocols 
renders the servers unavailable for high-speed communication.  
Several strategies were applied to enhance protocol processing at the end node to meet 
the requirements of high-speed networks. This study analyses several successful 
strategies for accelerating the TCP/IP or UDP/IP processing for high-speed networks.  
The results of this thesis demonstrate the following findings relating to protocol 
processing: 
                                                 
1
 Rules of Thumb in Data Engineering, 
http://research.microsoft.com/~gray/papers/MS_TR_99_100_Rules_of_Thumb_in_Data_Engineering.pdf. 
 
  
 
17 
 
 The host CPU needs time to execute the network protocol, including the processing 
of the network headers (i.e. the TCP and the IP headers), managing the cache 
misses, performing data movements (from user space to application space) and 
handling the interruptions that are caused by arrived packets.     
 There is a time requirement for a network interface (NI) to pass TCP or UDP 
packets to or from a network. Actioned within this period of time is the 
interpretation of the network header to recognise the protocol type of incoming 
packets, or the generation of the packet header for outgoing packets.  
 Ethernet supports different packet sizes (64 bytes to 1500 bytes). Expected packet 
service time at 100 Gbps depends upon the actual size of the packet.  
 
2.2.1  Host CPU Time Required for Protocol Processing   
Assuming the processing time commences when an application requests sending through 
the system call. The TCP or the UDP layer then copies the packet from the user buffer 
into the kernel buffer. The data checksum computation is also performed in this layer 
(most modern Operating Systems, including Linux, perform data copy and checksum 
computation at the same time for better performance). Following this, packet 
segmentation occurs if the data size is larger than the Maximum Transmission Unit 
(MTU) size [47]. The packed is then passed to the Direct Memory Access (DMA) which 
is commonly used to move data from the kernel buffer to the NI buffer memory.   
When receiving a packet, the packet is moved from the NI to the network buffer in the 
kernel space using the DMA. This means that the buffer descriptor is updated in order to 
  
 
18 
 
find a space within the host memory to allocate the arrived frames or packets.  After the 
completion of the DMA operation, the NI triggers an interrupt. The interrupt descriptor 
is then queued in the backlog queue (this is also known as NAPI in Linux) for further 
processing. The host CPU then spends time handling the interrupt which has been 
caused by the arrived packet. For example, when an Ethernet network operates at Gbps, 
the time required to transmit or receive a single bit (bit-time) is 1.0 ns. If the frame size 
is 1538 bytes, which is 12,304 bits, the transmission or receipt of a full-size frame could 
occurs every 12300 ns (12.304 X 1.0 ns). A system (depending on the type of Operating 
System and the CPU) can require up to 13 or 14 percent of the CPU cycles to handle the 
interrupt [31, 76]. 
After the interrupts are processed, the host CPU reads the network header of the arrived 
packet that was stored in the memory to identify the data packet. Since the packet 
headers are placed in the host memory instead of the CPU cache, the host CPU is 
required to fetch the packet header from the memory so as to process the arrived packet. 
In this case, each packet will result in one compulsory cache miss. Following the 
processing of the packet header, the host CPU is required to transfer the data packet 
from the user space to the related application space. 
 
2.2.2  Protocol Processing Time Used by the Network Interface  
The NI time refers to the time frame which is made available for the NI to process an 
Ethernet frame. The processing engines inside the NI must have sufficient performance 
  
 
19 
 
to finish processing one frame before the next frame arrives.  The NI processing time 
can be divided into two sections: the first is the frame processing at the Media Access 
Control (MAC) and physical unit time (data and physical layer).  The other section is the 
packet processing at the Packet Processing Unit (PPU), the network header core time 
(network layer) processing. Figure 2.1 describes the Ethernet frame format. The TCP/IP 
or UDP/IP packets are processed at the PPU, while the blue boxes are processed within 
MAC and Physical units. The CRC is commonly processed at the Link Layer such as the 
MAC unit, during the transfer of the packet from the MAC unit to the PPU or vice versa 
[4].    
 
 
 
 
 
 
 
 
 
                            
                             
                     
                                  
 
 
Figure 2.1: Ethernet frame format 
 
 
When a frame arrives, it first passes the MAC and the physical units, before proceeding 
to the packet processing. The frame level units are responsible for validating the frame 
and the packet (e.g., processing the frame’s CRC). For high rate communication lines, it 
is preferred that this unit is implemented within the hardware [80, 83]. The packet level 
Carried out 
by MAC 
and 
physical 
Unit 
Frame Headers 
Frame overhead 
Preamble and start-of-
frame delimiter 
8 bytes 
Ethernet header 
 
14 bytes 
40 (TCP/IP) 
or 28 bytes 
(UDP/IP) 
TCP/IP or UDP/IP 
headers Carried out 
by Core 
engine in the 
NI  
 Packet overhead 
 
Payload 
 
 
Data 
This part contributes to 
overall throughput 
Variable in size from 
6 bytes to 1472 bytes
Carried 
out by 
MAC and 
physical 
Unit 
Ethernet FCS 
Frame tails  
Data over head Ethernet inter –packet 
gap 
12 bytes 
  
 
20 
 
process interacts with the TCP, the UDP and the IP headers. The core engine of the NI 
stores a valid packet within the temporary packet storage of the NI, ready to be moved to 
the host memory. Sending processing starts after the large packet is moved from the host 
memory to the NI sending buffer. The NI core passes the packet to the MAC unit, which 
adds the MAC headers and then sends a complete frame to the network.  
 
2.2.3  Packet Processing Time    
Packet structure and size are equally beneficial when considering the protocol overheads 
which are contained within a node.  The transport layer determines the packet size. 
Small-size packets, which carry 64 bytes of data, provide fewer data movements at the 
workstation (6 bytes [71]). On the other hand, the information used to delimit and verify 
each packet header (which is 20 bytes for the TCP header, and 20 bytes for the IP 
header) is considered to be a high overhead when compared with the payload part which 
attaches to the packet. The message size and the line speed determine the number of 
frames that may flow over a connection. For example, the device could receive 14 
million frames when the MTU is 64 bytes in one second under certain consumptions, 
assuming there are no delays or errors in a network [71, 38]. The next section presents a 
calculation of the total packet flow for 40 and 100 Gbps.  
2.2.3.1 Maximum Number of Packets in one Second 
In this section, the maximum number of packets that can be transferred over a line speed 
of 10 Mbps when the frame size is 1518 bytes and 64 bytes, is calculated as follows: 
  
 
21 
 
 Ethernet data rate for 1518 bytes = (Total frame physical Size (bits)+(plus Interframe    
                                                                Gap)) /  bits per second 
                                                      = (9.6 µs + 1526 bytes) / 10 Mbps 
                                                      = (12 byte X 8 bits) + 12.208 bits  X  10
-6
 sec 
                                                      = 1.23 µsec      
 
 The rate for one second of the maximum size frame (1518) is 
                                                      = 1 sec / 1.23 µsec   
                                                      = 812 frame/sec                                                    (2.1)  
 
 
Calculating the maximum number of frames with a frame size of 64 bytes  is 
applied in a network: 
Ethernet Data Rate for 64 byte      = (9.6 µs +(72 byte X 8 bits ))X (1/10
6
) 
                                                       =  (12 byte X 8 bits) +(72 byte X 8 bits ))X (1/10
6
) 
                                                       = 67.2 X 10
-6   
sec 
                                                       = 67.2 µsec   
 Then, one second of the minimum size frame (64)  
                                                       = 1 sec / 67.2 µsec   
                                                       = 14,880 frames /sec                                           (2.2) 
  
When the frame size is 1518 bytes, the maximum number of frames that a node can send 
or receive is 812 frames per second (2.1). The number of frames processed per second 
increases when the frame size is 64 bytes (2.2). Table 2.1 shows the expected number of 
frames that can pass over the different communication line rates from 10 Mbps to 100 
Gbps. Table 2.1 shows the maximum number of fames for various line speeds from 10 
Mbps to 100 Gbps. Table 2.1 also highlighted the fact that, an end node is expected to 
receive more frames when the speed rate is over 10 Gbps [3, 28]. The number of frames 
  
 
22 
 
increases when the frame size is decreased. Large size frames that carry large amounts 
of data are suitable for various applications such as web servers, NFS and multimedia 
especially when these applications use zero-copy [8] or direct caching [85] (an overview 
of zero-copy and direct caching will be presented later in this chapter).  
Table 2.1:  The frame rate per second applied to different line speed rates 
Line Rate 
Calculated the Max frames per second 
Ethernet Frame Size 
64 byte 128 byte 256 byte 512 byte 1024 byte 1518 byte 
10 Mbps 14880 8446 4529 2350 1198 812 
100 Mbps 148810 84460 45290 23497 11974 8127 
1000 Mbps 1488096 8445946 452899 234962 119732 81275 
10 Gbps 14880953 8445946 4528986 2349624 1197318 812743 
40 Gbps 59523810 33783784 18115942 9398496 4789272 3250976 
100 Gbps 1488095238 844594595 452898551 234962406 119731801 81274383 
 
 
2.2.3.2 Maximum throughput of a workstation is dependent on packet size  
The throughput is the amount of that data that the end node can achieve whilst 
performing the sending or receiving activities. Table 1 shows the TCP RAW, Minimum 
and Maximum-sized Ethernet frames that effective unidirectional throughput of a 
Gigabit Ethernet (available in RFC, books and others).  
 
 
 
  
 
23 
 
Table 2.2: Minimum and Maximum-sized Ethernet frames that effective unidirectional 
throughput of a Gigabit Ethernet 
 Minimum-sized “Raw” 
Ethernet frames 
 
Minimum-sized 
Ethernet frames 
carrying TCP/IP data 
 
Maximum-sized 
Ethernet frames 
carrying TCP/IP data 
 
Preamble and Start-of- 
Frame Delimiter 
8 bytes 8 bytes 8 bytes 
Ethernet Header 14 bytes 14 bytes 
14 bytes 
 
TCP/IP Headers N/A 40 bytes 
40 bytes 
 
Payload 46 bytes 6 bytes 
1460 bytes 
 
Ethernet Frame-Check- 
Sequence 
4 bytes 4 bytes 
4 bytes 
 
Ethernet Inter-packet 
Gap 
12 bytes 
 
12 bytes 12 bytes 
Total Packet Size 
64 bytes 
 
64 bytes 1518 bytes 
Actual Bandwidth 
Consumed (i.e., packet 
size plus framing 
bytes) 
 
84 bytes (672 bits) 84 bytes (672 bits) 
1538 bytes (12,304 
bits) 
 
Link Speed 
1 Gb/s 
 
1 Gb/s 1 Gb/s 
Theoretical Maximum 
Frame Rate 
 
1,488,095 packets per 
second (approx.) 
 
1,488,095 packets per 
second (approx.) 
 
81,274 packets per 
second (approx.) 
 
Theoretical Maximum 
Throughput 
547 Mb/s 71 Mb/s 949 Mb/s 
 
 
The Theoretical Maximum Throughput (TMT) for 10 Gbps is calculated as follows:  
TMT applies on the different speed lines 
                = Theoretical frame rate X   payload size (in bits of each frame) / Frame Rate     
The TMT when the raw frame size is 64 bytes  
    = 1,488,095 X 64 X 8(payload part (in bits) / line rate is 10 Gbps   
                = 5476 Mbps                                                                                               (2.3) 
When TCP/IP is used as the network protocol and the frame size is 64 byte then  
  
 
24 
 
 TMT      = 14880952.38 X 6 X 8(payload part in bits) / line rate is 10 Gbps   
                = 714 Mbps                                                                                                 (2.4) 
When UDP/IP   is used as the network protocol and the frame size is 64 byte then 
TMT       = 14880952.38 X18 byte X 8(payload part in bits) / line rate is 10 Gbps   
                = 2142 Mbps                                                                                               (2.5) 
 
From equations 2.3, 2.4 and 2.5, it is clear that the TMT increases when there is less 
protocol overhead. For example, the TMT increases when using the UDP protocol 
header, which is only 8 bytes whilst the TCP header is 20 bytes. To illustrate this 
phenomenon, another example can be cited. Figure 2.2 presents the TMT for different 
Ethernet frame sizes from 64 to 1518 bytes. From the chart, it is clear that the TMT 
increases when the packet size gets larger.   
The Maximum Transmission Unit (MTU) Discovery [47] provides encapsulation the 
packet in a large size (e.g., from 512 bytes to 1500 bytes). Larger frames over 512 bytes 
provide greater throughput and require less processing than small sized frames. With 
large frames (e.g., 1024-1518 bytes) a node can reach the peak of the throughput. In 
addition, when the MTU is set to 512 bytes or larger, fewer packet headers are required 
for the application data than if the MTU is less than 512 bytes [28].  
 
  
 
25 
 
 
Figure 2.2: Theoretical maximum throughput for 10 Gbps 
 
   
2.2.4  Reducing Protocol Overhead and Enhancing Server 
Performance  
Enhancements to server performance can be applied in the host area, for example by 
increasing host CPU power, control data movements and reduce the amount of cache 
issues that are caused by each arrived packet. Other performance enhancements can be 
achieved by reducing the packet processing time at the NI. This research has found that 
various solutions have been presented in the past for increasing the stack processing 
performance . These efforts can be divided into three groups:   
1. Techniques that have been used in the host area.   
2. Techniques that have been applied to the frame size.   
3. Techniques that have been implemented in an NI.   
0
1
2
3
4
5
6
7
8
9
10
11
64 128 256 1024 1500
M
ax
 t
h
ro
u
gh
p
u
t 
 (
G
b
p
s)
   
Frame size (bytes )  
TCP/IP
UDP/IP 
  
 
26 
 
2.2.5 Techniques Used for Protocol Processing in the Host Area 
The processing time at an end node needed to complete the aforementioned processing 
of an Ethernet frame (1518 bytes) is 1230.4 ns per frame when the line speed is 10 Gbps 
and 123.04 ns when the communication speed is 100 Gbps. In contrast, the time required 
for the shortest Ethernet frame (64 bytes) is about 67.2 ns for a line speed rate of 10 
Gbps and only 8 ns for 100 Gbps [71]. With 10 Gigabit communications, the host 
processor demonstrates the inability to complete protocol processing when the MTU 
becomes 128 bytes or less [27, 28, 34]. Protocol processing may amount to more than 
half of the capacity of the processor when the frame size is 1518 bytes [3, 5].  
In a recent study, one of the approaches adopted to enhance system performance is the 
use of an additional general-purpose processor at the node used for protocol processing. 
This technique is known as TCP onload processing [6, 49]. The technique of TCP 
onload processing refers to the use of a chip multiprocessor (CMP) or a symmetric 
multiprocessor (SMP) to complete the tasks related to protocol processing [121]. 
Microsoft, for instance, increases the 2008 server performance by queuing incoming 
packets to the multiple CPUs. This is known as Receive Side Scaling (RSS) [7]. The 
RSS method works by allowing the NI to manage multiple hardware queues in order to 
host a number of incoming packets [19, 20]. The NI driver interrupts the host system to 
read these frames or packets that are stored in the host memory. With the RSS 
mechanism, the network driver, together with the network card, distributes incoming 
packets among the processors on the end node by using a technique known as the 
Interrupt Service Routine (ISR). The ISR interrupts the proper CPUs when handling 
  
 
27 
 
these frames (or packets), through Deferred Procedure Calls (DPCs). Packets that belong 
to the same TCP connection are directed to the same processor.   
Another technique which has been used in the host area is Zero-copy Packet data [8, 11, 
12]. Zero-copy is used to increase server performance by reducing the copying of data 
traveling to or from host memory. Apply Zero-copy in a server to reduce the time-
consuming needs of switching between user space and kernel space enhances existing 
web applications and web servers, such as SPECweb2005, by 44 percent [10].  
The Direct Cache Access (DCA) [85] approach is also applied to enhance system 
performance. The DCA provides a mechanism for a workstation to indicate that the 
target data has been targeted for a CPU cache. Normally, when packets arrive at a 
workstation they are stored within the host memory. The host CPU then needs to fetch 
the network headers from the host memory in order to determine the payload part of this 
message. Therefore, the DCA technique consists of placing a copy of incoming network 
headers into the CPU cache. The DCA eliminates the cache misses that occur during the 
packet processing.  This method has helped to develop applications such as the web 
server, which deals with various connections and successions of different IP sources and 
ports [17]. However, the approaches that are applied in the host area enhance the 
server’s performance, in spite of a number of challenges including data movements [6] 
between networks, user and application space and between host memory and the CPU 
cache. However, these techniques focus on data copying or moving and do not involve 
header processing on layer two or three of the protocol stack. In addition, these 
techniques did not provide any level of control over the processor's interruptions.  
  
 
28 
 
2.2.6 Extending the Ethernet Frame Size     
The second strategy for accelerating stack processing is obtained through the 
improvement of the system’s I/O performance by increasing the size of the payload 
Ethernet frame to 9000 bytes (the Jumbo frame). Increasing the frame size reduces the 
number of packets that a system has to handle. Only one packet of 9000 bytes (1500B X 
6 = 9000 bytes) is required for processing as opposed to addressing six headers if the 
MTU is 1500 bytes. This certainly improves the performance of the end nodes and 
reduces protocol overheads.  
   
2.2.7  Techniques Implemented Inside the Network Interface 
Offloading part of the protocol processing to an NI is another technique to improve a 
system’s performance [120]. This technique works by decreasing the amount of protocol 
processing from the host CPU, where part of the TCP/IP or UDP/IP protocol processing 
is passed to the NI. The common offload techniques are as follows:  
a) Checksum Offloading [4, 9, 35] is a technique of processing that involves the 
checking of headers. Check summing is a necessary operation and is generally 
carried out by the host CPU processor. However, the check summing technique is 
costly in terms of processor time needed to maintain it. Adding this service to the NI 
is becoming a common aspect where it is not adding any complexity or cost to the 
NI. Commonly, the CRC is implemented in hardware, within DMA transfer or 
  
 
29 
 
during the copying of the packet. This offloading technique can improve end node 
throughput by up to 33 percent [35].  
b) Large Segment Offloading (LSO) is a technique implemented to reduce overhead 
processing. The advantages of this approach are apparent when a host CPU sends 
data larger than the Maximum Segment Size (MSS) (e.g., MSS >1460 bytes) to 
other end nodes in a network. The LSO service is added to the NI, in order to 
complete the task of sending data that is larger than the negotiated MSS. This 
service divides oversized messages into short segments and then generates network 
headers which need to be sent with each segment. The LSO approach can improve 
system performance by up to 50 percent [6].     
c) Remote Direct Memory Access (RDMA) [122] is a direct transmission of data from 
one memory device to another. Where possible, this is accomplished without the 
intervention of the Operating System of both devices. This idea is mainly aimed at 
reducing the protocol processing, which is the data transfer. RDMA has been 
implemented in an NI [39] and router [122] which is used to place the arrived data 
from the other end into the main memory of the device. This application, therefore, 
requires coordination between the main processor and the core engine in the NI in 
terms of data exchange. For example, buffer descriptors must be supplied to the NI 
in order to allocate data inside the host memory. RDMA facilitates this process by 
reducing the demand of the bandwidth.   
d) Newer NI drivers provide an advanced feature known as interrupt moderation (IM) 
[31] or interrupt avoidance. Normally, the host CPU is interrupted after each arrived 
frame or packet. The IM holds a number of packets or frames in the NI buffer (or in 
  
 
30 
 
the host side buffer) before it initiates the interruption. However, the interruption 
rate increases significantly as the packet number escalates. Host processors then 
become incapable of handling both the network header processing and the interrupts 
[28, 76]. These interrupts usually inform the host CPU that packets have arrived and 
are in the temporary queue. Without the IM, the receiving procedure is likely to 
interrupt the host processor after each packet arrival. This ultimately results in the 
dropping of messages [76]. Dropping packets requires a TCP sender to re-send 
them.  Modern NIs that support this service are implemented by hosting the 
incoming packets from the network within a temporary buffer before requesting the 
interrupt. This process has created more efficiency for fast networks [31]. 
e) Other offloading techniques can even hand all the TCP/IP functions to the NI [22, 
23, 24]. This is known as the TCP Offload Engine (TOE). The TOE style leaves 
more time for a host processor to perform other applications instead of addressing 
the concerns of the protocol functions. This can be done by allowing most of the 
protocol functions, such as data and network layers, to be executed in the NI.    
In particular, offload strategies provide more features to the end node, including an 
increase in end node throughput by reducing the overhead and enhancement of end node 
performance [5, 16, 17]. In addition, while using offload techniques, a CPU can devote 
more cycles to other Operating System tasks [31].      
 
  
 
31 
 
2.3 An Analysis of the Existing Protocol Processing Solutions for 
Gigabit Networks    
The previous section addressed the TCP/IP and UDP/IP processing challenges and 
several solutions were proposed. These solutions ranged from host area processing to NI 
offload. The solutions that apply to the host domain focus on two crucial concepts. The 
first concept refers to the data transfer between the kernel and the user space. 
Transferring data between the protocol layers or buffers requires several steps. For 
instance, the host CPU is responsible for determining which procedures operate to 
transfer data. There are two approaches for data movements: either the CPU assigns the 
Direct Memory Access (DMA), or it assigns the Programmable I/O PI/O [44]. The 
initiation of the DMA controller is preferred, where the CPU is required to provide 
certain information to the DMA, such as data size, address and destination.  According 
to the analysis of TCP processing overhead done by Clark and his assistant [27], the 
measurements of data copy itself may cost up to  200 μs when a copy occurs between 
user and system and 386 μs when a copy occurs between the network and memory. The 
copying time could turn a node into a bottleneck where each packet requires a certain 
subroutine to complete moving a payload part form one space in a host memory. The 
decreasing of the number of packets can be a solution to reduce data copy.    
The second concept to reduce packet handling overhead in the host area is by providing 
the host CPU with a copy of the information that arrives from the network at the CPU 
cache [85]. Although this approach enhances end node performance, it focuses only on 
the upper layer of the protocol (the application layer) and does not provide any solutions 
on the network or data link layers; these need to be improved to allow the end node to 
  
 
32 
 
deal with fast networks. In particular, a small-sized frame which is carrying a small 
amount of data provides fewer data movements at the end node. The information used 
by the headers, which is 40 bytes (TCP/IP) or 28 bytes (UDP/IP), is considered to be a 
high overhead when compared with the payload size that is attached with each packet 
(approximately 6 bytes) [25].  
Increasing the payload frame size to 9000 bytes improves system I/O performance and 
reduces header processing. However, this approach has not been universally deployed 
since the original IEEE 802.3 specifications [47, 90] defined a valid Ethernet frame size 
to be 1518 bytes. An increase in payload size can cause some negative impacts on 
Gigabit networks. Firstly, when the packet size is greater than the 1518 byes which the 
router can deliver to the destination, the router then divides the packet into separate 
packages which are compatible with the port line that will deliver them on the network. 
In this case, there is information added to each package in order to facilitate the 
assembly process. However, most of the NIs still support only an MTU of 1500 bytes.  
Secondly, there is re-sending of the whole packet in case of loss of one of the small 
fragments. Thirdly, it imposes an additional burden on the processing for routers to 
deliver the packets which consume more overhead on the network. Furthermore, 
fragments may be filtered by a firewall, because they do not include the necessary 
control information such as the TCP header.  
Finally, partitioning the processing of the protocol by allowing some functions to be 
processed on the NI and leaving others for the host processor enhances the processing 
performance at an end node [15, 16, 20]. There are also other methods that allow the NI 
  
 
33 
 
to offload all the protocol processing without a host processing association [5, 23].  
Offloading all the TCP processing functions from a server CPU to an NI helps the host 
CPU to focus on processing application requests rather than network protocols. 
Nevertheless, shifting the entire TCP/IP or UDP/IP functions to the NI is a more 
complicated solution and involves a significant change in the Operating System. 
Furthermore, there are a number of issues that need addressing if shifting the entire 
TCP/IP or UDP/IP functions to the NI. Such issues include security, Moore’s law, 
performance, flexibility and cost [14]. 
  
2.4 Aims of the Research  
After network speed increases from 10 Gbps to 100 Gbps, packet processing at the end 
node requires more attention in order for it to support the higher speed. Enhancement of 
the Packet Processing Unit (PPU) at the NI for processing the incoming and outgoing 
TCP/IP or UDP/IP packets has been studied. 
The aim of this Thesis is to propose a novel 32-bit processor based Reduced Instruction 
Set Computer (RISC) architecture, based on the well known DLX processor architecture 
[44] as the core engine for the PPU. The specialised core of the extended 32-bit RISC 
based processor has the capabilities to support the outgoing and incoming packet 
processing for high speed networks at 100 Gbps. In order to achieve this goal, a novel 
algorithm will be developed for the Large Receive Offload (LRO) function on the NI. 
The LRO function reassembles the incoming packets that belong to the same stream 
  
 
34 
 
(e.g., having the same IP address and Port ID addresses) to form a single large packet 
inside the NI’s buffer before the Interrupt Moderation (IM) [31] size expires. The 
amalgamated data is then sent to the host for further processing. For outgoing packets, 
an appropriate algorithm will be provided for the Large Send Offload (LSO) that could 
segment or fragment the outgoing TCP/IP or UDP/IP packets to the MTU before 
delivering them to the MAC unit. Other units for supporting the cores, such as the DMA 
and the Content Addressable Memory (CAM), are also used to support the RISC core 
while processing the LRO and LSO for TCP/IP and UDP/IP.  
 
2.5  Research Contributions   
One contribution in this thesis shall be an attempt to use the Jumbo frame approach and 
the offload approach to develop a novel algorithm for the LRO function on the NI that 
reassembles the incoming packets. This algorithm can be flexible to support different 
types of protocols, such as TCP/IP and UDP/IP. In addition, this implementation is 
conditioned to manage the out-of-order packets (the packets that arrive non-sequentially 
at the destination [73, 88]) before delivering them to stack processing. For outgoing 
packets, the LSO processing provides an appropriate algorithm that generates the packet 
headers faster for each segmented or fragmented packet.  
The majority of commercial core engines are designed as multi-core parallel processors, 
such as EZChip’s NP-1-4, Intel's IXP1200, 2400, 2800, 2850 NPs, IBM’s Power NP, 
Motorola’s C-5 NP and many others. Using a single processor for packet processing at 
  
 
35 
 
the NI instead of using composed multiple equal general purpose RISC processor cores 
for high speed networks is another contribution of this research.  Such RISCs have less 
pipeline stages, less instructions and simple control units. These features make the RISC 
capable of supporting a wide range of transmission line speeds of up to 100 Gbps. Using 
a specialised  core reduces the design complexity and the cost of designing the NI. This 
study therefore can open a door for using specialised cores for next generation networks. 
 
2.6 Offloading Processing and Related Work 
2.6.1 Large Receive Offload in the Virtual Driver  
The receiving side is not exposed directly to operations within the NI. This is not only 
because of the potential for out-of-order packets, but also due to the difficulty of 
implementing it in the NI.  In spite of this, several researchers have succeeded in 
implementing their work in an NI for the receiver side. One of these research studies has 
suggested the use of network stack processing called Large Receive Offload (LRO) on 
the host side [51, 52]. LRO is a software driver in the Linux platform.  Intel implements 
it to reduce the number of arriving TCP packets at the host.  LRO combines the same 
stream to be encapsulated into large-sized packets inside the host memory. The host 
CPU then passes these large segments to the network stack. The LRO improves the 
handling of incoming messages which are sent to the processor where it reduces header 
processing. As a result, the network protocol then addresses the assembled group of TCP 
packets as a single packet within the host memory.   
  
 
36 
 
    
2.6.2  Receive Side Coalescing  
Receive Side Coalescing (RSC) [36, 37] is another successful approach to facilitate 
packet processing on the receiving side. It is a stateless and software based offload 
mechanism which works to identify incoming packets by reading the network header. 
Next, RSC combines the TCP arrived packets with large packets inside the host 
memory. The host CPU then completes the processing of the large packets. 
Consequently, RSC reduces the number of packets that a host CPU requires for a 
TCP/IP stream by around 20 percent, which also reduces CPU cycles [36].  
 
2.6.3 Large Receive Offload Concerns in High-Speed Networks   
LRO and RSC improve the inbound processing of TCP packages in high-speed Ethernet 
lines of up to 10 Gbps. However, the packets (header and payload) still transfer from the 
NI's buffer to kernel space using the DMA device, which involves initiating the DMA to 
carry a packet to be allocated to the host memory. These initiations may cost the host 
CPU around 300 ns to complete a message with a size of 32k [50]. Besides, the packet 
headers (40 bytes) are moved through the system bus with each data transfer. This can 
reduce bandwidth utilization.   
The NI buffer descriptor requires regular updates in order to obtain an empty space in 
the host memory to allocate the arrived packets. Every TCP/IP stack specifies an amount 
  
 
37 
 
of buffer space to be used by the TCP/IP stack in order to store the data temporarily. If 
the receive buffer is set to zero, the application has to ensure that it provides buffers to 
store the arrived packets. In the case where the processor is busy executing other 
applications than LRO code while messages continue to arrive, the buffer becomes full 
and cannot afford new messages.  Late processing of the buffered messages could cause 
the dropping of messages and lead to the nodes becoming a bottleneck in the network.   
Furthermore, the LRO and RSC do not provide any algorithm to support any protocol 
other than TCP. In addition, the out-of-order packets force the RSC and LRO to stop 
functioning. This is caused by a number of single packets which are coalesced instead of 
large packets. As a result, more TCP/IP headers need to be processed by the host CPU. 
 
2.6.4 Large Segment Offload (LSO) Enhancements   
For outgoing packets, the NI engine fragments or segments a large message (over the 
MTU) that has been sent by a host CPU to the NI buffer. Furthermore, it generates 
packet headers, attaching the payload part (MSS, 1460 bytes for TCP segment or 1472 
bytes for UDP fragment) to the generated headers and eventually sends the packet to the 
MAC unit to be sent to a network line.  
Transmissions are made according to the protocol type. For instance, the TCP/IP 
protocol uses two identifiers in each packet: the Sequence Number (SN) and the 
Acknowledgment Number (AKN). The beginning segment carries the start sequence 
  
 
38 
 
number of the data. Segments also carry an AKN, which is the SN of the next expected 
data portion of the transmission.  
LSO is designed on the transmit side [68, 74]. The LSO mechanism works by shifting 
the higher layer transmitting processing to an NI, where the core engine in the NI is 
responsible for handling the tasks that are related to the transport layer. In the LSO 
implementation, the datagram can be as a maximum size of 64 KB which is sent to an 
NI buffer [129]. At the NI, the core engine first reads all the information related to the 
moved datagram including the position of the message inside the NI buffer. The core 
engine then examines the datagram size. If the datagram is larger than the MSS, the core 
engine generates the network headers for each SSM. The smart NI prefers to store a 
TCP/IP header template that has the IP total length and the initial SN for each outgoing 
MSS [68]. A change of the template header is created whenever there is segmented data 
that needs to be sent to a network. It updates the essential fields inside the copied TCP 
and IP headers, such as the SN and total length of the datagram before sending a packet. 
The core engine attaches the header copy to the MSS to create a complete packet and 
then sends it to the MAC unit.  Since the template packet headers are stored in the 
internal core engine at the NI, the core needs to perform a set of data transfer activities. 
This obviously costs more CPU cycles as the core is required to copy the headers from 
the core’s internal buffer to the MAC buffer for each SSM. The packet CRCs which are 
required by the TCP, UDP and IP headers for packet validation, are commonly 
implemented on the MAC unit, where the CRC is added to the header using a technique 
known as by-pass.  
  
 
39 
 
       
2.7 Programmable Packet Processor Design Methodology 
At every workstation or sever, the NIs are connected to the I/O bus [59]. The NI acts as 
an interface between the host CPU and a network. The NI receives messages from the 
host and then transfers them to a network, or vice versa.  The packet processing 
normally has two parts: the first part is the MAC Line Interface, which connects the 
packet processing to frame processing. The second part is the Host Interface (HI), which 
connects the NI to the host (Figure 2.3). The HI serves as a buffer between the NI and 
the host for receiving and transmitting TCP/IP packets which are traditionally 
implemented in software as part of the Operating System, where a CPU is responsible 
for executing them. When the transmission line runs at a moderate speed such as 1 Gbps, 
the design and implementation of an NI is straightforward, sending packets to or from a 
host, without any header processing [19, 20].  
 
                                    
 
 
                                                                                                              
Figure 2.3: Workstation architecture 
I/O Bus  
Cache 
              
Core 
Engine 
 
Host 
Interface 
   
 Network 
Host  
Memory 
Host  
CPU 
 
MAC 
and Phy 
units 
    Packet                                       Frame 
 Processing                                Processing 
Network Interface  
  
 
40 
 
 
 
However, if the transmission line is running at a high-speed rate and the functions 
processed by the host are large, performing all processing by the host will reduce its 
ability to perform its normal operations. Thus, an interface-based processing capability 
that offloads the burden from the host processing becomes particularly important. There 
are two possible methods that may be used to process the offload protocol:  hardware-
based and programmable–based. The following is a snapshot of both methods:   
 Hardware-based design 
o Application Specific Integrated Circuit (ASIC) controller [21, 65] 
o Field Programmable Gate Arrays  (FPGAs) [19, 29,110]   
 Programmable-based design   
o General-purpose embedded processor [15, 39, 41, 42, 43,107] 
o Specialised engine core [23] 
2.7.1  Network Interface Hardware-based   
The technological advancement in chip design, specifically that which is Application 
Specific Integrated Circuit (ASIC) based, has made it possible to combine most, if not 
all, discrete components that are required for an NI onto a single chip. Furthermore, 
using an NI based ASIC provides greater energy efficiency, better integration and lower 
costs. Nevertheless, ASIC-based NIs have faced difficulties in accepting new protocols 
without re-designing parts of the chip components.  Field Programmable Gate Arrays 
(FPGA) platforms are reconfigurable hardware. When building an NI, this feature makes 
  
 
41 
 
the FPGA approach more attractive to designers than the ASIC-based designs. Even 
though the FPGAs deliver a higher performance than the ASIC-based chips, they still 
have multiple setup overheads compared to the ASICs, which require more transistors to 
complete building a design than the FPGA. This adds latency and increases the power 
consumption in the NI design. In addition, FPGAs are more sensitive to coding styles 
and construction practices [66]. ASICs also have limited flexibility and upgradability 
and make NI design-specific tailoring complex. 
 
 
 
2.7.2  Network Interface Programmable-based  
Designing the NI using the programmable process can potentially improve server 
performance [16], because it allows for more flexibility to adjust to network protocol 
functions. These adjustments can be made in the NI by modifying the code that is 
necessary for protocol processing. General-purpose embedded processors, for instance, 
may not provide the same level of performance as the other methods offered, but they 
are more flexible and can easily accommodate protocol revision or even more protocols. 
The availability of general-purpose processors has contributed to low development costs 
for NIs. The wider use of hardware-based NIs (such as using a fully customised logic-
based NI) can be expected for the following reasons:  
1) There is a lack of study regrading to the processor performance to match next-
generation networks such as 40 Gbps or 100 Gbps with a low clock rate.  
  
 
42 
 
2)  The new trend of designing the NI is to have all the NI functions implemented in 
one chip [32].  
3) Complexity of using a group of lower-speed cooperating NP processors to support 
high-speed networks than a single core [16, 108,109].  
 
2.8 Programmable Approches for Packet  Processing    
This study aims to investigate the possibility of improving the structure of the target NI 
by having a simple data path and not a complex structure for inbound and outbound 
processing. Using a single core as a network processor (NP) can simplify the design 
while providing the performance required. However, the clock rate of these cores will 
increase considerably when protocol functions are shifted to the NI [23]. This is the 
reason why parallelism and hardware accelerators can be considered an option. With a 
new generation of high-speed networks going beyond 10 Gbps, packet rates increase and 
the time between packet arrivals gets smaller. During the course of this research, 
different programmable-based NPs for Gigabit networks have been studied. Table 2.2 
describes the NP cores that are used for high-speed networks.  
 
 
 
  
 
43 
 
Table 2.3: Some of Network Processor cores researches have used for high-speed 
networks 
Processor Type 
Number 
of 
Processors 
Clock 
Speed of 
the Core 
Processor 
Type of Services 
Process by the Core in 
the Network Interface 
Line 
Speed 
Supported  
Offload 
Functions 
Specialized core [23] One 5000 MHz TCP offload only 
Up to 10 
Gbps 
Yes 
Specialized RISC [107] One 2000 MHz 
IP routing process and 
the packet size is 512 
B 
Up to 100 
Gbps 
No 
Intel 82597EX / 1310 
[58] 
Two  
2000 to 
3200 MHz 
Forwarding  TCP/IP 
packet  to/from 
Network interface 
Over 7 
Gbps 
No 
RISC [43] Intel IOP 310 
(80200/ 80312) 
Two 733 MHz 
Forwarding  TCP 
packet  to/from 
Network interface 
UP to 5 
Gbps 
No 
Broadcom (SB1250) 
[125] 
Two 
600 to 
1000 MHz 
Support TCP offload 
Up to 10 
Gbps 
Yes 
EZchip NP4 [123] Two 365 MHz 
Packet forwarding.  
Determine the output 
path for incoming 
packet 
Up to 100 
Gbps 
No 
RISC processors [42] Six 166 MHz 
Forwarding UDP/IP 
packet only to/from 
Network interface 
Up to 10 
Gbps 
No 
Intel NP IXP 1200-RISC  
[124] 
Six 232 MHz Fast IP router/switch 
Up to 100 
Gbps 
No 
Intel IXP 2800 [109] Sixteen 
1000 to 
1400 MHz 
programmed to 
deliver intelligent 
transmit and receive 
processing 
Up to 100 
Gbps 
No 
 
2.8.1  Network Interface Using a Single Processor    
Hoskote and his team have designed a single processor as a core engine to offload TCP 
inbounds of 10 Gbps [23]. The header processing for a minimum sized packet requires a 
5 GHz processor for 10 Gbps. To improve performance, this paradigm applies two 
different clock rates to balance data processing of 10 Gbps.  The main clock rate is 
  
 
44 
 
applied to accelerate data processing, such as the context look-up block and the 
transmission control. The minor clock rate is assigned to the other units that have fewer 
processing tasks, such as the reorder packet units or the working registers. The NI also 
supports the service processor with the CAM which contains all the connection 
information that the host has made with other ends. However, the subject of this research 
is limited to data handling and does not consider any solutions for data transfer. This 
simulation focused on the use of small-sized packets in the TCP offload. TCP carries 
around 6 bytes in the payload area for small-sized packets which could require four CPU 
cycles over the 64-bit bus in order to move this amount of data from one location to 
another. Large packets (1500 bytes) require about 183 CPU cycles over the local 64 bits 
bus (1460 bytes = 11680 bits of a payload over 64 bits).  In this simulation, there is still 
no solution for other protocols such as UDP.   
Another network processing approach was designed by Jakimovska which used a single 
core for IP routing functions processing when the packet size is 512 bytes for a 100 
Gbps network [107]. A core engine running at 2000 MHz is used in the router. There is 
no TCP or UDP header processing involved in the Jakimovska’s work.    
2.8.2 Using a General-Purpose Processor Supported by a DMA for 
Transferring Packets to or from a Network  
Los Alamos National Laboratories (LANL) [58] has studied the Intel 82597EX as a 
processor core for the NI.  This NI focuses on implementing the check summing 
function. The LANL uses DMA for supporting the core engine for data transfers inside 
the NI. This feature reduces the processing that the host requires for transferring TCP 
  
 
45 
 
data from inside the NI. This NI gives a remarkable result of up to 7 Gbps, but this 
adaptor does not offer any solution for the offloading of the TCP or UDP functions, 
other than the check summing of packet headers.  
 
2.8.3 Using a Dual Core Engine supported with DMA for UDP 
Protocol 
Using dual processor as a core engine inside the NI is another strategy for protocol 
processing.  The designers Hyong, Vijay and Scott developed the Tigo programmable 
core which was released in 1997 [39]. This core engine depends on two 88 MHz MIPS 
R4000-based processors for the completion of data processing. Hyong uses two 
DMAs (one for incoming and one for outgoing packets) to transfer data inside NI. This 
work claimed that the end node throughput improved by 65 percent when a large frame 
is applied and by 157 percent when a small-size frame is used. This framework focuses 
on the UDP protocol, which needs less processing than TCP. Moreover, it does not 
support any type of offload function. Rather, it sends packets from the host memory to a 
network or delivers the packets to the network. For outbound processing, the host CPU 
accesses the main memory and verifies the data that needs to be sent. Next, the host 
CPU generates the network headers of the first segment of the datagram, and then 
initiates the DMA to move the packet (header + payload section). During this transfer of 
data, only the DMA occupies the system bus. The CPU needs to wait until the bus is 
entirely released by the DMA. The remaining packets follow the same sequence of 
processing. The NI core engine then sends the packet to a MAC unit to be delivered to 
  
 
46 
 
the network. After that, an interruption is sent to the host CPU. When a packet is 
received, the NI controller interrupts the host CPU after a packet is placed inside the 
receiver buffer. Next, the host CPU pulls the data from the host memory. However, 
more interruptions may reduce the host CPU’s performance [31]. Another disadvantage 
of this method is the need for RISCs to complete packet processing inside the NI for 
inbound or outbound processing. In the case of one of these processors wanting to read 
local memory through the local bus, the other processor would have to wait (idle cycles) 
until the bus is released and the memory becomes available.  In addition, protocol 
processing is required to be distributed between the two processors. Practically, only one 
processor can initiate a DMA or interrupt a host CPU. These initiations do not have 
hardware support for concurrency and use the only available lock. Therefore, each 
access to the data structure requires synchronization, which would be extremely 
complicated, as the data structures need to be accessed frequently.   
 
2.8.4  Multiprocessing Cores for Packet Processing  
Using multiprocessing cores as processing cores at the NI is another way to scale the NI 
to support high-speed networks.  The Intel IXP2800 processor is composed of 16 
identical multi-threaded RISC processors, which are organized as a pool of parallel 
homogeneous processing cores [109], which allows a single stream to be decomposed 
into multiple, sequential tasks that can be linked together among these processors.  
EZChip processors are also chosen for high speed rates, such as NP-4 [108, 123].  
Previous researchers at Rice University and Purdue University have proposed 
  
 
47 
 
strengthening the network card with six processors to enable it to perform at 10 Gbps 
[15, 42]. The idea behind the multiprocessing approach is to divide the processing 
required for each incoming or outgoing packet. A 166 MHz controller with six 
processors can achieve 99 percent of the theoretical throughput of 10 Gbps [42].  Intel 
also uses 6 processors for fast routers and switches to get 100 Gbps [124]. Despite 
achieving this goal, this solution has several problems in regards to the NI structure. 
There are number of complexities assisted with these proposed models like sharing of 
the main resources between these processors. This is could make the processors idle 
during the packet processing. For example, when one of these processors occupies the 
NI bus long time, the rest of the processors become inactive until the bus is released. 
Furthermore, accessing local memory, such as instruction memory or external memory, 
only allows for one processor at a time. These embedded processors also occupy a large 
space in the NI and can increase the cost. In addition, Amdahl’s law [119] states that the 
speed-up achievable on a parallel computer is materially limited by the existence of a 
small fraction of an inherently sequential code, which cannot be parallelised. In this case 
of parallelisation, Amdahl's law explains the in-depth processing inside a design (2.6). 
The implementation program is usually divided into two portions: the first, part "P", is 
the amount of the protocol processing program that can be made parallel, and “1- P” is 
the other portion of the processing that cannot be parallelized; it remains serial. Thus, 
the maximum achievement on the NI by using the N processors is:  
𝑆(𝑁) =
1
 1−𝑃 +
𝑃
𝑁
                                                                                                    (2.6) 
  
 
48 
 
Assuming that P is 90%, then 1- P = 10 %. With this high percentage of parallelized 
code (90%), the problem is sped up by a maximum of a factor of 10, no matter how 
large the value of N is. Accordingly, using a number of processors to support the LSO or 
LRO can be accomplished only if this design has extraordinarily high values of P. This 
is known as the Embarrassingly Parallel Problem [64, 99]. The complexity could 
increase when LRO and LSO processing code migrations will most likely start from 
serial code bases. Therefore, the target software design needs to identify the solution to 
meet the parallel processing need. Furthermore, the programming model should be 
Symmetric Multiprocessing (SMP) or Asymmetric Multiprocessing (AMP), CPU 
intensive code. For instance, it is difficult to redesign parallel processing using SMP 
[121].   
 
2.8.5  Target Core Observation  
There are extreme challenges in meeting the needs of 100 Gbps Ethernet processing. 
This processing challenge includes the TCP and UDP offload. Using specialized high 
performance RISC cores supporting the offload functions reduced the complexity 
involved in using multi-cores for supporting packet processing. This study aims to 
improve the traditional RISC core engine style based DLX processor architecture, which 
was developed by John L. Hennessy and David Patterson [44]. The high-performance 
novel RISC used in this research has specified instruction sets, control units and file 
registers that are appropriate for supporting the offload functions. Such a core can be 
used for processing 100 Gbps.  
  
 
49 
 
 
2.9 Conclusion  
Improved methods of packet processing at the end node are one of the necessities 
required to keep pace with the substantial and continuous development of high-speed 
networks. Designing and implementing an NI for high-speed required for receiving side 
and sending side has become very important in order to achieve 100 Gbps speeds. 
Several changes have been designed and implemented within the host area, NI interface 
and packet size, in order to improve the processing of TCP/IP and UDP/IP. Offloading 
part of the protocol processing to the NI reduces the CPU requirements for processing 
the protocol stack and enhances protocol processing at the end node.  
An overview of the methods used to improve packet processing of the TCP/IP and 
UDP/IP protocols inside the network interface for both inbound and outbound TCP and 
UDP packets have been provided. The implementation approaches to enhance the 
protocol processing inside the network interface to assist the sending and receiving 
side’s completion of packet processing within a given time frame for communication 
rates of 10 and 100 Gbps are highlighted. 
The next chapter will illustrate the research methodology for supporting the Large 
Receive Offload function on the proposed network interface model.  
 
  
50 
 
Chapter 3  
Large Receive Offload Processing Methodology 
 
3.1 Introduction 
Reducing the per-packet processing overhead is necessary for the receiving side in order 
to serve high-speed communications lines. A possible solution is the use of jumbo 
frames (9000 bytes) that can carry up to 9000 bytes of payload in size, reducing header 
processing at the end node. However, jumbo frames are not universally used within the 
Ethernet due to legacy compatibility reasons. The original IEEE 802.3 specifications 
[47, 90] defined a valid Ethernet frame size to be 1518 bytes. This chapter provides a 
novel approach for Large Receive Offload (LRO) by combining the received TCP and 
UDP packets that have the same packet header information in the Network Interface 
(NI) buffer. This reduces the number of headers that a host CPU is required to process, 
reducing the number of packet headers passes from the NI through the system bus to the 
host memory. The proposed methodology of LRO is designed to enhance the packet 
processing at the end node and to support out-of-order packets (packets arrived out of 
sequence at the end node). This chapter will extend the LRO functions to support the 
UDP/IP protocol. To achieve this, a Packet Processing Unit (PPU) at the NI is designed 
to support the required processing for TCP/IP and UDP/IP at high communication speed.  
Using a programmable-based NI to implement LRO functions increases flexibility and 
scalability by supporting tow different network protocols, TCP and UDP protocol . In 
  
51 
 
addition, using a single processor for packet processing simplifies the design of the NI 
and avoids the complexity of using multi-cores.  
In this thesis, the LRO methodology has been verified using a SPIM Simulator based-
RISC (R2000/R3000) as a target core engine for processing the proposed LRO for 
verifying the instructions set that is required for processing the LRO function. The 
number of cycles required for processing TCP and UDP packets is investigated.  
  
3.1 Related Implementations  of Large Receive Offload Processing  
The virtual Large Receive Offload (LRO) processing starts after receiving a valid 
TCP/IP packet from the MAC unit. A DMA initiation is required to pass each packet 
from the NI buffer through the system bus to its user space in the host memory. Figure 
3.1a illustrates the receive processing when the LRO is not implemented within the NI 
driver and each packet has to be stored in host memory (according to the available space 
inside the user space). Updating the descriptors is required after placing a packet into the 
host memory. A host CPU interruption is required when there is a valid packet stored in 
memory. The CPU then starts to process the packet(s).  
 
3.1.1 Virtual Large Receive Offload  
The virtual LRO approach for TCP/IP is performed below the socket layer in the kernel 
of the Operating System, where the user data is passed to the protocol stack through the 
socket buffer.  The LRO process is responsible for combining the arrived packets that 
  
52 
 
are related to the same link (which has the same IP address and port ID), then 
encapsulating them into large-sized packets at the end node before presenting them to 
the protocol stack (Figure 3.1 b).  TCP segments that have no data or have the control 
flag set (e.g., SYN, FIN) are not eligible for LRO processing. 
Host Memory
I/O
DMA
Network Interface
 
Figure 3.1a: Receive side data flow when Large Receive Offload is not implemented 
 
 
 
 
 
 
 
 
 
Figure 3.1b: Receive side data flow processing when the Virtual Large Receive Offload 
is implemented in the Kernel 
 
 
Each arrived packet 
(header + payload) has 
been stored separately 
in the host memory  
Host Memory
I/O
DMA
Network Interface
 
Combining the packets 
(one header + multi- 
payloads) that have 
same connection data 
inside the host memory  
  
53 
 
The LRO functions generate socket buffers (SKB) only for the first packet of an LRO 
session.  The following segments will be added in the segments list of the SKB. If the 
packets do not match the LRO requirements, the virtual LRO algorithm uses the 
lro_receive_frags function to store them in a separate SKB within a kernel buffer 
(Figure 3.2), and then passes them to the network stack for further processing. 
 
void lro_receive_frags (structnet_lro_mgr *lro_mgr,structskb_frag_struct 
                                  *frags,intlen, inttrue_size, void *priv); 
void lro_vlan_hwaccel_receive_frags(structnet_lro_mgr *lro_mgr,structskb_frag_struct    
                                                          *frags,intlen, inttrue_size, structvlan_group *vgrp,u16  
                                                           vlan_tag,void*priv); 
Figure 3.2: Extract part of the LRO code shows packets that do not match the LRO 
requirements in a separate buffer 
 
The reordering of packets is quite common in networks [73]. The LRO does not process 
the out-of-order packets. This means the buffer descriptor is updated frequently to find 
empty space inside the host memory in which to allocate the data. This buffer space 
depends on the driver‟s implementation of copying a received packet from the NI into 
the host memory by using the SKB-mode or page-mode (fragmented packets are linked 
either by the next pointer in the SKB or by the fragments array). Each TCP stream 
specifies the amount of buffer space to be used by the TCP stream for each socket. 
When the receive buffer is set to zero, the LRO functions provide buffers fast enough to 
accommodate the maximum throughput. Since the LRO processing is dependent on the 
main host processor, the received packets must be processed within a budgeted time 
  
54 
 
(e.g., 123 ns when the line speed is 100 Gbps and the packet size is 1500 bytes) in order 
to store each packet in the receiving buffer. The host processor is also required to 
process the interrupts caused by the incoming packets. Interrupting the processor has a 
negative effect on end node performance, which reduces the bandwidth utilization and 
consumes processor cycles [31, 76]. The host CPU is also required to process packets 
that did not match the LRO‟s criteria, such as processing of out-of-order packets. This 
processing of the interruptions and out-of-order packets reduces the LRO‟s performance.  
 
3.1.2 LRO performance within the host area  
The distribution of the number of instructions executed per packet in the guest domain 
for virtualized Linux environment of Xen has measured the LRO performance [56]. The 
results shows that running the LRO with the kernel causes the CPU to spend 2927 cycles 
for a guest domain for 10 Gbps. The CPU instructions are required to process the virtual 
LRO data stream, which is around 1600 instructions, when the packet size is 1500 bytes.  
When 1500 bytes serves as a default size, the end node receives around 812,743.82 
packets per second (Table 2.1). When a network packet is using the virtual LRO, it 
raises the CPU usage [126]. According to Kumar and his team [56] when measuring Xen 
of the Linux performance, the calculation of CPU usage when a virtual LRO is applied 
is as follows: 
 
- Assuming the communication line speed is 10 Gbps, the packet size is 
1500 bytes and the CPU clock speed is 4 GHz.  
 
  
55 
 
Then the determination of the actual MIPS rate and execution time for virtual LRO is;  
           CPI = CPU Clock Cycles/Instruction count    
                  =    2972 / 1600    => 1.85      ; CPU spends 2927 cycles for a guest domain  
                                                                               ; 1600 is total instruction count  
 
Execution time calculation = Instruction Count X CPI X Cycle time 
     OR 
    = Instruction Count X CPI / Clock Rate 
                =   2972 X 1.85  /  4  X 10
9  
=  1.3 X 10
-6
 sec                                            (3.1) 
      MIPS = clock frequency / (CPI X 1000000)  
    = (4 X 10
9
) / (1.85 X 1000000)  =>  2153 MIPS                                       (3.2) 
 
This estimated calculation (Equation 3.2) shows that over 50% of the CPU usage (4 
GHz) is required to operate the LRO code whilst performing the LRO.  The host CPU 
devotes more cycles to the completion of the LRO functions than other services when 
the number of incoming packets increases, especially when the packet size is smaller 
than 1500 bytes [28]. However, 1500 bytes packet needs an extremely short time at 100 
Gbps (about 123.04 ns), but the budgeted time for supporting each packet in this 
calculation is 1.3 ms (Equation 3.1), which is not enough time for processing a packet at 
40-100 Gbps.  
 
3.1.3 Receive Side Coalescing 
Receive Side Coalescing (RSC) is a stateless offload technology that has been used to 
reduce the use of the CPU and network processing on the receiving side by offloading 
tasks from the CPU to the RSC enabled network adapter [36, 37]. This technique 
coalesces the packets into large packets inside the host memory and reduces TCP/IP 
  
56 
 
processing by around 20 percent. However, this approach still carries a number of 
drawbacks, including the transferring of data from the NI to host memory. Data 
transferring involves initiating the DMA to transfer a payload area over the system bus, 
to be allocated to the host memory. Each TCP/IP stream leads to several DMA 
initiations and these initiations may cost the host CPU around 300 ns when the message 
size is 32KB [50]. The NI moves the packet from its buffer to the kernel space using the 
DMA. Another concern with the RSC approach is that the out-of-order packets force the 
RSC to stop functioning. This is due to a number of small packets that are not coalesced. 
As a result, a host CPU could have more TCP/IP headers in need of processing. There 
are also a number of processes associated with out-of-order packets processing, which 
require more CPU processing (such as looking up the Transmission Control Blocks 
(TCB) for packets that have not passed the RSC criteria test). This processing is 
estimated to take 25 CPU instructions [27]. Furthermore, the RSC does not provide any 
solutions for sending TCP data or a framework of relations between the host and the NI. 
The RSC does not indicate the use of other protocols (such as UDP/IP) nor are there any 
suggestions for an algorithm for any other protocol than TCP. 
 
3.2 Enhancing the Large Receive Offload Processing  
Designing a scalable interface that supports LRO requires several factors be taken into 
consideration. The first factor is to provide a network core engine that can adequately 
identify the specifications of each arrived packet before linking them to form a large 
packet. The second factor is to support different protocols. Another factor is the support 
for out-of-order packet processing.  
  
57 
 
Performing the processing of out-of-order packets in the NI enhances the packet 
processing, where the target core engine has to manage the received packets and puts 
them in order before sending them to host memory (Figure 3.3). Assuming that there are 
8300 bytes of application data sent from one end to another and that the MTU is 1500 
bytes, six packets of application data will travel from the sender to the destination. Each 
packet carries 1460 bytes of application data, except the last packet, which will carry 
1000 bytes.  In addition, there will be 40 bytes of TCP/IP headers encapsulated within 
1460 bytes of application data. With the Linux LRO implementation, all packets will 
pass through the system bus (application data + TCP/IP headers), assuming that these 
packets have arrived out-of-order at the destination.  
               Network Interface 
    Receiver Buffer (RB)
TCP/IP
Payload of packet 1 
Payload of packet 2 
Payload of packet 3 
Payload of packet 4
TCP/IP
Payload of packet 5 
Payload of packet 6 
REP
Local Bus
CPU
Memory
DMA
 
 
 
 
 
Figure 3.3: Offloading the LRO approach to the Network Interface 
 
6             6 
        1 
Payload of 1 ,2,3and 4 
Payload 5,6 
              5 
             3 
                4     
        2 
  
58 
 
For example, the order of these packets is the following: 1, 2, 4, 3, 6 and 5. With virtual 
LRO processing, these packets need four queues inside the host memory. Packet 1 and 2 
will be stored in the first queue, packets 4 and 3 will each be stored in a new queue 
because they are out of sequence. Packets 5 and 6 will be stored in the last queue. Table 
3.1 illustrates the processing scenario where the virtual LRO is implemented to support 
these six packets. The table also shows the benefits of implementing the LRO inside the 
NI.  
Table 3.1: A comparison between the virtual LRO processing and offloaded LRO  
 
Virtual  LRO 
Proposed LRO in 
network interface 
Data size  8300 byte  8300 byte 
Total headers passed over 
the system bus 
Six TCP/IP or UDP/IP header 
Two TCP/IP or 
UDP/IP header 
Estimation of  DMA 
initiation 
Six DMA initiations Two DMA initiations 
Segment header overhead 
 
The overhead is 240 bytes (40 
X 6 TCP/IP header) 
The overhead is 80 
bytes (40 X 2 TCP/IP 
header) 
Processing the out-of-order 
Not ordered packet. According 
to the LRO implementation, it 
needs 4 queues 
Queue1 packet 1 and 2 
Queue2 packet 4 
Queue3 packet 3 
Queue4 packet 5 and 6 
Only two queues 
Processed  by A host CPU processor  
Network interface's  
engine 
 
 
3.2.1 Lost Large Packet Treatment inside the Network Interface  
When multiple packets are lost from a window of the data, a sender may end up either 
retransmitting packets that might have already been successfully received, or resending 
the dropped packets within the round-trip time available. In order to overcome this and 
  
59 
 
enhance TCP performance, there are two approaches that can be used for the treatment 
of large packets. A selective acknowledgement (SACK) [81] mechanism could be used 
as a suitable approach for LRO processing. With SACK, a receiver can inform the 
sender about all segments that have been successfully received, allowing the sender to 
resend the missing packets that have already been lost. The SACK approach also 
requires the approval of both parties and has to be explicitly enabled by the system 
administrator. Using SACK, an extra burden of processing and extra data is added to the 
stack processing inside the option area of the IP header.  
The other approach is to send the final acknowledgment number of the amalgamated 
packets [114]. This will not change the protocol behavior of sending acknowledgements 
to the sender [115]. Avoiding the retransmitting of data during the amalgamation of the 
arrived packets reduces the time of data amalgamation to less than 0.5 ms [93, 94, 113]. 
This research has studied the maximum number of amalgamate packets between 20 to 
200 packets depending on the packet size (more details on section 6.7.2). For instance, if 
the packet size is 500 bytes, the total time is 24600 ns (200 packets X 123 ns, the time 
required for a packet at 100 Gbps, which is less than 0.5 ms) 
This thesis does not attempt to specify in detail the congestion control algorithms for 
implementing TCP with acknowledgments but will illustrate the proper behavior of TCP 
processing within the proposed LRO functions inside the NI. Amalgamating UDP 
packets is less complex compared to TCP since there are no traffic control requirements. 
Yet UDP packets have larger amounts of data (the payload) than TCP, which requires 
more time to be moved from one location to another inside the NI.  
  
60 
 
3.2.2 Large Packet Processing  
LRO processing determines whether the packet is eligible for amalgamation. The packet 
is not eligible if any of the following conditions applies, including [33, 47, 60, 62, 63, 
67, 69]: 
- Non-Padded frame (IP total packet length must be received packet 
length) 
- Non-TCP or UDP packet 
- IP options are present 
- IP ECN CE is set 
- TCP segment has no data 
- CWR (Congestion Window Reduced) flag is set, ECE (ECN Echo) flag is 
set 
- SYN flag is set 
- FIN flag is set 
- URG flag is set 
- ACK flag is not set  
- TCP Timestamp option is present 
- The IP header does not have the MF bit set (More Fragments) and offset 
is zero.  
 
If these types of packets are discovered by the embedded processor, it buffers the 
packets without any changes. The procedure of amalgamating packets runs until the 
Interrupt Moderation (IM) [31] timer expires or the network interface buffer reaches its 
  
61 
 
quota (more will be explained in chapter 5). When amalgamating the packet of a stream, 
the embedded core is required to update the ACK number to be the same as the last 
packet‟s ACK. The total length of the packet (inside the IP header) also has the total size 
of the datagram. The UDP length inside the UDP header changes to the total size of the 
datagram. 
 
3.3 Primary Design and Structure for the Receiving Side  
Designing a programmable NI that includes LRO functions to support high-speed 
communication rates of up to 100 Gbps is challenging. The processing includes 
managing the different virtual TCP connections that a host CPU makes with other 
remote hosts, and rearranging the out-of-order packets so as to combine more packets in 
the NI‟s buffer. Other considerations required to be studied while designing the NI are 
the packet size, supporting different streams and the communication between the host 
and MAC unit. Scalability, performance and complexity are also considered when 
designing a NI. 
This chapter investigates the use of a single core unit for the NI to support the LRO 
functions for high-speed communication lines. Using a dedicated RISC as a core engine, 
one for the sending side and the other for the receiving side, will allow more features to 
be added to the NI, such as reducing the complexity of designing the NI with multiple 
core engines [15, 42, 108]. One of the strategies implemented in the NI is to split the 
packet processing into three parts: the communication Line Interface (LI) Kernel 
  
62 
 
processing, and the Host Interface (HI) (Figure 3.4). The HI and LI parts will be 
implemented within the hardware. 
This research also examines UDP/IP packet processing, since it is widely used in 
applications [72]. Using multiple protocols improves the scalability of the proposed NI 
design. This research, therefore, can be used as a guide for different protocols to be 
processed within the NI. 
 
 
 
 
 
Figure 3.4: The Ethernet network interface structure  
 
3.4 Receiving Side Processing 
A server deals with a number of users and each user is considered to be a separate 
connection. Each connection may have a number of virtual connections, each with a 
different identification. A copy of these identifications is required to be stored in the NI, 
which helps the core within the receiving side processor to manage incoming packets. 
The main function of the receiver section of the NI is to amalgamate the arriving packets 
into a complete large packet by linking the packets that have the same identifier in the 
 
 
 
Line Interface (LI)   MAC processing Core processing Host Interface (HI) Host I/O 
bus  
  
Network  
 
 
 
Receiver 
Embedded 
Processor 
(REP) 
Receiver 
Buffer 
(RB) 
 
Network Interface (NI) 
Receiver 
Buffer  
Interface 
(RBI) 
  
63 
 
Receiver Buffer (RB) using a linked-list mechanism. The packet information also helps 
the Receive Embedded Processor (REP) core to recognize the packet types. For 
example, from the packet information, the REP determines if it is the Beginning of 
Message (BOM). The BOM is the first packet to arrive at the NI of a stream where the 
REP needs to create a new linked-list. The Continuation of Message (COM) is a packet 
that already has a linked-list created. The End of Message (EOM) is the last packet of a 
connection. This stops the linked-list if the Push flag (PUSH) inside the TCP header is 
seen [33]. The PUSH bit is not a record marker and is independent of segment 
boundaries. This packet is considered as the EOM packet of the stream. A Single 
Segment Message (SSM) is a single TCP message packet that carries the application 
data that is equal to or less than 1500 bytes. Signaling packets such as FIN or SYN 
always flow among the TCP packets. The signaling packets are processed as SSM but 
have less processing requirements than SSM. For example, the RISC is not required to 
link the SSM to a stream. The RISC sends such packets „as is‟ to the HI.   
The TCP and UDP protocols required different processing scenarios to identify the 
sequence of the arriving packets. For instance, TCP has a Sequence Number (SN) field, 
which is used to detect the sequence of the packet. To maintain this stream, clients on 
either side of the TCP session maintain 32 bits SN that can have a value between 0 and 
4,294,967,295 and an Acknowledged Number (ACK) which is used to keep track of 
how much data it has sent or received. The SN is included in each transmitted packet 
and acknowledged by the opposite host as an ACK to inform the sending host that the 
transmitted data was received successfully. UDP packets have Identification, fragment 
flag and packet offset fields which are used to identify the fragmented packets.   
  
64 
 
 
3.4.1 TCP Processing Methodology  
As soon as a TCP packet arrives at the NI, the IP and TCP headers will be processed. 
The IP address and the Port IDs are masked from the IP header and TCP header. The 
Lookup Memory is used to store the active connection information. The core engine 
sends the arrived packet‟s connection information to Lookup Memory to find a match 
(the connection information is sent by the CPU to the NI). If a match of the connection 
information between the arrived packet and the one in the Lookup Memory is found, the 
processing of linking the packet starts. The TCP processing methodology uses the SN 
bits for determining the arrived packets. Figure 3.5 demonstrates the BOM, COM, EOM 
and SSM within a TCP stream. The BOM is the first packet received from a TCP stream 
which carries the first SN (SN = n) and the ACK (m+1) of the stream. COM is a packet 
that arrived at the NI after the arrival of the BOM of the same stream. EOM is the last 
packet of the stream when the PUSH is signaled [33]. The Single Segment Message 
(SSM) packet is the first packet of the stream that has the PUSH flag set or it carries 
small data (e.g., the packet size is equal or less than the MTU). 
Figure 3.5 illustrates the inter-packet processing methodology for receiving TCP 
packets. After getting a match between the packets from Lookup data, the REP then 
examines the packet‟s type. If the arrived packet is a BOM, the packet needs to be 
transferred from the Receiver Buffer Interface (RBI) to the Receiver Buffer (RB). If the 
packet is a COM or an EOM, the body of the packet needs to be linked with the previous 
amalgamated data of this stream inside the RB. 
  
65 
 
 
 
 
 
 
Figure 3.5: Beginning of Message, Continuation of Messages, End of Message and 
Single Segment Message of a TCP stream   
 
The processing methodology for received TCP packets is illustrated in Figure 3.6. If the 
arrived packet passes the LRO criteria (e.g., packets are not SN or FIN ) and a match 
between the packet connection information (the IP address and port ID) with the Lookup 
Memory entries was found, the REP then examines the packet type. If the arrived packet 
is a BOM, the packet needs to be transferred from the Receiver Buffer Interface (RBI) to 
the RB. If the packet is COM or EOM, the body of the packet needs to be linked with 
the previous amalgamated data of this stream inside the Receiver Buffer. 
 
 
 
 
 
 
Single Segment 
Message (SSM) 
SN = n 
ACK=m+1 
Push flag =set 
 
End of 
Message 
(EOM) 
SN = n+1 
ACK=m+1 
Push flag =set 
 
Continuation of 
Message (COM) 
SN = n+1 
ACK=m+1 
Push flag =not 
set 
 
Beginning of Message 
(BOM) 
SN = n 
ACK= m+1  (expected 
of next packet) 
Push flag = not set 
 
  
66 
 
 
 
 
Read the SYN and ports 
(Source and Destination) 
from the incoming packet 
and compare it with the 
one in the Lookup memory 
        Found ?
Read the SN and 
the end address  
from Lookup 
memory
Lost message 
Discard this 
message 
    If the result # is “0”
Read the SN 
from the 
arrived 
packet  
Read the TCP 
PSH flag 
   Is it the expected 
SN? 
Read the TCP 
PSH flag
If the PSH flag = “1” 
COMEOM
If the PSH flag = 
“0” 
Out-of-
Order
SSM BOM
 
Figure 3.6: Processing flow of TCP of LRO  
No 
No 
No 
No 
No 
Yes 
Yes 
Yes 
Yes 
Yes 
Valid TCP 
packets 
 
  
67 
 
3.4.1.1 Linked-list Structure Format 
After inserting a new entry in the Lookup memory, all connection pointers, such as the 
Start-Address and End-Address of the packets inside the Host Interface (HI), are reset to 
zero. Change of these pointers depends on the processing that the entry packet requires. 
Each arrived packet may require different processing within the linked-list, and depends 
on its connection information, including the sequence number and Port ID. As soon as 
the TCP packet arrives at the NI, the IP and TCP headers will be processed. The IP 
address and the Port IDs have been masked from the IP header and the TCP header in 
order to match these identifiers with the Lookup memory. After the match has been 
found, the payload needs to join the same data packet in the RB. Instead of the REP 
searching for a free location inside the HI, a Circulation Buffer (CB) mechanism has 
been added to hold all free pointers inside the HI. The REP reads the head of the CB in 
order to get the address of a free location inside the RB for the arrived packet. The free 
pointers that refer to the available location inside the HI occurrences are collected after 
the host reads the amalgamated data.   
The 32 bits SN in the TCP header assists the REP in identifying the received packets.  
For example, the first arrived packet of a stream is the BOM. This means there is no 
linked-list previously assigned to this stream. The REP needs to create a new linked-list 
for this packet by inserting the Start-Address and End-Address in the Lookup memory 
beside the connection information (Figure 3.7).  The Start-Address refers to the head of 
the linked-list (the address that is loaded from the CB for this packet). The End-Address 
refers to the tail of the linked-list, which is the Null address (Node‟s pointer), located at 
the end of the packet body. The SSM requires one pointer, pointed to the head of the list.  
  
68 
 
Connection info. of stream 1
Start-address
End-address
Connection info. of stream 
Start-address
End-address
Connection info. of stream n
Start-address
End-Address
TCP/IP
Data 
-> next
TCP/IP Data 1 Data 2
Data 
-> next
TCP/IP Data 
Head
Tail
BOM
COM
SSM
 entry 1
Lookup   
Memory 
EOM
entry n
 
Figure 3.7: Linked-list data structure 
 
COM refers to the packets arriving after the BOM that have the same connection 
identifier. After a match of the IP and Port IDs is found in the Lookup memory, the REP 
starts adding a new node to the existing linked-list (the packet body and its pointer). The 
linked-list is updated after adding a new node by setting the current node pointer of the 
node to NULL (End-Address). Next, the REP stores the NULL address of the current 
node at the Lookup memory which refers to the new end in the list. 
If the EOM is discovered the REP needs to stop amalgamating the packets for this 
stream. The REP then appends the EOM packet to the related stream inside the RB. 
Initiating the DMA to transfer the packet body from the RBI to the RB buffer then 
deletes the linked-list of the stream. The End-Address in the Lookup memory, which 
 
 
 
 
 
 
 
 
 
                                                 Local 
                                               Memory     
                                     
 
                               
 
 
 
  
69 
 
refers to the NULL value of the previous packet, is read by the REP. The REP stores the 
address of the current packet in the same place as the NULL value of the previous 
location (End-Address). The REP is responsible for updating the TCP header of the 
original TCP/IP packet of the amalgamated large packets which are inside the RB. 
Furthermore, it needs to update the length of the datagram to the total amalgamated 
number of bytes of the stream inside the IP header and acknowledgement number inside 
the TCP header. When the PUSH flag is equal to “1”, the REP examines the Lookup 
memory‟s Start-Address and End-Address of the connection. If the Start-Address and 
End-Address are equal to “0”, then there is no linked-list assigned to this stream; thus 
the packet is moved from the Line Interface to the RB buffer. The REP then completes 
this packet as a SSM. The REP then stores the NULL value at the end of the packet 
body. Furthermore, there is no need to update the Start-Address and End-Address in the 
Lookup Memory because no more packets will be amalgamated within this stream. In 
this case, the current packet is amalgamated to the previous packet for the same stream, 
which then stores the NULL value at the end of the current pointer node. Finally, the 
NULL address is stored in the Lookup Memory with the same link information, which 
refers to the End-Address of this TCP stream. When the PUSH flag (inside the TCP 
header) is equal to “1”, the REP needs to add this node to the end of the linked-list, 
following the same procedure as that used for EOM. With EOM, there is no need to 
extend the linked-list further because it is the last packet of the TCP/IP stream, and there 
is no need to store the End-Address in the Lookup memory. However, when the Push 
flag is equal to “1” and the Start-Address and End-Address of this connection is equal to 
“0”, the current packet is moved from the Line Interface to the RB buffer. The REP then 
processes this packet as SSM. With SSM, the REP does not need to update the Lookup 
  
70 
 
memory entries that are related to this message, such as the Start-Address and End-
Address of the linked-list, since no more packets will be amalgamated after this packet.   
 
3.4.1.2 Out-of-order Processing  
The Out-of-order procedure is a more complicated form of processing than BOM, COM 
or SSM. The RSC prototype [36] prefers to stop the coalescing of packets when an out-
of-order packet is identified, whereas the LRO [51, 52] opens a new queue if out-of-
order packets are discovered. These steps lead to an increase of processing cycles for the 
host CPU. In this research, a new approach has been implemented to manage the out-of-
order processing inside the NI. The out-of-order processing starts after the REP reads the 
sequence number (SN) of the arrived TCP packet. It is then compared to the SN 
expected to be reached at this linked-list. In the next step, the REP joins the arrived 
packet to the linked-list. In the case where the SN is not located between the boundaries 
of the amalgamated stream, the REP creates a sub-linked-list of this stream (Figure 3.8). 
A duplicate data segment will be discovered after checking the SN of the TCP stream 
that has been amalgamated beforehand in the RB. Furthermore, the REP discards any 
lost packets that do not have a match with the Lookup entry. However, the re-ordering 
can be performed within the Interrupt Moderation size only. If the lost packet of the 
same stream arrives after the Interrupt Moderation has expired, it will be sent to the host 
as a Single Segment Message (the mechanism of the retransmission of lost packets and 
calculating the time are out of the scope of this thesis).   
 
  
71 
 
    Host Interface Buffer
TCP/IP header
Payload of packet 1 
Payload of packet 2 
Payload of packet 3 
Payload of packet 4
TCP/IP header
Payload of packet 5 
Payload of packet 6 
REP
Lookup Memory Entry
Connection Info. N
SN->next               End Add.
Start-Add.            Total bytes
SN->next               End Add.
Start-Add.             Total bytes
Linked-list 
pointers related to 
sequence number 
for packet 1,2,3 
and 4 
Linked list pointers 
related to sequence 
number for packet 5 
and 6 
62 31 5 4
 
Figure 3.8: Lookup Memory structure 
 
The REP is responsible for updating the TCP header (e.g., the sequence and 
acknowledged number) for stored large packets. The total length of the packet is also 
updated according to the total amalgamated data. The receiver host CPU sends a 
confirmed acknowledgement sequence number to the sender to inform the sending host 
that the transmitted data was received successfully. When data segment is sent, a timer is 
started [93]. The sender expected to get an ACK before the timer expires. Lost packet, or 
hole, which could occurs during transmission of the TCP stream, the destination host did 
receive the ACK, then it rearranges the re-send the holes according to sequence number 
that received from the receiver host. This sequence number is included on each 
transmitted packet, and acknowledged by the opposite host [33, 84]. When the packet 
arrives at the NI, the REP processes the packet as SSM and sends it to host “as is”.  
  
72 
 
3.4.2 UDP/IP Processing Methodology  
UDP is a user datagram protocol which is used by a number of protocols such as 
Network File System (NFS) and Trivial File Transfer Protocol (TFTP). UDP packets 
also need to form a large packet in the NI before being sent to the host. Reassembly is 
based on the information carried inside the IP header [63]. The IP header includes the 
necessary information that is needed to merge the arrived packets, such as the 
identifications, the flag bit (MF bit) and offset number [67]. Within UDP/IP processing, 
the IP address and the Port ID are responsible for recognizing the packet as being related 
to a UDP flow, while the offset number, MF bit and the Identification fields inside the IP 
header are used to identify whether the packet is the BOM, COM or EOM packet of the 
stream (Figure 3.9). These identifications assist the linked-list mechanism where each 
packet is associated with the previous amalgamated data of the same connection. Process 
a linked-list for UDP is similar to TCP Linked-list. 
 
3.5 Verification of the LRO Processing  
The design of a scalable network interface which is based on a specialized embedded 
RISC core for high-speed communications is required in this research. In this section, 
the processing carried out by the RISC processor is evaluated and the proposed LRO 
algorithm is verified. Such evolution helps to identify the type of instructions required, 
as well as the total RISC cycles needed for TCP and UDP. The SPIM S20 simulator is 
based on the DLX RISC processor architecture, used for testing the functionality of the 
RISC and finding the amount of processing required [44, 61]. The simulator runs 
  
73 
 
programs for the RISC microprocessors, where it can read and immediately execute the 
proposed LRO containing assembly language. The simulator is a self-contained system 
for running these programs and contains a debugger, which is integrated in the 
simulation [44]. This DLX helps to analyze and test the stream of the execution of the 
assembly program that has been written for the LRO. In addition, it allows for the study 
of the sets of instructions that are required for LRO. 
Read the IP and PID. Also, the  
packet information such as the 
packet ID , offset number and 
the M bit from the incoming 
packet and compare it with the 
one in the Lookup memory  
Found ?
Read the first 64-bit 
(the packet info and 
the end address<- 
Lookup memory
Lost message 
Discard this 
message 
    If the result # is “0”
Read the packet 
info. from the 
arrived packet  
Read the 
Mbit
   Is it the expected 
NO.? 
Read the 
Mbit
If the Mbit = “1” 
COMEOM
If the Mbit = “0” 
Out-of-
Order
SSM BOM
 
Figure 3. 9: Illustrates inter-packet processing to be executed by the NI for the receiving 
UDP packets 
NO 
NO 
NO 
NO 
NO 
Yes 
Yes 
  
74 
 
 
3.5.1 Modelling SPIM Simulator Architecture 
The SPIM simulator has been used to process the LRO functions, including 
communication with the host, data movements and packet header processing (Figure 
3.10). Since it is implemented within the NI, sending and receiving are processed in two 
different processors and both are run in parallel. The simulation for the sending function 
can be performed independently of the receiving function. 
 
 
 
 
                            Packet Processing Unit                    Frame Processing Unit 
 
Figure 3.10: Receiving block diagram 
 
The components of the Packet Processing Unit have been configured within the SPIM 
simulator. The embedded RISC is the core unit. The packet buffers required for LRO are 
implemented inside the simulator's memory (Figure 3.11). These buffers are:  
 
 
Packets  
 
Frame 
Processing  
Media Access 
Control and 
Physical  units. 
Line 
Interface 
(LI) 
Host 
Interface 
(HI)  
Embedded 
RISC 
R2000/R3000 
From 
Network 
 
 
 
 
Packe
t 
Buffe
r 
Packet 
Buffer 
To 
Host 
Ethernet Frames 
 
Exchange 
Packet Processing 
Frame header /Trailer 
Packet headers (e.g., TCP/IP) 
Application data 
 
 
  
75 
 
 
 
 
 
 
 
 
 
Figure 3.11: Packet Processing Unit based SPIM simulator 
 
 The Line Interface (LI) buffer to store the packets as they are moved from the MAC. 
 The Host Interface (HI) buffer which is used to store packets before delivering them 
to the host. While transferring packets from the HI, the NI's core is able to store new 
packets in the lower space of the HI.  
 The Host-NI Communication (HNIC) buffer is used to exchange control and status 
messages between the host and the NI. This includes sending the addresses of the 
packets that have been sent by the CPU inside the Packet Buffer.  
 Another buffer is used to store a list of pointers in the free space inside the HI 
buffer. After the host reads the packets from the HI, the free pointers are sent back 
to the NI. The embedded processor uses these pointers to store the arrived packets.    
Host Interface 
(HI) 
buffer 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Memory 
simulator 
 
Line Interface 
(LI) 
           Buffer 
Look-up table  
Embedded 
Processor 
Simulator 
Pointers of free 
spaces inside the 
HI 
Host - NI 
Communication 
 (HNIC) buffer  
 
  
76 
 
 The Look-up table stores the active TCP connections (e.g., the destination IP 
address and the source and destination port IDs) that are required to hold the active 
identifiers to support the link-list mechanism.  
In a physical NI, the LI, HI and HNIC buffers are hardware components. The Look-up 
table is implemented within the simulator‟s memory.  
As the SPIM simulator does not have a DMA unit, instead a programmed I/O is applied 
for data movement. The embedded processor core is permitted to simulate the 
initialization of the control information for the DMA controller but not the data 
movement itself. This makes the simulation processing extremely close to reality, where 
the processor at the Packet Processing Unit needs to initialize the DMA controller to 
move data. With the Programmed I/O method, the embedded processor handles all the 
procedures for moving the packet payload from the LI buffer to the HI buffer (Figure 
3.12). Each clock cycle, the processor loads 32 bits to its register (step1) and then during 
the next cycle, stores these 32 bits inside the HI buffer (step 2). 
The embedded core continues to load and store operations until all bytes have been 
moved to the HI buffer. The copying of data packets from one location to another can be 
done by using the Copy Memory Address function, because the source and destination 
address are stored in the same memory. However, using load and store is very close to 
reality, as the LI and the HI are implemented as separate memory. 
 
  
77 
 
 
 
 
 
 
 
 
 
 
Figure 3.12: Programmed I/O approach for data movement 
 
3.5.2 Simulation Processing Analysis  
The purpose of this section is to validate the LRO processing algorithm through the 
SPIM simulator. To get accurate results from the LRO processing at the SPIM simulator, 
captured data from a real network environment is used (see Appendix A). Two other 
applications are used to capture the real data. The first application is the Microsoft 
TechNet NTTTCP [101], which was developed by Microsoft, to send and receive the 
TCP and UDP streams. The NTTTCP is used to send data applications between 
computers. The other application is Wireshark Libpcap, which is used to capture the 
packet flow of the streams that are running at the end node [102]. The NTTTCP 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
SPIMMemory 
Host Interface 
(HI) 
buffer (2) 
  Packet body #n 
 
Store the packet body 
inside the HI 
Line Interface 
(LI) 
buffer 
(1) 
   Packet #n   
Reassembly 
Embedded 
Processor 
    (REP) 
Pointers of free 
spaces inside the 
HI 
 
Load the packet 
from the LI 
 
Look-up-table 
Host - NI 
Communication 
(HNIC) buffer 
 
  
78 
 
application starts sending the TCP/IP and UDP/IP streams to the targeted machine. It 
captures the arrived packets using Wireshark Libpcap and then exports the data of the 
TCP/IP and the UDP packets in hexadecimal which then form a Hexadecimal format file 
(Figure 3.13). Then, the streams of the real packets are stored inside the LI buffer. The 
specialized instruction set was developed to create efficient TCP and UDP processing, 
based on the virtual LRO [51] algorithm and following the RFC for TCP/IP and UDP/IP 
[33, 62, 63, 67, 69, 94]. As the captured files are in sequence and there is no lost or 
duplicated data, manual changes of the Hexadecimal file have been made to simulate 
lost and out-of-order packets. These changes include the change to the position of the 
ordered packets or the deletion of a packet from a stream. 
 
 
 
 
 
 
 
 
 
Figure 3.13: A TCP/IP and UDP/IP Hexadecimal format  
 
+---------+---------------+----------+ 
01:59:09,206,211   ETHER 
|00|22|75|29|ec|95|20|68|9d|9a|34|6c|08|00|45|00|01|27|13|ef|40|00|80|06|a9|0f|c0|a8|02|08|4d|ea|2c|38|d1|42|00|50|8c|ef|0e|
c0|62|07|23|00|50|18|00|40|ac|8c|00|00|47|45|54|20|2f|52|2f|41|31|51|4b|49|47|4a|68|4e|6a|64|6a|4e|6a|49|33|59|7a|63|33|5a
|54|52|6a|4d|32|45|34|59|32|45|31|4e|7a|45|33|5a|54|52|6c|5a|44|49|35|4e|7a|56|6a|45|67|51|41|42|77|59|54|47|4f|51|42|49|6
7|45|44|4b|67|51|49|41|78|41|41|4b|67|51|49|42|52|41|42|4b|67|63|49|42|. 
……. 
+---------+---------------+----------+ 
01:59:09,527,074   ETHER 
   
|20|68|9d|9a|34|6c|00|22|75|29|ec|95|08|00|45|00|00|28|ad|86|40|00|32|06|5e|77|4d|ea|2c|38|c0|a8|02|08|00|50|d1|42|62|07|2
3|00|8c|ef|0f|bf|50|10|00|01|7f|b8|00|00| 
+---------+---------------+----------+ 
01:59:09,815,821   ETHER 
20|68|9d|9a|34|6c|00|22|75|29|ec|95|08|00|45|00|00|c2|ad|87|40|00|32|06|5d|dc|4d|ea|2c|38|c0|a8|02|08|00|50|d1|42|62|07|2
3|00|8c|ef|0f|bf|50|18|00|01|a8|e2|00|00|48|54|54|50|2f|31|2e|31|20|32|30|30|20|4f|4b|0d|0a|43|6f|6e|74|65|6e|74|2d|54|79|7
0|65|3a|20|61|70|70|6c|69|63|61|74|69|6f|6e|2f|6f|63|74|65|74|2d|73|74|72|65|61|6d|0d| 
1:59:09,817,228   ETHER 
|20|68|9d|9a|34|6c|00|22|75|29|ec|95|08|00|45|00|05|ae|ad|88|40|00|32|06|58|ef|4d|ea|2c|38|c0|a8|02|08|00|50|d1|42|62|07|2
3|9a|8c|ef|0f|bf|50|10|00|01|c8|46|00|00|31|34|0d|0a|03|12|08|e4|01|12|01|fe|32|0a|08|04|10|ed|c9|cb|15|18|80|0a|0d|0a|31|6
4|66|34|0d|0a|0a|f1|3b|41|53|55|21|56|50|53|7a|02|07|06|13|27|00|00|00|a9|1d|00|00|f4|26|00|00|78|da|25|9a|77|3c|d6|6b|14|
c0|5f|e3|b5|67|46|52|89|54|b6|b2|b2|ae|ac|ac|24|91|2d|59|65|64|54|46|11|4 
……. 
……. 
  
79 
 
3.5.3 Instruction Cycles  
During the simulation, the amount of processing required for network interface protocols 
and data movement is measured. The number of instructions required for the LRO 
functions processing is also determined for TCP and UDP. After the embedded 
processor finishes processing one packet, it fetches the new connection identifier or a 
pointer that has been sent by the host through the HNIC. After finishing amalgamating 
the large packets in the HI, the core processor is required to send the Start-Address and 
the total bytes to the HNIC. The amount of the execution that the processor takes for 
different types of operations for TCP and UDP has been analyzed during this simulation. 
The analysis of the type and the total of implementation instructions needed by the 
embedded processor to complete the address of each BOM, COM and EOM message 
individually for TCP/IP and UDP/IP were studied. Table 3.2 presents the number of 
cycles required to complete the out-of-order packets for a TCP or a UDP packet within 
the SPIM simulator.  
The Programmed I/O is applied for the data movement.  The RISC core spends over 400 
cycles on moving 1500 bytes packets from the HI to the LI.  As the packet size is 
reduced, the number of cycles becomes less. The RISC required 45 cycles to complete 
the processing of the 64 bytes TCP packet and 56 cycles for the processing of UDP 
packets. UDP has higher cycles since it needs more Programmed I/O cycles.  
 
 
  
80 
 
Table 3.2: Number of cycles needed to complete out-of-order TCP or UDP packets  
Protocol 
Type of 
processing 
64 
bytes 
128 
byte 
256 
bytes 
512 
bytes 
1024 
bytes 
1500 
bytes 
TCP 
Header 
processing 29 29 29 29 29 29 
Lookup data 10 10 10 10 10 10 
Moving data 6 22 54 118 246 365 
Total 45 61 93 157 285 404 
UDP 
Header 
processing 26 26 26 26 26 26 
Lookup data 10 10 10 10 10 10 
Moving data 9 32 64 121 249 368 
Total 56 68 100 157 285 404 
 
 
The overall processing percentage when the processor processes the TCP and UDP 
packets is determined (Figure 3.13). The percentage of moving data is increased when 
the data packet gets larger. The highest processing percentage of packet processing is 
reached when the packet size is 1500 bytes. This is because the core requires more 
processing to move the payload of the packet than other types of smaller packets. The 
shaded area in the Figure 3.14 shows that the RISC requiring nearly 90 percent of the 
total cycles for moving of the payload data of 1024 bytes or larger. 
 
  
81 
 
 
 
 
Figure 3.14: Total percentage of data movements of LRO  
 
 
3.5.4 Instruction Type  
The instruction type is required to be studied in order to design a processor with 
maximum performance for LRO. Using selected instructions that can run the LRO 
processing functions reduces the design complexity and makes the RISC controller 
simpler and faster. The instructions used are memory-register, register-to-memory, 
arithmetic and logic instructions. Table 3.3 illustrates the instruction types and their 
format. 
 
 
0
10
20
30
40
50
60
70
80
90
100
64 bytes 128 
byte
256 
bytes
512 
bytes
1024 
bytes
1500 
bytes
p
ro
ce
ss
in
g 
p
er
ce
n
ta
ge
 
Header processing
Lookup data
Moving data
High 
percentage  
 
 
  
82 
 
 
Table 3.3: Instruction types that are used with LRO processing 
 
Category Instruction Format Examples Comments 
Arithmetic 
Add add  $s10, $s2, $s3 
Accumulating the 
total bytes 
Three operands; 
data in registers 
Add 
immediate 
addi $s8,$s4,40 
Adding the size of 
headers to the total 
bytes 
Used to add 
constants 
Data 
transfer 
Load lw $s1,50($s2) 
Loading the TCP, 
UDP or IP from LI.   
$s1 = Memory[$s2 + 
50] 
Data from memory 
to register this 
include LH and LB 
Store sw $s10,100($s2) 
Store total bytes in 
HI Memory[$s2 + 
100] = $s10 
Data from register 
to memory, 
including SB and 
SH 
Logic 
Shift Right sr $11,20 
Shift data register to 
the right  
Used for shifting 
data 
And and $s10,$s1,x00ffffff 
Extract the total 
length form the first 
32 bits of the IP 
header 
Used to mask the 
register with 
constant 
Condition 
branches 
Branch equal beq $s1,17 
To check the 
protocol type  (UDP 
= 17) 
Condition jump  
 
3.5.5 Total Registers  
SPIM processing for the proposed LRO required 32 general registers (r0 to r31). 
Floating Point registers are confirmed as not required during the processing of the 
proposed LRO functions (Figure 3.15). All the Floating Point resisters  remain zero.  
  
83 
 
 
 
 
 
 
 
 
 
 
 
Figure 3.15: Floating Point registers during the processing of the proposed LRO 
 
3.5.6 RISC Clock Rate  
The simulator has measured the amount of processing that is required for TCP/IP and 
UDP/IP protocol processing and for data movement. Different TCP/IP and UDP/IP 
packets have been delivered to the simulator. The number of processed instructions 
required for the protocols, with and without data movement processing, were measured 
in MIPS, where every instruction was processed in one cycle. Therefore, the results 
shown below represent the required speed of the RISC core in terms of the MIPS, while 
processing the requirements for header processing alone without data movement or 
lookup data for both TCP/IP and UDP/IP. The RISC required 29 cycles for the header 
 
  
84 
 
processing when out-of-order packets are processed. The Lookup processing required 10 
cycles. Figure 3.16 represents the maximum Receive Embedded Processor (REP) clock 
rate required to perform header processing for the out-of-order packet sizes of 1500, 
1024 and 512 bytes. It is clear that the clock rate of the RISC core gets higher whilst 
performing the processing of the smaller sized packets. This is obvious because the 
RISC is required to process about 234962406 packets in a second when the MTU is 512 
bytes and around 81274382 packets when the MTU is 1500 bytes (Table 2.1). 
 
 
Figure 3.16: RISC clock rate for packet header processing 
 
The data movement for the packet payload is simulated using the Programmed I/O for 
data movements. The amount of MIPS required for packet processing varies depending 
on the packet size. Even though the small packets carry less data, the core has to run 
faster to manage the processing of the packets at 100 Gbps. A RISC core with 3322 
321 467
917
1766
3293
5803
0
1000
2000
3000
4000
5000
6000
7000
1500 bytes 1024 bytes 512 bytes 256 bytes 128 bytes 64 bytes 
M
IP
S
Packet size
100 Gbps
40 Gbps
  
85 
 
MIPS is required for the LRO when 1500 bytes packet is processed and the line speed is 
100 Gbps. A core with 9492 MIPS is required when the packet size is 512 bytes (Figure 
3.17).  
 
 
Figure 3.17: MIPS required for the Receiving side using Programmed I/O 
 
3.6 Network Interface Design Considerations for 100 Gbps 
According to the SPIM simulator, 404 cycles are required when the out-of-order 512-
byte packets are processed. The 9492 MIPS of the RISC cycles was caused by several 
factors. The first factor was that the core uses the Programmed I/O for the data 
movements approach. The second factor is the local bus width, which is 32 bits. Other 
factors include the Lookup processing and RISC pipeline stages.  
3322
4837
9492
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1500 bytes 1024 bytes 512 bytes 
M
IP
S
Packet Size
100 Gbps
40 Gbps
  
86 
 
3.6.1 DMA for Data Movements 
Using the Programmed I/O for data movement is slow because there are too many 
unnecessary overheads for small transfers [43, 44] and the RISC core is tied up with 
moving data from one location to another, especially when the core requires moving 
large amounts of data. This affects the performance of the RISC core, making it 
unavailable for other activities.  The use of DMA in the NI is more efficient for network 
applications than the programmed I/O [13, 39, 58]. 
DMA is the chosen approach in this research as a method for the data movement 
function of the NI within the proposed model. In the proposed model, the DMA is 
responsible for moving data between the LI and the HI (Figure 3.18).  
 Step 1, the RISC core initiates the DMA controller. Since the local bus of the 
receiving side is shared between the DMA and the RISC core, the RISC core will 
have to release the local bus to the DMA to perform the data block transfer. Each 
transfer of 64 bits consumes two cycles.  
 In Step 2, the second DMA cycle reads the 64 bits from the source buffer to the 
DMA‟s register.  
 Step 3, the 64 bits move from the DMA‟s register to the destination buffer. The 
DMA state machine will then provide the read and write signals to the source 
and destination buffers. The state machine in the DMA is also required to 
increment the address counter and store 64 bits in the destination buffer. The use 
of the DMA reduces the RISC processor instruction cycles. 
  
87 
 
 
 
 
Figure 3.18: DMA approach for data movement inside the Network Interface 
 
3.6.2 Local Bus Width  
Each load or store from the LI to the HI can hold up to 32 bits. This is because the local 
bus width of the SPIM simulator is 32 bits. Using a wider bus than 32 bits for the 
proposed NI increases the amount of data moved within each cycle (e.g., 64 bits). 
3.6.3 Pipeline Stages  
The SPIM simulator has 4 pipeline stages.  These stages consist of fetching the 
instructions, decoding, executing and writing back. Reducing the pipeline stages is 
developed to improve the execution throughput of the RISC [44, 95]. The proposed 
RISC‟s pipeline will be discussed in chapter 6.   
 
LI
REP
Data
DMAStore the 
data from 
the DMA’s 
register to 
HI
HI
Data
Initiated the DMA 
1
2
Locate the 
data from the 
LI to DMA’s 
register
3
  
88 
 
3.6.4 Lookup Memory  
The active connection information can be accessed by the NI‟s engine, either by using a 
fast look-up-table, such as Content Addressable Memory (CAM) [79], or by accessing 
the TCP/IP Control Block (TCB) from the host [58]. In this research, the prototype 
model uses the CAM and SRAM [79] to store the active connection information [33], 
such as the IP address and the Port ID. Each TCP/IP or UDP/IP packet has its own 
headers to carry the packet‟s information, such as the IP address and Port ID.  
 
3.6.5 Overlapped Processing  
Completing the processing of a packet within the minimum timeframe requires the 
inclusion of a particular technology, namely Overlap Processing [50, 107]. During the 
transfer of data from one place to another, the local bus is busy and the processor cannot 
perform any instructions related to using the local bus. Owing to this factor, the 
treatment of processes within the NI has been carefully selected and takes advantage of 
the processor‟s ability to complete other tasks that are not related to the use of the local 
bus. For example, in the case of receiving packets, the processor is made to deal with the 
linked-list in the CAM, carrying out such tasks as updating it or reading from it (Figure 
3.19). Using scheduled jobs is an appropriate approach [57], but this is only useful with 
multi-core processors where more job numbers are required. Using the overlapped 
technique reduces the packet processing time available to complete a packet. In addition, 
these enhancements of the NI‟s structure increase the performance of packet processing 
in high-speed networks. 
  
89 
 
 
Processing completed  before  data movements Data movements
Processing instructions that are not 
related to the NI's Bus, such as 
updating the linked-list of a steam  
Initiate the DMA 
to transfer the 
header and the 
first segment  
Checking the 
current packet
Update the 
CAM Processor 
might be idle 
data movements cycle time 
depends on the payload size 
Time
Done by 
Core 
processor
Initiated by 
 Core 
processor
REP idle 
time (when 
the DMA 
spent much 
time over 
the local 
bus 
REP busy 
during the  
data 
movements
DMA 
transferring 
 the data
Overlapped 
processing
P
ro
c
e
s
s
in
g
Process the packet  header and copy  the related 
data such as SN # port IDs 
 
Figure 3.19: Overlapped processing at the receiving side 
 
3.7 Conclusion 
Large Receive Offload (LRO) is a technique used to support protocol processing at the 
end node. The methodologies of amalgamating the TCP and UDP functions have been 
discussed. The proposed LRO function works similarly to the Jumbo Frame (9000 bytes) 
mechanism. This thesis extends the LRO to support out-of-order packets. This research 
shall contribute to improving the structure of the network card in terms of scalability, 
simplicity and the ability to support a high-speed rate such as 100 Gbps.  
  
90 
 
The SPIM simulator results show that the A 917 MHz RISC core can support the 
Receiver unit processing for a transmission speed of rate up to 100 Gbps for TCP/IP and 
UDP/IP without data movements when the MTU is 512 or larger (without data 
movement). However, using Programmed I/O for data movements increases the RISC‟s 
cycles to be over 9000 MHz when the packet size is 512 bytes.   
Enhancing the receiving side processing can be accomplished by adding additional units 
such as the Content Addressable Memory to store the active connection information and 
its location inside the Receiving Buffer.  In addition, the NI uses the DMA‟s unit to 
complete the transfer of data from the Line Interface to the Host Interface, or vice-versa.  
The next chapter will illustrate the Large Sending Offload. Chapter 5 will discuss the 
implementation of the Packet Processing Unit inside the NI, which uses the Xilinx 
simulator that supports the Very High speed Description Language (VHDL) when 
building the RISC processor and other devices. These devices include the Content 
Addressable Memory, the DMA and the buffers for performing Large Receiving Offload 
and Large Sending Offload. 
 
 
 
91 
 
Chapter 4 
Large Send Offload Methodology 
 
4.1 Introduction  
Another contribution of this thesis is to design and implement a scalable Packet 
Processing Unit for the sending side. An appropriate algorithm has been developed for 
sending segments faster to the Media Access Control (MAC) and physical units. The 
processing core of the packet processing is required to support a high communication 
line rate of up to 100 Gbps. The core is responsible for calculating the size of each 
message and generating the packet headers (e.g., TCP and IP). It then transfers the 
packets from the NI's buffer to the MAC unit within the budgeted time frame for high-
speed networks. The methodology for sending TCP/IP and UDP/IP packets will be 
discussed in this chapter. A SPIM simulator has been used for verifying the Large Send 
Offload (LSO) functions. The outcomes of following the RISC instructions and adhering 
to the clock rate will also be presented. 
 
4.2 Sending Side Block Diagram  
Shifting part of the transport and data link layer processing to the network interface (NI) 
has enhanced end node performance [28]. Instead of sending a number of multiple small 
packets to the NI and generating interrupts to be sent after each packet, the host CPU 
92 
 
sends only one large packet (up to 64 KB) [33, 62, 114]. This reduces the number of 
interrupts that the host CPU sends to the NI after completing the generation of a packet. 
In addition, sending one large packet reduces the impact of protocol overheads on 
network throughput [50] and enhances bus utilization [34]. However, offloading the 
entire transport layer processing can further complicate the NI design and more 
consideration must be given to the protocol flow [14]. Today, significant challenges are 
faced by server platforms while performing TCP and UDP protocol processing. For 
instance, the speed of networks now exceeds 10 Gbps and the design and 
implementations of high-performance NIs have become very challenging, with the 
budgeted time frame for calculating and generating a packet header being extremely 
short (only 123 ns when the line speed is 100 Gbps for 1500 bytes packets). 
In this thesis, the NI structure that supports the stateless LSO for processing TCP/IP 
segmentation and UDP/IP fragmentation applications has been designed. The proposed 
NI (sending side) that can support high speed networks has been divided into three 
sections; the Host Interface (HI), the Line Interface (LI) and the Packet Processing Unit 
(PPU) (Figure 4.1). The DMA is used to transfer data inside the NI. The Sending 
Embedded Processor (SEP) and the DMA are attached to a 64-bit bus.  
After receiving an interrupt from the host CPU, conforming that a packet is stored in the 
Sending Buffer (SB), the sending core engine at the NI starts processing the packet. The 
processing, however, is dependent on the protocol type and the packet size.  
 
93 
 
 
 
 
 
 
 
For example, the TCP protocol supports end-to-end reliability segmentation, where each 
packet has two identifiers: the Sequence Number (SN) and the Acknowledgment 
Number (AKN). The beginning segment carries the start sequence number of the TCP 
segment. It also carries an AKN, which is the SN of the next expected data portion of the 
transmission [33]. The SN and AKN need to be updated in each outgoing segment. The 
Checksum field in the TCP and IP headers is either calculated during the data movement 
from the SB to the SBI (this is the approach taken with Myicom’s LANai-4 NI), or it is 
built into the hardware of the MAC unit (e.g., most of Alteon’s Gigabit Ethernet NICs). 
The aim of this research is to provide an alternative method for sending data faster to the 
MAC and physical unit. This includes the generation of the packet headers using a 
specialized high performance RISC processor. 
Normally, the NI’s engine transfers the TCP and IP header of the original message from 
the Host Interface (HI) buffer to the local core buffer as a template header (e.g., inside 
Sending 
Buffer 
Interface 
(SBI)
RISC core
Sending 
Embedded 
Processor (SEP)
DMA
Sending Buffer  
(SB)
Line Interface (LI)Host Interface (HI) Processing Area
To/
From 
Host
Local bus
Communication with 
the Host CPU
System bus
 
Figure 4.1: Sending side Model 
 
94 
 
the core’s registers) [68]. Whenever there is a segment, a copy of the template header is 
sent to the Line Interface (LI) after updating the essential fields (e.g. the SN inside the 
TCP header and the datagram total length inside the IP header of the copied headers). 
Because the packet header is stored inside its register, the core engine then sends the 
packet header from its register to the HI. This approach costs a core at least 10 cycles to 
transfer the packet header from the HI to its register (or local memory) when moving 40 
bytes (TCP and IP header) over the local bus. The core is also required to send the 
packet header to the MAC unit with each data segment. The proposed approach for 
proceeding with the packet headers is to update the original large packet headers inside 
the Sending Buffer instead of saving a template header inside the local memory. The 
core then initiates the DMA to move the packet from the HI to the LI. In this case, the 
header movement is done by the DMA and not the programmed I/O. 
 
4.3 Protocol Processing Methodology 
The protocol processing aims to establish and maintain the LSO of the TCP/IP and 
UDP/IP protocols. The process includes identifying the packet type, calculating the total 
length of the packet, generating the packet header and sending a complete packet from 
the HI to the LI. This processing can be started after a host CPU specifies a packet to be 
sent to a network. This packet is then stored in the SB for segmentation or 
fragmentation. The host CPU also sends the necessary information to assist the core 
engine at the NI to segment the large packet, such as the Maximum Segment Size 
(MSS). It also indicates the size of the packet. Small packets have a priority bit flag and 
95 
 
always appear at the top of the queue [68]. The SEP checks the flags after completing 
the transmission of a packet (e.g., every 123 ns). If a flag is identified, the REP stops 
sending the large packet and sends the small packet first. The SEP starts processing the 
large packet inside the HI to be sent to the MAC unit as the Maximum Transmission 
Unit (MTU). The SEP reads the length of the moved packet by extracting the packet 
length from the IP header. If the size of the packet is equal to or smaller than the MTU 
(e.g., retransmitting packets for a lost PDU), then the packet is processed as a Single 
Segment Message (SSM) and the packet passes 'as is' to the MAC and the physical unit. 
If the packet size is larger than the MTU, the SEP starts calculating the first MSS to be 
encapsulated within the packet header (e.g., TCP and IP) and then sends it to the LI. A 
TCP/IP packet can carry up to 1460 bytes when the MTU is 1500 bytes, whereas a 
UDP/IP packet can carry up to 1472 bytes in the payload part. 
The processor uses four pointers to enable it to continue sending data to the LI: the Start-
Header Address Pointer (SHAP), the End-Header Address Pointer (EHAP), the Start 
Payload Pointer (SPP) and the End-payload Pointer (EPP) (Figure 4.2). The core engine 
is responsible for updating the packet headers for each outgoing segment from the HI to 
the LI. The SHAP is pointed at the start-address of the large packet inside the HI. The 
EHAP is pointed to the end of the packet headers. The SPP pointer helps the SEP to 
locate the start of the application data. The last pointer is the EPP, which points to the 
end of the first segment. The SEP updates the SPP and EPP pointer during the data 
movements of the first packet (the BOM). The processor core requires incrementing the 
SPP and EPP in order to transfer the entire application data to HI.  
96 
 
IP header
TCP header
First Segment 
Data
Network 
headers
SHAP
EPP
SPP
EHAP
 
Figure 4.2: Four pointers are used with the new approach for segmenting packets  
 
4.3.1 UDP Processing   
With UDP processing, there is no guarantee that a packet will ever reach its destination. 
In addition, UDP has no flow control processing. However, careful flow processing is 
required to make sure that each piece of data is sent. The sequence of the UDP packet 
processing starts after reading the IP header and identifying the protocol type (the 
protocol type field inside the IP header is equal to “17”) (Figure 4.3). The protocol type 
can be read from the first 32 bits of the IP header. The second 32 bits are then fetched 
from the IP header (the Identification field (16 bits), flags and Fragment Offset (13 
bits)). These three fields are used by the sender core to fragment the UDP application 
data that is larger than the MTU. The application data of the long packet is then divided 
into portions of 8 bytes (64 bits), so that the first packet, the Beginning of Message 
(BOM), carries the first MSS data (e.g., 1472 bytes).  
Start  Header 
 
End Header 
Start of the first segment 
 
 
End of the first segment  
97 
 
Read the protocol 
type 
TCP or UDP
Calculate the SN 
and Ack 
Read the packet 
length
Larger than the 
MSS
Update the 
original  headers 
with new data 
such as  new 
length size, SN   
and  AKN
Pass the header 
to MAC unit 
Pass the body to 
MAC unit
Check the remain 
size of the message
Send a completion 
to Host 
Calculate the 
offset  
Update the 
original  headers 
with new data 
such as  new 
length size, offset  
and  total segment 
Pass the header 
to MAC unit 
Pass the body to 
MAC unit
Check the remain 
size of the message
Datagram finish 
Send a message 
as is
Receive Send req.
   
Figure 4.3: Processing flow of TCP and UDP of LSO 
 
 
The More Fragments flag (MF) in the first packet is set to one (to indicate that more 
fragments of this packet follow) (Figure 4.4). The original packet’s header has the 
TCP 
Yes 
COM COM 
EOM 
SSM 
BOM 
UDP 
BOM 
98 
 
Identification ID equal to “10”, the MF bit is equal to “0” (since no packets follow) and 
the offset is equal to "0" (the packet is a single packet).  
MF
0
Offset
0
Appliction data 6840 bytes
MF
1
Offset
0
Application data 1472 bytes
MF
1
Offset
184
Application data 1472 bytes
MF
1
Offset
368
Application data 1472 bytes
MF
1
Offset
552
Application data 1472 bytes
MF
0
Offset
736
Application data 
1002 bytes
BOM
COM
COM
COM
EOM
Fragment 1: 
data bytes  0- 
1471  and                    
ID = 10 
Fragment 2: data              
bytes  1472-2943             
ID = 10 
Fragment 3: data 
bytes  2944-4415                 
ID = 10 
Fragment 4: data 
bytes 4416-3837        
ID = 10 
Fragment 5: data 
bytes 5838-6839               
ID = 10 
 
Figure 4.4: Procedure of sending a UDP user data application 
 
If the MTU is 1500 bytes, the packet can carry 1472 bytes in the payload part. The SEP 
divides the original packet into several fragments. The BOM contains the first fragment 
of data (1472 bytes). The header information is as follows: the Identification ID is equal 
to “10”. The MF bit is equal to “1” which indicates more packets follow with the same 
ID. The offset is “0”. The COM is the following packet, where the Identification ID is 
equal to “10” and the offset is 184 (1472/8).  This is because this packet is located at a 
relative location of 184 bytes. The MF bit is equal to “1”. The last packet is the end 
packet of the message (EOM), which holds the rest of the data, with the MF bit equal to 
“0”. The offset is 736 (5888/8). 
99 
 
4.3.2 TCP Processing  
If the stored packet is TCP, the SEP starts examining the size of the packet length. If the 
size is larger than the MSS, the first part of the application data (sender and destination 
address) of the BOM are set based on the initial value. Conceptually, each TCP packet 
requires an SN and AKN inside the TCP header (Figure 4.5). The control flags (e.g., 
Urgent Flags [114]) field is meaningful and must remain as is within the outgoing 
segments. If the total length of the BOM is 1500 bytes (the two network ends specify 
different sizes for the MTU when the TCP connection is being set up), the payload part 
of the BOM is 1460 bytes. After completing the sending of the BOM to the HI, the core 
engine examines the remaining application data if there are more packets that need to be 
sent to the HI. The Continuation of Message (COM) indicates subsequent packets of the 
stream. With the COM, changing the SN and ACK is essential, where the length of the 
datagram remains as it is (e.g., 1500 bytes). The length of the datagram may change in 
the last sending packet, the End of Message (EOM), which bears the last part of the 
application data. 
IP TCP TCP user data
IP TCP IP TCP
IP TCP
TCP user data TCP user data
TCP user data
TCP user data length
MSSMSS
MSS
SN =0
AKN= 1461 
SN =1461
AKN= 2922
SN = last part
AKN= final AKN 
BOM COM EOM
Packet Header
 
Figure 4.5: Procedure of sending a TCP user data application 
 
100 
 
4.4 SPIM Simulator for LSO  
The SPIM S20 simulator (based on the MIPS R2000/R3000 RISC) executes the 
proposed LSO code containing the assembly language.  The data flow from the host to 
the Line Interface is presented in Figure 4.6.   
 
 
 
 
 
 
 
Figure 4.6: Sending block diagram 
 
The simple simulator packet processing structure is presented in Figure 4.7. The Host 
Interface (HI) buffer is used for storing packets that have arrived from the host 
temporarily. After 'cutting' the large packets to the size of the MTU, they are sent to the 
Line Interface (LI) buffer which is used to store the sending packets, before delivering 
them to the MAC unit. The Host-NI communication (HNIC) buffer is used for 
exchanging control and status messages between the host and the NI. The Circulation 
Buffer (CB) is used to store all the address pointers for the free space that exists inside 
the HI buffer.  
Frame 
Processing 
Media Access 
Control (MAC)  
Line 
Interface 
(LI) 
Embedded 
RISC 
R2000/R3000 
Host 
Interface 
(HI) 
Packet 
Buffer 
Packet 
Buffer 
 
 
 
Packet 
Buffer 
From 
Host 
Processing 
Packets  
 
Exchange 
  
Packet Processing 
Unit 
Frame Processing 
Unit 
101 
 
 
 
 
 
 
 
 
Figure 4.7: SPIM simulator block diagram 
 
Figure 4.8 shows the sending unit structures; the HI, the core processor and the LI. The 
host transmits large packets to the HI buffer of the Sending unit. It also provides the 
embedded core with the MSS and the location of the packets inside the Sending Buffer 
of the Sending unit. This information should be sent to the HNIC for each transferred 
packet.   
The SEP reads the information which is available in the HNIC in order to start 
generating the first packet headers of the large packet. It then calculates the size of the 
MSS in order to determine the length of the packets and the SN fields in the TCP/IP 
header (The client on either side of a TCP session maintains 32 bits SN to keep track of 
how much data it has sent [33]). After the TCP/IP or UDP/IP headers have been 
generated, they are written into the SEP's register to be used with each outgoing segment 
or fragment (e.g., with each 1460 bytes).  
Sending Embedded 
Processor 
(SEP) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Memory simulator 
 
Host Interface (HI) 
Buffer 
Line Interface (LI) 
Buffer 
Circulation Buffer 
(CB) 
     Host - NI 
Communication    
  (HNIC) Buffer  
 
102 
 
 
 
 
 
 
 
Figure 4.8: Communication between the host and the NI Sending Embedded Processor 
 
The SEP needs to change the TCP/IP packet fields for each outgoing header. This 
sequence number is included in each transmitted packet and acknowledged by the 
opposite host as an AKN to inform the sending host that the transmitted data was 
received successfully. If the packet processing is related to UDP/IP, the SEP is required 
to change the UDP/IP headers, including the length of each outgoing packet, the packet 
flags inside the IP header, the Packet Offset and the UDP segment size inside the UDP 
header. This is not the case when segmenting TCP, the Offset remains zero and the More 
Fragments bit is not set. 
For data movements using Programmed I/O, the embedded processor moves the payload 
data from the HI to the LI. In addition, the headers will be transferred from the SEP's 
register to the LI.  After moving the headers and the payload body from the HI to the LI, 
the SEP starts generating the next packet which includes the actual length of the next 
Exchange data 
including the location 
of the packet # n inside 
the HI and MSS  
 
 
(HNIC)Buffer 
 
 
 
 
 
 
 
 
 
 
 
 
Packet # n 
Sending Embedded 
Processor 
(SEP) 
From the 
host 
 
103 
 
packet, the SN and the AKN, and then sends the trailer from the SEP's register to the LI 
(Figure 4.9). 
 
 
 
 
 
 
 
 
 
 
 
Figure 4.9: Processing flow 
 
 
4.5 Simulation Results 
During the simulation, the amount of processing required for sending TCP/IP and 
UDP/IP packets has been measured. Different streams have been captured from the real 
data streams and were stored in the HI buffer (see Appendix A).  This is done to test the 
LSO function and to determine the number of instructions and the instruction types that 
(3) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Host Interface (HI) 
Buffer 
Line Interface (LI) 
Buffer 
Host-NI Level of 
Communication 
(HNIC) Buffer 
 
Load the packet body from the HI buffer  
Store the packet 
inside the LI buffer 
 
All the necessary 
information about   
moved packets #n 
Packet header #n 
 
Payload body 
From 
the host 
 
Packet (header + 
payload) 
(2) 
(1) 
 
 
 
Sending Embedded 
Processor 
(SEP) 
Read the stored 
location of the stored 
packet inside the HI 
104 
 
are required. The number of instructions required for the core to process segmentation or 
fragmentation of TCP or UDP packets is presented in Table 4.1.  
 
Table 4.1: Number of cycles needed to complete the Beginning of Message of a TCP or 
UDP packet within the SPIM simulator 
 
Protocol Type of processing 
64 
bytes 
128 
byte 
256 
bytes 
512 bytes 
1024 
bytes 
1500 
bytes 
TCP 
Generates the packet 
headers cycles 
19 19 19 19 19 19 
Transferring the 
packet header  cycles 
12 12 12 12 12 12 
Transferring Data 
cycles 
6 22 54 118 246 365 
Total cycles 37 53 85 149 277 396 
UDP 
Generates the packet 
headers cycles 
18 18 18 18 18 18 
Transferring the 
packet header cycles 
8 8 8 8 8 8 
Transferring Data 
cycles 
9 32 64 121 249 368 
Total cycles 35 58 90 147 275 394 
 
 
The processing analysis contains the packet header processing and the data movements. 
The RISC processor requires 19 cycles in order to process the Beginning of Message 
(BOM), which is the highest number of cycles found when generating the TCP/IP 
headers. 18 cycles were required for UDP/IP.    
105 
 
The total percentage measurement is also taken when processing TCP/IP and UDP 
packets (Figure 4.10). The percentage of data movements processing increases when the 
packet size gets larger than 256 bytes (shaded area). Over 90 percent of RISC processor 
cycles are required when the packet size is 1500 bytes. With small packets (64 bytes), 
the core requires the use of 15 percent of its capacity to move the payload part of 24 
bytes. However, with small packets the header processing reached over 50 percent of the 
RISC capacity. 
 
 
Figure 4.10: Total percentage of data movements of LSO processing 
 
4.5.1 Instruction Types  
These types of instructions used for the LSO are similar to those used with the LRO. 
Table 4.2 describes the types of instructions that are required for the LSO. With the 
LSO, the Floating Points registers remain zero. They are confirmed as not required for 
LRO. 
106 
 
Table 4.2:  Instruction types used with LRO processing 
Category Instruction Format Examples Comments 
Arithmetic 
Add add  $s10, $s2, $s3 Accumulating the total bytes 
Three 
operands; 
data in 
registers 
Subtract sub $s9,$s5,$s7 
Subtracts the total size of the 
message from MSS 
Three 
operands; 
data in 
registers 
Add 
immediate 
addi $s8,$s4,40 
Adding the MSS to the 40 bytes 
(size of headers to the total bytes) 
Used to add 
constants 
Data 
transfer 
Load lw $s1,50($s2) 
Loading the TCP, UDP or IP from 
LI.   $s1 = Memory[$s2 + 50] 
Data from 
memory to 
register 
Store sw $s10,100($s2) 
Store total bytes in HI Memory[$s2 
+ 100] = $s10 
Data from 
register to 
memory 
Logic 
Shift Right sr $11,20 Shift data register to the right  
Used for 
shifting data 
And 
and 
$s10,$s1,x00ffffff 
Extract the to the total length form 
the first 32 bits of the IP header 
Used to 
mask the 
register with 
constant 
Condition 
branches 
Branch 
equal, 
Branch 
Greater than 
or  Equal 
beq $s1,17 
To check the protocol type  (UDP = 
17) 
Condition 
jump  
 
 
4.5.2 RISC Clock Rate 
The simulator has measured the amounts of processing that are required for segmenting 
or fragmenting the TCP/IP and the UDP/IP packets. Different TCP/IP and UDP/IP 
packets have been stored in the HI buffer. Each of these large packets is associated with 
data stored inside HNIC Buffer including the MSS and the packet location of packets 
inside the HI’s buffer. The RISC segments or fragments the packets. The number of 
processed instructions required for the protocols, with and without data movement 
processing, were measured in MIPS, where every instruction was processed in one 
107 
 
cycle. Therefore, the results shown below in Figure 4.11 represent the required speed of 
the RISC core in terms of MIPS, while processing the requirements for the generation of 
the packet headers alone without data movement.  
 
 
Figure 4.11: RISC clock rate for packet header processing 
 
The data movement for the packet payload is simulated using the Programmed I/O for 
data movements. The required MIPS processing varies depending on the packet size. 
Even though small packets carry less data, the core has to run faster to manage the 
processing of the packet at 100 Gbps. A RISC core with 3280 MIPS is required for the 
LSO when the 1500 bytes packet is processed and the line speed is 100 Gbps.  A RISC 
core with 9328 MIPS is required when packet size is 512 bytes (Figure 4.12).  
156
228
447
0
50
100
150
200
250
300
350
400
450
500
1500 bytes 1024 bytes 512 bytes
M
IP
S
Packet Size
100 Gbps
40 Gbps
108 
 
 
Figure 4.12: Amount of MIPS required for sending side using Programmed I/O  
 
 
4.6 Design Consideration for 100 Gbps at the Sending Side 
The challenges of designing a NI capable of processing packets at 100 Gbps include the 
processing cycle rates for each packet needing to be completed within the budgeted time 
frame (e.g., 123 ns when the packet size is 1500 bytes [1]). The packet size also requires 
consideration when designing the NI. A small sized packet, which carries a small 
amount of application data, provides less data movements at the end node. On the other 
hand, the information used to delimit and verify each packet header, which is 40 bytes 
(the TCP and IP headers), is considered a high overhead compared to the payload that is 
attached to each packet [28]. Large packets, which are suitable for various applications 
for 100Gbps, such as multimedia applications, provide good performance [54] and 
throughput by encapsulating large amounts of data required for more RISC cycles. 
Another consideration is the core unit. Using a single processor to design a 100 Gbps 
3280
4753
9328
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1500 bytes 1024 bytes 512 bytes 
M
IP
S
Packet Size
100 Gbps
40 Gbps
109 
 
programmable NI reduces the complexity compared to using a multi-core solution. Such 
a core has the capacity to complete the packet processing within a budgeted timeframe 
for 100 Gbps. To allow for processing within this timeframe, consideration needs to be 
given to the pipeline structure, the control unit design and the instruction types. Finally, 
the method of data movement at the NI is also important for high speed processing. 
Using Programmed I/O as a method of data transfer inside the NI requires more RISC 
cycles, especially when the packet size becomes larger than 256 bytes. 
 
4.6.1 Enhancing Packet Processing   
Since existing research exhibits a lack of analysis of data movement inside the NI, this 
thesis provides a detailed analysis of packet movement inside the network interface 
(More details in chapter 7). DMA is used for transferring data between the HI and the 
LI, with the SEP core initiating the DMA. Since the local bus is shared between the 
DMA and the SEP core, the SEP core requires the release of the local bus to allow the 
DMA to perform the data transfer. The size of the local bus used in this work is 64 bits. 
Each transfer of data will be 64 bit, which enhances the data movement inside the NI 
compared to 32 bits. The DMA controller provides the read and write signals to the 
source and destination buffers. The use of the DMA for transferring data reduces the 
instruction cycles of the SEP processor. The benefit is that the core processor is free to 
handle other packet processing requirements while transferring data (Figure 4.13). For 
example, the core generates the packet headers of the next packet of the same stream 
such as the COM or EOM. This increases the speed at which data can travel across the 
NI. 
110 
 
 
P
ro
c
e
s
s
in
g
Processing required before data movements Data movements*
Read the 
information 
related to the TCP 
or UDP message 
Generate the 
out going 
header
Initiate the DMA 
to transfer the 
header and the 
first segment  
Determine the 
next segment 
of the message
Checking 
other 
processing 
Processor 
might be 
idle 
*data movement cycle times depend on the size of the payload
Time
Overlapped 
processing  
 
Figure 4.13: Pipeline processing at the sending side 
 
 
4.7 Conclusion  
The methodology of designing a programmable Packet Processing Unit based RISC core 
for the sending side of TCP/IP and UDP/IP has been presented. The SPIM simulator is 
used to verify the type of instructions and the total number of cycles for the LSO. As a 
result, the types of instructions are Arithmetic, Logic, Load, Store and Condition Branch 
instructions. Floating Point instructions are not required for LSO functions. From the 
111 
 
SPIM simulator, the use of the Programmed I/O inside the NI was found not to be an 
appropriate method for use with high-speed networks such as 100 Gbps.  
Chapter 5 will discuss the implementation of the scalable NI (using a Xilinx simulator 
that supports the Very High-speed Description Language (VHDL)).  The VHDL will be 
used to build the Packet Processing devices, including the DMA and Content 
Addressable Memory (CAM).  
 
 
 
 
 
 
 
 
 
112 
 
Chapter 5 
 
A Scalable Network Interface Architecture for 100 Gbps 
 
 
 
 
 
5.1 Introduction  
The Network Interface design structures which deal with high-speed protocol processing 
require an adequate simulation tool that can support the proposed model. The chosen 
simulation can support all of the required parts of the Network Interface, including the 
RISC and DMA activity for high-speed processing. For this research, the Xilinx 
Behavior Model Simulation [97] will be used to simulate the scalable Network Interface 
design. The IEEE 1164-1993 standard of the Very High Speed Integrated Circuit 
Hardware Description Language (VHDL) is used for building the scalable Network 
Interface. 
This Chapter provides details of the DMA, Content Addressable Memory (CAM) and 
the buffers required to implement a scalable Network Interface. The VHDL-based 
Network Interface processes the Large Receive Offload and Large Send Offload 
functions for the TCP/IP and the UDP/IP protocols. This interface will use an embedded 
processor for each direction (send and receive), a simple data path as well as an 
uncomplicated DMA controller to support different transmission line speeds of up to 
100 Gbps. 
 
113 
 
 
5.2 Network Interface Model 
As explained in the previous chapter, the Network Interface (NI) model is partitioned 
into three parts (Figure 5.1): The communication line interface, the processing core and 
the host bus interface. The proposed model has been designed to support the TCP/IP and 
the UDP/IP protocols for high-speed lines with the following features:  
 Design and implement a RISC core for the sending side and another for the 
receiving side. These are used to perform the functions related to the Large Send 
Offload (LSO) and Large Receiving Offload (LRO) (the RISC design will be 
discussed in chapter 6).   
 Design and implement a Content Addressable Memory (CAM) based SRAM 
[79] for storing the connection information [23, 88], which contains the active 
TCP connections to help the RISC on the receiving side to reconstruct incoming 
packets into a large packet, using the Linked-list scheme. 
 Design and implement the DMA for data movement. 
 Memory management to manage the Receiving Buffer (RB) and Sending Buffer 
(SB).  
 
114 
 
Sending Buffer 
Interface (SBI)
RISC core
Send  
Embedded 
Processor
(SEP)
DMA
Memory 
manage
   FIFO 5 Sending 
status
FIFO 4  Status and 
control messages 
Sending Buffer  (SB)
Line Interface (LI)Host Interface (HI) Processing Area
To/
From 
Host
Local bus
Receiver Buffer 
Interface (RBI)
CAM
RISC core
Receive 
Embedded 
Processor 
(REP)DMA
Memory 
manage
FIFO 1  Signalling 
packets
FIFO 3  Status and 
control messages 
FIFO 2   Pointers to 
cascading packet
Receiving Buffer  
(RB)
Local bus
Interruption signal if the 
RB runs out of space 
System bus
 
 
FIFO 1 
FIFO 2 
FIFO 3 
FIFO 4 
 
 
FIFO 5 
DMA 
CAM 
SEP 
REP 
To provide the start-address for each received signaling packet.  
To provide the start-address for each amalgamated packet inside the RB.  
To provide the new TCP connections to the REP. 
To provide the necessary information such as the sending request, which includes the     
Maximum Segment Size of the payload part, or informs the SEP to slow down (when 
the receiver buffer is getting full).  
To exchange the necessary information to the host, such as the free space in the SB.  
Direct Memory Access.  
Content Addressable Memory.  
Send Embedded Processor.  
Receive Embedded Processor. 
 
 
 
Figure 5.1: Network interface block diagram 
 
 
 
115 
 
5.2.1 Network Interface Buffering 
First-In-First-Out (FIFO) buffers have been implemented to provide high flexibility in 
terms of exchanging information between the RISC-cores and the host CPU. These 
buffers perform the following tasks: 
 FIFO 1 provides the start-address for each signaling message received for the 
host CPU. 
 FIFO 2 provides the start-address for each large packet that amalgamated inside 
the RB for the host CPU. 
 FIFO 3 provides the new connection information from the host CPU to the RISC 
core at the receiving side.  
 FIFO 4 provides the RISC core at the sending side with the location of messages 
that have been stored in the SB or to control the transmission speed when the 
destination buffer has reached saturation point and there is only limited space 
available for the arrived packets. In this prototype, the priority option is used to 
arrange for the small packets to be moved to the top of the FIFO [128]. Since 
applications should not make assumptions on the sending behavior, the raising 
bit that associated with the CPU order is applied [127]. When data is available in 
the FIFO, Higher priority entries, the FIFO entries are re-queue again to move 
the raising bit entry to the first of the FIFO. Another FIFO as if FIFO 1 is also an 
option that can be added to the sending side to provide a small packet or SSM 
also option instead of using a high priority mechanism.  
 
116 
 
 The RISC uses FIFO 5 to signal to the host CPU that is has finished transmitting 
a message and sends information on free space pointers that have become 
available within the SB.  Instead of the core processor triggering an interrupt 
after completing the sending process for a message it sends the completion 
descriptors through this FIFO. The time for sending 64 KB, for example, has to 
be less than the period required for retransmission [93, 113]. The performance of 
the RISC will be discussed in chapter 7.      
Packet processing has several packet buffers used to store packets temporarily in the NI.  
These buffers are organized as follows: 
 The Receiver Buffer Interface (RBI) stores the received valid packets. 
 The Receiver Buffer (RB) stores packets that it has sent from RBI (e.g., the 
amalgamated large packets). 
 The Sending Buffer Interface (SBI) is used to store packets until they are 
delivered to the MAC unit.  
 The Sending Buffer (SB) is used to store the information it has received from the 
host CPU, such as packets larger than the MTU. The data packets need to be 
segmented or fragmented into maximum segment sizes (MSS) by the Sending 
Embedded Processor (SEP) and then delivered to the SBI. 
 
 
 
 
117 
 
5.2.2 Data Transfer 
When data is required to move between the RBI and the RB or between the SB and the 
SBI, the RISC core initiates and controls the DMA (known as the DMA Request 
signal (DRQ)). Since the local bus of the receiving side and sending side are shared 
between the DMA and the RISC core, the RISC core needs to release the local bus to the 
DMA to perform the data block transfer.  
 
5.2.2.1  DMA for Data Transfer  
The RISC core initiates the DMA with the block length (number of words to transfer) 
and the addresses of the SB or the RB. The DMA controller is designed to transfer data 
in one direction travelling from the RBI to the RB or from the SB to SBI. These features 
can reduce the complexity of the DMA design. The top-level structure diagram of the 
DMA is shown in Figure 5.2. Each DMA transfers 64 bits of data, which consumes two 
RISC cycles. In the first cycle, the DMA controller reads the source buffer to receive the 
64 bits into the DMA‟s register. During the second cycle, the data moves from the 
DMA‟s register to the destination buffer. The DMA state machine provides the read and 
write signals to the source and destination buffers (Figure 5.3). It also increments the 
target memory addresses after storing data in a destination address. 
 
 
 
 
 
118 
 
 
 
 
 
 
 
 
 
Figure 5.2: DMA structure 
 
 
 
DMA
RBI
RISC Request
DMA Request
Read
RB
Write
Decrement_add
Increment_add
Source
Destination
DMA_request => Data size and 
destination address 
       source Location is “1” or “0”
    0 -> buffer 1
   1 -> buffer 2
DMA Transaction 
complete signal
Source Add Increment Add next 
Address = 0
Destination  Add Increment  Add next 
Address = 
N
No
No
DMA Transaction 
complete signal
Yes
Control signal Control signal 
Clk
Yes
N is the 
total bytes 
 
 
Figure 5.3: DMA cycles to transfer data from the source to the destination 
 
 
 
Data _in 
DMA State 
Machine 
Controller 
 
Data Register 
Bus Control 
RISC_Req 
DMA 
Controller 
Local bus 
Data _out 
Sel_R_W 
Address_in_out 
119 
 
5.2.2.1.1 Channel Configuration 
 
In this research, the DMA channel configuration is straightforward. There is only one 
source and destination. For the receiving side, the RBI has two buffers. The DMA 
initiation instruction contains one bit which is used to identify the appropriate buffer to 
fetch data from the RBI. The RBI and the SBI are used for storing packets. These 
packets are stored in two separate buffers. In order to recognize which buffer to read 
from or write to, an “x” bit is added to the initiation instruction set. For instance, if the 
“x” bit is equal to “0”, it fetches data from the upper buffer.  Alternatively, if the “x” bit 
is equal to “1”, the RISC fetches the data from the lower buffer. Figure 5.4 shows the 
DMA channel. The DMA channel is implemented as follows: 
 Source address: from where to fetch data. 
 Destination address: where to store data. 
 Burst count: The number of bytes which the DMA channel requires to move fro- 
m the source to the destination before releasing the local bus.  
 Transfer count: counts the number of bytes to be moved from the source to the 
destination. 
 
Channel Configuration
Next
Transfer Count
Burst Count
Destination Address
Source Address
 
 
Figure 5.4: DMA channel 
 
120 
 
5.2.2.2 Bus Width  
This design parameter deals with the width of the data and the address of the buses. The 
Data Bus width determines the size of data needed to be transferred inside the bus (e.g., 
from the Line Interface to the Host Interface). Although the instruction set of the RISC 
is 32 bits, the Data Bus does not need to correspond to this value. The bus width affects 
the NI performance: the wider the Data Bus, the higher the bandwidth [87].  
The proposed model has two data buses available to transfer data, one transferring data 
from the Sending Buffer (SB) to the Sending Buffer Interface (SBI) and the other 
transferring data from the Receiving Buffer (RB) to the Receiving Buffer Interface 
(RBI). Both of these buses are feasible in that they will both perform the basic transfer 
of data required. The preference is to use a 64-bit rather than the 32-bit data bus within 
the packet processing.  This preference is based on the fact that when using the 32-bit 
option, the data transfer requires more cycles (shown in Table 3.3 and Table 4.1).  In 
comparison, the 64-bit option actually reduces the number of DMA cycles, enabling 
them to carry more data in each cycle. It is also an option to use a bus which is larger 
than 64 bits. However, this option requires the use of more hardware pins and therefore 
adds more complexity to the design of the Network Interface. 
 
5.2.3 Content Addressable Memory  
Two approaches were tested to integrate the Content Addressable Memory (CAM) 
within the NI. In the first method, the CAM hosts the connection identifier of the active 
connections, such as the IP address and the Port ID. Each CAM entry is associated with 
121 
 
the Linked-list data. As soon as a match is found, the Linked-list data is released to the 
REP core (Figure 5.5). 
The connection information is 64 bits, which includes the destination IP address (32 
bits) and the Port ID source and destinations (32 bits).The depth of the CAM can be 
extended as far as desired, but the width is limited by the size of the memory. 
1011001101………...110101
1110011011………....000010
1000001101………...111110
1011001101………...110101
1000001101………...110101
11111...10101
10001...10101
10000...10101
11101...10101
10011...10101
0000...000000
10101...10101
11111...10101
11101...10101
11100...10001
1011001101………...110101
Linked-list 
information that is 
related to matched 
data
Match found
 
 
 
Figure 5.5: CAM structure when the Linked-list data is associated with the connection 
information 
 
 The width of the CAM is 192 bits. The first 64 bits of the CAM entry are used as a 
Matching Unit (MU). Each MU is associated with the Output Results (OUTR). The first 
64 bits of the OUTR contain the Start-address (head of the Linked-list) and the Sequence 
Number. The second 64 bits contain the End-address (tail of the Linked-list) and the 
total amalgamated bytes inside the RB of the connection. The read signal 
(read_end_start) sends two signals, “0” or  “1”. This signal reads by RISC. When the 
signal is equal to “0”, the first 64 bits (Start_addr)  will be sent out from CAM as an 
OUTR signal. The OUTR signal then reads by the RISC. If the read_end_start signal is 
Connection info.                          Start-addr End-addr 
Sequence No.                                     Total amalgamated bytes 
                                                                            
Matching Unit (MU)                     Output results (OUTR) 
122 
 
equal to “0” then the second 64 bits of the CAM entry are sent out as an OUTR signal. 
When the read_end_signal is equal to “1”, the End-address with the Sequence Number is 
sent out.  
This implementation can be used in the proposed network interface as a Look-up table 
when searching for the connection's match. However, this method requires more 
processing to be carried out in order to amalgamate packets in the RB. After the 
comparison is completed and a match is found (between the arrived packet and the 
active connection in the CAM), the REP reads the CAM's output. The core is required to 
read the first 64 bits of the OUTR to compare them with the incoming Sequence Number 
of the arrived packet (expected sequence). After it has initiated the DMA to move the 
data packet, the REP also needs to update the CAM entry (e.g., updating the total 
number of bytes, located within the second 64 bits of the CAM). The identification of a 
match is required when the REP needs to read or update the CAM entry.  
 
5.2.4 The CAM Implementation inside the proposed NI 
The second option is to send the memory address after a match has been found [79]. 
Each CAM entry is associated with the memory address found inside the Selected RAM 
(SRAM), which stores the Linked-list details (Figure 5.6). This scenario gives the RISC 
core enough room to store the Linked-list data. In addition, this approach can support 
any changes in the future, such as extending the size of the connection data or modifying 
the Linked-list. The VHDL based CAM is simulated where each CAM entry has one 
123 
 
pointer that points to where the connection information (the IP address, Port ID and 
sequence number for TCP and IP address, Port ID and offset details for UDP) is located 
inside the Match Unit (MU). 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 5.6: CAM based implementation of the look-up-table 
 
 
 
The VHDL based CAM is simulated where each CAM entry has one pointer that points 
to where the connection information (the IP address, Port ID and sequence number for 
TCP and IP address, Port ID and offset details for UDP) is located inside the Match Unit 
(MU). The CAM is used for searching and updating information. If there is any new 
entry that needs to be stored in the CAM, updating data is used to insert the new entry in 
the CAM. Figure 5.7 presents the block diagram of the CAM-based search engine.   
 
 
101011111000110001001001              Address 
111110011000110001001001              Address 
101011111100110001001001              Address        
111110000000011110110110              Address 
 
 
 
000000000000000000000000               Address 
101010011111111111001001               Address 
101010011000110001001001              Address 
    Input Bits  
Output memory 
address where the 
connection is stored 
 
101010011000110001001001 
Control singles   
0 - If the Match was not found 
1- If the match was found 
     Matching Unit (MU)                     Output results    
Read memory  
124 
 
 
 
 
 
 
 
 
                  CAM                                                                 SRAM 
* Store the 64-bit connection ID  
- Source IP 32 bits 
- Source port 16 bits 
- Destination port 16 bits 
** Stores the connection info   
- SN on the next packet 32 bits 
- Start-address of the large packet inside RB 
- End-address  
- Total bytes amalgamated 
 
Figure 5.7: CAM-based search engine block diagram 
 
The write signal "Sel_CAM" is sent by the REP to update the CAM (adding a new 
entry). The processor searches for the first location filled with Zeroes (blank location) in 
the CAM and then replaces the first entry of Zeroes with the new address. The input and 
output signals of the CAM are defined in Table 5.1 To check for a match, the RISC 
sends the relevant data to the CAM. If no entries of the CAM match the input data, a 
“miss” signal will be considered equal to “0”. 
 
 
Sel_CAM 
signal 
Read 
signal  
Read /write signal  Found the 
match  
1= found 
0= not found 
Connection ID 1* 
 
Connection ID 2* 
 
Connection ID 3* 
 
Connection ID 4 * 
 
 
\ 
 
 
Connection ID n * 
 
 
 
 
 
Connection info 1 ** 
 
Connection info 1 ** 
 
Connection info 1 ** 
 
Connection info 1 ** 
 
 
 
 
 
Connection info 1 ** 
 
Match_Addres 
Reset  
 DIN 
125 
 
 
Table 5.1: Input and output signals of the CAM 
Signal Name Direction Description 
CLK Input 
Clock: CAM operations are synchronous to the rising edge of 
the clock input. 
DIN Input 
Data In: Data to be written to the CAM during the write 
operation. Also, data to look-up from the CAM during 
the read operation, when simultaneous read/write option 
is not selected. 
Sel_CAM Input 
Write Enable: Control signal used to enable the transfer of 
data into the CAM from the DIN bus. 
Match_Addr Output 
Match Address: The CAM address where the matching data 
resides. 
Found Output 
Match: This signal indicates that at least one location in 
the CAM contains the same data as the DIN bus (or 
CMP_DIN, if simultaneous read/write mode). 
 
Two cycles are needed in order to obtain the required data following the completion of 
the matching process (Figure 5.8). The first cycle is needed to identify the corresponding 
data and the second cycle is needed to read the connection data.  
 
Figure 5.8: Cycles required during read operations of a Block from Select RAM memory 
 
126 
 
The RISC then considers the arrived packet as a lost packet. If any entries of the CAM 
match the input data, the CAM produces a signal equal to “1” indicating that a match 
was found. After a match is found, the processing continues, reading the other signals to 
determine the next procedure. The procedure after finding a match is to read the SRAM 
address location of the linked data. When receiving the last packet of a stream (when the 
MF bit is equal to “1” [63] or the Push flag is raised [33]), or after the amalgamation 
time expires (the Interrupt Moderation timer has expired) the RISC is required to update 
the large packets in the RB, including the final length and the final ACK number. Zero-
bits are also stored in the connection information associated with the end stream. After a 
connection between the two ends has terminated and the core has received the status of 
the connection from the host CPU (through FIFO 3), the core finds a match for the 
connection (the IP address and associated ports) and replaces it with zeroes.  
Each of the CAM entries is associated with two pointers of the Linked-list; the Start-
address and the End-address of the Linked-list. Updating the Start-address and End-
address is performed as follows: 
1. The RISC core reads the End-address from SRAM (after a match is found) to add a 
new node at the end of the Linked-list.  
2. Write the End-address inside SRAM after receiving a BOM and COM.       
3. Write the Start-address after receiving the first TCP packet of a stream. There is also 
a need to send the Start-address to FIFO 2 when the last packet of a stream is 
discovered. The host CPU then reads the Start-address from FIFO 2 in order to read 
the large packets from the RB. 
127 
 
4. The processing of UDP is the same as for TCP, except the core engine needs to store 
the connection information inside of the CAM. When the BOM has arrived at the 
network interface (the MF bit inside the IP header is equal to "1") the REP stores the 
IP address and the Port IDs in the CAM. 
5. The embedded processor uses the Start-address inside the SRAM to designate 
whether the message is a COM or a BOM message. When the Start-address value is 
equal to “0”, the REP indicates that the current message is a BOM because no 
Linked-list has been created for this stream. 
6. The Push flag inside the TCP header [33] is used to distinguish whether an arrived 
packet is a COM or an EOM. If the Push flag is equal to “1” then the current packet 
is an EOM. Otherwise, it is a COM. 
7. The RISC core uses the Start-address to differentiate between SSM or EOM packets. 
Both types of packets have the Push flag set to “1” in the TCP header. However, 
when a match is found, the REP reads the Start-address. If the Start-address is equal 
to “0” then the message is a SSM because no Linked-list has been created within the 
stream before.   
 
 
5.3 The Network Interface FIFOs 
The NI communicates with the host through the five FIFO-based memory buffers which 
are used [112]. This prototype requires the implementation of a large FIFO to provide 
enough buffering space in the NI. The FIFO is configured 64 bits wide with a 
programmable depth size of 511 words, costing 10 Block Selected RAMS. 
128 
 
The pointer of each FIFO is stored in the RISC's register.  The RISC can access any 
FIFO after reading its address. However, the interrupt mechanism which occurs during 
the exchange of information affects the overall performance of the NI or the host CPU. 
Interrupting the host CPU or the RISC cores at the NI during their processing time 
(depending on the type of host CPU being used), costs the CPU about 13 to 14 percent 
of total CPU cycles which is responsible for handling the interruptions caused by the NI 
[31, 34, 76]. The proposed network interface has an enhanced interrupting mechanism 
which operates between the host CPU and the RISC cores during the LRO and the LSO 
processing. There are several mechanisms which have been implemented in the NI in 
order to reduce the interruptions occurring between the host CPU and the embedded 
processor. 
1. Rather than interrupting the host CPU with the arrival of each packet (i.e. 123 ns if 
the line speed is 100 Gbps), the RISC core sends the Start-address of the 
amalgamated packets and the total length of the amalgamated data to FIFO 2 (Figure 
5.9). The fetching of large packets from the RB continues until the NULL value, 
which is placed at the last entry of the FIFO, or the Interrupt Moderation reached the 
end.  
2. The NI should pass the signalling packets (e.g., FIN packets) received at the NI to 
the host. The Start-address along with the total size is sent to the FIFO 1. A 
signalling packet does not need more processing cycles than a data packet. Since 
there is no amalgamating processing involved, the REP sends it „as is‟ to the RB. 
129 
 
There are no interruptions used when the host sends information to the NI. All 
information is delivered to the NI through three FIFOs. This information can be 
described as follows: 
 
 
 
 
 
 
 
 
 
 
 
Figure 5.9: The two FIFOs are used to send data from the receiver RISC processor to the 
host CPU 
 
1. The host negotiates with other hosts whenever a new connection is required. After 
the TCP connection is established between two remote ends, there is a specific IP 
address and Port ID for each TCP connection. These identifications are required to 
be delivered within the packet header(s). The RISC cores in the NI use the IP and the 
Port ID of the arrived packets in order to identify the arrived packets. The host uses 
FIFO 3 to deliver these identifications to the receiving side of the NI (Figure 5.10). 
As the data arrives at the NI, the CAM entries are updated by storing the 
identifications that were fetched from FIFO 3. The embedded RISC at the receiving 
side checks FIFO 3 after finishing the reassembly processing of the TCP packet 
(e.g., every 123ns if the NI is connected to the line of 100 GBPS and the packet size 
is 1500 bytes). 
 
                     FIFO 1 
 
 
                       FIFO 2  
 
 
Local bus  
To the 
Host 
 
 
FIFO 1: Has the Start-address of 
each received signaling packet 
FIFO 2: Has the address of each 
amalgamated packet inside the RB 
130 
 
 
 
 
 
 
 
 
 
 
Figure 5.10: Sends the TCP active connections to the receiving side through FIFO 3 
 
 
2. When the host moves the large packets to the SB, the host CPU notifies the sender 
RISC with the necessary information required for segmenting the packets, such as 
the maximum segment size (MSS) through FIFO 4 (Figure 5.11), whereas the RISC 
core sends the pointers of the available space inside the SB to the host CPU. In this 
prototype, the status of the sending data is also sent through FIFO 5. The first bit of 
each FIFO entry is used to distinguish between the pointers or status messages. If the 
bit is equal to „1‟, then the FIFO entry indicates free pointers.  
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 5.11: The FIFO carries the information needed for segmenting a message 
 
      FIFO 3  
Local bus  
 From 
the host 
           FIFO 4   
Local bus  
 From 
the Host 
 
FIFO 3: Contains the TCP connection info 
FIFO 4: Has the necessary 
information needed for large 
sending offload processing 
 
           FIFO 5    
FIFO 5 sends information to host 
CPU, such as the status of 
sending data  
 
  R E P  
 
CAM  
131 
 
 
 
 
5.4 The Interface Buffers 
5.4.1 Memory Management 
The algorithm of memory allocation is a variation of sequential fit [44]. The Best-fit 
memory allocation is one option that makes the best use of the RB space. However, it is 
slower in allocating data because it performs the important task of searching the entire 
list of allocated memory. Worst-fit memory allocation is the opposite of the Best-fit 
memory allocation. Worst-fit allocates free available block data to the new job, which is 
not the best choice for this particular NI. The second option is First-Fit memory 
allocation. First-Fit is faster in making allocations but leads to memory waste. However, 
in high-speed communications timing is critical, therefore First-Fit memory allocation is 
applied to allocating packets inside RB and the SB. 
The Dual-ported memory is divided into 255 blocks of free-shared memory, each with 
64 KB (2
16 
bytes). Each block can accommodate 64KB from a single connection. Up to 
200 connections can be processed with the Interrupt Moderation (more details in section 
6.2.2).  Since every access to memory must be mapped from a virtual to a physical 
address, reading the page table every time can be quite costly. Therefore, a circulation 
buffer (CB) is used to acquire the address of the memory pages (Figure 5.12).  
All pointers of the Start-address of the pages are stored in the CB. Two pointers are used 
to manage the CB contents. The head-pointer points to the first page of the RB and the 
tail-pointer points to the last page. The Memory Management State Machine updates the 
132 
 
head-pointer after the embedded processor requests a new address page to store the 
arrived packet. In addition to this, a counter is used for tracking the number of pages that 
have been used.  
 
 
 
 
 
 
 
 
 
 
Figure 5.12: Circulation Buffer architecture 
 
The counter increases by one after the RISC requests an address to accommodate the 
first packet (Figure 5.13a). A signal is sent to the embedded processor (limit_siz signal) 
when the counter reaches 200 (200 pages have been requested by the RISC) (Figure 
5.13b). The RISC sends an interrupt signal to the host processor which initiates 
withdrawal messages from the RB. The host CPU will reply to the interruption signal 
and start pulling data from the RB. During the time between requesting the interrupt and 
the host CPU‟s response, there will be new packets arriving at the NI.   
 
Update the head-of-
the-list after reading 
one pointer  
 
Update the tail-of-the-list after 
adding a new pointer  
 
Space left 
inside RB 
Tail-of-the-
CB  
 
Pointers 
Head-of-
the-CB  
 
 
 
 
Read the pointer after each 
packet received 
 
133 
 
 
 
 
 
 
 
Figure 5.13a: Tracking the size of the RB 
 
 
 
 
 
 
Figure 5.13b: Signal sent when 200 pages of the RB are occupied 
 
As mentioned before, the RB was divided into 255 pages which mean there will still be 
55 pages remaining in the RB that can be used to store the new packets. Assuming that 
the new 55 packets are small sized (64 bytes), the total budget time available for the host 
CPU to process the interrupts and pull the packets from the RB would be 368.5 ns (55 X 
7 ns). Sending an interrupt to the host can be adjusted in accordance with the interrupt 
processing time of the host CPU. 
Signalling packets (packets without data), which are 64 bytes or less, are stored in one 
memory page which saves memory space. As the size of the memory page is only 64 
KB, it is able to allocate no more than 1000 signalling packets within the IM timer. To 
accumulate the signalling packets in one memory page, the REP uses three internal 
registers. The first register holds the Start-address of the memory location. The second 
register holds the End-address. The third register is used to hold the total bytes that have 
Limit_size 
Counter 
Compare 
Reset 
 
Enable 
Clk 
 
Max_size 
From RISC (exC8  = 200) 
 
Clk 
Max_size 
Reset 
Counter  
Max_size 
(C8  = 200) 
 
134 
 
been stored in the memory page. The Start-address and the total bytes are sent to FIFO 
1. 
 
5.4.2 The Receiving and Sending Buffer   
The NI has a Receiver Buffer (RB) that is used to reassemble the packet bodies that are 
arriving from the network and stores them until a host CPU is ready to process them.  
The size of the RB buffer is 4 GB. This buffer can hold 256 connections where each 
connection may contain 64 KB of data. The Sending Buffer (SB) stores packets (e.g., 
TCP/IP or UDP/IP packets), which have been sent to be segmented by the RISC to the 
MSS. The RISC adds the packet headers for each MSS. In this prototype, the buffer can 
also hold 128 TCP or UDP messages (each has 64 KB of TCP or UDP). 
 
5.4.3 Receiver and Transmission Line Buffers 
When a packet arrives at the Receiver Buffer Interface (RBI), the Finite State Machine 
(FSM) in the RBI enables one of the two buffer locations to hold the serial bits arrived 
from the transmission line. The FSM switches to another buffer after interrupting the 
RISC-core on the receiving side. The RISC core starts processing the packet header, 
which is located at the top of its body (Figure 5.14). Even though there will be a buffer 
within the MAC unit (discussed later in this chapter), each packet in the Line Interface 
has to be finished within the budgeted time frame while retaining space for the arrival of 
the next packets, especially when there are signalling packets that have arrived between 
the large packets (the core performance will be discussed in the next chapter). 
135 
 
Two signals are used for communication between the RBI and the RISC. After a valid 
packet has arrived at the RBI‟s buffer, the FSM interrupts the RISC by raising the 
V_packet signal to „1‟ to indicate that there is a valid packet in the buffer. The RISC 
continuous processing the packets until the V_packets signal changes to „0‟, indicating 
that there is no packet in the RBI. 
 
Dual ported 
memory
Dual ported 
memory
Select (0,1) 
FSM
Select (0,1)
F_packet
V_packet
 
Figure 5.14: Receiving Buffer Interface architecture 
 
 
To avoid read overwrite, the F_packet signal is used. After the RISC completes 
processing the packet, the RISC switches the F_packet to „0‟. The FSM stores a new 
packet in the same buffer and then changes the F_packet signal to „1‟.     
Having two buffers within the RBI enhances the packet processing at the proposed PPU. 
Instead of loading the entire packet header after receiving a packet [107] (which requires 
at least 10 RISC cycles (load 40 bytes)), the RISC requires to identify the arrived packet 
by reading the protocol, flags and the packet length. If the packet is a signalling packet 
Local bus 
From MAC 
unit 
136 
 
(e.g., FIN), the RISC initiates the DMA to transfer the packet „as is‟ from the RBI to the 
RB. Single Segments Messages (SSM) are also sent „as is‟ to the RB. The packet 
payload is required to be moved from the RBI to the RB when the TCP or UDP packets 
desire to amalgamate inside the RB.   
Other features include the use of two buffers when initiating the DMA if the Start-
address of the packet location is known to the DMA controller (upper address of the 
memory buffer) and it knows from where to read data. The data direction is also known 
to the DMA controller (from RBI to RB). These features make the DMA controller 
uncomplicated.    
The Sending Buffer Interface (SBI) also contains two buffers, each of which holds one 
TCP or one UDP packet.  The state machine controls the SBI and allows only one buffer 
at a time to receive data within a given timeframe. The buffer remains enabled until the 
completed packet has been stored (Figure 5.15). The FSM then allows the stored data to 
be sent out while it fills the other buffer. Both the SBI and the RBI have to communicate 
with the MAC unit‟s processor for sending and receiving the packets. 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 5.15: Sending buffer interface architecture 
Local bus 
SBI 
To the MAC 
unit 
Packet header n   
 
Data body n 
 
Packet header n+1   
 
Data body n+1 
 
137 
 
 
5.5 Conclusion  
The Scalable Network Interface Architecture for the line rate of 100 Gbps was designed. 
The VHDL IEEE 1164-1993 and 1076-1993 standard running over the Xilinx was used 
for designing the Network Interface components.  Among the components is a simple 
DMA controller designed for transmission of data between the Network Interface 
buffers.  Content Addressable Memory was used as a look-up-table to support the 
receiving side. FIFOs are used to reduce the orchestrated interrupts issued by 
exchanging information between the host CPU and the RISC cores. Receiving side and 
sending side have suitable buffers for storing TCP/IP and UDP/IP packets.  
The next chapter presents the RISC core structure. In addition, it will provide the details 
for the instructions set format. The chapter will discuss the enhancements that are 
applied to avoid Data Hazard.   
 
138 
 
Chapter 6 
 
Developing the RISC Core for TCP/IP and UDP/IP Processing 
 
 
 
 
6.1 Introduction  
The specialized RISC is required to support the proposed LRO and LSO functions. Using a 
specialized RISC core provides more benefits to the NI than using an off-shelf general-
purpose processor (GPP). The Floating-Point unit is also not required for processing the 
LRO and LSO (from SPIM simulator processing of the LRO and LSO). In addition, the 
limited number of instruction sets that are required to support LRO and LSO processing 
can reduce the size of the control unit, improve the speed and reduce the core complexity. 
Reducing the pipeline stages is another feature that has been developed to improve the 
execution throughput of the RISC that can support high speed rates of 100 Gbps.  
 
6.2 RISC Pipeline 
The normal RISC pipelines divide the execution of an instruction into a number of steps or 
pipeline stages [44] (Figure 6.1a). The proposed depth of a pipeline corresponds to the 
number of pipeline stages (Figure 6.1). The NI RISC core has been designed to execute one 
instruction in three-pipeline stages: 
 a) Fetch an instruction from local memory (Fetch stage). 
 
 b) Decode/execute the instruction and register read (Decode/Execute stage). 
139 
 
 
 c) Store results back into the destination register (Write/Back, or W/B, stage). 
The RISC fetches instructions which are used to run the TCP/IP and  UDP/IP protocol 
program from local memory. The Decoding and Executing stages execute the running 
instruction that has been fetched by the first stage of the pipeline. The last stage of the 
RISC's pipeline is W/B, in which the data is written to the RISC‟s register. Some 
instructions such as the Store instruction terminate at the Decode/Execute stage.  
 
 
 
 
 
Figure 6.1a: Normal Structure of RISC instruction pipeline 
 
 
i 
 
 
i+1 
 
i+2 
 
 
Figure 6.1b: Structure of RISC instruction pipeline 
 
 
A VHDL description for the RISC components, the top-level package, is needed to describe 
the signal types for executing the instructions. These  signals include the specific  ALU 
functions, the shifter operation and the states needed for the control of the RISC. The 
         
i 
 
i+1 
     Fetch  Decode 
               Execute 
         W/B 
 
         Fetch  Decode 
                 Execute 
          W/B 
         Fetch  Decode 
                 Execute 
       W/B 
   Fetch  Decode 
                
        Execute 
         Fetch  Decode 
 
         Execute 
/ 
         W/B 
 
         W/B 
 
140 
 
processor fetches the instruction from memory and executes the LRO or LSO program 
(which is stored in the internal memory). These instructions are decoded by the control unit 
which provides the appropriate signal instructions to make the processor unit execute the 
instruction.  Figure 6.2 shows the block diagram of each stage. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 6.2: Block diagram of the Fetch, Decode, Execute and Write/Back with the 
necessary signals 
 
 
 
 
 
Fetch 
 
142 
 
6.3 Instructions Set Representation 
The evaluation of the RISC instructions that are required for LRO and LSO has been done 
using a SPIM simulator (chapter 3 and 4). The set of instructions of the RISC were 
developed for efficient LRO and LSO processing provided in Table 6.1.   
Table 6.1: RISC Instructions 
Instruction Functions Comments 
ADD r3  r1 + r2 
These instructions perform 
arithmetic operations 
ADDI r3  r1 + imm 
SUB r3 r1 - r2 
 
AND r3 r1 and r2/imm These instructions perform logic  
operations Sr R3 r1 / imm 
   
LOAD r1mem 
These instructions load or store 
register values from another 
register, memory location, or with 
immediate values given in the 
instruction 
LCAM )Find the match of r1 with 
CAM contents and store the CAM 
data  in r2) 
r1 = (CAM) --> r2 
STORE r1mem 
STCAM                                                                                                 
Memory (CAM) (Store value at 
Content Addressable) 
r1 (CAM) 
 
BEQ r1 = r2/imm ---> label Branch if equal 
BGE r1>= r2/imm -> label Branch greater or equal 
BLE r1<= r2/imm -> label Branch less or equal 
BG r1>= r2/imm -> label Branch if greater than (imm) 
Either to move data from SB to SBI 
or from RBI to RB. There is an X bit 
that is used to identify which buffer 
inside the RBI or SBI needs to be 
used. 
r1(DMA) Initiates the DMA 
 
143 
 
6.3.1 Arithmetic and Logic Operation Instructions 
The arithmetic and logic operation instructions provide computational capabilities for 
processing numeric data. Logic instructions provide the logical operations such as and and 
or.  Figure 6.3a presents the format of the arithmetic and logic instruction Register-to-
Register format. Figure 6.3b presents the format of the arithmetic and logic instruction 
„Immediate‟ format. 
 
 
 
 
Figure 6.3a: Arithmetic/Logic instruction formation  
 
 
 
 
 
 
 
 
 
Where:  SRR1          source register 1 
SRR2         source register 2 
                              DES           destination register  
                              F                function bit 
                              IMM16      immediate value  
X                for future use 
 
 
Figure 6.3b: Arithmetic/Logic immediate instruction formation  
 
 
 
The arithmetic instructions such as add, are executed as follows: Add register SRR2 to 
register SRR1 and store the result into destination register “DES”. In the Add-immediate 
instruction (addi), the register contents, to which SRR1 refers, will be added to the 
Op-code F            DES SRR2 SRR1       X 
F             SRR1       DES 
    31                        27      26 25                21 20                16  15                    11 10                          0 
 31                          27    26  25                 21  20                16  15                                                          0 
Op-code F      S R1 IMM16 
144 
 
immediate value “IMM16”. The result is stored in the destination register “DES”. The 
arithmetic instructions example is shown in the following figure. 
 
 
i add     r3, r2, r1                ; Add r2 to r1 and store the result in r3 
i+1addi    r4, r3, 10           ; Add r3 to the value of 10 and then store the result in r4 
 
Figure 6.4: Arithmetic instructions  
 
 
A Read after Write (RAW) data dependency could occur during the program execution; 
however, the forwarding mechanism has been implemented to resolve such a dependency. 
For the two preceding instructions (add and addi), the add instruction stores the result in 
register “r3” where the addi instruction will use “r3” as a source operand. During the 
Decode/Execute stage of addi instruction, the "r3"will not yet be updated by the add 
instruction, and thus, an error of calculation will occur. The use of the forwarding 
mechanism will keep this problem from occurring [44]. The forwarding mechanism with 
this instruction has an F bit in the instruction format to initiate the forwarding mechanism. 
If the F bit is equal to “1”, action will be taken, but there will not be any forwarding action 
when the F bit is equal to “0”. The F bit is set or reset during the program‟s execution.       
 
6.3.2 Branch Instructions 
Branch instructions are used to test the value of data or the status of a computation before 
jumping to the label's address (Figure 6.5). The RISC checks the M bit to distinguish 
whether the second source operand is an immediate value or a data register. If the M bit is 
145 
 
equal to “1”, the comparison should be done between the register “SRR1” and the 
immediate value.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 6.5: 
Branch instruction format 
 
The F bit is used to control the forwarding mechanism when the operand of a current 
instruction is still in the W/B stage of the previous instruction. The forwarding mechanism 
will be turned on when the F bit is set to “1” to prevent pipeline stalling. The F bit will be 
reset if no Read after Write dependency exists between the branch and preceding non-
branch instruction. The branch instruction can be written as defined in Figure 6.6. 
 
 
beq    r1,r2, label                     ; Branch to label  if  contents of r1= contents of r2 
beq    r1,10, label                    ; Branch to label  if  contents of r1= 10 
 
Figure 6.6: Branch instruction example 
 
label address in memory (instruction memory) 
imm immediate value 
F     
function bit 
M immediate/register  select    
X for future use 
Op-code F                          X   SRR2 
imm 
SRR1 M label 
     31                     27      26    25                       21  20               16        12    11                                            0                                                                                    
146 
 
The program counter will be updated to point to the label's address after checking that the 
value at r1 is equal to that at r2, or when r1 is equal to the immediate value of 10. The use 
of the immediate value in addition to the Branch instruction is useful for checking the 
protocol type (inside the IP header e.g., if the immediate value is equal to 5 for the TCP or 
18 for the UDP protocol). For future work, this comparison can be implemented within 
hardware to create a faster response when using the Demultiplexing option. 
 
6.3.3 Memory Access Instructions 
The following instructions are used to move data between memory and the RISC core 
registers: 
 
a) Load/Store instruction: Load data from local memory into the RISC‟s register, or store 
the data register in local memory. The instruction format is defined in Figure 6.7. 
 
 
 
 
 
 
Address Memory address. 
SRR2 Holds the data that needs to be stored in local memory (if                      
the instruction is a load). For a store instruction it is used as 
the destination register (to store the data memory).  
SRR1 Memory address.  
X Future use.                            
 
Figure 6.7: Load/Store instruction format 
 
 
 
Op-code  SRR1                            Address    SRR2        X 
    31                  27            25                  21 20              16                                                                                 0 
147 
 
For Load instructions, the value at the address field and the contents of the SRR1 are used 
to store the source address (SRR2). The SRR2 field is used as the destination register. The 
same is true for the Store instructions where the memory address is calculated. The SRR2 
is also used as the source register instead of the destination register. The Load/Store 
instructions can be written as defined in Figure 6.8. 
 
lw  r1, address (r2)    ;  r1 = Memory[r2 + address]  
sw  r4, address (r2)    ;  Memory[r2+address]  =  r4  
 
 
Figure 6.8: Load and store instructions with memory address 
 
 
 
b) Load CAM (LCAM): This instruction is used to load memory from the CAM after a 
data match with the contents of the CAM is found. The instruction format of the 
LCAM is shown in Figure 6.9. 
 
 
 
 
 
 
                                         DES      Holds the first 32 bits of connection info 
                                         SRR1    The CAM‟s memory address after a match is found  
                                         S           Bit signal to the RISC‟s controller 
                                         X           For future use 
 
 
Figure 6.9: LCAM instruction format 
 
 
 
Op-code                      X        DES     SRR1  S 
    31                       27  26    25                      21 20                     16                                                            0 
148 
 
When a match is found, the CAM returns the address of the matched location. The read 
signal is also sent to SRAM. After reading the S bit, the RISC controller loads the first 32 
bits (the Sequence number and End-address) and stores them in the destination register 
“DES”. If a match is not found, the CAM sends a signal to the RISC (match = 0). The 
RISC considers this packet a lost packet and discards it. The LCAM instruction can then 
be written as presented in Figure 6.10. 
 
lcam    r2,r1        ; location of data that matches the connection identifier 
                                  ; reads data (r1) output into r2. 
 
 
Figure 6.10: Load a memory address from CAM 
 
 
6.4 Pipeline Hazard 
In the instruction stream, a hazard prevents the next instruction from being executed during 
its designated clock cycle. Clearly, hazards reduce the RISC‟s performance.  There are two 
types of hazards; Control hazards and Data hazards [44]. The Control hazard occurs during 
the Condition Branch instruction execution. The decision about whether the branch is taken 
or not taken does not occur until the result of the comparison is completed in the 
Decode/Execute pipeline stage. The fetched instruction after the Condition Branch 
instruction will be flushed from the pipeline if the branch is taken. Such an operation is 
called Branch Penalty. Clearly, the Branch Penalty will reduce pipeline performance. The 
Branch-Delay technique [44] is used to reduce such problems and is shown in Figure 6.11.  
149 
 
Figure 6.11a presents the existing code before scheduling, where the add instruction is 
executed before the branch instruction, which is considered to be an independent instruction.  
In Figure 6.11b, the add instruction is used as a delay slot which is scheduled to be executed 
after the branch instruction. In this case, the and instruction will be executed whether the 
branch is taken or not and that will not affect the pipeline‟s performance.   
i and   r1 ,  r7 ,  0x00ff0000 ; mask the protocol type 
i+1 and r2, r6,0x0000ffff   ; mask the protocol type, get the length of 
;the message and store the result in r2   
   
i+n beq r1,TCP, start_TCP ; if the protocol type is TCP 
 
Figure 6.11a: Before scheduling the branch-delay slot 
 
i and   r1 ,  r7 ,  0x00ff0000  
i+1 beq r1,TCP, start_TCP  
   
i+n and r2, r6,0x0000ffff  
 
Figure 6.11b: After scheduling the branch-delay slot 
 
The number of occurrence for Conditional Branches for the receiving unit for both TCP/IP and 
UDP/IP is presented in Table 6.2. Minimizing the impact of the Conditional Branch 
instructions, the branch-delay slot technique [44] is used to eliminate the effect of performance 
degradation which is an outcome of following the Conditional Branch instructions.  A useful 
instruction is found for managing most of the Conditional Branch instructions, for the 
processing of both TCP/IP and UDP/IP. 
150 
 
Table 6.2: Number of occurrence for Conditional Branch instructions during TCP/IP and 
UDP/IP LRO processing 
Message Type TCP/IP UDP/IP 
SSM 8 7 
BOM 9 8 
COM 9 8 
EOM 7 8 
 
A Data hazard can occur when the Data hazard instructions are being executed in the 
pipeline stage and the score from earlier pipeline stages is not available. Figure 6.12 
presents the existing code of the program, before scheduling.  The and instruction at i+1 
writes the value of r1 in the W/B pipeline stage, where the beq instruction at i+2 reads the 
value during the Decode/Execute stage. Clearly, the beq instruction will experience the 
Data hazard problem. Unless precautions are taken to prevent it, the beq instruction i+2 
could read the wrong value of r1. 
Clearly, this Data hazard happens quite often in the NI code. Support should be provided 
for the NI‟s design in order to reduce the negative impact. Otherwise, a severe reduction in 
performance will occur. There are different techniques that can be used to reduce or even 
eliminate the performance reduction due to the Data hazard (or dependency). One reduction 
technique makes use of the Forwarding or Injecting procedure [44]. In this work, the 
injection of a useful instruction is used to avoid the hazard 
 
 
151 
 
. 
. 
I       lw r4, 0(r7) 
 
 
; Load a free pointer from the CB                                                                   
i+1  and r1, r6, 0x00ff0000  ; Mask the protocol type from IP header  (r6)   
; store the result in r1 
 
i+2 beq  r1, TCP, start_TCP ; Check if the current packet  is TCP  
i+3 and r2, r6,0x0000ffff ;Mask the TCP PSH flag header (r7)   
; store the result in r2 
i+4 beq r3, r8, start_EOM ;If the current packet l is the EOM/SSM  packet 
 of the message by checking the R3‟s value 
i+5 lw r5, 4(RBI) 
. 
. 
;Get the source address from IP  header  
; store the  result in r5 
 
 
Figure 6.12: Before scheduling procedure 
 
For example, the i+1 instruction in Figure 6.13 is swapped with the i instruction to prevent 
the Data hazard when the beq instruction reads the value of r1, which is still in the Write 
Back stage (Figure 6.13). This scheduling mechanism is performed when data from a 
previous instruction is not available for the next executed instruction. 
. 
. 
i (i+1)  and r1, r6, 0x00ff0000 
 
; Mask the protocol type from IP header  (r6)   
; Store   the result in r1 
i+1 (I previously)lw r4, 0(r7)                       ;Load a free pointer from the CB 
i+2 beq r1, TCP, Start_TCP ; Check if the current packet  is TCP  
i+3  and  r3,r7,0x000800000                ;Mask the TCP PSH flag header (r7)  
; store the result in r3 
i+4 (i+5 previously) lw r5, 4(RBI)          ; Mask the (Source address) from IP   
 header (r6)    
;Store the  result in R5 
i+5 (1+4 previously) beq r3, r8, Start_EOM 
. 
. 
; Check if the current packet l is the  
 EOM/SSM packet of  
 message by checking the R3‟s value 
 
Figure 6.13: Delay slot technique for Data hazard 
152 
 
Using branch prediction is essentially done by inserting an instruction (in which case 
a pipeline bubble is required) during the LRO and LSO processing. The frequency of Data 
hazard occurrences for both TCP and UDP processing on the receiving side is presented in 
tables 6.3 and 6.4. Although the occurrence appears to be minor, the impact or reduction of 
performance is significant. After rescheduling the LRO program, one RAW hazard remains 
for UDP within the BOM and the COM. One RAW, within the SSM of TCP program, is also 
required. 
 
 
Table 6.3: Number of occurrence of the Read after Write (R/W) hazard for UDP for LRO 
processing  
UDP message type RAW Hazard 
BOM and COM 1 
SSM and EOM 0 
 
 
 
Table 6.4: Number of occurrence of the Read after Write (R/W) hazard for TCP for LRO 
processing  
TCP  message type RAW Hazard 
BOM, COM, EOM 0 
SSM 1 
 
 
To enhance the processing in the LRO and LSO, add the forward mechanism to reduce the 
Data hazard. Most independent instructions are used to avoid the Control hazard. An 
example of the forward mechanism used in the simulator is presented in Figure 6.14. 
 
153 
 
 
i           sub  r3, r1, r2                        :  Add r1  to r2 and store a result in r3  
 
 
i+1           r                                          ;  If  the r3 is equal to value  5 then jump to the send  
                                                                        signal locations 
 
 
Figure 6.14: Forward mechanism used in the simulator 
 
Latch the result of the Arithmetic Logic Unit (ALU) from r1 - r2, then send the latched data 
at register r3 to the ALU to compare it with value 5 (during the Decode/Execute of the 
current instruction). The forwarding hardware receives a signal from the RISC‟s controller, 
indicating that the current instruction needs the latched data of the previous instruction. The 
latched data is sent to the Decode/Execute stage of the current instruction (Figure 6.15).  
 
 
Decode
Execute
Fetch Write back
Decode
Execute
Fetch Write back
 
 
 
 
Figure 6.15: Latching the output of the Arithmetic Logic Unit for minimizing the Data 
hazard 
 
 
 
6.5 RISC Registers 
In the implementation, the RISC‟s instruction format has three register operands. These 
operands need to read two data words from the register file and write one data word into the 
register file for each instruction. Two inputs are for writing data into the register. The first 
r3 
Forward R3 
beg 3, 5, send signal
i      sub  r3, r1, r2 
 
i+1  beg r3, 5, send signal 
154 
 
input required specifies the register number to be written to. The second input supplies the 
data to be written into the register. Thus, four inputs are required (three for the register 
number and one for the data) and two for the data out (Figure 6.16). The output of the 
register contains whatever register values are on the Read register inputs. Read registers are 
controlled by a signal called R (R1,R2). The R signal must be asserted as a read command 
to the specified register. The write register is controlled by a specific write control signal, 
called W (for the W/B register). The RISC asserts the W signal to write data into a target 
register. 
 
 
 
 
 
 
 
 
Figure 6.16: RISC register file 
 
The register number is 5 bits, data input 32 bits and two data outputs, which are both 32 
bits wide. The size of the register files is different between the LRO and the LSO functions. 
The number of the register files for the LRO and the LSO unit are shown in Table 6.5.  
 
 
 
 
 
Select R1 
                            Output 1 
Select R2 
 
                           Output2 
 
Select Write back  
 
Data_in 
 
R (R1,R2) 
W( for W/B register)  
 
RISC Registers  
 
Register 
numbers 
 
DataIn 
Control 
signal 
 
Data  
Out 
 
155 
 
Table 6.5:  Number of Register files size for LRO and LSO 
Functions unit Register size 
LRO 64 registers 
LSO  32 registers 
 
 
 
 
6.6 Components required for RISC cores 
Both RISC cores require other components to help with LRO and LSO processing.  Table 
6.6 shows the components that are required for the LRO and LSO functions.  
 
 
Table 6.6: Shows the components needed for the LSO and the LRO functions 
 
Processing core LSO LRO 
DMA Required Required 
CAM Not required 
Required for active 
connections 
Local bus 
64 bits word bus 
required 
64 bits word bus required  
Circulation Buffer 
Required to manage 
the SB space 
Required to manage the RB 
space 
Number of FIFOs 2 4 
SRAM packet buffer Required (4GB) Required (4GB) 
 
 
 
 
6.7 Packet Data Path   
The end node processing starts after receiving the Ethernet frames from a network or when 
sending frames to a network. Figure 6.17 shows the structure of the data path within the NI. 
156 
 
After a frame has arrived, the NI processes this frame at the Media Access Control (MAC) 
unit. The MAC unit then passes the packet onto the next stage, which is the Packet 
Processing Unit (PPU), or accepts packets from PPU which creates the frame headers and 
delivers the frame to a network. The PPU then has to process the packet headers before 
sending them to the protocol stack. Alternatively, it generates the packet headers and then 
sends them to the MAC unit. Each of these units is required to execute its operation 
concurrently. When the unit completes an operation, it passes the result to the next unit and 
starts with a new operation. The PPU is attached to a MAC unit and to the host. Two 
interfaces have been implemented; the host interface and the bus interface.   
 
Packet Processing Unit (PPU)Host Media Access 
Control (MAC) unit 
FramesPackets
I
N
T
E
R
F
A
C
E
I
N
T
E
R
F
A
C
E
 
Figure 6.17: The topology of the test environment 
 
 
This thesis focuses on the PPU (shaded area in Figure 6.18). In order to evaluate packet 
processing at this stage, a complete system environment has been set up (Figure 6.18). 
 
157 
 
Queue
Hexadecimal
 File Buffer
Gigabit 
MAC 
Receive
Flow 
Control
Elastic 
Buufer
FSM
Gigabit 
MAC 
Transmit
Elastic 
Buffer
FSM
FSM
To/From PHY
Packets
Frames
Queue
Hexadecimal
 File Buffer
FSM
Data
FIFO
FIFO
Destination 
Buffer
FIFO
FIFO
FIFOData
To/From Host
MAC VHDL Line Interface Host InterfaceHost Area
Frame processingPacket processing
DMA
To/From the MAC Unit
Packet processing unit
 
Figure 6.18: Tested model for sending and receiving packets to or from the Packet 
Processing Unit 
 
The MAC unit sends valid packets to or from the PPU packet checksum validation or 
generation is performed during the transfer of the packet (by-pass)). The structure and 
design of the MAC unit is out of the scope of this thesis, however sending and reading 
packets to or from the PPU is required in order to test the performance of the RISC cores. 
The external VHDL MAC unit structure is based upon the VHDL code for TCP/IP and 
UDP/IP packets, the COM-5402SOFT [102] and the COM-5401SODT [101]. Both the 
COM-5402SOFT and COM-5401SODT are designed to support 2 Gbps bidirectional, 
where the clock rate is 125 MHz. For simulation and timing purposes, the clock rate can be 
increased to up to 5 GHz. To avoid delays in sending packets from the MAC unit to the 
PPU, the clock rates have been modified to match the rates of 40 and 100 Gbps (e.g., at 123 
ns per 1500 bytes packet). The COM-5402SOFT and COM-5401SODT are chosen to send 
158 
 
or read packets from the proposed PPU. Both COM-5402SOFT and COM-5401SODT are 
out of the scope of this thesis. 
The Finites State Machine (FSM) at the VHDL MAC unit passes the frames to the MAC 
Receiver which will process the MAC headers and trailer. It also validates the packets 
before they are passed to the Elastic buffer. Then it triggers the FSM at the Line Interface to 
start reading the packets from the Elastic buffer queue. The RISC processor starts 
processing the arrived packets after it receives the interrupt signal from the Line Interface‟s 
FSM. The RISC processes the arrived packet until the Interrupt Moderation reaches the 
end. The RISC then triggers the FSM at the Host Interface to start transferring the packets 
to the host memory (Destination Buffer). The RISC core however continues processing the 
newly arrived packets. 
For the sending side, the large packets for TCP/IP and UDP/IP are stored in the Queue 
Buffer to be sent to the NI. The host CPU is also required to send the Maximum Segment 
Size (MSS) and the location of each packet that is stored inside the Sending Buffer (SB). 
The RISC at the sending side starts segmenting or fragmenting the large packets to the 
MTU and sends them to the MAC unit where the header CRC, the MAC and the trailer 
header are added. The MAC unit's processing of details shall not be discussed here as this is 
outside the scope of the present study which is currently focusing on RISC processing at 
the Packet Processing Unit. 
 
159 
 
6.7.1 The PCI Interface  
There are many different approaches when polling data from the RB. On UNIX systems it 
is stored at a stream buffer called „mbufs‟ [92]. Interrupt-driven and polling techniques are 
used to get data from the NI. As a means of improving the performance of Linux on high-
end systems, developers have created an interface subsystem alternative network (called 
NAPI) for polling data. In contrast, Microsoft uses the Receive Side Scaling (RSS) method 
[7] where the NI manages multiple hardware queues to host a number of incoming packets. 
After the Interruption Moderation [31] or the Interrupt Avoidance timer expires, the 
Deferred Procedure Call (DPCs) requests the appropriate CPUs to handle these packets 
through the Interrupt Service Routine (ISR). All processing for the network I/O within the 
context of an ISR is routed to the same processor. This behavior differs from that of earlier 
processors in UNIX. However, the most significant difference between the proposals is that 
the NI driver operates only in response to requests from the kernel, while the network 
packets are received asynchronously from the network [92]. Thus, the core processor at the 
Network Interface prepares the incoming packets to be ready for processing, whilst the host 
processes these packets. The communication with the Network Interface is designed for this 
different mode of operation. NAPI or RSS are used as architectures of the Operating 
System rather than the TCP or the UDP protocol. In addition, there are a number of 
implementations associated with polling data which include Zero-Copy [10] or Direct 
Cache Access [85] used to reduce header processing. In this chapter, the structure of the 
PCI interface is described in Figure 6.18.  
 
160 
 
6.7.1.1 PCI Interface at the Packet Processing Unit   
The PCI speed (in MHz), bit-rates (the width of the bus) and the number of channels are 
essential information for line speeds of up to 100 Gbps [18, 104]. The PCI specification and 
design is out of the scope of this thesis. However, demonstrating the movement of the 
packets to or from the PPU helps to test the embedded cores‟ performance and their 
capacity to support a maximum throughput of 100 Gbps while using  real TCP and UDP 
streams. When the Interrupt Moderation (IM) timer expires, a signal is sent from the 
receiver processor to the Finite State Machine (FSM) controller (Figure 6.19). The FSM 
controller starts reading the FIFOs (FIFO 1 and 2). The FIFOs contain the address from 
where the data is to be fetched from the RB and the total number of bytes. The FSM then 
asserts a Read signal to the host‟s DMA, along with the Start-address and total bytes 
required to start transferring the data from the RB to the target location in the host memory. 
Store new 
packets
Host Area 
Host Memeory
Bus 
Interface
New Packets
Receiver 
Buffer
FSM
RISC
DMA
Amalgamated 
packets
FIFOs
 
Figure 6.19: Transferring packets from the Receiving Buffer to the Host Memory 
 
 
This research is focusing on the Packet Processing Unit area. The other unit is designed to 
assist the proposed framework. This prototype uses queues in the host area to act as the host 
161 
 
memory. With real memory, there is extra processing involved in accessing memory, 
including the available space. The use of queues has some advantages for this framework as 
there is no need to identify where to store the data, manipulate the memory pages or buffer 
descriptor updates. 
The data transfer from the NI to the host memory depends upon the system's architecture 
including the bus width, how many PCI channels, bursts as well as the speed of the host 
DMA. A burst transfer is required to enhance the bandwidth utilization of the RB and the 
design of bus interface [111].  
Data has to be transferred from the RB in large bursts to provide a space inside the RB for 
new incoming packets. Figure 6.20 presents the timing diagram of how the burst is 
transferred. A transfer for 64 bits X 8 (64 bytes) equivalent to 0.9 RISC cycle. The host 
DMA‟s clock and system bus has the ability to support burst transfers within a budgeted 
timeframe. If the IM size is 100 packets and assuming these 100 packets are each 1500 
bytes in size, the total cycles required to complete moving the 100 packets can be 
calculated as follows:  
   = 2343 cycles 
 
 
Figure 6.20: Timing diagram captured from the simulation for burst transfer from the RB to 
the host 
 
162 
 
 
 
The PCI-DMA requires completion of the 2343 cycles within 12300 ns (100 packets, each 
with 123 ns for 100 Gbps), in order to transfer the packets from the RB to the host. If the 
number of bursts is increased to 16, the total byte transfer in each burst is 128 bytes per 
cycle. The total cycles required for 100 packets are 1172 for the RISC. Increasing the speed 
for moving the packets from the RB enhances the end node‟s capacity to support 100 Gbps. 
With smaller packet sizes, PCI-DMA transfers can be completed with fewer cycles. While 
the configuration of the PCI-DMA is out of the scope of this thesis, a demonstration of the 
movement of data is required in order to verify the proposed PPU model and to investigate 
the RISC clock within the 100 Gbps line rate (RISC performance is covered in the next 
chapter). 
 
6.7.1.2 Reading data from the Receiving buffer 
The FSM has a counter unit. This counter unit starts after the RISC core initiates the FSM 
to pull data from the RB. When IM has reached its end (the total number of packets has 
reached its limit), the counter sends a signal to the FSM to stop. The counter then resets to 
Zero. In this case, even FIFO 1 or 2 are still receiving new entries for the next IM (200 
packets); there will be room for 300 new FIFO entries (the depth of FIFO is 511). As 
mentioned before, the depth of the FIFO is enough to keep the receiving side working 
properly without having any overflow. 
After the IM has expired and data is starting to be moved out to the host, the Memory 
Management‟s State Machine at the PPU updates the Circulation Buffer pointers, 
163 
 
depending on how many memory pages have been used before the Interrupts Moderation. 
For example, if the Interrupt Moderation window is 30 packets and the Receiver core uses 
five pages of the Receiver buffer to amalgamate 20 packets and another 10 for Single 
Segment Messages packets, the Circulation Buffer (CB)‟s pointers are incremented by 15 
steps. When the REP sends a signal to the PCI Interface, it is then at the tail of CB 
increments, with the same number of memory pages. The total number of pages is stored 
within the internal buffer of the Memory Management. 
 
6.7.1.3 Reading data from Sending side 
For the sending side, the host interacts with the NI. This means that whenever the host is 
ready to send data to a network interface, it reads FIFO 4 which includes the free address 
space available in the Sending Buffer (SB). The host CPU controls the Sending Buffer as 
an I/O device and there is no need to perform any page swapping techniques. Therefore, 
this area becomes a usable space for the host CPU to store large TCP/IP or UDP/IP packets. 
The maximum size of the large packets is 64 KB and the host can send up to 100 large 
packets to a network interface. A message description is also required to be sent to FIFO 5 
with each message. The description includes the Start-address of the large message and 
Maximum Segment Size (MSS). When the RISC core starts processing a message, it sends 
it to the Start-address to be stored at the end of the Circulation Buffer. FIFO 5 is 
implemented with a high priority option for small packets, such as signaling packets (e.g., 
packets with FIN or SYN) which are always put to the top of the FIFO. The sending core 
checks the priority pointers of the FIFO after completing a packet (e.g., every 123 ns if the 
164 
 
packet size is 1500 bytes). Small packets that are sent from the host CPU which are smaller 
or equal to the SSM are sent „as is‟ to the MAC unit. Using the „cycles stealing technique‟ 
to move such packets is also an option [44]. Stealing cycles requires more processing 
including instructing the DMA to release the local bus in order to read or write data. In our 
case however, the core processor initiates the DMA to transfer the small packets after 
storing the current packet in the RISC‟s registers. 
 
6.7.2 Interrupt Moderation Window Size  
The network adapter interrupts the host processor upon receiving a number of packets. In 
many scenarios where high processor utilization exists, it is best to coalesce several packets 
for each interruption so as to reduce the number of times the host processor is interrupted.   
The Interrupt Moderation Window Size is dependent upon many factors including the 
speed of the DMA [103] and the width of the PCI [104]. To avoid overwrites of data inside 
the Receiver Buffer, the Interrupt Moderation Windows size is fixed. The maximum UDP 
datagram size for IPv4 is 65,535 bytes. The Receiver Buffer consists of a 65,535 bytes X 
256 column array. Optimal performance is dependent upon how fast the data can be moved 
out from the Receiver Buffer. A larger size of the Interrupt Moderation results in more 
amalgamation of packets if the IM timer is less than the TCP re-transmitting time of the 
packets (500 ms [93]). On the other hand, the large amalgamated data requires fast 
processing in order for it to be completed within the budgeted timeframe of the line rate.  
165 
 
Configuring the timers is typically a matter of determining the desired interrupt rate or the 
desired number of packets per interrupt [33]. Table 6.7 illustrates the total time required for 
50, 100 or 200 packets for line rates of 40 and 100 Gbps. 
Table 6.7: The Interrupt Moderation sizes and absolute time to interrupt the host CPU 
 
line 
speed 
Total packet / 
Sec 
Number of 
packer / 
Interrupt 
approximate 
interrupt time 
/Sec 
Time for 
1500bytes 
frame 
Interrupt / µs 
40 
Gbps 
3250975 
20 162548.8 3.076E-07 6.152000005 
50 65019.51 3.076E-07 15.38000001 
100 32509.75 3.076E-07 30.76000002 
200 16254.88 3.076E-07 61.52000005 
300 10836.58 3.076E-07 92.28000007 
100 
Gbps 
81274382 
20 4063719 1.2304E-08 0.24608 
50 1625488 1.2304E-08 0.6152 
100 812743.8 1.2304E-08 1.2304 
200 406371.9 1.2304E-08 2.4608 
300 270914.6 1.2304E-08 3.6912 
 
 
When the number of packets per interruption is 20 and the line speed is 100 Gbps, the total 
time is 2460 ns before the Interrupt Moderation expires. This means the first 20 pages of 
the Receiver Buffer are used to store the first 20 packets, assuming the worst-case scenario 
where all the arrived packets are from a single flow. After the Interrupt Moderation expires, 
the FSM starts to poll data from the Receiver Buffer, where the embedded core can store 
166 
 
new packets for the next 20 pages of the Receiver Buffer. The Trailer Pointer of the 
Circulation Buffer moves ahead (depending on how many pages have been used). When the 
Interrupt Moderation is 100 packets, the host DMA time required to complete transferring 
100 packets from the RB   would be 12300 ns.    
A consequence of this design is the limitation that only a maximum of approximately 100 
flows can be supported before the Receive Buffer must be flushed to the host. Similarly, 
each flow cannot contain more than 65,5356 bytes of data (64 KB). In this study, the 
processing of TCP deliberately did not support windows scaling and thus can buffer an 
entire TCP window within one row of the array. The potential to accommodate windows 
scaling exists when a trade-off is made between increasing the Buffer on the NI and using 
multiple columns to buffer a single flow (with consequential processing overheads). The 
array was sized according to the Interrupt Moderation parameters currently used in the NI.  
 
 
 
 
 
6.8 Conclusion  
In this chapter, the Scalable Network Interface Architecture for a line rate of 100 Gbps was 
designed. The IEEE 1164-1993 and 1076-1993 standards running over the Xilinx were 
used for simulating the Network Interface and testing the Behavior Model.  VHDL is used 
for building the Scalable Network Interface components. Among the components, the three 
pipeline  RISC cores were successfully built and were tested for the Large Send Offload 
and the Large Receive Offload functions. A simple DMA controller was designed for 
transferring data between the Network Interface buffers. Content Addressable Memory was 
used as a look-up-table to enhance the receiving side processing. FIFOs are used to reduce 
167 
 
the interrupts issued due to exchanging information between the host CPU and the RISC 
cores. Receiving side and sending side have suitable buffers for storing the TCP/IP and 
UDP/IP packets.   
The next chapter determines the RISC core performance for processing TCP/IP and 
UDP/IP packets within a line rate of 100 Gbps. It will provide a detailed analysis of TCP/IP 
and UDP/IP processing. The desired RISC core's clock speed will be re-ordered so as to 
maintain the inbound and outbound packets for both TCP/IP and UDP/IP. The desired 
DMA clock speed is also investigated for the high-speed network interface. 
 
 
 
168 
 
Chapter 7 
LRO and LSO Processing Analysis inside the Packet 
Processing Unit 
 
 
7.1 Introduction 
With the advent of 100 Gbps Packet Processing speed, a detailed analysis at LRO and 
LSO implementations is required and how to scale TCP/IP and UDP/IP processing to 
higher speeds with least amount of processing power is presented in this chapter. The 
chapter is focusing on receiving side and sending side processing of TCP/IP and 
UDP/IP. The total number of RISC instructions executed for TCP and UDP packets 
including Beginning of Message, Continuation of Message, End of Message and Single 
Segment Message will be discussed. The path length cycles for TCP and UDP are also 
presented.    
 
7.2 Enhancement to Improve Packet Processing    
A number of techniques in this thesis have been implemented to enhance packet 
processing within the budgeted time frame for 100 Gbps. These techniques include 
communication style between the host and the Media Access Control (MAC) unit, 
which is based on the use of information exchange, rather than on the interruption 
approach. In addition, the RISC instructions shall enable a single-cycle Look-up table 
and single cycle 32 bits for Reads and Writes. Generic instructions operate on 32 bits 
169 
 
operands, which shall make up the heart of the TCP and the UDP processing code. The 
embedded cores execute the LRO and the LSO codes which are stored in the internal 
memory.  
Another approach to enhance the proposed PPU to support 100 Gbps is the assembly 
language implementation of the algorithm programs for the LRO and the LSO. The code 
considers the program's structure and avoids reading after writing the instructions. In 
addition, focus has been placed on the Hazard Data of the Assembly program, using 
Branch Delay techniques. Another technique is the use of overlapped processing. This 
allows the RISC core to complete part of the LRO or LSO processing instructions 
during the movement of data inside the NI, including the calculation of the packet 
headers.  
A testing process has been designed to check the functionality of the PPU and to 
evaluate the RISC performance, whilst processing different TCP and UDP streams at a 
high-speed rate of 100 Gbps. For the receiving side, the first action is to collect different 
real stream TCP and UDP data in hexadecimal files (see Appendix A) and store the 
hexadecimal file(s) in the queue buffer inside the MAC VHDL unit (Figure 7.1).   
The MAC VHDL then processes the frame header and trailer and forwards the valid 
packet to the Elastic buffer.  In response to the trigger signal, the Finite State Machine 
(FSM) at the Line Interface begins reading the packets from the Elastic buffer queue. 
170 
 
Queue
Hexadecimal
 File Buffer
Gigabit 
MAC 
Receive
Flow 
Control
Elastic 
Buffer
FSM
Gigabit 
MAC 
Transmit
Elastic 
Buffer
FSM
FSM
To/From PHY
Packets
Frames
Queue
Hexadecimal
 File Buffer
FSM
Data
FIFO
FIFO
Destination 
Buffer
FIFO
FIFO
FIFOData
To/From Host
MAC VHDL Line Interface Host InterfaceHost Area
Frame processing
Packet processing
DMA
To/From the MAC Unit
 
Figure 7.1: Tested model for sending and receiving packets at the Packet Processing 
Unit 
 
The RISC processor starts processing the arrived packets after it receives the interrupt 
signal from the Line Interface‟s FSM. The RISC processes the arrived packets until the 
Interrupt Moderation size reaches the end. The RISC triggers the FSM at the host 
interface to start transferring the packets to the host memory (Destination Buffer). The 
RISC core continues processing the newly arrived packets.     
For the sending side, the large TCP/IP and UDP/IP packets are stored in the Queue 
Hexadecimal Buffer. The large packets in the queue have the same size as the packets 
that the server sent to the clients (see Appendix A). The host is also required to send the 
Maximum Segment Size (MSS) and the location of each packet, which is stored inside 
the sending buffer, to FIFO 4. The RISC at the sending side commences segmenting or 
fragmenting the large packets to the MSS size and generates the packet headers for each 
segment. Then the RISC initiates the DMA to transfer the MTU packets (headers and 
171 
 
application data) to the line interface. The FSM line interface asserts a signal to the 
MAC‟s FSM to pull the packet from the SBI. The CRC at this stage is added to the 
TCP/IP headers at the MAC VHDL. The CRC can also be integrated with the DMA 
transfer. The MAC header and trailer header are also created within this unit. The MAC 
unit processing details shall not be discussed here as these are outside the scope of the 
thesis, which focuses on the designed RISC processing performance at the proposed 
PPU.   
    
 
 
7.3  Processing Analysis   
By delivering different TCP and UDP streams to the simulated model and by 
investigating the waveform generated from the simulator, the number of instructions 
required to process complete packets for both TCP and UDP are recorded. The RISC 
starts by reading the packet headers from RBI. It then reads the head of the Circulation 
Buffer (CB) that contains the pointer space. The RISC initiates the DMA to move the 
payload from the RBI to the pointer space inside the Receiver Buffer (Figure 7.2). The 
Xilinx Behavior Model simulation presents the timing diagram for packet reception. 
Three general steps of packet processing can be found while processing TCP/IP or 
UDP/IP at the PPU. In the first step, there is the packet header processing, which is used 
to acquire the protocol type and the connection information, such as the IP address and 
Sequence Number (SN). The search inside the CAM is looking for a matching 
connection, based on the IP address and the port address which were processed in the 
first step. 
172 
 
 
 
 
 
 
Figure 7.2: Large Receive Offload Processing cycles characteristics 
 
The second step determines whether the arrived packet is a Beginning of Message 
(BOM), Continuation of Message (COM), End of Message (EOM) or Single Segment 
Message (SSM). The final processing step is to move the data from the Receiver Buffer 
Interface (RBI) to the Receiver Buffer (RB). To do this, the RISC core executes a 
number of instructions during the data transfer and then becomes idle until the DMA 
releases the bus. The number of RISC cycles required to process the LRO and LSO for 
TCP and UDP processing are presented in the next section.  
The highlighted area (dark yellow area) indicates the instructions that the RISC can 
process during the data movement (the DMA runs at twice the speed of the processor 
clock rate). The DMA needs 184 RISC cycles (each DMA transfer carries 64 bits) to 
finish transferring one UDP payload (1472 bytes). During the data movements, the 
RISC can execute instructions, such as finding a CAM match, update the CAM entries 
or calculate the total bytes of the amalgamated data before the DMA finishes its job. 
After the DMA finishes the transfer, the RISC processor takes control of the bus. The 
 
Header processing 
cycles before data transfer
Data movemets
Over lab cycles
  Overlapped cycles  RISC Idle Cycles
T1 T2 T3 T4
:   Total RISC cycles
Initiated DMA
DMA transfer cycle
Processing Time
RISC idle cycles
173 
 
RISC then processes the linked-list mechanism (e.g., updating the value of the packet 
header after the movements of the EOM) or reads FIFO 3, which contains a new TCP 
connection to be stored within the CAM. In this situation, the RISC has to wait several 
cycles until the DMA completes its job. Even though the RISC mainly lies idle during 
the processing of a packet (which is decreasing power consumption during the RISC‟s 
wait cycles [95, 118]), the idle cycles reduce the NI‟s performance. The long path of the 
payload processing extends the execution time for each TCP/IP or UDP/IP packet by the 
number of the RISC‟s wait cycles. The processor idle cycles can be efficiently 
scheduled for parallel/distributed computing or sharing tasks [118, 119]. Scheduling or 
sharing tasks is an appropriate scenario for the proposed PPU, if there is an extra bus 
that allows the embedded core to execute a new task (e.g., accessing new packets).  It is 
clear that the embedded core at the PPU can only perform a single task at a time and 
reducing the core idle cycles enhances the packet processing performance in the PPU.    
 
7.3.1 Large Receive Offload Analysis through Full-System Simulation   
By delivering different packets to the simulated model and by investigating the 
waveform generated from the simulator we are able to find the number of instructions 
required to process the algorithm of the LRO for TCP/IP and UDP/IP. The timing 
diagram for receiving BOM TCP/IP packets is illustrated in Figure 7.3. The instructions 
that are processed by the RISC in order to process a received BOM TCP/IP packet are 
presented as a diagram model. The REP starts by examining the header packet. The REP 
spends 15 instructions before initiating the DMA to move data from the RBI to RB. For 
174 
 
example, it starts by checking whether the protocol type is TCP or UDP and then 
matches the connection information of the received message with active TCP 
connections inside the CAM entry. The REP is executing 7 instructions during the data 
movement cycles. These cycles do not require the use of the local bus, such as checking 
the message type. The REP becomes idle after completing a packet header until the bus 
has been released by the DMA so it can fetch the other header from the RBI. The DMA 
required 2 cycles to move 64 bits from source to destination. It is clear from the timing 
diagram that the DMA requires more time to complete moving 1460 bytes over the 64-
bit bus. The RISC stays in idle mode until the DMA releases the bus.   
 
7.3.1.1 TCP processing cycles  
 The analysis of the processing cycles during BOM processing is illustrated in figure 
7.4. The RISC cycles before initiating the DMA are focused on examining the packet 
headers (TCP and IP). This examination contains reading the packet identifiers (the IP 
address and Port ID) from the IP and TCP headers. The processing continues by 
matching these identifiers with the one stored in the CAM. If the match is found, the 
RISC initiates the DMA to start transferring data. During the data transfer, the RISC 
core continues creating a linked-list of this link. After the linked-list has been completed 
the RISC then waits until the DMA has finished transferring data in order to read the 
second packet. While processing the COM, the REP executes more instructions before 
the data movements than the BOM (Figure 7.5). This is because the REP needs to pull 
the end-address of the linked-list from the CAM. In addition, the REP spends 10 
instructions during the data movements in order to link the received payload to the 
175 
 
previously amalgamated data. The EOM could need fewer instructions than a COM 
since there is no need to update the linked-list inside the CAM (Figure 7.6). However, 
with the EOM, the REP needs to update the TCP and IP headers of the large packet that 
amalgamated inside the RB. In addition, it needs to send the link‟s information to FIFO 
2 including the start address of the current packet. When the REP discovers that the 
received message is SSM, where the PUSH flag is set [33], and there is no amalgamated 
data for this link, it sends the packet as is to RB and there is no need to create a linked-
list (Figure 7.7). Out-of-order packets require more cycles than other packets (29 
instructions). The REP requires the creation of a sub-liked-list if the sequence number of 
the arrived packet is out of range (Figure 7.8). If the sequence number of a received 
message is within the range of an existing list in the RB, the REP can add the message 
to an existing liked-list. 
 
7.3.1.2  UDP processing cycles 
Processing the BOM of a UDP packet required 15 instructions before initiating the 
DMA (Figure 7.10). These instructions include extracting the connection information 
(the IP address and Port ID) and reading the free space from the Circulation Buffer (CB) 
inside the memory.  The REP spends 10 instructions to complete the creation of a new 
linked-list for this link. The total bytes of the amalgamated data are also required to be 
stored after each received UDP packet with the same connection data (IP address, Port 
ID) inside the CAM. When the REP discovers the EOM or if the timer for amalgamating 
the packets has expired, the header of the message which is stored inside the RB is 
modified with the number of bytes of the total data amalgamated for this link.  After 
176 
 
sending the COM to RB, the REP prepares the linked-list with the “offset number” that 
the receiving side is expecting for the connection (Figure 7.11). The REP executes 12 
instructions to link the received packet to pervious amalgamated data of this link. When 
the EOM is processed, the core is required to update the headers of the large packets 
(Figure 7.12) (the core requires 30 cycles to complete the EOM). When an SSM is 
found (Figure 7.13), there is no need to update the CAM with the number of total bytes 
or creating a linked-list. The REP spends only 17 instructions with SSM messages.  
   
  
1
7
7 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 7.3:  Timing diagram for TCP BOM packet Instructions that are processed by the RISC core  
 
 Packet header processing (before data movements) 
Instructions are processed 
during data movements  
DMA cycles                                RISC idles   
                                                       cycles                                             Processing Time 
CLK 
DMA_Clk 
Data movements 
RISC ints. 
 
 
 
 
 
 
 
Found it 
 
 
Initiated the DMA 
  
1
7
8 
Instructions that are processed by the RISC core to complete the processing of the TCP BOM packet 
where the Receiver Buffer Interface is not empty 
 
 
                                                        
  
 
 
 
 
 
 
 
                                                    
 
 
Figure 7.4: Total number of instructions for TCP BOM packet without idle cycles 
Processing needed before 
moving a packet body (15 
instructions)
7 instructions are processed during data 
movement  
DMA 
releases the 
bus  
RISC 
Idle  
  RISC idle  
  cycles  
                3 inst.               4 inst.            4 inst.           3 inst.                  1 inst.             2 inst.              1 inst.              1 inst.            1 Inst.          2 inst. 
DMA transfer cycle 
Check the 
current packet„s 
protocol 
TCP == 6 
 
Extract the 
link info 
from the 
arrived 
packet  
Examine the 
connection‟s 
data that  is 
stored in 
CAM and get 
the free space 
from the CB. 
 
Get the end 
address where 
the next data 
retread to this 
stream should 
be stored  
The next 
sequence 
number 
that the 
receiver 
is 
expecting 
Update the 
CAM with 
expected Seq. # 
,  start-address, 
end address of 
this stream  
Get the match 
and examine 
the location 
of the coming 
packet  inside 
the linked-list 
Initiate the 
DMA to move 
the packet to 
the selected 
place in the RB 
 
Calculate 
total bytes 
of the 
arrived 
packet to 
be stored 
in CAM  
Check the 
PUSH 
flag‟s 
status of 
the arrived 
packet  
 
RISC 
Becomes 
idle  
  
  
1
7
9 
 
Instructions that are processed by the RISC core to complete the processing of the TCP COM packet 
where the Receiver Buffer Interface is not empty 
 
 
 
 
 
 
  
                                                       
                                                  
 
 
 
Figure 7.5: Total number of instructions for TCP COM packet without idle cycles 
 
 
 
DMA transfer cycle 
 3 inst                4 inst.           4 inst.               3 inst.             4 inst.             1 inst.                 2 inst.         1 inst.         5 inst.         2 inst.        
Processing required before data 
movement   (19 instructions) 
= 19 instructions 
10 instructions that the RISC core 
can process during the data 
movement,   
Check the 
current 
packet„s 
protocol 
Extract 
link info 
form the 
arrived 
packet  
Check 
the seq. 
of the 
arrived 
packet  
 
Calculate 
the end 
address 
where the 
next data 
retread to 
this stream 
should be 
stored  
  
Update the 
CAM with 
expected Seq 
# ,  start-
address, end 
address of 
this stream  
Get the match 
and check the 
end address of 
this stream   
RISC 
idle 
cycles 
Get the size of 
the packet and 
address where 
to store the 
payload part in 
the RB  
Update the 
total bytes 
and the 
expected 
Seq # 
 
Initiate the 
DMA to 
move the 
payload part 
to the 
selected 
place in the 
RB 
 
Check the 
PSH flag‟s 
status of 
the arrived 
packet  
 
RISC 
Becomes 
idle  
  
  
1
8
0 
 
Instructions that are processed by the RISC core to complete the processing of the TCP EOM packet 
where the Receiver Buffer Interface is not empty 
  
 
 
 
 
 
 
 
                                                       
                                                   
 
 
 
 
 
 
 
Figure 7.6: Total number of instructions for TCP EOM packet without idle cycles 
DMA transfer cycle 
 3 inst.               4 inst.             4 inst.               3 inst.          4 inst.            1 inst.                    5 inst.                                                                     4 inst                     
 
Processing required before data 
moving  (19 instructions) 
5 instructions that the 
RISC core can process 
during the data 
movement,   
 
Note: The DMA 
needs 11 RISC cycle 
to move 48-byte   
Check the 
current 
packet„s 
protocol 
TCP = =6 
Extract 
link info 
form the 
arrived 
packet  
Check 
the seq. 
of the 
arrived 
packet  
 
Update the IP,   
TCP and FIFO 
2 
Get the match 
and check the 
end address of 
this stream   
  RISC idle cycles 
Get the size of 
the packet and 
address where 
to store the 
payload part in 
the RB  
Initiate the 
DMA to 
move the 
payload part 
to the 
selected 
place in the 
RB 
 
If the PUSH flag raised, 
RISC  then   the updates 
the total bytes and the 
expected Seq # 
 
 
4 instructions. 
after data has 
moved  
RISC 
Becomes idle  
 
  
1
8
1 
 
Type of instructions that are processed by RISC to complete the processing of the TCP SSM packet 
                                                        
 
 
 
 
 
 
 
                                           
 
 
 
 
 
 
 
 
Figure 7.7: Total number of instructions for TCP SSM packet without idle cycles 
 
DMA transfer cycle 
                3 inst.                 4 inst.           4 inst.               3 inst.              1 inst.             2 inst.          
Processing needed before 
moving a packet body (15 
instructions) 
2 instructions are 
processed during 
data movement  
Check the 
current 
packet„s 
protocol 
TCP = = 6 
 
Extract the 
link info 
from the 
arrived 
packet  
Examine the 
connection‟s 
data that stored 
in CAM and 
get the a free 
space from the 
memory 
management 
Get the 
match and 
examine the 
location of 
the coming 
packet  
Initiated the 
DMA to 
move the 
packet to 
selected 
place in the 
RB 
 
Check the 
PSH flag‟s 
status of 
the arrived 
packet  
 
RISC idle cycles   
RISC 
becomes idle  
 
  
1
8
2 
 
Type of instructions that are processed by the RISC core to complete the processing of the TCP out-of-order packet 
when creating a new sub linked-list 
 
 
 
 
 
 
 
 
                                                       
                                                      
 
 
 
 
 
 
Figure 7.  8 : Total number of instructions for the TCP out-of-order packet when the sub linked-list is equal to “0” in the CAM 
DMA transfer cycle 
       3 inst.                4 inst.             4 inst.              3 inst.        3 inst.               4 inst.             1 inst.             2 inst.              1 inst.             1 inst.          1 Inst.        2 inst. 
    
Processing needed before 
moving packet body (22 
instructions) 
7 instructions that the RISC 
core can process during the 
data movement,   
Check the 
current 
packet„s 
protocol 
Extract 
link info 
form the 
arrived 
packet  
Check 
the seq. 
of the 
arrived 
packet  
 
Get the match 
and check the 
end address of 
this stream   
RISC idle 
cycles 
Get the size of 
the packet and 
address where 
to store the 
payload part in 
the RB  
Initiate the 
DMA to 
move the 
payload part 
to the 
selected 
place in the 
RB 
 
Get the second 
match from 
CAM . If the 
second entry is 
“000” then   
Get the end 
address 
where the 
next data 
retread to this 
stream should 
be stored  
The next 
sequence 
number 
that the 
receiver is 
expecting 
Update the 
CAM with 
expected 
Seq # ,  
start-
address, 
end address 
of this 
stream  
Calculate 
total bytes 
of the 
arrived 
packet to be 
stored in 
CAM  
Check the 
PSH flag‟s 
status of 
the arrived 
packet  
 
RISC 
beco-
mes 
idle  
  
  
1
8
3 
 
 
Type of instructions that are processed by the RISC core to complete the processing of the TCP out-of-order packet 
when updating the sub linked-list 
 
 
 
 
 
 
 
                                                       
                                    
 
 
 
Figure 7.9: Total instructions for TCP out-of-order packet when the sub linked-list is not equal to “0” in the CAM 
DMA transfer cycle 
 3 inst                4 inst.             4 inst.              3 inst.            3 inst.              4 inst.             1 inst.                 2 inst             1 inst           5 inst            2 inst.        
    
Processing needed before 
moving packet body (22  
instructions) 
10 instructions that the RISC 
core can process during the 
data movement,   
 
 
 
Check the 
current 
packet„s 
protocol 
Extract 
link info 
form the 
arrived 
packet  
Check 
the seq. 
of the 
arrived 
packet  
 
Get the match 
and check the 
end address of 
this stream   
  RISC Idle 
cycles  
Get the size of 
the packet and 
address where 
to store the 
payload part in 
the RB  
Initiate the 
DMA to 
move the 
payload part 
to the 
selected 
place in the 
RB 
 
Get the second 
match from 
CAM. If the 
second entry is 
not zero then 
compare it. If 
match then 
Calculate 
the end 
address 
where the 
next data 
retread to 
this stream 
should be 
stored  
  
Update the CAM 
with expected 
Seq # ,  start-
address, end 
address of this 
stream  
Update the 
total bytes 
and the 
expected 
Seq # 
 
Check the 
PSH flag‟s 
status of 
the arrived 
packet  
 
RISC 
beco-
mes 
idle  
  
  
1
8
4 
 
Type of instructions that are processed by the RISC to complete the processing of the UDP BOM packet 
 
                                                       
 
 
 
 
 
 
                                                   
 
 
 
 
 
 
 
Figure 7.10: Total number of instructions to process the UDP BOM packet without idle cycles 
 
DMA transfer cycle 
     3inst.                  4 inst.            5 inst.          3 inst.             2 inst.         3 inst.          2 inst.           1 inst.            2 inst. 
Processing needed before moving a 
packet body (15 instructions.) 
10 instructions are processed during data 
movement  
Check the 
current 
packet„s 
protocol 
UDP = =17 
 
Extract the 
link info 
from the 
arrived 
packet  
Assign the 
end address 
where the 
next data 
retread to this 
stream should 
be stored  
Get the 
expected 
offset 
(Current 
offset+  
total 
bytes)  
Update the CAM 
with expected 
offset and start 
address of this 
stream  
DMA releases the bus   
Get the 
match and 
examine the 
location of 
the coming 
packet  
Read a new 
address from the 
memory 
management and 
initiate the DMA 
to move the 
packet to selected 
place in the RB 
 
Mirage the 
start address 
and total 
bytes in a 
register to be 
store it in 
CAM   
Check the 
MF flag‟s 
status of 
the arrived 
packet  
MF = 0  
 
RISC 
becomes 
idle  
  
RISC idle 
cycles 
  
1
8
5 
Type of instructions that are processed by the RISC to complete the processing of the UDP COM packet 
                                                         
 
 
 
 
 
 
                                                      
 
 
 
 
 
 
Figure 7.11: Total number of instructions to process the UDP COM packet without idle cycles 
 
 
 
DMA transfer cycle 
     3 inst.         4 inst.          5 inst.         4 inst.      1 inst.          3 inst.         3 inst.          2 inst.             2 inst.          2 inst. 
Processing needed before moving 
a packet body (17 instructions) 
Instruction are processed during data movement  
Check 
the 
current 
packet„s 
protocol 
UDP == 
17 
 
Extract the 
link info 
from the 
arrived 
packet  
Get the 
new end 
address  
Current 
add+ total 
length 
Assign the 
end address 
where the 
next data 
retread to this 
stream should 
be stored  
 
Update the 
CAM with 
expected 
offset, start-
address, end 
address of this 
stream  
Get the 
match and 
examine 
the 
location of 
the coming 
packet  
Prepare the 
data to be 
stored in 
CAM    
Connection 
info. + end 
address   
and total 
length 
Get the 
expected 
offset   
(Current 
offset+  
total 
bytes)  
 
Examine 
the link 
info with 
the one 
stored in 
CAM. If 
matched 
and MF=0 
then  
RISC 
Becomes 
idle  
 
Initiate 
the DMA 
to move 
the packet 
to the 
selected 
place in  
the RB 
 
RISC 
idle 
cycles 
  
1
8
6 
 
Type of instructions that are processed by the RISC core to complete the processing of the UDP EOM packet 
 
  
 
 
 
 
 
 
 
                                                       
                                                     
 
 
 
 
 
Figure 7.12: Total number of instructions to process the UDP EOM packet without idle cycles 
 
 
DMA transfer cycle 
 3 inst.                4 inst.                4 inst.                3 inst.          4 inst.            1 inst.               5 inst.                 .                                                         4 inst.                     
Processing needed before moving a packet payload (19 instructions.) 5 instructions that 
the RISC core can 
process during the 
data movement,   
 
 
Check the 
current 
packet„s 
protocol 
Extract 
link info 
form the 
arrived 
packet  
Examine the 
link info 
with the one 
stored in 
CAM. If 
matched and 
MF=1 then  
 
Update the total 
bytes and the 
expected Seq # 
 
Update the 
IP, UDP, 
and  
FIFO  2 
DMA 
releases the 
bus 
Get the 
match and 
check the end 
address of 
this stream   
RISC might be 
idle (depended on 
the payload size) 
Get the size of 
the packet and 
address where 
to store the 
payload part in 
the RB  
Initiate the 
DMA to 
move the 
payload part 
to the 
selected 
place in the 
RB 
 
4Inst. After data 
has been moved  
RISC 
Becomes idle  
 
  
1
8
7 
 
Type of instructions that are processed by the RISC TO complete the processing of the UDP SSM packet 
                                                          
 
 
 
 
 
 
 
  
 
                                                       
 
 
 
Figure 7.13: Total number of instructions to process the UDP SSM packet without idle cycles 
 
DMA transfer cycle 
     3inst.                  4 inst.            5 inst.              3 inst.          2 inst.          
Processing needed before 
moving a packet body (15 
instructions) 
Instructions are 
processed during 
data movement  
Check the 
current 
packet„s 
protocol 
UDP== 17 
 
Extract the 
link info 
from the 
arrived 
packet  
DMA 
releases the 
bus  
Get the 
match and 
examine the 
location of 
the coming 
packet  
Initiate  the 
DMA to move 
the packet to 
selected place 
in the RB 
 
Check the 
MF flag‟s 
status of 
the arrived 
packet  
MF = 1  
 
RISC 
Becomes idle  
  
RISC 
idle 
cycles  
  
1
8
8 
 
Type of instructions that are processed by the RISC core to complete the processing of the UDP out-of-order packet 
 
 
 
 
 
 
 
 
                                                       
                                    
 
 
 
 
Figure 7.14: Total instructions for the UDP out-of-order packet when the sub linked-list is not equal to “0” in the CAM  
DMA transfer cycle 
 3 inst                4 inst.             4 inst.              3 inst.            3 inst.              4 inst.             1 inst.                 2 inst             1 inst           5 inst            2 inst.        
    
Processing needed before 
moving packet body (22  
instructions) 
10 instructions that the RISC 
core can process during the 
data movement,   
 
 
 
Check the 
current 
packet„s 
protocol UDP 
== 17 
Extract 
link info 
form the 
arrived 
packet  
Check 
the seq. 
of the 
arrived 
packet  
 
Get the match 
and examine the 
location of the 
coming packet  
  RISC Idle 
cycles  
Get the size of 
the packet and 
address where 
to store the 
payload part in 
the RB  
Initiate the 
DMA to 
move the 
payload part 
to the 
selected 
place in the 
RB 
 
Get the second 
match from 
CAM. If the 
second entry is 
not zero then 
compare it. If 
match then 
Calculate 
the end 
address 
where the 
next data 
retread to 
this stream 
should be 
stored  
  
Update the CAM 
with expected 
Seq # ,  start-
address, end 
address of this 
stream  
Update the 
total bytes 
and the 
expected 
offset # 
 
Get the 
expected 
offset   
(Current 
offset+  
total 
bytes)  
 
RISC 
beco-
mes 
idle  
  
 189 
 
7.3.2 Large Send Offload Analysis through Full-System Simulation   
The segmentation and fragmentation functions for TCP and UDP have been simulated.  The data 
path for the sending-side is different from the receiving-side, where the DMA controller is 
required to move the packet header and the payload part from the Sending Buffer (SB) to the 
Sending Buffer Interface (SBI). On the sending-side simulator, the RISC core initiates the DMA 
with a location and size of data to be transferred from SB to SBI. The core is also responsible for 
updating the packet headers for each outgoing segment. The core uses several pointers in order to 
segment the messages inside the SB; the Start Header Address Pointer (SHAP), End-Header 
Address Pointer (EHAP), Start Payload Pointer (SPP) and End-Payload Pointer (EPP). The SPP 
is used to indicate where the starting point of the data is located while the EPP indicates the end 
of the first segment. The RISC core uses the SHAP pointer to get the starting point of the 
network headers inside the SB. The processor stores these pointers in an internal register when 
the core switches from sending large message to a small message.  
 
7.3.2.1  TCP processing cycles  
The SEP starts calculating the total length of the moved message.   If the message is larger than 
the MSS it starts processing the BOM (Figure 7.14). The Sending Embedded Processor (SEP) 
executes 12 instructions before initiating the DMA to transfer the packet header from the SB to 
SBI. These 12 instructions include generating the acknowledgment number (AKN) of the 
outgoing packet. The sequence number of the BOM is remaining as the default one. During the 
data movement, the SEP can execute five instructions that are not involved in using the system 
bus, such as the calculation of the remaining size of the TCP message. If there are no more 
 190 
 
packets required to be sent for the message, the SEP follows the COM procedure for the current 
message (Figure 7.15).   When the SEP reaches the last part of a TCP message, it sends the 
header followed by the data (Figure 7.16) and there is no need to update the pointers. If the total 
TCP message that is sent from the host is equal to or less than the MSS, it is sent as is to the SBI 
(Figure 7.17). With SSM, the SEP requires initiating the DMA to move a complete packet to 
SBI. After the message has been moved to SBI, memory management collects the free spaces 
and sends them to FIFO 1. All signalling packets or packets with zero payloads are sent as is to 
the MAC unit.  
 
7.3.2.2 UDP processing cycles 
The fragmentation process for UDP messages uses a similar approach as TCP. However, UDP/IP 
packet processing is based on the IP header fragmentation field, packet identification (ID), flag 
bits (M bit and F bit) and Fragment Offset (FO). The SEP spends 12 instructions to complete 
generating the first packet of a UDP message, the BOM (Figure 7.18). The moving of the data 
packet from the SB to SBI requires more DMA cycles, especially if the MTU is 1500 bytes. 
During the data movements, the SEP calculates the remaining data and prepares the fragment 
fields (flag bits and the new FO). When processing the COM message, the SEP executes five 
instructions, checking the remaining size for more packets that need to be sent to the MAC unit 
(Figure 7.19). Once the message inside the SB reaches the last piece of data to be sent, the SEP 
updates the IP header with the actual packet length and there is no need to update the message 
pointers (Figure 7.20). The pointers are then reduce to zero. SSM packets are sent as is from SB 
to the SBI (Figure 7.21). An SSM can be a message with a small amount of payload and can be 
sent to the MAC unit with no changes.  
  
1
9
1 
 
Type of instructions that are processed by the RISC core to complete the processing of the TCP BOM packet 
 
                                                         
  
 
 
 
  
 
                                                       
 
 
 
 
Figure 7.15: Total number of instructions to process the TCP BOM packet without idle cycles 
 
 
 
DMA transfer 
cycle 
DMA transfer cycle 
                3 inst.              3 inst.       1 inst.              4 inst.           1 inst.             1 inst.                       1 inst.               2 inst.           2 inst.       1 inst. 
Processing needed before 
moving a packet body (13 
instructions) 
6 instructions are processed during data movement  
Check the 
current 
packet„s 
protocol 
TCP<- 6 
 
Check the 
length of 
the this 
message   
TL>1500 
B 
(yes) 
Update the 
IP and UDP 
headers 
TL = 1500 
and  Seq. =0 
Ack. =1461  
Then 
(COM) 
Calculate 
the new 
Seq. & 
Ack. 
 
Get the first 
size of data 
to be sent 
with the first 
Segment  
PEP = 
PSHA  + 
MTU  
Initiate the 
DMA to move 
the packet 
header  
Update the 
pointers  
PSP = PEP 
PEP = PEP + 
1460 B 
(MTU=1500 
byte) 
  
Check the 
remaining 
length  
TL = TL – 
1500 B 
(Yes) 
Calculate 
the PEHA 
 PSHA  
+ 28 B 
Initiate the 
DMA to move 
the packet to 
selected place 
inside the SB 
(Send the 
first packet) 
DMA 
releases the 
bus  
RISC 
become
s idle  
  
RISC idle 
cycles 
  
1
9
2 
Type of instructions that are processed by the RISC core to complete the processing of the TCP COM packet 
                                         
 
                 
  
 
 
 
 
 
 
                                                      
 
 
 
 
Figure 7.16: Total number of instructions to process the TCP COM packet without idle cycles  
 
 
DMA transfer 
cycle 
DMA transfer cycle 
                2 inst.                  1 inst.            1 inst.              1 inst.              3 inst.                        2 inst.            
Processing needed before 
moving a packet body (4 
instructions) 
6 instruction are processed during data 
movement  
Update the 
Total length 
inside the IP 
header.  
TCP header 
with new  
New seq. # 
and Ack.. # 
 
Initiate the 
DMA to 
move the 
packet‟s 
header to 
LB 
Send first 
Packet 
 
Then ( EOM) 
Calculate the new Seq. 
and the Ack. # 
Old Seq.  = old seq.#+ 
new length 
 
Initiate the 
DMA to move 
the payload 
part to 
selected place 
in the SB 
Update the 
pointers  
PSP = PEP 
PEP = PEP 
+ remaining 
bytes 
Check the 
remaining 
length  
TL = TL – 
1500 B 
(NO) 
RISC idle cycles  
RISC 
Becomes idle  
  
  
1
9
3 
Type of instructions that are processed by the RISC core to complete the processing of the TCP EOM packet  
 
 
 
 
 
 
 
 
 
                                                       
 
 
 
 
 
 
 
Figure 7.17: Total number of instructions to process the TCP EOM packet without idle cycles 
 
DMA 
transfer 
cycle 
       4 inst.                  1 inst.                         1 inst.     
Processing needed before moving a 
packet body (6 instructions) 
No Instructions are 
processed during 
data movement  
Update the 
total length 
inside the IP 
header.  
TCP header 
with new  
New seq. # 
and Ack.. # 
 
Initiate the 
DMA to 
move the 
packet‟s 
header to 
LB 
Send first 
Packet 
 
Initiate the 
DMA to 
move the 
payload part 
to selected 
place in the 
LB 
DMA transfer cycle 
RISC 
Becomes idle  
 
RISC idle 
cycles  
 
  
1
9
4 
Type of instructions that are processed by the RISC core to complete the processing of the TCP SSM packet 
                                                       
 
  
 
 
 
 
 
                                                     
 
 
 
 
 
 
Figure 7.18: Total number of instructions to process the TCP SSM packet without idle cycles 
 
 
 
                3 inst.                  4 inst.            1 inst.              1 inst. 
Processing needed before 
moving a packet body (9 
instructions) 
Check the 
current 
packet„s 
protocol 
TCP<- 6 
 
Check the 
length of 
the this 
message   
TL>1500 
B 
(No) 
Get the PEP 
PSHA+ 
actual length 
Initiate the 
DMA to move 
the packet to 
selected place 
in the SB 
Send first 
Packet 
DMA transfer cycle 
RISC 
Becomes idle  
 
RISC idle 
cycles  
  
1
9
5 
Type of instructions that are processed by the RISC core to complete the processing of the UDP BOM packet 
 
 
 
 
                                                         
  
 
 
 
 
                                                 
 
                                                           
 
 
 
Figure 7.19: Total number of instructions for the UDP BOM packets without idle cycles  
 
 
DMA 
transfer 
cycle 
DMA transfer cycle 
            3 inst.          3 inst.            1 inst.              3 inst.         1 inst.         1 inst.                 1 inst.           2 inst.          2 inst.           1 inst. 
Processing needed before moving a packet body 
(12 instructions) 
6 instructions are processed 
during data movement  
Check the 
current 
packet„s 
protocol 
UDP== 17 
 
Check the 
length of 
the 
message   
TL>1500 
B 
(yes) 
Update the 
IP and UDP 
headers 
TL = 1500 
and MF = 1 
Offset = 0 
Then ( 
COM) 
Calculate 
the new 
offset 
Get the size 
of the first 
data to be 
sent with the 
first 
fragment  
PEP = 
PSHA + 
MTU  
Initiate the 
DMA to 
move the 
packet to 
selected 
place in the 
SB 
Update 
the 
pointers  
PSP = 
PEP 
PEP = 
PEP + 
1460 B 
  
Check 
the 
remainin
g length  
TL = TL 
– 1500 B 
(Yes) 
Calculate 
the PEHA 
 PSHA + 
28 B 
Initiate 
the DMA 
to move 
the 
packet  
header 
RISC 
Becomes 
idle  
 
RISC idle 
cycles  
  
1
9
6 
Type of instructions that are processed by the RISC core to complete the processing of the UDP COM packet 
                                                         
  
 
 
 
 
 
 
 
                                                  
 
 
 
 
 
 
 
Figure 7.20: Total number of instructions for the UDP COM packets without idle cycles 
 
 
DMA 
transfer 
cycle 
DMA transfer cycle 
                1 inst.                  1 inst.         1 inst.        1 inst.                2 inst.                     2 inst.            
Processing needed before 
moving a packet body (3 
instructions) 
5 instructions are processed during data 
movement  
Update the IP 
header  
MF= 0  
Offset= new 
Offset data 
 
Initiate the 
DMA to 
move the 
packet‟s 
header to 
LB 
Send first 
Packet 
 
Then (EOM) 
Calculate the new 
offset 
Old Offset = old Offset+ 
new length 
 
Initiate the 
DMA to 
move the 
payload 
part to 
selected 
place in 
the LB 
Update 
the 
pointers  
PSP = 
PEP 
PEP = 
PEP + 
remaining 
bytes 
Check the 
remaining 
length  
TL = TL – 
1500 B 
(NO) 
RISC 
Becomes idle  
 
RISC idle 
cycles  
  
1
9
7 
Type of instructions that are processed by the RISC core to complete the processing of the UDP EOM packet 
 
 
                                                         
  
 
 
 
 
 
                                                     
 
 
 
 
 
 
Figure 7.21: Total number of instructions for the UDP EOM packets without idle cycles 
 
 
DMA transfer cycle 
                2 inst.                  1 inst.                               1 inst.               
Processing needed before 
moving a packet body (4 
instructions) 
Update the IP 
header  
MF= 1  
Offset= new 
Offset 
 
Initiate the 
DMA to 
move the 
packet‟s 
header to 
LB 
Send first 
Packet 
 
Initiate the 
DMA to move 
the payload 
part to selected 
place in the 
LB 
DMA 
transfer 
cycle 
RISC 
Becomes idle  
 
RISC idle cycles  
  
1
9
8 
Type of instructions that are processed by the RISC core to complete the processing of the UDP SSM packet 
 
                                                         
  
 
 
 
 
  
                                                  
 
 
 
 
Figure 7.22: Total number of instructions for the UDP SSM packets without idle cycles
DMA transfer cycle 
                3 inst.                  3 inst.            1 inst.             1 inst.             
Processing needed before moving 
a packet body (8 instructions) 
No instructions required to be processed during 
data movement  
Check the 
current 
packet„s 
protocol 
UDP<- 17 
 
Check the 
length of 
the this 
message   
TL>1500 
B 
(NO) 
Get the size 
of data to be 
sent with 
this fragment  
PSHA + TL   
Initiate the 
DMA to move 
the packet to 
selected place 
in the SB 
RISC 
Becomes idle  
 
RISC idle cycles  
 199 
 
7.4 The Payload lengh Path    
At this stage, the packet processing performance is unsatisfactory. This is because idle 
cycles will occur when the DMA takes too long to actually move the data; this is also 
based on the fact that the clock for the DMA runs at twice the speed of the processor's 
clock rate. The extent of the idle cycles might reach levels as high as 70 percent in terms 
of RISC usage (Table 7.1). A greater number of idle cycles would make the NI a major 
bottleneck within a network. As shown in Table 7.1, the 1500 bytes packets have the 
highest number of RISC cycles (206 cycles). Although it is less than the SPIM cycles 
which reached 415 cycles (Figure 3.14), it is still considered to be a high number of 
cycles for high-speed networks. 
With smaller packets in the range of 64 to 256 bytes, fewer DMA cycles are required 
than for larger packets with a greater payload. It is logical that there would  be fewer 
idle cycles because there is less data to be transferred from the RBI to the Receiver 
Buffer. With regard to the monitoring of  the RISC cycles, it was found that they 
became particularly high when they processed large packets (i.e 1500 bytes).  
When the host sends a block of data to the other end, the host CPU sends the large 
packet containing the PDU data to the SB. The host CPU also sends all the information 
needed to transmit this packet through FIFO 4, including the MSS and the location of 
the large packet inside the SB.  
 
 
 200 
 
Table 7.1:  The number of RISC instructions required to process the LRO for TCP and 
UDP packets when the DMA is double RISC‟s clock 
      
Total RISC cycles for LRO       
    1500 bytes 1024  bytes 512 bytes 256 bytes 128 bytes 64 bytes 
    Total 
RISC 
cycles 
Idle 
cycles 
Total 
RISC 
cycles 
Idle 
cycles 
Total 
RISC 
cycles 
Idle 
cycles 
Total 
RISC 
cycles 
Idle 
cycles 
Total 
RISC 
cycles 
Idle 
cycles 
Total 
RISC 
cycles 
Idle 
cycle 
    
T 
C 
P 
 
Single 
Segment 
Message 
(SSM) 
198 181 138 121 74 57 42 25 26 9 17 1 
Beginnin
g Of 
Message 
(BOM) 
198 176 138 116 74 52 42 20 26 4 22 0 
Continua
tion Of 
Message 
(COM) 
202 173 142 113 78 49 46 17 30 1 29 0 
End Of 
Message 
(EOM) 
206 176 146 116 82 52 50 20 34 4 28 0 
Out-Of-
Order 
205 173 145 113 81 49 49 17 33 1 32 0 
U 
D 
P  
 
Single 
Segment 
Message 
(SSM) 
199 182 140 123 28 59 43.5 27 18 11 17 0 
Beginnin
g Of 
Message 
(BOM) 
199 174 140 115 28 51 43.5 19 25 3 25 0 
Continua
tion Of 
Message 
(COM) 
201 172 142 113 30 49 45.5 17 29 1 29 0 
End Of 
Message 
(EOM) 
209 179 150 120 32 56 53.5 24 28 8 28 0 
Out-Of-
Order 
206 174 147 115 32 51 50.5 19 32 3 32 0 
 (0) means there are no idle cycles, where the RISC takes longer than the DMA 
.  
 201 
 
The RISC performance was tested using a normal speed of 100 Gbps. Presently, the 
segmentation abilities for TCP and fragmentation for UDP has been sufficiently 
simulated with the Xilinx simulation (Behavior Model). The data path that is located on 
the sending side is also different to that located on the receiving side when the DMA 
controller is required for moving the packet header. In regards to the Sending Buffer, the 
RISC core initiates the DMA to transfer data from the Sending Buffer to the Sending 
Buffer Interface (SBI). 
The RISC processor is required to „cut‟ the larger messages into smaller packets that can 
be sent over the network wire. This process effectively means that the BOM can be 
followed by the COMs followed by the EOM, which is the final part of the sequence. 
This is demonstrated in the form of cycles presented in Table 7.2. 
The total cycles required by the RISC in order to complete the BOM, COM, EOM and 
SSM are measured when the DMA runs at twice the RISC clock speed. The RISC core 
requires 202 cycles, icluding the idle cycles, to complete the processing of the BOM. 
Howerver, the total idle cycles measured are 177 RISC cycles. This is the case when 
1500-byte packets are processed at the PPU. The number of idle cycles reduces when 
the MSS size becomes smaller and reaches zero when the BOM or COM packet size is 
64 bytes.   
  
 
 
 202 
 
Table 7.2: RISC cycles while performing the Large Send Offload, when the DMA clock 
rate is double the RISC's clock rate 
  Total RISC cycles for LSO    
  1500 bytes 1024  bytes 512 bytes 256 bytes 128 bytes 64 bytes 
Total 
RISC 
cycles 
Idle 
cycles 
Total 
RISC 
cycles 
Idle 
cycles 
Total 
RISC 
cycles 
Idle 
cycles 
Total 
RISC 
cycle 
Idle 
cycle 
Total 
RISC 
cycle 
Idle 
cycle 
Total 
RISC 
cycle 
Idle 
cycle 
 
 
 
 
 
T 
C 
P 
 
Single 
Segment 
Message 
(SSM) 
192 183 132 123 68 59 36 27 20 11 17 3 
Beginning 
Of 
Message 
(BOM) 
196 177 136 117 72 53 40 21 24 5 22 0 
Continuati
on Of 
Message 
(COM) 
187 177 127 117 63 53 31 21 15 5 29 0 
End Of 
Message 
(EOM) 
187 183 129 123 65 59 33 27 17 11 28 3 
 
 
 
 
 
U 
D 
P 
 
Single 
Segment 
Message 
(SSM) 
192 184 133 125 28 61 37 29 18 13 17 5 
Beginning 
Of 
Message 
(BOM) 
196 178 137 119 28 55 41 23 25 7 25 0 
Continuati
on Of 
Message 
(COM) 
189 179 130 120 30 56 34 24 29 8 29 0 
End Of 
Message 
(EOM) 
188 184 129 125 32 61 33 29 28 12 28 5 
 
 
It can be  seen from table 7.2 that the RISC core is idle at this point and would  remain 
so until the DMA has completed the movement of data. There is a variance in the packet 
payloads, specifically from 512 to 1500 bytes. For example, if the packet size happens 
 203 
 
to be 1500 bytes, the DMA would require 183 cycles (1460 byte /64 bit) to shift the data 
packet from the Sending Buffer to the SBI over the 64 bits bus. It should be noted that 
various packet sizes are shown here, for example 1500, 1024, 512 and 64 bytes.  
However, each packet has 40 bytes or 28 bytes of header information (TCP/IP or 
UDP/IP), with the rest being payload data. For instance, when  a packet shows 1024 
bytes, in reality it consists of a 40-byte header with the remaining  984 bytes consisting 
of payload. 
The budgeted time for each 512-byte packet at 100 Gbps is ~ 43 ns per packet. For large 
packets (1500 bytes), the budget time is ~123 ns. This measure of time is required for 
fast processing so as to avoid the dropping of packets.  High clock rates are required to 
manage over 181 RISC cycles for processing large packets and to complete 1500-byte 
packets within 123 ns. 
 
7.5 Conclusion  
A detailed analysis of TCP/IP and UDP/IP processing has been presented. The RISC 
instructions while performing BOM, COM, EOM and out-of-order packets were 
analyzed during this simulation. The simulation demonstrates that the payload path 
length is taking a long time about 184 cycles when an out-of-order packet is processed. 
The RISC becomes idle for about 180 cycles when the packet size is 1500 bytes. 
 The idle cycles will occur when the DMA takes too long to actually move the data; this 
is also based on the fact that the clock for the DMA runs at twice the speed of the 
 204 
 
processor's clock rate.  The DMA required about 184 cycles to complete moving the 
payload part of a UDP 1500-byte packet. Enhancing the TCP/IP and UDP/IP packet 
processing and reducing the idle cycles is important to achieve 100 Gbps. 
In the next chapter, the enhancements made to packet processing at the Packet 
Processing Unit to be scaled for 100 Gbps will be presented. The RISC core and payload 
length will be investigated for LRO and LSO.  
 
 
 
 
 
205 
 
Chapter 8  
VHDL Simulation Results   
 
 
8.1 Introduction  
The aim of this research is to design a novel PPU based high-performance 32-bit 
processor for 100 Gbps. As discussed in the previous chapter, the DMA runs at twice 
the speed of the RISC clock rate and is suitable for small sized packets (e.g., 256 bytes 
or less) while completing the movement of data to the NI within fewer cycles (e.g. the 
DMA requires 3 RISC cycles to move a 64 bytes packet). The DMA speed rate is not 
applicable to larger packets (512 bytes or larger).  
This chapter shows the enhancements required at the Packet Processing Unit for high- 
speed rates of 40 Gbps and 100 Gbps. The RISC clock rate that is capable of supporting 
a high-speed network is presented. This chapter also illustrates the DMA clock rate that 
will support the RISC core for reducing the path length of processing TCP and UDP 
payload inside the NI. 
 
8.2 Packet Processing Enhancements for High-Speed  Networks  
The DMA required two cycles to move each block of 64 bits from the Line Interface 
(LI) to the Host Interface (HI) or from HI to LI. This means it runs at twice the speed of 
the processor clock rate. The number of cycles performed depends upon the type of 
206 
 
packet that is being processed (including the Beginning of Message (BOM), 
Continuation of Message (COM), End of Message (EOM) or Single Segment Message 
(SSM)). During the simulation, the DMA required more than 180 cycles in order to 
move the large payload of a packet (1472 bytes) over the 64-bit bus (Figure 8.1). Some 
of the RISC processing instructions are required to use the local bus, but the DMA 
controller occupies the bus. From Figure 8.1, the UDP/IP packet has more idle cycles 
than TCP/IP packets. This is due to the fact that the number of DMA cycles that are 
required for the UDP/IP packet is more than TCP/IP packets. The RISC idle cycles is 
becomes high when the packet size is 512 bytes or larger.  The idle cycles are reduced 
when the packet size is 256 bytes or less.  
 
Figure 8.1: Total RISC idle cycles when the DMA clock is double RISC clock rate 
0
20
40
60
80
100
120
140
160
180
200
1
5
0
0
 b
y
te
s
1
0
2
4
 b
y
te
s
5
1
2
 b
y
te
s
2
5
6
 b
y
te
s
1
2
8
 b
y
te
s
6
4
 b
y
te
s
1
5
0
0
 b
y
te
s
1
0
2
4
 b
y
te
s
5
1
2
 b
y
te
s
2
5
6
 b
y
te
s
1
2
8
 b
y
te
s
6
4
 b
y
te
s
Receiving Side Sending Side
N
o
. o
f 
id
le
 c
yc
le
s
Packet size 
TCP/IP Single Segment Message (SSM)
TCP/IP Beginning Of Message (BOM)
TCP/IP Continuation Of Message 
(COM)
TCP/IP End Of Message (EOM)
TCP/IP Out-Of-Order
UDP/IP Single Segment Message 
(SSM)
UDP/IP Beginning Of Message (BOM)
UDP/IP Continuation Of Message 
(COM)
UDP/IP End Of Message (EOM)
207 
 
The DMA clock rate is modified within the simulator to enhance the NI’s performance. 
Each time the DMA’s clock increases, the idle cycles are reduced and the NI's 
performance is enhanced where most (if not all) the idle cycles are reduced. During the 
Behavior Model analysis, packets as small as 512 bytes require higher DMA clock than 
1500 bytes packets (Figure 8.2). This phenomenon happens because the budgeted time 
for the DMA to complete the moving of data is short, that is, around 42.60 ns. Packets 
larger than 512 bytes may require a lesser DMA clock rate as presented in Figure 8.2.  
This is because the processing time for a 1500 bytes packet is 123 ns. A DMA 1316 
MHz is found when the packet size is 1500 bytes and 1916 MHz when the packet is 
1024 bytes.  
 
Figure 8.2:  Desired DMA clock rate for LRO and LSO 
 
When the packet size is 512 bytes or larger, the performance of the NI improves. The 
core can execute 32 instructions in order to process an Out-of-Order packet, with 0 idle 
40 Gbps 100 Gbps 40 Gbps 100 Gbps
Sending Side Receiving Side
1500  bytes 296 740 526 1316
1024 bytes 431 1078 766 1920
512 bytes 846 2115 1504 3759
0
500
1000
1500
2000
2500
3000
3500
4000
M
H
z 
208 
 
cycles when the DMA clock is 3759 MHz (Table 8.1). Even though there are 8 cycles 
found with the SSM, the total number of cycles does not exceed 32. The RISC 
processing cycles become the same when there are no idles cycles. This can be seen 
when the packet size is 256 bytes or less.  
Table 8.1: Total number of RISC instructions to complete the processing of the LRO for 
TCP and UDP when the DMA clock is 3759 MHz 
 
 
    Packet  Type 
Packet Size 
Total number of instruction 
1500 
bytes 
1024  
bytes 
512  
bytes 
256  
bytes 
128   
bytes 
64     
bytes 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idel 
nst. 
 
T 
C 
P 
 
Single Segment 
Message (SSM) 
25 8 22 5 27 10 21 4 17 0 17 0 
Beginning Of 
Message (BOM) 
25 3 22 0 27 5 22 0 22 0 22 0 
Continuation Of 
Message (COM) 
29 0 29 0 31 2 29 0 29 0 29 0 
End Of Message 
(EOM) 
32 2 31 1 31 1 30 0 30 0 30 0 
Out-Of-Order 32 0 32 0 32 0 32 0 32 0 32 0 
 
U 
D 
P 
 
Single Segment 
Message (SSM) 
25 8 22 5 28 11 21 4 18 1 17 0 
Beginning Of 
Message (BOM) 
25 0 25 0 28 3 25 0 25 0 25 0 
Continuation Of 
Message (COM) 
29 0 29 0 30 1 29 0 29 0 29 0 
End Of Message 
(EOM) 
31 3 28 0 32 4 28 0 28 0 28 0 
Out-Of-Order 32 0 32 0 32 0 32 0 32 0 32 0 
209 
 
To enhance the sending side, an increase in the desired DMA clock rate rather than in 
the RISC clock rate reduces the life cycles of the system. An increase in the DMA clock 
rate at the sending side reduces the number of idle cycles and improves the processing 
of packets inside the PPU. While a timing analysis was undertaken, A 2115 MHz DMA  
clock rate is found to reduce the TCP and UDP packet payload length path. 
Additionally, the performance of the NI increased significantly and RISC idle cycles 
were subsequently reduced (Table 8.2).  
Table 8.2: Total number of RISC instructions to complete the processing of the LSO for 
TCP and UDP when the DMA clock is 2115 MHz 
 
Packet  Type 
Packet Size 
Total number of instruction 
1500 
bytes 
1024  
bytes 
512  bytes 256  bytes 128  bytes 64     
bytes 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idle 
Inst. 
RISC 
Inst. 
Idel 
nst. 
T
C
P/ 
IP 
Single Segment 
Message (SSM) 
18 9 15 6 20 11 14 6 11 2 10 1 
Beginning Of 
Message (BOM) 
23 4 20 1 24 5 19 0 19 0 19 0 
Continuation Of 
Message (COM) 
14 4 11 1 16 6 10 0 10 0 10 0 
End Of Message 
(EOM) 
15 9 11 5 17 11 11 5 9 3 7 1 
U
D
P/ 
IP 
Single Segment 
Message (SSM) 
8 0 8 0 22 13 14 6 11 3 9 1 
Beginning Of 
Message (BOM) 
23 5 20 2 25 7 18 0 18 0 18 0 
Continuation Of 
Message (COM) 
13 5 11 3 16 8 9 1 8 0 8 0 
End Of Message 
(EOM) 
13 9 11 7 16 12 9 5 6 3 5 1 
210 
 
All of the RISC cycles that were necessary for the TCP and UDP packets were 
completed. It can also be observed that the network interface receives a greater number 
of packets when their size is less than 1500 bytes.  The speed of the DMA was recorded 
during the process of moving different packets.  
Nevertheless, if the DMA speed is increased to 2115 MHz and the packet size to 512 
bytes, it performs best, provided that the DMA is running at 740 MHz when dealing 
with a packet size of 1500 bytes. The DMA clock rate is now fixed to 3759 MHz for the 
LSO too and this enhances its performance. To be more specific, it can be noted that 
within the LSO functioning process, the RISC processing speed is definitely going to be 
less. This is due to the fact that a DMA with a greater speed has been used (at 3759 
MHz) within a scenario, where all the idle cycles that were associated with the RISC 
processing core have been eliminated.  
 
8.3 RISC Clock Rate for 100 Gbps 
After the enhancement of the clock speed of the DMA, the RISC core is able to achieve 
TCP and UDP processing for 40 Gbps and 100 Gbps line rates. The DMA clock rate is 
fixed to 3759 MHz for sending and receiving processing. The idle cycles associated with 
sender and receiver RISC core processing are reduced. A DMA with a 3759 MHz clock 
is able to complete the large payload (e.g., 1500 bytes) within a short time and reduces 
the RISC’s idle cycles.  
The test simulation after setting the DMA to 3759 MHz shows that a 761 MHz RISC 
processor can support the LRO function at a line rate of 100 Gbps when the desired 
211 
 
DMA speed is 3759 MHz and the packet size is 512 bytes (Figure 8.3). A desired RISC 
core of 263 MHz can also be used to process the LRO at 100 Gbps with a packet size of 
1500 bytes. These RISC clock rates were measured when there were enough buffer 
space in the LI.  
 
Figure 8.3: RISC clock rate for LSO and LRO for UDP/IP when the DMA is 3759 Mhz 
for receiving-side and 2115MHz for sending-side 
 
Having adjusted the DMA clock rate, the sender RISC core requires 23 cycles on the 
transmitter side (when a BOM is processed). The RISC clock is now optimized at a 
speed of 423 MHz on the transmitter side and can be utilized for the transmission of a 
LSO for TCP/ IP and UDP/ IP packets at the rapid communication rate of up to 100 
Gbps.  
 
8.4 Results  
It is important for the proposed PPU model to be able to complete the processing of 
packets within the budgeted time frame for 100 Gbps, even if the MTU is 512 bytes or 
105
263
80148
154
383
87
216
301
752
169
423
40 Gbps100 Gbps40 Gbps100 Gbps
Receive sideSend side
1500 byte 1024 byte 512 byte
212 
 
larger, which can carry more application data and reduce protocol processing overheads 
[27, 28, 50, 51]. The PPU can still receive smaller size packets such as SSM or signaling 
packets.  
The core processor requires completing processing of the incoming packets from the 
MAC unit to avoid overcrowding and dropping of packets. This can be the case if the 
MAC unit was designed without a packet buffer (e.g. in the Elastic buffer) or the buffer 
size at the MAC unit to store packets temporarily is too small. Table 8.3 presents the 
PPU performance of the Signaling, SSM, COM and EOM packet processing time with 
the RISC clock rate set to 752 MHz (when the MTU is 512 bytes or more).  
When the SSM is classified by the embedded RISC, it is sent ‘as is’ to its destination 
buffer (e.g., the RB). There is no linked-list processing required at the receiving side or 
the generation of a header at the sending side. If the SSM is 1500 bytes, the time frame 
required to complete the 1500 bytes is 32 ns, whereas the total budgeted time frame 
required for 100 Gbps is 123 ns. This makes the PPU efficient, as there is no delay 
associated with 1500 bytes packets. There is also no delay in processing 1024 and 512 
bytes packets. This means that the local bus is free and that the core is capable of 
processing the next packet or reading FIFO 3. In spite of this, there is still a delay when 
the SSM packets become smaller than 512 bytes and the processing time becomes 
greater than the budgeted time of 100 Gbps (the yellow cells). This is because the RISC 
requires 21 cycles (including four idle cycles) to complete processing the 256 bytes 
packets. The processing time becomes larger when the packets become smaller. Yet, the 
time saved with packets that are 512 bytes or larger is long enough to compensate for 
the delays that occur with the smaller sized packets.  
213 
 
Table 8.3: Packet processing at the receiving side when the RISC clock is 752 MHz and 
the DMA is 3759 MHz 
 
Packet 
size  
(Bytes) 
Total Budget time 
required for 100 
Gbps (ns) 
TCP Processing time at 
PPU (ns) 
Time Difference 
in ns 
Single Segment 
Message 
1500  123.00 32 -91 
1024  83.50 28.1 -55.4 
512  42.60 34.5 -8.1 
256  22.10 27.52 5.42 
128  11.80 21.76 9.96 
64  6.72 21.76 15.04 
Signaling  64 6.72 12.8 6.08 
Continuation of 
Message 
1500 123.00 37.12 -85.88 
1024 83.50 37.12 -46.38 
512 42.60 39.68 -2.92 
End of Message 
1500 123.00 39.68 -83.32 
1024 83.50 35.84 -47.66 
512 42.60 39.68 -2.92 
256 22.10 35.84 13.74 
128 11.80 35.84 24.04 
64 6.72 23.04 16.34 
 
Small packets such as signaling packets have always arrived either before or after the 
large packets, requiring less processing than a 64 bytes SSM. The Signaling packets can 
be classified after reading the size of the packet flags, whereas the SSM requires further 
processing, such as reading the protocol type and checking the CAM. The total time that 
the signaling packet spends at the PPU is 12.8 ns; the delay time is only 6.08 ns.  
The core processing can complete an End-of-Message packet in a different manner to 
the SSM. Large packets, such as those over 512 bytes, can be processed within an 
efficient time frame inside the PPU. Yet the small EOM spends more time inside the 
214 
 
PPU. The budgeted time frame for 64 bytes, for instance, is approximately 7 ns . The 
packet spends 24.04 ns, with the delay time of 16.34 ns. This time delay is required for 
increasing the buffer size (e.g., the Elastic buffer or the RBI). However, the EOM rarely 
occurs with the smaller sized packets. 
Based on observation of the performance of the sending side, when the RISC at 752 
MHz has served the sending side and the DMA clock rate is 3759 MHz, all packets are 
finished on time and there is no delay (Table 8.4).  
Table 8.4: LSO packet processing time when the RISC clock is 752 MHz and the DMA 
is 3759 
 
Packet 
size 
(Bytes) 
Total Budget time 
required for 100 
Gbps (ns) 
TCP Processing time at 
PPU (ns) 
Time difference 
in ns 
Single Segment 
Message 
1500 123.00 23.04 -99.96 
1024 83.50 19.2 -64.3 
512 42.60 23.04 -19.56 
256 22.10 17.92 -4.18 
128 11.80 14.08 2.28 
64 6.72 12.8 6.08 
Signaling  64 67.20 6.4 -0.32 
Continuation of 
Message 
1500 123.00 17.92 -105.08 
1024 83.50 14.08 -69.42 
512 42.60 20.48 -22.12 
End of Message 
1500 123.00 19.2 -103.8 
1024 83.50 14.08 -69.42 
512 42.60 21.76 -2.08E+01 
256 22.10 14.08 -8.02 
128 11.80 11.52 -0.28 
64 6.72 8.96 2.26 
215 
 
The sending side processing requires fewer cycles than the receiving side. Therefore, a 
RISC at 752 MHz enhances the sending side since a RISC with 423 MHz able to 
complete TCP and UDP segmentation processing.  
With an MTU of 256 bytes, the desired clock rate for the RISC is 1449 MHz. Table 8.5 
presents the evaluated results of the processing carried out within the PPU when the 
RISC clock rate is 1449 MHz. The improvements were made within the packet 
processing, where the PPU communicates with different sized packets without any 
significant delay, especially with the signaling packets.    
 
Table 8.5: LRO packet processing time when the RISC clock is 1449 MHz and the 
DMA is 3759 
 
Packet 
size 
(Bytes) 
Total Budget time 
required for 100 
Gbps (ns) 
TCP Processing time at 
PPU (ns) 
Time difference 
in ns 
Single Segment 
Message (SSM) 
1024 83.50 14.025 -69.475 
512 42.60 17.2125 -25.3875 
256 22.10 13.3875 -8.7125 
128 11.80 10.8375 -0.9625 
64 6.72 10.8375 4.1175 
Signaling  64 67.20 5.1 -1.62 
Continuation of 
Message 
1500 123.00 18.4875 -104.5125 
1024 83.50 18.4875 -65.0125 
512 42.60 19.7625 -22.8375 
End of Message 
1500 123.00 19.7625 -103.2375 
1024 83.50 17.85 -65.65 
512 42.60 19.7625 -2.28E+01 
256 22.10 17.85 -4.25 
128 11.80 17.85 6.05 
64 6.72 10.8375 4.1375 
 
 
216 
 
When the core is running at 1449 MHz on the sending side, the packet processing is 
completed faster than the budget time required for 100 Gbps (Table 8.6). The core can 
support packets smaller than 265 bytes.  
 
Table 8.6: LSO packet processing time when the RISC clock is 1449 MHz and the 
DMA is 3759 
 
Packet 
size 
(Bytes) 
Total Budget 
time required 
for 100 Gbps 
(ns) 
TCP Processing time 
at PPU 
Time difference 
Single Segment 
Message 
1500 123.00 11.475 -111.525 
1024 83.50 9.5625 -73.9375 
512 42.60 11.475 -31.125 
256 22.10 8.925 -13.175 
128 11.80 7.0125 -4.7875 
64 6.72 6.375 -0.345 
Singling  64 6.72 3.1875 -3.5325 
Continuation of 
Message 
1500 123.00 8.925 -114.075 
1024 83.50 7.0125 -76.4875 
512 42.60 10.2 -32.4 
End of Message 
1500 123.00 9.5625 -113.4375 
1024 83.50 7.0125 -76.4875 
512 42.60 10.8375 -3.18E+01 
256 22.10 7.0125 -15.0875 
128 11.80 5.7375 -6.0625 
64 6.72 4.4625 -2.2375 
 
 
217 
 
8.5 The DMA and RISC Clock Rate for 100 Gbps.  
In this thesis, I have presented computer simulation results to measure the amount of 
processing required for LRO and LSO functions for the TCP and UDP protocols. The behavior 
model of the simulation results have shown that a cost-effective embedded RISC core can 
provide the required efficiency of the network interface to support a wide range of transmission 
line speeds of up to 100 Gbps (Table 8.5). A 752 MHz RISC core can support the receiver side 
processing for transmission speeds of up to 100 Gbps for TCP/IP and UDP/IP when the packet 
size is 512 bytes, with 3759 MHz being required to enhance the network performance and 
reduce the RISC idle cycles. The DMA clock reduces significantly when NI's bus gets wider 
than 64 bits.  
The proposed model can support small sized packets up to 256 bytes for receiving when the 
RISC clock is 1449 MHz. With 1449 MHz for sending, the core is able to support packets up to 
64 bytes. 
 
Table 8.7: The RISC and DMA clock rate supporting LRO and LSO for TCP and UDP 
at 100 Gbps 
 
RISC 
MHz 
DMA 
MHz 
Receive side Send Side 
100 
Gbps 
752 3759 
The PPU supports packet 
size up to 512 bytes 
Can support packets up to 
265  bytes 
100 
Gbps 
1449 3759 
The PPU supports packet 
size up to 256 bytes 
The PPU supports small 
packet size up to 64 bytes 
 
 
218 
 
8.6 Conclusion  
This chapter has presented computer simulation results of the proposed Packet 
Processing Unit. It has been demonstrated that specialized RISC cores are able to cope 
with 100 Gbps speeds, while performing the Large Receive Offload (LRO) and the 
Large Send Offload (LSO) for TCP/IP and UDP/IP. This was achieved after enhancing 
the packet processing at the Packet Processing Unit. The key aspect is to improve the 
total amount of the RISC cycles and to increase the DMA clock rate for incoming and 
outgoing packets. When the DMA runs at 3753 MHz for the LRO and 2115 MHz for the 
LSO, the RISC clock rate reduces the life cycles of the system. The DMA clock rate 
significantly decreases if the NI's bus becomes wider than 64 bits (e.g. 320 bits [10]). 
As a result, a 263 MHz RISC core supports the receiver side processing for lines with a 
transmission speed of up to 100 Gbps, which are used for TCP/IP and UDP/IP when the 
packet size is 1500 bytes. On the other hand, a core which runs at 752 MHz is found to 
support 512 bytes.  
 The RISC running at 752 MHz can support the PPU with a line speed of 100 Gbps 
when the packet size is 512 bytes.  This is achieved by running the DMA clock rate at 
3759 MHz. However, there is still a delay when processing End-of-Message packets, 
especially at the receiving side. A RISC core running at 1449 MHz can be used for LRO 
and LSO to reduce the delay time and can be considered an appropriate choice for 100 
Gbps. Furthermore, a single specialized core can achieve packet processing at a line rate 
of 100 Gbps, which reduces the complexity and cost of designing the Network Interface, 
compared to using a series of processors such as the IXP1200.  
219 
 
Chapter 9 
Conclusion and Future Work 
 
9.1 Summary of Contributions 
The speed of networks now exceeds 10 Gigabit per second (Gbps). The performance for 
processing protocols such as TCP/IP and UDP/IP is also required to be improved so that it may 
support future speeds of 40-100 Gbps. Offloading part of the protocol processing to the Network 
Interface (NI) can provide some performance enhancing features, making protocol processing 
more efficient.  
By enhancing the protocol processing of the incoming and outgoing packets, performance issues 
and the occurrence of a bottleneck at the end node can be avoided. In this thesis, two software 
solutions and hardware have been implemented in order to enhance protocol processing for the 
receiving and the sending side of the designed Packet Processing Unit (PPU) for 100 Gbps. On 
the receiving side, a novel offload algorithm was developed for Large Receive Offload (LRO) 
functions for the TCP and UDP protocols. The LRO algorithm links the arrived packets that 
belong to the same TCP or UDP streams to form large packets inside the NI’s buffer. These large 
packets are then sent from the NI to the protocol stack for further processing. The proposed LRO 
function works similarly to the Jumbo Frame (9000 bytes) mechanism. This means that no 
changes to the protocol stack are required and all Operating Systems can support the proposed 
LRO. 
220 
 
The proposed LRO reduces the amount of per-packet processing performed at the end node. 
Based on this research, it is anticipated that the proposed LRO also supports out-of-order 
processing by managing the out-of-order packets while linking them in the NI’s buffer.  
Another enhancement for protocol processing at the proposed PPU is applied to the sending side 
(Large Send Offload (LSO)). A novel algorithm has been implemented for LSO for segmenting 
the packets that are larger than the Maximum Transmission Unit (MTU) into Single Segment 
Messages (SSM) (i.e. 1460 bytes). The packet headers are then generated for each SSM. 
A scalable Packet Processing Unit (PPU) that supports the LRO and LSO functions using a 
programmable-based NI has been implemented including RISC cores, Content Addressable 
Memory (CAM) and DMA. In this thesis, a novel high-performance 32-bit RISC based PPU was 
proposed, replacing multi processing cores which occupy a large space of the NI and add 
complexity to the NI design. The three pipeline stages RISC core with 11 instruction sets and 64 
register file delivers some of the important features in regards to protocol processing, such as 
scalability to support different network protocols like TCP and UDP, simplicity of the data path 
and a shorter developing cycle.  
In order to support high-speed rates, several approaches have been applied to the processing flow 
of the LRO and LSO. The program’s flow for the LRO and the LSO is considered including 
avoiding reading after writing and write back instructions. In addition, focus has been placed on 
Hazard Data of the Assembly program, using Branch Delay techniques. These features reduce 
the pipeline processing and enhance the flow of the program during the conditional branches, 
which occur while processing a Beginning of Message, Continuation of Message, End of 
Message or Single Segment Message for TCP/IP and UDP/IP packets.  
221 
 
Another approach described in this thesis is the enhancement to the packet processing flow in the 
PPU for 100 Gbps, the Pipeline Processing technique. Pipeline Processing works by executing a 
number of instructions that are not related to the local bus during the movement of data. This 
technique reduces more than 85 percent of the RISC idle cycles which are used for processing 
the LRO and LSO functions, especially when the packet size is 512 bytes or larger. 
To complete the packet flow processing efficiently at the PPU and to assist the single RISC core, 
supporting units (including a CAM, acting as a lookup table to support the receiver core), were 
designed and added to PPU. For the movement of data from the Line Interface (LI) to the Host 
Interface (HI) or from HI to LI, a simple DMA controller was designed. 
A SPIM simulator based on a R2000/R3000 RISC core has been used for testing the LRO and 
LSO functions. The results of these tests indicated that a 917 MHz RISC core can support PPU 
processing for transmission speeds of up to 100 Gbps for TCP/IP and UDP/IP, without data 
movement cycles. The instructions used to complete processing the LRO and LSO functions are 
memory-register, register-to-memory, arithmetic and logic instructions. Floating Point registers 
are confirmed as not required during the processing of the proposed LRO and LSO functions. 
The VHDL PPU has been designed including the RISC core, CAM and DMA.  The Behaviour 
Model within the Xilinx environment has been used. A computer simulation for measuring the 
amount of the payload path length for LRO and LSO processing required by the network 
interface for both TCP and UDP was presented. The DMA clock rate of 3759 MHz was used 
with the proposed model in order to reduce the packet processing time. 
222 
 
The simulation results have also shown that the high-performance, low power, 32-bit runs at  752 
MHz can support the LRO and LSO processing for transmission speeds of up to 100 Gbps when 
the Maximum Transmission Unit (MTU) is 512 bytes. When the MTU is 256 bytes, a 1449 MHz 
core can be used for receiving TCP and UDP packets over a 100 Gbps line. A 1449 MHz core is 
capable to support 64 bytes of TCP and UDP packets for sending side.  These results are based on 
the use of a specialised RISC core that was developed and simulated for the TCP/IP and UDP/IP 
protocols.  
 
9.2 Future Work  
The future work following this thesis includes investigating the support for other protocols. Such 
support will require modification of the protocol functions, rather than the architecture of the 
network interface. In addition, this can be extended to processing the upper layers of the TCP 
protocol, identified as the Acknowledgment Reply. Such support can be useful as it provides 
direct network-to-device communication with minimum interference from the host processor. 
The use of the RISC processing core clock rates to support other protocols can also be 
investigated in the future.  This research also has the capacity to open doors for the discovery of 
future improvements of the common protocol methods currently being applied to accelerate 
protocol processing.  These methods include the use of Zero-Copy, RDMA or Direct Cache 
Access (DCA) with this Large Receive Offload approach. These DCA and Zero-Copy would 
enhance the protocol processing if it is integrated with the proposed LRO. The reduction of data 
copy can be achieved during the processing of the arrived large packets from the proposed LRO. 
For DCA, less headers would be copied into the host CPU’s cache, instead of a large number of 
223 
 
small TCP/IP or UDP/IP headers. Besides this, there would be less system calls for shared 
memory when there are large packets, which carry large data application, where each call can be 
done for a large packet instead of small packets. Large amount of data application will move 
from user space to applications space for each memory call. This would enhance the protocol 
processing required at the end node in the future. 
 
 
 224 
 
References  
 
 
[1] IEEE Standard  “IEEE P802.3ba 40Gb/s and 100Gb/s Ethernet Task Force,” June 
2010. 
[2] D. Perry. VHDL third edition. Published by McGraw-Hill Companies, 1998. 
[3] S. Makineni and R. Iyer, “Performance characterization of TCP/IP packet 
processing in commercial server workload," Proc. 6
th
 IEEE Workshop on Workshop 
Characterization, IEEE Press, 2003, pp. 33-41.  
[4] G. Gregory, S. Hotz and R. Meter. “The Impact of a Zero-Scan Internet 
Checksumming Mechanism, ” ACM SIGCOMM Computer Communication Review, 
1994. 
[5] K. Kant, “TCP offload performance for front-end servers,” Proc. IEEE Global 
Telecommunications Conference (GLOBECOM 03), IEEE Press, 2003, pp. 3242-
3247.  
[6] G.  Reginier, et al. “TCP onloading for data center servers,” IEEE Computer 
journal, pp.48-58. Nov., 2004.  
[7] Microsoft. “Receive-side scaling,” 
 http://msdn.microsoft.com/en-
us/library/windows/hardware/ff556942(v=vs.85).aspx . May.  2012 [June 2011]. 
[8] P. Shivam, P. Wyckoff and D. Panda. “Zero-copy OS-bypass NIC-driven gigabit 
Ethernet message passing,” Proceedings of the 2001 ACM/IEEE conference on 
Supercomputing, November 2001. 
[9] K. Kleinpaste, P. Steenkiste, and B. Zill. “Software support for outboard buffering 
and checksumming,” In Proceedings of the ACM SIGCOMM ’95 Conference on 
Applications, Technologies, Architectures, and Protocols for Computer 
Communication, 1995, pp. 87–98. 
[10] O. Suzumura, M. Tatsubori, S. Trent, A. Tozawa and  T. Onodera. “Highly scalable 
web applications with Zero-copy data transfer,” Proceedings of the 18th 
international conference on World Wide Web, 2009, pp. 921-929. 
[11] M. Thadani and Y. Khalidi. “An efficient Zero-copy I/O Framework for Unix,” 
tech. report SMLITR-95-39,Sun Microsoft Laboratories, May 1995. 
[12] D. Stancevic.  “Zero copy I: user-mode perspective,” Linux Journal, Vol 3, Issue 
105, 2003. 
[13] A. Earls. “TCP offload engines finally arrive,” Storage Magazine, March 2002. 
[14] J. Mogul. “TCP offload is a dumb idea whose time has come,” Proc. 9th Workshop 
on Hot Topics in Operating Systems (HotOS IX), Usenix Assoc., 2003.   
[15] P. Willmann, H. Kim, S. Rixner and V. Pai. “An efficient programmable 10 gigabit 
Ethernet network interface card,” Proceedings of the 11th Int’l Symposium on 
High-Performance Computer Architecture November 2005. 
[16] H. Kim. “Improving networking server performance with programmable network 
interfaces,” RICE University, Texas, United States.  Master’s thesis, April, 2003. 
[17] H. Kim, V. S. Pai, and S. Rixner. “Improving web server throughput with network 
interface data caching,” Proceedings of the Tenth International Conference on 
Architectural Support for Programming Languages and Operating Systems, Oct. 
002, pp 239–250. 
 225 
 
[18] Intel, “Intel® X38 Express Chipset,” 
www.intel.com/Products/Desktop/Chipsets/X38/X38-overview.htm., 2009 [Jan. 
2011].  
[19] T. Henriksson. “Intra-packet data-flow protocol processing,” PhD Dissertation, 
Linkoping university, 2003. 
[20] S. Chu and Y. Bai. “Packet buffer management for a high-speed network interface 
card,”  Computer Communications and Networks, 2007. ICCCN 2007. Proceedings 
of 16th International Conference on 13-16 Aug. 2007, pp.191–196.  
[21] O. Elkeelany. “On chip novel video streaming system for bi-network multicasting 
protocols,” Integration, the VLSI Journal, vol.42 n.3, pp.356-366, June, 2009.  
[22] Alacritech SLIC. “A Data Path TCP Offload Methodology,”   
http://www.alacritech.com/html/techreview.html [Jan. 2010].  
[23] Y. Hoskote et al. “A TCP offload accelerator for 10 Gb/s Ethernet in 90-nm 
CMOS,” IEEE Journal of Solid-State Circuits, 38(11):1866–1875, Nov. 2003. 
[24] M. Rangarajan et al. “TCP servers: offloading TCP/IP processing in internet 
servers. Design, implementation, and performance,” Rutgers University, 
Department of Computer Science Technical Report, DCS-TR-481, March 2002. 
[25] A. Foong, T. Huff, H. Hum, J. Patwardhan, and G. Regnier. “TCP Performance  re-
visited,” in Proc. IEEE Int. Symp. Performance Analysis of Systems and Software, 
pp 70–79, Mrch. 2003. 
[26] D. Dittia, G. M. Parulkar, J. Jerome and R. Cox. “The APIC approach to high 
performance network interface design: protected DMA and other techniques,” in 
Proceedings of INFOCOM ’97, IEEE, April 1997. 
[27] D. Clark, V. Jacobson, J. Romkey, and H. Salwen. “An analysis of TCP processing 
overhead,” IEEE Communications Magazine, 27(6), June 1989. 
[28] S. Makineni and R. Iyer, “Measurement-based analysis of TCP/IP processing 
requirements,” In 10th International Conference on High Performance Computing 
(HiPC 2003), Hyderabad, India, Dec. 2003. 
[29] T. Mohsenin, “Design and evaluation of FPGA-based gigabit- Ethernet/PCI 
network interface card,” Rice University, Texas, United States, Master’s thesis, 
April 2004.  
[30] Cisco ®,"100 Gigabit Solution," 
http://www.cisco.com/en/US/netsol/ns581/networking_solutions_solution.htmlupdt
ed March 2013 [April 2013]. 
[31] Intel. “Interrupt Moderation Using Intel® GbE Controllers,”  
download.intel.com/design/network/applnots/ap450.pdf, 2007 [Jan. 2010]. 
[32]  N. Tredennick, and B. Shimamoto. “Go reconfigure,” Special report in IEEE 
Spectrum, Dec. 2003, pp. 36-41. 
[33] J. B. Postel, “Transmission Control Protocol,” NIC- RFC 793, Information Sciences 
Institute, Sept. 1981. 
[34] Intel Inc, “Small packet traffic performance optimization for 8255x and 8254x 
Ethernet controllers”, Application Note. 2003 (AP-453). 
[35] N. L. Binkert, A. G. Saidi and S. K. Reinhardt. "Integrated network interfaces for 
high-bandwidth TCP/IP," ASPLOS, 2006.   
[36] P. Govindarajan et al. “Achieving 10Gbps network processing: Are we there yet? ” 
High Performance Computing – HiPC. 2008 pp 518-528.  
 226 
 
[37] S. Makineni  et al. “Receive side coalescing for accelerating TCP/IP processing,”  
HiPC 2006. LNCS 2006, pp. 289-300. 
[38] M. Norris. “Gigabit Ethernet,” Technology and application. Artech House, 2003. 
[39] H. Kim, V. S. Pai and S. Rixner. “Exploiting task-level concurrency in a 
programmable network interface,” Proceedings of the ninth ACM SIGPLAN 
symposium on Principles and practice of parallel programming, 2003 pp 61-72. 
[40] W. Bux, W. E. Denzel, T. Engbersen, A. Herkersdorf, and R. P. Luijten. 
“Technologies and building blocks for fast packet forwarding,” IEEE 
Communications Magazine, Vol. 31, No. 1, pp. 70-77, Jan 2001. 
[41] M. Attia and I. Verbauwhede. “Programmable gigabit Ethernet packet processor 
design methodology,” European Conference on Circuit Theory and Design, August 
28-31, 2001.  
[42] D.  Schuff and S. Pai. “Design alternatives for a high-performance self-securing 
Ethernet network interface,” Parallel and Distributed Processing Symposium,  
IEEE International, 2007.  
[43] X. Yang, D. Wu and N. Sun. “Design of NIC Based on I/O Processor for Cluster 
Interconnect Network,” Networking, Architecture, and Storage, pp 3-8, July 2007.  
[44] D. Patterson and J. Hennessy. “Computer organization and design, 4th ed. The 
Hardware/Software Interface. Publisher: Morgan Kaufmann, 2009. 
[45] P. Shivam and J. S. Chase, “On the elusive benefits of protocol offload,” In ACM 
SigComm Workshop on Network-IO Convergence (NICELI), August 2003.  
[46] R. Westrelin, N. Fugier, E. Nordmark, K. Kunze, and E. Lemoine,” Studying 
network protocol offload with emulation: Approach and preliminary results,” In 
12th Annual IEEE Symposium on High Performance Interconnects, Stanford, Aug 
2004.  
[47] J. Mogul and S. Deering, “Path MTU discovery", RFC 1191, Nov. 1990.  
[48] IBM. “TCP and UDP performance- Large Send Offload, tuning,” 
http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.aix.prftun
gd%2Fdoc%2Fprftungd%2Ftcp_large_send_offload.htm, [May, 2010]. 
[49] A. Ortiz, J. Ortega, A. F. Díaz and A. Prieto “Comparison of onloading and 
offloading strategies to improve network interfaces,” 16th Euromicro Conference 
on Parallel, Distributed and Network-Based Processing . pp. 253-260, 2008. 
[50] H.  Jin and C. Yoo. “Impact of protocol overheads on network throughput over 
high-speed interconnects: Measurement, Analysis, and Improvement,” Journal of 
Supercomputing, Vol 41, Number 1,  July, 2007 
[51] L. Grossman. “Large Receive Offload implementation in Neterion 10GbE Ethernet 
driver,” In Ottawa Linux Symposium (OLS), 2005. 
[52] A. Menon and W. Zwaenepoel. “Optimizing TCP receive performance,” In 
USENIX Annual Technical Conference, June 2008. 
[53] N. L. Binkert et al. “Performance analysis of system overheads in TCP/IP 
workloads,” Parallel Architectures and Compilation Techniques, 2005.  
[54] C. Hermsmeyer et al. “Towards 100G packet processing: Challenges and 
technologies," Labs Technical Journal, Vol 14(2), pp. 57–80, 2009.  
[55] V. Paxson. “End-to-end Internet packet dynamics,” IEEE/ACM Trans. Networking, 
vol. 7, pp. 277–292, June 1999. 
 227 
 
[56] K. Kumar, J. Renato, Y. Turner and L. Alan  “Achieving 10 Gb/s using safe and 
transparent network interface virtualization,” Proceedings of the 2009 ACM 
SIGPLAN/SIGOPS, international Conference on Virtual Execution Environments, 
March 2009. 
[57] D. Field, D. Johnson, D. Mize and R. Stober. “Scheduling to overcome the multi-
core memory bandwidth Bottleneck,”  Hewlett-Packard Development Company,” 
http://www.hpcadvisorycouncil.com/pdf/vendor_content/Platform_Scheduling%20t
o%20Overcome%20the%20Multi-
Core%20Memory%20Bandwidth%20Bottleneck%20WP.pdf, 2007 [May, 2012]. 
[58] W. Feng, et al., “Optimizing 10-gigabit Ethernet for networks of workstations, 
clusters and grids: A Case Study,” in Supercomputing Conference 2003, Phoenix, 
Nov. 2003. 
[59] S. Davie, “The Architecture and implementation of a high-speed host interface,” 
IEEE Journal on Selected Area in Communication, Vol. 11 No. 2, February 1993.  
[60] J. Postel, “User Datagram Protocol. Internet Engineering Task Force,” RFC 
768:1980.  
[61] A MIPS32 Simulator: http://pages.cs.wisc.edu/~larus/spim.html, 2001[Jan. 2010]. 
[62] DARPA Internet Performa. “Internet Protocol,” RFC 791, 1981. 
[63] J. Heffner, M. Mathis, B. Chandler, and J. Heffner. “IPv4 Reassembly at high data 
rate,” RFC 4963, 2007. 
[64] F. Richard, P. Hosbson, “Onload parallel embedded-processor architecture for 
ATM reassembly" IEEE/ACM Trans. On Networking, Vol. 7, No 1 February 1999. 
[65] Yi-Mao et al. “High Speed UDP/IP ASIC design,” International Symposium on 
Intelligent Signal Processing and Communication Systems, Dec. 2009. 
[66] Xiling company “Xilinx design reuse methodology for ASIC and FPGA designers,"  
http://www.fpga.com.cn/advance/skill/Design_Reuse_Methodology.pdf, 2010 [Jan, 
2012]. 
[67] D. David, “IP Datagram reassembly algoritms,” RFC 815, July 1982. 
[68] O. Cardona and J. B. Cunnlngham.“System load based dynamic segmentation for 
network interface card,” U. S. Patent 0295098 A1. 2008.  
[69] K.  Lahey. “TCP problems with path MTU discovery,” RFC 2923. Sep. 2000.   
[70] P. Willmann, S. Rixner, and L. Cox. “An evaluation of network stack 
parallelization strategies in modern Operating Systems,” In Proceedings of the 2006 
Annual USENIX Technical Conference, 2006.  
[71] G. Held. Ethernet Networks. Design, Implantation, Operation and Management, 4
th
 
ed. John Wiley publisher LTD, 2003. 
[72] A. A. Bare, "Measurement and analysis of packet reordering using reorder density, 
” Master’s Thesis, Department of Computer Science, Colorado State University, 
Fort Collins, Colorado, Fall 2004. 
[73] S. Jaiswal, G. Iannaccone, C. Diot, J. Kurose and D. Towsley, “Measurement and 
classification of out-of-sequence packets in Tier-1 IP Backbone,"  Proc. IEEE 
INFOCOM, Mar.  2003. 
[74] M. Allman and A. Falk, “On the effective evaluation of TCP,” ACM SIGCOMM 
Comput. vol 29, pp 59-70, Oct 1999. 
[75] R. Morris, E. Kohler, J. Jannotti, and M. Kaashoek, “The click modular router,” 
ACM Transactions on Computer  Systems, vol. 8, no. 3, Aug. 2000, pp. 263-297. 
 228 
 
[76] K. Salah, K. El-Badawi, and F. Haidari, “Performance analysis and comparison of 
interrupt-handling schemes in gigabit networks,” Computer. Communication. vol. 
30, no. 17, pp. 3425–3441, 2007. 
[77] J. Mermet “VHDL for Simulation, Synthesis, and Formal Proofs of Hardware,” 
Kluwer Academic Publishers. Published: Dec. 2009.  
[78] Xilinx ISE 13.1 release notes: 
http://www.xilinx.com/support/documentation/dt_ise13-1.htm, 2011 [Jan. 2012]. 
[79] K. Pagiamtzis and A. Sheikholeslami, “Content Addressable Memory (CAM) 
circuits and architectures: a tutorial and survey,” IEEE Journal of SolidState 
Circuits, vol. 41, 2006, pp. 712–727. 
[80] Altera. “40- and 100-Gbps Ethernet MAC and PHY,”    
http://www.altera.com/literature/ug/ug_40_100gbe.pdf, 2012 [Dec, 2012].  
[81] M. Mathis and J. Mahdavi. “TCP Selective acknowledgment options,” RFC 2018, 
Oct. 1996. 
[82] LAN/MAN Standards Committee. “Carrier sense multiple access with Collision 
Detection (CSMA/CD) Access Method and Physical Layer Specifications,” 
IEEE 802.3 standard, Dec. 2008. 
[83] Xilinx, “Virtex-6 FPGA embedded Tri-Mode Ethernet MAC,” 
http://www.xilinx.com/support/documentation/user_guides/ug368.pdf, 2011 [Dec. 
2011]. 
[84] S. Dharmapurikar and V. Paxson. “Robust TCP stream reassembly in presence of 
adversaries,” In USENIX Security Symposium, Aug. 2005. 
[85] R. Huggahalli, and S. Tetrick. “Direct Cache Access for high-bandwidth network 
I/O,” In International Symposium on Computer Architecture (ISCA), 2005. 
[86] V. Pedroni . “Circuit design and simulation with VHDL,” Publisher: Cambridge 
2nd ed, Mass. 2010 
[87] S. P. Dandamudi  “Fundamentals of Computer Organization and Design.” Publisher 
Springer 2003 
[88] M. Przybylski,  B. Belter and  A. Binczewski, “Shall we worry about packet 
reordering?” Computational methods in science and technology. Vol.11 (2), 
pp.141-146, 2005. 
[89] P. Lekkas. “Network processors architectures, protocols, and platforms,” Publisher: 
McGraw-Hill, 2003   
[90] IEEE Standard ”Local and metropolitan area networks 802.1D,”  
http://www.dcs.gla.ac.uk/~lewis/teaching/802.1D-2004.pdf, 2004, [Jan 2011] .     
[91] NetLogic MicroSystems Application Note, “Intra device configuration of network 
search engines,” 
[92] J. Corbet, A. Rubini and G. Kroah-Hartman. “Linux Device Drivers, ”Networking 
drivers, 3
rd
 ed . Chapter 14-17. Retrieved 2011-03-06, Feb 2005. 
[93] R. Bradeb  “Requirements for  Internet Hosts – Communication Layers,” RFC 
1122, Oct 1989  
[94] H. Balakrishnan et al. “TCP performance implications of network path asymmetry,” 
RFC 3449, Dec 2002    
[95] D. Patterson and J. Hennessy, “Computer Architecture,”   A Quantitive Approach. 
Publisher 5
th
 ed. Morgan Kaufmann, 2011. 
 229 
 
[96] D. Murray, T. Koziniec, K. Lee and M. Dixon, “Large MTUs and internet 
performance,” 13th IEEE Conference on High Performance Switching and Routing, 
2012. 
[97] Xilinx Inc. “Performing Behavioral Simulation,” 
http://www.xilinx.com/itp/xilinx10/isehelp/pp_p_process_simulate_behavioral_mo
del.htm, 2008 [Oct. 2010]. 
[98] J. Larus, “SPIM: AMIPS32 Simulator,”  http://spimsimulator.sourceforge.net,  
2011 [Feb, 2011] 
[99] J.W. McCormick. “Building parallel, embedded, and real-time application with 
Ada,” Publisher Cambridge, April 2011. 
[100] Extreme Networks. “40 and 100 Gigabit Ethernet Overview,” 
http://www.extremenetworks.com/libraries/whitepapers/WPDC40G100G_1621.p
df, 2011 [Nov. 2011] 
[101] Microsoft TchNet. NTTTCP, last update. Ver 5.28,” 
http://gallery.technet.microsoft.com/NTttcp-Version-528-Now-f8b12769, 2008 
[Jan. 2010]. 
[102] Wireshark, “network protocol analyzer,” http://www.wireshark.org [Feb. 2010]. 
[103] B. Li, Y. Peng, Da-Tong Liu, and X. Peng “A high speed DMA transaction 
method for PCI express devices,” Journal of Electronic Science and  Technology 
China, VOL. 7, NO. 4, DEC. 2009. 
[104] R. Bittner "Speedy bus mastering PCI express,"  22nd International Conference 
on Field Programmable Logic and Applications (FPL), 2012  
[105] ComBlock   “COM-5401SOFT Tri-Mode 10/100/1000 Ethernet MAC VHDL 
source code overview:  WWW.ComBlock.com, 2011 [Sep. 2012].   
[106] ComBlock “COM-5402SOFT IP/TCP/UDP/ARP/PING stack GbE VHDL source 
code,” WWW.ComBlock.com, 2011 [Sep. 2012].  
[107] D. Jakimovska et al."Performance estimation of novel 32-bit and 64-bit RISC 
based network processor," Cores Cyber Journals: Multidisciplinary Journals in 
Science and Technology, Journal of Selected Areas in Telecommunications 
(JSAT), May Edition, 2011, pp 28-40.  
[108] NP-4, “100-Gigabit network processors for carrier Ethernet applications, product 
brief,” EZchip Technologies, 2010. 
[109] Intel, “IXP2800 Network Processor® Product brief, for OC-192/10 Gbps network 
edge and core applications,” http://int.xscale-
freak.com/XSDoc/IXP2xxx/27905403.pdf, 2004. [Mar, 2012]. 
[110] Y. Qi et al. “Multi-dimensional packet classification on FPGA: 100 Gbps and 
beyond,” In Proceedings of the rIntl. Conf. on Field-Programmable Technology, 
2010.  
[111] Xilinx Inc, “PCI-X specification,”  
http://www.xilinx.com/support/documentation/ip_documentation/pcix_64_ds208.
pdf, 2010 [ July, 2012]. 
[112] Xilinx Inc, “Virtex-II Pro™ Platform FPGA user guide,” Xilinx User Guide, 
UG012 (v2.3), March 2003. 
[113] V.Paxson et al “Computing TCP’s retransmission timer” Internet RFC 6298, 
2011. 
 230 
 
[114] F. Gont and A. Yourtchenko, “On the implementation of the TCP urgent 
mechanism,” Internet RFC 6093, 2011. 
[115] S.Rewaskar, J. Kaur and F. Donelson "A Performance Study of Loss 
Detection/Recovery in Real-world TCP Implementations," Proc. IEEE ICNP, 
2007. 
[116] M. Zhang et al.,”Analysis of UDP traffic usage on Internet backbone links,” Proc. 
9
th
 Annual International Symposium on Application and Internet, 2009. 
[117] K. Fall and W. Stevens .TCP/IP Illustrated. The protocols 2
nd
 ed. Addison-
Wesley, 2012  
[118] D. Madhuri,  S. N. Pradhan “Scavenging Idle CPU cycles for creation of 
inexpensive supercomputing Power,”  International Journal of Computer Theory 
and Engineering, Vol. 1, No. 5, December, 2009 , pp. 1793-8201. 
[119] A, Gene,  “Validity of the single processor approach to achieving large-scale 
computing capabilities,"  AFIPS Conference Proceedings , pp 483–485, 1967 
[120] A. Ortiz et al., “Protocol offload analysis by simulation,” Journal of System 
Architecture, pp 25-42, 2009.  
[121] R. Oshana, “A decision-tree approach to picking a multicore software 
architecture,” Embedded System. http://www.embedded.com/design/operating-
systems/4372692/1/A-decision-tree-approach-to-picking-the-right-embedded-
multicore-software-architecture 2012 [July, 2012]. 
[122] R. Maram, "Design and operation of RDMA based routing architecture,” Master’s 
Thesis, Department of Electrical and Computer Engineering, Wichita State 
University, Fall 2008. 
[123] Radisys, “100 Gigabit EZchip NP4 Packet Processing,” 
http://www.radisys.com/products/atca/packet-processing/atca-7300/ [June, 2013]. 
[124] Intel,” IXP1200 Network Processor family,” http://int.xscale-
freak.com/XSDoc/IXP12xx/27904001.pdf [June 2013]. 
[125] BroadCom, “Increasing performance in network storage with mutli-processors and 
high-speed I/O,” http://www.broadcom.com/collateral/wp/1250-WP100-R.pdf. 
[Jan, 2013]. 
[126] X. Pu et al., “Who is your neighbour: net I/O performance interference in 
virtualized Clouds,” IEEE Transactions on Services Computing, VOL. 6, No. 3, 
pp 314-328, 2013.  
[127] E. P. Markatos and T. J. LeBlanc, “Multiprocessor synchronization primitives 
with priorities,” in Eighth IEEEWorkshop on Real-Time  
[128] A. Wieder and B. Brandenburg, “On Spin Locks in AUTOSAR: Blocking 
Analysis of FIFO, Unordered, and Priority-Ordered Spin Locks,”  
[129] IBM® AIX® 6.1 Information Center “Large Send Offload,”                                                       
http://publib.boulder.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%2Fcom.ibm.a
ix.prftungd%2Fdoc%2Fprftungd%2Ftcp_large_send_offload.htm, [Jan 2012]. 
 231 
 
Appendix A  
Data collection  
 
 
A.1 Collection of real TCP and UDP streams from multiple tests  
A network topology has also been set for collecting the TCP and UDP flows. The 
principles of using real data streams of TCP and UDP flows data is to be processed with 
the Packet Processing Unit (PPU). Three PCs (A, B, and C) are chosen for sending and 
receiving multiple of TCP and UDP streams to or from the server "D" (Figure A.1).  The 
server chosen to send or collect the data is IBM eServer that runs Windows 2008 as OS. 
The clients are Dell Duo core CPU E8400 3000GHz of 4 RAM Physical Address 
extension.  
 
 
 
 
 
 
      
          
 
Figure A.1: Network topology 
 
 
 
 
 
clients A,B and C 
 
 
 
Server 
Fast Ethernet switch  
 
 232 
 
A.2 Tests Methodology  
All performance measurements were carried out using a benchmark tool NTTTTCP 
[101]. This tool is included in the Microsoft® Windows® Driver Development Kit and 
is a popular multithreaded, asynchronous application mechanism for sending streams of 
TCP and UDP between one or more end points.  
To obtain valid results and collect different type of TCP and UDP streams the following 
parameters are adjusted to achieve the highest throughput on unidirectional 
communications:   
1- To support the network processing: A specified number of threads and 
processors are required to support the network processing. For example, three 
threads are used for one stream (either send or receive) and two CPUs on the 
server to be responsible for the networking processing.  This is done to avoid 
delay in the processing  
2- An asynchronous data transfer that posts six overlapped receive buffers must 
also be specified. This is required to increase the possibility that when data 
arrives from the network, there will be a user-buffer available.  
3- The MTU within the NI's configured manually (e.g., 512 bytes or 1500 bytes). 
4- To allow enough time to complete sending all flows and receiving data, the test 
time runs must be 120 seconds (two minutes). 
5- To avoid any conflict with any stream data, the communication ports were 
assigned (5555, 5566, 5577 and 5588) for this test.  
6- All other applications that may interfer with the test to remain closed  
 233 
 
7- Registry entry Tcp1323Opts, with type as REG_DWORD, and value set to 1. 
8- A registry entry called TcpWindowSize with type REG_DWORD, and value set 
to default 64K.  
9- Systems rebooted for settings to take effect. 
 
A.2.1  Receive side Flows 
The server receives different TCP and UDP flows as enumerated below: 
 The PC A will send 29040 bytes (28 K) of UDP application data to the server.  
 The PC   B will send 14400 bytes (14 K) of TCP application data to the server 
through  
 The PC B will also send 14520 bytes (14 K) of UDP application data to the 
server.  
 The PC C will send 23000 bytes (22 K) of TCP application data to the server.  
 
For each transfer measurement, the NTTTTCP tool initiated a test using 1500 bytes 
packets (as a default MTU). All receiver threads were started, and then sender threads 
started simultaneously. For bi-directional tests, a sender and receiver thread is started on 
both client and server systems simultaneously. The –t parameter was adjusted for a run 
time of approximately 120 minutes per data point.  The measurements were repeated 
more than ten times, without interruption, and the best three results were averaged to 
obtain the numbers saved as a stream in the WireShark [102], which is installed in the 
server.  
 
 234 
 
A.2.1.1 Commands:  
TCP flow (Test1)   
Client :  NTTTCP -s -m 3,0,192.196.2.2 -l 22K   -a 6 -p 5555  -t 120 –n 1  
Server: NTTTCP -r -m 6,1,192.196.2.2  -rb 64K   -a  6 -t 120 -p 5555 
 
UDP flow (Test2)  
Client : NTTTCP -s -m 3,0,192.196.2.2 -l 28K   -a 6 -u -p 5566  -t 120 –n 1   
Server :NTTTCP -r -m 6,1,192.196.2.2  -rb 64K   -a  6 -t 120 -p 5566 
 
TCP and UDP flows (Test3)  
Client : NTTTCP -s -m 3,0,192.196.2.2 -l 14K   -a 6 -u -p 5577  -t 120  -n 1  
Client : NTTTCP -s -m 3,0,192.196.2.2 -l 14K   -a 6  -p 5588  -t 120  -n 1  
 
Server :NTTTCP -r -m 6,1,192.196.2.2  -rb 64K   -a  6 -t 120 -p 5577  
Server :NTTTCP -r -m 6,1,192.196.2.2  -rb 64K   -a  6 -t 120 -p 5588 
 
TCP and UDP flows (Test 4,5and 6) 
 
Client : NTTTCP -s -m 3,0,192.196.2.2  -l 64K   -a 6 -p 5555 -t 120 –n 1 
Client : NTTTCP -s -m 3,0,192.196.2.2  -l 64K   -u -a 6 -p 5566  -t 120 –n 1  
Client : NTTTCP -s -m 3,0,192.196.2.2  -l 64K   -u -a 6 -p 5577  -t 120 –n 1  
Client : NTTTCP -s -m 3,0,192.196.2.2 -l 64K   -a 6 -p 5588 -t 120 
 
Server : NTTTCP -r -m 6,1,192.196.2.2  -rb 64K   -a  6 -t 120 -p 5555 
Server :NTTTCP -r -m 6,1,192.196.2.2  -rb 64K   -a  6 -t 120 -p 5566 
Server : NTTTCP -r -m 6,1,192.196.2.2  -rb 64K   -a  6 -t 120 -p 5577 
Server : NTTTCP -r -m 6,1,192.196.2.2  -rb 64K   -a  6 -t 120 -p 5588 
 
   
 
A.2.2    Send Side Flows  
A large file of TCP and UDP applications that has size over the MTU has been chosen. The 
server D sends TCP segments or UDP fragments to the PC A, B and C.  
 
A.2.2.1 Commands:   
Server D:  sends packet out to MAC unit  
NTTTCP -s -m 3,0,192.196.2.10 -l 28K   -u -a 6 -p 5555 -t 120 –n 1     
NTTTCP -s -m 3,0,192.196.2.11  -l 14K    -a 6 -p 5566  -t 120 –n 1  
NTTTCP -s -m 3,0,192.196.2.12 -l 22K    -u -a 6 -p 5577 -t 120 –n 1  
 235 
 
NTTTCP -s -m 3,0,192.196.2.12 -l 22K     -a 6 -p 5588  -t 120 –n 1  
 
Clients A, B and C  
NTTTCP -r -m 3,0,192.196.2.10  -rb 28K   -a   -p 5555  -t 120 
NTTTCP -r -m 3,0,192.196.2.11  -rb 28K   -a    -p 5566 -t 120 
NTTTCP -r -m 3,0,192.196.2.12  -rb 28K   -a    -p 5577 -t 120 
NTTTCP -r -m 3,0,192.196.2.12  -rb 28K   -a    -p 5588 -t 120 
 
 
Simultaneous flows of the TCP and the UDP that have been sent or received through the 
server have been captured from the real environments using WireShark. A copy of 
hexadecimal file exported from the WireShark is to be stored in the Queue Hexadecimal 
File Buffer (Figure A.2). In fact, the main purpose of these tests is to find out the total 
cycles of the designed RISC core while processing the LRO of the TCP and UDP at 100 
Gbps.     
 
 
 
 
 
 
 
 
 
 
Figure A.2: A snapshot of the Hexadecimal file from WirShark   
Ethernet Header                                                          IP header                                              TCP 
236 
 
 
 
 
Appendix B 
 
VHDL Simulation Diagrams 
 
 
In this Appendix, the schematic diagrams of the Packet Processing have been presented.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237 
 
                               
 
 
 
Figure B.1: VHDL based Packet Processing Unit architecture 
238 
 
 
 
Figure B.2: Structure of RISC instruction VHDL based pipeline 
239 
 
 
 
Figure B.3: DMA schematic diagram 
240 
 
 
Figure B 4: RISC register file schematic diagram 
241 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure B.5: CAM schematic diagram 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242 
 
 
Figure B.6: Receiver Buffer Interface (RBI) schematic diagram 
243 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure B.7: Minimize Data Hazard by latching the output of the ALU by forwarding 
hardware (U12) to be  read within next instruction (forward mechanism). 
 
 
 
 
 
 
 
244 
 
 
 
 
Figure B.8:  VHDL block diagram for  DMA entity 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
245 
 
 
 
 
Figure B9:VHDL block diagram for  Register  entity 
 
246 
 
 
 
 
Figure B.10: VHDL block diagram for  PipeLine entity 
 
247 
 
 
 
Figure B.11:VHDL block diagram for  CAM entity 
 
 
