The speed of Ethernet networks has increased to 40-100 Gbps since the release of IEEE P802.3ba. Enhancing the protocol processing at the end node is essential to meet the demands of the increased network speeds. This research presents an enhanced pre-packet processing for inbound and outbound processing using a scalable Network Interface-based three-pipeline Embedded Processor. The designed Network Interfaces uses a specialized cost-effective 760 MHz embedded processor core can support a wide range of received UDP/IP packets, up to 100 Gbps. A 430 MHz Embedded Processor can be used for the send side. Furthermore, we have provided a processing methodology for Large Receive Offload and Large Send Offload that can contribute to pre-packet processing and work with fewer headers and data transfer from the network interface.
Introduction
UDP-based protocols are needed as a short-term solution to effectively use kernel space transport protocols for high-speed networks [23] . Protocol processing for incoming or outgoing packets requires a large part of a CPU processor cycles [12, 13] . Shifting part of protocol processing to the network interface (NI) is commonly used to reduce the amount of the processing requirements at the host CPU [14] . However, offloading the protocol functions to NI moves the network bottleneck towards the core engine in the NI.
Enhancing the NI performance and reducing the protocol time processing have become priorities in satisfying the demand for high-speed networks. In this paper a novel technique for Large Receive Offload (LRO) function [17, 18] is presented in the NI. This approach is to amalgamate the incoming UDP/IP packets that belong to the same Internet Protocol (IP), port IDs (PID) and Identification to form a single large packet inside the NI buffer before sending the large packet to the protocol stack for further processing. The LRO has been extended to manage out-of-order packets. Another contribution is the enhancement of the Large Send Offload (LSO) methodology to manage UDP/IP fragments into Maximum Transmission Unit (MTU) messages.
Increasing the core processor to 5 GHz [2] or using multi-cores in the NI to achieve the 10 Gbps [4] can be utilized as a solution for 10 Gbps. However, today many cost effective embedded cores have become available and can be ported to the NI (e.g. Intel IXP800 and EZChip's NP-1-4 processor). However, these processors are used for high speed networks using multi-processing to meet the requirements for 100 Gbps processing. In addition, off-the-shelf processors are not optimized for LRO and LSO functions. Since these processors are designed to support other general functions, the control unit has to support general functions, complex instructions and long and variable execution times. These general purposes CPUs have a large number of registers to accommodate all the possible uses.
The goal of the research is to design a Network Interface to support our algorithms for LRO and LSO for High-speed network, up to 100 Gbps. In addition, investigating of using two specialized cores; one for LRO and the other for LSO requiring with lower hertz to support communication speed 40 and 100 Gbps networks.
Large Receive Offload Enhancements
Large Receive Offload (LRO) was designed as a software driver in the Linux platform. Intel has used the virtual LRO to reduce the number of arriving UDP/IP packets.
The virtual LRO combines the packets from the same stream into larger-sized packets inside a host memory Socket buffer (SKB) by generating SKBs only for the first packet of a LRO session. The virtual LRO does not support out-of-order packet processing and instead stores the packets as a separate SKB if they do not match the LRO requirements.
This approach benefits the receiving-side, but the host CPU spends a number of cycles [3] to run the virtual LRO. In addition, the host CPU requires processing of the small packets that do not match the LRO's criteria, such as the out-of-order packets. Reordering packets is quite common in networks [19] and the host CPU is required to handle all the out-of-order packets.
With virtual LRO, the DMA initiations are required to pass packets from the network interface's buffer through the system bus to its user space in the host memory. Each 32 KB leads to several DMA initiations, and these initiations may cost the host CPU around 300 ns when the message size is 32 KB [20] .
The proposed approach is to have the core engine at the network interface amalgamate the valid UDP packets that have the same information (e.g. IP address and port ID) into a large packet before passing them to the protocol stack (Fig. 1) . Processing the LRO in the network interface does not change the protocol stack process, since it is similar to a Jumbo frame (9000 bytes). In addition, offloading the LRO function relieves the host CPU from processing the out-of-order packets and reduces the DMA initiations, since a large packet only needs to be moved from the network interface buffer to user space.
Network Interface
Receiver Buffer (RB) 
Large Send Offload
It is evident that the LSO feature is helpful only on the transmitting side by freeing a host CPU from the task of segmenting the applications data. The core engine in the network interface divides the data sent by the host CPU into the Maximum Segment Size (MSS) (default is 1472 bytes/fragment). The core engine is required to generate the packet header, attach it to the payload part and eventually sending it to the MAC layer to be sent to a network line.
The core engine in the NI is responsible for handling the tasks related to transport layer [21] . The LSO processing starts after a datagram (sized up to 64 KB) is sent to the NI's buffer along with the information related to the moved datagram, such as the position of the message inside the NI's buffer and the MSS. The core engine at the NI copies the header of the moved packet (as a template header) in an internal buffer (e.g., core's register). A copy of the template header is used when a segment needs to be sent to a network. It updates the essential fields inside the UDP and the IP header of the copied headers, such as the Offset and total length before sending a packet. This copy is attached to the segment to create a complete packet, which is then sent to the MAC layer. This approach has been successful in offloading the LSO, but the use of this method on high speed networks requires a number of cycles needed by the core engine to complete sending one packet from NI's buffer to MAC. The header copy itself requires at least four cycles over the 64-bit bus to load the 28 bytes header. In addition, the core is required to move the copy of the template header to MAC layer.
In this paper an alternative method for sending data faster to the physical and MAC layer. The presented approach focuses on the header process and data movements. For header processing, we present a new algorithm to enhancement the flow of packets. After a host CPU specifies the data to be sent to a network larger than the MTU, it sends a large UDP/IP packet to NI. At the network interface, the packet header is generated for each segment data. The IP identification values are set based on the initial value of the original packet to be sent by the host CPU. Conceptually, each packet needs to change the packet length, MF bit and the offset value [5] (Fig. 2) . 
Network Interface Model
We have structured the proposed NI model into three parts: communication Line Interface (LI), kernel processing and Host Interface (HI) (Fig. 3) . The HI and LI are implemented in hardware. The processing unit in the NI, which commonly processes functions that are related to header processing, is performed by specialized embedded processor. When a valid packet arrives at the Receiver Buffer Interface (RBI), the finite state machine (FSM) enables one buffer and holds the received data and switches to another buffer after interrupting the Receiver Embedded Processor (REP) core in the receiving side. The REP core processes the packet header, which is located at the top of its body. If a new packet arrived, it will store in the second buffer of the RBI. These implementations will support the REP processor at about 123.04 ns when the line rate is 100 Gbps, and the packet size is 1500 bytes (MTU) [1] . To provide high flexibility in terms of exchanging information between these embedded cores and the host CPU, three FIFOs were implemented as memory-based. The pointer of each FIFO is stored in the processor's register. The cores can reach any of these FIFO after reading its address. FIFO 1 carries the signaling packet (e.g., packets have zero payload) received. FIFO 2 carries the connection information to the host CPU including the start-address for each received amalgamated packet inside the Receiver Buffer (RB). FIFO3 carries the status of the connection information to the REP core.
FIFO 4 carries the sending status to the host (e.g., after sending all the data). FIFO 5 carries information to the Sender Embedded Processor (SEP) core at the sendingside, such as the location of messages inside the Sending Buffer (SB) or the control a sending speed when the destination buffer becomes saturated and has limited spaces for the arrived packets.
The NI has a RB which is used to reassemble the packet bodies that are arriving from the network. The large packets are stored in the RB until the Interrupt Moderation timer expires [11] . The size of the RB buffer is 4 GB. This buffer can hold up to a maximum of 256 connections where each connection may contain 64 KB of data.
UDP Receive Processing
The REP core at the receiving side receives streams of packets. A set of processing rules that the RSIC core uses for each packet distinguishes whether the packet is the beginning of message (BOM), which is the first packet received of a UDP stream. Continuation of message (COM), packets is processed after the BOM. End of message (EOM) is the last packet of the stream. A Single Segment Message (SSM) is an individual packet. Identifying the received packets, a fast look up table (LUP) mechanism is needed to choose a sub-routine of the packet type. The Content Addressable Memory (CAM) [2] is used in order to store the received connection information [8] , such as the IP address and the Port-ID (PID). This information is extracted from the BOM header, where each UDP/IP packet has its own headers to carry the packet's information, such as the IP address and Port-ID (PID). These PIDs are unique numbers for each stream. The Receive Embedded Processor (REP) matches the arrival packet's connection data with the one stored in the CAM (Fig. 4) . After the match is found, the CAM sends out the address of the linked list information and the match signal to the REP core for further processing. However, the CAM is not addressing lines to find data; the depth of the CAM can be extended as far as desired. In this prototype, a 256 entry is used.
As soon as the UDP packet arrives at the NI, the IP header and UDP header are processed. The IP address and the PID are masked from the IP header and UDP header.
Since the CAM holds the received connections, the matchup is between the arrived packet's connection information and the one stored in the CAM. If the match has been found, the REP needs to store the data packet in the Host Interface (HI). In order to manage the HI accurately, a Circulation Buffer (CB) mechanism inside memory management has been added to hold all the free pointers inside the RB. The REP reads the head of the CB in order to retrieve the address of a free location inside the RB for the arrived packet. The free pointers that refer to the available location inside the RB occurrences are collected after the host reads the amalgamated data.
If the packet is related to a UDP flow, the offset number, MF bit, and PID fields inside the IP header are used to identify whether the packet is the BOM packet, COM, EOM or SSM packet. These identifications assist the linked-list mechanism to build the linked list of the stream.
The BOM is the first packet of a message, where the MF bit is equal to "0". The REP creates a data entry inside the CAM, such as the start and end address of the linked list. If the MF bit is equal to "1", this message is a SSM. With a SSM, the REP sends the message as-is to RB and does not update the start and end address of the linked list. COM packets arriving after the BOM have the same Identification and the MF bit equal to "1". With COM, the REP needs to update the CAM entry. The EOM is the last packet of a message, where the MF bit is equal to "0". The REP combines the data packets into the large packet previously amalgamated for this address. The REP is responsible for updating the UDP header (i.e. the offset number) if the EOM is discovered, the interruption moderation timer expires. The total length of the datagram is updated according to the amalgamated data.
Linked-List Design Format
The 32-bit field inside the UDP header (the Identification, flags and offset) supports the REP in identifying the packets in order to enable the link. For example, if the packet is a BOM, which is the first packet in a stream. This means there is no linked-list previously assigned to this stream. The REP needs to create a new linked-list for this packet by inserting the Start-address and End-address in the CAM beside the connection information (Fig. 5) . The Start-address refers to the head of the linked-list (the address that is loaded from the CB for this packet). The End-address refers to the tail of the linked-list, which is the Null's address (node pointer), located at the end of the packet body. Thus, the linked-list with one node has been created for the arrived UDP/IP packet. Continuation-of-Message (COM) packets arriving after the BOM have the same connection identifier. After the packet's IDs match with the one in the CAM, the REP starts adding a new node to the existing linked-list (the packet body and its pointer). The linked-list is updated after adding a new node by setting the current node pointer of the node to NULL (end-address). Next, the REP stores the NULL address of the current node in the CAM referring to the new end in the list.
The End-of-Message (EOM) packet has a flag -the MF-bit-inside the IP header [7] . The REP stops amalgamating the packets for this stream. The REP then appends the EOM packet to the related stream inside the RB and deletes the linked-list information of this stream. The End-address in the CAM, which refers to the NULL value of the previous packet, is read by the REP. The REP stores the address of the current packet in the same place as the NULL value of the previous location (the Endaddress). The REP is responsible for updating the UDP header of the original UDP/IP of the large packet, which is inside the RB. Furthermore, it needs to modify the length of the datagram by the total amalgamated bytes of the stream inside the IP header and the datagram size inside the UDP header. When the MF-bit is equal to "0", the REP examines the CAM's start and end address of the connection. If the start and end address are equal to "1", there is no linked-list assigned to this stream. The packet then is moved from the LI to the RB buffer. The REP then completes the packet as a Single Segment Message (SSM). The REP then stores the NULL value at the end of the packet body. Furthermore, there is no need to update the end or start address in the CAM because no more packets will be amalgamated with this packet.
In this case, the current packet is amalgamated with the previous packet of the same stream, which then stores the NULL value at the end of the current node. Finally, the NULL address is stored in the CAM with the same link information, which refers to the end address of the UDP stream. When the MF-bit flag (inside the IP header) is equal to "0" [10] , the REP needs to add this node to the end of the linked-list, following the same procedure as that used for the EOM. With the EOM, there is no need to extend the linked-list further because it is the last packet of the stream, and there is no need to store the Endaddress in the CAM. However, when the MF-bit is equal to "0" the start and end address of the connection is "0", then the current packet is moved from the Line Interface to the RB buffer. The REP then processes this packet as an SSM. With a SSM, the REP does not need to update the CAM entries related to this message, such as the start and end address of the linked list since no more packets will be amalgamated after this packet. 
Out-of-Order Processing
The Out-of-Order procedure is a more complex process than the BOM, COM or SSM packets. The virtual LRO will open a new queue if the out-of-order packets are discovered. These steps lead to extra processing requirements and additional constraints on the host CPU cycles and space. Obviously, this research also refers to an order-dependent stream system as having in-orderprocessing architectures by introducing a more flexible out-of-order processing architecture. Out-of-order processing starts after the REP reads the Offset of the arrived packet contained in the IP header, and then it is compared with the Offset expected to be reached in the linked-list. In the next step, the REP joins the arrived packet into the linked-list.
In the case where the offset number is not located between the rates of the target stream, the REP creates a sub-linked-list of the stream (Fig. 6) . A duplicate data segment is discovered after checking the Identification and the offset of the UDP stream that has been amalgamated beforehand in the RB.
Send Processing
Since the receiving-side is entirely independent of the send-side, the Send Embedded Processor (SEP), DMA and other devices sharing a single bus to operate the LSO.
Fragmentation is required if a message is larger than the MTU. The SEP is responsible fragmenting the large packet into multiple smaller packets that can be transmitted over the network. The IP header has three fields that are used for fragmentation processing [9] : the Identification ID (16 bits), datagram ID set by the source, and Fragmentation offsets (13 bits) which are required to distinguish the location of the datagram. It also is specified in multiple of 8 bytes. The flags (D-bit and MFbit) inside the IP header are used during the fragmentation procedure. The D-bit (do not fragment bit) prevents fragmentation. The MF-bit specifies if this fragment is the last one in the original message. Using the Programmed I/O (PI/O) method for data movement makes the embedded processor core control the bus while data is being moved. If PI/O is used as the method for data movements, the core then will become busy with transferring data rather than header processing. The DMA is used for transferring data between the SB and the SBI. The embedded processor core initiates and controls the DMA with the location and size of data. Since the local bus is shared between the DMA and the embedded processor core, the embedded processor core requires the release of the local bus to let the DMA perform the data transfer. Each transfer of 64-bits consumes two cycles. First cycle, the DMA reads the source buffer to read 64-bits to the DMA's register. During the second DMA cycle, the words move from the DMA's register to the destination buffer. The DMA state machine will then provide the read and write signals to the source and destination buffers. The state machine in the DMA is also responsible for incrementing the address counter.
In this simulator, the embedded processor core initiates the DMA to transfer data from SB to SBI. The core is responsible for updating the packet headers for each segment inside the SB. The processor uses several pointers in order to continue updating data in the SB (Fig.  7) : the Start Header Address Pointer (SHAP), End-Header Address Pointer (EHAP), Start-Payload Pointer (SPP) and End-Payload Pointer (EPP).
The SEP core uses the SHAP pointer for retrieving the network headers inside the SB. The SPP pointer helps the SEP to locate the start. The EPP is used to point the end of the first segment. The SEP updates this pointer during the data movements of the first packet (the BOM).
Processing Analysis
A behavior model simulation over the Xilinx [22] was chosen for this research. We started the model simulation by delivering different packets to the receiving-side that can keep the embedded processor busy while the DMA transfer cycle is in operation. We have observed from the simulation that the REP requires 15 instructions to ident- (Fig. 8) . Avoiding the conflict of using the local bus, we have assigned instructions to the REP that do not require the use of the local bus during data movements, such as checking the status of the current packet of the BOM.
If the MF-bit sets to "0" [10] and there is no linkedlist assigned to this stream, then the REP treats this packet as a SSM; elsewhere, the REP processes this packet as the EOM. If the packet is an EOM, the REP core then executes two instructions to update the UDP and IP header inside the RB and one cycle to send the startaddress to FIFO 2. The REP core has to wait until the local bus is released by the DMA controller in order to complete the EOM processing.
A. Data Movements
The packet payloads varied from 512 to 1472 bytes [1] . If the packet size is 1500 bytes, the DMA required 183 cycles to move the data packet from the SBI to RB over the 64-bit bus. It is clear from Fig. 8 , that the embedded processor is idle until the DMA completes the data movements. The idle cycle's affects the performance of the network interface and its capabilities to deal with high-speed networks.
Reducing the idle cycles of the cores has been studied. One of these solutions is to use a multi-bus on the receiving-side. The embedded processor can access multiport memories while the DMA controller moves data. The second scenario is to place data into the RB first, instead of the RBI, and then combine the message with the previous one. Stealing cycles from the DMA can be used but the overhead of this approach can reduce the NI's performance [24] .
The other approach is to use a DMA that runs at a higher clock rate than the embedded processor. We have Fig. 8 : The embedded core becomes in idle mode until the DMA completes moving the data used this method since it does not require any modifications to the proposed model in Fig. 3 . We adjusted the DMA's clock to reduce the idle cycles and to improve the performance of NI. Small size packets, such as 64 bytes to 256 bytes, may require fewer DMA cycles than other packets that have more payload bytes. However, while using these smaller size packets can improve the NI's performance, it affects the end node's throughput [6] . We have focused our research packets sized from 512 bytes to 1500 bytes (MTU). The use of small packets can be studied with this model, but they bear little payload data and may not be able to achieve 100 Gbps.
Using MTU packets requires a fast DMA to eliminate the embedded processor's idle cycles, Therefore, the desired DMA clock speed was increased to run faster than the REP or SEP clock to enable the local bus to be available for both the DMA and the cores.
Simulation Results
During the timing analysis, we found when the DMA's clock becomes five times faster than the embedded processor core, the NI performance increased significantly, where most, if not all, the idle cycles were reduced (Table 1) . Table 1 presents the total embedded processor instructions needed to complete UDP packets. Network interface receives more packets when the package is smaller than 1500 [1] . We have been recording the speed of the DMA while moving different messages (Fig. 9) . The speed of the DMA increases up to 3759 MHz when the size of the packet is 512 bytes, while the speed of the DMA is 1.4 GHz when it is moving packets sized 1500 bytes.
We found that the number of idle cycles decreased to zero when the core engine processed out-of-order packets. The core requires 32 cycles to completely process the outof-order packet. The DMA managed to move 484 bytes from RBI to RB (the payload part of 512 packets is 484 bytes). Although, there are less idle cycles when the packet size is 512 bytes, yet there are a large number of idle cycles recorded when the size of packets is 1024 bytes or greater.
The DMA clock is fluctuated while moving the different size packets. The DMA required a lower clock rate when dealing with 1500 bytes, which are 81274382 packets per second [1] . The variation of the DMA speed depended on the size of the message, we have fixed the DMA clock speed at 3759 MHz on the receive side and 2115 MHz on the transmit side. This rate was found when a 512 byte packet is processed at the receiving side and BOM at the send side. The transmitter and receiver have been verified after adjusting the DMA clock speed to the highest clock speed and the cycles have been recorded in Table 2 .
The DMA clock speeds identified produced results which are acceptable for speeds up to 100 Gbps. The embedded processor core requires 32 cycles on the reception side, and 23 cycles on the transmitter side. The desired processor clock was recorded on this basis, and (Fig. 10) . A 423 MHz RSIC core is required to support LSO for 100 Gbps speeds. Fig 10: The desired embeded clock rate for LSO and LRO for UDP/IP (when the DMA is 3759 MHz for receiving-side and 2115MHz for send-side
Conclusion
We have presented computer simulation results to measure the amount of processing required for LRO functions for UDP/IP. The behavior model of the simulation results have shown that a cost-effective embedded core can provide the required efficiency of the network interface to support a wide range of transmission line speeds, up to 100 Gbps. A 270 MHz embedded core can support the receiver side processing for up to 100 Gbps transmission speed for UDP/IP when the packet size is 1500 bytes, while a core running at 800 MHz was found to support 512 bytes. Assuming a fast DMA with 3759 MHz is required to enhance the network performance and reduce the embedded processor's idle cycles. The DMA clock reduces significantly when NI's bus gets wider than 64-bit (e.g. 320-bit [10] ).
This research can open the door for improving the common receive methods that have been applied to accelerate the protocol processing, such as the use of the zero-copy, RDMA or Direct Cache Access (DCA) with this LRO approach. These applications then need to work with fewer headers and less data copy. For DCA, fewer headers will be copied into the host CPU's cache instead of a large number of small UDP/IP headers. Besides that, there will be less data calls to copy data into the host memory. This could increase the processing need in the future.
