The trend in network processing is towards layer 4-7 processing in the routers and switches. In this paper, we present a hybrid network processor (NPU) architecture that supports high-speed execution of higher layer TCP/IP protocol processing. The architecture achieves this by employing optimized logic blocks for specific tasks in the processing elements. This is a unique architecture such that it can realize both task-level and packet-level parallelism. We also present an example of implementation of a fast adaptive routing algorithm, Cognitive Packet Networks (CPNs), using the proposed NPU architecture. CPN uses a neural network model with a reinforcement learning algorithm to find routes. The applicability of the CPN concept has been demonstrated through several software implementations. Through hardware implementation, we show that this new network model can sustain similar line rates as the current IP protocols.
INTRODUCTION
The rapid expansion of network applications and data traffic are leading to new specialized processor designs to keep up with the growing field of networking and communications. Network component design becomes more challenging as the performance and usage of communication networks increase. To meet rapidly changing requirements such as performance, cost, flexibility and interoperability, the networking industry has opted to build products around network processors or network processing units (NPUs).
The current state of NPU designs resembles the infancy stage of CPU architectures in the '60s-'70s prior to the emergence of standardized tracts of architectures. There are at least 30 companies marketing or developing various types of NPUs, which are very different in terms of architectural features and functionality [1, 2] . Almost all these NPUs target IP packet forwarding and layer 3 applications. Recently with increasing line rates servers become overwhelmed with the burden of processing TCP/IP packets. One approach to ease that problem is to provide hardware support to CPU by offloading some of the tasks involved [3] . This may work for the server or client side; however, as line rates scale upward, layer 4-7 processing will impose more processing requirements on the processor. So the trend in network processing is towards layer 4-7 processing in the routers and switches. Implementing 7-layer processing at high speeds, e.g. 40 Gbps, requires innovative technology. Legacy solutions based on RISC processing cannot meet the challenge. They cannot cope with the considerable processing required in a limited number of clock cycles [4] . To handle specific operations specialized logic is necessary. We address this in our proposed network processor architecture including task specific logic (TSL) blocks in processing elements (PEs). Each PE employs a different TSL, which is optimized for a specific function.
The growth of the Internet is pushing the capabilities of the current TCP/IP based network architecture to its limits and beyond. Alternative network architectures are needed to fix some of the problems of IP-based networks. The recently proposed Cognitive Packet Network (CPN) [5, 6, 7] attempts to solve some of the problems associated with the legacy IP networks, such as QoS, the never-ending expansion of routing tables and their related maintenance issues.
In this paper, we introduce a hybrid network processor architecture that not only supports CPN and IP packet forwarding, but also high-speed execution of upper layer protocols. To accomplish this, the architecture utilizes optimized logic blocks for specific tasks. As such, this is a unique networking hardware architecture that can achieve both tasklevel and packet-level parallelism. Hardware support for upper layer processing improves content-aware security, intelligent bandwidth management, and delivery of highly granular QoS required for time-sensitive applications. Also, there is a growing interest in extensible networks, overlay networks and grid computing. Higher layer processing built-in hardware can powerfully support these networks and computational styles.
The rest of the paper is organized as follows: in the next section, we provide the preliminaries for network processing hardware and upper layer protocol processing. Section 3 introduces hybrid NPU architecture. In Section 4, a brief overview of CPN is given. Section 5 presents the design and implementation of a high-speed execution block for CPN. Section 6 discusses the system evaluation results. Finally, the papers ends with some conclusions and future work.
PRELIMINARIES

Network processing hardware
PEs in networking hardware are traditionally either ASICs or general-purpose processors (GPPs). GPPs are preferred due to their flexibility while ASICs provide better performance. Recently, network processors have emerged offering the best of both worlds. A network processor (NPU) is a programmable processor that is optimized to perform one or more of the following functions: packet classification, packet modification, buffer management and scheduling, and packet forwarding. Note that this is not an all encompassing definition and excludes GPPs, which are still being used for packet forwarding in routers. It is expected that GPPs will be replaced by programmable network components for packet processing. However, GPPs will still find use for initializing, configuring and orchestrating the NPU control path. There is also a notion of the specialized co-processor, which is being used hand-inhand with a network processor. In general, a co-processor is used for more specialized tasks, and is more likely to be shared among multiple PEs. Common networking functions that are implemented as co-processors are search engine, classification, buffer management, encryption/decryption and traffic management.
Most of the currently available NPU designs employ such hardware-oriented techniques as pipelining and parallel processing, as well as software-oriented techniques such as multi-threading and special-purpose instructions [8, 9] NPUs achieve high performance by taking advantage of inherent data parallelism in the networking applications. There are two common parallel packet processing architectures as illustrated in Figure 1 . The first is task-level parallelism, in which a packet is divided into smaller units and processed by multiple engines. Each unit requires a different task and includes a task-specific processing engine, which performs a unique operation to extract information from the particular components. All processed units are then integrated back and prepared for departure. The second parallel architecture is called packet-level parallelism, which allows the processing of whole packets by many different engines. In this approach, multiple incoming packets can be processed at the same time like a super-scalar architecture.
Upper layer protocol processing
Traditionally, network processing has been limited to layer 3, or purely packet forwarding. However, currently we see a shift 
CPN and Upper Layer IP Execution 191
in this paradigm in which increasingly more services need to be provided possibly utilizing information from higher layers. For instance, a router with layer 4-7 processing capability can gather application-level information about the user and the details of the request. This information can be used for intelligent load balancing among the servers as well as for intrusion detection [10] . Upper layer processing also helps delivery of highly granular QoS required for time-sensitive applications such as voice and video over IP, and real-time business transactions. One of the most important properties of layer 7 processing is the large variation of tasks at this level. The tasks in this layer are usually more complicated than the traditional router tasks (i.e. layer 2-3 tasks such as IP forwarding). Upper layer protocol processing requires more than just TCP/IP header processing. Most upper layer applications require matching a string in the content of the packet through a filter. Unlike fixed-length fixed-location fields in the header, the size of the payload varies from packet to packet. This makes the string matching harder. Besides this, a layer 4-7 router needs to perform many tasks such as TCP header modification, connection setup, address translation, server assignment, accounting and billing. All these tasks are required not only for each new session, but also for thousands of them simultaneously. As a result, implementing higher layer processing at high speeds, e.g. 40 Gbps, requires innovative technologies. We address this in our proposed architecture by including taskspecific customized logic blocks in PEs.
HYBRID NETWORK PROCESSOR ARCHITECTURE
The need for high performance computing in network processors is satisfied by deploying large number of PEs and multi-threading techniques. Most NPUs show a similar architectural characteristic: a general purpose CPU core combined with multiple PEs. PEs include some network processing tailored instructions in their instruction sets. A common NPU architecture is shown in Figure 2 . The core CPU orchestrates the individual PEs and communication tasks. It also assists in packet context switching (i.e. task distribution among PEs). The PEs are assigned specific tasks by the core CPU based on the load and performance requirements of the NPU. The architecture also allows an external host processor which can be located on the line card and assists the core CPU in control-plane processing and system (protocols) management. In other words, it assists the core CPU in control of new protocols, exception processing and complex internetworking applications. In the NPU architecture each PE requires local memory to store values which provide rules for their functionality. In addition to local memory, there is a need to use shared memory. The SRAM-based shared memory moves data among the functional units. Several system buses are proposed to provide point-to-point connections between the components. This improves the processor and memory bandwidth.
The trend going from hardwired NPU ASICs to programmable NPUs has been so fast that many companies have actually neglected the very idea of building network processing functions in hardware and implemented them in software running on micro engines. We believe this type of architecture will have difficulty sustaining the line rates when we approach hundreds of gigabits per second. In order to achieve wirespeed processing for higher layer protocols we propose an architecture that contains PEs with hard-wired TSL for already known fixed applications. This architecture will also enable the rapid deployment of new network services and distributed applications through a general-purpose computation unit. The proposed architecture for a PE is shown in Figure 3 . This PE architecture has general purpose registers, special purpose registers, local memory, a controller unit and output registers. Each PE contains a replication of the same computation logic unit (CLU). Further, there is a TSL unit in each PE to differentiate its functionality. This is a unique architecture such that it can achieve both task-level parallelism (through TSL) and packet-level parallelism (through CLU). Each TSL unit is hard-wired and is optimized to perform a specific task (e.g. address translation, packet modification, lookup, classification).
An example TSL is shown in Figure 4 . This is a multipurpose content inspector, which can be used for any layers. Each configurable filter compares the filter value with the content of the packet and returns a match or miss. The CLU instruction set is a blend of conventional RISC instructions with additional features specifically tailored for network processing. Further, the CLU provides the TSL with the required 
192
T. Kocak parameters to add processing options to adapt to new packet protocols and network options. PEs contain both TSL and CLU units so that they can share specific and general registers as well as local memory module within each PE. Hardware sharing within each PE reduces the chip area. The combination of TSL and CLU functionality increases system throughput, control, scalability and network protocol flexibility. Two parallel buses are used to provide higher bandwidth for communication between the modules.
COGNITIVE PACKET NETWORKS
CPN [7] is a packet-switched network in which QoS-based routing goals are accomplished by a learning algorithm [11] . CPN has been already described by Gelenbe et al. in great detail in [5, 6, 12, 13, 14] ; in this section, we briefly overview the important features of the CPN model for the sake of completeness. CPN employs three types of packets: smart packets, payload packets, and acknowledgment packets. Smart packets act as network explorers for different QoS-source-destination (QSD) sets. As a smart packet traverses the network, its next hop is determined by the experiences of previous smart packets with the same QSD parameters. When the smart packet arrives at its destination, the destination node generates an acknowledgment that follows the reverse course recorded by the smart packet. As the acknowledgment travels to the smart packet's source, it delivers network measurement data to the mailboxes (MBs) in each router on its path. These data describe the quality of connections between the particular router and its neighbors. The MB at each node is a special type of data structure that is updated by each acknowledgment packet going through that node. Payload packets are source routed. The path that a payload packet will follow is determined by the data that the acknowledgment packets return to the source node. Each packet type contains a Cognitive Map (CM) that stores its path and data from the trip. CPN uses a fully recurrent random neural network (RNN) model [15, 16] with a reinforcement learning (RL) algorithm [11] to determine the next hop for the smart packets. The RNN is an analytically tractable spiked neural network model that has been implemented in a wide range of applications [17] . The function of the RNN in the CPN model is to capture the effect of the unpredictable network parameters and convert it into a routing decision. There are different learning algorithms that may be applied to the RNN model. The gradient descent algorithm has been used with feed-forward topologies in many applications [17] . This algorithm has two distinct modes: offline training and online execution. However, due to the dynamic nature of the CPN parameters, this type of training algorithm cannot be utilized.
An n-port CPN router has an n-neuron RNN within it (each neuron corresponds to a particular port). After a smart packet enters the router, the next hop is determined by referencing the neurons of the RNN. The neuron with the highest output value (in RNN terminology, q: the steady-state probability of being excited) represents the outgoing port that the smart packet will use. The acknowledgment packet that returns will carry data, which will either deter or promote the decision to use this port again by updating the weights between the neurons.
Applicability of the CPN concept and some examples
Applicability of the CPN concept have been demonstrated through several software implementations. In [6] , the experiments are run on a test-bed to deliver real-time voice over CPN. The test-bed is based on PCs running Linux. All the 
CPN and Upper Layer IP Execution 193
nodes/PCs are configured to use CPN and IP packets at the same time. There is a 'tunneling' program running on every CPN interface router that allows sending IP packets through CPN cloud. In a recent work, the CPN model is extended to mobile ad hoc networks [18, 19] . Battery lifetime and signal-to-noise ratio of communication channels are included within the routing algorithm to reduce the error rate. CPN has some further advantages over TCP/IP when it comes to security and dependability. In the CPN model, denial-of-service attacks can be prevented by maximizing the dispersion as the goal function of smart packets to distribute the packets on various nodes and avoid sniffers. CPN also reduces the impact of address spoofing dramatically. Since the packet is source routed, even if the sender's address is spoofed, the exact route the packet travelled through the network will still be known-thus bad traffic can always be traced back to the source's first hop. Furthermore, the nodes in CPN have the provisional power to inspect the packet, and take such action necessary to discard or encapsulate the packet if it represents a threat.
IMPLEMENTATION OF THE CPN PROTOCOL
In this section, we describe an example implementation of the CPN protocol on the proposed hybrid NPU architecture. There are five major modules that need to be considered whether the implementation of CPN is in software or hardware. The interaction of these modules are shown in Figure 5 . Below, we provide a functional description for each module.
System controller
The CPN system controller coordinates the interaction of the functional units in the CPN implementation. Instead of using a shared common bus as an interconnect, the system controller has separate 32-bit connections to each major module to improve the throughput. Two major functions are implemented in the system controller: (i) delivering a packet from the input queue to the appropriate idle packet processing unit (e.g. smart packet is forwarded to the smart packet processor (SPP) as long as it is idle), (ii) Selecting the path based on the packet's destination. This path selection can be done in three ways. First, it is checked if the packet is destined to this node then if it is a smart packet then the acknowledgment MB along the packet generator is activated to generate an acknowledgment packet. If it is an acknowledgment packet then the packet generator forms a payload packet. Second, if the packet is destined to a neighboring router then the system controller forwards the packet to the output port connected to that neighbor router. Third, if the packet is in an intermediate node, the system controller will forward the packet to the corresponding functional unit (e.g. payload packet-payload packet switch).
Payload packet switch
In CPN, the acknowledgment and payload packets are source routed. The source node for these packet types insert the routing information in the CM which is located right after the packet header. The routing information holds the addresses of the next hops in order. The payload packet switch extracts the next hop information from the CM with a simple algorithm. Then, the payload packet is forwarded to the I/O port connected to the next hop through the system controller.
Acknowledgement mailbox
The acknowledgment packets collect measurements about the network status, and deposit them in the MB at each node. These measurements are used to calculate a reward value which determines whether the routing algorithm made a right decision when the smart packet was travelling through the current node. Once this value is calculated, it is forwarded to the SPP where this value is used to modify the neural network parameters. If the acknowledgment packet is in transit (i.e. the current node is not its destination) the next hop information is extracted from the CM very much like the payload packet switch.
Packet generator
The packet generator module generates all three types of packets. Smart packet is generated when the node acts as a source node and the path to the destination is not efficiently determined before by another smart packet. Payload packet is generated when the node acts as a source node and has all the necessary routing information to find the destination node. The routing information can be known ahead of time because of the previous acknowledgments, or it can be extracted from the just arrived acknowledgment packet. Acknowledgment packet is generated when a smart packet is received and it is destined to the current node.
Smart packet processor
The SPP determines the next hop address for the incoming smart packets. The SPP uses the flow information such as QoS, source and destination addresses as well as the previous experience (if any) on routing packets from the same flow to make its decision. The previous experience is stored in the 
194
T. Kocak RNN model which is trained by a RL algorithm. The neural network parameters are modified (i.e. the experience is gained) based on the reward value calculated with the data from the acknowledgment packets. Thus, the SPP needs to interface with the MB as well. The SPP is computationally more complex than the other modules. Therefore, we partition the CPN implementation into two such that SPP is implemented in an applicationspecific, hard-wired TSL architecture while the rest of the modules are implemented in software to be executed in the programmable CLU in the PE. This choice of software implementation for the CPN system controller also gives an opportunity for future updates in the CPN algorithm.
SYSTEM EVALUATION
In order to evaluate the hardware design for the CPN protocol at the system level, a basic 4-port router is implemented in VHDL. The router shown in Figure 6 houses a single PE which includes the CPN TSL unit. Note that we use only the necessary modules from the hybrid network processor architecture that are required to confirm the proper functionality of the CPN implementation. In the following sub-sections, we describe the I/O port architecture and the CPN TSL unit. As mentioned in the previous section, all the other modules except the SPP are implemented on the CLU processor. The CLU instruction set in this work adopts the conventional RISC paradigm. Design and implementation of the SPP is reported in our earlier paper [20] . In this work, we present how it can be integrated as a TSL unit to implement the CPN protocol on the proposed hybrid NPU architecture. In addition, this paper provides network performance analysis which was not discussed in the previous paper. One of the main contributions of this work is to show that any new network model can be implemented side by side with the ubiquitous TCP/IP protocol in hardware for wire-speed performance. This is especially true when it comes to computationally more difficult new algorithms than the simple IP forwarding such as the neural network component of the CPN model.
I/O port architecture
In this test router implementation, there are four I/O ports. The architecture of an I/O port is illustrated in Figure 7 . Both input and output queues can store up to 30 memory addresses in their registers. Each register in the queue is 12 bits long and the first 2 bits indicate the type of the packet stored in that address. The remaining 10 bits represent the memory address where the packet is stored. The I/O controller manages the communication between the Dual Port Ram (DPR), input/ output queues and the system controller.
DPR is 32 · 1K in size and has two separate ports (A and B). Incoming packets from other nodes are written to the DPR through port A. Outgoing packets to other nodes are read from port B. The processed packets from modules (SPP, acknowledgment MB, packet generator, payload packet switch) are written to the DPR through port B. Outgoing 
CPN and Upper Layer IP Execution 195
packets to the other modules within the router are read from port A. Thus, the input queue communicates with the DPR through port A, and the output queue communicates with the DPR through port B. Read and write operations can be done simultaneously through both ports while the I/O controller can manage simultaneous write operations. I/O port architecture design is verified through simulations using the Synopsys Design Analyzer. Figure 8 depicts a snapshot from the simulation conducted for the write operations in the DPR for the incoming packets. First, 'data_valid' signal is set to high, which means the node connected to the I/O port is ready to send a packet. Packets are received in 32-bit segments. The I/O controller parses the first 32 bits to determine the length of the packet. In the header of the CPN packet, bits 31-16 determine the packet length. In this simulation, the first 32 bits of the packet contains '00801234'. Therefore, the length of the packet is ('0080' ¼ 8 · 16 ¼ 128 bits) 128/32 ¼ 4 fragments 32 bits long. Since each word in RAM is 32 bits long, we will need 4 memory locations to store this packet. Thus, the consecutive four words are stored in memory locations (000, 001, 002, 003 in hexadecimal). When the I/O controller detects the end of the packets, it changes its state into FINISH_PACKET. The process continues by parsing the incoming data to determine the length of the next packet.
CPN TSL unit
In the hybrid NPU architecture, the TSL units can also be made reconfigurable. However, in this work we did not prefer to use FPGAs to exploit the high performance and speed of a full custom ASIC design. The block diagram of the TSL unit for CPN or the SPP design is shown in Figure 9a .
The SPP has four major components: the smart packet interface, the RL algorithm, the neuron array, and the weight storage table. The SPP interacts with the system controller through the SP interface. The RL algorithm module updates the neural network parameters based on the reward value received from the MB. The neuron array calculates the RNN output qs. The weight storage table holds the neural network parameters (weights) as well as the flow information (QSD indexes). The table architecture as shown in Figure 9b employs a table controller, a content addressable memory (CAM), and a random access memory (RAM). The CAM 
196
T. Kocak enables parallel searches upon all the words stored in it. For this implementation, the CAM words represent the QSD combinations of different RNNs. Each row of the RAM in this table architecture is used to store the threshold, decisions and weights of a particular RNN. The memory structures in the table are made dual port to increase the throughput.
Circuit implementation
The CPN TSL unit is implemented in VHDL. The behavioral model for the design is synthesized with Synopsys tools using 0.6 mm CMOS library cells to obtain hardware circuit implementation. For the synthesis, some optimization constraints, such as maximum area, maximum delay and clock specifications are set. For instance, the processor clock frequency is set at 50 MHz. The synthesized gate-level netlist is imported to Cadence Silicon Ensemble, for floorplanning, placing and routing of the design. The obtained layout for the design is shown in Figure 10 . The core occupies 6.46 mm 2 in a 3-metal single-poly 0.6-mm digital CMOS process.
We ran some simulations in Synopsys Design Analyzer to test the execution of the learning algorithm. The results are depicted for a punishment scenario in Figure 11 . The MB triggers the SPP with a start signal (start_ack) when it calculates a new reward value. It passes the QSD along with the port number to the SPP. The RL module reads the threshold and the weights from the weight storage table. If there is no match for the QSD in the table then the default values are loaded for the threshold and the RNN weights. The default value is 08000 in hexadecimal with 15 bits after the radix point.
The RL component compares the reward value with the threshold. The reward value in Figure 11 is less than the threshold. Thus, the RL component will punish the previous decision by reducing the probability of selecting this port again for the same flow. This is done by increasing the inhibition weight for that port (w m1 ) and the excitation weights (w p ) assigned with the other ports. Then, the weight terms are normalized to prevent them from growing unbounded. After the weight updates, the new steady-state outputs (qs) are calculated through a few iterations.
Hardware performance
Worst-case delay (or critical path delay) is usually used to describe the performance of the digital circuit implementations. The critical path delay for the synthesized CPN TSL unit is measured as 30.58 ns. This measurement is the sum of all critical path delays for the individual modules (i.e. SP interface, RL learning algorithm, and the weight storage table). Thus, it takes as much as 30.58 ns to process a smart packet in a node, where CPN is implemented in hardware. This corresponds to 32.7 M smart packets per second. With an average 40-byte (header + CM) smart packet size each CPN and Upper Layer IP Execution 197 node can process 10 Gbps in wire speed, which is not possible with the software implementation of CPN running over GPPs. In addition, our hardware implementation uses CAM for the QSD parameter search in the table. In comparison, the software runs an algorithmic search which suffers a long latency problem due to the significant number of memory accesses. CAM-based parallel search techniques offer the only viable high-speed solution for the next-generation networks.
Network performance analysis
After verifying the operation of the design in an isolated environment, the next step is to create a networked environment to measure the performance. The network configuration is very similar to the one used in the CPN software test bed [5, 6] . Several instances of the synthesized test router module are used to form the network. Figure 12 shows the average delay of smart and acknowledgment packets for various payload sizes. The average delay represents the average latency that a packet experiences from entry to exit. Smart packets are of variable size and consist basically of three areas: a header, a CM and data portion (payload). In this experiment, the smart packets were configured to carry payload as well. The latency for the smart packets increases with the packet size as expected. Since acknowledgment packets consist of only a header, their latencies are not affected by the payload size change. As in the software implementation of the CPN testbed [5] , it is modeled such that each port of a CPN node uses a 10 Mbps Ethernet link connected with another CPN node. Figure 13 shows the throughput of smart packets for various payload sizes. Throughput is determined as the amount of user data transferred by the CPN. In theory, as packet size increases packet throughput should also increase. In our simulations, we notice that packet throughput increases non-linearly as the payload size increases except a small drop for the packet size ¼ 2048 bytes. This small drop is within the simulation tolerance limits, and as can be seen from the figure, the throughput is asymptotically converging between 2.5 and 3 Mbps.
We have also looked at packet loss probability and the results are illustrated in Figure 14 . The x-axis represents the inter-packet time for the main flow of packets. The experiment is conducted with a fixed input rate for a payload size of 250 bytes. We observe a significant improvement in packet loss probability as the transmission time increases. This is due to the fact that the buffer sizes in the nodes are not large, thus some packets are dropped if they cannot be forwarded to the next node.
CONCLUSION
Higher layer protocol processing will be an integral part of next-generation network processor (NPU) designs. In this paper, we presented a hybrid NPU architecture that supports high-speed execution of higher layer TCP/IP protocol processing. The architecture achieves this by employing optimized logic blocks for specific tasks in the PEs. This is a unique architecture that can achieve both task-level and packetlevel parallelism. We also presented an example implementation of a fast adaptive routing algorithm, CPNs, using the proposed NPU architecture. Implementation of the blocks in the CPN TSL unit is completed in VHDL and is verified through initial simulations using Synopsys Design Analyzer. The design simultaneously services smart packets while integrating the RL algorithm. To accomplish this, the CPN TSL unit incorporates several interacting state machines with a dual-port memory structure for storing and accessing the parameters of multiple neural network models. A basic 4-port router has also been developed in VHDL to test the CPN TSL design at the system level. Future directions for this work include development of PEs and the rest of the modules in the hybrid network processor and then the integration of the CPN TSL unit with these modules.
