NT SoC dedicated for offloading ofprotocol processing tasks in network terminals is presented. The [1] , but in the future the terminals will also require offloading using programmable high-speed solutions.
In chapter 2 our dual-processor terminal protocol processing architecture is introduced. Chapter 3 discusses the control path of one of the two processors with focus on the performance limiting tasks of program flow selection. In chapter 4 three implementation strategies for this control path are proposed. In chapter 5 some performance figures for two of the alternatives are listed and finally in chapter 6 some conclusions and directions for further work are listed. r general purpose mmunication ofone rt.
II. PROGRAMMABLE PROTOCOL PROCESSOR
A Protocol Processor (PP) architecture intended to be used as a offloading device in a network terminal was proposed by the authors in [2] . As most PP, it consist of more or less programmable devices that can accelerate and offload a host processor, by handling the communication protocol processing. The protocol processor is a domain specific processor that have superior performance over general purpose CPUs but still provides flexibility through programmability within the application domain. The proposed architecture has a unique dataflow based strategy for storage and wirespeed processing of incoming packet data. Instead of storing the data in a input buffer before it is processed as traditional network processing hardware, the proposed architecture manage some of the fast path processing before the packet is either discarded or stored for further processing. An overview of the PP architecture is illustrated by. The PP is a dual processor architecture depicted in figure 1 . The first processor is a general purpose micro controller responsible for the control intensive processing of the slow path. This type of processing tasks is common in upper protocol layers such as TCP/UDP. The second component is the Programmable Protocol Processor (PPP) which is responsible for the high-performance acceleration of the data-intensive processing tasks, i.e. packet decoding. This fast path process the data on wirespeed as it streams through a chain of flip-flop based registers. The PPP consists of a number of accelerators denoted as Functional Pages (FP) (e.g. [8] Figure 3 . Branch unit supporting single clock cycle programflow selection. The critical path includes extraction ofpacket header data, comparison and branch decision. the network clock). By using the same clock frequency the requirements on synchronization with the data flowing through the PPP becomes very strict. Since the data only is available to each FP during one clock cycle it is necessary to start and stop the processing in the FPs at the exact clock cycle. The concept of synchronized protocol processing was first introduced by Henriksson et al in [4] even ifthe proposed implementation is slightly different.
The use of a single clock domain simplifies the layout and reduces synchronization problems between FPs and the control path (C&C). I.e. the need for synchronization registers is eliminated. An implementation alternative for the C&C suitable for synchronized special purpose protocol processing is illustrated in figure 3. There are many alternative ways of implementing a hardwired case-statement. Using Content Adressable Memories (CAM) is one rather straight forward altemative ( [5] and [6] ). The CAM based branch unit depicted in figure 4 uses the PC value and flags generated in FPs as inputs. If there is a match in one of the CAM entries, i.e. a conditional branch is taken, a new instruction and instruction fetch address are provided. Since the branch unit will be a part of the critical path of the PPP and thereby determine the maximum clock frequency it can operate at, it is very important to optimize this CAM search. The latency of a CAM search is mainly dependent on the size of the two search fields and the number of entries in the memory. The latency of the CAM search must be added to the latency of the FP and Muxes to find the critical path of the PPP.
C Pipelined C&C With Branch Unit Even if a pipelined processor is used it is possible to accelerate program-flow selections using a branch unit. The size of the branch unit and the clock frequency to be used can be optimized after the protocol coverage has been set. The branch unit then makes it possible to accel- erate case-statements. Normally classification is accelerated using pipelined classification engines, e.g. CAM but also other types of case-statements can be covered using branch unit acceleration.
Since the branch unit can be pipelined it is possible to reduce the critical path compared with alternative A, thereby enabling a higher clock frequency while still allowing for a large number of entries in the branch unit memory. In fact alternative C is the only possible for complex terminals with a large number of protocols, source and destination addresses since the number of entries in various case-statements implemented is to large. Note, that synchronization registers are still needed.
V. PERFORMANCE Using AMS 0.35 gm standard cell library an implementation of the synchronized data-flow version of the C&C has been completed. The implemented branch unit supports 16 different conditional branches, each with four case-entries. Static timing analysis of the implemented layout shows that the critical path is 10.9 ns long which indicates that it can support wire speed processing at 2.9 Gbit/s. This is comparable with the performance of a configurable CRC FP implemented using the same process. Four entries to the case-statement means that only four destination addresses (including multi-cast addresses) can be checked.
enabling high speed and low latency program flow selection, the overall throughput is optimized. Timing analysis indicates that multi-gigabit network speeds are feasible for a restricted set of protocols when the C&C is implemented in a mature standard cell technology.
As future work a more complex protocol stack would be interesting to investigate implementation alternative C. Specially the size of the branch-unit and number of pipeline stages have to be carefully optimized using benchmarks.
The critical path of the pipelined C&C is the ALU. Static timing analysis of the layout of the C&C datapath shows that the simple 2-pipeline stage ALU can run at 588 MHz when implemented using AMS 0.35 gm standard cell library. If the C&C runs at a four times as high clock frequency this enables the PPP can support for a network speed of 4.7 Gbit/s. If the protocol coverage is increased, the C&C might have to run at eight times as high clock frequency compared to the network interface (GMII). This however still allows for more than 2 GBit/s of wirespeed processing.
VI. CONCLUSIONS AND FURTHER WORK
This paper proposes three different implementation strategies for the control path of a data-flow based protocol processor. Based on protocols covered and terminal type the C&C can be optimized for high speed operation. By
