Demands for flexible processing has moved general-purpose processing into the data path of networks. With the development of System-On-a-Chip technology, it is possible to put several processors with memory and I/O components on a single ASIC. We present a model of such a system with a simple performance metric and show how the number of processors and cache sizes can be optimized for a given workload. Based on a telecommunications benchmark we show the results of such an optimization and discuss how specialized hardware and appropriate scheduling can further improve system performance.
in turn, impact performance and also the required off-chip memory bandwidth. This bandwidth requirement is yet another important design constraint and is one which, in certain cases, has led to the adoption of Rambus [10] and other techniques. This paper is concerned with determining the optimal configuration of processors and caches for a given network oriented workload and chip size. The following design parameters are considered:
• Number of processors.
• Size of on-chip caches.
• Number of I/O channels.
• Processing workload.
A performance model whose metric is the total processing power of a network processor in terms of instructions per second is developed. The model is used to explore the design space (e.g., set of parameters above) associated with developing a single chip network processor which has multiple processing units on the chip. The evaluation is performed using an analytical model that takes cache configurations, I/O requirements, and workload characteristics into account. It is shown how the overall system can be optimized for maximum processing power for a given workload. As an example, optimization results are presented using statistics gathered from a benchmark of programs (CommBench [13] ) which has been designed to reflect activity typical of network processing applications.
Section 2 that follows characterizes the problem in more detail and provides an overall system design. Section 3 covers analysis of the optimization problem. Section 4 introduces a benchmark that is used as a sample workload and Section 5 shows the system optimization results. Section 6 discusses extensions to the system and Section 7 summarizes the work.
Design Issues With Multiple Processor Systems-On-a-Chip
As a base model for the analysis, we use the multiprocessor system described in this section.
Background
There are a host of advantages associated with integrating multiple processing units on a single chip and developing what is referred to as a SOC (System-On-a-Chip) network processor. Chief among them are the ability to achieve higher performance and, by using fewer chips, lower cost. Such implementations are however limited by the size of the chip (i.e., silicon real estate) that is feasible (for cost and technology reasons), the packaging technology that can be utilized (to achieve given pin requirements), and the power which can be dissipated (at a given frequency).
Network processors are used in routers to flexibly process data streams where such data streams can be in the form of either fixed-bandwidth cell-stream connections, or packet datagrams. In either case, there is a real-time bound on the time duration permitted for processing a given packet. On a heavily loaded link, data is sent back-to-back in a virtually continuous manner and if the system is to be responsive and reliable, processing of a packet cannot take more than the average interarrival time.
With gigabit links, for example, packet interarrival time is on the order of hundreds of nanoseconds. SRAM access times on the other hand are about 10ns. Thus, with a single processor, only a limited amount of processing can be done before the next packet arrives. However, to increase the amount of processing possible between packet arrivals, one can make use of a basic network traffic property. That is, packets from different data streams can be processed independently and thus parallel processors can be used in this situation without the need for complex synchronization and inter-processor communication. Each processor in such a design handles data packets from different flows and, when a processor becomes idle, a system level scheduler routes the appropriate packet to a processor as necessary. Thus, with an ideal scheduler and n processors, the amount of time to process a packet is extended to n times the packet interarrival time.
With a dozen processors this time can be pushed into the range of microseconds. Note also that the processors can utilize very simple real-time operating systems since they always process only a single packet and generally will not require multitasking and associated complex context switching capabilities.
In the next section a functional design for a single chip multiprocessor SOC is presented. In use, groups of such chips would be placed on router line cards and appropriately adapted to various link speeds.
Functional Design
For the remainder of the paper we focus on a single architecture that follows the basic ideas described above. The SOC consists of multiple independent processing engines. The memory hierarchy consists of on-chip, per processor instruction and data cache, and shared off-chip memory. Memory accesses are done through several I/O channels that are shared among sets of processors. The I/O channels are also used by the system controller/scheduler to send packets requiring processing to the individual processors. The overall system is shown in Figure 1 .
Typically, a packet is first received and reassembled by the Transmission Interface on the input port of the router. The packet then enters a Packet Demultiplexer which uses packet header information to determine the flow to which the packet belongs. Based on this flow information the Packet Demultiplexer now decides what processing is required for the packet. The packet is then enqueued until a processor becomes available. When a processor becomes available, the packet and the flow information is sent over an I/O channel to one of the processors on the network processor chip. After processing has completed, the packet is returned to the Packet Demultiplexer and enqueued before being sent through the router switching fabric to its designated output port. Note, in this design it is assumed that the network processor chip appears at the router's input ports. Alternatively, it can be positioned at the router's output ports. Its position however will have little influence on the design issues considered in this paper. A more detailed functional description of the above design can be found in [14] . Here, we consider the single chip design optimization problem associated with selection of the:
• Number of processors on the chip.
• Cache size per processor cache (and the split between instruction and data cache).
We assume that the processors are general purpose RISC processors that can execute one instruction per cycle if no cache misses occur. Thus the access time to on-chip cache is one cycle. The I/O channel is assumed to be in the style of a Rambus interface [10] .
Analysis
Given that we are interested in the amount of traffic the system can handle, we view the design problem as one of selecting the parameter values which maximize the throughput assuming chip area constraints, reasonable technology parameters, and the operational characteristics of a benchmark of network processing programs.
1
Throughput in this environment corresponds to the number of packets that can be processed in a given time. This is determined by a combination of the instruction processing requirements of a given application (e.g., number of instructions necessary for routing table lookup, packet encoding, etc.), and the number of instructions that can be executed per second on the network processor. We assume that all packet processing tasks are performed in software on RISC microprocessors (specialized hardware is considered in Section 6.1). Thus, the throughput is proportional to the number of Instructions Per Second (IPS) that can be executed on the system. Given a typical RISC instruction set, network application benchmark characteristics (e.g., fault rates for different cache designs), and various other parameters (e.g., CPU clock rate, cache miss times, etc.), an optimal system configuration, that maximizes IPS, can be determined. For a given chip size, this configuration falls between the following two extremes:
• Too few processors: Each processor has enough cache to execute programs efficiently (i.e., with very low fault rates) but the total number of processors is too low to achieve high throughput.
• Too many processors: Each processor has very little cache and thus the number of cache misses and off-chip memory accesses is high. The processors spend most of the time stalling and, as a result, have a high Clocks Per Instruction (CPI). Application processing is therefore slow and throughput low.
In the remainder of this section an analytic model reflecting the interactions between these various items is developed and an optimized system determined.
Configurations
We begin by defining the fundamental chip area limitations that are present. The network processor chip size limits the number of processors, n, the amount of instruction and data cache (assumed to be identical for each processor), and the number of I/O Channels, m, that may be present (parameters and definitions may be found in Table 1 
With identical processors, cache configurations, and I/O channels this becomes:
Further, we can assume that the best performance is achieved with a set of design parameters which result in an area as close to s(ASIC) as possible. That is, we need to investigate only configurations that try to "fill" the available chip area.
Single Processor
We continue with a simple performance model for a single processor and then extend that to multiple processors. Assume a simple RISC design where the ideal CPI is 1. The actual number of instructions that can be executed with a given instruction cache size (in bytes) c i and data cache size c d depends on the number of off-chip memory accesses (e.g., instruction cache miss or data cache miss) and the time associated with each of these accesses. The average number of off-chip memory accesses per instruction, M , can be expressed as:
where m ic and m dc are the instruction and data cache miss rates, f load and f store are the frequencies associated with load and store instructions, and d cd is the probability of a dirty bit set in the writeback data cache (a write-back caching mechanism is assumed).
Given an off-chip memory cache line access and transfer time, t mem , and a processor clock rate clk p , the average cycles per instruction, CP I, is:
The total number of instructions that can be carried out per second, IP S 1 , by a single processor is thus:
The amount of average off-chip memory bandwidth, BW mem1 , required by a single processor can now be expressed in terms of the number of instructions per second, IP S 1 , the number of memory accesses per instruction, M, and the cache linesize associated with each off-chip memory. Thus:
The I/O channel bandwidth required to transfer packets from the packet demultiplexer to the processors is relatively small compared to the bandwidth generated by memory accesses, and is therefore not considered any further.
Multiple Processors
The single processor analysis can now be extended to the case where there are multiple processors on the chip. With n processors present, Equation 5 becomes:
If contention for the I/O system is ignored, and if all the processors are executing the same workload, then the bandwidth to memory can be expressed simply as:
Equations 7 and 8 however assume that the I/O system has sufficient bandwidth so that contention between processors does not increase the memory access time t mem since, if t mem increases, IP S will decrease (see Equation 5 ).
To account for contention and potential queueing delays it is convenient to first define the parameter l io , 0 ≤ l io ≤ 1, as the load on the I/O channel associated with the n processors and their memory requests (i.e., M ). A value of l io of 1 would indicate that the entire bandwidth of the I/O system is being used by the processors. In such a situation, contention would be high and t mem would increase.
The approach taken is to select a value of l io so that contention based delays are negligible. Once this is done, then the I/O system bandwidth becomes a design constraint. Enforcing this constraint effectively makes the queueing delay components of t mem negligible and simplifies the design of an optimal system. The proper selection of l io is considered in more detail in Appendix A. The analysis presented there, for example, indicates that for a typical system with a load of l io = 0.5 and a DRAM access time of 40ns, the probability that a memory request is blocked is about 7% and thus, for this analysis, is ignored.
Given a selected load of l io for the processor generated memory requests, the constraint on the bandwidth thus becomes:
Now a single I/O channel having a clock rate of clk io and a width of w io will have a bandwidth of clk io · w io . Given an overall I/O requirement of BW mem for a low contention I/O design, the number of I/O channels m is simply:
This value of m is used below in determining the chip area requirements for the I/O channels.
Multiple Applications
So far we have considered only a single program to be executed on the processors. A more realistic assumption is that there is a whole set of programs that make up the workload on the processors. The above analysis can easily be extended to accomodate such a workload notion. Let the network processing workload W consist of l applications a 1 , a 2 , ..., a l . Each application i is executed a fraction q i of the total data stream ( q i = 1). The actual number of instructions that are executed by an application a i depends on its ratio q i as well as on its complexity, compl i . Complexity in this context is a measure of the number of instructions in the application that have to be executed on average for each byte in a packet of data. Let r i be the fraction of instructions that are executed on average belonging to application a i .
The fraction r i determines the contribution of each application to memory accesses and associated processor stalls. Depending on the load and store frequencies f load,i and f store,i of each application a i , the respective cache miss rates m ic,i , m dc,i , and the dirty bit set probability d cd,i can be determined. The number of memory accesses per instruction M W for workload W is:
M W can now be substituted for M in the expressions for CP I, IP S and BW mem and thus represent the execution of a selected workload.
Optimization
To find an optimal design, first assume that instruction and data cache sizes can be selected over a wide range of values. The number of processors that can combined with the given cache sizes will, of course, depend on the chip size as well as the total number of I/O channels needed for contention-free operation.
Substituting the expression for the number of I/O channels necessary, m, from Equation 10 into the area constraint Equation 2, and assuming we attempt to fill the chip area, we obtain:
This expression reflects both the chip size and I/O channel constraints. The BM mem term can be further expanded using the above equations going back to Equation 3 or 12 for M , the average number of memory accesses per instruction. M in turn is a function of the fault rates for different cache configurations. This function can be obtained from benchmark execution data which is discussed in Section 4. From this data, one can thus obtain fault rates as a function of cache sizes and other characteristics. Thus, given the chip size, s(ASIC), the size of the processor, cache and I/O components, Equation 13 defines the space of parameter choices for the design. If the number of memory sizes, that are considered, is limited (e.g., only sizes that are powers of two), then the optimization can be performed by an exhaustive search procedure. For each choice there will be corresponding fault rates, number of processors and resulting overall IP S. The choice(s) which yields the largest IP S is the design yielding the highest throughput. 
CommBench and Workload Definition
To properly evaluate and design network processors it is necessary to specify a workload that is typical of that environment. This has been done in the development of the benchmark CommBench [13] . Applications for CommBench were selected to include a balance between header-processing applications (HPA) and payload-processing applications (PPA). HPA processes only packet headers which generally makes them computationally less demanding than PPA that process all of the data in a packet. The applications included in CommBench are shown in Table 2 .
Application Properties
For each application, we need to know the following properties that can be measured experimentally: computational complexity, load and store instruction ratio, instruction cache and data cache miss rate, and dirty bit probability (for all cache sizes over the optimization space). The complexity of an application can be obtained by measuring the number of instructions that are required to process a packet of a certain length (for header-processing applications, we assumed 64 byte packets):
We measured the complexity of the benchmark application with Spixtools [3] on an UltraSparc II. To amortize the system-specific program initialization overhead, the measurements were taken for a large number of packets in a single program run. The complexity numbers, as well as the load and store frequencies, for the different applications are shown in Table 3 .
Note that the complexity of payload processing is significantly higher than for header processing. This is due to the fact that payload processing actually touches every byte of the packet payload and executes complex transcoding algorithms. Header processing on the other hand, typically only reads few header fields and does simple lookup and comparison operations.
The cache properties were measured with Shade [4] and Dinero [5] . A 2-way associative writeback cache with a line size of 32 bytes was simulated. The miss rates for the various applications are shown in Figure 2 (for illustration purposes only cache sizes 1kB through 32kB are shown, but 1kB through 1024kB were measured). The cache miss rates were obtained such that cold cache misses were amortized over a long program run. Thus, they represent the steady-state miss rates of these applications. As can be seen from the figures, for most applications, miss rates under one percent can be achieved with an instruction cache size of 8kB and a data cache size of 16 to 32kB. Effects of cold caches are considered in Section 6.3. The differences between header-processing application and payload applications with respect to computational complexity, cache behavior, and instruction mix (not shown) suggest the use of specialized processors and non-uniform cache configurations for the different application categories. This is considered in Sections 6.1 and 6.2, but for the remaining analysis identical configurations are used for all applications.
Workload
In this analysis we consider three workloads that are weighted combinations of the applications listed above. The workloads are defined by the vectors q = (q 1 , ..., q 8 ) as follows: Using individual application complexities (Table 3) , the instruction ratios r = (r 1 , ..., r 8 ) for the different workloads can be obtained.
• • Workload III: r III = (0, 0, 0, 0, 0.105, 0.227, 0.587, 0.081).
Example System
Given the analysis of Section 3 and the workload and application properties of Section 4, the optimal configurations of a network processor can now be determined.
Area Constraints
The three main network processor components are the processors, their caches, and the I/O channels. By examining the specifications of several RISC processor cores, estimates on the sizes of these components were obtained and, for .25µm technology, are given in Table 4 . It is assumed that the sizes scale linearly with the number of each component (e.g., s(n · c) = n · s(c)) 2 . Based on these estimates, and using Equation 2, all 'legal' configurations for an ASIC can be enumerated. total cache size is 8kB. One possible division for this case would be to have 4kB for both instruction and data cache. Note that this table just indicates possible configurations prior to imposing the I/O bandwidth constraint. Next, using CommBench data and workload selection, one can determine the resulting cache miss rates and the effective number of instructions that can be executed on the system. This, as explained before, determines the throughput of the system and thus the total performance. The cache miss rates also determine the required I/O bandwidth, which has to be related to the number of I/O channels shown in Table 5 .
Evaluation
Optimization through exhaustive search can now be done over two independent parameters, the instruction cache size c i and the data cache size c d for cache sizes of 1kB, 2kB, 4kB, 8kB, ..., 1024kB. For each combination of cache sizes, we determine the resulting miss rates and bandwidth requirements. The processors are assumed to be clocked at clk p = 400M Hz and the off-chip memory access time is t mem = 35 clocks. Considering a load of l io = 0.5 on the I/O channels (see Appendix A), we can determine the number of I/O channels required for a given number of processors. Thus, we search for the maximum number of processors that can fit onto the chip, while still providing the necessary number of I/O channels.
The optimization space is shown in Figure 3 . The 3-D figure shows the total MIPS rating of a 100mm 2 ASIC optimized for Workload I. The other figures show the top-down view of the surface, also for a 100mm 2 ASIC, optimized for Workloads I through III. It can be seen that there is an optimum in the area around 8 or 16kB. For smaller or larger cache sizes, the total performance drops significantly. Optimization results for various chip sizes are shown in Table 6 . The number of processors scales with the ASIC size, and the optimal cache sizes vary only slightly due to rounding effects.
It is notable that the best performance is achieved in only a small configuration space (caches of 8kB to 16kB). This shows that workload-specific optimization is very important. Comparing application properties of CommBench applications to those of workstation applications, as found in the SPEC benchmark [11] , one can see that program kernels of networking applications are about one order of magnitude smaller than those of SPEC applications [13] . This means that SPEC programs require much larger instruction caches to achieve the same miss rates.
1
In comparison to commercial network processors, the results match roughly with the cache configurations on Intel's IXP1200 [7] with 16kB instruction and 8kB data cache and Tsqware's TS704 [12] with 16kB instruction and 16kB data cache. C-Port's C-5 [2] uses smaller caches, where 4 processing engines share 16kB of cache. For other commercial products, no information about cache configurations could be obtained.
Extensions
For further improvements to the system, we elaborate in this section on three extensions that can be considered: specialized processors, applications-dependent cache configurations, and smart scheduling.
Processor Pools
One important result from the benchmark measurements is that header processing applications have very different characteristics from payload processing applications. It is a natural extension to use specialized processors for the processing of certain categories of applications.
Header processing applications might be able to make use of sophisticated branch predictors due to their large fraction of load, compare, and branch instructions. Payload processing applications, which are dominated by arithmetic, logic, and shift instructions, could make use of instruction-level parallelism.
Ideally, a system should be configured such that there is the right number of processors for each workload category. This is particularly important if applications can only be executed on processors which are specialized for them because of binary incompatibilities. Thus, the specialization requires knowledge of the expected workload in the system. If the workload is not known in advance, the system can be overengineered to handle 100% header processing traffic as well as 100% payload processing traffic. This will result in a lower overall cost-performance.
Application-Specific Cache Configurations
Analogous to specializing processors, it is possible to specialize cache configurations for individual applications or groups of applications. Table 6 (100mm 2 chip size) shows that the optimal cache configuration for header processing applications (workload I) is c i = 8kB and c d = 16kB. The optimal configuration for payload processing (workload III), though, is c i = 16kB and c d = 8kB. As a result of a global optimization, that does not take individual applications into account, workload II, which contains both types of applications, has its optimum at c i = 16kB and c d = 16kB. This optimum requires 32kB of cache per processor because payload processing applications need c i = 16kB for good performance and header processing applications need c i = 16kB for good performance. If the optimization can be performed for application specific cache configurations, the optimal configuration is a set of processors with c i = 8kB and c d = 16kB for payload processing and a set of processors with c i = 8kB and c d = 16kB for header processing. In this case the amount of per processor cache is only 24kB versus 32kB in the uniform design. The chip area saved by the smaller cache configuration can then be used for additional processors. Naturally, a packet scheduler would need to route packets to the processor with the best cache configuration.
Scheduling
So far, all packets were distributed randomly over the processors in the system. The packet at the head of the queue in the packet demultiplexer was sent to the first processor that became available. While this approach is easy to implement, it has the disadvantage that it does not take into account what program was executed on that processor prior to the current packet. In most cases these programs will be different, and this will cause an increase in "cold cache."
There are two approaches to improving packet demultiplexer scheduling to reduce cold cache misses:
• send packets of the same type of processing to the same processor • buffer packets of one type and execute them back-to-back on the same processor
The first approach to using warm caches is to have the scheduler keep track of all programs that are executed on the processors and their expected finishing time (this can be estimated using the computational complexity measure). If the packet at the head of the queue requires processing of type i, the scheduler can check if a processor, that is currently executing program i, is expected to finish within some time ∆t. If so, the packet will be queued for this processor rather than sent to another processor that might become available earlier, but has executed a different program. The amount of waiting, ∆t, is the time that can be saved by executing the program on a warm cache instead of a cold cache. If the packet has to wait longer than ∆t, it is not worth waiting for a processor with warm caches, because it will be done quicker on a processor with cold caches for which it does not have to wait.
Another approach is to buffer packets of the same type and then send them back-to-back to the same processor. This guarantees that except for the first packet the caches are warm. There is a tradeoff between buffering and introducing delay. If too much is buffered before the packets are processed, the first packets can be delayed significantly, but this scheme is much simpler to implement, since it does not require an estimation of finishing times.
Both scheduling approaches require multiple queues for the different processing types. This is also necessary when specialized processors are used, because packets have to be queued until a processor suitable for their specific processing requirements becomes available.
Summary and Conclusions
In this paper, we consider a multiprocessor system-on-a-chip that is specialized for the telecommunications environment. Network traffic can be processed by special application software that executes on a set of processors contained on a single chip. The problem analyzed is that of determining the optimal number of processors, associated cache sizes and I/O channels that should be present in such a design given a set of defining parameters and constraints with the principal constraint being the total chip area available.
An analytical model of the system has been presented that reflects the overall computational power based on the computational complexity and cache behaviour of a set of measured programs. The analytic expressions developed are such that the design can be optimized to maximize network processor throughput with the optimal system specified in terms of number of processors, cache sizes, and I/O channels for a given workload characterization. Workload statistics were obtained using CommBench, a telecommunications workload that contains both header and payload processing applications.
The results indicate that, for example, with a 200mm 2 chip and a workload whose computational load is equally split between header and payload processing, the optimal configuration would have 16 processors, both instruction and data caches of 16kB, and 2 I/O channels. Using the expressions provided the optimal designs over range of parameter choices is possible.
While this model and the resulting designs are promising, there are some clear extensions to this work which would permit greater performance with the same given area constraints. One example concerns the use of non-uniform cache sizes. In such a situation, scheduler assignment of tasks would be based on their type (e.g., header or payload processing) and resulting cache requirements. Another example would permit non-uniform processors where processor instruction sets are specialized to different applications class requirements and the scheduler routes tasks accordingly.
Further extensions of the current model are also being considered. In particular the effect of cold cache misses is being analyzed and efforts are being undertaken to analyze the effect of having processor scheduler distribute tasks on the basis of minimizing such misses. t mem = t control access + t signaling + t mem access + t transf er access + t transf er .
(15)
To use the I/O channel efficiently, both pipelining and memory interleaving techniques are commonly incorporated in the bus design. For analysis purposes we assume an h way interleaved memory and constrain the design so that only h memory requests are permitted at any given time. Since the processors associated with these h possible requests are all dealing with different non-interacting flows, we assume that different memory banks are assigned to each flow and thus remove any possible queueing delays associated with contention between the h flows within the off-chip memory system. t mem access now is a single value obtained from the memory chip manufacturer.
Consider now the time associated with the memory reacquiring the bus after a memory request has been satisfied, t transf er access . From the perspective of the memory, the bus can be viewed approximately as an M/G/1 system. With such a system and an average utilization or load, l io , the probability of the memory having h requests outstanding for use of the bus is [1] :
The probability of having more than h requests outstanding is:
The average time to access the transfer component of the I/O channel, t transf er access , then depends on the average number of requests that will be served before it can gain access.
E[t transf er access ] = t transf er
To simplify the analysis we consider the queueing delays associated with t control access to be constrained in the same manner as those associated with t transf er access . Doing this, Equation 18 above applies to t control access with t signalling replacing t transf er .
Consider now a system with a 16 bit wide, 800M Hz bus and a DRAM access time of 40ns. For t signalling , we assume 4 bus clocks or 5ns. t transf er for a 32 byte memory line is 16 bus clocks or 20ns. With up to h = 4 interleaved requests, we choose a load on the I/O channel of l io = 0.5 to get a probability of over 93% that there are 4 or less simultaneous memory requests. The average access times then become t transf er access = 18ns and t signalling access = 5ns. Thus, the total memory access time t mem is t mem = 5ns + 5ns + 40ns + 18ns + 20ns = 88ns.
In terms of the processor clock, this corresponds to 35 clock cycles on a 400M Hz processor.
