Abstract-Along with sharply increasing bandwidth requirements, modern network applications and new protocols demand highly intelligent and sophisticated processing over the network. Since these workloads require processing capability beyond state-of-the-art microprocessors, parallel architectures are required in order to handle the packet data without slowing down line speed. Network processors with various parallel architectures are appearing in the market, however, a thorough investigation of the implications of static versus dynamic scheduling of this class of emerging workloads has not been done. In this paper, we characterize the performance and power dissipation of statically identified ILP architectures, and we also compare them to dynamically scheduled architectures for network processing. In dynamically scheduled architectures, the power consumption of the instruction window greatly increases with increasing issue widths. On the other hand, statically scheduled architectures show better performance, as well as tremendous advantages in power, since they do not have instruction window wakeup/select, reorder buffer, and other scheduling related hardware modules. With the large parallelism and the loop nature of network applications, our experimental analysis supports static scheduling as an appropriate strategy for network processor applications.
I. INTRODUCTION
Modern network applications and protocols demand intelligent and sophisticated processing over the network, requiring non-trivial computation capabilities. The processing requirements within network interfaces and routers are becoming more complex. To keep up with current trends, programmable microprocessors called network processors (NP) are being introduced in network interfaces to handle these demands.
The network processor on the physical port of a router should be able to process the modern workloads without slowing down line speed. Assuming a stream of minimum-sized packets of 64 bytes, a router over 10Gbps link should handle 19.5 million packets per second. In this case, one packet time, a processing time for one packet data, is 51 nano-second. Given a single processor of 1 GHz clock frequency for the router, if we assume every instruction can execute within a cycle, this single processor can execute only 51 instructions per one packet time. The required number of instructions per packet for executing NP applications is in the range of 300~10,000 [4] , and hence a conventional processor is not enough to handle these workloads. Highly parallel architectures are required to handle these workloads.
In order to get a high parallelism, recent research and commercial products for parallel implementation of network processors are using multithreaded or vector-type array processors. Melvin et. al. [12] utilize multiple multithreaded processing engines to get a high degree of thread-level parallelism (TLP) in an NP design that supports 256 simultaneous threads in eight processing engines. In this scheme, each thread has its own independent register file, while sharing functional resources and memory ports with other threads. ClearSpeed [13] introduces an MTAP (Multi-Threaded Array Processing) processor, which provides a scalable processing solution, based on an array of 10s to 1,000s of small processing elements. Each PE has its own local memory and I/O capability. Although these implementations can meet the demanded performance, they still have large amounts of hardware complexity, cost and power problems. Complex and special compiler infrastructure is needed to utilize these architectures.
NP workloads have large instruction-level parallelism (ILP) and data-level parallelism (DLP), since they have many loops in the algorithm [4] . However, the existing research and products have been mostly focused on the DLP and TLP. With the being changed trends, the ILP concept also should be considered in designing a network processor in order to get the required throughput. Generally, the architectural concept of ILP implementation extends to information embedded in the program pertaining to the available parallelism between the instructions [14] [15] . The two most important types of ILP processors are Superscalar and VLIW.
Current popular media processors -TI's C6x and TriMedia's TM-1300 -rely on the simpler hardware of VLIW processors in order to minimize the cost and power of ILP implementation [3] . This is because multimedia and DSP applications include many loop operations in the algorithm and they are well suited for the static scheduling architecture. Network processor workloads are also loop-intensive, so we consider statically identified ILP implementation as an appropriate architecture for NP applications. As the demand of programmability of network processors is increased, compiler support for network processors also would take on a significant role for performance evaluation.
In this paper, we characterize the performance (throughout) and power dissipation of statically identified ILP implementation, and we also compare them to the dynamic Exploiting Statically Identified ILP for Network Processor Applications
Byeong Kil Lee, Member, IEEE optimization for network processor applications. In order to get a high throughput, more aggressive parallelism and multiprocessor architecture is greatly needed in designing a network processor, but the analytical characteristics of a single processor, with respect to the performance and power dissipation, are also necessary. It can be a solution to finding the appropriate architecture for a specific application workload. An important question is whether network processor architectures should be statically scheduled or dynamically scheduled. A thorough investigation of the implications of static versus dynamic scheduling for this class of emerging workloads has not been done. The goals of our experiments are to find:
• Do NP workloads benefit from dynamically scheduled processors?
• Can static scheduling gain better performance than dynamic scheduling?
• Can naive (static) scheduling be sufficient? How important are sophisticated compiler techniques?
• What kind of power saving can be obtained from static scheduling (and dynamic scheduling)?
In our experiments, we do not advocate any particular statically scheduled architecture at this point. We are simply using the VLIW paradigm as a vehicle to investigate the feasibility of static scheduling for NP applications.
Modern NP applications can be functionally categorized into two types of operations: the data plane operations and the control plane operations [4] . While the data plane performs packet operations, the control plane handles flow management, signaling, congestion control and higher-level protocols. Although NPs have initially been targeted for data plane applications, they also play a major role in the control plane, particularly for emerging workloads. In fact, with the increased demand for complex processing, the boundaries between data plane and control plane have become blurred [1] . Along with this tendency, the recently released network processor benchmark, NpBench, includes control plane applications [4] .
For our experiments, eight control plane and data plane applications, as well as three media applications, are used to evaluate the performance in these experiments. Experimenting with both static and dynamic approaches, we can see that the static scheduling architecture shows better performance (throughput) than the dynamic scheduling architecture in the NP domain. If we can choose proper compiler optimization options (e.g., different block formation) for each application in VLIW, the performance becomes 1.4x ~ 4.1x better than comparable superscalar architecture. In regards to the energy consumption, the static scheduling architecture shows 6.5x ~ 12x more energy efficiency than the compatible superscalar architecture with respect to the power per instruction.
With the characteristics of large parallelism [4] and loop intensive nature, our experimental analysis supports static scheduling as an appropriate paradigm for NP applications. Even though NP applications are quite different from multimedia and DSP applications in architectural aspects, the success of VLIW in multimedia areas could be applied to NP domains with better performance, lower hardware complexity and lower power dissipation.
The rest of the paper is organized as follows: Section 2 provides the network processor workload, including control plane and data plane, used in this paper. Section 3 describes our experimental framework. In section 4, we analyze the characteristics of dynamic scheduling architecture for the network processor. The performance comparison of static and dynamic scheduling architecture is discussed in Section 5. Finally, we present conclusion and future work in Section 6.
II. NETWORK PROCESSOR WORKLOADS
It is extremely important to identify appropriate benchmarks for efficient design and evaluation of any processor. In NP fields, three benchmark suites have been previously proposed: CommBench [9] , NetBench [5] and NpBench [4] . Wolf et. al. presented eight selected workloads called CommBench [9] for traditional routers and active routers. CommBench has two groups of benchmarks, namely Header Processing Applications (HPA) and Payload Processing Applications (PPA). Memik et. al. proposed nine benchmarks called NetBench [5] for micro-level, IP-level and application-level benchmarks. The NpBench [4] , proposed by Lee et. al, categorized NP workloads into control plane and data plane categories. Control plane workloads are just emerging and evolving in current network environments, and they perform congestion control, flow management, higher-level protocols and other control tasks [4] . The other benchmark suites focus on data plane while the NpBench focuses on control-plane. One of significant features in modern network workloads is differentiated service. In fact, large amounts of network bandwidth are consumed by multimedia contents and these multimedia contents are provided with different quality of service levels. In order to capture these newer workload scenarios, control plane workloads and payload processing should be included in NP benchmarks.
Past studies using CommBench [9] and NetBench [5] contrast network workloads with other benchmarks, such as SPEC [6] and mediabench [7] , with respect to instruction set characteristics and memory behaviors. NpBench [4] shows the difference characteristics between control plane and data plane network workloads. Based on the three benchmark suites, the characteristics of network processors can be summarized as large number of memory accesses, poor data cache performance, large amounts of branch instruction (particularly in the control plane), and high level of data parallelism.
We choose eight representative NP applications from the above three benchmarks for our experiments, including control plane workloads and payload applications. We also include three media applications in order to compare the effectiveness of static scheduling. Table 1 summarizes selected applications. Contrary to other applications, network processor applications have two different types of data: packet header data and payload data. While some NP applications need to process only packet header information, others have to deal with payload data as well. Network processor applications handle the packet header data and payload data with the same kernel, which means they have many loops in the algorithm, resulting in high parallelism.
Generally, the parallelism and throughputs are restricted by branch operations and memory operations. These two operations typically consume a significant part of the total execution cycles. Therefore, we investigate the instruction frequency of branch operations and memory operations for the selected NP applications. Figure 1 shows a percentage of the total number of dynamic instructions for memory operations ('load' and 'store') and branch operations. This data is based on the executables compiled for Simplescalar environment. From this experiment, we observe that NP applications use more memory operations and branch operations than media applications. Large number of memory accesses indicates NP applications are a data-intensive application. Also, many comparison and conditional operations are required for several control plane workloads (e.g., QoS (Quality of Service) [11] ), because they have to process each packet by its priority. In the next section, we provide a more detailed analysis of the performance factors. 
III. EXPERIMENTAL FRAMEWORK
In order to study the effectiveness of static and dynamic scheduling for NP applications, we perform experiments on an out-of-order superscalar processor model and a VLIW architecture model. However, we do not advocate any particular statically scheduled architecture. We are simply using the VLIW paradigm as a vehicle to investigate the feasibility of static scheduling for NP applications. Performance and power consumption are used as metrics, hence power simulators are also needed. We utilize four different tools in this evaluation: We use the Simplescalar out-of-order [16] , which is used for simulating the dynamic scheduled architecture and analyzing performance, and the Trimaran [17] tool for static scheduled architecture simulation and performance evaluation. We also use the tool Wattch [8] to estimate the power dissipation on superscalar architecture for each application and we use the tool PowerImpact [10] to estimate power consumption of VLIW architecture. In the static scheduling simulation using Trimaran, we use three region formations for front-end compiler optimization, which is independent of the target processor, in order to see the effectiveness of aggressive compiler optimization techniques. These three region formations are: basicblock, hyperblock and superblock. Basicblock scheduling has a limited scope of exploiting ILP and each basicblock has 4-5 interdependent instructions on average. Hyperblock and superblock are a kind of extended basicblock for scheduling in which groups of basicblocks are scheduled as a single unit.
Wattch is based on the Simplescalar framework and PowerImpact is designed on the Impact tool [22] . Wattch is capable of breaking down the power consumption into the various units of the processor and hence we do detailed analysis of the power consumption of various processor units.
Several processor configurations are simulated on the different tools. For section 4, various superscalar configurations ranging from 4-issue to 64-issue are simulated. Simplescalar configurations for 4 and 8-issue superscalar are explained in Table 2 , and the wider superscalar configurations use proportionally larger resources. For the comparison of superscalar and VLIW in section 5, we use 8-issue VLIW architecture, and its Trimaran configuration is shown in Table 2 . In order to make compatible superscalar architectures with 8-issue VLIW, we use both 4-issue and 8-issue configurations ( Table 2 ). The 4-issue superscalar architecture has smaller issue width than 8-issue VLIW, but they have compatible functional units. For example, while this VLIW has dedicated functional units for memory and branch operations, 4-issue superscalar handles memory and branch operations, as well as integer ALU operations in 4 integer ALUs. It should be noted that the integer units in VLIW configurations can do multiply and divide. The 8-issue superscalar architecture has same issue width as 8-issue VLIW, but it has more functional units. In our experiment, we assume the architectural configuration of 8-issue VLIW is more compatible with 4-issue superscalar, but it is somewhere in between 4-issue and 8-issue superscalar configurations.
IV. ANALYSIS OF DYNAMICALLY SCHEDULED ARCHITECTURES FOR NP APPLICATIONS

A. Effectiveness of Wide Issue
The most significant issue in designing network processors is maintaining the required throughput. Network processor workloads contain a large amount of instruction level parallelism. Hence they possibly will be able to exploit wider and wider issue widths. For each NP application, we applied various issue widths of superscalar architectures in order to see the performance (IPC) variations; the result is shown in Figure 2 . We consider a 4-issue superscalar as a base configuration and we double the hardware resources according to each issue width. Most applications show better performance (IPC) with increasing issue widths, but some applications, such as FRAG, REED, SSLD, ADPCM and FFT, show early saturation at a small issue width. In order to get a high throughput, an aggressive increase of issue width can help to improve the performance, but the cost and complexity of the hardware might diminish the benefit.
B. Power consumption of wide issue superscalar
In order to understand the power consumption of wide superscalar processors, we perform experimentation using the Wattch framework. As shown in Figure 3 , total energy consumption increases with increasing issue width. In dynamic scheduled architecture, large amounts of power are consumed in instruction window wakeup/select, reorder buffer, and other scheduling related hardware modules. We note that the total energy consumption of a 64-issue architecture is 10 times (on average) larger than that of 4-issue architecture. In particular, the power consumption of the instruction window greatly increases with increasing issue widths. 
C. Sensitivity Analysis
In order to investigate which hardware element is most influential to the throughput on dynamic scheduled architectures, we perform a sensitivity analysis for NP applications. In this experiment, we use nine restricted hardware elements, including branch prediction, commit width, decode width, the number of functional units, issue width, load/store queue size, the number of memory ports, memory bus width and the register update unit. Figure 4 presents the results of the sensitivity analysis for NP applications. For each NP application, the impact of restricting the resource is studied. In this analysis, we consider the performance of a 64-issue machine as the baseline performance. In each experiment, a single constraint is intentionally inserted into the baseline performance model. From the results of the experiment, we can determine the degree of impact, which indicates how the constraint affects the overall performance during dynamic execution. The percentage value of each bar represents a normalized performance metric, which is the relative performance compared to the assumed baseline performance (100%). For this sensitivity analysis, nine constraints, which are independent of each other, are applied. The 'bpred' bar shows the effect of branch misprediction compared to perfect prediction. The 'commit', 'decode' and 'issue' bar show the impact of the limited size of each resource. The 'FU' bar illustrates the impact of restricted functional units. The 'LSQ' and 'RUU' bar show the effects of limited load/store queues and register update units, respectively. The 'mem_port' bar provides the sensitivity of the limited number of memory system ports available to the CPU, and the 'mem_width' bar represents the sensitivity of limited memory access bus width. We use each hardware element of the 4-issue superscalar as the corresponding constraint for the baseline performance model.From this analysis, we see that the restriction of memory width has little impact on the overall performance in most NP applications, except for RED and MPLS. Branch misprediction has largely affected all NP applications, except for SSLD. This observation shows that branches are quite unpredictable in NP applications. As shown in Figure 4 (a), MPLS is the application that was most affected by all of constraints. The common bottlenecks across all NP applications are 'LSQ' and 'RUU', with RED and MPLS having the largest impact. Also, the 'commit', 'decode' and 'issue' width are medium-level bottlenecks in the overall performance for all NP applications. Figure 4 (b) shows the sensitivity analysis of the total energy with respect to the resource constraints. It is interesting to note that branch misprediction leads to a large amount of additional energy dissipation, which is due to the increase in cycles due to the misprediction. The 'commit' has a similar effect as the 'branch' in some applications. We see that better performance (and hence fewer cycles) mean less energy consumption. All other constraints, except for the above two, show the proportional impact of the reduced resources in the power dissipation. Table 3 shows the impact of inserted constraints on detailed resource elements in one representative application. We choose WFQ for this experiment because WFQ shows a typical characteristic among the selected NP applications. When the 'bpred' is given as a constraint, the power dissipation of all resources (except for 'LSQ') is increased in order to execute additional instructions, which compensate for misprediction penalties. The most affected resource is the register file in the WFQ experiment. The register file consumes twelve times more energy by restricting 'commit' width. This is because there are large amounts of access to the register file, which is due to the narrow commit width. The 'issue' and 'RUU' make the largest impact on the power dissipation across all applications, which implies that large amounts of power is demanded for the related resources (e.g., instruction window) in large issue-width architecture. As shown in Figure 5 , we see that the dynamic energy consumption of the instruction window is increased from 7.2% to 22.3% with larger issue architecture among the selected NP applications. We assume that aggressive clock-gating is employed and, therefore, power is scaled linearly with port or unit usage. It is assumed that unused units dissipate 10% of their maximum power. In this section, we compare the performance of statically scheduled architectures to dynamically scheduled architectures while executing NP applications. We also experiment with media applications in order to investigate the effectiveness of applying static architecture for NP applications. Since the performance of network processors can be measured by an actual throughput, we use the total execution cycles between the two architectures, rather than IPC. In some cases, IPC can be misleading about actual performance criterion because some aggressive compiler optimizations increase the total number of dynamic instructions. However, the increased number of instructions can be executed within smaller cycles, since the optimization techniques introduce much higher parallelism. Table 4 shows IPCs (Instruction per Cycle), total number of instructions and total execution cycles, when we applied NP applications and media applications to VLIW and superscalar architectures. Even though the two architectures use different ISAs and several different dynamic and static options, it is meaningful that for NP applications, VLIW architecture shows better performance than compatible superscalar architecture, with respect to the throughput.
A. Performance Evaluation of Static Scheduling in NP Applications
In the static scheduling experiments, we see that some applications, such as WFQ and RED, need fewer cycles when executing the application with superblock optimization, while MPLS, SSLD and MTC have better results with hyperblock. Compared to the media applications, some NP applications show a large amount of benefit when we properly select a region formation for the static optimization in VLIW. As shown in Figure 6 , the VLIW approach with the selected optimization option for each NP application, shows 1.4x ~ 4.1x (2.2x on average) better performance (throughput) than compatible superscalar (4-issue) architecture. Compared to 8-issue superscalar architecture, this VLIW approach shows better performance across all NP applications except for DRR and MTC. In Figure 6 , the Y-axis shows a normalized speedup compared to the total execution cycle of a 4-issue superscalar.
Contrary to media applications, most NP applications show poor performance in 'VLIW-sched (basicblock)' than the superscalar model. Therefore, NP applications need aggressive optimization techniques in compilation. Our experimental results illustrate the effectiveness of aggressive optimization techniques in NP applications in Figure 7 . Since hyperblock and superblock optimizations are used to reduce the impact of conditional operations with the penalty of code inflation, our experimental results also prove that NP applications benefit from optimized region formation techniques. 
B. Comparison of Different Region Formation Techniques
In VLIW architecture, the compiler plays a major role in finding parallelism, decreasing dependencies among instructions and exploiting other optimization techniques in static mode. For more aggressive optimization, several types of region formations -basicblock, hyperblock and superblock -have been used in the compilation stage. Table 5 shows the static code size of each region formation in VLIW optimization. Hyperblock and superblock optimization has a much larger code size than basicblock optimization, as shown in Table 5 , since these optimizations use several algorithms, such as tail duplication, node splitting and loop peeling, to exploit larger parallelism. These algorithms make the code size larger during the optimization process. The most important exploiting of parallelism can be done by employing instruction scheduling which is for assigning instructions into fixed functional units in VLIW architecture. Figure 7 shows the performance comparison between the scheduled and the unscheduled VLIW experiments. For a more intuitive comparison, we use a normalized speedup to total execution cycles of the unscheduled VLIW. From this experiment, we see that the impact of the optimized basicblock (e.g., hyperblock or superblock) in the NP applications is significantly larger than that of the media applications during the instruction scheduling.
C. Power Effectiveness of Static Scheduling in NP Applications
We compare the power consumption of static and dynamic scheduled architectures. As shown in Figure 8 , the VLIW approach with the selected optimization option for each NP application, shows 6.5x ~ 12x (9.1x on average) more energy efficiency than the compatible superscalar (4-issue) architecture with respect to the power per instruction (PPI). Even with basicblock optimization of the VLIW architecture, the results show an average of 5.6x more efficiency than the compatible superscalar architecture model in the NP application. Since emerging NP workloads require high throughput, parallel architectures are required in order to handle the packet data without slowing down line speed. In order to get a high throughput, the parallel implementation for the network processors always comes with the increase of power dissipation, which cannot be ignored in the clustered processors or multiple processors. However, we should use large issue and clustered architectures to get desired throughput.
VI. CONCLUSION AND FUTURE WORK
In this paper, we characterize the performance and power dissipation of statically identified ILP architectures and compare them to dynamically optimized architectures for network processing. In dynamically scheduled architectures, power consumption of the instruction window greatly increases with increasing issue widths. On the other hand, statically scheduled architectures show better performance as well as tremendous advantages in power, since it does not have instruction window wakeup/select logic, reorder buffer, and other scheduling related hardware modules. With the large parallelism and the loop nature of network applications, our experimental analysis supports static scheduling as an appropriate strategy for network processor applications.
NP applications have large parallelism, and they have many loops in the algorithm. Even though NP applications are quite different from multimedia and DSP applications in architectural aspects, the success of VLIW in multimedia areas could be applied to NP domains. While media applications can get enough throughputs with basic block optimization, NP applications need more aggressive optimization techniques in compilation.
A lot of research for VLIW architectures focuses on Clustered VLIW [18] , VLIW code compression [19] and Value prediction module [21] in order to get better performance. Using these skills, the performance of NP applications could be improved with smaller code size (by compression scheme) and more aggressive parallelism (by clustered architecture). For the future work, we will attempt to analyze the architectural differences between multimedia applications and NP applications in order to find a clue for improving the performance. Also, we will consider the clustered VLIW architecture for the network processors to get a high parallelism.
