Abstract
Introduction
Programmable network processors are a viable alternative to custom ASICs to support high speed network processing applications. These application specific processors try to exploit packet-level parallelism in a network application. A network processor(NP) has multiple micro engines, each supporting multiple thread contexts. Each thread is assigned to perform a specific task related to packet processing. Several NPs are available commercially such as Intel IXP2400 [11] , IBM Power NP [14] and Motorola C-Port [4] .
When the network packets are being processed in a NP, they are buffered in an off-chip memory store, referred to as a packet buffer. These buffers also store packets during periods of congestion in the network. Packet buffer size is usually equal to round trip time * transmit rate. This could be as large as 150MB for an OC-48 link, assuming a 500ms round trip time. On account of their high storage capacity at low cost, DRAMs are the storage medium of choice for packet buffers. Since each packet is written into and read from the packet buffer, the DRAM must support twice the transmit bandwidth of the NP. The bandwidth requirement could be higher if parts of a packet are accessed multiple times. Exploiting DRAM bandwidth is the key to obtaining higher throughput in network applications [10] . In [9] , the authors show that buffering packets in the DRAM is a bottleneck in network processing applications. In order to understand these overheads, we study the memory hierarchy in detail. Unlike the probabilistic model used in [9] , our colored Petri net model captures the various hardware structures of the DRAM, including banks, rows, columns, page buffers and the data bus. We also model the packet buffer allocation, DRAM address mapping scheme and cell based interface for packet transmission. The model enables us to study the effect of multiple DRAM banks in hiding the bank access latency. The other levels of the memory hierarchy like the on chip scratchpad memory and the off chip SRAM are also modeled. We model the processing of the most common network application, IP forwarding, on IXP 2400 NP. Our colored Petri net model captures the timing of packet arrivals, access to packet buffers, on-chip and offchip memory, the state of the hardware structures and threads for processing packets. Unlike a trace based approach, this integrated framework allows us to capture the timing information and to study the interaction of the program on a given hardware. This approach provides insight into the potential bottlenecks and enables us to measure the performance improvement from various packet buffer allocation schemes. Presence of multiple banks in a DRAM allows data trans-fer from one bank to be overlapped with access of another bank. The achievable bandwidth from the DRAM drops if the large access latency to the banks are not overlapped by data transfers from other banks. In a NP, the size of data to be transferred to/from the DRAM is variable. We observe that small data transfers to/from DRAM do not hide the bank access latency sufficiently due to which, the bandwidth realized from the system decreases. With the naïve packet buffering scheme in IXP2400, a significant portion (up to 30%) of the accesses are of small size, which we call narrow accesses. Using our Petri net model, we show that the bandwidth realized reduces by 7.7% in such a scenario. Based on this observation we propose to buffer parts of the packet in the scratchpad memory in order to prevent narrow accesses to DRAM. Buffering packet data in on-chip memory is different from the earlier approaches proposed in [16, 7] where the application data structure is cached. These schemes exploit the temporal locality observed in accesses to program data such as trie, classification tables etc. Buffering packet data in on-chip memory has not been studied previously as this data is large and there exists little temporal locality in their access. Storing small segments of the packet in on-chip memory has low space overhead and has the added benefit of distributing the memory accesses to different levels of the memory hierarchy. Using the detailed Petri net model for IP forwarding, the throughput of the NP is studied using real traces from the internet. Previous studies [9] only considered traffic consisting of 64 byte sized packets and showed a throughput of 3.2 Gbps. For this trace, limited computation power of the hash unit is observed to be an impediment. However, we show that with real traces from the internet, a throughput of 6.2 Gbps is achievable. At higher transmit rates we show that, the data bus becomes the bottleneck. This demonstrates that different traffic patterns stress different components of the NP. Such behavior must be carefully considered while provisioning network resources for different traffic patterns. Our contributions in this paper include 1. Detailed evaluation of IP forwarding application on IXP 2400 is done. This model captures the DRAM structures and packet allocation in different levels of the memory hierarchy along with the entire network processing steps. It enables us to study the hardware structures and access patterns that lead to performance bottlenecks. 2. We identify the reduction in DRAM bandwidth due to narrow accesses. Packet buffer allocation schemes that reduce narrow accesses are proposed and evaluated. These schemes do not assume the use of additional hardware in the NP and can improve the throughput by 21% on an average with real traffic traces. 3. We find that diverse traffic patterns stress different components of the NP. With real traffic traces, it is shown that the data bus is the bottleneck resource. We describe the architecture of IXP 2400 and DDR SDRAM (Section 2) followed by the description of their Petri net models and validation(Section 3). Section 4 explains the reason for narrow accesses to DRAM and the various allocation schemes to prevent such accesses. The improvement in throughput and utilization of NP resources are discussed in Section 5. We conclude the paper in Section 7.
Background
DRAM access involves not only managing resource conflicts but also maintaining precise timing requirements between commands. Given the observation that packet buffering is a performance bottleneck, a detailed Petri net model of DDR SDRAM helps us to study the resources and access patterns inhibiting higher throughput. In order to understand the setup of the simulation, we briefly describe the architecture of IXP2400 network processor and DDR SDRAM.
Intel IXP2400 Network Processor Architecture
Network processors are application specific processors that are geared towards processing of network applications. NPs have extensive support for multiple thread contexts which aim to exploit substantial amount of packet-level parallelism in network processing applications. Presence of multiple threads helps to hide the latency of accessing the program data in SRAM and packet data in DRAM. NPs typically use hardware engines to offload specialized tasks frequently required by network processing applications like hash calculation, CAM lookup. Figure 1 shows the architecture of Intel 
IXP2400
. This NP has 8 simple in-order cores called micro engines. The microengines perform the data plane processing. Each microengine(ME) can support up to 8 thread contexts. Each ME has 256 general purpose 32-bit registers which are partitioned among the threads. As a result, the threads have zero cycle context switch overhead.
IXP 2400 has off-chip SRAM and DDR SDRAM memories. The dual-ported quad data rate (QDR) SRAM [3] is used to store the program data like the lookup trie. The SRAM is connected to two on-chip controllers. DDR SDRAM with 4 banks is used to buffer packets. The SDRAM is connected to the on-chip memory controller by a 64 bit channel operating at 200MHz. The SRAM size is usually 8MB whereas the DDR SDRAM size may vary from 64MB to 1GB. There is an on chip scratchpad memory of size 16 KB, that may be used to store temporary variables, setup communication rings between threads and distribute memory accesses. Apart from these storage locations, each ME has 640 32-bit word local memory space. Further details of the Intel IXP2400 are available from [11] .
Packet Receive and Transmit Interface
As we focus on the buffering of packets in the NP, we discuss the processing involved in receiving, buffering and transmitting packets. The IXP communicates with the network physical device through the Media Switch Fabric (MSF). The MSF supports CSIX-L1 [5] interface which handles ATM and Ethernet traffic. It provides a cell based interface to the NP for receiving and transmitting packets. Each cell is called a mpacket. Its size can be configured to 64 or 128 or 256 bytes. We assume a mpacket size of 64 bytes in the rest of the paper. The MSF has two buffers called receive buffer and transmit buffer which are used to hold the mpackets belonging to the input and output queues. Each of these on chip buffers is of 8KB size. The state of the incoming mpackets is stored in a receive status word(RSW). The information about an out going mpacket is put in a control word. When a packet is received by the MSF, it signals the ME. Following this signal, a thread from a free pool is assigned for processing the packet. The thread reads the corresponding RSW and is responsible for allocating buffers and moving the mpacket data from the receive buffer to the packet buffer. In order to move the packet data within the NP, the MSF has a data path to the MEs and off-chip SRAM and DRAM memories. 
DDR SDRAM Architecture
The IXP 2400 uses a DDR SDRAM as the packet buffer. The architecture of the DRAM device is shown in Figure 3 . Such a device has 4 memory arrays called banks which can be accessed independently. Each bank has an associated SRAM array, called page buffer which is used to hold the contents of the currently accessed row. The page buffers share IO gates that connect them to the data pins. The banks are controlled by the address and command bus. These buses are used to send the data address and to issue commands to the DRAM device.
Figure 3. SDRAM Architecture
The banked architecture allows the access of each bank to be pipelined. The other banks can be in various stages of memory access when data is fetched from a given bank. An access to the DRAM device consists of the following steps [19] .
• Precharge: The data in the page buffer belonging to the previously accessed row is restored back and the sense amps are prepared for the next access. This step has to be performed since the data in the DRAM cells is lost once it is read into the page buffer. Time to precharge is denoted by t RP
• Row access: The row address is sent over the address bus and is latched on to the row decoder by enabling row access strobe (RAS) command. The entire row is read into the page buffer after a finite delay called row to column delay (t RCD ).
• Column access: After the column address is latched onto the column decoder through the column access strobe (CAS) command, the appropriate data from the page buffer is available on the data bus after a finite delay called the column access latency(t CL ). The data passes through the IO gates which connects the page buffers to the data pins. The data is transferred on the data bus for 't' cycles, where 't' is determined by the data access length and bus width. The memory controller is responsible for maintaining the precise timing between these steps. A page miss is said to occur when two consecutive accesses to a bank go to different rows. A page hit occurs when consecutive accesses to a bank are to the same row. Depending of the page miss behavior, the banks may be managed with either a closed policy or an open policy. In closed policy, the page buffer is immediately precharged following a read. On the contrary, the open policy leaves the data in the page buffer and does not precharge it until an access to a different row in the bank is seen. Detailed architecture of DDR SDRAM memory and the steps involved in accessing it are elaborated in [19] .
Petri Net Model for Packet Buffering on IXP2400
In this section we describe our Petri net model for DDR SDRAM and IP forwarding on IXP2400. The DDR SDRAM model is independently validated against DRAMsim [20] , an accurate DRAM simulator, before it is integrated with the IXP2400 Petri net model. The entire model is then validated with Intel IXA SDK [12] .
DDR SDRAM Petri Net Model
Our Petri net models the primary structures of DDR SDRAM such as the data bus, rows and columns of the memory array, page buffer, multiple banks, IO gating, address and command bus (Figure 4) . Unlike a superscalar processor with caches, in an NP environment, the DRAM has to service requests of different sizes. Therefore the transmission time over the data bus depends on the size of the data. For the DDR SDRAM model we use the resource usage model proposed by Wang [19] which allows two accesses to be processed concurrently as long as they do not use a common resource. This generic access protocol enables us to model the various degrees of overlap associated with DRAM commands. The page buffers are managed with a closed policy as in IXP 2400. The presence of multiple resources is elegantly modeled using colored tokens. We describe the Petri net model for processing a read access to the DRAM. In the following discussion, the places and transitions in the Petri net are shown in italics. An instantaneous transition is represented by a thin line (example ReadMemCtrlTranform) and a timed transition by a thick line (example RowAddrTrans). In our model, each timed transition takes a fixed amount of time, determined by the respective activity. The availability of hardware resources like banks, data bus, IOgating, address and command bus is modeled by Transitions that are described below are highlighted with colored input and output arcs in Figure 4 . A request is presented to the DRAM by placing a token in the DRAMReq place (shown by a double circle). The token contains the size of the access, type (read/write) and address of the DRAM request. From the address of the memory request, the bank, row and column are identified by the memory controller based on the address mapping scheme. This action is modeled by the ReadMemCtrlTransform transition. When a bank is free, the row address along with the row activation command is sent over the address and command bus (RowAddrTrans transition). After the row activation delay, modeled by RowAct transition, the row is read into the page buffer. This is captured by the presence of tokens in the WaitColAddr place. The column address is sent to the DRAM device over the address and command bus. The use of the address and command bus for RowAddrTrans and ColAddrTrans is modeled by arcs between the AddrCmdBus and these transitions. The data to be read is available for output after a fixed interval of time denoted by the ColAccess transition. The data passes through the IOgating that connects the page buffer to the input-output pins (ReadDataTrans transition). The CpuTrans transition captures the transmission of data over the data bus to the MEs. The time interval for which the bus is used is determined by the size of the data being transferred. The request is completed by placing the token in CpuBuf place. Since we model a closed page policy, the bank is precharged following the row access. After the transfer of data through the IOgating, the row is precharged. These actions are captured by PrechargeCmdTrans and Precharge transitions. The minimum delay between the row access command and precharge within a bank is modeled by RASLatency. Following the precharge, the bank can be used to service the next access. The delays between these commands are determined by the DRAM access protocol. Time period for the transitons are taken from the data sheet [1] . The memory controller can service only a fixed set of DRAM accesses simultaneously. This is modeled by the presence of fixed number of tokens in the DramSlots place. The DRAM memory is refreshed at regular intervals of time which is modeled by RefreshInterval. During refresh, row activation is inhibited by the presence of a token in WaitRefresh place. The DRAM does not accept any requests for RefreshInterval transition time.
Validation of DRAM Petri Net Model
We use CNET Petri net simulator [21] to obtain performance metrics from the model. This simulator provides a rich set of features to model Petri nets including colored tokens, timed transitions, priority based transitions. We validate our Petri net based DDR SDRAM model with the DRAMsim [20] simulator which accurately models DDR SDRAM. DRAM access traces of the IP forwarding application on IXP2400 simulator and the IPSec application on IXP2850 simulator of the IXA SDK are used as input to both the simulators. The access size is fixed at 64 bytes as DRAMsim cannot vary the size of accesses within a trace. Figure 5 shows the throughput observed from both models. The Petri net model predicts throughput within 4% of the throughput predicted by DRAMsim. DRAMsim only supports DRAM accesses whose size is equal to the cache line size. However, our DRAM Petri net model can support accesses of different lengths and is suited for modeling NP applications. 
IPv4 Forwarding Petri Net Model
We built a colored Petri net model of IPv4 forwarding application running on IXP2400. Each thread in the base scheme assumes the processing flow as described in [18] . The thread assigned for packet processing moves the mpackets into the DRAM. During processing, the header is read into the registers. The time-to-live field is decremented and the destination address is used to hash into the lookup table entry in the SRAM. This provides the output interface for the next hop. The modified header is written back into the DRAM after recomputing the checksum. The packet is then enqueued in the transmit buffer. The control words for the transmit mpackets are suitably updated. Each token corresponding to a packet carries information about its allocation address. This allows us to model the bank conflicts during DRAM access. Unlike the model in [9] , the Petri net model handles packets of different lengths. We model the cell based interface for receiving and transmitting packets (described in Section 2.1.1), the SRAM and scratchpad memories and the buses connecting them to the MEs. The DRAM model described in Section 3.1 is integrated with the IXP2400 model. High order interleaving is used to map the memory address to DRAM banks. The Petri net model for IPv4 forwarding is validated with the IXP simulator available with the IXA SDK. We compare the ME utilization and the throughputs obtained from both the models. A list of free threads is maintained and a single thread is assigned to each packet that arrives at an ingress port. In such a model, each thread performs the tasks related to receiving a packet, buffering it, processing for IP forwarding and transmitting the packet by enqueuing it in the transmit buffer. As in [18] , a stream of 64 byte packets is used to simulate the traffic pattern observed under denial of service(DOS) attacks. For each configuration, the rate of input traffic is the same as the throughput supported by the NP in that configuration. The throughput obtained for various configurations of the MEs and threads per ME is shown in Figure 6 (a). Figure 6 (b) shows the ME utilization obtained with the IXA SDK simulator and with the Petri net model. We compare the trends in the throughput between IXA SDK simulator and the Petri net model. We observe that the throughput increases with the number of threads. In IXA SDK a maximum throughput of 2410 Mbps was achieved with 8 threads and after that the throughput drops marginally. This is because, the IXP implements back pressure on the MEs when the DRAM request queue length is above a certain threshold [11] . When the number of outstanding DRAM accesses crosses this threshold, 10 in case of IXP2400, the DRAM controller sends an almost-full signal to the MEs, as a result of which further requests are inhibited. This reduces the rate of execution of the MEs, which in turn reduces the requests to the DRAM. The Petri net model does not implement back pressure. Therefore, we see that the throughput saturates at a higher throughput of 3210 Mbps and saturates beyond 16 threads. Unlike in the IXA SDK, the DRAM Petri net model is more detailed. As a result we see that there is a difference in the throughput obtained from both the models, but the overall trend is the same. The utilization of the MEs follows the same patterns 
Packet Buffer Allocation Schemes
In this section, we first explain the detrimental effect of narrow accesses on DRAM bandwidth and quantify its impact. We then identify the causes for narrow accesses and propose allocation schemes to alleviate their negative impact.
Why Do Narrow Accesses Affect Bandwidth?
An access to the DRAM involves RAS, CAS and precharge steps, in addition to data transfer (refer Section 2.2). While the delays involved in RAS and CAS steps, viz t RCD and t CL , are fixed, the data transfer time is proportional to the size of the access. Note that, when the banks are free, the RAS steps (similarly in CAS step) corresponding to accesses for two different memory banks can be overlapped (as shown in Figure 7 ) except for the duration for which the address bus is used. When successive accesses to different memory banks transfer "sufficient" data, these accesses can be overlapped as shown in Figure 7 , such that the data bus is fully utilized. In such a situation, the delays of RAS, CAS and precharge latencies are not exposed as consecutive memory accesses overlap. For the DDR SDRAM, such a situation occurs if the accesses are for 64 bytes or larger. We refer to these accesses as wide accesses and other accesses less than 64 bytes as narrow accesses. Figure 8 shows data transfers when some of the accesses are less than 64 bytes. In this situation, even though the data transfer for the first access is complete at time step 11 and the data bus is free, the transfer from the second access does not start as the CAS step is not complete. This results in under utilization of the data bus, which in turn leads to lower DRAM bandwidth. (ii) What is the impact of these narrow accesses on DRAM bandwidth? With 64 byte packet traffic, we find that up to 50% of the accesses to the DRAM are narrow accesses and with real traffic trace, 30% of the accesses are narrow. Using our Petri net model for DRAM, we evaluate the DRAM bandwidth when certain part of the traffic results in narrow accesses. In this experiment, we use a trace with read and write accesses of two different sizes, 64 byte accesses which utilize the data bus fully and 32 byte accesses which are considered as narrow accesses. The trace has equal number of read and write accesses which are randomly distributed across banks. For different mixes of 64 and 32 byte accesses the DRAM bandwidth achieved is shown in Figure 9 . 
Figure 9. Effect of narrow accesses
to 15% when 50% of the accesses are 32 bytes. With 30% narrow accesses, as in real traces, the bandwidth reduces by 7.7%. The bandwidth reduction would be higher if the data access is narrower. The packet distribution in the internet leads to narrow accesses to packet buffer. This is explained using the following traces from the NLANR project [2] . PSC (PSC-1135643836-1) and FRG (FRG-1133697651-1) are edge traces whereas AMP (AMP-1124777370-1) is a core trace. Figure 10 shows the packet lengths and their distribution in the three traces considered. Packet length is observed to be bimodal with modes around 40 bytes and 1500 bytes [17] . We observe that packets of length less than 64 bytes constitute about 42% of the total number of packets in the PSC trace. In the FRG and AMP traces, this number is 33% and 20% respectively. Such small packets lead to narrow accesses to the DRAM during packet buffering. When DRAM is used in a superscalar processor, the size of memory access is typically fixed to the size of L2 cache line size. Since DRAM bandwidth affects the throughput of the NP [10] , we focus on preventing narrow accesses, which are detrimental to performance.
Figure 10. CDF of packet length distribution
We propose changes in buffer allocation to reduce narrow accesses to the DRAM. Since mpackets belonging to the middle of the packet are 64 bytes long, they require wide accesses and are stored in the DRAM. In the proposed schemes, we modify the buffering for the first mpacket and the tail portion of the packet.
Header buffering (HB):
In header processing applications, the header is read from the DRAM after buffering the packet. In case of IP forwarding, the header is 20 bytes. Accessing it, results in a narrow access of 24 bytes 1 . Govind, et al. [9] propose storing the header in the SRAM in an effort to reduce the fully saturated DRAM utilization. We observe that the packet header is accessed multiple times and exhibits certain amount of locality. Thus storing the header in a faster on-chip memory such as the scratchpad memory, not only improves the access latency of packet header, but also prevents narrow accesses to DRAM which have an adverse impact on the bandwidth. Since the header occupies limited space, it can be stored in scratchpad memory.
First cell buffering (FCB): Even with header buffering,
there are still a number of narrow accesses. Accessing the remaining part of the first mpacket from the DRAM leads to a narrow access, especially if the packet itself is of small size(≤ 64bytes). For example, when header buffering is used for IP forwarding, the first 24 bytes are stored in the scratchpad memory. The remaining 40 bytes required to form the first cell of the packet would still lead to a narrow access. This can be overcome by storing the first cell in scratch memory. With FCB, the first packets less than 64 bytes and first mpacket of larger packets are entirely buffered in scratchpad memory. From the packet length distribution of the real traces (refer Figure 10) , we find that 30 to 50% packets are stored in on-chip scratchpad memory. Note that the first mpacket can be easily identified as they have the SOP bit set.
From our Petri net model, we observe that throughput saturates with 16 threads. Conservatively, even if we assume that 32 free threads are used in the free pool of threads and each thread is statically assigned 64 bytes in scratchpad for packet buffering, this scheme requires only 2KB of memory. This requirement can be easily met by Intel IXP2400 which has 16KB of scratchpad memory.
First cell buffering + Tail buffering (FCB+TB):
The other source of narrow accesses to the DRAM is the tail portion of the packet(shown in Figure 2 ). In case of ethernet traffic, packet length can vary from the minimum size of 40 bytes to the maximum length of 1500 bytes.
As the length of a packet need not be a multiple of the cell size, the last part of the packet usually results in a narrow access to the DRAM. Thus we propose buffering the tail portion of a packet in the on-chip scratchpad memory along with the first cell. The tail portion of the packet occupies a bounded amount of space and buffering it in the scratchpad memory along with the first cell requires twice as much memory as FCB. 
Remarks
The allocation information regarding the storage of different parts of the packet in various levels of the memory hierarchy must be stored during packet processing. For this purpose, two additional fields are added to the packet descriptor. These fields point to the first cell and tail portion of the packet in scratchpad. The amount of data stored in these portions can be derived from the packet length information in the packet descriptor. For FCB+TB, each thread requires a bounded amount of scratchpad memory. This is equal to twice the mpacket size, 128 bytes in our case. The scratchpad space allocated for buffering is divided into fixed sized segments. The segments can be dynamically assigned to each thread to ensure maximum utilization of scratchpad memory. A key observation about our schemes is that the reported performance benefits are achieved by enabling greater utilization of existing resources and without the use of any additional hardware to the base IXP2400 scheme.
Improvement in Throughput and Memory Hierarchy Analysis
In this section the performance improvement achieved from the proposed schemes are quantified. Further, the Petri net model is used to identify the performance constraining resources in the memory hierarchy.
Impact of Buffering Schemes
The packet lengths from real traces are used as input for the simulation. The simulation is run with different configurations with varying number of ME and threads per ME. For all buffering schemes, we observe that the throughput increases when the number of threads increases from 1 to 8. Since we intend to study the effect of network processing on memory hierarchy at higher packet forwarding rates, we report results only for 8 or more threads. Figure 11 shows the throughput improvement due to various buffering schemes. In order to understand the effect of reduction in narrow accesses, the bandwidth realized by the DRAM under the base scheme and with first cell buffering + tail caching (FCB+TB) are measured ( Figure 12 ). For the three traces, the bandwidth realized from the DRAM improves by upto 4%. This has contributed to some of the improvement in throughput observed in FCB+TB scheme. The remaining increase in throughput is attributed to the distributed allocation of packets, which improves the parallel utilization of resources. Further, with FCB+TB, a peak throughput of 6.25 Gbps is achieved with 16 threads. In contrast, the throughput saturates at 5.05 Gbps in the base scheme. The data bus utilization increases from 97.4% to 99%.
Utilization of NP structures
The Petri net model allows us to study the utilization of various resources within the NP and DRAM. Such a study helps us to identify the performance constraining structures in packet buffering. We study the utilization of resources in the NP with worst case traffic consisting of 64 byte packets , referred to as DOS trace and traffic from real traces. Figure 13 . NP resource utilization(FCB+TB) Figure 13 presents the utilization of data bus, DRAM banks, MEs and hash unit observed for a configuration with 4 MEs and 4 threads per ME. The ME utilization, is the fraction of time a single ME is busy in computation and bank utilization is the average utilization of the DRAMs banks. The banks utilization is observed to be uniform for all banks. With real traces, the utilization of the bus reaches 99% at a throughput of around 6200 Mbps, which is almost twice that observed with minimum sized packets(refer Figure 15) . Thus for real traces, we observe that the data bus is the bottleneck resource. Two sets of simulations were performed to substantiate this claim. In one experiment, the bus width was doubled keeping the number of banks same (4). In the other experiment, the number of banks was doubled with the bus width of 64bits. Figure 14 shows the throughput with FCB+TB scheme, for different configurations when real traces are used as input. The throughput with base IXP configuration is shown in the first row. When the bus width is increased to 128 bits, keeping the number of banks constant, the throughput is observed to increase by 33%. However, with a 64 bit bus and 8 DRAM banks, the throughput increases by only 1%. This clearly shows that the data bus connecting the off-chip DRAM to the MEs is the bottleneck in achieving higher throughput. It also explains the moderate increase in bus utilization when using FCB+TB. Figure 14 . DRAM configurations Figure 15 shows the throughput and utilization of various NP resources with base allocation scheme for the DOS trace. The number of banks and data bus width are varied. It is seen that for such a traffic, the hash unit becomes the bottleneck resource. There is no significant improvement in throughput when the bus capacity is doubled or the number of banks is increased from 4 to 8. In fact, the utilization of these resources decreases by roughly 50%. The ME utilization, for a 4x4 configuration, increases marginally with the throughput. In all these configurations, the hash unit utilization is close to 100%, making it the bottleneck resource. This observation is in line with the result shown in [9] . The utilization of NP resources varies widely with different traffic traces. The throughput of the NP is observed to be highly dependent on the input traffic. Traffic consisting of of 64 byte packets, as in a DOS attack, requires greater compute resources. However if the traffic is similar to real traces, then greater throughput can be achieved by having multiple channels. Provisioning processing and IO capacity for a network application must consider these traffic conditions.
Related Work
Hasan, et al. [10] propose opportunistic mechanisms such as locality aware packet buffer allocation scheme and batched access algorithms in order to increase page hits. We observe that page misses are harmful only when the latency of the miss is exposed by narrow accesses. We try to reduce such accesses to DRAM by distributing the narrow accesses to scratchpad. Improving row access locality provides performance benefit only when the data bus connecting the DRAM to the MEs is not a bottleneck. Our simulations show that the data bus is an impediment to achieve higher throughput. By keeping the most frequently used part of the packets in onchip memory, we try to reduce the amount of data transferred on the data bus. A Petri net model is used for evaluation and design space exploration of NPs in [9] . The study shows that accessing DRAM is a performance bottleneck. We use a detailed model of SDRAM to study packet buffering issues. Narrow accesses to DRAM are identified as one of the factors leading to lower DRAM bandwidth. Iyer, et al. [13] propose the use of an SRAM to store the an entire packet before buffering it in the DRAM. Their scheme uses a large number of banks, a wide bus and look ahead mechanism to schedule requests to DRAMs, but does not try to improve the utilization of available resources. In contrast, our proposal tries to maximize the utilization of existing resources to improve throughput. The proposals in [8, 6] implement complex memory controllers for packet buffering in high end routers. These schemes implement look ahead, memory reordering or universal hashing mechanisms in the memory controller to provide packet buffer access guarantees. Our schemes require no changes in existing memory controller policy. 
Conclusions
Using a detailed Petri net model for packet buffering in an IP forwarding application, we show that narrow accesses to the packet buffers in NPs reduce the bandwidth realized by DRAMs. Distributed packet buffering strategies are proposed, which lead to 21% average performance improvement in throughput with real traffic traces. The on-chip memory memory requirement for the proposed scheme is low. Further, our performance study demonstrates that the data bus is the bottleneck resource, especially for real world traces. Using the integrated Petri net model, the utilization of NP resources is shown to vary widely with different types of network traffic. Other header processing applications such as network address translation have similar DRAM access characteristics [9] and can benefit from the buffering schemes proposed. We plan to study the effect of congestion on packet buffer allocation as the utilization of the on-chip memory may drop in this case. Also the performance of the NP under various traffic mixes has to be studied.
