A packet memory stores packets in internet routers and it requires typically RT T × C for the buffer space, e.g. several GBytes, where RT T is an average round-trip time of a TCP flow and C is the bandwidth of the router's output link. It is implemented with DRAM parts which are accessed in parallel to achieve required bandwidth. They consume significant power in a router whose scalability is heavily limited by power and heat problems. Previous work shows the packet memory size can be reduced to
Introduction
In routers, scalability is a critical issue and usually limited by power. Packet memory accounts for a significant portion of the power consumption of a router [1] . The off-chip DRAM packet memory builds output queues where incoming packets are stored. Since the occupancy of a queue is time-varying, its storage space is dynamically allocated using a linked list data structure [2] . One linked list is built per queue. Figure 1 (A) shows a conventional packet buffer where a linked list for a queue contains two packets: P0 and P1. Whenever a packet is added to or removed from a linked list, off-chip DRAM memories are accessed and consume tens of Watts.
Previous work [3] suggests that we only need a small packet buffer as the number of TCP flows increases in core routers, which motivates us to propose a memory architecture which splits a memory into on-chip and off-chip memories for power saving. In addition, we propose a novel packet mapping method that maps packets on on-chip and off-chip memories by estimating the latency of packets through a router and utilizes the on-chip memory efficiently, which can reduce power consumption significantly. Manuscript 
Low Power Packet Memory Architecture
Embedded DRAM makes it possible to integrate a relatively large memory on a die. Embedded DRAM takes far less power. Our scheme proposes that embedded DRAM implements the on-chip memory and both on-chip and offchip memories are segmented in the same way as shown in Fig. 1 (B) . When a packet comes into the router, memory segments for the packet are allocated from the on-chip memory if the space is available in the on-chip memory. If the occupancy of the on-chip memory becomes larger than a predefined threshold, new segments are allocated on the offchip memory via a proposed mapping method. Figure 1 (B) shows a case where the linked list for a queue is straddled over both on-chip and off-chip memories. The rationale behind this mapping is based on the finding in [3] . The finding was that the output link bandwidth utilization can reach 99% when the packet memory size is
where N is the number of TCP flows provided that most traffic are desynchronized long-lived TCP flows. Considering modern routers support tens and hundreds of thousand queues [3] , [4] , the scaling by √ N could be substantial. This implies that we can reduce off-chip memory accesses significantly with a relatively small on-chip memory and save significant power consumption. In the next section, we show how to allocate space for incoming packets in either on-chip or off-chip memory using the proposed method under different scheduling schemes to maximize the power saving.
Latency-Aware Packet Mapping Algorithm
Although the router tries to capture most packets in the onchip memory in the proposed method, an off-chip memory is needed to handle following overflow cases. First, the numCopyright c 2013 The Institute of Electronics, Information and Communication Engineers ber of queues supported in routers may not be fixed due to reprogramming in the field or time-varying active queues. For a small number of queues, the buffering requirement gets bigger. Second, routers use random early detection (RED) to desynchronize TCP flows and they need a larger buffer than required to make RED work and avoid TCP synchronization due to tail drops [5] . Finally, the work [3] assumes a single output queue and simple FIFO scheduling. Real routers, however, shape the bandwidth of queues non-uniformly and employ various scheduling algorithms, e.g. priority scheduling or deficit round robin scheduling (DRR) [6] . Thus, it is very difficult to estimate the optimal on-chip memory size and misprediction could result in overflow into an external memory.
To handle the case when the on-chip memory is filled up, we propose a packet mapping algorithm in Fig. 2 . There are three threshold values used in the mapping algorithm: overflow error (global), overflow warning (global), and congested (per-queue) threshold. The overflow error threshold is set to the size of on-chip memory. The overflow warning threshold is programmed to 5∼10% less than the overflow error threshold so that directing packets for the congested queues to the off-chip memory does not kick in too early. When the occupancy of the on-chip memory is between the overflow warning and error threshold, the memory space for packets arriving to congested queues is allocated in the off-chip memory to avoid thrashing the on-chip memory because those packets are likely to stay in the router for long time due to the congestion. The key idea in our mapping algorithm is to identify the packets that potentially have large latency through the router and place them on the offchip memory when on-chip memory is about to being fully occupied. The latency estimation is done by comparing the depth of individual queues against their per-queue congested threshold values. The congested threshold in Eq. (2) 
In Eq. (1), L i , λ i , W i are average queue length, average effective arrival rate, and average queue latency for queue i respectively. The average effective arrival rate is typically given for TCP flows because the arrival rate is shaped to the allocated bandwidth. Thus, by measuring the depth of a queue, we can estimate the latency of packets. We choose target W i such that it is greater than average latency of low latency queues but smaller than the high latency queues so that only packets for the low latency queues are allocated on on-chip memory dynamically. This target W i can be statically determined from average queue latencies which are derived from queue bandwidth allocation and input traffic pattern. Through simulations, we measure the average queue latency and queue depth for different queues. Based on these average queue latencies, target W i is chosen statically. α i is a multiplication factor that is determined empirically from experiments so that congested threshold i is set higher than the average queue depth of low latency queues but lower than that of high latency queues.
Experimental Results
We modify ns-2 [7] to model routers employing two different scheduling policies: priority scheduling and deficit round robin. The first implements priority scheduling for low latency packets and the second implements deficit round robin for ensuring fairness. The purpose of this experiment is to validate the effectiveness of our method in two most popular scheduling policies. We simulate hundreds of different realistic scenarios using the configuration in Fig. 3 . N TCP sources send traffic to a router and N sources are mapped to the output queues of the router. Scheduler implements either priority scheduling or DRR.
Priority Scheduling
In priority scheduling, queues are prioritized and scheduled accordingly. Packets belonging to the high priority queues (HPQs) are scheduled early and have very small latency through the router. The purpose of this experiment is that the proposed mapping method identify these high-priority low-latency packets and map them on the on-chip memory to increase the utilization of on-chip memory.
In this experiment, we use 9 queues from queue 0 to queue 8. The 9 queues represent 9 different priority levels, i.e. queue 0 represents the highest priority and queue 8 represents the lowest. Queue 0 through 7 are referred as high priority queues and queue 8 is referred as a low priority queue. We map 100 TCP flows to 9 queues and each flow is bandwidth-shaped to 2 Mbps. Since our work is the first work to propose a packet mapping method for on-chip and off-chip memory, we implement a latency-unaware packet mapping method to compare with our proposed latencyaware packet mapping method. In the latency-unaware method, packets are allocated on the off-chip memory whenever the on-chip memory is full.
We evaluate the performance of our proposed method with respect to the amount of low latency packets. For this experiment, the error threshold is set to 50 KB and the warning threshold is set to 45 KB. The error threshold is set to the half of suggested optimal packet memory size ( 80 msec×100 Mbps √ 100
= 100 KB) in [3] so that (1) many packets are overflown to the off-chip memory to emulate the overflow case and (2) we show the smaller memory than [3] is sufficient thanks to the high utilization of on-chip memory by the proposed mapping method. Figure 4 shows the results of the case where we increase the number of flows per HPQ to increase low latency packets. The number of flows per HPQ is swept from two (4 Mbps per queue) to eight (16 Mbps per queue). For queue 0 through 7 (high priority queues), target W i is 1 msec and α i s are 2.58, 3.44, 5.16, and 10.33 for 8, 6, 4, and 2 flows per HPQ. For queue 8 (low priority queue), target W i is 1 msec and α i s are 36.57, 6.37, 1.08, and 0.60 for 8, 6, 4, and 2 flows per HPQ respectively. One interesting point is that more packets are mapped on the off-chip memory in the latency-unaware method as we increase number of flows for HPQs. This is because frequently arriving low-latency high-priority packets are pushed away to off-chip more often by low-priority packets as we increase high priority packets. When the number of flows is 8 for HPQs, 95% of total packets are allocated on the off-chip memory in the latency-unaware method (Fig. 4 (a) ) whereas only 5.9% of packets are so in the proposed method (Fig. 4 (b) ). This indicates that we can reduce 94.1% of power consumption for the external memory access with the proposed scheme with only 50% of the suggested packet memory size in [3] .
As the bandwidth of latency sensitive traffic increases, the gain from the proposed scheme is significant. Considering more and more internet traffic become latency sensitive due to the nature of increasing real-time applications, this result is very promising.
Deficit Round Robin Scheduling
DRR is a modified weighted round robin scheduling algorithm that guarantees fairness among different queues by allocating weights proportional to the bandwidths of queues. A quantum, expressed in number of bytes, represents the scheduling weight and is assigned to each queue. In DRR scheduling, a large quantum is assigned to a queue to allocate large bandwidth or reduce the latency of packets. The purpose of this experiment is that the proposed mapping method identifies the low-latency packets belonging to queues with a large quantum and maps them on the on-chip memory to increase the utilization of on-chip memory.
In the first experiment using DRR scheduling, 100 TCP flows are mapped to 10 different queues. The quantum for the queue 0 through 3 is set to ten times larger value than rest of queues so that packets for those queues have small latencies. In this test case, we verify the performance of the proposed method under highly skewed bandwidth allocation. That is, packets through high bandwidth queues will have low latency due to large bandwidth allocation and we try to verify if our proposed method efficiently maps low latency packets on on-chip memory. We wanted to have as many high bandwidth queues as possible. Four high bandwidth queues can have 20 Mbps each and the remaining 20 Mbps is shared among low bandwidth queues. target W i is 0.01 sec. α i s for queue 0 through 3 (high bandwidth queues) are 2.2 whereas α i s for queue 4 through 9 are 7.8. The result for the first experiment is shown in Fig. 5 . The proposed method in Fig. 5 (b) maps 3.35∼3 .62 times more bytes on the onchip memory compared with the latency-unaware method in Fig. 5 (a) . With the proposed mapping method, 75 KB of on-chip memory captures 72.7% of total bytes and reduces the power consumption proportionally. Compared with the results from the priority scheduling shown in Sect. 4.1, onchip memory captures more bytes as we increase the on-chip memory size from 45 KB to 75 KB. In priority scheduling, 50 KB of on-chip memory already captures significant portion of packets. For this reason, we show the results for different on-chip memory sizes in priority and DRR scheduling.
In the second experiment using DRR scheduling, we choose a test case where the bandwidth for queues are more smoothly distributed (monotonically decreasing). That is, bandwidth for queue 0 is ten times that for queue 9 and bandwidth for queue 1 is nine times that for queue 9 and so on. target W i is 0.05 sec. α i s for queue 0 through 9 are 0.45∼3.49 whereas Error T hr, Warning T hr, and Congested T hr are set to 75 KB, 70 KB, and 45 KB respectively. With the latency-unaware method, 10.5% of packets are mapped on on-chip memory whereas 22.6% of packets are mapped on on-chip memory with the latency-aware method. The latency-aware method shows 2.15 times better mapping efficiency.
Conclusion
The proposed method significantly reduces the power consumption caused by the off-chip packet memory accesses via a small on-chip memory implemented with embedded DRAM and a novel packet mapping scheme. Our packet mapping method estimates the latency of packets based on the queue depth and maps low latency packets to the on-chip memory to increase the utilization of the on-chip memory.
