Packet switches (that is, IP routers and ATM and Ethernet switches) maintain statistics for performance monitoring, network management, security, network tracing, and traffic engineering. Counters usually collect such statistics as the number of arrivals of a specific packet type or they count a particular event, such as when the network drops a packet. A packet's arrival can lead to the updating of several different statistics counters.
The number of statistics counters in a network device and their rate of update are often limited by memory technology. On-chip registers or SRAM (on-or off-chip) can hold a few counters. Often, a network device has to maintain many counters and therefore must store them in off-chip DRAM. But the large random access times of DRAMs make their use difficult when supporting high-bandwidth links. The time it takes to read, update, and write a single counter would be too long, and worse still, each arriving packet can trigger the update of multiple counters.
To alleviate these problems, we use a wellknown architecture for storing and updating statistics counters. This approach maintains smaller-size counters in fast (potentially onchip) SRAM, while maintaining full-size counters in a large, slower DRAM. Our goal is to ensure that the system always correctly maintains counter values at line rate. An optimal counter management algorithm (CMA) minimizes the required SRAM size while ensuring correct line-rate operation for a large number of counters.
Role of packet switches
Packet switches perform many processing tasks on arriving packets. Jobs include address lookup, classification, buffering, quality-of-service scheduling, header editing, and statistics maintenance. Packet switches typically perform these tasks on the line cards of switches and routers, and therefore need to occur at line rate. When optical carrier line rates increase beyond Sonet specification OC-192 (10 Gbps) to OC-768 (40 Gbps), packet processing tasks will becomes more difficult. Although several proposed techniques deal with address lookup, 1 packet classification, 2 packet buffering, [3] [4] [5] and quality-of-service scheduling, 6 we are not aware of research besides ours that addresses the maintenance of a large number of statistics counters.
Packet switches maintain statistics for many reasons. These include firewall support (especially stateful inspection), intrusion detection, performance monitoring (for example, remote monitoring), network tracing, load balancing, and traffic engineering (for example, policing and shaping of traffic patterns). In addition, most packet switches maintain statistics counters to facilitate network management. We can characterize the general problem of statistics maintenance as follows: When a packet arrives, the router classifies an arriving packet to deter-mine what actions to perform on it-should it be accepted or dropped, receive expedited service or not, and so on. Depending on the chosen action, the router updates statistics counters.
We are interested in statistics that count events. For example, the number of fragmented, dropped, or arriving packets or the number of bytes forwarded, and so on. We refer to these types of statistics as counters. Here, we describe and quantitatively analyze the problem of maintaining these counters.
We are particularly interested in applications that maintain many counters, such as a routing table that counts how many times a packet uses each prefix, or a router that counts the packets belonging to each TCP connection. Both examples would require simultaneously maintaining several hundreds of thousands, or even millions, of counters, making it infeasible (or at least very costly) to store them in SRAM and hence requiring DRAM storage. Furthermore, we are interested in applications that have frequent updates, such as an OC-192c link in which the router updates multiple counters upon each packet arrival. These read-modify-write operations must occur at the same rate as packet arrival.
If each counter is M bits wide, then a counter update operation
• reads the M bit value stored in the counter, • increments the M bit value, and • writes back the updated M bit value.
If packets arrive at rate R (in gigabits per second), the minimum packet size is P bits, and if the router updates C counters each time a packet arrives, the memory may need to be accessed (either read or written) every P/(2CR) ns. Let's consider the example of 40-byte TCP packets arriving on a 10-Gbps link; each arrival leads to the update of two counters. The memory needs to be accessed every 8 ns, about eight times faster than the randomaccess speed of today's commercial DRAMs.
It is a strict requirement that routers correctly update a counter or counters every time a packet arrives. Counters must account for every packet. If the scheme that updates counters performs an update operation every time a packet arrives and update C counters per packet, then minimum bandwidth R D required on the memory interface where the counters are stored would be at least 2RMC/P. Again, this bandwidth requirement can become unmanageable as the size of the counters and the line rates increase.
We propose an approach that uses DRAM to maintain statistics counters and a small fixed amount of (possibly on-chip) SRAM to support these operations. We assume that DRAM stores N counters of width M bits and that SRAM stores N counters of width m < M. The SRAM counters track the number of updates not yet reflected in the DRAM counters. Periodically, under the control of a CMA, our approach updates the DRAM counters by adding the values in the SRAM counters to the DRAM counters, as shown in Figure 1 . Updating the DRAM counters relatively infrequently reduces memory bandwidth requirements.
Our approach derives strict bounds on the size of the SRAM so that-irrespective of the arriving traffic pattern-none of the SRAM counters overflow. DRAM access rate and bandwidth requirements decrease but still ensure correct counter operation.
SRAM size and DRAM access rate both depend on the CMA used. The largestcounter-first (LCF) CMA minimizes SRAM size. We derive necessary and sufficient conditions on counter sizes (and hence the SRAM that stores these counters), and prove that the LCF CMA is optimal.
As an example of how our technique can work, consider an OC-192c line card on a router that maintains a million counters. Assume that the maximum size of a counter is 64 bits and that each arriving packet updates a maximum of 10 counters. Our results indicate that such a system can use a statistics counter with a 51.2-ns DRAM access time, 1.25-Gbps DRAM memory bandwidth, and 9-Mbyte SRAM.
Memory hierarchy
Packets arriving at a switch have variable lengths; we denote minimum packet size P as the minimum length of a packet. The time slot is the time taken to receive a minimum-size packet at link rate R. We organize the SRAM as a statically allocated memory, consisting of separate storage spaces for each of the N counters. In this article, we assume that an arriving packet increments only one counter. If we instead considered the case where each packet arrival updates C counters, the line rate on the interface would be CR.
A large counter of size M bits in DRAM, and a small counter of size m < M bits in SRAM represent each counter. The small counter counts the most recent events, and the large counter counts events occurring since the large counter was last updated. At any time, the correct counter value is the sum of the small and large counters.
Updating a DRAM counter consists of a read-modify-write operation: Read an M bit value from the large counter. Add the m bit value of the corresponding small counter to the large counter. Write the new M bit value of the large counter to DRAM. Reset the small-counter value.
Our goal is to decrease DRAM bandwidth by factor b, that is, R D = 2RM/(Pb), and increase DRAM access time accordingly, that is, access time A t = Pb/(2R). Thus, the CMA will update a large counter only once every b time slots. The minimum SRAM size is function g, which depends on N, M, and b. Therefore, the system designer can trade off SRAM Count C(i, t) is, at time t, the number of times that the ith small counter has been incremented since the ith large-counter update. An empty counter is when counter i is empty at time t; that is, C(i, t) = 0.
The correct large-counter value could be lost if the small counter overflows before it is added to the large counter. Our approach is to find the smallest possible size of counters in the SRAM and a suitable CMA such that the small counter cannot overflow before its corresponding large-counter update.
Necessity conditions on any CMA
For this hierarchy of counters to work, under any CMA the SRAM must meet certain conditions, which we define in the following theorem.
Theorem 1 (necessity): Under any CMA, a counter can reach a count C(i, t) of
Proof: We will argue that we can create an arrival pattern for which, after some time, there exists k such that there will be (N − 
Optimality
Key to establishing the LCF CMA's ability to minimize SRAM size in the concept of optimality, which we explain using the theorem that follows.
Theorem 2 (optimality of LCF CMA):
Under all arriving traffic patterns, LCF CMA is optimal in the sense that it minimizes the count of the required counter.
Proof:
We give a brief intuition of this proof here. Consider a traffic pattern from time t, which causes some counter C i (which is smaller than the largest counter at time t) to reach maximum threshold M * . A similar traffic pattern can cause the largest counter at time t to exceed M * . This implies that not serving the largest counter is suboptimal. We provide a detailed proof elsewhere. 7 
Sufficiency conditions on LCF service policy
We must show that under the LCF service policy what size of SRAM is sufficient. Consider the values of these counters' values after they are incremented, their contribution to F(t + b) becomes dα. But a counter with count C(i,t) ≥ i 1 + 1 is served at time t + b and its count becomes zero. Hence, the decrease to F(t + b) is at least dα/b. Thus, the net increase is at most dα [ 
Hence, the net increase is at most zero, that is, if arrivals occur to nonzero queues, F(t) can't increase. 
Proof:
We know that to store value x we need at most log 2 x bits. Hence, the proof of this theorem follows for the Theorem 3.
Choosing the correct value of b
There are three constraints to consider when choosing b.
• The system designer can choose any value of b that satisfies these three bounds. Very large N and small M can have no suitable value of b. Such a case forces the system designer to store all the counters in SRAM.
OC-192 line card counter design
Consider an OC-192c line card that maintains a million counters. Assume that the maximum size of a counter is P = 64 bytes and that each arriving packet updates a maximum of C = 10 counters, hence, R = 100 Gbps. Suppose that the fastest available DRAM has access time T R = 51.2 ns. Since our approach requires Pb/2R ≥ T R , this means that b ≥ 20. Given present DRAM technology, this is sufficient to meet the lower bound obtained on b using the memory I/O bandwidth constraint. Hence the lower bound on b is simply b ≥ 20.
We consider the upper bound on b, using two different values for counter size M, required in the system. If M = 64, then log 2 {ln(bN) / ln[b / (b − 1)]} < M and we design the counter architecture with b = 20. We find that 9 bits is the minimum size for the SRAM counters, as required for the LCF policy. This counter size results in a 9-Mbyte SRAM. Keeping the SRAM memory on-chip supports the required access rate. If M = 8, then ∀b, b ≥ 20, log 2 {ln(bN) / ln[b / (b − 1)]} > M. Thus there is no optimal value of b, and this design must store all the counters in SRAM without using DRAM. P acket switches need to maintain counters for gathering statistics on various events. Our method can help build a high-bandwidth statistics counter for any pattern of arrival traffic. We discussed the necessary condition on the size of SRAM required to keep exact statistics under any policy. LCF CMA policy is optimal in the sense of smallest SRAM size, and we obtained the bounds on the size of SRAM. But LCF CMA is a complex algorithm to implement at a very high speed. It will be interesting to obtain a similar performance as LCF CMA with a less complex algorithm. MICRO 
