Abstract -For monitoring and measuring high-speed networks accurately in real-time, a large number of statistics counters may need to be maintained at wirespeeds (e.g. 10 Gbps). Expensive, but fast, SRAM is needed for storing the counters to satisfy the speed requirements. However, highdensity, but slower, DRAM is needed to provide the necessary storage capacity for storing all counter values exactly. Recent papers by Shah et. al [1] and Ramabhadran and Varghese [3] have addressed the problem using counter memory architectures based on one level of fast SRAM for storing partial counter values and a high-capacity DRAM for storing full counter values. In this paper, we propose to extend their work with a multi-level counter memory architecture to reduce the amount of fast memory required. Our multi-level counter memory architecture can reduce the amount of equivalent fast memory storage required by as much as 28%.
I. INTRODUCTION
A number of recent papers [1] , [3] , [4] have motivated the need to maintain a large number of statistics counters at wirespeed for monitoring and measuring high-speed networks in real-time. In general, packet switches may need to maintain different statistics for a variety of performance monitoring, security, network tracing, and traffic engineering applications.
To maintain statistics counters at wirespeed (e.g. 10 Gbps), the majority of memory accesses for counter maintenance must be to fast memory like SRAM. However, storing all counter data in SRAM is not economical for large numbers of counters (e.g. millions of counters.) Shah et. al [1] provided a Counter Management Algorithm (CMA) for maintaining a partial count of m bits in the SRAM while maintaining the full count with M bits in the DRAM. Their counter management algorithm, called Largest-Counter-First (LCF ), selects the largest counter in the SRAM to update to the DRAM every b cycles, where b is the access time ratio of DRAM over SRAM. Their algorithm ensures that none of the counters in the SRAM will ever overflow if at least m bits of memory is used, and they showed that the required m bits is minimal. Recently, Ramabhadran and Varghese [3] provided an alternative counter management algorithm called Largest-Recent-with-Threshold (LR(T )) that achieves the same minimal m bits width SRAM, but with a lower hardware cost for implementing their algorithm than LCF . Both LCF and LR(T ) assume the use of one level of SRAM and a DRAM.
A. PAPER CONTRIBUTIONS
In this paper, we propose to extend the work of Shah et. al [1] and Ramabhadran and Varghese [3] with a multilevel counter memory architecture. We propose to use multiple levels of fast memory instead of one, namely one level of SRAM and one or more levels of CAM (Content Addressable Memory) for storing partial counter values, and a highcapacity DRAM for storing the full counter values. This architecture is described in Section II.
In [1] and [3] , two different counter management algorithms were proposed for ensuring that none of the counters in the SRAM will ever overflow before they are updated in the DRAM. For each method, the respective authors derived an upper bound c max on the maximum partial count value that a counter can reach in the SRAM before it gets updated to the DRAM. Based on c max , they showed that the SRAM must be m = log 2 c max bits wide to ensure that every counter can count up to c max .
A key observation is that in practice most counters will never reach a count close to c max before it gets updated. Therefore, using m bits for every counter in the SRAM may be wasteful since most counters can be represented using fewer than m bits. The basic idea is to use a first-level of SRAM to store the lower m 0 < m bits of every counter. Then, for those counters that can reach partial counter values greater or equal to 2 m0 , we will use one or more levels of CAM to store the upper bits of those counters. If the number of counters that can reach or exceed 2 m0 is significantly less than N , where N is the total number of counters, then we can significantly reduce the amount of expensive fast memory that we need to maintain the partial counts.
Using this basic idea, we first consider the case where we use exactly two levels of fast memory -the first level is in a SRAM, and the second level is in a CAM. In this case, we split the m bits into a lower m 0 bits and an upper m 1 bits such that m = m 0 + m 1 . The lower m 0 bits are stored in the first-level SRAM with N locations, and the upper m 1 bits are stored in the second-level CAM with G < N locations, where G corresponds to the number of counters that can reach or exceed a count of 2 m0 . We describe in Sections III and IV how G is computed for the LCF algorithm and the LR(T ) algorithm, respectively.
We then describe in Section V how we can generalize our solution for λ levels of fast memory. In particular, we split m bits into m 0 , m 1 , . . . , m λ−1 bits, each m l corresponds to the number of bits in the l + 1 level in the memory hierarchy (m 0 is the number of bits in the first-level of the memory hierarchy, m 1 is the number of bits in the second-level of the memory hierarchy, and so on). We call these multi-level counter memory architectures or MMA for short. Using our approach, we are able to reduce the amount of equivalent fast memory storage required by as much as 28%, as discussed in Section VI.
II. ARCHITECTURE
In this section, we first describe how the single-level memory architecture from [1] can be extended to two levels of fast memory. Then we describe a general architecture with λ levels of fast memory. In both cases, a DRAM is used to store the full counter values for all counters.
A. TWO-LEVEL ARCHITECTURE
We extend the basic one-level counter memory architecture to a two-level counter memory architecture, as shown in Whenever a counter k in the first-level SRAM overflows the maximum value 2 m0 − 1, a "1" is added to the corresponding entry in the second-level CAM using k as the content address. If there is no entry matching k, a new entry is created. By the way that we compute G, we can prove that there are enough entries in the second-level CAM to store all the overflow counters. When the CMA decides to update a counter k to DRAM, we reset the corresponding entry in the SRAM, and we remove the corresponding entry in the CAM.
At any moment in time, the partial counter value for counter k is defined as follows. Let σ[k] be the SRAM value stored for entry k. Let ψ[k] be the CAM value stored for the content address k if an entry is found, and let ψ[k] = 0 if no entry is found. Then the partial counter value for counter k is
This is the same as taking the binary interpretation of the bits formed by concatenating binary patterns retrieved from the upper m 1 bits and the lower m 0 bits.
The memory size of the first-level SRAM is simply Nm 0 . For the size of the second-level CAM, we define it in terms of the equivalent SRAM size. In particular, for a CAM ψ, given γ, ω, µ , where γ is the number of entries in the CAM, ω is the width of the content address, and µ is the width of the entry value, we define the memory size of ψ as
: :
r is the area ratio of CAM cells over SRAM cells. This ratio is about 1.5 because a typical SRAM cell uses 6 transistors whereas a typical CAM cell uses 9 transistors [6] . Thus, with an ω-bit wide content address and γ entries, we calculate a component of the CAM cost as rωγ. We also add another component µγ to the CAM size corresponding to the memory bits required to store γ entries of µ-bit wide entry values. Finally, we add 2 additional bits of storage for each CAM entry to implement an aggregate bitmap for keeping track of the available CAM locations. This aggregate bitmap implementation was explained in [3] . For the specific architecture shown in Figure 1 , the parameters for the CAM are G, log 2 N, m 1 . Thus, the total equivalent memory size for the two-level counter memory architecture is
B. MULTI-LEVEL ARCHITECTURE
We generalize our architecture to a λ-level counter memory architecture, as shown in Figure Whenever a counter k in the previous level CAM or SRAM overflows the maximum value
a "1" is added to the corresponding entry using k as the content address, or a new entry with content address k is created if one does not exists. When the CMA decides to update a counter k to DRAM, we reset the corresponding entry in the SRAM, and we remove the corresponding entries in all the levels of CAM if the entries exist.
In any moment in time, the partial counter value for counter k is defined as follows. where N i is the number of counters with a value i. Using Equation 2 to compute the equivalent sizes for the intermediate CAM memories, the overall equivalent memory size for a λ-level counter memory architecture is
where G(0) = N . At each level of CAM, the number of bits needed to represent the content address is the logarithm of the number of locations in the previous memory hierarchy, namely
The throughput of the multi-level counter memory architecture will be similar to that of the single-level counter memory architecture. Each level of an MMA CMA can be implemented as a stage in a counter update pipeline. Each stage will only need to read and write to a counter location once per counter update just as is the case for the single level CMA. The CAMs may have slightly longer access times than the SRAMs that are used in the single level CMAs. However, the CAMs used in the MMA CMA are much smaller than the SRAMs used for the single level CMAs so the CAMs will unlikely become a bottleneck.
III. COMPUTING G FOR LCF
Using the LCF algorithm proposed by Shah et. al [1] , we first consider how this CMA approach can be applied to the two-level counter memory architecture described in Section II-A. We defer to Section V to describe the generalized solution for the λ-level counter memory architecture described in Section II-B.
For describing their LCF algorithm, they defined a notion of a potential function F as follows [1] :
where
In Equation 6 , i refers to a counter value, and N i refers to the number of counters that has that counter value. The potential function F tries to capture an abstract notion of the relative fullness of all the counters. They used their definition of F to prove that there is not traffic pattern that can cause F to exceed the following.
To derive an upper bound on c max under LCF , they assumed the worst-case situation where one counter would have the largest value c max , and all other counters would be 0. This implies
If only one level of fast memory is used, the number of bits required to represent c max for LCF is m = log 2 c max For the two-level counter memory architecture described in Section II-A, we split these m bits into a lower m 0 part and an upper m 1 part. With m 0 < m, it is possible for some counter to reach or exceed 2 m0 , and thus causing the firstlevel SRAM to overflow. For these counters, we essentially extend the number of bits for representing their counter values by m 1 bits in the second-level CAM. However, we need to determine the number of counters that can simultaneously reach or exceed 2 m0 in any given moment in time. This is what we have been referring to as G, which corresponds to the number of entries in the second-level CAM.
Theorem 1: Let i be a counter value, and let G i be the worst-case number of counters that can simultaneously reach or execeed i under LCF . Then
We first assume the case where all counters have either the value i or 0. Then according to the potential function F defined in Equations 6 and 8, the following must hold:
Equation 14 states that the number of counters with the counter value i must be less than or equal to bN d i . Therefore, the maximum number of counters N i that can have the value i must be
Suppose that G i > N i . This implies that there are some counters in the set of counters that generate G i such that those counters are greater than i. However each of the counters that is larger than i could be reduced to i without increasing the value of F . This is because, for k > 0, the contribution to F of a counter with value i + k is larger than the contribution of a counter of value i.
This implies that there is some set that generates N i such that N i = G i . However, this contradicts the assumption that
And by the definition of N i , N i must be less than or equal to N i . Therefore, G i must also be less than or equal to N i . Finally, since there are only N counters,
For a given m 0 , we can use Equation 12 to compute G i , where i = 2 m0 .
IV. COMPUTING G FOR LR(T)
The drivation of G i using the LR(T ) algorithm proposed by Ramabhadran and Varghese [3] is similar to the derivation of G i for LCF . Ramabhadran and Varghese [3] introduced a family of algorithms that they called Largest-Recent-withThreshold (LR(T )). They provided a detailed analysis for the specific case where T = b, and they referred to this version as LR (b) .
To find the c max value for LR(b) they developed a potential function just as was done in [1] . The potential function from [3] for an individual counter is as follows:
Using the definition of f i , they defined their potential function F as follows:
It was shown in [3] that this function was constrained by the following inequality for all possible traffic patterns:
To derive an upper bound on c max under LR(b), they assumed the worst-case situation where one counter would have the largest possible value c max allowed by Equation 18 and all other counters would be 0. They used their potential function (Equations 17 and 18) to show that the upper bound on c max is the following:
Just as with LCF the number of bits required to represent c max for LR(b) in a single level memory is m = log 2 c max Theorem 2: Let G i be the worst-case number of counters that can simultaneously reach or execeed i under LR(b). Then
Just as with LCF we first assume the case where all counters have either the value i or 0. According to the potential F defined in Equations 17 and 18, the following must hold:
Equation 22 states that the number of counters with the counter value i must be less than or equal to d b−i−1 (N −1) . Therefore, the maximum number of counters N i that can have the value i must be
For the purpose of a proof by contradiction suppose that G i > N i , for i ≥ b. This implies that there are some counters in the set of counters that generate G i such that those counters are greater than i. However each of the counters that is larger than i could be reduced to i without increasing the value of F . This is because, for k > 0, the contribution to F of a counter with value i + k is larger than the contribution of a counter of value i.
. This implies that there is some set that generates N i such that N i = G i . However, this also contradicts the assumption that G i ≥ N i since N i = G i . And by the definition of N i , N i must be less than or equal to N i . Therefore, G i must also be less than or equal to N i .
Finally, since there are only N counters,
We use Equation 20 to compute the number of locations G i necessary in the second level of memory given m 0 and i = 2 m0 under LR(b). 
V. MULTI-LEVEL DESIGN
In this section, we discuss how we can compute G for each level of the memory hierarchy in a λ-level counter memory architecture. 
., G(λ−1).
Once these values are derived, the overall architecture can be put together, as shown in Figure 2 .
The optimal partitioning of m bits into m 0 , . . ., m λ−1 bits is a problem that can be solved using a dynamic algorithm. The problem is to minimize the area of the device subject to The optimal partitions for all of the data points presented fit within three memory levels of fast memory. The improvement curves of the multi-level CMA methods over the single-level CMA methods look like step functions because the multi-level method benefits the most when the c max value first steps over a power of two boundry. It is when c max is slightly greater than 2 j for some j that the multi-level method offers the most improvement because this is when the values of G i will be smallest allowing for the most aggressive introduction of memory levels.
In all of the optimal partitions the values for G 1 and G 2 are small with respect to N , this confirms our observation that very few counters ever reach c max before they are updated to DRAM. Tables I and II The partitions described in this table are optimal, they were generated using a dynamic algorithm that optimizes for area given the constraints on m i and G. This table shows that many optimal multi-level memory CMA architectures can be built using three or fewer levels of memory.
For the area calculations we have only considered the size of the counter memory since this is the area that our method improves upon. In an actual implementation the device areas would be much larger for LCF and slightly larger for LR(b) to account for the CMA control memories required by these methods. LCF requires a large data structure for quickly finding the largest counter in the array. LR(b) only requires a small amount of control memory to store an aggregate bitmap that is used for identifying the entries that exceed the threshold value b.
VII. CONCLUSION
In this paper, we proposed an improvement to the counter memory architectures developed by Shah et. al [1] and Ramabhadran and Varghese [3] . Our method uses multiple levels of fast memory to reduce the amount of surface area required for a CMA implementation. Using a multi-level counter memory architecture the amount of equivalent fast memory storage required for a counter array may be reduced by as much as 28% for reasonable configurations. We use a dynamic algorithm to find the optimal bit partition for each pair of values N and b.
While this work extends the earlier work by Shah et al. [1] and Ramabhadran and Varghese [3] , several questions remain open for future research.
• One is to determine if better bounding functions G can be developed for LR (b) and LCF implementations if the bounding functions are tailored to specific counter values rather than using F and c max
• Another is to see if we can further reduce the implementation cost of the multi-level counter memory architecture. In the current proposed architecture, we proposed to use CAM to implement the overflow memories. One area of investigation is to see if we can continue to exploit the idea that relatively few counters can reach large values without using a CAM.
• Finally, it would be interesting to see how approaches for maintaining exact statistics counters can be applied to applications other than measuring high-speed networks on a per-packet basis in real-time.
