The exponential growth in Internet popularity motivates network router and switch designers to develop custom software/hardware that can move packets through the network faster. Recently, a new breed of microprocessors called Internet processors have come into existence speci cally to address the performance problem due to exploding Internet tra c. The development e orts of these Internet processors concentrate almost exclusively on streamlining their data-paths to speed up network packet processing, mainly routing and data movement. As higher-level protocol processing functionality is shifted towards network devices, e.g., the active network architecture, Internet processors are required to perform more sophisticated tasks at ever higher speeds. Rather than blindly pushing the performance of packet processing hardware, an alternative approach is to avoid repeated computation by applying the time-tested architectural idea of caching to network packet processing. Because the data streams presented to Internet processors and general-purpose CPUs exhibit di erent characteristics, detailed cache design tradeo s for the two also di er considerably. This research represents one of the rst, if not the rst, reported works that focus on cache memory design speci cally for Internet processors. Using a trace-driven simulation methodology, we evaluate a series of three progressively more aggressive routing-table cache designs. Our simulation results demonstrate that the incorporation of hardware caches into Internet processors, combined with e cient caching algorithms can signi cantly improve the overall packet forwarding performance. We show that the use of progressively more e cient cache designs can result in up to a factor of 5 di erence in the average routing table lookup time and thus the packet forwarding rate.
Introduction
With the enormous momentum behind Internet-related technologies and applications, demands for data network bandwidth are on the rise at an astounding rate. It is predicted that the total data tra c volume will exceed that of the voice tra c by the turn of this century. As a result of the exploding bandwidth requirement, a growing number of microchips are designed and fabricated speci cally for networking devices rather than for traditional computing applications. In particular, a new breed of microprocessors called Internet processors have emerged that are designed speci cally to e ciently execute network protocols on such internetworking devices as link-layer (Layer 2) switches, network-layer (Layer 3) routers, and transport/application-layer (Layer 4) gateways.
Network protocol code typically performs one of two functions: control packet processing and payload packet processing. Control packets exchange information among network nodes to maintain the state of the network itself. Network devices process payload packets according to this state information. Control packet processing, because of low tra c volume, rarely causes performance problems, and is thus typically executed on general-purpose CPUs. Payload processing, on the other hand, is on the critical path and is thus the target for performance optimization. Speci c tasks performed during payload processing include Routing: Upon receiving an input packet, the network device needs to determine which output interface to use to forward the packet. In the case of real-time network connections, one needs to further identify which bu er queue on that output interface is to be used ( ow classi cation).
Forwarding: Each packet is forwarded from the incoming input interface to the chosen output interface after the routing decision is made.
Scheduling: To support guaranteed Quality Of Service (QOS), packets destined to an output link need to be scheduled according to their resource reservations.
Filtering: At the boundaries between the Internet and the private networks, incoming and outgoing packets need to be ltered based on a set of pre-determined criteria, which constitute the security policy, to protect the resources in the enterprises.
Manipulation: Network devices may modify the contents of data packets for protocol-speci c processing, e.g., the time-to-live counter eld in IP, or application-speci c processing, e.g., transcoding, compression/decompression and encryption/decryption.
In general, Internet processors are designed to accelerate all the above tasks except forwarding, which is related to data movement and whose performance depends more on the network device's underlying switching fabric than its processing power. However, the focus of this paper is on routing and ltering tasks only, as in most existing Internet processors. Despite a great deal of industrial research and development e orts invested in this area, to our knowledge no attempt has been made previously to perform a systematic study on the architectural tradeo s of Internet processors. Given the observed and projected growth rate of Internet applications, high-performance internetwork devices are expected to play a crucial role in advancing future information infrastructures. Therefore, it is important to examine the design issues of the basic building block of internetwork devices: Internet processors.
At the IP level, the routing table lookup problem is equivalent to longest pre x matching because of Internet's hierarchical addressing structure. Speci cally, Each IP routing table entry 1 logically includes the following elds: a network mask, a destination network address and an output port identi er. Given an IP packet's destination host address, the network mask eld of an IP routing table entry is used to extract the destination network address, which is then compared to the entry's destination network address eld. If they match, this routing table entry is considered a potential candidate. Logically the destination host address is compared against each and every routing table entry this way. Finally, the routing table entry candidate with the longest network address eld wins, and the packet is routed via the output port speci ed in this entry to the corresponding next-hop router. If none of the routing table entries match the incoming IP packet's destination host address, the packet is forwarded to the default router.
The network mask eld in each routing table entry extracts the most signi cant N bits from the IP packet's destination address as the network address, where N is the number of 1's in the network mask. Therefore, IP routing table lookup is equivalent to searching for the longest pre x match in the routing table for a given destination IP address. Moreover, with the increasing threat of IP address depletion, a technique called Classless Inter-Domain Routing (CIDR) is currently in use that allows more e cient allocation of contiguous address chunks in the IP address space. In contrast to the classical addressing scheme, in which N can take only one of three possible values: 8, 16, and 24, CIDR allows N to take any value from 1 to 31. This generality further complicates routing table lookup because it signi cantly increases the number of possible network addresses.
The packet ltering problem is equivalent to a multi-dimensional range search problem because in the most general case the conditions under which lter rules are applied represent sets of hypercubes in a high-dimensional space. For example, a typical packet-level rewall lter rule could state that all packets whose source and destination host addresses fall within a certain range, and whose source and destination ports fall within another range should be rejected. In this case, the condition part of the rule corresponds to a hypercube in a 4-dimensional space, whose four main axes are de ned by source IP address, source port number, destination IP address, and destination port number.
E cient algorithms to solve these two problems exist 6] 7] 8]. However, the architecturelevel research question is how to execute them at wire speed. For example, suppose the router's performance target is 10 million packets per second, the per-packet processing, including both 1 Unless explicitly indicated otherwise, the term \IP" refers to Internet Protocol Version 4 Each routing table entry also includes a next-hop router's address and an expiration timer, but they are ignored in this paper.
longest pre x match and multi-dimensional range search, should be completed within 100 nsec. While many attempts have been made to build specialized hardware for clever packet routing and ltering algorithms, in this work we chose a time-tested architectural idea, caching, to attack this problem, based on the belief that there is su cient locality in the packet stream for reusing results of routing computation. That is, instead of trying to speed up hard-to-accelerate algorithms with custom hardware, we opt to use caching to minimize the number of times such algorithms need to be invoked in the rst place.
Caching alone is not su cient due to less locality in packet address streams than the instruction/data reference streams in program execution. Given caches of a xed con guration, the only way to improve the cache performance is to increase their e ective coverage of the IP address space, i.e., each cache entry covering a larger portion of the IP address space. Towards this end, this work develops a novel address range merging technique by exploiting the fact that there is a limited number of outcomes for routing table lookup (the number of output interfaces in a network device) regardless of the size of the IP address space. Our simulation results demonstrate that address range merging improves the caching e ciency by a factor of 5 over generic IP host address caching, in terms of average routing table lookup time.
The rest of this paper is organized as follows. In Section 2, we review previous work related to Internet processors. In Section 3, we describe the network packet traces used and the architectural models assumed in this study. Section 4 presents the results for the baseline cache design, which supports routing table lookup based on individual destination host addresses. Section 5 presents the performance results of caching host address ranges rather than individual destination hosts addresses. Section 6 presents the simulation results of a further performance optimization technique that exploits the fact the number of outcomes of network packet processing is typically nite and small. In Section 7, we discuss the issues of applying caching to support general packet ltering, which require longer input addresses. Section 8 concludes this paper with a summary of the main research results and a brief outline of on-going work in this project.
Related Work
State-of-the-art Internet routing devices use two types of Internet processors for routing and ltering: general-purpose CPU and ASIC. BBN's MGR router 1] uses a DEC Alpha 21164 processor as the routing engine and relies on the on-chip L1 and L2 cache for software-based routing table cache. IBM's Integrated Switch Router 3] uses PowerPC 603e for both control engines and forwarding engines on the interface cards. Some IP routers are rooted in a massively parallel architecture such as Pluris' massively parallel router 4], which uses a 100+ MHz general-purpose CPU on each node to perform the routing function. 5] is based on a cluster architecture that uses a 300-MHz Pentium on each node for both packet routing and real-time packet scheduling.
Most high-performance switches and routers on the market implement proprietary routing/ ltering algorithms in ASIC chips whose micro-architecture description is generally not disclosed. None of the custom-designed processors explicitly mention the use of cache in the chip descriptions, although some of them did mention 10] the reason why they choose not to use cache: the tra c as seen in major Internet backbone routers is not expected to exhibit su cient locality to justify the use of caches. Feldmeier 18] studied the management policy for the routing-table cache, and showed that the routing-table lookup time can be reduced by up to 65%. Chen 19] investigated the validity and e ectiveness of caching for routing-table lookup in multimedia environments. Estrin and Mitzel 20] derived the storage requirements for maintaining state and lookup information on the routers, and showed that locality exists by performing trace-driven simulations of an LRU routing table lookup cache, for di erent conversation granularities. Jain 21] found that the cache replacement algorithm may need to be di erent for di erent types of tra c (interactive vs. non-interactive), and in some cases, cache could actually hurt the performance if the cache size is not su ciently large.
More recently, results from Internet tra c studies 2] as well as our own showed that there is signi cant locality in the packet stream that caching could be a simple and powerful technique to address the per-packet processing overhead in gigabit/terabit routers. Instead of generic host address caching, the goal of this work is to perform an empirical study on various design dimensions of the packet processing cache incorporated in Internet processors.
Performance Evaluation Methodology

Trace Collection
We use a trace-driven simulation methodology to study the design of Internet processors' cache memory systems. Although there were several IP packets traces available in the public domain, none of them are suitable for our purposes for two reasons. First, they were mostly captured before 1993, before WWW came into existence, and thus may not be representative of today's Internet tra c. Second, all of these traces have been \sanitized", i.e., IP addresses replaced with unique integers. While this is ne for temporal locality studies such as fully associative cache simulation, it is completely unusable for studies that require spatial locality information such as set-associative hardware cache simulation.
As a result, we decided to collect a packet trace from the periphery link that connects the Brookhaven National Laboratory (BNL) to the Internet via ESnet at T3 rate, i.e., 45 Mbits/sec. This link is the only link that connects the entire BNL community to the outside world. The internal backbone network of BNL is a multi-ring FDDI network, while the periphery router is a Cisco 7500 router. The trace is collected by a tcpdump program running on a Pentium-II/233 MHz machine, which snoops on a mirror port of a Fast Ethernet switch that links BNL's periphery router with ESnet's router. Therefore the packet collection is completely un-intrusive. The rst 80 bytes of each Ethernet packet are captured, compressed, and stored to the disks in real time. The packet trace was collected from 9AM on 3/3/98 to 5PM on 3/6/98. The total number of packets in the trace is 184,400,259, with no packet loss reported by tcpdump. To the best of our knowledge, this work is one of the rst network packet locality studies based on real packet traces that are collected from major production routers in the post WWW/Mosaic era. Because there are only three output interfaces in the BNL router, we used a backbone router's routing table from the IPMA project 22] in the simulations.
Recognizing the fact that the BNL network is at the edge of the Internet and thus the collected trace may not re ect the tra c patterns on backbone routers, we multiplexed portions of the original trace to create an aggregated packet stream that emulates the e ect of interleaving the tra c from a large number of un-related TCP/UDP connections. Speci cally, we extracted from the collected trace 100 30-minute packet sequences that are temporally spaced as far apart as possible, and interleaved them to form a new amalgamated trace. Essentially we are simulating spatial tra c (un-)correlation using temporal tra c (un-)correlation. To further mitigate the performance skews due to higher tra c locality in the collected trace, we focus our simulation study mostly on caches that are smaller than is feasible in current processors.
Architectural Assumptions
Because the major goal of this research is to explore the design space of cache memory subsystems speci cally for Internet processors, there is no reason to restrict ourselves only to the conventional CPU cache hardware structures. In the following sections, we will explore four Internet processor cache designs and their detailed architectural tradeo s using trace-driven simulations. The rst design is a generic CPU cache memory for routing-table lookup, where the destination host address is treated as a virtual address. The second design is an improvement over the rst by exploiting the fact that each routing table entry corresponds to a contiguous range of the IP address space. Therefore, instead of caching individual destination host addresses, the Internet processor cache can cover a larger portion of the IP address space if each cache entry corresponds to a host address range.
The third design represents a further performance optimization over the second by exploiting the fact that the number of distinct outcomes of routing-table lookup is equal to the number of output interfaces in a router and is thus relatively small. As a result, one could choose a di erent hash function than that used in generic CPUs to "combine" disjoint host address ranges that share the same routing-table lookup result into a larger logical host address set. Each such logical host address set is mapped to one cache entry. This technique thus further increases the e ective size of the Internet processor cache's coverage of the IP address space. The last design focuses on cache memory support required to broaden the scope of network packet processing from simple routing-table lookup to general packet ltering.
Cache Miss Handling
The average routing table lookup time depends on both cache hit ratio as well as cache miss penalty, which is determined by the software algorithm used to perform routing Figure 1 illustrates the data ow of the NART lookup process.
We have implemented the above NART algorithm on a Pentium-II 233 MHz machine with 16-KByte L1 data cache running Linux, and tested it on the IPMA routing table. The measured software NART lookup time is 120 CPU cycles on the average. 4 Baseline: Host Address Cache (HAC) Figure 2 shows the baseline Internet processor cache architecture, which is identical to conventional CPU cache. In this section, we report the results of generic cache simulations by varying the cache size, the cache block size, and the degree of associativity for two reasons: to identify possible locality characteristics di erences between network packet streams and program reference streams, and to establish the baseline model against which subsequent cache design alternatives are compared. The simulated cache miss ratio results in Table 1 show that the cache size and degree of associativity have a similar performance e ect on the Internet processor cache as on the CPU cache. However, a distinct di erence between network packet streams and program reference streams is that the former lacks spatial locality, as evidenced by the fact that for a given cache size and degree of associativity, decreasing the block size monotonically decreases the cache miss ratio. Intuitively this behavior is expected as there is no direct temporal correlations among network activities of the hosts residing in the same subnet. Poorer performance for caches with larger block size results because larger block size leads to ine cient cache space utilization when references to addresses within the same block are not correlated temporally, i.e., low spatial locality. The performance di erence between cache con gurations that are identical except the block size could be dramatic. For example, the miss ratios of a 4-way set associative 8K-entry cache with a 32-entry block size and one with 1-entry block size is nearly an order of magnitude apart, 38.05% vs. 3.29%. As cache size increases, the performance impact of block size decreases, although still signi cant, because the space utilization e ciency is less of an issue with larger caches. From this study, we conclude that the block size of network processor cache should always be small, preferably one entry wide. Whenever the base data structure from which a cache is built changes, there is a cache consistency problem. For the host address cache, modi cations to the routing table due to the routing protocol's message exchanges give rise to the consistency problem. However, unlike CPU cache, temporal inconsistency in the host address cache is tolerable, because the routing protocol itself does not expect the route information to converge immediately. Therefore, there is much more latitude in the timing of when consistency maintenance actions should be taken. To simulate the e ects of routing table changes, we ush the contents of the host address cache and measure the impacts of routing table update frequency on the e ectiveness of the host address cache. The results are shown in Table 2 , which assume the cache is direct-mapped and its block size is one entry wide. As the ush interval increases, the miss ratio decreases as expected. But the performance di erence due to ushing, as shown by the ratio of the miss rates corresponding to the 100K and 1 ush intervals, increases with the cache size. The reason for this behavior is that larger caches require a longer cold-start time, and therefore tend to su er more than smaller caches when the ush interval is small. Consequently, the relative performance di erence between di erent ush frequencies is more signi cant for larger caches. To put the ush intervals used in these simulations in perspective, 100K packets is equivalent to 100 msec for a router that can process one million packets per second. On the other hand, the interval between consecutive routing tables changes is on the order of seconds. Figure 3 : The Internet processor cache architecture that caches host address ranges rather than individual host addresses. Range size is a global parameter applied across the entire address space, and is determined by maximally concatenating individual entries in the attened routing table.
Level-3
while the latter is an encompassed entry. An encompassing entry's network address is a pre x of those entries it "encompasses." The address range associated with each encompassed routing table entry needs to be "culled" away from the address ranges of all the entries that encompass it, so that every address range in the IP address space is covered by exactly one routing table entry. Consequently, one routing table entry could correspond to multiple disjoint address ranges. This culling step is essential because it ensures that at most one cache entry match each incoming IP packet's destination address.
Second, adjacent address ranges that share the same output interface should be merged into larger ranges as much as possible. Once this merging is done, these ranges are "aligned", that is, ranges are potentially split to make all range sizes powers of 2 and to make starting addresses of all ranges aligned with a multiple of their size. Then the minimum of all resulting address ranges is calculated. Finally the minimum range granularity parameter of the host address range cache is chosen to be the largest integer that is smaller than the minimum-granularity range and is also a power of 2. Range size, which is de ned as log(minimum range granularity), thus represents the number of least signi cant bits of the IP addresses that could be ignored during routing-table lookup, as destination addresses within a minimum-granularity range are guaranteed to be routed in the same way. Figure 3 shows the hardware architecture of the host address range cache, which is the baseline cache augmented with a logical shifter. The destination address of an incoming packet is logically right-shifted by range size before being fed to the baseline cache. Because each cachable item corresponds to an address range, the e ective size of a host address range cache's coverage of the IP address space is multiplied by minimum range granularity.
Cache Size Associativity Miss Ratio (HARC) MissRatio Table 3 : Cache miss ratio comparisons between host address range cache (HARC) and host address cache (HAC), assuming the block size is one entry wide and the range size is 5. The last column is the ratio between HAC's and HARC's average routing-table lookup times, assuming the hit access time is one cycle and the miss penalty is 120 cycles.
We processed the IPMA routing table according to the steps described above, and calculated the range size parameter, which turned out to be 5. This means that each host address range cache entry now corresponds to a continuous range of 32 addresses, a factor of 32 increase in the cache's e ective coverage. Table 3 shows the comparison between the cache miss ratios of host address range cache (HARC) and host address cache (HAC), assuming the block size is one entry wide. HAC's miss ratio is between 1.68 to 2.10 times higher than that of HARC. In terms of average routing-table lookup time, HARC is between 58% and 78% faster than HAC, assuming that the hit access time is one cycle and the miss penalty is 120 cycles. Because the logical right shifting step in the HARC lookup lowers the "degree of variation" in the address stream as seen by HARC, the probability of con ict miss increases. As a result, the miss ratio gap between HAC and HARC widens with the degree of associativity, because HARC bene ts more from higher degrees of associativity by eliminating more con ict misses than HAC.
Intelligent Host Address Range Cache (IHARC)
Traditional CPU cache directly takes the least signi cant bits of a given address, and uses them to index into the data and tag arrays. Therefore the corresponding hash function is a simple selector function using the least K signi cant bits of the input addresses, where 2 K is the number of cache sets. In this section, we show that it is possible to further increase the minimum-granularity range in the host address range cache by choosing a more appropriate hash function for cache lookup. 1 1 2 2 1 1 2 2 3 3 2 2 3 3 2 Figure 4 : A routing table example that illustrates the usefulness of carefully choosing the index bits. In this case, the 1-th bit is chosen to be the index bit. The total number of distinct address ranges is reduced from 8, if the basic merging operation used in the formation of the host address range cache is used, to 3. The three address ranges are labeled as A, B, and C.
Output Interface
Consider the example routing table in Figure 4 , where there are 16 4-bit host addresses with three distinct output interfaces, 1, 2 and 3. The merging algorithm used in calculating the range size of the host address range cache will stop after all adjacent address ranges with identical output interfaces are combined. In this case, the total number of address ranges is 8, because the minimumrange-granularity is 2. To further grow the address range that a cache entry can cover, one could choose the index bits carefully so that when they are ignored, some of the identically labeled address ranges are now "adjacent" and thus could be combined. For example, if the 1-th bit is chosen as the index bit into the data/tag array, then the host addresses 0000, 0001, 0100, and 0101 can be merged into an address range because they have the same label, 1, and when the 1-th bit is ignored, they form a continuous sequence, 000, 001, 010, and 011. Similarly, 1000, 1001, 1100, and 1101 also can be merged into an address range, as well as all the host addresses whose corresponding output interface is 2. With this choice of the index bit, the total number of address ranges to be distinguished during cache lookup is reduced from 8 to 3.
Intuitively, IHARC provides more opportunities of merging identically-labeled intervals by decomposing the address space into partitions and attempting to merge those intervals within the same partition. Note that HARC corresponds to a partitioning based on least-signi cant bits because it insists on merging intervals that are adjacent in the original IP address space, and thus is a special case of IHARC.
While HAC and HARC use a xed three-level-table NART structure (16, 8 and 8 bits) that is independent of the hardware cache con gurations, the NART associated with IHARC depends on the hardware con guration. In particular, the size of Level-1 Table is equal A greedy index bit selection algorithm used to pick the bits in the input addresses for cache lookup number of index bits in IHARC, and the set of bits used to index the Level-1 Table is the same as the selected index bits in IHARC. As a result, the cache miss penalty for IHARC may be di erent for di erent cache con gurations. However, measurements from our prototype implementation show that the average miss penalty is almost the same for all IHARC cache con gurations we experimented with, and moreover is close to that of HAC and HARC, i.e., 120 cycles. Another important di erence is that in addition to the output interface, an NART entry contains the address range it corresponds to. Note that every leaf NART entry, viz, an NART entry containing an output interface, would correspond to exactly one IHARC address range. Thus, together with the routing lookup decision, it also keeps the corresponding address range itself. This address range is then used in populating the IHARC cache upon a miss, as explained later in this section.
We rst describe our index bit selection algorithm. Assume N and K are the number of bits in the input address and in the index key, respectively. In general, any subset of K bits in the input addresses could be used as the index bits, except the least signi cant range size bits as determined by the basic merging step in constructing the host address range cache. We use a greedy algorithm to select the K index bits, as shown in Figure 5 . S represents the set of index bits chosen by the algorithm so far. Score(S; j) is a heuristic function that calculates the desirability of including the j-th bit, given that the bits in S have already been chosen to be included in the index bit set. As the bits in S fjg decompose the IP address space into disjoint partitions, the heuristic function rst computes, for each induced partition, the number of distinct address ranges that need to be uniquely tagged. Note that this partitioning makes address ranges that were originally non-adjacent in the address space, as adjacent now (as brought out in g.4). While computing Score(S; j), those "adjacent" address ranges in a partition that have the same output interface are merged into larger ranges.
In the IHARC cache design, each IHARC partition corresponds to one cache set. Hence, the metric Score() measures the extent of contention within a cache set because the number of address ranges in each partition that need unique tags is the number of caching entries that are going to compete for the partition's corresponding cache set in cache lookup. Therefore, the probability of con ict miss in a cache set is proportional to the number of address ranges that need to be uniquely tagged in the associated partition of the IP address space. Note that uniquely tagged address ranges are determined only after address ranges in each partition are properly merged.
Given a candidate index bit set S j, the algorithm computes the i-th partition's metric M i;(S;j) , for all i. Then the algorithm minimizes Score(S; j), which is a weighted sum of P i M i;S;j , and P i jM S;j ? M i;S;j j; where M S;j is the mean of all M i;S;j 's. The second term of the weighted sum minimizes the standard deviation and is included to prevent the occurrence of hot-spot partitions, which potentially could lead to excessive con ict misses in the corresponding IHARC cache sets. Figure 6 shows the hardware architecture of a host address range cache with a programmable hash function engine that allows tailoring the choice of the index bit set to individual routing tables. The intelligent host address range cache architecture that uses a hash function speci c to a routing table in order to combine disjoint address ranges into a logical host address set which is then mapped to a cache entry. The programmable hash engine provides the exibility needed to tailor the hash function to the network routing Table 4 : The number of address ranges that need to be distinguished and the bits chosen as index bits after applying the index bit selection algorithm to a routing table from the IPMA project. The last two columns correspond to numbers of address ranges with and without the constraint that each address range's size has to be a power of 2, respectively. or 134,217,728. In other words, the index bit set selection algorithm e ectively reduces the number of distinct addresses ranges from HARC to IHARC by three orders of magnitude. In addition, this number is only 3 to 4 times the number of entries in the original routing table, even though the resultant address ranges can now be easily matched with conventional cache lookup hardware. As mentioned before, for address ranges to be tag-matched based on masks, their size has to be a power of two. Table 4 shows the di erence in the number of distinct address ranges with and without this constraint. Table 5 shows the miss ratios for IHARC, assuming that the block size is one entry wide. In terms of average routing- 
Packet Filter Cache
As network routers evolve towards interpreting packet bits beyond network-layer headers, such functions as real-time packet classi cation and rewalling are now being incorporated in state-ofthe-art routers. A similar caching mechanism could be applied to these "packet ltering" tasks, in which lter rules will be red if the incoming packet matches their ring conditions. That is, once a packet invokes a lter rule, one can cache the packet's connection ID, either as a pair (source and destination IP addresses) or a quadruple (source and destination IP addresses, and source and destination port numbers), and the outcome of the invocation. Subsequently all other packets belonging to the same connection can then be ltered via a cache lookup without performing any software-based rule matching. Conceptually, the only di erence between the routing table cache and the packet lter cache is that the latter uses a longer "input." From a cache design standpoint, a lter cache may need better hash functions for indexing into the data/tag arrays due to larger address spaces. In this section, we report the initial results of a simulation study of a packet lter cache.
Cache Table 7 : The miss ratio for a packet lter cache whose input is a quadruple that consists of the incoming packet's source and destination address and the source and destination port numbers. All cache con gurations are direct mapped and have a one-entry block size. Table 6 and 7 show the miss ratios for the pair lter cache and quadruple lter cache respectively, for di erent cache sizes and choices of hash functions. All cache con gurations are direct mapped and have a one-entry block size. We have experimented with four hash index functions. Assume K is the number of index bits used to access the data/tag arrays. Least signi cant extracts the least signi cant 16 bits from each eld in the pair or quadruple, sums them up, and takes the remainder from dividing the sum by 2 K . Most signi cant is similar to Least signi cant except it is the most signi cant 16 bits of each eld that are extracted. Random-1 and Random-2 are two random choices of a subset of K bits from the pair or quadruple.
As shown in these two tables, proper choices of index bits are crucial, and conventional choices such as Least signi cant and Most signi cant outperform random choices by a large margin. For cache con gurations using Least signi cant and Most signi cant, doubling the cache size from 4K to 8K entries leads to a relative cache miss ratio di erence ranging from 3% to 44%. The cache size does not play as signi cant a role as expected, which is surprising because the cache sizes we experimented with are relatively small (4K and 8K entries). We have also varied degrees of associativity for a given cache con guration, and the performance di erences among directmapped, 2-way, and 4-way cases are signi cant and are comparable to routing table caches. These initial results show general trends as conventional cache but the absolute cache miss ratios for packet lter cache are still too high to be acceptable. Further research into more accurate locality characterizations of "pair" and "quadruple" streams is called for. 8 
Conclusion
This paper reports the results of one of the rst research e orts on cache memory designs for emerging Internet processors. Based on a real packet trace collected from the main router of a national laboratory, we studied a series of routing-table cache designs and a packet lter cache design. Major results from this research are summarized as follows:
Based upon the interleaved trace used in the study, there seems to be su cient temporal locality in the packet stream to justify the use of a routing-table cache in Internet processors. However, spatial locality is weak, and therefore the block size should be as small as possible, preferably one entry wide.
Caching address ranges rather than individual addresses greatly improves the e ective coverage of caches of a given size and therefore their hit ratios.
A careful choice of the index bits during cache lookup is crucial and can dramatically reduce the number of address ranges that need to be distinguished, and thus the cache miss ratio.
The locality characteristics of source/destination address pair streams and source/destination address/port quadruple streams are very di erent from destination address streams. As a result, traditional cache optimization techniques such as increasing the cache size and degree of associativity alone may not be as e ective. Rather, choices of hash functions to index the data/tag arrays seem to play a key role in the performance of a packet lter cache.
We are currently investigating the performance impacts of routing table updates on HARC and IHARC, both of which exploit the current contents of routing tables to dynamically recon gure the cache hardware, and therefore need to be changed on the y. Speci cally, developing an incremental version of the index bit selection algorithm is essential in enhancing IHARC's practical usability. Finally, we are working on a detailed characterization study of the pair/quadruple streams, and the design of better hash index functions to improve the hit ratio of packet lter caches.
