Abstract-In this paper, we propose new shared memory multiprocessor architectures and evaluate their performance for future Internet Protocol (IP) routers based on Symmetric Multiprocessor (SMP) and Cache Coherent Nonuniform Memory Access (CC-NUMA) paradigms. We also propose a benchmark application suite, RouterBench, which consists of four categories of applications representing key functions on the time-critical path of packet processing in routers. An execution driven simulation environment is created to evaluate SMP and CC-NUMA router architectures using this RouterBench. The execution driven simulation can produce accurate cycle-level execution time prediction and reveal the impact of various architectural parameters on the performance of routers. We port the FUNET trace and its routing table for use in our experiments. We find that the CC-NUMA architecture provides an excellent scalability for design of high-performance IP routers. Results also show that the CC-NUMA architecture can sustain good lookup performance, even at a high frequency of route updates.
INTRODUCTION
A S link speed continues to grow exponentially, packet processing at network switches and routers is becoming a bottleneck. Assuming an average packet size of 1,000 bits, a multigigabit router has to process several million packets per second. This implies that the average packet processing time has to be less than one microsecond, which is impossible for a single processor to achieve.
Different approaches are proposed and deployed to improve the performance of Internet Protocol (IP) routers, whose performance metric has been mainly the throughput or packets per second. Application Specific Integrated Circuits (ASICs) are usually preferred in high-end core routers due to their high performance, while tolerating their inflexibility and lack of programmability. The emerging network processors provide programmability and flexibility together with high performance, but are usually targeted to edge routers. Therefore, general purpose multiprocessors are preferred for high-end core routers because of their high performance, programmability, and scalability [7] , [27] . Packet processing in these core routers is accomplished through software, hence, they are called software IP routers.
Modern commercial routers such as the BBN Multigigabit Router [27] and Cisco 12000 Series Gigabit Switch Router [7] employ multiprocessors to process packets. BBN's MGR [27] incorporates a number of centralized Forwarding Engines (FEs) that are reponsible for processing packets coming from the line cards (LCs). LCs are network interfaces where packets arrive and depart. Cisco 12000 series multigigabit routers [7] adopt a distributed architecture, in which each LC has a dedicated FE associated with it. As these commercial routers are deployed, some questions arise for both the industrial and academic communities. Among them, the following two questions are addressed in this paper:
. How do we evaluate the software packet processing capability of a multiprocessor router architecture? . What types of multiprocessor architectures yield the best performance for core routers? The aim of this paper is to evaluate different multiprocessor architectures for software routers and to develop techniques to improve performance. Execution time or packet processing latency in a router is considered as the main performance metric in this paper instead of throughput. Thus, we employ execution-driven simulation to predict the performance of IP routers. The advantage of execution-driven simulation is that it can produce accurate cycle level execution time and reveal the impact of architectural parameters on router performance. We extend Augmint [22] to perform multiprocessor architecture simulation. We choose Augmint because it runs fast due to its simple processor architecture model and it enables us to focus on exploring parallelism at the packet level. The impact of packet-level parallelism on the router performance is known to be more profound compared to architectural extensions like instruction-level parallelism (ILP), branch prediction, or speculation [4] .
In order to evaluate the router architectures, we propose a benchmark, called RouterBench, which consists of four categories of applications that reside on the timecritical path of packet processing in a core router. Our benchmark applications are similar to those of the Click router [16] . We quantitatively analyze the computation requirements of the key functions in RouterBench and evaluate their execution time on the shared memory multiprocessor router architectures.
Route updates may happen frequently due to IGP, BGP [28] , or other routing protocols. The routing update messages cause changes to the routing table structures, stall the route lookup procedure, and degrade the lookup performance. Hence, we also study the impact of such route updates on the lookup time of multiprocessor routers.
In both BBN's MGR [27] and Cisco's 12000 series routers [7] , the common data structure, such as routing table, is replicated in all the processors, resulting in a waste of memory. Also, to update multiple copies of a large data structure is time consuming and needs special hardware support. Shared memory architectures enable multiple processors to share one copy of data structure, e.g., routing table, which not only saves memory space, but also eases updating. The scalability and high performance of shared memory multiprocessor architectures make them salient candidates for router architectures. Hence, we propose to design and test two shared memory architectures in this paper.
The contributions of this paper are the following:
. We propose two shared memory multiprocessor architectures for IP routers based on SMP and CC-NUMA organizations. . We construct an execution driven multiprocessor simulation framework for evaluating these router architectures consisting of forwarding engines (FEs), line cards (LCs), and a switching fabric. . We develop a set of benchmark applications that reside on the critical path of packet processing in the routers. This RouterBench is used to evaluate the multiprocessor architectures under study. . We study the impact of route updates on the performance of routing table lookup in a multiprocessor router architecture. This paper is organized as follows: The related research is presented in Section 2. We describe the SMP and CC-NUMA router architectures and introduce a simulation framework in Section 3. In Section 4, we identify the key functions that a router performs for processing a packet, and propose RouterBench, which is used in the simulation study of multiprocessor router architectures. The simulation results are presented in Section 5. In Section 6, we study the impact of route updates on the routing table lookup performance of the CC-NUMA based router. Finally, Section 7 concludes the paper.
RELATED WORK
Wolf and Turner proposed the design of a scalable, highperformance active router [34] , where multiple network processors with cache and memory on a single application specific integrated circuit (ASIC) were used. However, a general purpose multiprocessor architecture has advantages over ASIC or specialized processor because it avoids long development process, incurs less design cost, and is programmable. In addition, general purpose processors usually have on-chip caches, which can store frequently used data structures locally. Hence, BBN MGR [27] is based on DEC Alpha 21164 general purpose processors.
The performance evaluation of a router should be carried out with appropriate benchmark applications. There exist a few network benchmarks for network processor design and evaluation like CommBench [32] , NetBench [18] , PNI benchmark [9] , etc. CommBench includes a set of header processing applications in traditional networks and payload processing applications in active networks. NetBench contains microlevel, packet-level, and application-level benchmarks. PNI benchmark covers packet header applications such as packet classification and fowarding and packet data processing applications such as IPSec. Although these benchmarks cover various applications in the network realm, none of them specifically addresses all the workload in an IP router. RFC 1812 is the classic document that states the requirements of IP Version 4 routers [2] . Karlin and Peterson [15] summarized the main functions of routers as classification, fowarding, and scheduling. Kohler et al. [16] built a functional software router-Click Modular Router, where the key functions of an IP router are implemented.
Most of the work in performance evaluation of routers emphasizes the delay in packet transmission and ignores the packet processing latency. Chan and Morris presented a modeling framework to optimize the cost and performance of routers with multiple forwarding engines [5] . They assumed that route lookup time was exponentially distributed, which did not reveal the real packet processing delay. Papagiannaki et al. [26] measured the single-hop delay from an operational backbone network router, but did not specify the processor computation time, queuing delay, and other contributions to the total delay. Chen et al. [6] studied the buffer and queue management techniques to optimize the Click software router [16] on off-the-shelf multiprocesor PC hardware. In both [6] and [26] , the implementations are based on existing hardware which lack study on investigating the impact of various architecture parameters (such as CPU power, memory size), and on tuning these parameters to optimize the router performance.
Bhuyan and Wang [4] were the first to measure the packet processing time and architectural impact through an execution driven simulation. They extended the RSIM [25] simulator to evaluate the execution of a routing table lookup algorithm on a multiprocessor router. In this paper, we further extend the previous research [4] to a complete simulation framework and router workload. We extend Augmint [22] to perform multiprocessor architecture simulation because RSIM spends too much simulation time on architectural detail inside the CPU, such as instruction level parallelism (ILP), branch prediction, speculative execution, etc. It was shown that the impact of these parameters is not that significant compared to multiprocessing [4] . Augmint assumes simple processor architectures, but enables us to focus on exploring packet-level parallelism.
MULTIPROCESSOR ARCHITECTURE FOR ROUTERS

Existing Multiprocessor Router Architectures
A basic router architecture consists of line cards, a router processor, and a backplane switch. A current trend in the router designs is to use multiple packet processing units instead of a single centralized one. High-performance routers can be divided into three organizations, shown in Fig. 1 . The architecture of Fig. 1a connects multiple FEs and LCs via a high-speed bus. Packet headers can be sent to one of the FEs from an LC to perform route lookup, packet classification, header updates, etc. Such a router is inexpensive in cost; however, since the LCs forward packets to outbound LCs via bus also, the bus could be a bottleneck. The multiprocessor Click router in [6] is based on such an architecture. A parallel architecture in Fig. 1b There are two additional decisions to be made when designing a high-performance multiprocessor router. The first one is the programmability. A programmable architecture has increasing advantages over ASICs because new protocols and router functions can be incorporated. So, we consider general-purpose processors as forwarding engines. The second one is the memory usage. The routers can either have all data structures replicated for each processing element or have a shared data structure. The obvious benefit of the replicated one is the simultaneous memory access by all the FEs. However, as the size of the data structures ( such as tree structure for routing table lookup) grows and the number of processing elements increases, the memory requirement may easily exceed beyond limit. For example, the Cisco 12000 one-port 10-Gigabit Ethernet Line Card has 256MB routing table memory [8] , which turns out to be 4GB for a router with only 16 LCs. Another drawback of replicated data structures is the difficulty in updating them. For example, whenever there is an update on the routing table, every copy of the data structure for routing table lookup will need to be updated simultaneously. This will increase the hardware complexity or degrade the performance significantly if done sequentially. So, a shared memory architecture seems to be a good candidate. A distributed shared memory (DSM) organization can be adopted to provide simultaneous memory access from different processors. This is possible because different processors will access different parts of the shared data structure at any particular time.
Shared memory architectures can be divided into two categories, namely, Symmetric Multiprocessors (SMP) and Cache-Coherent Nonuniform Memory Access (CC-NUMA) multiprocessors. Their organizations are briefly described later in this section. In shared memory architectures, a global memory space is shared by all the processors in the system. Each processor has a local cache which can store recently accessed shared data blocks. In order to study the sharing behavior in routers, we executed the Radix Tree Routing (RTR) [31] table lookup algorithm on a simulated SMP architecture with 32 processors. RTR is a classic route table lookup algorithm for layer three (IP) packet forwarding in routers which performs route lookup based on a radix tree data structure. We used publicly available FUNET trace and its routing table [24] in this experiment. Fig. 2 illustrates the sharing behavior of RTR, where the x-axis denotes the number of processors sharing a certain data block and the y-axis is the number of data blocks that x processors share. It can be observed that the number of data blocks shared by all processors is very high, which means that the degree of sharing in RTR lookup is substantial. These highly shared blocks are actually the radix tree root node and its near descendants. They are shared most frequently because every lookup has to begin from these nodes located at the top of the radix tree. As RTR lookup traverses the tree, route lookup on packets with the same destination IP address will follow the same path until the lookup finishes, whereas other lookups may branch to different ways after some point. A continuous stream of packets with the same destination IP will be seen in the packet stream because two communicating parties will maintain a connection and send data packets for a period of time. However, these packets may be sent to different FEs for processing. The corresponding FEs will follow the same path on the radix tree and, thus, the tree nodes along the path will be shared by these FEs. This interesting observation motivates us to propose multiprocessor architectures with shared memory for IP routers. We examine the suitability of both SMP and CC-NUMA multiprocessor architectures in the following sections.
SMP-Based IP Router
In SMP router architectures, the FEs and LCs are connected through a high speed bus. The centralized shared memory module is also attached to the bus as depicted in Fig. 3 . This architecture is same as the multiprocessor Click router [6] . An SMP-based router architecture has the following features:
. Each FE has cache memory of one or multiple levels.
The SMP router architecture uses uniform/centralized shared memory and broadcast/bus-based snoopy cache coherence protocol [3] , [13] . . Centralized shared memory stores routing table and other data structures which are shared by all the FEs. . An LC has an on-card memory buffer, where incoming packets (both header and payload) are stored. . An FE accesses these memory spaces using memory mapped I/O operation, processes the packet header, and writes the port number of the outbound LC into the packet header. . LCs transfer the packets to outbound LCs via the same shared bus. The FEs perform header checking, classification, routing table lookup, etc. The routing table lookup process involves loading radix tree nodes from centralized main memory to cache, upon which the RTR algorithm is performed. The lookup result and updated packet header are written back (again via bus) to the origin LC. Intuitively, the potential problem could be the bottleneck of the centralized memory and bandwidth limit of the bus. The FEs have to compete for the bus to access shared data structures in centralized main memory and the LCs compete for the bus to transfer packets consisting of both headers and payloads.
CC-NUMA-Based IP Router
The architecture shown in Fig. 1b avoids the bus problem by putting a crossbar network between the LCs and FEs. A crossbar switch allows all one-to-one connections between the LCs and FEs. Although BBN's MGR has an organization similar to Fig. 1b , the memory of FEs is not shared, so the routing table is replicated in all the FEs. We propose a CC-NUMA architecture to eliminate the replication by using a shared memory paradigm.
A Cache Coherent Nonuniform Memory Access (CC-NUMA) router architecture, shown in Fig. 4 , has the following features.
. Each FE has cache memory that can be of multiple levels. The cache coherence is maintained through a directory-based organization [3] , [13] . It also has a local memory module, which is part of the global shared memory space. . Each LC has an on-card memory buffer, which stores the packet (both header and payload). An FE can access any memory remotely via crossbar. Ideally, an FE should be located at the LC, as considered in [4] . However, we chose the organization in Fig. 4 , similar to the BBN's MGR. . All FEs and LCs are connected via a crossbar switch fabric, which allows simultaneous multiple connections for high bandwidth. A CC-NUMA router works as follows: Each FE gets a new packet header from an LC, performs header checking, classification, and routing table lookup, etc. Routing table lookup involves loading radix tree nodes from distributed memory to local cache. The packet header is updated using the lookup result and it is written back to the origined LC. The LCs then initiate packet transfers via the crossbar.
Multiprocessor Router Architecture Simulation
To evaluate the performance of the above router architectures, we develop an execution-driven simulator. Augmint [22] is initially a simulation tool for Intel CISC processors, only simulating the instruction execution on the processor side. Its processor simulation is relatively simple; however, it enables us to focus on the multiprocessing of packets. Augmint provides us the flexibility to add memory module and bus/crossbar module for SMP/CC-NUMA architecture in the back end. We construct a cache memory module with 32-byte cache blocks and 32KB cache size. It is two-way setassociative and the cache replacement policy is LRU. We extended Augmint to implement a snoopy cache coherence protocol for SMP and a full-map directory-based protocol for CC-NUMA architecture [3] , [13] . Memory management policy in CC-NUMA is page-based round-robin, so the radix tree data structure is distributed uniformly across the memory modules. We simulate FEs and LCs as independent components as in a real router and they interact through the interconnection (bus or crossbar switch).
ROUTERBENCH
To evaluate router architecture performance with appropriate benchmark applications, we need to identify the key functions of routers. RFC 1812 [2] , "Requirements for IP Version 4 Routers," states that "an IP router can be distinguished from other sorts of packet switching devices in that a router examines the IP protocol header as part of the switching process."
The processing of IP layer involves IP header validation, routing table lookup, decrementing time-to-live (TTL), fragmentation, etc. [2] , [16] , [15] . Profiling on an example Click sofware router [16] also shows that these tasks take a considerable amount of time to execute: header validation, classification, and decrementing TTL take 12.3 percent, 9.2 percent, and 3 percent of the total execution time, respectively. In addition, packet classification is becoming a mandatory task to support QoS in routers [14] . In general, each packet header will go through the processing shown in Fig. 5 . To capture these key router functions, we propose RouterBench, which consists of the following four categories of applications:
. classification, . forwarding, . queuing, and . miscellaneous. These operations reside on the critical path of the passing packets, which are time-critical and are key factors of the router overall performance. Therefore, we believe that the RouterBench specifically identifies the workload of IP routers and is suitable for evaluating their performance. At this stage, we do not consider other non-time-critical functionalities, such as BGP routing, logging, and administrative tasks.
The source codes of the RouterBench applications are available for public download at http://www.cs.ucr.edu/ cial/routerbench/ [17].
Classification
Packet classification is the primary task for identifying flows, filtering data traffic, QoS [14] , etc. Packets can themselves carry an explicit service classification field, such as Type of Service (TOS) in IP header [1] , [23] . A general classification is done with an access list that allows specification of a packet's source and destination address, protocol, and the protocol port address for TCP or UDP. In some systems such as Click [16] , a more generic configuration mechanism allows any field of the packet header to be identified using specification of an offset and a field length. We take the general classification mechanism as our benchmark application, which uses a rule list and can check any field of the packet header. In theory, there can be an unlimited number of rules in a rule list. Since there is no typical rule list described in the literature, we list a set of rules that a typical router would configure based on an access list from a campus level router. There are a total of 336 rules declared for the inbound traffic. An example of a rule is "dst host 138.23.168.40 and tcp port 80," which classifies all the incoming HTTP packets destinated to a server with IP address of 138.23.168.40. Fig. 6 shows the cumulative distribution function (CDF) of packet counts on top rules based on the data in our campus level router. The x-axis is the rank of a rule and the y-axis is the cumulative distribution of traffic that are matched. We find that only 31 rules out of 336 of them classified 99 percent of the traffic and many of the rules were never encountered by any packet. We thus use the top 31 rules in our classifier application. Our classification algorithm, ported from Click router [16] , is based on the "hierachical trie" algorithm described in [11] .
Hierachical trie structure is an extension of the trie data structures. A trie structure is constructed by examining one field (for example, "source IP" field) of the classification rules such that the trie consists of all the different values in this field. Each trie node contains an {offset, prefix, mask} tuple to represent the value in the field. Offset specifies the bits to be examined. Those bits are ANDed with mask and then compared with prefix to determine if there is a match or not. The left and right children of a node can be either another trie node or a leaf node, which indicates that the final classification decision can be made here. Then, the hierachical trie is constructed by connecting all the tries based on the rule list. Classification of an incoming packet proceeds from the root node, determining if the packet matches at each node until it reaches a leaf node (decision node).
Forwarding
To forward packets to their destination hosts is the primary task of a router. To forward a packet, a router has to look up a route table to determine its outgoing port. A route table lookup algorithm is used to search for a most specific route prefix. Because of Classless InterDomain Routing (CIDR), forwarding becomes a nontrivial problem and quite a few routing algorithms have been proposed in the literature [29] . Among them, the Radix Tree Routing (RTR) route lookup algorithm is from the public domain BSD Unix [21] distribution that performs lookup based on a radix tree data structure. It is widely used and included in both CommBench [32] and NetBench [18] . We thus incorporate it as our benchmark for forwarding.
In the RTR algorithm, a route lookup is performed based on the radix tree and starts from the root node. Each tree node (radix node) indicates which bit of the 32-bit destination IP address should be examined. The value of that bit (1 or 0) determines which of the children nodes will be the next one to visit. When a leaf node is reached, the destination IP is ANDed with the route prefix mask to verify the final match. Backtracking is possible if there is not a match in the leaf node.
Queuing
Queuing may happen at both the input side and output side of a router. Different queuing policies are proposed to control congestion, guarantee fair sharing of bandwidth, and provide QoS [30] , [10] , [14] . There are two categories of queuing policies: scheduling and dropping. Scheduling policy decides how a number of packet sources, usually queues, can share a single output channel. Dropping policy drops packets when the queue size is beyond certain threshold. We use the following two algorithms in our RouterBench.
DRR (Deficit Round-Robin). DRR is a Deficit Round-
Robin fair schedulig algorithm [30] that is commonly used for bandwidth scheduling on network links. DRR assigns a quantum to each queue and serves the queues in a round robin fashion. A queue cannot send a packet if the packet size is larger than the quantum. After a packet is sent, its size is subtracted from the quantum. The remainder of the quantum is added to the quantum for the next round so that the large packet gets a chance to be sent out later. The DRR algorithm is implemented in one form or another in currently available routers (e.g., Cisco 12000 series).
RED (Random Early Detection). RED is used to drop
packets when there is network congestion. A link is considered congested when there are too many packets in the queue serving that link. RED monitors the average queue size of each output queue. Congestion in the link increases the average size of the queues. If the average size of a queue exceeds the maximum threshold, RED drops a packet with a probability that is proportional to the connection's share of the throughput, and notifies the sender about this congestion. RED was first proposed by Floyd and Jacobson in [10] and later implemented on various platforms. UCLADew [35] is one of the implementations of RED and we port it to build our RED benchmark application. DRR and RED are usually performed on LC instead of FE because FEs are responsible for processing packet headers, while LCs are responsible for queuing management. However, future high-performance routers are likely to implement the QoS activities in the FE. Also, when an FE is physically located in an LC, like in Cisco 12000 routers [7] , it has to perform QoS activities.
Miscellaneous
There are some other tasks that are executed in the critical path of processing a packet; however, it is not appropriate to put them into the above categories. These functions are included in both BBN MGR and Click router configurations. We run a profiling tool on a Click software router and identify that these functions are consuming a nonnegligible amount of time. This miscellaneous category consists of CheckIPHeader, DecTTL, and Fragmentation. 
SIMULATION RESULTS AND ANALYSIS
It is desirable to conduct performance evaluation based on real routing table from a backbone router with real packet trace from the same site. The only routing table and trace pair that is publicly available is from FUNET [24] . The trace includes near-real IP addresses (the last 8 bits of an IP are masked to 0) and is acceptable for evaluating the RTR lookup because most route prefixes are less than 24 bits. The FUNET routing table has 41K entries and the trace file contains 100K destination IP addresses. The other functions, such as CheckIPheader, Classifier, DecTTL, and Fragmentation, need other IP header fields besides the destination IP. However, FUNET traces do not contain such information. Therefore, we use traces from NLANR [20] because they have a full packet header up to TCP layer. On the other hand, NLANR traces have sanitized IP addresses that make them unsuitable for routing table lookup. Fig. 7 depicts the per-packet execution time of each RouterBench application in a uniprocessor environment, where the processor has one level cache. The cache has 32-byte cache block, 32KB cache size, and is two-way setassociative. Cache latency and memory access latency are assumed to be two cycles and 30 cycles, respectively. The experimental results show that classification and routing table lookup take up to 56 percent of the total processing time (2,186 cycles). CheckIPHeader, DecIPTTL, Fragmentation, RED, and DRR share the other 44 percent. Classification and route table lookup have comparable execution time complexity. RTR route table lookup takes most of the time, 617 cycles per packet, and DecIPTTL takes the least, 51 cycles. The total processing time for a packet (excluding queuing time) is 2,186 cycles. Assuming memory access cycles do not change with increase in CPU frequency, the above result corresponds to a maximal throughput of 91.5K packets/sec with a 200MHz CPU or 457K packets/ sec with a 1GHz CPU. In reality, as we move to faster CPUs, the number of cycles for each memory access will increase proportionately. As a result, the packet processing time will increase and throughput will decrease. In any event, the throughput obtained in a uniprocessor is far below the multigigabit link requirement (multimillion packets/sec) and justifies the need for a multiprocessor architecture in routers. Hence, in the following sections, we show how multiprocessor architectures can improve the performance of packet header processing applications.
RouterBench Applications Performance
Multiprocessor Router Architecture Performance
We individually run each benchmark application on the SMP and CC-NUMA architectures of our execution-driven simulator to get various performance measurements. Table 1 summarizes default parameters of our architectures. We adopt packet-level parallel processing in the sense that, when a packet arrives at a line card, it is sent to a forwarding engine for processing in a round-robin manner. However, they all share the same shared memory.
SMP Performance
Fig . 8 plots the execution time of various RouterBench applications on an SMP architecture. Except for RTR, the execution time of all other applications in RouterBench decreases almost by half as the number of processors doubles. RTR lookup gets speed up until eight processors, after which its execution time increases. In order to understand the cause of RTR performance trend on SMP, we further break down the RTR lookup execution time in Fig. 9 . The computation time decreases up to 50 percent when the number of processors doubles, so does the memory stall time. Memory stall time is the time taken at the centralized memory of Fig. 3 . The bus contention time is considered separately, which increases with the number of processors. This is because the radix tree is located in the centralized memory and every processor has to request and load the radix tree data structure via the bus when a cache miss occurs. Bus contention becomes the bottleneck in performance, which leads to the increase of total execution time after a certain number of processors. Fig. 10 illustrates cache behavior of the RTR lookup algorithm on SMP. There are few write misses since we are not simulating the route update procedure of the radix tree structure. The read misses come from the fact that radix tree structure is allocated in the shared memory and every processor has to traverse the tree structure to perform lookup. When a processor traverses a node in the tree for the first time, a cache miss occurs. The next visit to the same tree node will usually be a cache hit. The cache miss ratio is relatively low (between 3 percent and 4.5 percent) because the temporal locality in packet traces makes the lookup procedure follow the same path along the radix tree. The miss ratio increases as the number of processors increases because arriving packets are distributed to all processors in a round-robin fashion and temporal locality in the trace as seen by each processor will be reduced. The reason for round-robin distribution is that the LCs do not classify the incoming packets in our current design, thus no information is available to distribute a flow of packets to the same FE. If more computation power is invested into LCs, simple classification can be done and processors can be allocated to preserve locality between packet trains.
CC-NUMA Architecture Performance
A CC-NUMA architecture overcomes the shared bus bottleneck by providing a crossbar network and by distributing the shared memory among different processors. Fig. 11 plots the execution time of the same RouterBench applications on CC-NUMA architecture. It shows that all RouterBench applications gain linear speedup as the number of processors doubles. Fig. 12 analyzes the RTR lookup execution time that consists of four parts: computation, local memory stall, remote memory stall, and crossbar contention. As radix trees are created in shared memory, nodes of the tree may reside in one of the physical memory modules. If a processor accesses the node located in its own memory module, the access is a local memory access. Accesses to other nodes in the remote memory are accomplished through the crossbar switch, which has a high delay. However, the experimental data show that the crossbar mitigates the remote memory access cost by providing simultaneous accesses to several memory modules. As a result, the performance is much better than a busbased SMP. It is also observed that the crossbar contention time increases until the number of processors equals four, after which the crossbar contention drops. This implies that, as we distribute the radix tree data structure to more processors, read accesses to the radix tree are also directed to all memory modules, instead of a particular one. Crossbar switch provides a separate path when the accesses are to different memory modules, which leads to less contention. Hence, the results show that CC-NUMA has better scalability than SMP for RTR lookup algorithm.
IMPACT OF ROUTING TABLE UPDATES
In this section, we study the impact of routing table updates on performance of the CC-NUMA multiprocessor router architecture.
Routing tables are updated periodically by routing protocols such as BGP. The route in a routing table may be modified or deleted or a new route entry may be added into the table. The frequency of route updates is known to be from hundreds to thousands of updates per second [12] , [27] . As the future routing tables scale, we conjecture this number can reach up to 100,000 updates per second, which is used as a test case in performance study in [33] . In the following experiment, we study the impact of routing table updates at the frequency of 1K, 10K, 100K updates per second, which correspond to one update per 1M, 100K, 10K cycles for 1GHz processors.
There are three kinds of route updates: modification, deletion, and addition of a route. When a route entry is modified, its outgoing link number, priority, or other route property are updated. From the perspective of radix tree, updating such information does not change the tree structure. Thus, the cost of such udpates is not significant. However, when adding a new route entry or deleting an existing one, the radix tree structure will have radix nodes allocated or deallocated and pointers updated or deleted. Such updates to the radix tree have significant costs.
It is desirable that we use a BGP update trace from the same FUNET site because we use FUNET routing table and packet trace pair in the simulation experiments. However, such an update trace is not accessible. Although there are BGP route update traces available publicly, they are not suitable for our simulation experiment because the BGP updates are highly site-dependent. Hence, we construct a synthetic route update trace using FUNET routing table. We take the raw FUNET routing table with 41K entries, scan the table linearly, and extract one route entry out of every 10 entries. 1 Thus, we obtain a new routing table T, which is used for simulating route update messages. We define each entry in T as three consecutive update messages: a route deletion, an addition, and a modification. It is assumed that such an update message sequence repeats for all the entries in T. We use a separate processor to execute routing table updating procedure; all other processors perform route lookups as they do in the case of an IP router [7] , [27] . Fig. 13 shows the impact of route updates on the routing table lookup performance of a CC-NUMA architecture. The routing table updates generally degrade the lookup time. This is expected since route updates will increase memory traffic and delay the memory access from other processors. Also, the routes stored in each processor's local cache may have to be invalidated by the updating processor, thus the next lookup will generate a cache miss. The route update frequency clearly affects the amount of degradation in lookup time. As shown in Fig 13, when the number of processors equals to one, 1K updates per second cause an increase of 3.5 percent on average lookup time (from 648 to 671 cycles), 10K updates per second cause an increase of 15 percent (to 746 cycles), and 100K updates per second give rise to an increase of 18 percent (to 762 cycles).
As the number of processors increases, however, this lookup time degradation becomes insignificant. Especially when the number of processors is beyond eight, the performance degradation is not noticeable. This is because, although route updates affect the latency of one memory module, other modules operate without any conflict. Also, there are several caches which maintain copies of radix nodes, so the memory contention due to update and cache misses is reduced. When a route update message is received, the updating processor locates the appropriate radix node in the tree structure to allocate a new node and append it, deallocate a node, or make modification. These updates only invalidate the particular cache block in the processor that has such a copy. As the number of processors increases, each processor tends to keep a different portion of the tree structure in its cache. As long as the invalidated block is not shared by many processors, the overall lookup performance does not degrade. Hence, a CC-NUMA architecture can sustain high lookup performance even in the presence of highly frequent route updates.
CONCLUSION
In this paper, we proposed two shared memory multiprocessor architectures, SMP and CC-NUMA, to meet the increasing line speed of future IP routers. Such architectures not only save memory space for storing routing table, but also have the advantages of general purpose multiprocessor architectures, which include programmability, scalability, and high-performance. We developed an execution-driven simulation environment to evaluate the proposed SMP and CC-NUMA router architectures. A snoopy cache coherence protocol was incorporated into the SMP architecture to maintain coherence among caches. The CC-NUMA architecture is designed based on a directory protocol. To quantitatively analyze the performance of such routers, we proposed a benchmark application suite, RouterBench, which consists of key functions in the time-critical path of packet processing. The source code of this benchmark suite is publicly available at [17] . Experimental results showed that classification and route table lookup executions take up to 56 percent of the total processing time of a packet. CheckIPHeader, DecIPTTL, Fragmentation, RED, and DRR share the other 44 percent.
The SMP and CC-NUMA architectures improve RouterBench performance as the number of processors increases. However, the routing table lookup performance in SMP improves until the shared bus becomes saturated. On the other hand, the CC-NUMA architecture provides excellent scalability for the design of high-performance routers. We showed different components of the execution time and their variation with the number of processors for the RTR lookup algorithm when executed on both SMP and CC-NUMA architectures.
Finally, we studied the impact of routing table updates on the performance of the CC-NUMA multiprocessor router. The route updates degrade route lookup performance due to the cache block invalidation generated by the updating processor. However, this degradation becomes less significant as the number of processors increases. This implies that the CC-NUMA architecture can sustain high lookup performance even at a high frequency of route updates.
Yan Luo received the BE and ME degrees in computer science and engineering from the Huazhong University of Science and Technology in 1996 and 2000, respectively. He was a technical staff member in the R&D Center of Guangdong Nortel Corp. from 1996 to 1997. He is currently a PhD candidate in the Department of Computer Science and Engineering at the University of California, Riverside. His research interests include network processor, router architecture, parallel and distributed processing, and cluster/grid computing.
Laxmi Narayan Bhuyan has been a professor of computer science and engineering at the University of California, Riverside since January 2001. Prior to that, he was a professor of computer science at Texas A&M University (1989) (1990) (1991) (1992) (1993) (1994) (1995) (1996) (1997) (1998) (1999) (2000) and program director of the Computer System Architecture Program at the US National Science Foundation (1998) (1999) (2000) . He has also worked as a consultant to Intel and HP Labs. His current research interests are in the areas of network processor architecture, Internet routers, and parallel and distributed processing. He has published more than 100 papers in related areas in reputable journals and conference proceedings. His brief biography and recent publications can be found at his web page at http://www.cs.ucr.edu/~bhuyan/. Dr. Bhuyan is a fellow of the IEEE, the ACM, and the AAAS.
Xi Chen received the BE degree in computer science and engineering from Zhejiang University in 2000, the MS degree in computer science from the University of California at Riverside in 2002, and is currently a PhD candidate in computer science at the University of California at Riverside. His research interests include verification of embedded system designs, system-level design methodologies, and distributed computing. He is a student member of the IEEE.
. For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
