Abstract
Introduction
Network processors provide flexible support for communications workloads at high performance levels. Designing a network processor can involve the design and optimization of many component devices and subsystems, including: (multi)processors, memory systems, hardware assists, interconnects and I/O systems. Often, there are too many options to consider via detailed system simulation or prototyping alone.
System modeling can be an economical means of evaluating design alternatives; effective system models provide a fast and sufficiently accurate description of system performance. The modeling framework introduced here has been designed to help answer questions such as:
Is a system S sufficiently provisioned to support application W at the target line rate and number of interfaces? If not, where are the bottlenecks? If yes, can S support application W' (that is, application W plus some new task) under the same constraints? Will a given hardware assist improve system performance relative to a software implementation of the same task? How sensitive is system performance to input traffic?
The framework is composed of independent application, system and traffic models which describe router functionality, system resources/organization and packet traffic, respectively. The framework uses the Click Modular router to describe router functionality. Click modules are mapped onto an object-based description of the system hardware and are profiled and simulated to determine maximum packet flow through the system and aggregate resource utilization for a given traffic model. This paper presents several modeling examples of uniprocessor and multiprocessor systems executing IPv4 routing and IP security (IPSec) virtual-private network (VPN) encryption/decryption. Model-based performance estimates are compared to the measured performance of the real systems being modeled; the estimates are found to be accurate within 10%. The framework emphasizes ease of use and utility, and permits users to begin full system analysis of existing or novel applications and systems very quickly. Framework users can include those writing router software, router system architects or network processor architects.
The remainder of the paper is structured as follows. Section 2 introduces the modeling framework, and describes its components, operation and implementation. Examples of system models executing an IP router application are presented and analyzed in Section 3. Section 4 uses an IPSec VPN router extension to demonstrate application modeling within the framework. Likewise, section 5 demonstrates traffic modeling for the systems and applications from preceding sections. Section 6 considers future work . The paper concludes with a discussion in section 7.
Framework Description
This section describes the framework component models, the Click modular router, and the overall operation of the framework. As described below, we use Click modular router [9] configurations to describe applications in terms of modular program elements and the possible packet flow between them. The traffic model dictates the packet flow between elements. The system model describes system resources in a manner compatible with the work description admitted by the application model. Our general approach is to approximate application characteristics statistically, based on instruction profiling when considering a software implementation of a task, and to use those approximations to form system resource usage and contention estimates.
We begin with a high-level description of the models.
Application -A modular, executable specification of router functionality described in terms of program elements and the packet flow between them. Examples of program elements for an Ethernet IPv4 router include: IP Fragmentation, IP address lookup, and packet classification. System -A description of the processing devices, memories, and interconnects in the system being modeled. For instance, one or more processors could be connected to SDRAM via a high-speed memory bus. All elements of the application model are mapped to system components. Traffic -A description of the type of traffic offered to the system. Traffic characteristics influence A) which paths (sequence of elements) are followed in the application model and B) the resources used within each element along the path.
In addition to these models, the user provides a mapping of the application elements onto the system; this mapping describes where each element gets executed in the system and how the packets flow between devices when they move from one program element to another.
The Click Modular Router
This section describes the Click modular router. A good introduction to Click is found in [9] ; Kohler's thesis [8] describes the system in greater detail. Click is fully functioning software built to run on real systems (primarily x86 Linux systems). When run in the Linux kernel, Click handles all networking tasks; Linux networking code is avoided completely. This is of great benefit; Click can forward packets around 5 times faster than Linux on the same system due to careful management of I/O devices [9] .
Click describes router functionality with a directed graph of modules called elements; such a graph is referred to as a Click configuration. A connection between two elements indicates packet flow. Upon startup, Click is given a configuration which describes router functionality. An example click configuration implementing an Ethernet-based IPv4 router is shown in Figure 1 . In this paper, configurations are often referred to as routers, even if they do more than route packets. The Click distribution comes with a fairly complete catalog of elements -enough to construct IP routers, firewalls, QoS routers, network address-translation (NAT) routers and IPSec VPNs. Click users can also build their own elements using the Click C++ API. Configurations are described in text files. Elements are declared and connections are made between them in a clear, intuitive syntax.
In this work, Click's modular design and ease of use are used to describe router functionality. We ported Click to the Alpha ISA so that we could profile and simulate the code with various processor parameters using the Simplescalar toolkit [1] . Within the framework, Click configurations function as application specifications and implementations. Using Click fulfills the authors' goal to use real-world software rather than benchmark suites whenever possible; previous experience in evaluating network processor architectures [3] has shown the importance of considering complete system workloads.
In Click, the mechanism used to pass packets between elements depends on the type of output and input ports involved. Ports can use either push or pull processing to deliver packets. With push processing, the source element is the active agent and passes the packet to the receiver. In pull processing, the receiver is active and requests a packet from an element on its input port. Packet flow through a Click configuration has two phases corresponding to these two types of processing. When a FromDevice element is scheduled, it pulls a packet off the inbound packet queue for a specified network device and initiates push processing; push processing generally ends with the packet being placed in a Queue element near a ToDevice element; this can be seen by examing paths through Figure 1 . When a ToDevice element gets scheduled, it initiates pull processing, generally by dequeueing a packet from a Queue. As a general rule, push processing paths are more time consuming that pull processing paths.
Click maintains a worklist of schedulable elements and schedules them in round-robin fashion. FromDevice, PollDevice (the polling version of FromDevice used and described later in this paper) and ToDevice are the only schedulable elements. On a uniprocessor-based system, there is no need for synchronization between elements or for shared resources like free-buffer lists, since there is only one packet active at a time and packets are never pre-empted within an element. However, Click also includes support for shared-memory multiprocessors [2] and thread-safe versions of elements with potentially shared state. Synchronized access and contention for shared resources will be of concern for SMP-based systems, as will be seen later.
Application Model
The application is modeled by profiling the Click configuration and simulating its execution on the target processor over the packet input described in the traffic model. This yields a tremendous amount of information, including: instruction count and type information, cache behavior, branch behavior, ILP and how each contributes to cycles per instruction (CPI). The specific information kept in the model depends on what aspects of the system are of interest.
We profile and simulate the code with the SimpleScalar toolkit [1] , version 3.0, targeted for the Compaq Alpha ISA [4] . In general, the application is simulated over the target traffic, on the target processor(s). This helps determine packet processing performance. However, execution time is only part of the overall packet forwarding time; even this time must later be adjusted to account for contention for shared resources and the cost of any synchronization overhead. If an element has no software implementation (i.e., a hardware assist) and cannot be simulated, the profile and external dependencies must be provided manually by the user.
The examples presented in this paper illustrate only a portion of the flexibility of application analysis provided by the framework. Click configurations, and individual elements, can be profiled in great detail. The studies found in later sections only use the number of instructions, CPI achieved and number of system bus transactions (i.e., L2 misses) to estimate performance.
System Model
The system model consists of devices (e.g., processor cores and memories) and channels (e.g., memory and I/O buses) that connect devices. Each device component has internal operations (e.g., an add instruction), which are completely specified by the device, and external operations (e.g, a load that misses in caches), whose implementations are dependent on target devices (e.g., SDRAM) and the channels that connect them.
System resources and organization are described with objects. There are two base objects: devices (e.g., processor, memory) and channels (bus, or other interconnect). Device and channel objects, once instantiated, are attached to one another to describe the system's organization. Once connected, the system can begin to deduce the cost of the external operations for each device; this is only a portion of the cost, however, since certain aspects of external operation cost such as locality and contention depend on the application and traffic models. A sample system model is shown in Figure 2 . The examples in this study involve only four device types: processors, memories, interfaces and bridges. Other types are certainly possible within the framework, but only these are discussed here.
Many network processor designs include specialized instructions not present in the Alpha ISA. To explore the impact of a specialized instruction, the user can do the following: add a new op code and instruction class to the profiling tool, define a macro (e.g., compiler intrinsic) that uses the new instruction class, modify at least one Click element to use the macro rather than C/C++ statements, and update the system device to include the new instruction. Adding a new instruction to the profiler involves adding new instruction class and op code declarations, as well as adding a C function that implements the new instruction. Note that the application can no longer run natively, so this must be done after the element graph has been profiled with the traffic model to obtain path frequencies.
Traffic Model
The traffic model is a statistical description of packet stream characteristics. These characteristics include: packet size distribution, IP packet type distribution (e.g., UDP or TCP), inter-arrival time distribution, distribution of source addresses, and distribution of destination addresses. This description can be specified directly by the user or measured from a trace of packets.
Application-System Mapping
As mentioned previously, a complete model description also requires a mapping of Click elements onto system devices. Comment(PC):In SMP Click, only schedulable elements are assigned to devices. Since I don't actually use the mapping in the way its described here, should I just fall back to SMP Click's approach? Each element must be assigned to a processing device (usually a processor) for execution and one or more memory devices for data and packet storage. For example, if the system is a PC-based router, then all computation elements in the application are assigned to the host processor for execution and all packets and data are assigned to system RAM.
Given this mapping, we can find a packet's path through the system as it flows through the Click configuration. The mapping indicates where each element stores its packets, thus the packet must flow through the system from memory to memory, according to the elements visited along the packet's path through the Click configuration, across whatever channels connect them.Comment(PC):This is basically what I do to find the average contention and shared data structure L2 misses on SMP Click. Except, it's really Queue movement that I track rather than packet flow, since packets only move between processors while part of a Queue.
While Click describes packet flow, it does not describe how packets are delivered to Click in the first place; the packet delivery mechanism is defined by the operating system (OS) and device drivers. Since OS code is not included in our application model, a manual analysis is required to model the packet arrival mechanism. The mapping must also indicate whether packet arrival is implemented with interrupts, polling, a hybrid interrupt/polling scheme [10] , or special hardware supporting a scheme like active messages [5] . The Click system itself, when running in kernel mode on Linux, uses polling to examine DMA descriptors (a data structure shared by the CPU and the device for moving packets.) This issue is of paramount importance in real systems, and is worthy of the attention it has received in the research literature. All examples presented in this paper will use polling, but any of the other schemes could be modeled within the framework.
Metrics & Performance Estimates
Our primary ambition is to find the forwarding rate, forwarding latency and resource utilization for a given application, system and traffic description. Intuitively, this means finding the time and resources needed to move the packet through the system, from input interface to output interface. This involves three phases:
1. moving the packet from the input interface to memory (IM) 2. processing the packet (PP) 3. moving the packet from memory to the output interface (MI).
Our principal metrics -latency, bandwidth and resource utilization -for these three phases directly determine system performance.
Determining values for these metrics is complicated by the fact that the three phases generally rely on shared resources. For example, suppose the input and output interfaces both reside on the same I/O bus. Then, inbound and outbound packets will be forced to share the I/O bus's total bandwidth.
These phases have a further dependence. While each phase may have a different maximum throughput, the effective throughput of each is limited by that phase's arrival rate. A phase's arrival rate is determined by the throughput of the preceding phase since they act as stages in a pipeline: phase 3 is fed by phase 2, phase 2 by phase 1, and phase 1 by the traffic model.
Framework Operation
Overall forwarding rate (i.e., system throughput) will be limited to the lowest forwarding rate among the phases. Once the lowest of the three maximum forwarding rates is found, it can be used as the arrival rate for all phases to determine whether any resources are over-subscribed at that rate; if no shared resource is over-subscribed at this rate, then that rate is the maximum loss-free forwarding rate (MLFFR). System forwarding latency will be the sum of these individual latencies. Finally, resource utilization will be the aggregate utilization of all phases, since they will generally operate in pipelined fashion.
The steps to determine MLFFR, as well as the latency and resource utilization seen at that rate, are summarized as follows:
1. Find the latency and throughput of each phase assuming no contention for shared resources. 2. Use the lowest resulting throughput as the arrival rate at each phase, and find the offered load on all shared resources. 3. If no shared resources are over-subscribed, then the MLFFR has been found. Otherwise, the current rate is too high and must be adjusted down to account for contention. The adjustment is made by adding arbitration cycles to the phases sharing the congested channel.
The framework assumes that only shared channels can be over-subscribed; this is a reasonable assumption so long as the channels leading to a device saturate before the device itself does. For the buses presented here, oversubscription is said to occur when utilization reaches 80%. If no shared resources in the system are over-subscribed (that is, if offered load is less than capacity), then the system is contention-free. Any over-subscribed resources represent contention and changes must be made to estimates of latency and throughput for any phases using those resources. To resolve contention, slowdown is applied equally to contributors. If capacity represents Ü% of offered load, then all users of the congested resource must reduce their offered load by Ü%. Offered load is reduced by iteratively increasing the number of arbitration cycles needed for each user to master the bus until offered load on the channel drops below the saturation point. Initially, the model includes no arbitration cycles; they are only added to a device's bus usage latency once saturation is reached.
On the whole, a balanced system will have the greatest throughput, since throughput is determined by the lowest performing phase (note that latency is unaffected).
Implementation
The object-based system description and modeling logic present in the framework are implemented in programming language Python. A sample set of model declarations in the using the Python object syntax is shown in Figure 3 . Python is also used as a glue language to permit the modeling logic to interact with Click and Simplescalar. As mentioned previously, Click provides the router specification and implementation, and Simplescalar is used to profile and simulate Click's execution on a target processor.
The framework operates very quickly, in general; on the order of seconds when considering tens of thousands of packets. However, processor simulation can be time consuming for experiments involving great amounts of traffic. This is, in general, not a problem. While an hour can be spent simulating system operation over a trace of tens of millions of packets, there are two reasons why this presents no problem: 1) most analyses can be conducted with far fewer packets (so this situation is unlikely to be useful), and 2) this is simulation time, not development time, and can, when needed, be tolerated.
Bridge
Main Memory 
System Modeling
This section will now illustrate, by way of example, how the framework can be used to model network processing systems. The systems modeled below were chosen because they are the systems used by the Click designers to evaluate Click performance. Hence, the performance estimates yielded by the framework can be compared to the observed perfomance of the actual system. 
Uniprocessor PC
The first example models the PC-based system used in [9] to measure the performance of the Click IP router configuration from Figure 1 . The details of the system being modeled can be seen in Table 1 in row "Uni-PC". The main features are: a 700MHz Pentium III Xeon processor, and a 32-bit wide 33MHz PCI I/O bus. There are three channels in this system: the host, memory and PCI buses. Of these, only the memory and PCI buses are shared.
The organization of the system model used here is shown in Figure 2 . In addition to the single CPU, the system contains 8 interfaces in total: four are used as packet sources and four as destinations. The code declaring models in this experiment is shown in Figure 3 . The traffic used in the Click study was very simple. The packet stream consisted of 64 byte UDP packets generated by each of the sources; each source generated packets with uniformly chosen destinations (from among the 4 destination interfaces). Finally, since there is only one processing device, there are no mapping decisions to be made. The packet delivery mechanism, as mentioned above, is polling.
Polling, with Click, involves: checking the DMA descriptor for notice of a new packet, moving the new packet onto an incoming packet queue for processing, moving a packet from the outgoing packet queue into the DMA buffer, and updating the DMA descriptor to indicate the buffer modifications. Both the DMA descriptor and the DMA buffers are effectively non-cached (since they are accessed by both the processor and the I/O devices), thus each of these steps involve L2 cache misses, for a total of 5 L2 misses per polling event.
These results are shown in Figure 4 . The framework estimates an MLFFR of 322,000 packets per second. This is within 3% of the value of 333,000 reported in [9] . At this rate, system throughput is limited by the packet processing (PP) phase. Table 2 reports the per-phase details. Note that neither bus is over-subscribed at the MLFFR. With the PP phase, the Click configuration executes 2110 instructions at a CPI of 0.8467 for a total of 1787 cycles. With a cycle time 1.43ns (i.e., 700MHz), those 2554ns, plus the 550ns spent on the 5 polling L2 misses per packet, become the bottleneck to system throughput.
In addition to the base system, the experiment presented here includes a number of speculative systems for which real performance numbers are not available. These speculative systems include faster processors and faster PCI buses. By increasing the CPU clock rate to 1GHz, the MLFFR increases to 428,000 packets per second. Figure 4 also shows the performance for a simple configuration that performs no packet processing. This configuration is limited by PCI bandwidth. The PCI bus saturates (achieves 80% utilization) at 450,000 packets per second. This number also agrees well with the measured rate of 452,000 packets per second. Table 2 . Per-phase details for Click IP routing on a uniprocessor.
While the thrust of this paper is to introduce the modeling framework, it is interesting to briefly discuss the bottleneck of this PC-based system. No matter how fast the processor, the forwarding rate will remain at 450K packet per second due to PCI saturation. This PCI saturation is due to Click's packet delivery mechanism; PCI could, in theory, deliver more throughput. The Click designers chose to implement polling to eliminate interrupt overheads. This means that the processor never initiates any DMA transactions; this fact has consequences for PCI performance. The reason is that most PCI bridge implementations (including the 440GX used here) only have good burst transfers when the source of the data initiates the burst. With Click, the network interface masters all transactions. This is desirable for packet delivery into memory; the interface is the source of the data, so the packet can be bursted into memory. For packet retrieval from memory, on the other hand, the interface is not the source and thus must use un-bursted memory read transactions. This fact is reflected in the model's PCI 
SMP PC
Click's operation on SMP-based systems was described and measured in [2] . As was the case for the uniprocessorbased PC, the system modeled here will be based on the system used in the study; model estimates will be compared to the observed results from the real system.
The system organization of the SMP-based PC is very similar to that of the single processor system; Figure 2 is still an accurate depiction. The only significant resource differences are: additional processors on the host bus, slower clock rates on those processors (500 Mhz vs. 700 Mhz), a wider PCI bus (64b vs. 32b), and fewer interfaces (4 vs. 8). System details can be found in Table 1 .
Another difference of note is found in the packet han- dling mechanism. The network interfaces used in the uniprocessor system allowed the Click designers to implement polling and to make all host-device interaction indirect through DMA buffers. The interfaces on the systems used in the SMP study, however, do not have this flexibility. In fact, on every packet send, the host must write a pair of device registers on the interface; this requires an additional PCI transaction for every packet. In the uniprocessor case, the packet processing phase (PP -phase 2) involved no PCI bus traffic (the host just moved the packet into a DMA buffer and updated a descriptor); this is not the case for SMP-based system. Aside from system resources and organization, there are several other important differences between SMP and uniprocessor Click operation that need to be discussed. These include: scheduling elements on CPUs, synchronizing access to shared data (such as mutable elements and the free buffer list) and cache misses due to shared data (such as Queue elements).
The four-interface version of Figure 1 has eight schedulable elements; a PollDevice and ToDevice element for each interface. Each of these must be assigned to a CPU. The Click designers considered both dynamic and static assignments and found static scheduling of elements to CPUs to be more effective than their own adaptive approach (the adaptive approach couldn't account for cache misses and therefore was capable of poor locality); static scheduling will be modeled here.
The experiments in this section will consider two and four CPU SMPs. In the two CPU system, each CPU will be assigned two PollDevice and two ToDevice elements. To increase locality, each CPU is assigned PollDevice and ToDevice elements for different devices (incoming packets rarely depart on the same interface on which they arrived). Likewise in the four CPU system, each CPU will host one PollDevice element and one ToDevice element, each element associated with a different interface.
Any sharing or synchronization between CPUs will involve accesses to memory. So, whenever possible, Click minimizes interactions between CPUs to keep all data cached. To this end, each CPU is given a private worklist from which to schedule elements; so, the static assignments of element to CPU are permanent. Each CPU is also given its own list of packet buffers to manage. Buffers are allocated and returned to this private list, so only when a CPU is overloaded will it incur cache misses to allocate a buffer from another CPU. Some elements have private state that must be protected via synchronized access. Queue elements are an especially important example. It is often the case that paths from multiple interfaces (and therefore multiple CPUs) lead to the same output Queue. As a result, enqueues must be synchronized. In many cases, however, multiple CPUs will not dequeue from the same Queue element, since Queue elements often feed ToDevice elements, which are CPU-specific. Click recognizes when this common case holds and disables Queue dequeue synchronization.
While ToDevice elements are CPU-specific, the Queue elements that feed them are not. In fact, Queues are the only way for packets to move between CPUs. Say a packet arrives at CPU A destined for interface AE, which is assigned to CPU B. CPU A will initiate push processing, which culminates in an enqueue onto a Queue element. When the ToDevice(N) element gets scheduled on CPU B, it will initiate pull processing by dequeuing a packet from element Queue and send the packet on its way. Since the Queue element is used by both CPUs (and any other CPUs in the system), either of these accesses might have resulted in an L2 cache miss. Such cache misses will increase with the number of CPUs in the system. Note that in the uniprocessor case, there are no L2 misses of this sort.
Synchronization and L2 misses due to shared data are important because they represent the cost of having multiple CPUs. These costs tend to increase as CPUs are added to the system. The benefit, of course, is that multiple packets (ideally N, when there are N CPUs) can be processed in parallel. When the costs outweigh the benefits, increasing CPUs in an SMP system is wasteful. Similarly, when the benefits outweight the costs, increasing CPUs is a design win.
For the situation modeled here, the cost is not very high. This is because, when processing a packet, the time spent outside of synchronization regions is much greater than the time spent within synchronized regions. Pushing packets onto Queue elements is the only significant synchronization event; enqueues, even when they involve L2 cache misses, are at least an order of magnitude faster than the rest of the IP routing processing path. So CPUs do not, for a moderate number of CPUs, backup at Queue elements. Furthermore, while the expectation of an L2 cache miss per enqueue goes up as CPUs are added to the system (more CPUs implies more remote users), the maximum is 1 L2 cache miss per enqueue. Again, this represents only a fraction of the total time needed to process the packet. As in the previous section, the experiment presented here includes speculative systems for which no data has been published. The results of the IP routing experiments on SMT systems of 2 and 4 CPUs are shown in Figure 5 . The results are once again within 10% of the original Click study's results. The 2 CPU SMP model estimates the MLFFR at 506K packets per second, as compared to the observed rate of 492K packets per second. The 4 CPU SMP model yields an MLFFR of 543K packets per second, as compared to a measured value of 500K packets per second.
The per-phase details are shown in Table 3 . The 2 CPU case is normal; the MLFFR is equal to the minimum phase forwarding rate. However, this not the case for the 4 CPU case since its PCI bus is over-subscribed at the minimum phase forwarding rate. The MLFFR, then, is the minimum forwarding rate beyond which the PCI bus becomes saturated; this rate is 543K packets per second. As described in Section 2.7, the framework finds the MLFFR by solving for the number bus arbitration cycles needed per packet in order to keep the minimum phase forwarding rate beneath saturation.
While accurate to within 10%, the SMP models have not been as accurate as the uniprocessor models. This is because the SMP case is more complicated, and our model leaves out many details. For instance, the Pentium III processor modeled here uses a specific coherency protocol not captured by our model. Cache misses due to sharing are modeled, but the overhead due to coherency traffic is not. 
IPSec VPN Decryption
This section demonstrates how the framework can be used to explore application performance. The intent is to show that the framework can help determine the speed at which a given system can support a target application.
We begin by adding IPSec VPN [7] encryption and decryption to the IP router Click configuration. The VPN encryption and decryption blocks are shown in Figure 7 and Figure 6 , respectively. The figures indicate how simple it is to add significant functionality to an existing router; adding the same functionality to a Linux or FreeBSD router would be, by no means, trivial.
Note that in these experiments, only the packet processing phase is changed; packet receive and transmit will have the same characteristics for these systems as before. Simulation of the new Click configuration on our target processors shows that the processing of each packet requires 28325 instructions, over 13 times as many instructions compared to the baseline IP router model. Instruction processing, rather than packet traffic on the PCI bus, dominates performance in this case.
Hence, as the results indicate in Figure 8 , the SMP system experiences linear speed up in the number of CPUs. Once again, the model's peformance estimates match the observed values closely. Estimated MLFFRs for the uniprocessor, 2 CPU SMP and 4 CPU SMP are 28K, 46K and 92K, respectively. Observed values for the same systems are: 24K, 47K, 89K -all well within 10% of the model's estimate. The one speculative system included in the experiment, a 1GHz uniprocessor, sees a performance increase in direct proportion to the increase in clock rate; an expected result on such a compute bound workload. Forwarding rate, it is noted, remains well below the system limit.
Packet Size Distributions
All of the previous experiments have used a primitive traffic model: a stream of 64B packets at varying uniformlydistributed arrival rates. Changing the traffic model is a simple matter of specifying the distribution of packet sizes, distribution of packet arrival times and the distribution of packet type (TCP, UDP, errors, etc.). Application profiling and system resource usage are both dependent on packet size. Thus, both steps must be carried out for relevant packet sizes and then weighted according to the distribution (and influence) of those sizes. 
Future Work
This modeling framework is very much a work in progress. The models presented in this paper have served a dual purpose: 1) they helped to introduce Click as Click was meant to be used and 2) they helped to informally validate the modeling approach embodied in the framework. However, there are several other types of network processing systems that have organizations quite different from a PC. These systems tend to use a greater variety of devices, including switches and special-purpose hardware assists, as well as employ distributed memory and hardware queues and buffers. In future work, we plan to model these systems and devise ways to map Click configurations onto more heterogeneous systems.
To this end, the alpha release of the framework will include all the models presented in this work in addition to a model of a highly integrated network processor, such as the Intel IXP12000 [6] or its predecessors. 
Conclusions
This paper has presented a modeling framework for network processing systems. The framework is intended to enable quick analysis of systems that implement complete and detailed router functionality. To illustrate the modeling approach, a number of system, application and traffic models were presented. The framework yielded performance estimates accurate to within 10%, as compared to the measured performance of the systems being modeled.
