Abstract-Modern supercomputers aggregate thousands of microprocessors through a high performance network. Many of these systems place a processor on the network interface controller (NIC) to handle some portion of the MPI processing. This processing involves traversing a linked list and invoking a matching function for each item. Although this task is critical to the performance of the system, microprocessors perform it extremely poorly. Furthermore, the traditional network processor approaches of multicore and multithreading map poorly to the problem because the list is a shared data structure. While match processing can be implemented directly in hardware, hardware implementations can be extremely inflexible and lead to extremely high risk. This paper presents a novel, programmable architecture for a processor to handle the matching function. The matching engine approaches the performance of a direct hardware implementation while maintaining a high degree of flexibility and programmability. More importantly, it requires a dramatically smaller area than a conventional processor.
I. INTRODUCTION
The dominant programming model for supercomputers is the Message Passing Interface (MPI) [1] , [2] . The most commonly used communication functions are the blocking and nonblocking variants of two-sided, point-to-point transfers. In the nonblocking variants (MPI Isend and MPI Irecv), the sender and receiver can post a series of nonblocking operations that become a linked list of operations. These two-sided operations require MPI matching of the MPI envelope at the receiver to resolve incoming messages to matching receives; thus, the posted receive list is traversed for each new message. The matching operation is a computationally complex step of this traversal and has parallelism that is only bounded by the bandwidth to load an item into the processing core. Because matching can be decoupled from the latency dominated list traversal operation, this paper focuses on an architecture to perform matching quickly.
Network interface controllers (NICs) often include an embedded microprocessor to offload matching operations (e.g. Quadrics [3] and Cray [4] products); however, this approach can lead to significant increases in message latency under some realistic usage scenarios [5] . Given that the match time per item is over 3× the memory access latency (and matching * Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.
exhibits spatial locality that exploits the cache), this points to limited parallelism within the processor as a significant factor for match time.
While some networks place most processing on the host processor [6] , [7] , we have proposed a dedicated hardware solution for traversing lists and performing the matching operation [8] . While the MPI matching operation can certainly be implemented entirely in hardware, practical considerations make it undesirable to do so. Most systems evolve the lowest level network API, the implementation of that API, and even aspects of the MPI header format over the lifetime of the system. Thus, it is desirable to have a general purpose programmable design, but the required flexibility is limited and the processor can be heavily customized for the MPI matching problem.
We propose a microcoded engine to process MPI list items. It contains two ALUs fed by two independent register files with the ability to pass data between the ALUs. Both ALUs are capable of operating in a SIMD manner at 2 byte boundaries within an 8 byte word; however, the two ALUs support different types of operations. One supports typical binary operators, while the other is designed to efficiently implement ternary matching to deal with wildcarded matching entries. This enables the microcoded engine to approach the performance of a dedicated hardware solution.
To evaluate this microcoded engine, we compare it to an embedded microprocessor design point (comparable to current practice) and a multithreaded design point (comparable to what is typically used in network processors). We found that the microcoded engine achieved 94% of the performance of a comparable hardware unit (as limited by the local memory bandwidth) when only 10 list items are traversed, and the embedded microprocessor achieved only 34% of this potential, despite having twice the memory bandwidth. Similarly, a 16 core multithreaded design point only achieves 52% of this potential, despite having 4× the memory bandwidth. The remarkable observation here is that through architectural specialization, it is possible to achieve hardware levels of performance in a programmable processor without the area overhead of a conventional processing approach. An extremely conservative estimate places the microcoded match unit at 4.6× smaller than the embedded processor and 3.8× smaller than a single multithreaded core.
In the next section, we present related work. In Section III, we place the work in context by presenting an overview of the matching problem followed by a brief overview of the network interface architecture. Details of the proposed microcoded architecture are then presented in Section IV. Our methodology is explained in Section V followed by results in Section VI. Finally, we present conclusions in Section VII.
II. RELATED WORK
Relatively little work has been devoted to MPI matching. While Quadrics has used a customized processor to perform matching on the network interface for many generations [3] , the newest hardware (the Elan5) increases the number of thread units rather than specializing the processors. Notably, these processors must implement general code; thus, they cannot be particularly specialized to the matching problem. Similarly, the network interface for the Cray XT3 machine [4] implements the Portals [9] programming interface using a general purpose PowerPC 440 embedded processor. However, embedded processors are ill-suited to traversing the posted receive queue.
To address potentially long linked lists, research has considered reducing the search cost by using hash tables [7] , [10] . However, while a hash table can significantly reduce the time needed to find a matching entry, it also increases the time needed to insert an entry into the list. Because posting a receive is often on the application's critical path [11] , the increase in insertion time is prohibitive. The hashing process is also complicated by the need to support wildcard matching and maintain ordering semantics; thus, the approach has largely been abandoned. There is also a significant amount of previous work on using the general processor on the network interface to implement other operations (MPI collectives, for example) efficiently [12] , [13] , [14] . Similarly, these approaches focus on protocol optimizations and efficient data movement operations rather than list traversal.
MPI matching may appear closely related to the more broadly studied field of network intrusion detection (NID). Network intrusion detection typically matches the contents of network packets against a list of exploit signatures, whereas MPI matching must match the MPI envelope information in a packet against the posted receive list. Researchers have studied NID algorithms that operate well on network processors [15] , [16] as well as hardware NID accelerators running in FPGAs [17] . Both of these approaches work by allowing parallel searches through the signature database. Although MPI matching is similar to the string matching done in NID, two key differences prevent these approaches from working for MPI matching. First, the NID signature database changes slowly (hours) and, thus, leverages off-line processing. In MPI matching, however, there is high list turnover (nanoseconds), making it prohibitively expensive to use off-line techniques. Second, unlike NID, MPI matching must maintain strict ordering semantics.
Longest prefix matching is required for an IP router to determine where to route a packet. Longest prefix matching matches the destination with the most complete (i.e. having the fewest number of wildcards) routing rule. A common approach uses network processors backed with ternary content addressable memories (TCAM) to accelerate this matching [18] . Unfortunately, MPI matching cannot be formulated as a longest prefix match due to the ordering constraints. Also, while a TCAM can prioritize based on longest prefix, it is inherently unordered. The TCAM approach can be adapted to support MPI matching by adding ordering into the ternary structure [19] . While this works well for a small number of entries in the posted receive queue, longer queues yield prohibitive hardware requirements and require a linear traversal of those items not in the TCAM structure.
III. SYSTEM CONTEXT The matching operation fits into a system context defined by the MPI matching problem. Solving the matching problem, however, requires a specific instantiation on a NIC. MPI matching issues and our basic assumptions about the network interface are described here.
A. The MPI Matching Problem
When offloading MPI processing, MPI matching is typically abstracted into a lower level network API. For this work, we will consider the Portals API [9] , [20] , since it is used on Red Storm/Cray XT3 [21] . The matching code required by Portals is shown in Figure 1 .
The outer loop in Figure 1 traverses a linked list that makes up the equivalent of the MPI posted receive queue. At each position, there is a memd structure containing a match entry (me) and memory descriptor (md). The memd is 64 bytes long and has numerous subfields ranging from 16 bits (process ID) to 64 bits (address, match bits). Similar data arrives in the header of an the incoming message. Many of the fields in the memd can be wildcarded at the field (e.g. me->pid != ANY) or bit (e.g. me->dont ignore bits) level. In addition, the range of operations includes all variants of compares, ternary operations ("don't care bit" masking), and basic arithmetic operations. Finally, virtually all of the comparisons are independent and can yield concurrent operations -if the architecture supports it. More importantly, that concurrency is free if the list item is going to be retrieved anyway.
The code has interesting memory access requirements in terms of both latency and bandwidth. The linked list traversal incurs the memory latency for each list item if the list is not in cache. In contrast, the header is in cache after traversing the first list item. At the same time, in the most common case, the code short-circuits after the first test if the match will fail; thus, a relatively small part of the cache line that is loaded is actually used. Although the loop short-circuits on a failure, with the tests prioritized based on the most common failure conditions, full matching can fail at any point along the path. Thus, a matching unit must be designed to handle all cases.
B. Network Interface Context
When MPI matching is done on the network interface, it is a receive side problem. It requires inspecting the incoming f o r ( c u r r e n t = memd ; c u r r e n t ! = NULL ; c u r r e n t = c u r r e n t −>n e x t ) { s t r u c t u s e r m e * me = & c u r r e n t −>me ; s t r u c t u s e r m d * md = & c u r r e n t −>md ; 
/ * Check t h e match b i t s , and t h e NID PID

/ * MD i s i n a c t i v e i f u s e m a x s i z e ( ) and r e m a i n i s l e s s t h a n m a x s i z e * / i f ( u s e m a x s i z e ( md) && ! a l l o w t r u n c a t i o n ( md ) ) { i f ( ( m d l e n − * o f f s e t ) < g e t m a x s i z e ( md ) ) c o n t i n u e ;
} / * D e t e r m i n e t h e l e n g t h t o r e c e i v e i n t o t h e MD * /
i f ( hdr−>l e n g t h <= r e m a i n ) * m l e n g t h = hdr−>l e n g t h ; e l s e i f ( a l l o w t r u n c a t i o n ( md ) ) * m l e n g t h = r e m a i n ; e l s e c o n t i n u e ;
/ * S e t t h e o f f s e t o u t p u t and r e t u r n t h e match *
/ r e t u r n c u r r e n t ; message headers to determine where the data should be placed and informing a DMA engine. Thus, a construct is needed to traverse the posted receive queue and find a matching entry. Figure 2 shows the example of a receive side NIC architecture used for this work. The dashed box designates the functionality that would typically be served by an embedded microprocessor. It interacts with a FIFO structure to deliver network headers and a DMA to place the data into memory.
In Figure 2 , a list manager and match unit replaces the embedded microprocessor. The list manager provides support for adding items to or deleting items from the posted receive queue and streams list items to the match unit for matching, and for providing information to the host and DMA engine about a matching receive. The match unit's sole responsibility is to compare an incoming header with the items in the posted receive queue.
As in [8] , the list manager manages a small cache of list items to cover the local memory latency. This hides memory latency and allows header processing to proceed immediately. When a new header is received, the list manager pulls it out of the header buffer and passes it to the match unit for processing. At the same time, it starts memory requests for the first list item not in the cache. Immediately after the header is sent to the match unit, the list manager starts streaming in the list items, starting with those in the cache. The list manager receives either a "Match Failed" or a "Match Successful" for each list item sent to be matched. When a successful match is found, the list manager completes sending of the current list item, then sends an "end list" command. On a successful match, the match unit also sends an offset and a length for the destination of the message in the target buffer. This information is used by the list manager to create the appropriate DMA commands for sending the message data to the host's memory. The list manager also sends an event to the the host letting it know about the received message.
IV. MATCH UNIT ARCHITECTURE
The match unit is much like a general purpose processing engine, but it has been specialized in several ways. Foremost, inputs and outputs arrive through FIFOs, rather than queues in memory. FIFOs provide a simple interface mechanism with other system components. Inputs arrive into two independent register files that feed independent ALU and ternary ALU datapaths. Results from these operations are aggregated through a predicate register file with a predicate combining unit to affect branch behavior. On each cycle, the match unit can: 1) input an item, 2) output or copy an item, 3) perform an ALU operation, 4) perform a ternary operation, 5) perform two predicate merge operations, and 6) resolve a branch.
As a by-product of the extensive concurrency, the highest level of programming for the match unit uses assembly language. The level and specificity of concurrency within the core would be difficult to exploit with a high level language. The assembly language is translated directly into microcode, and the characteristics of the matching code are discussed in Section IV-C.
A. Motivating Objectives
The architecture of the match unit was driven by three main considerations: high throughput, irregular data alignment, and program consistency. Ideally, the match unit would be able to process the data as quickly as it arrives. However, there is a trade-off between circuit complexity and throughput. This trade-off led to an architecture with a small number of computational units operating in parallel and a three stage pipeline. In general, the first cycle reads operands from the appropriate register file, the second cycle does the operation, and the third writes the result to the register file.
Inherent in header and list item processing is the need to process data of varying bit widths packed into native size words (64 bits). This led to specialized functions that combine and reorder data, as well as SIMD-like functionality in the ALU and ternary unit. Finally, given the streaming nature of the input stream, we require strict ordering semantics for the program: all operations in the same instruction word are independent and their results are available for use by the next instruction. This is complicated when the input FIFO becomes empty. If the FIFO is empty and an element of the wide instruction requires input data, the issue of the entire wide instruction is delayed until the FIFO has data. These issues make it necessary to include result forwarding (forwarding paths shown as dashed lines in Figure 3 ) and to require some of the register file write ports to be write before read. It is also necessary to modify the pipelining of the predicate unit (see Section IV-B.4).
B. Match Unit Details
The match unit consists of 4 computational/control units, 4 memories and 2 data transfer units, as seen in Figure 3 computational units include the arithmetic logic unit (ALU), the ternary unit, the predicate unit and the branch unit. The four memories consist of the microcode memory and three register files: the ALU registers, the ternary registers and the predicate registers. The data transfer units perform data copies: 1) from the input FIFO to the ALU and ternary register files and 2) from the ALU register file to the output FIFO or the ternary register file. Each unit has dedicated ports into the register files and are controlled independently. Each of the 6 units has an instruction slot in the wide instruction word format of the microcode (shown in Figure 4 ). The bit widths of the major unit instruction fields are shown below the label for each field (the overall instruction word is 164 bits). The minor fields for each instruction are also shown.
1) Register Files:
The ALU register file and the ternary register file are 64 bits wide and have 16 entries with 2 write ports and 3 read ports each. Register 0 is the constant (not writable) zero for the ALU and all 1's for the ternary unit. To comply with the program consistency semantics, the write ports connected to the input FIFO are read before write and the other ports are write before read. The predicate register file contains sixteen 1-bit entries which can be accessed through 7 read ports. The write port structure is more complex: Eight of the registers are directly connected to the ALU and ternary unit to receive results of comparison operations. Register 0 is set to a constant 1, and the remaining 7 registers are writable from the predicate unit.
2) Arithmetic Logic Unit (ALU):
The ALU can perform most binary operations, including addition, subtraction, logical operations and comparisons. Notably absent are the multiply and shift operations. Multiply is not needed in matching operations, and the required shift functionality is provided by the combination of the permute operator and sub-word operations. The permute operator can arbitrarily combine two registers on 8-bit boundaries. This means that each byte of output can be chosen from any of the 8-bytes from either of the two inputs. In addition, each byte can also be set to all zeros. Combined with a SIMD capability, this allows arbitrarily aligned byte level operations. Bit level operations are supported by the ternary unit (Section IV-B.3).
Because headers have several fields ranging from 16 to 64 bits, the ALU includes SIMD operations to process data smaller than 64-bits at arbitrary boundaries. The arithmetic functions are divided into four 16-bit sections that can be aggregated into larger operations. This is controlled by 3 SIMD bits, which tell the unit which internal 16-bit boundaries the operation will cross. Each 16-bit section includes a comparison result output and a set of 4-bits that control which results will be written to the predicate register file. This allows the unit to work on multiple, non-natively aligned fields simultaneously.
The instruction word for the arithmetic unit also includes a 32-bit immediate field to support constants. While this field doubles as part of the control field for the permute instruction, other instructions use the immediate in either the upper or lower 32-bits of the second operand. The remainder of the bits are passed through unchanged from the operand read from the register file (e.g. register zero is used for a traditional immediate instruction). Creating a 64 bit constant involves replacing the lower 32 bits of a register and then replacing the upper 32 bits in a second instruction; thus, 64 bit constants that are needed frequently should be placed in a register at initialization to save time and infrequently used constants should be built at execution time to save register file space.
To minimize the number of branch stalls, predication controls whether an ALU operation writes a result to the ALU register file -a result is written when its associated predicate is asserted. Predicate register 0 (always true) can be used for unconditional writes, and "nop" operations use ALU register 0 as their destination. In contrast, comparison operations do not write results to the ALU register file (and are not affected by the predicate), but instead write their results directly to the predicate register file. In addition to writing directly to the predicate register file, comparison results can be combined with the existing predicate value as it is written (i.e. the comparison can overwrite the register or can be anded or ored with the register). This accelerates compound expressions, such as checking to see if a field matches a particular value or is set to accept any value (a == b || a == ANY). Compound functions could be computed in the predicate unit, but the results are available a cycle earlier when done during the register write.
3) Ternary Unit: Unlike the ALU, the ternary unit performs one operation (although it can be used for multiple functions). There are three inputs to the ternary unit: match0, match1 and mask. It does an equal comparison under mask -only the bits which are specified in the mask are used in the comparison, all other bits are ignored 1 . Like the ALU, the ternary unit has SIMD functionality.
The ternary unit serves two purposes for the matching functionality. The first is to determine if the match bits in the header match the mask and match bits in the posted receive. In addition, it can do equal comparisons at smaller granularity than 16 bits. This replaces the shifting and masking that would be used to pull out single bit flags from packed control words. This feature also allows for limited boolean functions (wide and functions with optional negation of inputs) to be performed on flags which are found in the same word. For example, if a, b and c are one bit values packed into a single control word, the ternary unit could perform !a && b && !c. As a side benefit, the ternary unit can perform an equal comparison on any size field (e.g. the expected NID can be compared with the NID in the header in either the ALU or the ternary unit). The results go directly to the predicate register file, with the same facility for combining results described in Section IV-B.2.
The ternary unit also includes a simple permute unit on the input to the register file from the input FIFO. This permute divides the input into four 16-bit fields (corresponding to the four fields in a SIMD operation) and allows each of the 4 output fields to arbitrarily select any of the 4 input fields. This is useful for allowing the ternary unit to pull multiple flags out of a single 16-bit word. If this simplified permute is not sufficient, the more complex permute of the ALU can be used to properly stage the data.
4) Predicate Unit:
The predicate unit combines predicates generated by the arithmetic and ternary units. It consists of the predicate register file and two logic units that can perform arbitrary boolean functions on two predicates. The logic units can read any predicate, but can only write to predicates 1 through 7. There are also three other read ports: one each for the branch unit, the ALU, and the data copy unit. All of the read ports are write before read, allowing predicates to be read and used more quickly. Since predicates are single bit values, the predicate unit uses slightly unusual timing for consistency. The first cycle does nothing, the second cycle reads from the register file and the third computes and writes the result.
5) Branch Unit:
The branch unit controls the flow of the microcode program using a predicate register and an absolute target address. The branch instruction can be either branch on one or branch on zero. Thus, a branch depends on the results of some number of previous ALU and/or ternary unit comparisons that have been accumulated in predicate registers. In the absence of a branch command, the unit retrieves the next instruction word. A branch requires two cycles to resolve, so there are two branch delay slots. All instructions in these two slots will be executed on a taken branch, with the exception of other branch commands; a taken branch will invalidate branches in the next two instructions. These branch delay slots are generally easy to fill as most branches are "early out" cases where the header does not match that list item. The branch unit also controls program flow when the input FIFO is empty. Each instruction is tagged as requiring FIFO input or not. If the input FIFO is empty and the instruction requires input, then the branch unit will prevent the next instruction from issuing. Instructions that have already issued continue to progress through the pipeline, with the exception of branch instructions, which will wait for input to proceed.
6) Data Transfer Units:
Two data transfer units move data around the processor. The input FIFO unit moves data from the input FIFO to either one or both register files. The second data transfer unit copies from the ALU register file into the ternary register file or the output FIFO. These instructions can be controlled by a predicate, but cannot cause the instruction to stall, because the output FIFO can always accept data.
C. Matching Code Characteristics
The match unit is programmed entirely in an assembly language that translates one-to-one into microcode instructions. Part of the design target of the architecture was to minimize the number of instructions (and, therefore, cycles) required to implement the matching code. Table I gives a breakdown of the 44 instructions required to implement the matching operation. Initialization is needed to establish constants that will be used in the primary loop and is only executed at boot time. The primary loop for matching includes one execution of the header code, one or more executions of the list item code (typically shared code plus the fast path), and one execution of the flush code for a total of at least 31 cycles (the extra cycle arises from a pipelining impact). Each additional list item traversed adds at least 8 cycles (assuming the common case). More detail on the individual code segments follows.
The header code must read an 8 item header (one per cycle) from the input FIFO. Because header fields and list item fields are not typically identical, this code must reformat the data to match the list items to improve matching speed. The list item code then reads list items and compares them to the f o r ( i = 0 ; i < 1 2 ; i + + ) { s t r u c t memd * mm; s t r u c t p t l h d r * s t a r t = ( p t l h d r * ) ( o f f s e t ) ; i n t b e f o r e = r e a d S i m C y c l e ( ) ; f o r ( j = 0 ; j < 6 4 ; j + + ) { s t a r t = g e t N e x t H e a d e r ( ) ; mm = match ( l i s t , s t a r t , 0 , & o f f s e t ,& l e n g t h ) ; } i n t a f t e r = r e a d S i m C y c l e ( ) ; t i m e s [ i ] = a f t e r −b e f o r e ; } header. This code is split into "fast" and "slow" paths that share a common preamble, where the "slow" path supports a less common Portals semantic on a per list item basis. The "fast" path is optimized to complete a list item as soon as a match fails, but this cannot be less than 8 total cycles, as it takes 8 cycles to read the list item from the FIFO. The most common match failure is the match bits test that occurs first in Figure 1 . This failure requires 8 cycles per list item, where a full match will execute 12 instructions in 14 cycles (on a match, only 5 of the 6 "shared" instructions are executed). Finally, in most cases, the list manager will have sent an extra list item to be matched -not knowing that a match will be found. This requires that the flush code execute (8 cycles) to drain the extra list item from the input.
V. METHODOLOGY
We compare the matching performance of three basic architectures: the customized architecture described in this paper, a typical embedded CPU and a multithreaded CPU. The more traditional processors were simulated using the Structural Simulation Toolkit (SST) [22] , and are described below. The microcoded match unit was done in a cycle accurate hardware simulator.
A. Benchmark
The performance of the three architectures is compared using a benchmark that times the match under different conditions. The benchmark measures the total match time for 64 matches based on the total length of the posted receives queue, as well as the number of list items actually traversed. The core of the benchmark is shown in Figure 5 with matching described by Figure 1 .
The benchmark code loops over a set of headers that are designed to match 0% (the first entry) through 100% of the way through the list in increments of 10%. In addition, the first iteration is used to prime the instruction cache. The inner loop reads 64 identical headers and is the only part included in the measured time. A large number of headers was chosen to allow for a comparison with the multithreaded unit, which provides no advantage for a single header, but which can greatly improve throughput for a large number of headers. To support the multithreaded processor, two specific changes to the code were made. First, each match was invoked in a new thread. Second, locking was added to the matching function to insure that two different threads did not simultaneously access one list item and to insure that one thread did not "pass" another thread and cause out of order matching to occur. Both thread startup and locking are simulated as exceptionally fast to insure that the comparison to multithreaded processor is not unfair.
B. Embedded CPU
The simulation model for the embedded CPU is configured to be similar to a PowerPC 440 processor, as shown in Table II . The SimpleScalar [23] processor model simulated a PowerPC instruction set and the code was compiled with gcc 3.3.3 to target Mac-OSX (the loader supported by the simulator -no OS was used). Table III highlights many of the properties of the multithreaded processor configurations. The number of execution cores was varied for typical (1 and 2 core) configurations as well as an aggressive configuration reflective of a state of the art multithreaded network processor like the Intel IXP2800 [24] . Each execution core had a fixed number of hardware thread contexts, of which up to 64 total contexts could be used by the benchmark 2 . The thread creation time was two cycles. While this is aggressive, two cycles is plausible in this type of application, so it was chosen to make the multithreaded core as competitive as possible. Each core switches between active contexts each cycle. Every 1024 cycles (or, when there are no active contexts), inactive contexts are swapped into the core. Inactive contexts are created when a new thread is spawned and there is not a free hardware context. Swapping in an inactive context takes 10 cycles. Again, this is an aggressive design point, but it maximizes the competitiveness of the multithreaded architecture.
C. Multithreaded CPU
D. Microcoded Match Unit
The match unit is simulated using cycle accurate simulation in JHDL [25] . The match function was hand coded in assembly and translated to microcode for the match unit. The simulation assumes that the list items are read by the list manager as discussed in Section III. Since the list manager is separate from the match code, it is likely that the match unit will receive an extra item after a match is found. The match time includes the time required to flush this extra item and notify the list manager.
VI. RESULTS
We selected four technology points for comparison: a conventional embedded processor, a multi-core, multi-threaded processor, a pure hardware unit with memory bandwidth to match the microcoded match unit, and our proposed match unit architecture. These were selected to represent current practice in NICs supporting MPI, current state of the art NPUs, the "best case" 3 , and our proposed design. Each architecture runs at 500 MHz under the assumption that each design point could approach approximately the same clock rate.
A. Performance
Figures 6 and 7 present results from five configurations at four list lengths: 10, 30, 100, and 300 items. The data is presented in three ways: as an absolute time (a, b), as a time per list item traversed (c, d), and as the time relative to a best case hardware unit (e, f). The best case hardware time assumes that list items can be processed as quickly as they can be fed to the matching unit; thus, is assumes an overhead of 8 cycles to load the header and a delay of 8 cycles per item traversed. In all cases, the X-axis is the percentage of the list that must be traversed to find a match and the Y-axis is a metric of time.
The most notable result is that all of the programmable configurations pay a larger fixed overhead than a pure hardware implementation. This is particularly noticeable for the embedded processor and threaded processors, where the memory latency for loading the header imposes a significant overall penalty when a single list item is traversed. As more list items are traversed, this overhead is amortized away. The overhead for the microcoded match unit is only 6 cycles per item matched with no penalty per item traversed. The 6 cycle penalty is the difference between the time to match an item using the microcoded match unit (14 cycles) and the time to feed an item to be matched into the match unit (8 cycles). For our streaming test, this results in a constant 384 cycle penalty (64 incoming items that match in the list, 6 cycles per item that matches). The embedded processor clearly pays a penalty (relative to the hardware approach) both for each item traversed as well as for each item matched, because the "zero length" time is larger than the asymptotic time per item and the asymptotic time per item does not approach the hardware limit. The multithreaded units, however, exploit much more concurrency with the multicore cases so that the asymptotic time per item approaches the lower bound of the hardware time as the list grows long. They do, however, pay a higher item matched penalty. At short list lengths (10 items), we see nearly a 2× advantage for the microcoded match unit over any of the other configurations -an advantage that grows to 3× if only a portion of the list is traversed. In general, the embedded processor, with its more robust pipeline and out-of-order execution capabilities, has a significant win (14%) over the largest multithreaded configuration when traversing only a few items. However, at 10 items traversed, the concurrency that the multithreaded unit is able to exploit yields a slight advantage for the single multithreaded core and a 34% advantage for 16 multithreaded cores.
As the list length grows (30 items), the multithreaded configurations begin to distinguish themselves from the embedded processor. Although there are significant impacts from computation time, the memory latency is sufficient to highlight the latency tolerance of the multithreaded cores. With a list of 30 items, however, there is still not sufficient concurrency to dramatically differentiate 16 multithreaded cores from 2. In all of these cases, the microcoded match unit maintains a dramatic advantage over the most aggressive of the multithreaded configurations -an advantage of almost 2×! In the case of the microcoded match unit, the overhead over a pure hardware solution (which begins at 19%) has dropped to only 2.3% when 30 list items are traversed.
As the list grows long, sufficient concurrency becomes available to clearly highlight the advantages of additional multithreaded cores. With 100 list items traversed, the 16 multithreaded core case approaches the performance of the microcoded engine (although using drastically more hardware). By the time the list reaches 300 items, 2 multithreaded cores approach the performance of the microcoded match unit and 16 multithreaded cores exceed it. Indeed, leveraging the fact that it is configured with 4× more memory bandwidth than the microcoded match engine, the 16 multithreaded core case exceeds the "best case" scenario posed by the hardware at lower bandwidth. This is because 16 functional units are employed to work on 64 different incoming messages traversing a single list.
B. Area
A major advantage of the proposed microcoded match unit is its savings in chip area compared to other potential approaches. Using the CACTI tool [26] , we estimate that the memories in the match unit take 0.326mm 2 in 90 nm technology. For a conservative upper bound estimate of the area required, we double this area (0.652mm 2 ) to account for the size of the functional units. In addition, although it should be much smaller, we assume that the list management unit needed to support the match unit is as large as the match unit itself for a total of 1.3mm 2 . In Table IV , we compare this area to that of an embedded processor (the PowerPC 440 approximated by our simulations [27] ) and a multi-core, multithreaded approach. For the multithreaded approach, we leveraged a die photo and area information about the Sun Niagara multithreaded processor found in [28] and then used information available about the IXP2800 [29] as a sanity check 4 . The microcoded match unit has significant area advantages over the single core approaches and dramatic advantages over the only multicore configuration (16 multithreaded cores) that is competitive in performance. The primary area advantage arises from the lack of a cache in the match unit and the list manager. While it could be argued that the cache could be eliminated from the processor designs, these designs have caches by default. If a new processor design is being designed, it is better to optimize the architecture to the domain. Furthermore, the performance of the embedded microprocessor and the multithreaded cores depend on cache. In the case of the embedded microprocessor, the cache is needed to hide the memory latency. In the multicore multithreaded approach, the cache acts as an effective memory bandwidth multiplier.
C. FPGA Prototype
The microcoded architecture was implemented on a Virtex4 (-11 speed grade) FPGA to approximate the frequency. The prototype operates at 150MHz on the FPGA, which is produced using a 90nm CMOS process. Conservative estimates place standard cell ASICs at 5× the clock rate of FPGAs; thus, the design should achieve 750MHz operation in a 90nm standard cell ASIC.
VII. CONCLUSIONS
As supercomputer networks attempt to improve latency and message rate, MPI matching must be performed at ever higher rates. Rather than rely on a pure hardware implementation, we present a customized architecture to perform the MPI matching operation: the microcoded match unit. The customizations include the elimination of the traditional memory interface in favor of streaming data to be matched through FIFO based constructs. In addition, the architecture includes two ALUsone of which can only perform ternary operations -that are both capable of multiple simultaneous sub-word operations to match the irregular data structures typically found in network headers and linked list elements. Finally, the architecture includes a high degree of concurrency that enables six types of operations in each cycle. We compare the proposed architecture to a dedicated hardware implementation and find that the microcoded match unit is within 20% of dedicated hardware when only a single item is traversed and within 6% when 10 list items are traversed. In contrast, we also compare the microcoded match unit to a conventional embedded processor and a multi-core multithreaded approach. The microcoded match unit is almost 3× faster than either when the list is only a single element long and over 2× as fast when the list is 10 items. In fact, the multithreaded approach only approaches comparable performance when the list is 100 items long. These results were achieved while using an area that is 4.6× smaller than an embedded microprocessor and 3.8× smaller than a single multithreaded core.
