In this paper a novel hybrid wave-pipelined bit-pattern associative router is presented. A 
Introduction
Communication channels between any two nodes regardless of their physical location within a communication network system can be established by using routers at each node. The purpose of the router would be to receive, forward, and deliver messages. The router system transfers messages based on a routing algorithm which is a crucial element of any communication network [l] . Given the reconfiguring requirements for many computing systems, a number of routing algorithms as well as network topologies must be supported. This leads to a need for a very high performance flexible router to support these requirements.
State University of NewYorkBinghamton, NY 13902-6000 jabu@binghamton.edu
To maximize a machine's overall Performance and to accommodate reconfiguration in a distributed environment requires matching the application characteristics with a suitable routing algorithm and topology. Having the capability of changing the routing algorithm at run time could facilitate smart interconnects and adaptability that allow changes on the machine's topology for different applications.
A router intended to be used for a number of routing algorithms andor network topologies has to be able to accommodate a number of routing requirements. It is of great importance that the routing algorithm execution time be extremely short. This time dictates how fast a message can advance through the network, since a message cannot be transferred until an output port has been selected by the routing algorithm. Thus, the routing algorithm execution time must be reduced to decrease message delays. Other requirements may include: flexibility to accommodate modifications to a network, algorithm andor topology switching with minimum delay, and programmability to support a large number of routing algorithms and network topologies.
In this paper we present a high performance hybrid wave-pipelined VLSI router that constitutes of several modules and uses dynamic circuitry within these modules. Wave-pipelining is a design method that enables pipelining in logic without the use of intermediate registers [2] . In order to realize practical systems using wave-pipelining, it is a requirement that accurate system level and circuit level timing analysis be done. At system level, generalized timing constraints for proper clocking and system optimization need to be considered. At the circuit level, performance is determined by the maximum circuit delay difference in propagating signals within a given module of the system. Accurate analysis and strict control of these delays are required in the study of worst case delay paths of circuits. Data dependent delays also present a problem that needs to be considered in the use of wave-pipelining. In this study, hybrid wave-pipelining; a different approach to minimizing the clock period is undertaken.
Section 2 provides an overview of the bit-pattern asso-ciative router, describing the router operation and the circuits that form the basic blocks of each module. In Section 3 we describe hybrid wave-pipelining and demonstrate how the delay differences are narrowed, resulting in clock period reduction. Some concluding remarks appear in Section 4.
Router Functional Organization
The bit-pattern associative router (BPAR) scheme supports the execution of routing algorithms that are used in most communication switches. The associative router uses a content addressable memory as its bit-pattern associative unit, and this enables the destination address alternatives to be considered in parallel. The destination address is presented as the input to the bit-pattern associative unit for comparison with the stored data. The patterns stored in the bit-pattern associative unit allow the router to make a decision about the destination port based on the routing algo-
needs to be compared to the current node's address C (e, -1, . . . cg). A routing algorithm compares the bits of the two addresses; some bits are ignored since they do not affect the current routing decision. These "don't care" bits usually occur at different positions for each potential path being considered and need to be customized according to the routing algorithm requirements. To provide the flexibility required to support multiple interconnection networks and routing algorithms the bit-pattern associative router must be programmable. The results of the comparison are passed to the selection function, which then passes the match output with the highest priority to the port assignment to select the word corresponding to the selected output port. Figure 1 shows the bit-pattern associative router organization.
The BPAR supports three basic operation modes: Normal or matching, programming or data loading and refreshing [3]. In normal mode the previously loaded data is compared to data presented at the search argument register and the results passed to the selection h c t i o n . The programming mode initializes the memories with destination addresses, while refreshing enables replenishing of the loaded data during the normal operation.
Dynamic CAM Cell (DCAM)
The DCAh4 cell is the basic building block of the bit-pattern associative unit array. It implements a comparison between an input and the ternary digit condition stored in the cell. A single DCAM cell shown in Fig and NBIT-COMPARE lines are shared by the corresponding bit in all words of the matching unit. The design uses a precharged match line to allow fast and simple evaluation of the match condition.
In Table 1 Normal operation involves comparing the input data to the patterns stored in the DCAM and determining if a match has been found. During match operation, the input data is presented on the BIT-COMPARE line and its inverse value on the NEUT-COMPARE line. Before the actual matching of these two values is performed, the match line is precharged to "1" which indicates a match condition. The matching of the input data and the stored data is performed by means of an exclusive-OR operation implemented by transistors Tcl and Tco whose gates hold the stored value. The match line is discharged through a series transistor pair (T, and T,) and a logic 0 on the match line indicates a non matching condition while a logic 1 indicates a matching condition. 
Selection Function and Port Assignment
The selection function should be designed to ensure the deterministic execution of the routing algorithms. For a given input (i.e. destination address) and a set of patterns stored in the matching unit, the port assignment should always be the same. The priority allows only the highest priority pattem that matches the current input to pass on to the port assignment memory. The encoded priority (El',) output depends on the match at the current bit-pattern and the priority for this row. Ifboth match and priority are "1"; then the encoded priority is true. A priority lookahead scheme has been proposed and implemented; it has been reported in [ 5 ] . The port assignment memory or RAM holds information about the output port that has to be assigned after a bit-pattern that matches the current input is found. This memory is proposed to be implemented using a dynamic approach. The selected row address is passed from the priority encoder, the cells in this row read and their data latched in the port assignment register. The DRAM structure is able to perform an OR function per column when multiple RAM rows are selected this is when multiple matches are passed on. The DRAM structure is explained in [3] .
Hybrid Wave-Pipelining
In this section we outline some of the challenges of wave-pipelining first and then describe the timing constraints for the proposed hybrid wave-pipelining approach. Conventional circuit pipelining uses intermediate latches in addition to the input and output registers. Intermediate latches (registers) ensure that when the leading edge of the system clock comes data gets propagated from one stage to the next in a synchronous manner. In a system setup like this there is only one set of data between register stages. Wave-pipelining is an approach aimed to achieve high-performance in pipelined digital systems by removing intermediate latches or registers [4] . Idle time of individual logic gates within combinational logic blocks can be minimized using wave-pipelining.
Some of the challenges of designing wave-pipelined systems are: Preventing data collision; there must be no data overrun in each circuit block, and it must be ensured that there is no over committing of the data path. Designing dedicated control circuitry; control logic circuits must be designed to operate synchronously with the circuitry of the pipeline stages. Balancing delay paths; delay paths must be controlled or equalized to reduce major discrepancies or differences between maximum and minimum delays [4] . The requirements stated above are not inclusive but represent some of the most important design issues in wavepipelining. Timing constraints for the proposed hybrid wave-pipelining approach are derived in the same fashion as in [4] . In many computeridigital systems each stage has a significantly different function and circuitry; wide variations in delays (Dmin and D,,,) may not be tolerated. A common engineering practice is to consider the worst case delay (Dma,) , to ensure that the system runs properly. D, , , plays a very important role in the system's performance and safe regions of operation. DnLin (the shortest delay path), on the other hand, gives information about when the results will begin to emerge.
The equations derived for the hybrid wave-pipelining are denoted by the subscript h. To derive the equations that describe the timing constraints for the hybrid wave-pipeline, the temporalispatial diagram representing this scheme is presented first. The shaded regions of Figure 3 indicate that data is not stable, therefore, register outputs cannot be sampled. The computational cones in this diagram have been arranged to represent each stage within the design. We define some of the variables appearing on the figure. Dmin and D, , , are the minimum and maximum propagation delays through the stages with Tclk defining the clock period.
T, and Th are the register setup and hold times, A refers to the constructive clock skew while Aclk is the register's worst case uncontrolled clock skew. D R is the register's propagation delay with dmin(n) being the minimum delay encountered in propagating data within a single stage n and Dmin-hold, the overall minimum delay of all the stages.
The time it takes for data to emerge at the output register after N clock cycles is TL and it is given by:
Clocking the earliest data associated with wavci requires the following condition:
where 
The clock period for the hybrid approach is determined to be:
The hybrid wave-pipelined approach allows for the clock signal's period to be reduced, hence an increase in performance. A complete analysis of the hybrid wave-pipelining scheme must include clock cycle minimization, taking into consideration the constraints of the intemal nodes of the system and the register constraints. The minimum delay of the hybrid approach can be written to include the stage hold times as follows:
Also from Figure 3 it can be noticed that the region in which data is not stable, i.e. the difference between D,,,
-Dmin hold, is short. It can then be safely stated that D,,, M Dmin hold. The signal latching time, expression becomes: Dn+DmiTL ,Lold+Ts+&lk-n Hybrid wave-pipelining allows for the reduction of clock cycle time using the delays to propagate data from stage to stage without the use of either intermediate latches or distributed clocks.
Control Signals for The Bit-Pattern Associative Unit
In this study propagating coherent data waves from one pipeline stage to the next without the use of a distributed system clock is achieved by using the delays in the design. These delays are manipulated to allow data to ripple fiom one stage to the next without any collision. The difference between the maximum (worst case) delay and the minimum delay is of particular interest in designing the timing scheme of a wave-pipelined bit-pattern associative router [2] . The buffer insertion method to balance the delays described in [4] is not used, instead specialized circuitry is designed to provide very precise timing sequences.
The evaluate signal must go to "I" after the data passed to the DCAM for comparison has stabilized on the lines BIT COMPARE and NBIT COMPARE. The circuit that generates this signal once all the lines have been set to their appropriate input values is shown in Figure 4 line outputs are passed to the selection function. Thus, the pass signal must immediately be active following completion of the evaluation process. The circuit used to generate the pass signal is shown in Figure 4 (b). The pass signal is designed to mimic the path a zero on the match line (nonmatching condition) takes once evaluation completes. Passing a "0" to the selection function provides the maximum delay that can be experienced in passing the matching unit's outputs to the selection function and, therefore. constitutes I68 the worst case propagation delay for this operation. The circuits in Figures 4(a) and 4(b) have been designed to sense the duration and voltage levels of these signals. All the signals presented to this point depend on the system clock.
+

Hybrid Wave-Pipelining Signals
The signals generated by the above circuits appear in Figure 5 (a) along with the clock. The delays of the first two stages are clearly marked in the figure to show a correlation with the hybrid wave-pipelined approach. On Figure 5(a) the minimum delays, (&in) and hold times of each stage are shown. We have also included some results from the BPAR chip test, fabricated in a 0.5,um technology. These results appear in Figures 5(b) and 7(b) .
Once the output of the match lines have been received by the selection function a decision need to be made when more than one matching condition has been received. The selection function is designed to propagate a priority status P, to the entry below it indicating whether it has registered a match. This priority must be propagated very fast to prevent false starts. Some of the signals of importance in this scheme are shown in Figure 6 . The selection function's critical operation occurs when its first and last entries simultaneously receive inputs indicating that matching conditions have been found in the corresponding bit-pattern associative unit entries. The first entry has to propagate a "0" to the last entry in order to prevent generation of a pointer to the DRAM by the last entry. Signals to lessen the possibility of false starts are generated and these are labeled enable. They are designed to ensure that even if a false start occurs the pointer to the DRAM array is not established. In Figure 6 signals used to enable the DRAM pointers for the first and last entries from a setup in which the last entry always finds a match and the first entry finds a match every other clock cycle are shown. The plots in Figure 7 (a) are those of the DRAM pointers to its first and last entries. They serve to show that an output port assignment can be read from the DRAM array every one and a half clock cycles.
The computational cones of Figure 8 show the three stages of the bit-pattem associative router and their associated delays. Each stage accommodates the latest data associated with the current wave by allocating additional time to process this data. The bit-pattern associative unit completes worst case operation in 5.1 ns. This time includes the input latch delays, clock skew and hold and setup times for this stage. The hold time for the selection function is very short, almost equal to the minimum delay of this stage. This short hold time is a direct result of the selection function design, which ensures propagation of the priority status lo entries below the current one very fast by means of the priority lookahead. The port assignment has a minimum delay of 0.4 ns and requires 3 ns of hold time. Reduction of the gap between D,,, and Dmin at each stage is presented here graphically with the delays displayed to show how the clock period is made shorter in this design. Issues such as signal generation, uneven delays, and timing have been addressed to synchronize the pipeline.
