We present the design for the two VLSI components used in a processor-to-memory interconnection network for a shared memory system. These components allow the combining of requests that are destined to the same memory location. The design contains both semi-systolic queues and an associative "wait buffer." Transition equations and schematics of the critical pieces of the design are included.
INTRODUCTION
ommunication between hundreds or thousands of cooperating processors is the key to massively parallel processing. The NYU Ultracomputer project has studied shared-memory architectures throughout the decade. Successful use of highlyparallel, shared-memory MIMD systems requires avoiding serial bottlenecks at all levels, from algorithm design through hardware components, and thus providing scalability as the number of processors grows. The nature of serial bottlenecks in applications, programming environments, operating systems, coordination primitives, system architecture, and within processor-to-memory interconnection networks has been researched extensively by our group. We have developed techniques either to eliminate such bottlenecks or to reduce their impact significantly.
The NYU Ultracomputer network has the topology of an Omega network 13] with a buffered VLSI switch at each node (see Figure 1) , N processing elements (PEs) at each input and N memory modules (MMs) at each output. As discussed in [8] A major problem with such a network may be tree saturation due to hot spots (locations that receive a disproportionately large number of accesses) at the memory modules [18] . To facilitate synchronization operations and alleviate this problem, the NYU U1-tracomputer combines fetch-and-4 operations (including loads and stores) at the switches.
Fetch-and-d(X, e), where X is an integer variable and e is an integer expression, returns the (old) value of a memory location X and replaces it with b(X, e).
Concurrent fetch-and-4 operations must satisfy the serialization principle. Fetch-and-b operations simultaneously directed at X cause the final value of X to be the same as the result of executing the operations in some serial order, and each operation returns a value corresponding to an intermediate value of X in a serialized execution. Fetch-and-b operations can be combined in the network for any associative operator 4' [9] . Since combined requests can themselves be combined, any number of concurrent memory references to the same location can be satisfied in the time required for one shared memory access from a single PE. This paper describes the detailed VLSI design of components for a combining switch. These components have been fabricated, tested, and are being used in the network of a 16 16 Ultracomputer prototype. Section 2 describes the operating constraints and CMOS implementation of a semisystolic queue that is useful as a component in many kinds of switches, particularly if a cut-through switching strategy is employed. Section 3 specifies the combining switch architecture actually constructed, including packaging, packet format, operations supported and flow control logic. Section 4 discusses design choices for the arrangement and arbitration of buffers within a switch. Details about the design of the forward path component are given in section 5, followed by a description of the return path component in section 6.
SYSTOLIC QUEUE DESIGNS
Systolic queue designs, as described in [10] , have advantages even for non-combining switches. Memory-based FIFO designs require that input and output buses be connected to all storage elements; the capacitance on these buses must be charged and discharged for each insertion and deletion. Systolic designs require external connections only to the first slot in the queue (see Figure 2) . Items enter at the edge of the IN row and shift right according to certain rules until all items in the OUT row have been passed, then move down and begin shifting left until they exit from the leftmost slot.
The semi-systolic design we have implemented requires only two [3] . Furthermore, systolic queue designs have the advantages of regular layout and limited connections per cell characteristic of systolic structures in general [14] .
The systolic combining queue design has the further advantage of distributing the comparison logic used to find matching messages in a way that does not add significantly to the cycle time of the switch. Memory-based designs, like those described in [21, 22] , require a comparator connecting the input bus to every message slot in the queue. Such a comparison is likely to be quite slow and must be done in series with insertion, since the destination of the message on the input bus will be different depending on the result of the comparison. In a systolic combining queue, matching can be done in parallel with insertion.
Guibas and Liang's design [10] The NORA methodology described in [7] uses this latch to construct two-phase pipelined circuits that use only two basic clocks, b and 4. In this methodology, p__hase 1 corresponds to the time when b is high and b i__s low, and phase 2 corresponds to the time when b is high and b is low. Therefore, in a standard NORA system, a phase 1 latch has b on its N transistor and b on its P transistor; a phase 2 latch has the reverse. Restrictions on the parity of inversions between latches prevent data from flowing through a pair of latches during overlap periods when both b and 4 have the same value. In [3] we augment the NORA methodology to include qualified clocks and describe how to avoid noise problems with NORA logic when implementing a systolic queue design.
Our basic data cell is shown in Figure 3 . Data movement within a row (IN or OUT) occurs on phase 1 (when clock signal bl is high); data transfer from IN to OUT occurs on phase 2 (when clock signal b2 is high). Pass transistors control the horizontal and vertical data movement from cell to cell in the data path ( Figure 2 ). The decision as to which direction data will move in a phase is computed in the previous phase; thus data movement and control computations are completely overlapped.
In Figure 3 
Operations Supported
We implement a set of combinable memory requests that have been found useful in the development of parallel algorithms, including fetch-and-b operations as well as loads and stores. If two messages with the same combinable op code meet in a queue, they will be combined, as described in [9] . See Table I This section discusses the design choices in the arrangement and arbitration of these buffers for the combining switch.
Arrangement of Buffers
Switches can be classified according to the presence or absence of buffers, and according to the location of the buffers [4] . Figure 5 shows the three types of [15] , combining networks with Type B switches still show significantly better performance than non-combining networks.
Arbitration of Buffers
The arbitration rules for the queues paired at an output in Type B switches must not reduce bandwidth or starve one of the queues. The analysis described in [17] assumes that if one queue is empty and the other is not, the non-empty queue will drive the output, and that if both queues are not empty, the queue driving the output will be selected at random.
In practice, it is not easy to find a simple and reliable digital CMOS circuit that will select each output randomly but with equal probability. Strict [5] , but the alternative is to increase the latency for all traffic patterns, either by degrading the cycle time or losing the property of having single cycle transmission time when the queue is empty.
Control logic on each component decodes the op code to determine the ALU operation and the length of the message. The output is blocked if a wait buffer data accept signal (WB.DA) or a data accept signal from the next stage (FP_OUT.DA) is low. The QNE (not empty) signal from the paired queue is used to determine which component has priority to drive the output port, as described in section 4.2; out_moving is true if the output is not blocked and the component has priority to drive the port. The DA (data accept) signal sent to the previous stage is derived from the queue full signal and latched on the input end of message (EOM) signal. The EOM signal is derived from the op code, which is used to differentiate two and four packet messages.
Combining Queue
The semi-systolic combining queue in the forward path component adds another row to the queue design described in section 2.1 (see Figure 6) . A noncombining queue has a hdware cost similar to that of a shift register; the extra row costs about 50 percent more, while the comparator to recognize matching messages adds only 8 transistors to each cell ( Figure 7) . The comparator consists of a dynamic XOR gate followed by an inverter whose output can pull down a pre-charged match line which is shared by all the cells used for the match. These gates form a short chain of Domino CMOS logic, as described in [12, 19] . The IN, j) A --chute.transferl(j)) V validl(CnUTE, j). (See [1] for more details.) The CHH, CHR and CHL signals shown in Figure 8 control the movement of the chute_transfer signal between slices of the queue.
Combining ALU
Inputs to the ALU come from the second slice of the OUT and CHUTE rows of the combining queue. The next stage output (FP_OUT in Figure 6 ) gets address packets from the OUT row, passing through the ALU unchanged; data packets on FP_OUT are the output of the ALU, which simply selects the OUT row for messages that did not combine. The wait buffer output (WB) gets message identifiers from both rows for the address packet; data packets on WB come from the OUT row without change. Table II: For each bit, we compute the generate and propagate signals from func0, func 1, the CHUTE bit, and the OUT bit using the equations G func0/ funcl / OUT/ CHUTE, P (CHUTE/ (func0 V (funcl / OUT))) V (OUT/ (funcl V (func0/ CHUTE))). Figure 11 .
A typical message slot is shown in Figure 12 Thus the wait buffer never sees a broadcast message.
Since the input to the NOR that pulls down R (see Figure 13 ) comes from the "a2" section of the recirculating loop, the head packet of a message will always be read out of the wait buffer an odd number of cycles after it was read in. An additional cycle is required to pass through the ALU and enter the queue, giving the total of an even number of cycles from WB to D in Figure 10 , as mentioned previously. The input to the NOR could equally well have come from the "b2", which would have changed the parity of the cycle total and thus changed the parity of the memory latency, measured in switch cycles.
Decombining ALU
A four-function ALU identical in design to the combining ALU described in section 5.2 is used to genw b2 C,n Read 
CONCLUSION
We have presented the design for two components used in a processor-to-memory interconnection network for a shared memory system. These compo-MOSIS service. These parts were designed using the Magic design tools from UCB. Simulations were performed at multiple levels. We constructed both a behavioral model of the switch and a detailed RTLlevel model of our design, both written in C. These were linked to each other and to a switch-level simulator that operated on the circuit extracted from the layout. The verification process used confirmed that all three levels of abstraction agreed. We were able to test the agreement of the behavioral and RTL-level models on simulations of the entire 16-PE system and verified the conformance of the RTL-level model with the layout while simulating smaller systems.
To aid the layout verification process, schematics of each cell were entered into a commercial PCbased schematic capture system. We wrote conversion software to produce the same format transistor list files that are produced from our layout. 16- PE machine using these components and plan to undertake a 2-year effort to measure the characteristics of the resulting system.
