A hardware sorter suitable for VLSI implementation is proposed. I t operates in a parallel and pipelined fashion, with the actual sorting time absorbed by the inputloutput time. A detailed VLSI implementation is described which has a very favorable device count compared to existing static RAM.
Introduction
Sorting is one of the most important operations in data while the item at the top cell goes out of the array in 'an processing. It is estimated that in data processing centers upward data flow.) The initial sequence is entered into the over 25 percent of CPU time is devoted to sorting [I] . Many sorter one item at each step. After the last item has been sequential and parallel sorting algorithms have been proentered, the data flow direction is reversed, and the sorted posed and studied [2-131. Implementation of various sorting sequence is then extracted as output, also serially. Each step, algorithms in different hardware structures has also been executed synchronously and simultaneously by all the cells, investigated [3,4, 6, 1 I , 13-1 71. has two phases:
In this paper, we describe a sorter in which the sorting time is completely overlapped with the input/output time. It has complete parallel operation and processes data in a pipelined fashion. It can sort in both ascending and descending order and can overlap the sorting time of two consecutive input sequences. Because of the regularity of its structure, it is most suitable for VLSI implementation. A detailed implementation is presented to illustrate the basic principle.
1. Compare: The two items in each and every cell are compared to each other, 2. Transfer: Subject to the result of the comparison, the desired sorting order (ascending or descending), and the sorting state (input or output), one or the other of the two items is transferred to a neighbor cell and the original cell receives an item from the other neighbor cell.
Further optimization in various aspects of the design is clearly possible.
Principle
The sorter consists basically of a linear array of n/2 cells (we assume n is even), each of which can store two items of the sequence to be sorted (Fig. I) . There is only one connection between a cell and its upper and its lower neighbor cell. After comparison, one of the two items goes to the next neighbor cell through this connection. Since the data flow is the same for all cells at any given time, this removed item occupies the space newly created in the next cell. (The removed item at the bottom cell goes out of the array in a downward data flow
The sorter not only processes the items of a given sequence in a pipelined fashion, but can also sort different sequences in a pipelined way (provided that some extra hardware is added to the sorter), Le., while one sorted sequence is being produced as output, a new sequence can be entered as input at the same time from the other end of the sorter. In this way, the 110 time of the sequence is completely absorbed by the sorting time needed by another sequence. smaller of the two is transferred up. Note that at the end of the input stage (step 6), the smallest item must be in the top cell, the second smallest must be in either the top or the second cell. In general, the ith smallest item must be in one of the topi cells. This is why the output sequence is sorted.
The same principle applies to the descending sort; we have only to replace "w" by ' ' -a , ' ' the smallest item, and interchange larger and smaller. It is shown later that it is not necessary to flood the sorter initially with either ' ' a " or "-w." (See Fig. 14, shown later.) Let A , B be the two items stored in a cell. Let M = max ( A , B ) , m = min ( A , B ) . If we consider the sorting of an isolated sequence, and the sequence is entered through and extracted from the top (top sequence), the specific action in the transfer phase can be summarized as shown in Table 1 .
If the sequence is entered through and extracted from the bottom port of the sorter (bottom sequence), the situation would be as reflected in Table 2 . A fact to be noted is that the roles of M and m are interchanged when we consider a descending as opposed to an ascending sort. Step I
Step 2 Step 3
Step 4 Step 5 Step 6 Output stage (smaller items are circled and transferred)
Step 7 Step 8
Step 9 Table 2 Transfer actions for a single bottom sequence.
Output (down)
Ascending Descending Table 3 Transfer actions for overlapping sequences. top sequence, we promote the minimum of the two upwards ( r n f ) , while in each cell containing two items from the bottom sequence, we promote the maximum of the two upwards ( M t ) . This is because the top sequence is in its output mode, and items should go out in ascending order, while the bottom sequence is still in its input, mode, and larger items should be pushed up to the top so that later in the output mode the items of this sequence can come out (from the bottom) in ascending order.
For a cell containing an item from each sequence, we want to promote the item from the top sequence up, whatever the relative magnitudes of the two items may be. Thus we attach a flag to each item when it is entered: "0" ("1") to items in the top (bottom) sequence. This flag is considered part of the item, in the comparison as well as in the transfer. Consequently, for a cell containing items from both sequences, we simply promote the minimum of the two up (rnl). Thus, we obtain Table 3 on transfer actions. The parenthesized entries correspond to the descending sort.
The third column represents the frontier cell between the two sequences. If we include the tag bit as the most significant bit of the items for the purpose of comparison, the item from a bottom sequence with tag bit = 1 is always M and the two sequences are always kept separate. An example of the sorting with the added tag bits is shown in Fig. 3 . It should be emphasized once again that the sorter can be used to sort one ascending and one descending sequence with the existing flags but without the complexity of the moving M f m boundary.
Logic design
Throughout this paper, the cell array of the sorter is represented vertically. Each cell, containing two w-bit items, is a horizontal linear array (row) of w "dibit" cells. The overall topological layout is shown in Fig. 4 . In an actual physical layout, a carpenter folding [I81 of the cell array might be needed to obtain a more square-shaped chip.
Dibit cell Each such cell is a compare/steer unit for two bits, one from each of the two items A and B, representing the same bit position. For simplicity, these bits are referred to as bit A and bit B, respectively. Figure 5 is the block diagram of a dibit cell. In downward (upward) movements, after comparison, one of the two bits is shifted out on line a (b) to the next (previous) cell, while a bit from the previous (next) cell is being shifted in on line I (0). In this figure, the terms "input" and "output" refer to a top sequence, and the controls are indicated for an ascending sort. For example, at the input stage, if A < B, then the signal from the comparator is 1, which sets off the selector (SEL), allowing bit B to go down line a and a new bit to come in from I. The case A = B also generates signal 1.
The comparators of the dibit cells in a cell row are chained as in A circuit schematic of a dibit cell is shown in Fig. 7 . The precharged carry-propagate-type comparator is shown together with the two bit cells. It should be noted that every bit cell of item A ( B ) in a cell row is controlled by the same four signals C,, C,, C,, and C, (Ci, C;, C;, and Ci), so that all the bits of an item are recycled or shifted at the same time.
Since the comparator circuit in Fig. 7 Now that B = 0,z = 0, Q = 0, we have COu, = 1 whatever the value of Cjn may be.
The other parts (i.e., the bit cells) of Fig. 7 are explained in the next paragraph.
Control To illustrate, let us consider an ascending sort with a top sequence. Each cell is a two-inverter loop controlled by four gates using a two-nonoverlapping-phase clock. Note that for the global control of the sorter, we need one extra clock phase, in which one can change from the up to the down phase, or the down to the up. It can also be used for initialization. But more importantly, it is needed to make sure that a racing condition does not occur. C Figure 9 Circuit schematic of the cell control. Table 4 Transfer actions and corresponding shift register control for overlapping sequences.
S R
and bottom sequences ( Table 3) ; instead we have a bidirectional double shift-register chain,whose contents move up and down in synchrony with those of the cells and whose output at each level is taken to be SR, as shown in Fig. 10 , so that an item of a top (bottom) sequence is always chaperoned by S R = 0 (1). A slight complication occurs at the frontier. The desired transfer action then is shown in Table 4 . The reader can easily check from Fig. 10 that the two extra unidirectional shift registers at the two ends are needed to fulfill the requirement of the third column in both ascending and descending sort.
Timing
As mentioned in the previous section, we use a threenonoverlapping-phase clock, as shown in Fig. 11 phase bl, the transfer bit is read out from cell (i), while the other bit is recycled and the comparison carry chain precharged [ Fig. 12(a) ]. During b2, the transfer bit is written Specifically, at 42 = 1, we obtain the value of C (see Fig. 9 ), and it goes to the control at 43 = 1. At the next c # + = 1, bit A and bit B begin their transfer phase while the C line (or the C,,, line in the individual bit comparators) is precharged.
[See Fig. I2(a) .] At I#J2 = 1, a new bit is written in (the other has been circulating) and compared with the other bit. Without @3, we would have the situation in Fig. 12(b) . Thus, 
Initialization
Before the beginning of a sort, instead of initializing all the cells with "03" or " -w," it is necessary only to fill in the two border cells with tags, distinct from the tags of the sequence coming in, together with appropriate setting of the comparison shift registers as in Fig. 14 . Recall that top (bottom) sequences have tag bit "0" ("I"). So here ''m" ( ' " a ' ' ) represents any number with tag bit "1" ("0"). It could be easily checked from Table 4 , and, e.g., Fig. 14(e) that these initializations are indeed adequate. ers, comparisons on adjacent row cells must be implemented differently. Indeed, as can be seen in Fig. 6 , a bit leaving a cell is in complemented form in comparison to when it was entered. Therefore, to produce the same comparison carry output we need to invert the roles of A and 2, and also B and 3, as in Fig. 15 . A redrawn global block diagram is shown in Fig. 16 , where the alternation between adjacent rows is clearly indicated. Note also that an even number of rows is recommended so that data are entered and extracted in "true" form. (Otherwise either the top or bottom would be in "false," i.e., negated form.) 2. For our implementation (Fig. 6) 
