Abstract: An algorithm is described that allows log ( n ) processors to sort n records in just over 2n write cycles, together with suitable hardware to support the algorithm. The algorithm is a parallel version of the straight merge sort. The passes of the merge sort are run overlapped, with each pass supported by a separate processor. The intermediate files of a serial merge sort are replaced by first-in firstout aueues. The processors and Queues may be implemented in conventional solid logic technology or in bubble technology. A hybrid technology is alsb appropriate. 
Introduction
Most conventional sorting algorithms operate on a single processor and require of order n . log (n) cycles to sort n records. Examples are the merge sort [ l , pp. 163-1651 and quicksort [ l , pp. 114-1161. There are single processor algorithms with sort times proportional to n, but these are only effective in certain circumstances. Address sorting [ 1, pp. 99-1021 requires the spread of sort key values to be known and fairly random. Digital sorting [ l , p. 1701 is very good for main storage sorts of files with short keys. When secondary storage is used, the digit length has to be small to reduce the number of open files; the key then consists of many digits and the constant of proportionality of the sort is high.
A variety of multiple processor sorts exists, most of which require a very large number of processors, proportional to n or more. These are the network sorts [ 1, pp. 220-2431, in particular Batcher's merge exchange sort [ 1, pp. 111-1141, Thompson and Kung's mesh sorts [2] , and Chen's parallel bubble sort [3] . Some of these sorts are very fast, but all require very special hardware and are impracticable for large files with current technology.
Even proposed a sort using r (log, n ) processors and 4 . [(log, n) tape units to sort in 3.2r(log, n) write cycles [4] . This sort is made very complicated by the necessity of rewinding tapes before they can be read.
We present a sort that is similar to Even's. It uses more sophisticated hardware, which makes it both faster and simpler. The basic algorithm permits [(log, n) + I processors to sort n records in 2n + log, n -1 write cycles.
This requires the storage of 2 log, n intermediate queues of variable length and maximum total length n records.
These can be implemented using conventional main storage or shift register (e.g., bubble) storage. Our queues differ from Even's tapes in that they can be read before they have been fully written, and no rewind is needed. There are variations on our basic algorithm requiring fewer resources.
The proposed sort is suitable for use when several processors are available, but not order n or more. Very simple processors, which are only required to do a merge, can be used. Our sort is faster for sorting general files than single processor sorts, but not as fast as the network sorts.
Our sort could be used in a low cost special purpose sorting machine. Sorting is traditionally used in batch processing and also now in efficient implementations of relational query systems (e.g., [5] ). Our sort would form a natural part of a relational data base machine [6] .
The algorithm is a variant of a straight merge sort [ 1 , pp. 163-1651. The passes are run overlapped rather than serially. Each pass is supported by a separate processor. Reading from the output of one pass begins before the writing of that output is complete, so the intermediate structures are first-in first-out queues rather than files. When the number of records to be sorted is not an exact power of 2, the normal serial algorithm deals with the remainder at the end of each pass; our algorithm deals with it first.
There are several variations of the algorithm that are more suitable in certain circumstances. A multi-way merge sort reduces the number of processors. Small sections of data can be sorted before being introduced to the Copyright 1978 by International Business Machines Corporation. Copying is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract may be used without further permission in computer-based and other information-service systems. Permission to republish other excerpts should be obtained from the Editor. records. An overview of the sorting process is shown at (a), and a snapshot of the sort after seven cycles is shown at (b). Theindicates a break between strings of records.
Input Comments
front of file merge sort, which reduces the number of processors and the handling of very short queues. Blocking may be used to store queued records. Sort processors may be used to steer records directly from input to output queues, rather than to read them into buffers and later write them into queues. It may be best to let the processors run asynchronously. Various hardware techniques can support the algorithm, including solid logic and bubble technologies. Either can support both the processing and queue storage, or solid logic processors can be used with bubble queues.
In the next section we discuss the basic algorithm and analyze its processor and queue requirements. We show how multi-way sorting and presorting affect these require-51 0 ments. The following section discusses more detailed var-S. TODD iations of the algorithm that relate to the supporting hardware. Finally, hardware alternatives are covered, with general comments and three specific implementations.
The algorithm
The algorithm is a variation of a straight two-way merge sort [l, pp. 163-1651. A serial two-way merge sort operates in several passes, with each pass creating sorted sequences (strings) of records. The first pass creates strings of two records; the second pass merges these pairwise into four-record strings. After i passes, the strings have length 22. After [(log, n) passes, all n records are in one sorted string.
In our variation the passes are run overlapped. We consider first the special case where the number n of records to be sorted is equal to 2' for integer r and where there are r + 1 processors, 0 through Y. The output from the ith processor consists of sorted sequences of 2 i records, created by merging two output strings from the (i -1)th processor. Figure 1 shows the general setup for our algorithm, Fig. 2 , the operation of a serial merge sort, and Fig. 3 , the operation of an overlapped merge sort. We assume the processors run synchronously and can both read and write one record per cycle. Each processor starts when the previous processor has written one complete string and the first record of a second string.
The first process is finished before the last one starts; thus a single processor can be used for both. Alternatively, a sorter can be run almost as a continuous process. As soon as input of one file is finished, input of the next starts. The later processors act on the end of one file while the early ones act on the start of the next file.
With serial processing the natural merge sort reduces the number of passes needed when the file is already partially sorted. This is of less value with overlapped passes. The processors are usually preallocated, and only a few cycles are saved. Also, the storage requirements for the intermediate queues are less predictable.
In the remainder of this section we analyze the algorithm. We see how it must be adapted when n is not a power of 2 , and we discuss the use of multi-way merge sorts and the presorting of blocks of records.
Analysis of the algorithm
We establish sorting time and bounds on queue lengths.
There are r + 1 processors ( P o , . * ., p,) to sort n = 2' records. The output from pi consists of 2"i sorted sequences of 2' records each, si,l, . . ., si,zr-,. Processor po breaks the input into separate records; s O j consists of the jth input record. Sequence sij(j > 0) is created by merging si+l,zj-l with si-,,zj. Processors pi and pi+l are connected by two queues, q2i+l and q2i+2. Output sequence si,j from pi is written to qZifl f o r j odd or qZif2 f o r j even. Thus pi+l always merges a sequence from q2i+l with one from q2i+2.
Cycles Input
Po Processor pi+l starts operation as soon as there is input in both q2i+l and q2i+z, that is, 2' + 1 cycles after pi. Processor po starts in cycle 1; thus pi starts in cycle
Processor pi operates for n cycles and completes in cycle
The sort ends with processor r in cycle 2n + log, n -1.
A different way to find sorting time is to consider the case when the last record in is to be the first out. It passes pO in cycle n , p1 in cycle n + 1, and so on. It emerges from the sort out of p, in cycle n + r. The final record emerges from the sort n -1 cycles later in cycle 2n + r -1.
We establish bounds on queue lengths. Consider a time when pi and pi+l are both operating. Processor pi+l is about to write into si+lJ, into which it has already read the head (I records) of and the head ( m records) of si ,zj . Processor pi started writing into si,zj one cycle before pi+l started writing into s ' +~,~. Either During the period before pi+, starts and after Pi stops, the queues may be shorter than these rules imply.
The equation for the sum of the queue lengths is easily derived from the fact that pi+l starts after pi has written 2' + 1 records into these queues, and that from then until pi finishes each write cycle sees one record written by pi and one read by pl+,.
Number of records not a power of two
Consider the case where the number (n) of records is not a power of two. The serial straight two-way merge sort deals with the remaining "short" sequences at the end of each pass. This can be improved in the parallel case by taking the short sequences first. This does not upset the convenient property that merge sort preserves the order of records with equal sort key.
Let r = [(log, n). We still need r + 1 processors. If the short sequences are taken at the end in the parallel case, p, starts at cycle 2' + r. The sort ends after n + 2' + r -1 cycles.
Taking the short sequences at the front is more complicated. We initialize all processors as if they had already operated on 2' -n very small pseudo-records. For i = 1 to r -I , pi still starts operation 2'-' + 1 cycles after pi-l.
Processor p, starts as soon as there is a record in both its input queues. String s , . -~,~ contains all the pseudo-records, and thus only 2'-' -2' + n real records. So p, starts 2'-' -2' + n + 1 cycles after pr-l; that is, in cycle 2'-' + r -1 + 2,-' -2" + n + 1 = cycle n + r. The sort finishes in 2n + r -1 cycles.
We illustrate this with n = 5 , r = 3 (Fig. 5) . The operation of po, pl, and pz is similar to their operation in the case n = 8 (Fig. 2) , only three cycles earlier and with "a," "b," and "c" replaced by pseudo-records. Processor p, could have started in cycle 5 rather than cycle 6 , but it would have been held up in the next cycle waiting for the "d." Processor p3 is able to start in cycle 8 without risk of being held up. It is amusing when the last record is the smallest to watch the lower queues clear themselves to let it through to the front.
Multi-way merge sort
A multi-way merge sort can be operated with all passes in parallel in the same fashion as a two-way merge sort. Fewer, more powerful processors are needed.
For a k-way merge sort r + 1 = [(log, n) + 1 processors sort n records in 2n + r -1 cycles. The output of pi consists of strings of length k', which are merged k at a time At start of sort, three pseudo-records are already processed.
Sort begins exactly as in Fig. 3, cycle 3 , with a, b, c replaced by pseudo-records (+).
All continues as in Fig. 3 .
pz starts its second string. Table 1 shows the number of processors and queues for various values of k and n. The best choice of k depends on the comparative cost and speed of processors and queues for a particular implementation. 
S. TODD
Presorting The first few processors deal with very short strings. They can be eliminated if pa carries out an "in core" presort of sets of s input records. The output from pa consists of strings of s records, pi produces strings of s . k i records, and only [[log, ( n l s ) ] + 1 processors are needed to sort n records.
The numbers should be chosen so that pa can carry out the sort (s . log, s comparisons) as the records are read in.
It can then write the smallest record of the first string as it reads the first record of the second string. This gives a delay of s read cycles before the first write cycle. The processor pa requires storage for s records. Research at the University of Strathclyde on the LEECH processor [7] suggests that a very cheap special purpose processor should be able to handle values of s up to 50 or more. This simplifies the design of hardware for the intermediate queues. Space can be efficiently allocated to the queue sets at an early stage, which reduces allocation problems at execution time.
Variations of the algorithm
We discuss in this section variations of the algorithm that are made to suit particular implementations. They are the blocking of the records held in queues; overlapping input, comparison, and output; and synchronous operation. These are not important in the general behavior of the algorithm but must be considered for a specific implementation.
Blocking
In some implementations of queues the designer may prefer to deal with records in blocks rather than individually. Blocking delays the transfer of a record from pi to and slightly slows down the sort.
If presorting is not used, the effect of blocking on the early processors is complicated as several strings fit into one block. We only consider the case where the number of records in a block is equal to the length s of strings produced by the presort, which is generally the most con- As long as no block holds records from several strings, processors complete a block for one output queue before starting a block for another. Thus a single buffering store and write channel can be shared by all output queues.
Overlapping input, comparison, and output
For some implementations the processors see the records as a stream of bits. Rather than reading, comparing, and writing records, a bit serial comparison is made and the data steered to the correct place. This applies only with fixed length records with the key at their head. Figure 6 shows the setup for overlapped input, comparison, and output when merging two strings. Data are read one bit at a time into a processor, which controls the crossover switch and inhibitors. In a normal cycle the processor is comparing the loser record (that with the larger key) from the previous cycle (which is in the buffer loop) with the next record from the winner record's queue. The loser record's input queue is inhibited. As long as the records are identical, the status of the crossover switch is not important: one stream of bits flows into the output queue and an identical stream into the buffer loop. As soon as a difference is recorded, the switch is set to steer the smaller (winner) record to the output queue and the loser into the buffer loop.
During the copy sequences, no comparisons are needed. To keep a steady stream of data the output is directed from the input around the buffer loop. In the last cycle of a copy sequence, the first record from the next string of input 1 is steered into the buffer loop. The processor is then ready to start work on its new output string as soon as the first input 2 record arrives. The output queue control switch is flipped, but there is no break in the flow of input data.
When k > 2 strings are to be merged, k -1 comparators must be cascaded.
Timing and asynchronous operation
In the analyses so far, we have assumed the processors to be synchronized by their output cycles. This may prove inconvenient because of the different number of reads to be carried out between the writes, because of different comparison costs in each cycle or because of a queue access being held up for implementation reasons (e.g., storage conflict). Reads and writes can be balanced by buffering, but the other problems make asynchronous operation desirable.
No data comparisons are necessary during a copy sequence. Thus the amount of work involved in an output cycle, particularly in the multi-way merge, is not constant. Also the queues may not be implemented completely independently, and storage access may delay a processor. For these reasons it may be required to run the processors asynchronously. If pi+l operates too fast for pi, pi+l will attempt to read an intermediate queue and find it empty. Thus pi+l must then wait until the required record is written by pi. If pi operates too fast for p,,,, pi may find that the storage available for the intermediate queues becomes full; then pi must wait for pi+l to read some records.
Analysis of asynchronous operation involves the comparative speed of the processors and queue storage, the degree of independence of the queues, and the details of the file being sorted. With processors appreciably faster than storage and reasonably independent queues, the performance is comparable to that with synchronous operation.
Hardware
The algorithm is suitable for implementation on a wide range of hardware. We discuss in general terms the implementation using solid logic technology and bubble technology, with a specific implementation in each and a hybrid implementation. We show how the variants of the algorithm are used in different circumstances. Other technologies (e.g., multi-channel disks) are not discussed. Figure 6 Setup for overlapped input, comparison, and output. Data are read from the read heads into a processor, which controls the selection switches and the inhibitors.
Solid logic technology
Solid logic technology is good at providing random access and powerful processors but less good with independent access to several queues. In an implementation of the algorithm in this technology we would use presorting and a multi-way merge sort to minimize queue traffic.
Simultaneous accesses to a single module of main storage are very complicated (or impossible). Thus we allocate one storage module for each queue set. The queues are implemented by chaining. The bounds on queue set size determine the size of the storage modules.
With suitable buffering each processor requires one read and one write access per write cycle. All processors can be made to read during the first part of a cycle and to write during the second part. This avoids contention of one processor writing a queue while the following processor is reading it. Each processor requires a small local store for the buffered values.
If the sorting rate is limited by the speed with which data can be fed to a device, a very fast store is not appropriate except for the storage used for the presort. With fairly fast processors a possible design carries out a presort to produce strings of records to a total of 1 Kbyte, where K = 1024, and subsequent processors handle a 16-way merge sort. A device with a sort capacity of 256 Kbytes is shown in Fig. 7 .
Bubble technology
Magnetic bubble devices are particularly suited to handling streams of data, but not to random access. The bit serial merge of Fig. 6 Several bubble register organizations can support multiple queues with simultaneous access. It is too early to predict which would be the most economical. The most complicated (the shift register array) handles records independently; the others use blocking.
The shift register array [6] was designed for the storage of multiple queues. The operation is analogous to the use of many tapes with independent read and write heads; the problem is the neat stacking of the tape between the heads. The bounds on queue lengths simplify the dynamic allocation of space between them.
Two other mechanisms use blocking. Each can be seen as an array of data, with each column representing a block and several active columns from which data can be read or written. The columns can be moved to transfer them to an active column position when they are to be accessed. The control of the blocks can be handled by some external processor, which remembers which is the next block in each queue. Alternatively, the blocks can be self-identifying and retrieved associatively.
The first of the blocking mechanisms stores data using a major-minor loop scheme [8] . Because simultaneous access to several blocks is needed, the data are retrieved via a small section of shift register array rather than via a single major loop [9]. This allows several blocks to be accessed together, with buffering and unblocking handled in ~ 516 the store.
s. TODD
The second blocking mechanism makes use of much more tightly packed bubbles in an array [lo] . This scheme operates faster but does not permit one block to be accessed while another is being moved towards an access column. Thus some additional support is needed to buffer and unblock, or there will be interference between queues.
Of the above, the shift register array is the only scheme which effectively deals with the very short queues encountered between the early processors. The other mechanisms are more efficient for holding long queues. The shift register array can handle many queues with little interference, the other mechanisms only a few. Thus queues produced from several processors can be implemented in one shift register array, but for the blocking schemes independent modules are needed for the different sets of queues. Figure 8 gives a schematic diagram of one implementation of the algorithm using bubble technology. All processors use bit serial merge. The early queues are implemented in a shift register array. The later queues use minor loops accessed via a shift register array.
The first section is organized to produce strings of records to fit a 1-Kbyte buffer. The programming depends on the record length. There are sufficient queues and processors to support the smallest record length; for longer records some of them are not used. This section is effectively a presorter.
The later sections each have shift register arrays to block and buffer two input queues and buffer and unblock two output queues. The block size is 1 Kbyte. The ith section has minor loops long enough to hold 2" blocks. A total of 12 of these sections gives the last a capacity of 2 megabytes. The total storage required is just over 4
Mbytes (where M = K'), which is the sorting capacity of the device.
Hybrid implementation Many combinations of hardware can be used. Figure 9 gives an example. A solid logic front end is combined with a back end using solid logic processors and bubble queues. Four-way merge sort is used to make reasonable use of the processor without overcomplicating queue interaction in the store.
Summary
We have discussed the application of multiple processors to a merge sort. logic and magnetic bubble technologies can be used to implement the hardware, or a hybrid of these technologies can be used.
