This work studies comparator networks in which several of the outputs are accelerated. That is, they are generated much faster than the other outputs, and this without hindering the other outputs. We study this acceleration in the context of merging networks and sorting networks.
Introduction
We study comparator networks in which several of the outputs are accelerated. That is, they are generated much faster than the other outputs, and this without hindering the other outputs. Namely, for every 0 < k ≤ n, we present a merging network of minimal depth that merges two sorted sequences of length n into a single sorted sequence. This merging network produces either the lowest k keys or the highest k keys 1 after a delay of log(k) + 1 comparators. Building on that, we construct, for every 0 < k < n, an n-key sorting network that accelerates its k lowest or its k highest outputs. This sorting network is a merge-sort network 2 and has a minimal depth among these networks. Namely, its depth is log(n) · log(2n) 2
, the same depth as the Batcher merge-sort networks [2] . However, in contrast to the Batcher merge-sort networks which may accelerate only the first and last outputs, our merge-sort networks accelerates either the k lowest keys or k highest keys 1 to a delay of less than log(n) · log 2k comparators.
The paper presents a new merging technique, the Tri-section technique, that separates, by a depth one network, two sorted sequences into three sets, such that every key in one set is smaller or equal to any key in the following set. After this separation, each of these sets can be sorted separately and this leads to the desired acceleration. The idea of separating the input into two sets is known and is used, for example, in the Bitonic sorter of Batcher [2] ; however, to the best of our knowledge, separation into three sets as above is novel.
To put our results in context, let us compare the acceleration of our networks with the acceleration of other well-known merging networks -The Bitonic sorter and the odd-even merging network, both of Batcher [2] . The Bitonic sorter has no accelerated outputs at all; all outputs have exactly the same delay. On the other hand, the odd-even merging network has only two accelerated outputs, the first and the last ones whose delay is exactly one. All other outputs have the same delay.
To the best of our knowledge, the idea of accelerating certain outputs was never addressed. The only prior work which is somewhat similar to our work concerns selectors. A (k, n)-selector is a network that separates a set of n keys into the lowest k and the other keys. Fast selection leads to a sorting network that accelerate certain outputs, as follows: First, the k lowest keys are separated from the other keys. Afterwards, each set is sorted separately. Yao presented a (k, n)-selector which is efficient when k is constant and n is very large. This selector can be extended into a sorting network that accelerates its lowest k outputs; however, the depth of the resulting sorting network exceeds the minimal depth of a merge-sort network. Our network accelerates the lowest k outputs while its depth is minimal among merge-sort networks.
Our paper has several additional contributions. The first one concerns the well-known 0-1 Principle [6] . This principle is a powerful tool that simplifies the construction and analysis of comparator networks. The paper demonstrates that, in some cases, there is a more convenient tool to achieve the same goal. In the context of merging, we use a small and elegant set of vectors which constitute a conclusive set [4] ; namely, a network is a merging network if and only if it sorts this set. This tool simplifies our proof by having fewer special cases than the classical 0-1 Principle.
The second additional contribution concerns Batcher's merging techniques. Batcher's odd-even merging technique [2] works as follows: Each of the input sequences is partitioned into its even part and its odd part. The even part of one sequence is merged with the even part of the other sequence recursively and similarly, the odd parts are merged. Finally, the two resulting sequences are merged into a single sorted sequence by a depth one network.
A slight variant of this method, due to Knuth [6, pp 231] and Leighton [7, pp 623] , recursively merges the even part of each input sequence with the odd part of the other sequence. Again, the resulting 1 When n is a power of two, both the lowest k keys and the highest k keys can be accelerated.
2
A merge-sort network is a sorting network which operates as follows. The input is arbitrarily divided into two sets of (almost) equal size and each set is recursively sorted; the two sorted sequences are then merged.
two sorted sequences can be merged by a depth one network. We refer to the family of networks produced by allowing each of the above two variants anywhere in the recursion process as Batcher merging networks. It was shown in [8] that all published merging networks, whose width is a power of two, are members of this family. All these merging networks are of minimal depth and have no degenerate comparators. (A degenerate 3 comparator has a fixed incoming edge whose value is always greater or equal to the value on the other incoming edge, for every valid input of the network.) The above fact arise the following question :
Question 1 Are the Batcher merging networks the only merging networks with the following properties:
1. Their width, 2n, is a power of two.
2. Their depth is minimal -log(2n).
They have no degenerate comparators.
The Tri-section technique provides a negative answer to this question, as shown in Section 5. Another question, which remains open, concerns accelerating all the outputs of a merging network, each to a delay that is close to the trivial reachability bound of this output. This bound is due to the fact that the j lowest (or highest) output may come from each of certain 2j input edges. Therefore, our question is:
Question 2 For any n (or arbitrary large n), is there a merging network of width 2n that, for every j < n, accelerates the j lowest output and the j highest output to a delay of log(j) + o(log(j))?
Preliminaries
The concept of comparator networks is well-known and an example is depicted in Figure 1 . A comparator (represented by a circle) receives two keys via its two incoming edges. The comparator sorts these keys; it transmits the minimal one on the outgoing Min edge (indicated by a hollow arrowhead) and the maximal key on the outgoing Max edge (indicated by the solid arrowhead). The network's input edges are indicated by an open arrowhead.
The network of Figure 1 is in fact a merging network. Its input are two sorted sequences, each of width 2 and its output is a sorted sequence of width 4. Keys enter the network through its input edges and exit the network through its output edges. We name the input edges to denote how the input is fed into the network. In such a network, one input sequence enters the edgesâ 0 ,â 1 , . . .â n−1 and the other input sequence enters the edgesb 0 ,b 1 , . . .b n−1 . Similarly, output edges are namedô 0 ,ô 1 , . . .ô 2n−1 to denote how the output keys are assembled into a sequence. Namely, the output sequence o = o 0 , o 1 . . . , o 2n−1 is composed of the values on these edges, in that order.
The width of a network N is the number of its input edges which clearly equals the number of its output edges. Let e be an edge of N . The depth of e, denoted d(e), is the length of the longest path that ends in the tail of e (i.e., the path does not include e). Hence d(e) = 0, for every input edge e. The depth of N , denoted d(N ), is the maximal depth of the edges of N . In the network, M , of Figure 1 ,
A sequence of keys x 0 , x 1 , . . . , x n−1 is denoted by x; the width of this x, denoted | x|, is n. 2| a| = 2| b|. A bisorted vector is a bisequenced vector a, b in which both a and b are sorted. Such a vector is a valid input to a merging network of the appropriate width.
Let K be the set of optional keys. Usually the cardinality of K is insignificant, as long as it is greater than one, and this paper is no exception. This fact is reflected by the 0-1 principle. Surprisingly, there are some properties of networks which depend on the size of K as shown in [10] .
The Asymmetric Tri-section Merging Technique
This paper presents two Tri-section merging techniques. As said, they are based on separating, by a depth one network, a bisorted vector of width 2n into three sequences, x, y and z, such that every key in one set is smaller or equal to any key in the next set. This allows us to sort each of these sequences separately. In all our techniques the resulting merging network is of a minimal depth. Furthermore, the sequences x, y and z are sorted by networks of depth log(| x|) , log(| y|) and log(| z|) , respectively.
In this section we present the Asymmetric Tri-section technique in which | x| = k, | y| = n and
where k is an arbitrary number smaller or equal to n. The technique is called "Asymmetric" in contrast to the "Symmetric" variant in which | x| = | z|.
The depth one network performing the Asymmetric Tri-section, with these parameters, is called Figure 2 presents the network T 5,11 . In this figure, a comparator is denoted as in Figure 3 . Namely, it contains two horizontal edges: a Min edge and a Max edge, connected by a diagonal line. The name of the edges entering this comparator are written on the diagonal line while the name of the edges coming out of it are written on the edges. (See Figure 3 .) The general network, T k,n , naturally follows the format of Figure 2 and a formal definition is omitted. Note that the network T 0,n is identical to the first stage of Batcher's Bitonic sorter [2] . Hence, in some sense, the Tri-section technique is generalization of Batcher's technique. LetT k,n denote the mapping performed by the network T k,n . That is,T k,n ( a, b ) = x, y, z , where x, y and z are the sequences generated by T k,n when it receives the input vector a, b . To studȳ T k,n we name several types of vectors. A sequence x is sorted (ascending) if x i ≤ x j whenever i ≤ j. Similarly, x is descending if x i ≥ x j whenever i ≤ j. A sequence is ascending-descending if it is a concatenation of an ascending sequence followed by a descending sequence. Similarly, a concatenation of a descending sequence followed by an ascending sequence is called descending-ascending. Note that either of the sequences may be empty; therefore, ascending sequences and descending sequences are both ascending-descending and descending-ascending. A sequence is Bitonic 4 if it is a rotation of an ascending-descending sequence. A comparator network is an ascending-descending sorter if it sorts all ascending-descending sequences. Similarly, we define descending-ascending sorter and Bitonic sorter. 4 This term was coined by Batcher [2] and we follow his terminology. We caution the reader that some authors use the term "Bitonic" with other meanings. A powerful tool to study merging networks, similar to the 0-1 Principle, is the set of Sandwich vectors, presented in [4] . As demonstrated in the proof of Lemma 4, their simple and elegant form simplifies the analysis of merging networks and leads to fewer special cases than the traditional 0-1 Principle. For the sake of sandwiches we assume that K = N. A sandwich of width 2n is a bisorted vector a, b in which every member of the interval [0, 2n) appears exactly once and the range of the a sequence is an interval. The term "sandwich" follows from the fact that the vector can be sorted by inserting the sequence a consecutively in a certain place in the sequence b. The sandwich technique is based on the following lemma:
Lemma 3 (The sandwich Lemma [4]) A network is a merging network if and only if it sorts all sandwiches.
The following lemma is the keystone of the Asymmetric Tri-section technique.
Lemma 4 Let a, b be a bisorted vector of width
2n, let k ∈ [0, n) and let x, y, z =T k,n ( a, b ). Then 1. | x| = k, | y| = n, | z| = n − k.
Every key in x is smaller or equal to any key in y and every key in y is smaller or equal to any key in z.
3. x is ascending-descending, z is descending-ascending.
y is Bitonic.
Proof
The fact that z is descending-ascending is proved analogously and Statement (3) follows. The hard part of this proof is Statement (4). To use the 0-1 Principle or the Sandwich Lemma we need the network in question to be a merging network. To this end, we extend the network T k,n into a network M as follows: The sequence x enters an arbitrary ascending-descending sorter, the sequence y enters an arbitrary Bitonic sorter and the sequence z enters an arbitrary descending-ascending sorter. We now prove that M is a merging network. This fact can be proved using (a variant of) the 0-1 Principle but this leads to many special cases which need to be verified. On the other hand, sandwiches lead to a proof having only two symmetric cases. Therefore, we next assume that the input a, b is a sandwich and show that in this case the sequence y is Bitonic.
Note that a sandwich vector a, b is determined by the key a 0 . There are two (overlapping) cases; either a 0 ≤ k or a 0 ≥ k. The two cases are similar and we focus on the first. Figure 4 depicts the network T 5,11 processing the sandwich with a 0 = 2. The initial part of y, having k − a 0 keys, comes from b in reverse order. Hence, this initial part of y is descending. The rest of y comes from a in the natural order; hence, this part is ascending. To summarize, in the case of a o ≤ k, y is Bitonic. In the other case, where a 0 ≥ k, the sequence y is ascending-descending; hence, y is Bitonic also in this case. This and the Sandwich Lemma imply that M is a merging network. If we were just to prove that M is a merging network, then the proof would have ended here. However, our lemma is stronger -it says that y is always Bitonic, for any bisorted input. To prove that, assume for a contradiction that y is not Bitonic. By the following Lemma 6 (whose proof is not dependent of the current lemma), there exists a Bitonic sorter that does not sort y. As said, M processes y using an arbitrary Bitonic sorter. In particular, this Bitonic sorter could be the one that does not sort y. This contradicts the fact that the entire network M is a merging network.
Our goal now is to show that for every non-bitonic vector there exists a Bitonic sorter which does not sort it. To this end, we use the following result of [4] . For any 0-1 vector there is a network that identifies it in the following sense. Clearly, v is not constant. By Lemma 5, there exists a network N of width | v| that sorts all binary vectors except v . By a straightforward 0-1 argument, a network sorts a vector if and only if it sorts all its binary images. Since all binary images of Bitonic vectors are Bitonic it follows that N is a Bitonic sorter. Since N does not sort v it does not sort v.
Lemma 5 (The identifying lemma, [4]) For every 0-1 vector v which is not constant, there is a network that sorts all the 0-1 vectors of the appropriate width, except v.

Lemma 6 For every non-Bitonic sequence there is a Bitonic sorter that does not sort it.
Proof:
Returning to the Tri-section technique, recall that our goal is to construct a merging network of minimal depth in which the sequences x, y and z are sorted by networks of depth log(| x|) , log(| y|) and log(| z|) respectively. To this end, we use pruning ( [5] , [12] , [8, Section 4.2.2]) to reduce the width of certain comparator networks. This technique is based on the concept of a degenerate comparators, as defined in the introduction. Pruning, in the context of merging networks, is studied in [8] and can be applied as follows. Several consecutive input edges at the top of the input sequences are fed with the fictive values of +∞ while the rest of the inputs are fed with real keys. Clearly, the pathes of the fictive values are fixed -they do not depend on the values of the real keys. Any comparator that is on such a path is degenerate and can be removed without effecting the network's functionality. The resulting network, that processes no fictive values, is a merging network of a smaller width. Returning to our x, y and z, we first consider the case where n is a power of two. In this case, the vector y is Bitonic and its width is a power of two. Such a vector can be sorted by Batcher's Bitonic sorter [2] , whose depth is log(n).
Concerning the sequence x, recall that this sequence is not only Bitonic, but also ascending-descending. Such a sequence can be expanded into a Bitonic sequence of a desired width by adding fictive keys of value −∞ at the beginning (or the end) of the sequence. This implies that any wide enough Bitonic sorter can be pruned into an ascending-descending sorter of a smaller given width. The depth of the resulting network is clearly not greater than the depth of the original one. Therefore, x can be sorted by a network of depth log(| x|) . By symmetry, the sequence z can be sorted in a similar manner.
Next consider the case where n is not a power of two. Note that in this case | y| is not a power of two. One may assume that a wider Bitonic sorter can be pruned into a Bitonic sorter of the desired width; however, as shown in [9] , the minimal depth of a Bitonic sorter is not monotonic in the width of the input; therefore, such pruning is impossible.
The problem is solved as follows. Let n = 2 log(n) be the first power of two following n. Let M be the merging network of width 2n generated by the asymmetric construction in which the k lowest outputs are accelerated to a depth of log(k) + 1. The network M is pruned into a width 2n merging network as discussed above. This results in a merging network N of width 2n and of depth log(2n) whose k lowest outputs are accelerated to a delay of log(k) + 1 comparators.
Our construction possesses an additional important attribute which enables the concatenation of several such accelerating merging networks into an accelerating sorting network. This attribute relates to 'restricted reachability' as follows. We say that an edge e of a merging network is reachable only from the k lowest (highest) inputs if there is no path to e from an input edge which is not one the k lowest (highest) input edges of one of the input sequences. The construction of this section is summarized in the following lemma.
Lemma 7
For every k < n there is merging network of width 2n and of depth log(2n) in which each of the k lowest (highest) outputs is accelerated to a depth of log(k) + 1 comparators and is reachable only from the k lowest (highest) inputs.
The Symmetric Tri-section Merging Technique
Another Tri-section technique that accelerates certain outputs is presented in this section. As said, the Tri-section technique separates, by a depth one network, a bisorted input into three sets, x, y and z such that every key in one set is smaller or equal to any key in the following set. We first consider the case where the width of the input, 2n, and the number of accelerated outputs, k, are powers of two. In this case the Symmetric Tri-section satisfies | x| = | z| = k. Hence, the network accelerates both the lowest k outputs and the highest k outputs to a delay of log(2k) comparators.
The symmetric technique is based on the Bitonic sorting technique of Nakatani et al. [11] which considers the keys to be arranged in a matrix. To this end, we denote a matrix of keys by m. Their technique is based on the following lemma. Following those stages the resulting matrix is sorted in a row major fashion.
Lemma 8 (The
Note that the matrix technique does not require that j and k be powers of two; however, we use it only under this restriction. Assume that the Bitonic sorters used in stage (2) and stage (3) are of minimal depth. The depth of the entire network is log(j) + log(k) which is minimal. As shown in [8] , for every n, a power of two, there is a unique n-key Bitonic sorter of minimal depth. This implies that, when the width is a power of two, the network of Nakatani et al. is identical to Batcher's [2] Bitonic sorter. Yet, even in this case, Nakatani's technique sheds a new light on Batcher's Bitonic sorter.
Note that the parameter k in the technique of Nakatani et al. and the parameter k of our technique refer to the same number; namely, using a j × k matrix as per Lemma 8, we accelerate the highest k outputs and the lowest k outputs. Since we are only interested in merging and not in Bitonic sorting, we can perform Stage (2) in a special manner. When the bisorted input is turned into a Bitonic sequence and arranged in the above matrix, every column in the matrix m is not only Bitonic, but in-fact bisorted; hence, it can be sorted by any merging network. There are many merging networks [8, Section 6] of minimal depth that produce the lowest key and highest key after a delay of a single comparator (e.g., Batcher's [2] odd-even merging network); therefore, Stage (2) can be performed by k such merging networks, working in parallel. Stage (3) can be performed by j Batcher Bitonic sorters, working in parallel. The depth of these Bitonic sorters is minimal -log(k); therefore, this construction accelerates the lowest k outputs and highest k outputs to a delay of log(k) + 1 comparators.
LetM be the network performing Stage (2) . By Statement (4) of Lemma 8,M Tri-sects the keys into three sets: the first row, the last row and all the rest. Moreover, the very same Tri-section is performed by a subnetwork ofM of depth one; hence this technique is a Tri-section technique. Our construction not only accelerates the required inputs but also has a restricted reachability as summarized in the following lemma.
Lemma 9 For any k < n, both powers of two, there is merging network of width 2n and of depth log(2n) in which:
• Each of the k lowest outputs is accelerated to a delay of log(2k) comparators and is reachable only from the lowest k inputs.
• Each of the k highest outputs is accelerated to a delay of log(2k) comparators and is reachable only from the highest k inputs.
Next consider the case where k is not a power of two. As in the previous section, instead of accelerating k outputs we accelerate k = 2 log k outputs. Note that in this construction, each accelerated output is reachable from k = 2 log k (rather than k) extreme inputs.
Finally, consider the case where the network's width, 2n, is not a power of two. In this case, we do not know how to accelerate both the highest k and the lowest k keys, simultaneously; in fact, we do not know if such acceleration is possible. We do know how to accelerate either the smallest k outputs or the highest k outputs. This is accomplished by pruning a network whose width is a power of two. The following Lemma (similar to Lemma 7) summarizes this case.
Lemma 10 For any k < n there is merging network of width 2n and of depth log(2n) in which each of the k lowest (highest) outputs is accelerated to a delay of log k + 1 comparators and is reachable only from the lowest (highest) 2 log k inputs.
A counterexample
As shown in [8] , all published merging networks (whose width is a power of two) are Batcher merging networks. Namely, they are constructed by a straightforward generalization of Batcher's odd-even technique. The depth of all these merging networks is minimal. This raises the following question:
Question 1 Are the Batcher merging networks the only merging networks with the following properties:
Their depth is minimal -log(2n).
They have no degenerate comparators.
The answer to this question is no. The Tri-section technique can generate a counterexample based on the fact that when | x| is small w.r.t. n, the sequence x can be sorted in an arbitrary manner (by a network of excessive depth) while maintaining the minimal depth of the entire merging network. We next present such a network for any n ≥ 8, a power of two.
Our construction starts with the network T 3,n that produce the three sequences x, y and z. The sequence x is sorted by the network depicted in Figure 5 which has no degenerate comparators. The sequences y and z are sorted by any minimal depth network as per Section 3. The resulting merging network, M , satisfies the three conditions of Question 1. (If M has degenerate comparators, they should be removed.) The network M is not a Batcher merging network since it has a comparator, c, with the following property. Of the two edges exiting c, one is the output edgeô 2 and the other is not an output edge. This is never the case in a Batcher merging network. The above construction can be extended to yield a merging network of minimal depth which does not follow the "divide and conquer" paradigm. Let k = | x| be large enough and still much smaller than n. Then the sequence x can be sorted using a network which is clearly not of the above paradigm. Two such examples are Knuth's bubble-sort network and Knuth's odd-even transposition sort [6, pp 223,241] . This construction may produce degenerate comparators that can be removed without effecting the network's functionality. This implies the existence of a minimal depth merging network that has no degenerate comparators and has an arbitrary large subnetwork lacking any recursive structure.
Accelerating Sorting Networks
Building on the merging networks introduced in previous section, we now utilize the classical merge sort algorithm to construct a sorting networks that accelerates certain outputs. Clearly, the depth of a merge-sort network is at least log(n) · log(2n) 2
and due to Batcher [2] , this number is an exact bound. In theory, due to the AKS construction, there are sorting networks which are much faster than merge-sort networks. However, this holds only for impractically large n. The merge-sort networks of Batcher [2] , invented in 1968, are still the best practical sorting networks [6, Section 5.3.4] and, as said, their depth is log(n) · log(2n) 2
. We refer to the last number as the Batcher depth. This section presents a merge-sort network of a Batcher depth which accelerates the k lowest outputs (or k highest outputs) to a delay smaller than log(n) · log(2k) as follows.
We assume, without loss of generality, that k is a power of two. By pruning, we may also assume that n is a power of two. Our construction is composed of a sorting stage followed by log(n) − log(k) merging stages. In the sorting stage the n input keys are separated into sets of k keys each, and each of these sets is sorted separately by any sorting network of a Batcher depth. We now follow the merge sort method. Namely, in each of the merging stages, all the sorted sequences produced in the previous stage are paired and each pair is merged into a single sorted sequence. This merge is performed by a merging network, as per Lemma 7 or Lemma 10, that accelerates its k lowest outputs to a delay of log(2k) comparators and moreover, it possesses the restricted reachability property.
Consider the delay of the lowest k outputs. This delay is composed of log k·log 2k 2 comparators in the sorting stage and log 2k comparators in each of the log n − log k merging stages. Due to the restricted reachability property, these delays are added up; that is, in the entire network, the delay of the lowest k outputs is at most log k·log 2k 2 + (log n − log k) · log 2k. Clearly, the depth of the entire sorting network is a Batcher depth. Our construction is summarized in the following lemma.
Theorem 11 For every 0 < k < n, there is a sorting network of width n and of Batcher depth that accelerates all the lowest k (or highest k) outputs to a delay of log k·log 2k 2 + (log n − log k) · log 2k.
In the special case where n is a power of two, we may use Lemma 9 to achieve the above acceleration both for the highest k keys and the lowest k keys, simultaneously.
