Abstract-Merging-based sorting networks are an important family of sorting networks. Most merge sorting networks are based on 2-way or multi-way merging algorithms using 2-sorters as basic building blocks. An alternative is to use n-sorters, instead of 2-sorters, as the basic building blocks so as to greatly reduce the number of gates as well as the latency. Based on a modified Leighton's columnsort algorithm, an n-way merging algorithm, referred to as SS-Mk, that uses n-sorters as basic building blocks was proposed. In this work, we first propose a new multiway merging algorithm with n-sorters as basic building blocks that merges n sorted lists of m values each in 1 + ⌈m/2⌉ stages (n ≤ m). Based on our merging algorithm, we also propose a multiway sorting algorithm. We also show an application of our sorting algorithm with sorters implemented in threshold logic. Though both our algorithm and the SS-Mk require the same asymptotic number of gates, O(N log 2 N ), to sort N inputs, our algorithm requires fewer gates than the SS-Mk for wide ranges of N .
I. INTRODUCTION
Sorting is one important operation in data processing, and hence its efficiency greatly affects the overall performance of a wide variety of applications [1] , [2] . One of the most popular sorting algorithm is called merge-sort algorithm [2] . It first divides the input list (a sequence of values) into multiple sublists (a smaller sequence of values) and sorts each sublist simultaneously. Then, the sorted sublists are merged as a single sorted list. The sorting process of sublists can then be decomposed recursively into the sorting and merging of even smaller sublists, which are then merged as a single sorted list. The odd-even merging sort in [2] is based on a 2-way merging algorithm, which merges two sorted lists (odd and even lists) into one sorted list. However, the basic building block in this 2-way merging algorithm is a 2-sorter, which is simply a 2 × 2 switching element or comparator as shown in Fig. 1(a) .
Instead of merging two lists, n sorted lists can be merged simultaneously. Hence, n-sorters can be used as basic building blocks naturally. If larger sorters can be implemented efficiently, the total area as well as the latency of a sorting network using n-sorters may be smaller than that using 2-sorters. An n-way merging algorithm was first proposed by Lee and Batcher [3] , where n is not restricted to 2. However, the n-way merging algorithm in [3] is that the combining operation in the algorithm still uses 2-sorters as basic building blocks. Leighton proposed an algorithm for sorting r lists of c values each, represented as an r × c matrix [4] . This algorithm is a generalization of the odd-even merge-sort and named columnsort, since it merges all sorted columns to obtain a single sorted list in row order. In the original columnsort, no specific operation was provided for sorting columns and no recursive construction of sorting network was provided. In [5] , a modified columnsort algorithm was proposed with sorting networks constructed from n-sorters (n ≥ 2) [6] . However, a 2-way merge is still used for the merging process. In [7] , an n-way merging algorithm, named SS-Mk, based on the modified columnsort was proposed with n-sorters as basic building blocks, where n is prime. An improved version of the SS-Mk merge sort, called ISS-Mk, was provided in [8] , where n can be any integer. We compare our sorting scheme with the SS-Mk but not the ISS-Mk, because for our interested ranges of N , the ISS-Mk requires larger latency due to a large constant.
In this work, we propose an n-way merging algorithm, which generalizes the odd-even merge by using n-sorters as basic building blocks, where n (≥ 2) is prime. Based on this merging algorithm, we also propose a sorting algorithm. The work in this paper is different from previous works [7] , [8] in that our multiway sorting algorithm is a direct generalization of that in [3] . We analyze the latency and number of gates required by our algorithm, and compare them with that via the algorithm in [7] .
II. BACKGROUND
A sorting network is a feedforward network, which gives a sorted list for unsorted inputs. It is composed of two items: switching elements (or comparators) and wires. The depth of a comparator is defined to be the longest length from the inputs of the sorting network to that comparator's outputs. The latency of the sorting network is the maximum depth of all comparators. The network is fixed ahead of time and not dependent on the input values [2] . We use the Knuth diagram in [1] for easy representation of the sorting networks, where switching elements are denoted by connections on a a set of wires. The basic building block used by the odd-even merge [2] is a 2-by-2 comparator (compare-exchange element). It receives two inputs and outputs the minimum and maximum in an ordered way. The symbol for a 2-sorter is shown in Fig. 1(a) , where xi and yi for i = 1, 2 are input and output, respectively. Similarly, an n-sorter is a device sorting n values in unit time. The symbol for an n-sorter is shown in Fig. 1(b) , where xi and yi for i = 1, 2, · · · , n are input and output, respectively, and the output satisfies y1 ≤ y2 ≤ · · · ≤ yn. In this work, we denote the sorted values y1 ≤ y2 ≤ · · · ≤ yn by y1, y2, · · · , yn and use n-sorters as basic blocks for sorting.
In this work, we focus on multiway merge sort with binary values as inputs. Our merge sort also works for arbitrary values, which is justified by the zero-one principle in [2] stating that a sorting network 978-1-4799-7088-9/14/$31.00 ©2014 IEEE GlobalSIP 2014: Data Flow Algorithms and Architecture for Signal Processing Systems Algorithm 1 Algorithm for n-way merging network.
Apply (m − i)-spaced sorters between lists j and j + 1; end for Merge all (m − i)-spaced sorters; Update n sorted lists x
j,m for j = 1, · · · , n; i = i + 1; end while for j = 1 to n − 1 do Apply (m − 1)-sorters on m − 1 adjacent lines with first half,
j,m−k , from list j and second half,
; end for Output: Sorted lists.
will sort any arbitrary list of n values if it can sort all 2 n lists of 0s and 1s.
III. SORTING

A. Multiway Merging
Instead of merging two lists, multiple sorted lists can be merged as a single sorted list simultaneously. An n-way merger (n ≥ 2) of size m is a network merging n sorted lists of size m (m values) each into a single sorted list in multiple stages. This was first proposed as a generalization of the Batcher's odd-even merging algorithm. However, the combining network of the merging network in [3] still uses 2-sorters as basic blocks. In the following, we propose an n-way merging algorithm with n-sorters as basic building blocks as shown in Alg. 1. We consider a sorting network, where all iterations of Alg. 1 are simultaneously instantiated (loop unrolling). We refer to the instantiation of iteration i of Alg. 1 as stage i of the sorting network. The sorters in the last for loop in Alg. 1 consist of the last stage. Let the n sorted input lists be x
Denote the values of j-th list after stage k by (x
⌉ stages, all input lists are sorted as a single list, x
n,m . For convenience of describing and proving our algorithm, we introduce some notations and definitions. Denote the number of zeros in the j-th list after stage i as r j,k+1 , respectively, for some j ∈ Zm and k ∈ Zm−1. Then, our n-way merging Alg. 1 can be intuitively understood as flooding lists with zeros in descending order. The correctness of Alg. 1 can be shown by first proving the following lemmas. See the extended version [9] of this paper for the proofs of the following lemmas and theorems. 
Algorithm 2 Algorithm for combining n lists of m = n p−1 values.
for j = 1, · · · , n and obtain a single sorted list x
1,q , x
2,q , x
n,q , x We first show that the first connections of adjacent two sorters, S1 and S2, belong to either the same list or adjacent two lists. The same relation is true for the last connections of S1 and S2. This gives us a total of four cases as shown in Fig. 2 , where b ≥ a + 1 for Fig. 2(a)-(c) , and b ≥ a for Fig. 2(d) such that S1 and S2 have a size of at least two.
The following theorem proves the correctness of Alg. 1. The theorem can be proved by induction on i.
In the following, we propose Alg. 
where s ≥ 0 and s + l ≤ n i .
The following theorem proves the correctness of Alg. 2. 
B. Multiway sorting algorithm
Based on the multiway merging algorithm in Sec. III-A, we proposed a merge sorting algorithm via a divide-and-conquer method. The idea is to first decompose large list of inputs into smaller sublists, then sort each sublist, and finally merge them into one sorted list. The sorting of each sublist is done by further decomposition. For Algorithm 3 Algorithm for sorting N = n p values.
p−1 ; Apply one n-sorter on each of n p−1 lists and obtain x (1)
, · · · ,
for k = 1, · · · , n, and obtain a single
j,n i ; end for end for Output: Sorted list.
instance, for N = n p inputs, we first divide the n p inputs into n lists of n p−1 values. Then we sort each of these n lists and combine them with Alg. 2. The sorting operation of each of the n lists is done by dividing the n p−1 inputs into n smaller lists of n p−2 values. We repeat the above operations until that each of n smaller lists contains only n values, which can be sorted by a single n-sorter. The detailed procedures are shown in Alg. 3.
For example, a 3-way sorting network of N = 3 3 inputs is shown in Fig. 3 . The first stage contains 9 3-sorters. The second stage contains 3 three-way mergers with a depth of 3. The last stage contains a three-way merger with a depth of 5. The total depth is given by 9.
IV. APPLICATION IN THRESHOLD LOGIC
In this section, we focus on the threshold logic implementation and analyze the complexity by the number of threshold gates. This is a very narrow application in the sense that sorters are implemented by threshold logic and the inputs are binary values.
A. Sorter in threshold logic
A threshold function [10] f with n inputs (n ≥ 1), x1, x2, · · · , xn, is a Boolean function whose output is determined by
where wi is called the weight of xi and T the threshold. In this paper we denote this threshold function as [x1, x2, · · · , xn; w1, w2, · · · , wn; T ], and for simplicity sometimes denote it as f = [x; w; T ], where x = (x1, x2, · · · , xn) and w = (w1, w2, · · · , wn). The physical entity realizing a threshold function is called a threshold gate, which can be realized with CMOS or nano technology. Fig. 4(a) shows the symbol of a threshold gate realizing (1) . Larger binary sorters cannot be efficiently implemented in FPGA. However, they can be easily implemented in threshold logic. In [11] , a 2-by-2 comparator (2-sorter) was implemented by two threshold gates as shown in Fig. 4(b) . We introduce a threshold logic implementation of an n-sorter as shown in Fig. 4(c) , where n threshold gates are required. As shown in Fig. 4 , the number of gates of an n-sorter scales linearly with the number of inputs n, while the latency stays as a constant. Hence, large sorters are preferred to be used as basic blocks. However, we cannot use arbitrary large sorters as basic blocks, since larger sorters are more difficult to be implemented due to practical concerns, such as fan-in and cost. Hence, the benefit of using a larger block diminishes with increasing n. For this reason, some limit on the size of basic sorters is assumed.
B. Latency analysis
First, we focus on the latency for sorting N values. The latency is defined as the number of basic sorters in the longest paths from the inputs to the sorted output. In Alg. 3, there are p iterations. In iteration i, there are n i merging networks, each of which is to merge n sorted lists of n p−i values. For iteration i, the latency is given by Lour(n, n i−1 ) = 1
⌉. For a sorting network of N = n p values via Alg. 3, by summing up the latencies of all levels, we obtain the total latency
The closed-form expression of latency for the SS-Mk given in [7] is
C. Analysis of number of gates
In the following, we assume all gates are the same and derive the total number of gates. The sorting network of N inputs is composed of multiple stages, of which each partially sorts N values. Not all values in each stage participate the comparison-and-switch operation. A simple way to count the gates is to insert buffer gates in each stage to store values without involving any sorting operation. Buffer insertion is also needed for implementation of threshold logic in some nanotechnology, where synchronization is required for correction operation. Hence, each stage contains N gates and the total number of gates is obtained by multiplying N to the latency. Note that N does not have to be a power of n. Hence, the total number of gates of our Alg. 3 and the SS-Mk [7] are simply given by
and
If n is bounded, the total numbers of gates of our Alg. 3 and the SS-Mk in [7] in Eqs. (4) and (5), respectively, have an order of O(N log 2 N ).
D. Comparison of the number of gates
We compare the number of gates with buffers for N being a power of two. The numbers of gates are minimized by varying p according to Eqs. (4) and (5) for our algorithm and the SS-Mk [7] . The results are shown in Table I , where columns two to four show the numbers of gates for the SS-Mk, our Alg. 3, and the reduction of our Alg. 3, respectively, with n ≤ 20, and columns five to seven show those with n ≤ 10. For n ≤ 10 and n ≤ 20, there are up to 25% and 39% fewer gates, respectively, than the SS-Mk in [7] for N = 2 i with i = 1, 5, · · · , 16. It is observed that fewer and the same number of gates are needed for n ≤ 20 than for n ≤ 10 for all N = 2 i with i = 1, 2, · · · , 16. The reduction percentage of n ≤ 20 is also greater than or equal to that of n ≤ 10 for all N = 2 i with i = 1, 2, · · · , 16 but N = 16. This means our sorting network takes better advantage of larger basic sorters.
V. CONCLUSION
In this work, we proposed a new merging algorithm based on nsorters for parallel sorting networks, where n is prime. Based on the n-way merging, we also proposed a merge sorting algorithm. Our sorting algorithm is a direct generalization of odd-even merge sort with n-sorters as basic blocks. We showed an application of our sorting algorithm with linearly threshold logic sorters. Our algorithm has a smaller latency and fewer gates for wide ranges of N than other multiway sorting networks. 
