ABSTRACT This paper presents a new parallel structured lookahead multidimensional sorting algorithm. Our algorithm can be based on any sequential sorting algorithm. The amount of parallelism can be controlled using several parameters such as the number of threads, word size, memory/processor communication overhead, and the dimension of the algorithm. The proposed technique is ideally suited for general purpose graphic processing units and shared-memory massively parallel processor systems. It ensures that data being processed exhibits temporal and spatial locality to maximize the utilization of processor cache. The algorithm achieves a speedup even when a single processor is used. A lookahead algorithm is also proposed to achieve even higher speedup. The performance of the proposed algorithm is verified numerically and experimentally.
I. INTRODUCTION
Sorting is a fundamental and common operation in computing. It is used in several applications such as digital signal processing and communications [1] - [4] , database management systems [5] , optimization, searching, data compression, data clustering, spreadsheets, etc. [6] , [7] . Computer system programs, such as editors and compilers also make use of sorting to speed their operation [8] .
Many researchers proposed hardware structures specifically designed to speed up data sorting [6] , [8] - [15] .
The prevalence of multicore computers and availability of high performance computers built around Graphics Processing Units (GPUs) [16] prompted many researchers to explore implementing parallel sorting algorithms on parallel computers in general and GPUs in particular [17] - [20] . A GPU consists of several Streaming Multiprocessors (SMs). Each multiprocessor contains a number of CUDA cores. A GPU comes with a global memory, which can be used for storing the data to be processed. Each multiprocessor has a much faster memory, called shared memory. Shared memory is about 100× faster than the global memory.
The associate editor coordinating the review of this manuscript and approving it for publication was Muhamamd Aleem.
Our approach in this paper, which can be based on any sorting algorithm, achieves a speedup given by (25) . These speedup times are much better than the speedup estimates previously reported while taking into account the communication overhead between the processors and memory. In that sense, our algorithm can be applied equally well to internal as well as external parallel sorting. Furthermore, the upper limit on the number of stream processors is less than the number of items to be sorted. This upper bound clearly indicates that our proposed algorithm scales well with future parallel computing platforms with arbitrary communication overhead figures.
This paper is organized as follows. Section II looks at related work. Section III gives a detailed description of the parallel processor model employed. Section IV introduces the structured algorithm and how to synchronize the threads or processors implementing the algorithm. Section V provides theories and lemmas to prove the correctness of the algorithm and aid in developing the performance formulas. Section VI provides the formulas for worst, average, and ideal delays for the sorting problem using our proposed algorithm. The section also provides speedup estimates for best, average, and ideal cases. Section VII provides the results for simulating the performance of the one-dimensional algorithm and how it depends on the algorithm parameters. Section VIII discusses the structured multidimensional parallel sorting algorithm. Section IX derives the performance models as well as numerical simulation results for the multidimensional algorithm. Section X provides the results for simulating the performance of the α-dimensional algorithm and how it depends on the algorithm parameters. Section XI discusses whether the proposed algorithm shows any speedup when implemented on a single CPU machine without any parallelism or not. Speedup formulas are provided for this case. Section XII experimentally evaluates the speedup of the proposed algorithm when implemented on a single CPU machine and GPU platforms. Section XIII summarizes the features and advantages of the proposed algorithm and concludes the paper.
II. RELATED WORK
Several parallel sorting algorithms were presented to use parallel processors. Al Rjoub et al. proposed a deterministic parallel sorting algorithm to improve the performance of quick sort [21] . The data to be sorted was assumed to be stored in a matrix of three rows and c columns. The total number of elements to sort is therefore n = 3c. Each row is independently sorted by a processor using quick sort. After the sort operation is completed, the elements in the first and last location in the matrix are the global maximum and minimum of the list to be sorted. After extracting these two elements, the remaining elements are shifted and this operation is repeated n/3 times. A restriction was placed that the number of columns c must be an odd number. The proposed algorithm applied only to three parallel processors to sort a matrix, each row independently. The sort operation was estimated to complete in time order O (n/3 log n/3). Numerical simulations for the proposed algorithm were presented using Microsoft Excel 2007 in which the runtime of the algorithm and the sequential quick sort algorithm were estimated. In a later article, Al Rjoub et al. modified the algorithm to accept matrix rows r > 3 [22] . This allowed the algorithm to have a time complexity given by O (n/r log n/r).
Marcelino et al. compared FPGA-based hardware units implementing three sorting algorithms: sorting network, insertion sorting, and FIFO-based merge sort [23] . Their system achieved 20× speedup over a software implementation of quick sort algorithm. Olariu et al. discussed how to sort data using a fixed sorting network with a fixed I/O size [24] , [25] .
Yildiz et al. implemented two parallel algorithms: bitonic and radix sort algorithms using many core GPUs. Results showed a significant improvement using GPUs and as a comparison, the bitonic sort algorithm was found to have a higher performance than the parallel radix sort algorithm [26] .
He et al. proposed two novel optimal sorting algorithms based on a model that uses linear array with The Linear Array with Reconfigurable Pipelined Bus System (LARPBS) [14] . The first one is an extension of Leighton's seven-phase column sort algorithm and the second is an optimal multiway merge sort algorithm. Implantation and results showed that proposed methods are optimal sorting algorithms for the 2-D ARPBS model. Sujatha et al. made a comparison between single core and multicore for several sorting algorithms [27] . This research shows that a multicore CPU is more power-efficient than a single core and programs developed based on multicore have a better performance and run faster than on single core.
Durad et al. analyzed most of the parallel sorting algorithms and implemented them using MPI by two different architectures [28] . This research illustrated experimentally the relation between number of processors, data size, and execution time. The effect of network communication and saturation of local bandwidth are studied on radix sort and odd-even sort algorithms.
Amato et al. presented a study of the interaction between the algorithm and the architecture over several parallel sorting algorithms: bitonic sort, sample sort, and parallel radix sort and showed that the choice of an algorithm depends on the number of elements to be sorted and the performance of the algorithms differed on various machines [29] .
Taniar and Rahayu studied parallel database sorting on multiprocessors [30] . Five algorithms with different organization and sequencing of merge sort were studied: parallel merge-all sort, parallel binary-merge sort, parallel redistribution binary-merge sort, parallel redistribution merge-all sort, and parallel partitioned sort. Greß and Zachmann implemented a parallel sorting algorithm on a stream processing architectures [31] . The proposed sort algorithm was based on adaptive bitonic sorting. The optimal time complexity reported for sorting N values on n stream processors was O ((N log N )/n). This result tends to agree with Amdahl's law for a fully parallelizable algorithm. Authors reported that previous attempts at parallelizing sorting algorithms on stream processors only achieved O (N log 2 N )/n times. The proposed algorithm was applied for a maximum number of processors n = N / log N .
Peters et al. implemented a parallel in-place sorting algorithm on a GPU using bitonic sort [32] . Their approach was to divide the lists of bitonic sort among threads such that communication with the slow global-memory is reduced. This was accomplished by partitioning the elements in the list such that several compare/exchange operations dealing with the same list elements are assigned to one thread. These elements would then be stored in the thread registers and exchanged with other threads using the shared memory between threads. Access to the much slower global memory is done exactly once for each element of the array during the execution of the bitonic sort.
Satish et al. proposed a parallel radix sort algorithm for manycore GPUs [19] . They reported four times performance speedup compared to graphics-based GPUSort. Their strategy was to use fine-grain parallelism and find independent tasks with minimal global memory communication [33] .
Memory access and cross actions between processors represent barriers in parallel sort algorithms. Marszalek et al. increased efficiency of sorting algorithms by distributing tasks flexibly between logical cores [34] . Proposed method preserves separation of concerns; therefore, each of the processors works separately without any cross actions and interruptions. The idea was theoretical analyzed, experimentally tested, and compared to other similar methods. Obtained results confirmed high efficiency and showed that, with each newly added processor, sorting becomes faster and more efficient. Furthermore, Marszalek proposed a new method for rapid parallel sorting of large data based on a Parallel Random Access Machine (PRAM) to achieve high efficient reads and writes in the cell memory for each single processor [35] . To solve the memory access barrier, Marszalek proposed another technique that utilizes the principle that all processors can read memory cells but only one processor can write to the same memory cell [36] . This makes all processors work totally independent and achieve fast performance.
We propose in this paper a structured algorithm for parallelizing sorting algorithms. An algorithm to implement the structured algorithm is also proposed. The algorithm sorts data from both ends of the list, i.e., from the high end, where high values are located, and the low end, where low values are located. At each iteration, a block of sorted numbers is obtained to speedup convergence. Performance and speedup estimates of the algorithm are derived.
III. MODEL DESCRIPTION
The data to be sorted is assumed to be stored in a list L. To start our algorithm, L is divided into smaller sublists indexed as L i for one-dimensional subdivision, L i,j for 2-D division, L i,j,k for 3-D division, etc. When sublists are sorted, a block of elements from each sorted list is used for further processing. The size of each block is b with b ≥ 1. The proposed algorithm uses three more listsM ,M and S to process the sorted elements extracted from the sublists. ListM stores the maximum elements obtained from the sublists. ListM stores the minimum elements obtained from the sublists. List S stores the maximum and minimum elements obtained from M andM . We assume that we have n+3 threads or processors so that each list has its own thread or processor.
We can index our threads or processors into one-, two-or many-dimensional indexing schemes, as shown in Figure 1 . There are no restrictions on the number of processors in each dimension. This is in contrast to earlier parallelization attempts such as those reported in [21] , [22] , where only a 2-D algorithm was considered and one of the dimensions had to have an odd number of processors for the algorithm to work. The notations we use in this paper are listed below. b Data block size; i.e., number of elements in a block M List that stores minima of all sublists
Number of parallel processors denoted by P i with 0 ≤ i < n P i i-th processor in the parallel processor sys- Figure 1(a) , each thread or processor will be indexed as P i with 0 ≤ i < n. For the 2-D case in Figure 1(b) , each thread or processor will be indexed as P i,j with 0 ≤ i < n 1 and 0 ≤ j < n 2 , where n 1 × n 2 = n. For the 3-D case in Figure 1(c) , each thread or processor will be indexed as P i,j,k with 0 ≤ i < n 1 , 0 ≤ j < n 2 and 0 ≤ k < n 3 , where n 1 × n 2 × n 3 = n. Extensions to higher dimensions are straightforward and will be explored in the sequel.
We assume, without loss of generality, that our list of items to be sorted L is defined using a one-dimensional index set i = {0, · · · , N − 1} and the order relation ≤ is defined among the elements. The list is ordered when the following inequality is valid
The sorted array is assumed to be stored in another list S, where element s 0 represents the smallest element of the list and element s N −1 represents the largest element of the list. Figure 2 shows the memory architecture model for the parallel processor system for the one-dimensional case as an example.
The parallel computer system under consideration is assumed to contain n + 3 processors/threads. The first n processors, P i (0 ≤ i < n), deal with the sublists L i . Three more processors/threads PM , PM , P S are associated with lists M ,M , and S, respectively. The roles of these lists will be explained shortly.
Each processor is assumed to have its own local cache. This could be done using one level of cache or a cache hierarchy or even a combination of cache and local memory. The figure shows also shared or distributed memory. Communication between the processors could then be established through shared variables for the former or message passing for the latter option. In the sequel, we shall refer to the shared/distributed memory as ''memory'' for brevity. The memory will store the sorted list S or size N and will also store the intermediate listsM andM of sizes bn, whose function will be explained below. Data is aggregated and exchanged in blocks or chunks of size b. The block size controls the number of data items exchanged between the threads/processors. The intermediate listM stores the maximum blocks supplied by the processors from sublists L i . Likewise, intermediate list M stores the mimimum blocks supplied by the processors from sublists L i . ListsM andM are constructed by appending the maximum blocks and minimum blocks from all sorted sublists L i , respectively, aŝ
whereb i is maximum block obtained from thread/processor P i andb i is minimum block obtained from thread/processor P i . In the sequel, we use P i to indicate either the i-th processor in the multiprocessor system or the i-th thread in a multithreaded system. We denote by τ p the processing time, which might be the time needed by the arithmetic logic unit (ALU) to do a compare operation between two elements. τ p is also assumed to be approximately equal to the cache access time.
The time required to access the shared memory is denoted by τ m . The memory mismatch ratio [37] is defined as
Essentially, R is the ratio of global and local memory access latencies. Values of R vary from 2 for on-chip memory to as much as 10 6 for a cluster of computers with limited communication bandwidth, where memory access can take values in the millisecond range. Memory is distributed among physically remote machines. For example, access time of a DRAM memory varies from 5 to 100 ns, whereas instruction processing delay is a measure in 1 to 2 ns [38] , [39] . In the sequel, we shall assume R ≥ 1.
Because of processor/memory speed mismatch, it is advantageous to move a large chunk of data each time a message or a memory access is attempted. We access a block of data of size b each time a message is passed between threads/processors or each time the shared memory is accessed. The parameter b is thereby our data aggregation parameter.
IV. STRUCTURED ONE-DIMENSIONAL ALGORITHM FOR SORTING PROBLEM PARALLELIZATION
This section describes a one-dimensional algorithm for structured parallel sorting. The one-dimensional algorithm relies on the following major steps. 1) L is divided into n sublists. Each sublist has N /n elements. When n is not a factor of N , some elements will be unallocated. These extra elements are added to some of the sublists at random. Each element in sublist L i is identified with a label indicating which sublist it belongs to. 2) Sublist L i is associated with thread/processor P i , which performs a single sort operation on L i using any suitable sorting algorithm. This operation will not be repeated again for the duration of the sort problem. 3) Each thread/processor P i sends two blocks of size b each comprising the largest b elements and smallest b elements in L i to a pair of listsMM , respectively, such thatb
where max(L i , b) finds the largest b elements (i.e., a block) from list L i and min(L i , b) finds the smallest b elements (i.e., a block) from list L i 4) ProcessorsP M andP M associated with the semi-sorted listsM andM , respectively, sort the bn elements using any suitable sorting algorithm. 5) Blockb inM contains the global largest elements of L. That statement will be proved in Theorems 1 and 2 below. This block is inserted in S, which is sorted by P S , such that elements ofb are inserted at their proper positions. Similar action is performed byM on its smallest blockb. 6) ListsM andM remove the two blocksb andb and send these two blocks to all the sublists L i . 7) Upon reception of the blocks to be removed, each sublist removes the corresponding elements matching those inb andb.
VOLUME 7, 2019
8) If the removed blockb is the last block in the sublist L i , then this block should also be removed from listM . Similar action is performed byM on blockb. This is because the last item in sublist L i is considered the largest b elements and at the same time the smallest b elements, and therefore is added to both listsM andM . 9) Each sublist L i , which has an element removedl i produces the next largest element and sends it tom. Similarly, each list L i which had an element removed l i produces the next smallest element and sends it tom. 10) Steps 5 to 9 are repeated until all the lists L i (0 ≤ i < n) are empty.
We should point out that sublists L i are sorted once only at the start of the algorithm. All subsequent sort operations are performed on the semi-sorted listsM andM .
The stages performed by the proposed structured algorithm result in a serial-parallel algorithm, as illustrated in Figure 3 . Stage 1 is where all n threads/processors perform individual sorting operations on sublists L i in parallel. This operation is performed once only and will not be repeated for the duration of the sort problem.
Stage 2 follows Stage 1, where listsM andM collect the b maximum and b minimum elements from each thread/processor, respectively, and are sorted to determine the b global maximum and b minimum elements at each iteration. The sorting operation performed onM andM could take advantage of the fact that the data supplied by sublists L i are partially sorted and also data inM andM are fully sorted.
Stage 3 moves the maximum b elements fromM and minimum b elements fromM to their proper positions in list S and removes these 2b elements from the sublists that produced them and from listsM andM also.
A. THREAD AND PROCESSOR SYNCHRONIZATION
The structured algorithm we propose and its associated one-dimensional algorithm require simple thread/processor synchronization constructs that are readily available in operating system libraries and in GPUs. The operations in Stage 1 must all be completed before the operations in Stage 2 could start. Likewise, the operations in Stage 2 must all be complete before the operations in Stage 3 could start. This can be ensured using the barrier constructs in Table 1 . Figure 4 (a) shows that each thread/processor sorts its sublist and produces the local maximum and minimumb i andb i , respectively (shown by the shaded boxes). Figure 4(b) shows the allocation ofb i andb i to listsM andM , respectively (the two lists have also been sorted). The boxes with thick lines indicate the global maximum and minimum elements of list L. The next step of our algorithm is to placem in S at location N − 1 and remove these elements from sublists L i . Similarly, we placem in S at location 0 and remove these elements from sublists L i . 
Line 9: SortM andM in parallel to find blocksm andm. 
Algorithm 1 Pseudocode for the Structured Algorithm to Parallelize Sorting Algorithms
sort(L i ); 5: end parallel for 6: parallel for j = 0 : n − 1 parallel do 16 : Delete 2b maximum and minimum elements from sublists L j knowing the IDs of the processors that generated them. The sublists will remain sorted after these delete operations and no further sorting is required. Lines 18 & 19: For every element in the deleted blockm, an element is extracted from corresponding list L j . Same is done withm. These elements are placed at their proper locations inM andM using any search-and-compare algorithm. As a result, listsM andM will remain sorted after this update operation.
V. ANALYSIS OF THE PROPOSED ALGORITHM
The following theorem assures us that we are able to find the maximum and minimum elements of list L even after we break up the list into n sublists.
At the same iteration, elements of block s i are among the smallest b elements of sorted sublists L i . Proof: The largest b elements to be added to S are obtained from the largest elementsb i in sorted sublists L i . These elements will be picked up and moved to listM . A similar argument applies to the minimum elements.
The following theorem assures us that at iteration i, the elements in the block at position N − b(i + 1) in S are extracted from blocksb i of the sorted sublists L j . Also, the element at position bi in S is extracted from blocksb i of the sorted sublists L j . At iteration 0 also, and according to Theorem 1, we know that the smallest b elements of L are distributed among thȇ b blocks of sorted sublists L i . These nb elements are collected and sorted in listM . Thus, elements ofm are also the smallest b elements of L. This bottom block will be moved to block s 0 .
At the end of iteration 0, and according to our algorithm, the elements corresponding to s N −1 and s 0 will be removed fromM ,M , and sublists L i . What is left in sublists L i are all the remaining elements of L, which include elements that should occupy blocks s N −2 and s 1 . The former is the next block that is directly smaller than s N /b−1 and the latter is the next block that is directly larger than s 0 . Assuming L is the truncated list with elements from block s N −1 and s 0 removed.
At start of iteration 1, we will have truncated lists L i with elements from blocks s N /b−1 and s 0 removed. The operations in iteration 1 will allow us to extract the largest elements in the truncated sublists and we will allocate them to block s N /b−2 . Similarly, we will be able to extract the smallest elements in the truncated sublists and allocate them to block s 1 . This argument could be extended to all subsequent iterations until sublists L i are exhausted.
A. USING LOOKAHEAD SCANNING OF SUBLISTS L i
The algorithm presented thus far picked the top and bottom blocks for each sublist L i and loaded them in listsM andM , VOLUME 7, 2019 FIGURE 6. Flowchart for the structured algorithm to parallelize sorting algorithms.
respectively. Consequently, 2b elements were moved from sorted listsM andM into the final sorted list S.
Lookahead algorithm attempts to increase the number of elements appended to S at each iteration. This is accomplished by collecting element β i from sublist L i located at position N /n − b − 1 and element γ i from sublist L i located at position b. These two elements will prove very useful in speeding up the sorting algorithm presented here, as will be shown.
The following two theorems prove useful in speeding up the convergence even faster by reducing the number of needed iterations to sort the list. They rely on the use of element lookahead to examine the elements at locations N /n −b−1 in all the sublists. The first theorem below provides bounds on the number of largest sorted elements that can be appended to S and deleted from L at each iteration.
Theorem 3: At each iteration, at least b largest elements are available and at most bn largest elements are available to be appended to sorted list S.
We denote β as the largest element among all β i at location N /n − b − 1. β will help us maximize number of elements placed in sorted list S. Proof: At a given iteration, Theorem 1 indicates that each thread/processor P i produces b largest elements in sublist L i . The lookahead algorithm now performs the following two steps.
1) Define β such that
2) Instead of movingm to S, select all elements ofM that are larger than β and place those elements in S. There are three possible cases for the distribution of elements among the sublists. Sublist L 0 , say, contains all the largest bn elements of list L. Also, elements β 0 is the largest β elements (i.e., β = β 0 = max(β i )). Our algorithm picks up the top b elements from each sublist. In that caseM will contain only the b largest elements of L from sublist L 0 . All other sublists will produce blocks, which do not contain any useful elements. These b elements can be loaded to list S in the proper location. Figure 7 shows this case. The algorithm picks the largest n blocks from the sorted sublists and places them inM . Upon sortingM , we find that element β is larger than the largest of any element from sublists L i with i > 1. This element should be placed at location bn−b − 1 in the list of elements transferred toM . This indicates that we should only pick b elements to place in S.
Case 2:
The top bn elements in list L are uniformly distributed among the n sublists such thatb i of each sublist contains exactly b of the top bn elements of L. Also, elements β 0 is the largest β elements (i.e., β = β 0 = max(β i )). The collected blocks will be sorted inM to produce the top bn largest elements of L. Figure 8 shows this case. Figure 8 (a) shows that the largest bn elements are uniformly distributed over all sublists L i . Black boxes show β i elements at location N /n − b − 1 in all sublists. Figure 8(b) shows the location of the largest element β = β 0 relative to elements transferred toM .
The algorithm picks the largest n blocks from the sorted sublists and places them inM . Upon sortingM , we find that element β is smaller than the smallest elements transferred from all sublists L i . This element should be placed at location −1 in the list of elements transferred toM . This indicates that we should pick bn elements to place in S. So, we effectively sorted and moved all the top n blocks from the sublists into S.
Case 3:
The top bn elements are distributed non-uniformly among the sublists. In that case, the sublists will contain a fraction of the top bn elements of L. Some sublists might contain a number of the largest elements and some might not contain any of the largest elements at all. This is shown in Figure 9 . Figure 9 (a) shows that the largest bn elements are nonuniformly distributed over the sublists L i . Black boxes show β i elements at location N /n − b − 1 in all sublists. Figure 8(b) shows the location of the largest element β = β 0 relative to elements transferred toM .
The algorithm picks the largest n blocks from the sorted sublists and places them inM . Upon sortingM , we find that element β lies in the middle of the elements form the sublists L i . This element should be placed at a location somewhere between −1 and bn−b−1 inM . This indicates that we could pick any number of elements between b and bn elements to place in S.
So, at each iteration, we can extract at least the largest b elements or at most the largest bn elements as stated by the theorem.
The following theorem provides bounds on the number of least sorted elements that can be appended to S and deleted from L at each iteration.
Theorem 4: At each iteration, at least b smallest elements are available and at most bn smallest elements are available to be appended to sorted list S. Proof: The proof is identical to that provided in Theorem 3, but we need to determine the smallest element γ from among elements at location b in all sublists. The following theorem provides the basic idea to determine exactly how many sorted elements to be extracted fromM to place in S and deleted from L.
Theorem 5: The number of largest sorted elements from M that can be appended to S can be determined by examining elements β i in each sublist L i . Also, the number of smallest sorted elements fromM that can be appended to S can be determined by examining elements γ i at location b in each sublist L i . Proof: The proof was provided in Theorems 3 and 4.
The following theorem is related to estimating the communication overhead by providing bounds on the minimum and maximum number of messages or memory read/write accesses per iteration.
Theorem 6: The number of memory read/writes or messages exchanged between threads or processors (x) per iteration lies in the range 1 ≤ x ≤ n.
Proof: According to Theorems 3 and 4, the lower bound on the number of elements added to S occurs when only one sublist L i is used to extract the largest and smallest elements. Also, the upper bound on the number of elements added to S occurs when n sublists L i are used to extract the largest and smallest elements.
Therefore, the limits on the number of shared memory accesses or messages is 1 ≤ × ≤ n.
VI. SPEEDUP PERFORMANCE ESTIMATES FOR THE 1-D ALGORITHM WITH LOOKAHEAD
We obtain in this section estimates for the speedup of the 1-D parallelization algorithm in Algorithm 1.
The sorting delay is a random variable with values ranging between N and N 2 when the algorithm is executed sequentially on a single CPU machine [6] . The average value for the random sorting delay depends on two factors: the sorting algorithm used and the length of the list N . We express this average delay value as
where T sort (N , 1) denotes sorting a list of length N using one single processor and f (.) is a monotonically-increasing function that depends on the sorting algorithm and the value of N . The literature abounds with innovations to improve the performance of serial sorting algorithms. Throughout this paper, we assume the best sequential sorting algorithm has the optimistic sorting delay given by
Furthermore, we also assume that communication overhead for the serial algorithm is small, i.e., R = 1. Again, this is a very optimistic estimate that will only serve to reduce our theoretical estimates of our proposed algorithm when compared to the serial algorithms. It is well known that quick sort is one of the best sorting algorithms and gives a time delay of N log N . Hence, our estimate above is very generous.
To estimate the average time delay for our proposed algorithm, we need to know the average number of iterations required to complete the proposed parallel sort algorithm. According to Theorems 3 and 4, elements β and γ at locations N /n −b−1 and b, respectively, in sublist L i determine how many elements are added to S and removed from sublists L i . The average number of elements moved varies between b and bn. Assuming uniform distribution of the first bn elements among the sublists, element β could exist at any location in the range b + 1 to bn + 1 with probability
This probability is the same when considering the smallest elements in L. The average number of largest and smallest elements removed from L for 1-D algorithm is given by
This is an impressive number of sorted elements at each iteration. The average number of iterations required to completely sort L using our algorithm would be
For the 1-D structured algorithm, the average time required to sort list L is estimated as (17) 75454 VOLUME 7, 2019 where t 1 is time to sort each sublists L i . t 2 is time to sort listsM orM . During t 2 lists for elements β and γ will be sorted also. t 3 is the time to send n messages from the sublists to listsM andM . t 4 is time to update the lists assuming y avg iterations and n/2 elements being moved during each iteration according to Theorem 6. Substituting (8) in (13) and assuming n 1 gives as
For a fixed value of N , low values of parallelism (i.e., small n) result in large delay due to the first term in the above equation.
Very high values of parallelism (i.e., large n) result also in large delay due to the middle term in the above equation.
The delay is a function of the two variables b and n and its minimum is obtained by finding the derivatives
These partial derivatives yield critical values b c and n c given by
From the above equations we get the optimum values of
Notice that the critical values of b and n do not increase linearly with increasing N , rather they increase more slowly. Notice also that critical b value increases with increasing communication overhead R and critical n value decreases with increasing R. The minimum value of the sorting delay is given by
The above equation indicates that optimum sorting time using our proposed algorithm increases as N and R increase, which is intuitive. We conclude from (22) that our proposed algorithm O(N 2/3 ) time complexity. The area × time (AT ) complexity for our proposed algorithm could be expressed as
We should point out that previously reported results for using sorting networks have optimal AT complexity of O(N 2 log N ) [11] . We define our speedup according to the following general equation
If we use the critical values of b and n, then speedup would be defined by
We see that our proposed algorithm achieves higher speedup ratios for larger lists and speedup slows down when communication overhead increases, as expected.
VII. NUMERICAL RESULTS FOR THE 1-D ALGORITHM
Speedup S depends on the three parameters b, n, and R. We explore in this section the effect of these three parameters on speedup. Figure 10 shows the speedup variation with b and n when R = 10, which is the case of multicore systems. We note that the speedup surface shows a peak with the critical values of b and n roughly equal to the values in (21) . The peak of optimum speedup is relatively sharp and the two parameters b and n must be carefully chosen. Figure 11 shows the speedup variation with b and n when R = 10 5 , which is the case of cluster or grid computing systems. We note that the peak of the speedup has now shifted toward higher values of b and lower values of n in agreement with (25) . When communication overhead is very high, optimal speedup is to be sought using lesser numbers of processors to reduce number of communication packets and at the same time increasing the block size of data to be exchanged between processors and memory in an effort to increase the workload per processor. As expected, speedup is now reduced since communication overhead dominates the parallel algorithm delay, which is absent in a single processor with its own memory.
The most important observation we make from Figure 10 and Figure 11 is that increasing the number of threads will not lead to higher speedup. Also, we need to consider communication overhead R when determining the optimum level of parallelism n and data aggregation b.
VIII. STRUCTURED MULTIDIMENSIONAL ALGORITHM FOR SORTING PROBLEM PARALLELIZATION
In this section, we generalize the algorithm we proposed in Section IV to multidimensional parallelization. We consider, VOLUME 7, 2019 without loss of generality, the situation of a three-dimensional case.
Assume that the threads or processors are now identified as P i,j,k and the dimensions of the cube are defined as follows.
The 3-D algorithm we propose here relies on the following major steps. 1) L is divided into n = n 1 × n 2 × n 3 sublists of N /n elements each and allocate each sublist to a processor: sublist L i,j,k belongs to processor/thread P i,j,k . When n is not a factor of N , some elements will be unallocated. These extra elements are added to some of the sublists at random to balance the processing load. 2) Sublist L i,j,k is sorted by thread/processor P i,j,k using any suitable sorting algorithm. This operation will not be repeated again for the duration of the sort problem. 3) All processors lying along a line defined by the two equations
send two blocks of size b each, comprising the largest b elements and smallestb elements in L i,c 2 ,c 3 to a pair of listsm c 2 ,c 3m c 2 ,c 3 , respectively, such that
The pair of listsm c 1 ,c 2 andm c 1 ,c 2 lie on a 2-D plane defined by the indices j and k. We have a total of 2n 2 × n 3 such lists and each list has bn 1 elements. 4) ProcessorsP j,k andP j,k , associated withm j,k andm j,k , respectively, now use any suitable sorting algorithm to sort the partially sorted lists having bn 1 elements each. This operation will not be repeated again for the duration of the sort problem. 5) All processorsP j,k andP j,k , lying along a line defined by
send their largestb j,k elements and smallestb j,k elements, respectively, to the pair of listsm k andm k . Listŝ m k andm k lie on a 1-D plane defined by the index k.
We have a total 2n 3 such lists and each list has bn 2 elements. 6) ProcessorsP k andP k , associated with sublistsm k and m k , respectively, now use any suitable sorting algorithm to sort the partially sorted lists having bn 2 elements each. This operation will not be repeated again for the duration of the sort problem. 7) The largestb k and smallestb k blocks inm k andm k , respectively, are inserted in listsM andM , respectively. 8) ProcessorsP M andP M , associated with the semi-sorted listsM andM , respectively, sort the bn 3 elements using any suitable sorting algorithm and insertb andb in list S at their proper position, which are determined by processor P S associated with list S. This operation will not be repeated again for the duration of the sort problem. 9) ListsM andM remove the two blocksb andb and send these two blocks to all sublists.m k /m k ,m j,k /m j,k and L i,j,k . 10) Upon reception of the removed blockb, each sublist L i,j,k removes elements corresponding to elements inb and sends an equal number of the next largest elements tom j,k . Similarly, upon reception of removed block b, each list L i,j,k removes elements corresponding to elements inb and sends an equal number of the next smallest elements tom j,k . 11) Upon reception of the removed blocksb and the corresponding replaced elements from sublists L i,j,k , each sublistm j,k removes elements corresponding to received elements and sends an equal number of next largest elements tom k . Similarly, upon reception of the removed blocksb and the corresponding replaced elements from sublists L i,j,k , each sublistm j,k removes elements corresponding to received elements and sends an equal number of next smallest elements tom k . 12) Upon reception of the removed blocksb and the corresponding replaced elements from listsm j,k , each sublist m k removes elements corresponding to received elements and sends an equal number of next largest elements toM . Similarly, upon reception of the removed blocksb and the corresponding replaced elements from listsm j,k , each sublistm j,k removes elements corresponding to received elements and sends an equal number of next smallest elements tom k . 13) Upon reception of replacement elements from listsm k , listM produces the next largest block and sends it to S. Similarly, upon reception of replacement elements from listsm k , listM produces the next smallest block and sends it to S. 14) Steps 3 to 13 are repeated until all the lists L i,j,k (0 ≤ i < n 1 , 0 ≤ n 2 < n 2 and 0 ≤ k < n 3 ) are empty. The stages performed by the proposed structured algorithm result in a serial-parallel algorithm as illustrated in Figure 12 and explained as follows. Stage 1 is where all n threads perform individual sorting operations on sublists L i,j,k in parallel. Each thread sorts a sublist of length N /n = N /(n 1 × n 2 × n 3 ).
Stage 2 is where listsm j,k andm j,k collect the b maximum and b minimum elements from sublists L i,j,k , where 0 ≤ i < n 1 . Thread associated with m j,k sorts the data to determine the b maximum and b minimum elements.
Stage 3 is where listsm k andm k collect the largest and smallest b elements from listsm j,k andm j,k , where 0 ≤ j < n 2 . Thread associated withm k andm k sorts bn 2 elements.
Stage 4 is where listsM andM collect the b maximum and b minimum elements from listsm k andm k , where 0 ≤ k < n 3 . Threads associated withM andM determine the b maximum and b minimum elements.
Stage 5 is where the largest and smallest b elements are removed fromM andM to their proper positions in list S and removes these 2b elements from the sublists L i,j,k ,m j,k ,m j,k andm k andm k that produced them.
IX. SPEEDUP PERFORMANCE ESTIMATES FOR THE 3-D ALGORITHM WITH LOOKAHEAD
Assume for simplicity that the sublists are distributed along the three dimensions such that n 1 = n 2 = n 3 = n 1/3 . With reference to Figure 12 , we can estimate the average time delay for our proposed algorithm. For that, we need to know the average number of iterations and the average number of elements added to S at each iteration. Theorems 3 and 4 still apply to our 3-D algorithm. Elements β and γ in sublist L i,j,k determine how many elements are added to S and removed from sublists L i,j,k .
The average number of largest and smallest elements removed from L for the 3-D algorithm is given by
This is an impressive number of sorted elements at each iteration, which is proportional to b and n. The average number of iterations required to completely sort L using our algorithm would be
For the 3-D structured algorithm, the average time required to sort list L is estimated as
(39)
where t 1 , t 2 , t 3 , and t 4 have the same definition as in (13) . We can generalize the above equation to any dimension α as
The above equation reduces to (18) for the special case when α = 1.
The maximum value of α is governed by the number of processors or threads available n. α max can be estimated by the expression
which implies that each of the lists contains only two elements. We therefore have
X. NUMERICAL RESULTS FOR THE α-D ALGORITHM
We now explore the effect of the parameters α, b, n, and R on speedup S for α = 2 and 3 and R = 10 and 10 5 .
The limits imposed on variables b and n were 1 ≤ b, n ≤ 2 32 . The upper limit could be achieved for b when message exchange is in the form of packets or streams of packets per message and the upper limit for n could be achieved using the up-to-date GPUs, such as NVIDIA's GPU's [16] . In fact, the maximum number of threads for such a GPU is 2 41 , if we consider the maximum number of threads per block and the maximum number of thread blocks [32] . Figure 13 shows the speedup variation with b and n when R = 10, which is the case of multicore systems. We note that the speedup surface shows a broad peak. Comparing Figure 13 with Figure 10 , we note that the maximum speedup has increased by a factor of approximately 20×. Also, the higher peak speedup now occurs using slightly smaller block size b and about double the processors n. Figure 14 shows the speedup variation with b and n when R = 10 5 , which is the case of cluster or grid computing systems. Comparing Figure 14 with Figure 14 , we note that increasing R shifts the peak of the speedup toward higher values of b and lower values of n. Comparing Figure 14 with Figure 11 , we note that the maximum speedup has increased by a factor of 3 only. The speedup peak occurs using smaller block sizes and about double the number of processors. This shows how detrimental communication overhead is to parallel processor systems.
We now explore the effect of the three parameters b, n, and R for α = 3 on speedup S for two values of R. Figure 15 shows the speedup variation with b and n when R = 10, which is the case of multicore systems. Comparing Figure 15 with Figure 13 , we note that the maximum speedup has increased by a factor of 10 and the peak is achieved using smaller block sizes and slightly larger number of processors. Figure 16 shows the speedup variation with b and n when R = 10 5 , which is the case of cluster or grid computing systems. Comparing Figure 14 with Figure 16 , we note that the maximum speedup has increased by a factor of 3 again. 
XI. SPEEDUP PERFORMANCE ESTIMATES FOR A SINGLE CPU MACHINE
We explore in this section whether the proposed algorithm shows any speedup when implemented on a single CPU machine without any parallelism or not.
We employ the following assumptions in order to perform a fair comparison. 1) Quick sort is used as the sort algorithm used to sort list L when we attempt to estimate the performance of a standard sequential algorithm τ seq .
2) Quick sort is also used to sort the sublists in our proposed algorithm when we attempt to estimate the performance of our algorithm τ tech . 3) The chosen pivots for quick sort consistently lie in the middle of the elements to be sorted in each iteration. Therefore, quick sort will require exactly log N iterations to complete. This assumption is optimistic but we will assume it holds for our comparisons. 4) Communication overhead is R = 1, which is assumed whenever the processor needs to access data. 5) Data needed is always available in the cache and cache update is done transparently so that data is always present in the cache when needed without any cache misses. The last assumption about data being present in the cache can always be modified by increasing the communication overhead R. But, we will see later that this factor is not important.
The delay associated with implementing quick sort on a single CPU is estimated as
The factor R refers to one cache read/write operation per element of the list. Denote τ tech as the total sorting delay of our proposed algorithm.
The speedup of our proposed algorithm when implemented on a single-processor system is given by
The most important thing we note here is that the communication overhead R is not present and will not affect the speedup performance. Figure 17 shows the speedup of our proposed algorithm when implemented on a single CPU. We have assumed α = 1 and N = 2 30 . We note from the figure that there is a maximum speedup that depends on our choice of b and n. The best speedup occurs when b = 1. Having the number of threads approximately equal to N is practical especially when the threads that completed their tasks are either reused or terminated and new threads are created in their stead. Figure 18 shows the speedup of our proposed algorithm when α = 2 and N = 2 30 . In that case, best speedup is achieved when n ≈ N and b could be increased to approximately 2 3 bytes without decreasing the speedup. Figure 19 shows the speedup of our proposed algorithm when α = 10 and N = 2 30 . In that case, best speedup is achieved when n ≈ N and b could be increased to approximately 2 10 bytes without decreasing the speedup. 
XII. EXPERIMENTAL RESULTS
We evaluate in this section the speedup of the proposed algorithm when implemented on a single CPU machine and GPU platforms.
In all cases, the keys to be sorted were randomly generated as 32 bit integers. Input sequences sizes vary from approximately 2 13 to 2 30 keys. All experiments were repeated 15 times in single-user mode and results reported are averaged over the 15 runs.
A. RESULTS ON A SINGLE CPU MACHINE
The one-dimensional algorithm is implemented using C language and run on an Intel Core i7 − 3770K CPU with a frequency of 3.5 GHz. The initial sublists were sorted using quick sort algorithm. Figure 20 shows the time taken to sort arrays of varying sizes using our proposed algorithm and other serial/parallel algorithms. Figure 21 shows the speedup of our proposed algorithm and other parallel algorithms as compared to the quick sort algorithm. Results show that the proposed algorithm achieves speedup of 5× compared to quick sort algorithm.
B. RESULTS ON GPU PLATFORMS
The one-dimensional algorithm is implemented using C language and CUDA for both CPU and GPU platforms. We compared the efficiency of the algorithm on the Intel Core i7 − 3770K CPU with a frequency of 3.5 GHz and on NVIDIA GPUs GeForce GT 720M , GeForce GTX 980, and GeForce GTX 1080. NVIDIAs GTX 1080 device contains 20 SMs, a total of 2560 CUDA cores, and 8 GBytes global memory and 48 KBytes shared memory per multiprocessor. Table 2 shows the specifications of the three used GPUs.
Our GPU implementation uses blocks to sort the initial sublists. Each block sorts one sublist using parallel odd-oven sorting algorithm. Each block can have up to 1024 threads. Each sublist contains 2048 keys. Thus, the CUDA kernel has N /2048 blocks, each has 1024 threads. Blocks copy their sublists to the shared memory before start sorting to the advantage of using low latency memory. All the frequently used data structures are also kept in the shared memory. Figure 22 shows the time taken by the proposed algorithm and the quick sort algorithm to sort arrays of varying sizes on different GPUs. Figure 23 shows the speedup of our proposed algorithm on different GPUs as compared to the quick sort algorithm. The GT 720M device shows the worst performance due to its limited capabilities. The GTX 1080 device shows a speedup reaching 3× compared to the quick sort algorithm implemented on the same device and a speedup reaching 17× compared to the quick sort algorithm implemented on an Intel Core i7 − 3770K CPU.
Results show that parallel quick sort algorithm is not performing very well. This was expected due to an inefficient choice of the pivot value. If the pivot value is not the median value, the load distribution among the different processors will be unbalanced. Parallel bitonic sorting is performing better, but the algorithm performance is significantly affected by the communication overhead. Our proposed algorithm transfers less data, which leads to less communication overhead.
XIII. SUMMARY AND CONCLUSIONS
We provided in this paper a structured algorithm and the associated algorithm for parallelizing any sorting problem using n processors. The proposed algorithm is simple and applies to any sorting algorithm and to any parallel processor platform. The algorithm achieves speedup for any sorting algorithm even when a single processor is used but this processor should have multithreading capabilities. It is very well suited to GPUs due to their multithreading ability and global shared memory. Simple synchronization constructs are required to ensure correctness of thread and processor operations. Estimates for speedup were derived. We find that in the best case speedup is proportional to the product of the number of parallel processors and the number of elements in the list. The average speedup is only proportional to the number of parallel processors.
We can summarize the main features and strengths of the proposed algorithm as follows.
1) The proposed algorithm applies to any sorting algorithm.
2) The proposed algorithm applies to any type of parallel processing system. 3) The proposed algorithm can be implemented in software using multithreading or in parallel hardware structures capable of doing sorting and search operations. 4) Speedup could be controlled by choosing the values of b, n, and α to accommodate communication overhead R. 5) Lookahead is possible to achieve more speedup. 
