Abstract -The paper is dedicated to parallel data sort based on sorting networks. The proposed methods and circuits have the following characteristics: 1) using two-level parallel comparators in even-odd transition networks with feedback to a register keeping input/intermediate data; 2) parallel merging of many sorted sequences; 3) using even-odd transition networks built from other sorting networks; 4) rational reuse of comparators in different types of networks, namely even-odd transition and for discovering maximum/minimum values. The experiments in FPGA, which were done for up to 16×2 20 32-bit data items, demonstrate very good results (as fast as 3-5 ns per data item).
I. INTRODUCTION
Using and taking advantage of hardware accelerators have a long tradition in data processing essential for a vast variety of computational systems. Among numerous problems that need to be solved, sorting is one of the most important [1] , [2] . Since it is a time-consuming task for large volumes of data, speed-up is greatly required for many practical applications. A number of recent research activities are targeted to the potential of advanced hardware accelerators, which are analyzed in detail in [3] . Notable results have been achieved through applying parallelism, pipelining, non-sequential circuits, and other techniques and building specialized blocks in hardware. A special attention has been paid to such competitive implementation platforms as: field-programmable gate arrays -FPGAs (e.g., [3] - [7] ), graphics processing unitsGPUs (e.g., [4] , [8] - [13] ), and multi-core central processing units -CPUs (e.g., [14] , [15] ). Although different methods have been used in referenced above and other implementations, they can be characterized by a number of common features, such as applying broad parallelism mainly based on different types of sorting networks, minimizing the depth of networks to decrease either the number of clock cycles or delays in combinational circuits that form the network components (comparators), reducing the number of comparators to minimize the required hardware resources, and applying various pipelining techniques for potential accelerations.
One of the most important features of FPGA-based circuits is an opportunity to build the entire system that is composed of various components and data sorters can be among them. Optimization of resources for any component permits the same microchip to be used for implementing additional tasks. Thus, either functionality can be extended or released resources can be used for additional needs, such as better testing, verification, and so forth.
Increase of performance is important for many practical applications, especially for real-time embedded systems. Significant speed-up can be achieved if hardware circuits are used as accelerators for general-purpose and applicationspecific software. A number of comparisons (FPGA vs. multicore CPU, FPGA vs. GPU, FPGA vs. DSP) can be found in [3] , [7] , [16] - [18] . Existing extendable middleware frameworks, such as VForce [19] , permit the same application code to be run in software or in application-specific hardware supporting calls to both FPGAs and GPUs and requiring no changes in user code (results on systems with NVIDIA Tesla GPUs and Xilinx FPGAs are presented in [19] ). Thus, application-specific designs can be linked with generalpurpose systems. FPGAs became beneficial for more and more cases mainly due to inherent configurability and relatively cheap development costs. Generally, FPGA-based systems are more redundant and slow compared to CPUs and GPUs. However, they allow creating operations and blocks that are indeed required for particular applications. For example, in parallel sorting, the size of operands can be customized and we can combine reasonably combinational and sequential circuits. Besides, multi-core CPUs and hardware accelerators can be implemented on the same microchip, such as Zynq-7000 from Xilinx [20] . Our experience has shown that the best FPGA-based implementations should be as much regular as possible avoiding complicated routing procedure. Design of such regular circuits for data sort is the main target of this paper.
The remainder of this paper contains seven sections. Section II exposes general ideas for efficient parallel data sort. Section III discusses the related work. FPGA-based implementation and detailed analysis of basic components is done in Section IV. Merging of sorted subsequences for large datasets is described in Section V. Section VI suggests potential improvements to the suggested methods. Section VII discusses implementation details and the results of numerous experiments and comparisons. The conclusion is given in Section VIII.
II. GENERAL IDEAS
The proposed method for parallel data sort includes three or less basic steps:
• At the first step, the given set of data is decomposed into subsets that are individually sorted. The size of each subset is limited by FPGA resources, which will be analyzed in subsequent sections.
• At the second step, several subsets are joined to form bigger sorted subsets. Once again, the size of any bigger subset is limited by FPGA resources.
doi: 10.1515/acss-2014-0013
• At the third step, the subsets are merged to produce the final sorted set. Suppose we need to sort N M-bit data. Figure 1a depicts a very regular circuit which includes cascaded even and odd comparators that are invoked sequentially while there is at least one data swap. Figure 1b demonstrates an example for M = 4, N = 8 and the following set of data: 14, 15, 0, 12, 4, 8, 1, 9. The circuit is the same as even-odd transition network (see Fig. 2 for the example considered above), but two-level comparators (N/2 comparators at the first level and N/2−1 comparators at the second level) are connected to the register R and multi-level comparisons shown in Fig. 2 are executed sequentially until there are no swaps in all N−1 comparators. In the best case (for sorted sequence) the result is ready immediately and in the worst case the result is ready after N/2 steps (clock cycles that control the register). Thus, the minimum delay is 1 clock cycle and the maximum delay is N/2 clock cycles for sorting N M-bit data items. Step 1
Step 2
Step 3
Register (R) Let us list basic features of the circuit in Fig. 1a: • The circuit is very regular and does not require complex interconnections between the elements (the comparators); • The number of comparators is N−1 and they can be implemented within even low-cost FPGAs for relatively large numbers of N (the details and comparisons will be given later).
Although the circuit is sequential, the performance is comparable with multi-level combinational networks because the latter involve significant delays in multi-level circuits.
At the second step, two subsets from the previous step can be joined applying the following method:
1. Comparing and interchanging two sorted subsequences using the method [21] (this method is also described in [1] ). 2. Sorting the produced subsets with the largest and the smallest values. As a result, a bigger sequence will be sorted. Figure 3 shows an example for the following two sorted subsets: 15, 14, 12, 9, 8, 4, 1, 0 (see Fig. 1b and Fig. 2) and 367, 211, 127, 14, 8, 3, 2, 1. The left hand network in Fig. 3 is composed of N comparators for two N-bit sorted sequences and the interchange can be done in one clock cycle because comparisons are applied to non-intersecting elements. Sorting by the circuit in Fig. 1a requires N/2 clock cycles (in the worst case). Assuming two sequential sorts for the smallest and the largest subsets, the total joining time is equal to N+1 clock cycles. Note that two sorts can be executed in parallel but it doubles the required hardware resources. More than two sorted subsets can be joined using a very similar technique but the time will grow exponentially. Thus, the number of subsets should be reasonably limited, which will be discussed in the subsequent sections.
At the third step, several sorted subsets will be merged using the method illustrated in Fig. 4 . Let K be the number of sorted subsets. Each subset k (0 ≤ k ≤ K−1) contains Nk sorted items and all K subsets are kept in dual-port memory blocks embedded to FPGAs. Each subset has two associated counters pointing respectively to the current maximum and minimum items through the first and the second memory ports. The registers Rmax and Rmin keep K maximum items (one from each subset) and K minimum items (one from each subset), respectively. The registers and the memory are filled in during forming sorted subsets. Sorting is executed as follows: Since the first and the last steps are executed in different time slots, the first and the second levels of comparators shown in Fig. 1a can be reused in the networks of Suppose N M-bit items have to be sorted and we will execute just the first and the third steps. At the first step, K sorted subsets are created sequentially. In fact, such subsets can be built in parallel if FPGA resources are sufficient. Any subset k requires Nk/2 clock cycles in the worst case. Assuming that all values N0, …, NK−1 are equal we need K × (Nk/2) clock cycles for the first step. The third step requires N/2 clock cycles. Thus, the total time is K × (Nk/2) + N/2 = N (if all K sorted subsets have equal number of elements) in the worst case. What is very important is the low delay in combinational paths of different circuits, i.e., the minimal potential clock cycle period. Such characteristics will be thoroughly analyzed in real experiments with different FPGA-based circuits. We will also present the results of simulation in software for different datasets.
III. RELATED WORK
Since sorting is a core problem in computer science, it has been very well researched over the last five decades [11] . The main direction in this area is aimed at different types of parallel implementations involving sorting networks. In [3] , [7] various types of networks were implemented in FPGA and it was shown that even in advanced FPGAs (such as from Virtex-5 family of Xilinx) it is possible to construct combinational networks just up to 64 items, which is undoubtedly non-sufficient. In the subsequent sections we will demonstrate that using the circuits from Fig. 1a , sorters for significantly more items can be built even in low-cost FPGAs. The results of comparisons of different networks [3] (namely bitonic merge and even-odd merge) show that even-odd merge networks require slightly smaller hardware resources. From [1] we can see the minimum number of comparators in different networks. Although even-odd transition networks are not the best [12] (from the point of view of hardware resources and the number of levels) they are very regular. Besides, the two-level comparator structure shown in Fig. 1a cannot be built for bitonic merge and even-odd merge networks. To our knowledge, such networks are among the fastest but they cannot be used in a way considered in the paper. Comparisons of improved bitonic networks proposed in [11] with state-ofthe-art sorting methods (namely quicksort and radixsort) demonstrate advantages of methods [11] and, thus, the results of [3] , [7] , [11] can be taken for comparison with our technique. Besides, the best networks can be used as components within the proposed circuits and we will demonstrate such an opportunity in subsequent sections. Many other publications [3] - [15] present platform-targeted (multicore CPU, GPU, FPGA) methods. Our approach is more relevant to FPGAs. There are some reasons for that. Current trends for FPGA vendors demonstrate that more and more standard blocks (memories, digital signal processors, multicore CPUs, transceivers, etc.) are accommodated with gate arrays on the same microchips. An example is a dual-core Cortex-A9 ARM processor embedded to Zynq microchips of Xilinx and combined with Artix/Kintex FPGAs [20] . Thus, we can expect that as functionality of individual micro-chips is more and more integrated in ASICs (application-specific integrated circuits) and ASSPs (application-specific standard products), FPGA will contain more and more ASICs and ASSPs functionality implemented as built-in units. We presume that one of such units might be GPU. Thus, the boundary between different platforms will be eliminated [22] . Although our technique is FPGA-targeted, it can be combined with other implementations. Some examples are given below:
• Rapid creation of K sorted subsets, which can be further merged in a single sorted data set in general-purpose computers, multi-core CPU, etc.; • Using fast bitonic (or some other) networks [11] in Fig. 1a for sorting more than two items at each comparator level (see some particular proposals in Section VI); • Rational combination of address-based sorting [23] for very large data sets (close to 2 N ) and the proposed technique for smaller data sets (N<<2 N );
• Sorting M-bit data with very big values of M, which is easier in FPGAs since many constraints inherent to GPUs and multi-core CPUs can be eliminated (the results of experiments with 128-bit data items, i.e., M = 128, are given in Section VII); • Design of fastest sorters with small number of N using just the circuit is Fig. 1a . Such sorters are very valuable for different priority buffers/queues [24] ; • Using the results of [19] permitting the same application code to be run in software or in application-specific hardware (FPGA in our case).
IV. FPGA-BASED IMPLEMENTATION AND A THOROUGH COMPARISON OF BASIC COMPONENTS
In this section, we evaluate the primary components to be used as building blocks for the considered sorters, which are:
• Comparators;
• A two-level sequential circuit shown in Fig. 1a; • The circuits that discover the maximum and the minimum values in given sets of items (see Fig. 6 ). The objective is to get the following parameters for different FPGAs:
• Maximum delay in combinational circuits;
• Maximum attainable clock frequency;
• Required hardware resources.
A. Comparators
Any comparator is described in VHDL (Very High Speed Integrated Circuits Hardware Description Language) as follows: Here: Op1 and Op2 are input data items; MinValue and MaxValue are output data items; activity_flag is an output signal, which is equal to 1 if and only if input data items Op1 and Op2 are swapped. Table I below indicates hardware resources (the number of occupied slices S for Xilinx FPGAs or the number of logic elements LE for Altera FPGA), SFPGA/LEFPGA is the number of slices/logic elements available in the indicated FPGA, EMB -the size of embedded memory blocks in Kbits (this size is not directly used in the paper but it allows getting an idea which datasets can be stored inside FPGAs), D -is the maximum delay in nanoseconds. Please note that slices of different FPGAs are not equal. Synthesis and implementation of circuits were done in the Xilinx ISE 14.1 and Altera Quartus 12. From Table I we can see that FPGA resources needed for one comparator are small and the maximum delay time varies from 2.3 ns to 15 ns. Note that the results in Table I and below are given for particular types of FPGA because only such FPGAs were available for experiments (prototyping boards Nexys-2, Atlys, FX12, ML505, ML605, DE2-115, and ZedBoard). Optimization goal for ISE 14.1 was set to speed and optimization effort was set to normal. Optimization technique for Quartus software was set to balanced.
B. Sorter Based on Two-level Sequential Circuit
Sorting circuits composed of two-level comparators (see Fig. 1a ) can be described in structural VHDL as it is shown below:
generate_even_comparators: --the first level of comparators in Fig. 1a for The signal activity_flag is assigned as follows:
activity_flag <= activity_flag1 or activity_flag2; where activity_flag1 and activity_flag2 are signals from the first and the second levels of comparators, respectively, n_data (the number of data items that need to be sorted) is a generic parameter (we will change this parameter in order to fill in Table II Table II presents complexities of circuits for different number N of data items that have to be sorted using the method in Fig. 1a . Here, Nmax is the maximum number of data items that can be sorted in the indicated FPGA, Fmax is the maximum clock frequency in MHz for updating the feedback register in Fig. 1a . Figure 7 presents the comparison of the obtained results for Virtex-5 FPGA with purely combinational sorting networks from [3] . From Fig. 7 we can see that the proposed method allows more data items to be sorted. 
C. Discovering the Maximum and the Minimum Items
In Section II we have explained that since the first and the last steps are executed in different time slots, the first and the second levels of comparators shown in Fig. 1a can be reused in the networks of Fig. 6 . Table III gives the results of synthesis and implementation of such circuits in different FPGAs in the following format: the number of occupied FPGA slices (logic elements for Altera) S_LE indicating in parenthesis percentage of S_LE from all the available in particular FPGA slices (logic elements for Altera); the maximum attainable clock frequency Fmax in MHz, and the total number of comparators Nc. These data are recorded in the format: S_LE (percentage)/Fmax/Nc. From Table III we can conclude that resources are very reasonable and enable us to implement the proposed sorters even in low-cost commercial FPGAs. There is some inconsistence in different tables, for example, the number of slices for one comparator is 21 for Spartan-6 FPGA (see Table I ). It means that the total number of comparators in the last row of Table III should exceed the available FPGA resources. However, the resources were sufficient and we assume that it happened due to some optimization techniques applied by ISE 14.1 (such as avoiding some redundant signals).
V. MERGING SORTED SUBSEQUENCES FOR LARGE SCALE DATASETS
Synthesis and implementation of merging circuits require very long time. That is why we tested in FPGAs just relatively simple circuits (up to 2 14 32-bit data items). At the same time, we modeled the circuits in general-purpose software (in C++ language) and counted the number of FPGA clock cycles. Since the number of actual clock cycles in FPGA and the number of clock cycles obtained from software programs were exactly the same, we decided to present the results of more complicated sorters (up to millions of 32-bit items) just from simulation in software.
Sorting subsets is done in the following C++ fragment, which uses two levels of comparators invoked in a cycle activated as long as the set is not sorted: for (int n = 0; n < number_of_blocks; n++) // K = number_of_blocks { do /* exchange_flag indicates occurrence of at least of one swapping in parallel_even_sorter or in parallel_odd_sorter*/ { exchange_flag = false; clock_cycles++; /* each iteration is executed in one clock cycle (see Fig. 1a Initial data inside FPGAs were generated randomly and stored in embedded block RAMs in such a way that allows all N M-bit data to be recorded to the register in Fig. 1a and to be written back to memory in parallel. For FPGA implementations K = 256 sorted blocks with 64 32-bit items in each block were taken for merging and producing the final sorted sequence containing 2 14 32-bit items. The maximum attainable clock frequency for Spartan-6 FPGA was about 80 MHz. The results of simulation in software based on two steps (1 and 3) are presented in Table IV for different values of K and Nk (assuming all blocks have equal number of data items). We found that different numbers of M affect mainly the consumed hardware resources and practically do not alter the maximum achievable clock frequency (three different sizes M = 32, M = 64 and M = 128 were examined). The results of simulation in software based on all three steps do not give any advantage and the performance is even decreased. So, the step 2 was removed. That is why in all future implementations just the first and the third steps were used.
VI. POTENTIAL IMPROVEMENTS
Addition speed-up of the circuit from Fig. 1a can be achieved if comparators are replaced with sorting networks processing more than two items simultaneously. Let us consider the circuit in Fig. 8a where data items in a given subset are divided in G t-element groups. Data within each group are sorted using networks. As a result, instead of evenodd transition network for pairs of items we consider even-odd transition network for t-element groups of items. Figure 8b gives an example of sorting 16 data items divided in two 8-element groups (i.e., G = 2, t = 8). Such improvements permit additional acceleration of sorting in log2t times (in 3 times for our example). However, FPGA resources will also be increased significantly. Indeed, for our example (t = 8) the best sorting networks contain 19 comparators [1] and the number of slices required for each comparator is shown in Table I . Thus, even for Nk = 64 the circuit in Fig. 8a will occupy the entire FPGA xc6slx45. Nevertheless, more advanced FPGAs allow the considered technique to be implemented.
Similarly blocks for discovering the maximum (the minimum) value in Fig. 4 (structures of these blocks are shown in Fig. 6 ) can be replaced with networks, which enable a group of maximum (minimum) values to be chosen. An example is shown in Fig. 9 .
Let us consider the following 4 sorted subsets: Clearly, it is guaranteed that h left segments include h maximum values in entire set and h right segments include h minimum values in entire set. Now let us consider the segments instead of individual items in Fig. 6 ordering them with sorting networks as it is shown in Fig. 9 . The circuit in Fig. 9 works much similar to the circuit in Fig. 6 . The only difference is that segments are used instead of individual items. Now the circuit in Fig. 4 selects at each clock cycle 8 instead of 2 values and performance is increased 4 times. Note that since just a half of outputs are actually used in each sorting network in Fig. 9 , the complexity of this sorting network (i.e., the number of comparators) can be reduced. Now the registers Rmax and Rmin in Fig. 4 contain t/2 segments and each segment is composed of K values. The circuit in Fig. 4 
VII. EXPERIMENTS AND COMPARISONS
We found (see Section V) that step 2 (see Section II) can be used just for autonomous merging of a small number of sorted subsets. Thus, we have not analyzed this step in the experiments below. Figure 10 presents the results of synthesis and implementation in different FPGAs (the types of FPGAs are indicated in Tables I and II ) of circuits that sort small subsets of data (from 16 32-bit items to 1,024 32-bit items) using just step 1 (see Section II and Fig. 1a ). Such circuits (Fig. 10) are extremely fast and can be used as components of real-time embedded systems implemented in the same microchip (an example of such system is given in [24] ). Figure 11 presents the results of simulation in software of circuits for complete data sorters (data sets up to 2 14 of 32 data items were also verified in FPGAs; larger subsets were not synthesized and implemented in FPGAs due to very long synthesis and implementation time in ISE and Quartus; as an example we can say that obtaining the results in Table II for Altera FPGA and Nmax = 1024 required about 2 hours in Quartus software executing in quad-core PC computer with frequency 3.2 GHz under 64-bit operating system).
Let us compare these results with the results in referenced publications. In [3] there are no data for sorting measured similar to Fig. 11b and speed-up comparing to a Power PC 440 core is given. Thus, we will take data for median operators, which require less time than sorting of similar data. Processing median operators in FPGA of Virtex-5 family for large data sets requires approximately 19 ns per data item [3] . We can see from Fig. 11b that in our case this time is less than 5 ns. Let us compare now the results from Fig. 11 with sorting in GPU, which are given in Fig. 5 of [11] . Sorting of 16×2 20 data items requires between 500 and 1,000 milliseconds. In our case sorting 16×2 20 32-bit data requires less than 80 milliseconds. Thus, the advantages of the proposed FPGAbased sorters are evident.
In [6] the maximum speed of sorting is estimated as 180 million records per second. Please note that this is an estimated result and even in this case it is worse than 5 ns per data item. Thus, our results are better because if we consider groups of items instead of individual items (see Fig. 11b ) we can execute sorting with performance 3 ns per data item. Comparison with [8] (that provides very useful data for numerous sorting algorithms) also shows that the proposed data sorters are faster.
It should be also mentioned than in simulation results shown in Fig. 11 the clock frequency was considered to be just 50 MHz, which is less than real frequency in FPGA-based circuits. If we look at Table III we can see that actual achievable frequency is higher. Thus, we can expect better acceleration comparing to the results shown in Fig. 11b . Besides, potential improvements in Section VI are currently limited by not sufficient FPGA resources. Since such resources will undoubtedly be increased in upcoming generations of FPGA we can expect that the proposed methods will provide additional accelerations in future systems. Clock frequency is 50 MHz Fig. 11 . The results of simulation in software of circuits for complete data sorters.
Another important feature of the proposed circuits is that the sorting time does not depend much on size of data items. To prove this conclusion we present in Fig. 12 the results of sorting similar to Fig. 10 but for two sizes of data items: 32 bits and 128 bits (the results for 128 bit data items are shown by dashed lines). Please note that we were able to implement in the FPGA xc5vlx110t of Virtex-5 family the data sorter for a set of 128 128-bit data and in [3] even 128 32-bit data items could not be sorted in the same FPGA due to the lack of FPGA resources. 
VIII. CONCLUSION AND FUTURE WORK
The paper suggests a new method and FPGA-based circuits for very fast parallel sorting of data. The method includes two basic steps: 1) representation of data in form of subsets in which data are sorted applying the proposed even-odd comparison-based networks, which combine parallel and sequential processing and allow the number of comparators to be reduced up to N−1 (N is the number of data items to be sorted); 2) concurrent merging of many sorted subsets from the first step in a single sorted dataset with the aid of sequential extraction of minimal and maximal data items. The proposed circuits permit the number of clock cycles per data item to be significantly reduced. The technique enables datasets of size N of M-bit items to be sorted in parallel in approximately N clock cycles, which is faster than for other methods of sorting known from available publications. The first step can be executed autonomously for small datasets and in combination with the subsequent step. Experiments, which were done for up to 16,777,216 (16×2 20 ) 32-bit data items, demonstrated very good results, in particular sorting was performed as fast as 3 ns -5 ns per data item with relatively low clock frequency, which was set to just 50 MHz.
A number of potential improvements were proposed but they were not completely verified in hardware and in software. The relevant task is considered to be one of future directions. Besides, due to very long synthesis and implementation time in commercial CAD systems, only sorters with up to 2 14 32-bit items were verified in hardware and others (up to 2 24 32-bit items) were modeled in software. We will also work on additional optimization of circuit components. A very interesting direction is software/hardware co-design when a problem of sorting is split between FPGA-based circuits and software of powerful processing systems. One possible way is to combine power of an ARM A9 dual-core Cortex processor and flexibility of Artix FPGA in the Zynq application-specific system-on-chip. This direction is also considered for the future. 
