Abstract-Sorting is one of the most investigated tasks computers are used for. Up to now, not much research has been put into increasing the flexibility and performance of sorting applications by applying reconfigurable computer systems. There are parallel sorting algorithms (sorting circuits) which are highly suitable for VLSI hardware realization and which outperform sequential sorting methods applied on traditional software processors by far. But usually they require a large area that increases with the number of keys to be sorted. This drawback concerns ASIC and statically reconfigurable systems.
I. INTRODUCTION AND MOTIVATION
Reconfigurable hardware devices combine performance and flexibility. They outperform software implementation by applying parallelization and offer the option to change the implementation after shipping and multiplex different applications after each other. Dynamic partial reconfiguration furthermore allows that only a part of the FPGA (Field Programmable Gate Array) will be reconfigured at runtime. This can be applied to modify and/or add new possibilities to the current configuration on the FPGA, saving by this the time required for the whole reconfiguration of the FPGA which was needed before.
Sorting (long) sequences of keys is one of the classical and most common tasks solved by computers today, e. g., in numerous database and computer graphics applications.
An algorithm relevant for practical purposes due to the really small constant factor (see below) involved is Batcher's Bitonic sorting circuit [1] . Sorting circuits are build from comparator modules, or comparators for short, that connect horizontally drawn "wires" numbered from 1 to n where the n keys enter on the left side. A single comparator receives two keys and outputs the minimum key on the lower numbered wire and the maximum key on the other wire (e. g., see Fig. 1 ). The sequence of keys leaves the circuit on the "right."
These circuits are well suited to be realized by a hardware implementation. The Bitonic sorting algorithm performs 1 2 log n(log n+1) parallel steps (also called stages) and consists * The full version of this paper can be found under http://www.opus.ub.unierlangen.de/opus/volltexte/2011/2403/ of 1 4 n log n(log n + 1) comparators that all have to be implemented in hardware. The drawback of this approach is that only a small fraction of all comparators is active during a single step, i. e., in a certain way the implementation of sorting circuits in hardware leads to a considerable waste of area. Or, the other way around, this means that only short sequences can be sorted. Here, modern hardware with the ability of dynamic partial reconfiguration can be of great help: Just a piece of k parallel steps is implemented, and when the steps of the piece have been executed, the hardware can be reconfigured for the execution of the next piece of k parallel steps. So it is possible to sort longer sequences of keys with still all the benefits of parallel processing. In this paper, we demonstrate this approach by applying it to Bitonic sort.
The paper is organized as follows: In Section II other approaches which apply reconfigurable devices and parallel sorting algorithms are presented. In Section III, we explain the general approach of two implementation variants of Bitonic sort which we realized in our work. Furthermore, the design and implementation are explained in Section IV, and the experimental results are described in Section V.
II. BACKGROUND AND RELATED WORK
A comprehensive introduction to sequential and parallel sorting can be found in [2] . In particular, Sec. 5.3.4 of [2] presents a thorough overview of sorting circuits.
There are just a few reports on the implementation of parallel sorting algorithms on FPGAs. They have in common that they do not use partial reconfiguration. In [3] , implementations of quicksort, heap sort, radix sort, Bitonic sort, and odd/even mergesort (a further sorting method by Batcher [1] ) are compared on a reconfigurable platform. They are compared to each other with respect to memory bandwidth, clock speed, algorithm computational density and on the algorithms' ability to be pipelined.
In [4] , a hardware realization of a recurrent scalable sorting circuit based on Bitonic sorting is presented. The Bitonic sorter has been implemented on an FPGA. Techniques to reduce the communication within the circuit and to minimize the costs in terms of hardware resources are explained. Additionally, an enhancement of the input registers in order to reuse of the same architecture for different input widths is shown in detail. Thus, the time complexity of the original Bitonic sorter could be achieved.
Marcelino et al. [5] present and evaluate three different hardware sorting implementations and compare them to the quicksort algorithm implemented in software. Their experimental results show the differences in resources and performance among the three proposed sorting units and also between the sorting units and plain software implementations. By combining an insertion sorting unit and a merge FIFO sorting unit, they achieve a speed-up between 1.6 and 25 compared to a quicksort software implementation. Furthermore, they realized implementations supporting up to 128 parallel inputs.
In [6] , a hardware realization of a recurrent scalable sorting network based on Batchers bitonic algorithm is presented. The idea is to reuse the comparators in one comparator level stage for stage. The input registers of the same architecture are used for different input widths, by distributing the role of each comparator level over the network. Furthermore, the implementation of such a sorter has been realized in an FPGA, but no partial reconfiguration is applied.
In contrast to these existing works, our approach newly applies the technique of partially dynamically reconfiguration. In this paper, it is shown that parallel sorting algorithm implementation on FPGAs can highly benefit when used in conjunction with partial reconfiguration. The huge area usage of parallel sorting circuits like the Bitonic sort can be reduced as partial reconfiguration allows to separate the complete sorting circuit in multiple pieces and allows to load and execute them one after another. Thus, smaller FPGAs can be used and also parallel sorting circuits for a larger number of keys can be realized. This idea is completely new and has not been investigated before.
III. ALGORITHM A. Bitonic Sort
The Bitonic sorting circuit is based on the sequential mergesort algorithm. It follows the divide-and-conquer approach. The basic procedure is exemplified in Figure 1 . The unsorted sequence of keys enters the circuits on the left. A vertical connection represents a comparator which compares two keys and, if necessary, exchanges them. In Bitonic sort, n/2 comparators can be executed in parallel defining a parallel step. In the figure, the corresponding outputs of each comparator after each step are shown. Finally, the completely sorted sequence of keys is output on the right. The sorter consists of a cascaded sequence of Bitonic mergers. In Fig. 1 , these mergers are marked with boxes. This design is developed for n being a power of 2. For arbitrary n, one can use the circuit for 2 log n inputs and remove the unnecessary lower wires and the incident comparators. Bitonic sort belongs with its running time of 1 2 · log n · (log n + 1) parallel steps to the fastest practical sorting algorithms. But the original algorithm also has its disadvantages it shares with most circuit-based sorting methods. (i) The input sequence must always pass the full sorting circuit in order to guarantee that the result is eventually sorted. That means, there is almost no simple way to recognize an already sorted sequence at the beginning or in the middle of the computation and thereby save the rest of the sorting process. It is even improbable that in the middle of the computation the current sequence is sorted! (ii) It necessarily needs large hardware area, in particular due to the wiring [7] . The more keys are to be sorted by the Bitonic sorting circuit, the more space for comparator stages and wiring it needs, and thus it consumes a lot of space on a chip. (iii) In addition, the space is used inefficiently, as at all times just one single stage is active while the remaining comparator stages remain idle and wait for either the inputs or have their share of work already completed.
The last mentioned disadvantage can be compensated by applying pipelining techniques when several sequences are to be sorted. Thus, the throughput can be increased, and many comparator units are engaged in each time step. In this paper, we focus on (ii). We propose to apply partially dynamic reconfiguration to overcome the large area usage.
In our implementation, we differentiate between two implementation variants, the static version and the dynamic one. Their characteristics are explained in the next section.
B. Static Variant
In the static Bitonic algorithm implementation, the Bitonic sorting circuit is implemented as a whole on the reconfigurable device. The complete circuit is synthesized and loaded into the FPGA in a single step. Thus, the corresponding device must offer enough reconfigurable resources, e. g., comparator units, for the corresponding number of inputs. As the number of comparators increases considerably with the number of inputs, either very huge FPGAs must be engaged or only sorting short input sequences can be supported. Nevertheless, the execution in hardware can speedup the sorting task enormously on an adequate device.
C. Dynamic Variant
In the dynamic variant of our implementation of the Bitonic sorting algorithm, we adopt it to dynamically partially reconfigurable systems. We partition the sorting circuit as illustrated with dotted lines in Fig. 2 into multiple pieces such that each piece includes a set of parallel steps which fit altogether on just a portion of our reconfigurable device. Thus, we separated the complete sorting circuit into multiple smaller pieces, also called partial modules. Each of these modules can be loaded and executed one after another on a reconfigurable device. Furthermore, also multiple modules can be loaded and executed on the reconfigurable device. The advantage of this approach is that less reconfigurable area is needed by the comparison units. This additional space might be used to run different units of the application in parallel or also other applications simultaneously on the FPGA. But the main advantage is that longer sequences can now be sorted and processed in parallel by multiplexing the pieces one after another, until all steps of the circuit are executed.
IV. DESIGN
The design of the static algorithm is as follows: The inputs are prepared by the control CPU and sent via a hardwaresoftware communication module to the FPGA of the reconfigurable platform. Furthermore, the control CPU is also in charge of the initial static reconfiguration of the FPGA. The main part of this application is done on the FPGA. For the static variant, this hardware application consists of two main modules. The first is a Pre-and Postprocessing module, the second realizes the actual sorting circuit. This module controls the hardware-software communication module, receives the input, and sends the output back to the control CPU. Furthermore, its main task is to de-serialize and serialize the sequentially transferred data. The second hardware module realizes the Bitonic sorting circuit. Its size directly depends on the number of input keys to be sorted.
In comparison to the static version, the design of the dynamic adaptation of Bitonic sort has some important differences. An overview of this design is given in Figure 3 . The control CPU is not only in charge of transferring the input and output data and loading the initial configuration into the FPGA, it also controls the stepwise reconfiguration of the partial sorting modules. On the hardware side, an additional module is needed for storing and restoring the temporary results in a temporary memory when reconfiguration between the execution of successive pieces takes place. Thus, the Fig. 3 . Design overview of the dynamic Bitonic algorithm approach: The control CPU is in charge of the reconfiguration process and the transfer of the input and output values. On the FPGA, the de-serialization and serialization of the input and output data is done, respectively. A temporary memory unit is used to save and restore the results after executing the partial modules. Furthermore, the FPGA is loaded with the actual partial sorting module.
control CPU initially loads all static modules and the first partial sorting module, then transfers the input data. After executing the first piece of sorting steps, the results are saved in a temporary memory. Then, the control CPU triggers the reconfiguration of the next partial sorting module according to the Bitonic sorting circuit. Then, the temporarily saved sequence is restored from the temporary memory and fed into the newly reconfigured part of the Bitonic sorter. This procedure is repeated until all partial modules of the Bitonic sorter have been run through and we are guaranteed that the resulting sequence is completely sorted. Finally, the resulting sequence is transferred back to the control CPU.
V. EXPERIMENTAL RESULTS

A. Implementation
For the implementation of the algorithm, a dynamically partially reconfigurable platform called ESM is used [8] , [9] , which offers a Xilinx Virtex-II 6000 and an external PowerPC as control CPU. The communication between both devices is done via a 32-bit bus system.
B. Static Bitonic Sorter
We implemented and tested static Bitonic sorters with input sequences of lengths from 1 to 300 of 32-bit integer keys. Our main FPGA consists of 33,792 slices. In our maximal implementation, 50% of the resources were consumed by the sorting circuit part, and the other half by the Serializer and Deserializer unit. The reconfigurable resource usage, the number of Slices, Look-up Tables (LUT) term 32 · 20 ns depends on the word length of the keys. This is exactly the measured running time of the Bitonic sorting process in hardware. We also measured the running time of sequential merge sort on a 2.8 GHz CPU for the same number of input keys and calculated its average running time. The ratio of the two running times shows that Bitonic sort is several magnitudes faster than the sequential merge sort (see Fig. 4 ).
However, the cost of communication between control CPU and reconfigurable hardware is still too large on our experimental board, and also on currently available reconfigurable platforms in general. When the communication cost is not neglected, then the sequential merge sort on a 2.8 GHz CPU is still faster for sorting sequences of lengths 1 through 300 (see Fig. 5 ). But it can be estimated that if the input sequence has length 400 or more, the Bitonic sort would also be faster even when the communication cost between CPU and reconfigurable device are taken into account.
Thus, a communication channel with a higher bandwidth is needed to further increase the performance. This problem may be approached by designing a novel platform specialized for sorting purposes with adequate parallel data access. But for our purposes of testing the resource usage and usefulness of the Bitonic sorter and its adaptation to dynamically partially reconfigurable platforms, this communication bottleneck did not limit the goal of our approach. The experiments still validate that sorting can speedup sorting by several magnitudes.
C. Dynamic Bitonic Sorter
In the implementation of the dynamic Bitonic sorter, the size of the circuit realization is reduced enormously. In our implementation for sorting sequences of length 300, again 50% of the area is occupied by the Serializer, Deserializer, and Load and Restore unit, but now only 3% of the area are sufficient for the sorting circuit. In the static approach no more space was left, but in the dynamic approach 47% of the resources are now freely available. Thus, the main goal of our new approach was successful. The newly available space might be used by other application extensions or in order to sort longer sorting sequences.
The sorting time is only twice the sorting time of the static Bitonic sorter. The reasons for this delay are the hardmacros in the partial regions because they store the data transferred between partial modules for one additional clock cycle in their flip-flops. Though the running time is doubled, it is still enormously faster than traditional approaches. However, the additional reconfiguration time is a bottleneck. Reconfiguration of only one partial module took 33 ms to 35 ms for one macro column, which takes almost as long as the whole sorting. It can be assumed that future dynamically reconfigurable platforms will tackle the reconfiguration bottleneck.
As a summary it can be said that, as long as no reconfigurable devices with a much faster reconfiguration speed are not available, the static approach for the application of the Bitonic sorter on reconfigurable devices is still the better choice and provides an enormous performance speedup, if the communication infrastructure is chosen carefully.
