Abstract-This paper presents a merge sorter able of merging thousands of streams in a single run where the logic cost scales logarithmic with the number of streams merged. Moreover, we apply several performance tuning techniques, including speculative execution, deep pipelining and optimized communication schemes between processing elements. An end-to-end case study utilizing a Xilinx VC709 board merges 2048 sequences of the Graysort benchmark between two DRAMs at 9.5GB/s or 1024 sequences at 10.3GB/s effective throughput.
I. INTRODUCTION
Sorting is one of the best studied computer science problems. Over recent years many high-performance merge sorter implementations on FPGAs have been proposed. Most of them aim at high performance -in [1] the authors achieve 26 GB/s for a 32-way merge sorter, [2] achieve 77 GB/s while [3] achieves even 126 GB/s. On the other hand, in [4] the authors focus on the number of merged sequences and achieve 4096 merged sequences at once, but without providing the necessary I/O interface. This paper proposes a parallel merge sorter, which can merge a large amount of data sequences in a single run. The three key contributions of this paper are:
• We show that in practice the utility (i.e., the number of merged sequences) of a hardware merge sorter is often more important than throughput for large scale problems.
• We propose an optimized design merging up to 16,384 streams at once. Its running at 188 MHz and utilizes only 5.1K slices and 132 BRAMs.
• We demonstrate a case study targeting Graysort [5] on a Xilinx VC709 merging 1024 sequences of data between two DDR-3 memories at 80% peak memory throughput.
A. Traditional FPGA Merge Sorting
Merge sort is a very promising approach for sorting in hardware especially when considering External Sorting. The naive implementation of a hardware sequence merger utilizes a balanced binary tree structure as shown in Fig. 1 . The comparing cells use FIFO buffers to decouple operation from control and sort when the output FIFO is not full and both input FIFOs hold data to sort. Each stage N of such a tree consists of 2 N FIFOs and 2 N −1 compare units. An implementation of a e-way merge sorting tree consists of 2e−2 FIFOs and e − 1 comparators.
1) Traditional Merge Sorter Advantages:
• Uses only little control logic and is easy to implement • Locality of communication, which leads to fast designs • Small critical path: only a key compare and a 2:1 MUX 2) Traditional Merge Sorter Disadvantages:
• Linear resource complexity: limits utility for large e • Costly input module to handle simultaneous requests 
3) Observations:
When analyzing a traditional e-way merge sorter, we can observe the following characteristics which we will exploit for an improved implementation.
• All FIFOs are full in steady state. Then, whenever output occurs, only one empty token will be replaced from the entire higher level of the tree (dashed path in Fig. 1 ).
• Consequently in each level of the tree, only one compare unit is active as well as one (input) FIFO pull request and one (output) FIFO push request.
• A load unit that would be in charge to fill the input FIFO buffers (above Stage 4 in Fig. 1 ) will at most receive one refill request per cycle which implies that it doesn't need any complex multi-channel arbitration scheme. It should be mentioned that the observations apply to decision trees in general and that much of the work presented in this paper can serve as a design patterns for implementing such trees efficiently on FPGAs.
B. Analyzing Sorters
Mass storage access in external sorting is expensive compared to local memory, thus it is of paramount importance to minimize the number of major runs. We use the term major run to express a sorting step where the entire problem is read from mass storage, processed and written back to mass storage. A major run may include one or more minor runs through local memory (e.g., DDR) that work on smaller problems. In major runs, local memory may be used as a buffer for mass storage accesses. Minor runs also utilize local memory to hold intermediate data, thus all data sorted in a series of minor runs is limited by local and on-FPGA memory capacity.
Assuming that reading or writing the entire problem takes time T, then sorting needs at least the time 2 × T × number of run. Similarly, the energy needed for sorting is mostly depending on the problem size and the number of major runs. Because sorting cannot deliver a result before all input records had been seen by the sorter at least once, external large problem sorting cannot be accomplished in one major run. Consequently, most external sorting approaches aim for two major runs where in a first run large sorted sequences are generated that are merged (ideally) in just one final major run. Therefore, external sorting takes at least the time 4 × T (for two major runs requiring read and write).
In order to perform large problem sorting in just two runs (or a few runs), the goal is not that much to deliver high throughput per run (as long as we can saturate the mass storage or memory throughput), but on the utility per run. With utility of a sorter, we refer to the amount of work a sorter performs per run which in the case of a merge sorter is the number of channels merged (i.e. e).
C. Analyzing Merge Sorters
An e-way merge sorter merging in multiple runs n presorted sequences of a problem with the total size d that operates at throughput p, has a total sorting time of:
As p may be bound by the available I/O throughput, the actual throughput is the min of p and the effective DDR/SSD I/O. Considering saturation of available I/O, p effectively becomes a constant value that is bound by the maximal effective throughput of the particular run configuration. While decreasing n and d obviously results in less time to sort, these values are user/problem defined and we treat them as unchangeable values (constants -c). Therefore, considering sorting at saturated I/O, the effective time to sort relates to the sorter's properties is:
Thus once the design becomes I/O bound, improving sort throughput does not relate to faster sorting in reality. The only remaining strategy is to improve sorting utility.
In practice, we can use any sorter [6] that takes unsorted data and produces small sequences and then merge them within one or more minor runs. Any subsequent major run can turn the SDRAM into a buffer for hiding latencies and allowing big burst sizes with mass storage. This sorting pattern scales and the possible amount of data to be sorted in m major runs is e m−1 . In order to sort big data problems efficiently in FPGAs, merge sorters with large number of merged sequences are highly beneficial for the subsequent runs. Even Intel's new FPGA devices with embedded fast HMB memory [7] would not help here because for external sorting we are I/O bound by the mass storage devices or because of the limited (HMB) memory sizes available. This would even hold for temporal storage in DRAM which is significantly slower than HMB. Finally, improving sort utility minimizes the number major and minor runs and consequently is an important strategy for great energy savings. 2) Sorting Cell: Assuming steady state, every stage k of the sorter represents 2 k−1 sorter cells and 2 k FIFOs as shown in Fig. 1 . While a single combined compare unit can be used for the sorter cells of a tree layer, we need at least two FIFO elements, as we need two read ports. This imposes an implementation requiring at least two random access memories for mapping the buffered data as shown in Fig. 2. 
II. LARGE UTILITY MERGE SORTER

3) Logical FIFOs:
This module provides an abstraction from the underlaying hardware. It implements parameterized number of FIFOs with parameterized depth and width with a single read and write ports. The logical FIFO read and write operations can be addressed through what we refer to as channel ID. As we use powers of two, in each cell the most significant bit of channel ID is used to choose the destination of the incoming record. We implemented this strategy using a distributed memory register file for the read and write pointers. This allows asynchronous reads and pushing and popping is done by incrementing of the corresponding pointer.
4) Synchronization:
In the traditional design ( Fig. 1) , each comparing element is continuously polling for data, but this is unfeasible when sharing ports for multiple FIFOs. We overcome this by generating only one refill request per stage of the tree. As soon as a pop operation in cell k occurs, the corresponding refill channel ID is sent to cell k +1 (see Fig. 5 ) ensuring that cell k will receive a fill command back.
5) Initial and Steady States:
The previously described scheme demands that data buffers are already full. We therefore incorporate a sequential initialization phase.
6) Communication and Stalling:
The proposed design does not use request buffers between sorting stages. The sorter cells themselves consume only relatively few resources (less than 2K LUTs), which allows them to be placed closely together and consequently with small wiring delays. Not requiring request buffers between the sorting cells results in constant request serve time, which leads to predictable behaviour. This allows us to explore trade offs as we change the depth of the logical FIFOs. . In related work [6] , a methodology for pipelining the compare-and-select element has been proposed where future records are speculatively precompared. Our proposed design cannot directly benefit from that approach because different intermediate channels may be read in consecutive clock cycles which does not easily allow our sorter to look into future records. To overcome this, we implement a pipeline stage with a multiplexer and a channel comparison as shown on Fig. 4 . The figure is simplified and in our implementation we are actually using three speculative comparisons that run in parallel with the selector logic (including flag bits) and we pipeline the channel compare. Any consecutive requests accessing the same intermediate channel result in speculative peeking into the second elements of the corresponding logical FIFOs as shown in Fig. 3. The sorter unit will then compare if two consecutive sorting steps request data from the same logical FIFO channel and in this case multiplex the precomputed compare result. All these modifications allow pipelining the accesses to the FIFO memories with relatively relaxed timing, allowing us to route to a larger number of BRAMs as we implement eventually thousands of logical FIFO channels in a single cell.
2) Variable Buffer Sizes:
The implementation of the proposed design was set with sufficient logical FIFO depth sizes (depth S = 8), such that stalling logic could be fully omitted. Considering randomly distributed data, the probability of two consecutive requests with the same channel ID in stage k is 2 −2k and if the depth of logical FIFOs is S, then the total number of records in that stage is S · 2 k . Due to these observations we propose an optimization consisting of variable buffer size cells. We propose a parameterizable positioning of a special stage that decouples the cells that have small probability of consecutive requests from these with high probability (as presented in Fig. 5 ). The first cells are assigned smaller FIFO depths in order to save resources, while the second described cells have sufficient depth to prevent stalling. Stage K from Fig. 5 restricts consecutive requests of programmable window such that cells N to K +1 are rendered safe even when omitting consecutive request detection logic.
3) Skewed Data: Previous works [1] propose an optimization that targets stalling issues with skewed data but that cannot guarantee stable sorting. Stable sorting is however essential for most practical applications like databases. In the event of skewed data, the last proposed optimization can be omitted and sufficient logical FIFO depth provided, which will ensure that the design will not stall for any skewed data pattern.
III. EVALUATION AND CASE STUDY
We synthesized our sorters using Xilinx Vivado 18.1 with selected options AreaOptimized High and Performance Explore targeting a Xilinx VC709 board featuring a XC7VX690T-2 FPGA. Firstly we implement multiple configurations in out-of-context mode to evaluate utilization and stall rates while using a detailed DDR-3 memory model. We then carried out tests on real hardware for a case study targeting Graysort [5] on the Xilinx VC709 board.
A. Evaluation 1) Active rate:
To evaluate the proposed sorter we have ran detailed simulations of the sorter configured for 64-bit keys and multiple parameter values (e.g., K in Fig. 5 ). We used a random data generator with uniform distribution. Independent of the utility parameters of the sorter, we observe active rates higher than 99% when K > 5 (see K in Fig. 5 ) and 100% when K = N . We observe that for different configurations, the design needs a constant 16 K-33 K clock cycles to initialize. This time is independent of the problem size and is a negligible overhead when merging eventually terabytes of data per run.
2) Throughput, Utility and scalability: As the proposed design implements a single rate merge sorter, the achieved throughput is lower compared to previously proposed multirate sorters [1] [2] [3] , but as explained in Section I, this is not necessary a drawback as those sorters can only sort a rather limited number of sequences (up to 32 sequences in [1] [2] [3] ), which in turn means multiple more runs to sort the same problem size. If we compare our solution with the previous best high utility merge sorter [4] as summarized in Table I , we see that the proposed optimizations in our sorter lead to a significantly higher throughput and ability to achieve higher utility. Both sorters are single rate, but our proposed sorter achieves higher operating frequencies/throughput and also scales much better, which enables our design to achieve higher operating frequency when e = 16384 compared to the previous best at only e = 512 as shown on Table I . Again for external sorting, utility is commonly the key and not the throughput, but even the slowest sorter configuration with 16 K channels will likely saturate a fast NVMe SSD (please note that for external sorting we need read and write simultaneously). Ultimately a large utility high throughput sorter can be implemented by running multiple instances of the high utility merge sorter in parallel and combining their resulting streams with a high-throughput design like in [3] . Instantiating K times the proposed e-way design, buffering their output in deep FIFOs to overcome skewed data and putting that into the input side of an K-way multirate merge sorter will result into a (K ×e)-way merge sorter that produces up to K output elements per clock cycle. The aim of the researched case study was to provide high utility merge sorting that can execute sorting from DDR memory and store the resulting sequence in another DDR memory. Although utilization of the proposed sorter is rather small, incorporating it into a full system requires substantial amounts of on-chip memory to buffer large SDRAM burst sizes. This is needed as reading from a large number of input sequences is eventually fully random. The first access to a random memory location results in high latency penalty, while consecutive sequential accesses are cheap. This means current technology DDR memory actually performs poor on random accesses. For achieving high effective speeds, the burst sizes have to be of significant size.
Our target hardware consists of two DDR3 modules that operate at 64-bit data width whilst running at 800MHz, thus providing a theoretical maximal throughput of 12.8GB/s per module. A Xilinx MIG core provides a 512 bit interface running at 200MHz. We therefore decided to pack 7x800-bit records into 11x512-bits memory bursts calling this a record block. We generate test data by using a 64-bit time-dependant self-seeded LFSR to fill one of the DDR modules. We used 80 bits of the data payload for a hash value of the rest of the record, thus we verify on the fly at the output side of the sorter that records are intact and in correct order.
It is noticeable that the sorter utilizes most of the available BlockRAMs as just the buffering of two blocks of Graysort records per sequence in a configuration of 2K sequences requires already more than 2.8MB of on-chip memory (> 600 BRAMs). The infrastructure incorporates numerous performance counters both FSM dependant and trigger armed that are being read using Xilinx ILA.
We measure the overall achievable sorting speed of such an end-to-end system on real hardware as well as the impact of memory access patterns on effective sorting throughput. Stalling the sorter externally is a result of full DDR job queues, which in turn means that the sorter fully utilizes the available DDR bandwidth resulting from the corresponding access pattern. We have evaluated three scenarios for Graysort high utility merge sorters representing three different read burst sizes -1, 2 and 4 blocks of records, while the write of the output is fully sequential. We implemented 2048 merged sequences, but we cannot physically store 4 blocks of records for 2048 sequences on-chip and that is why the first two setups merge 20,846 Graysort records for each of the 2048 sequences and the last setup merges 41,692 for each of the 1024 sequences, resulting in 3.98 GiB.
2) Evaluation: We ran each setup a minimum of five times to obtain sufficient data. Table II presents the obtained results. Noting that the presented throughput is the total data to sort divided by total runtime. Through performance counters we observe for the last setup that the bottleneck of the system is the sequential writing to the second DDR module, which means we have reached the maximal write performance of the memory. The 1024-way setup achieves 81% of the theoretical throughput available to the SODIMM memories (12.8 GB/s).
This case study confirms the total sorting time estimation from section I-C and how important utility is as long as the problem is bandwidth limited. Comparing an assumed 1024-way merging to any multirate merge sorter [1] [2] [3] , all sorters achieve throughput that satisfies the available write bandwidth of ∼10.4 GB/s write on the Xilinx VC709. Because the latter sorters merge only 32 sequences at a time, they need log 32 1024 = 2 major runs, while our sorter can execute this in a single run. Independent of problem size, our sorter is about two times faster for the last 1024-way run.
IV. CONCLUSION
In this paper, we propose a high utility parallel hardware merge sorter. We implement highly optimized and scalable designs, including novel approaches for pipelining and allowing variable buffer sizes. We evaluated different designs for merging up to 16384 sequences with up to 66% higher throughput compared to previous best 4096-way merger implementation [4] . A case study targeting a Xilinx VC709 board merges 1024 (2048) sequences of the Graysort benchmark from one DDR memory to another memory and utilizing the DDR-3 memory throughput at 81% (75%) of peak throughput. This is to the best of our knowledge the first end-to-end implementation of a high utility sorter on an FPGA.
