Abstract-In 2011, we proposed a new receiver structure called Frequency domain linear MMSE filter with sphere detection for single-carrier frequency-division multiple access (SC-FDMA) uplink transmission. Frame error rate simulations and complexity estimations were used to define the most practical sphere detector algorithm for this receiver. High-level Synthesis (HLS) tools were used for comparing different architectures for the sphere detector. After 2011, the HLS design approach has gained more popularity and the tools have evolved. In this paper, HLS tools from different generations were used for implementing the same K-best list sphere detector. The results were compared and the overall experience of the optimization process and the evolution of the HLS tools was discussed. Additionally, the evolution of FPGA technology was discussed. In conclusion, the HLS tools have evolved into practical implementation tools even for high complex wireless communication algorithms. Moreover, 25% increase in throughput and 58% lower resource usage was achieved with the latest generation FPGA.
I. INTRODUCTION
Fourth generation (4G) wireless cellular systems [1] use single-carrier frequency-division multiple access (SC-FDMA) as the uplink transmission scheme instead of orthogonal frequency division multiplexing (OFDM) [2] . The reduced peakto-average power ratio of single-carrier transmission reduces the mobile transmitter cost by allowing cheaper power amplifiers [3] . In single-carrier transmission an equalizer is needed in the receiver to cancel the inter-symbol interference (ISI). Multiple-input multiple-output (MIMO) antenna configuration [4] , [5] is used in 4G systems to increase the peak data rates. Similar to any spatial multiplexing based MIMO transmission, a spatial equalizer is required in the receiver to cancel the interantenna interference (IAI). Both linear and nonlinear receiver structures have been considered for MIMO receivers with an emphasis on ones operating in OFDM systems, wherein ISI is not a problem. Different variants of sphere detector (SD) [6] , which calculate the maximum likelihood (ML) solution with reduced complexity, have received a lot of attention in the literature. For systems employing forward error control (FEC) coding, the soft output list sphere detector (LSD) [7] is needed.
In 2011, we proposed a new receiver structure Frequency domain linear MMSE filter with sphere detection for SC-FDMA uplink transmission and compared this to conventional frequency domain linear MMSE equalization with soft demodulation receiver structure [8] algorithms were considered for our receiver, namely the Kbest list LSD algorithm [9] and the selective spanning with fast enumeration (SSFE) algorithm [10] . The receiver structure and sphere detector are illustrated in Figure 1 . In [11] , we implemented the K-best LSD algorithm on a field-programmable gate array (FPGA) using an early version of Xilinx Vivado High-level Synthesis (HLS) tool to get a good understanding of its complexity. The HLS tool enabled the implementation and comparison of different macroarchitectures in a relatively short time. The architecture optimization was done in the C language level, which gave a clear benefit in terms of design time and effort. The area efficiency could have been further optimized with the conventional design approach, but HLS was well suited for this type of architecture evaluation. There are studies showing that the HLS tools can produce competitive quality of results compared to hand written register-transfer level (RTL) abstraction [12] . The HLS tools have been earlier used for MIMO detector FPGA implementations e.g. in [13] and for application-specific integrated circuit (ASIC) implementation e.g. in [14] .
In addition to results reported in [11] , we have implemented the same LSD algorithm blocks using different HLS tools from a year 2010 until 2017. This paper compares these implementation results and discusses the evolution of HLS tools and FPGA technology. Our first implementations were carried out using the Mentor Graphics Catapult C 2010 HLS tool. Later we adopted the AutoESL AutoPilot 2011 and currently we are using the latest version of the Xilinx Vivado HLS 2017. In this paper, the same C language input has been used with different tools. The original 40 nm Virtex 6 FPGA is not supported by the latest Xilinx tool anymore. Therefore, the latest implementations are targeted for the 28 nm Virtex 7 FPGA. Initially, the same macro-architecture, parametrization and FPGA optimizations (e.g. bit-width optimization and efficient use of embedded DSP blocks) were used for the implementations. The complexity and throughput results are compared. Additionally, the overall experience of the optimization process and the evolution of the HLS tools are discussed. The portfolio of currently available FPGAs is relatively large. Therefore, in addition to 28 nm FPGA implementation, the Vivado HLS 2017 tool was also used to implement the K-best LSD on the latest 20 nm and 16 nm FPGAs. The benefit of smaller silicon technology is discussed.
The rest of this paper is organized as follows. In Section II, we introduce the short history of the HLS tools used in this paper. Additionally, we compare the synthesis time and overall user experience of the tools. The evaluation of the HLS tools is divided into two sections. Section III discusses the evolution from Catapult C 2010 to AutoPilot 2011. Similarly, the Section IV discusses the evolution from AutoPilot 2011 to Vivado HLS 2017. The effect of FPGA technology as well as the effect of the built-in Vivado HLS 2017 implementation strategies are studied in Section V. The final conclusions are given in Section VI.
II. EVOLUTION OF HLS TOOLS AND SYNTHESIS TIME
The Behavioral Compiler introduced in 1994 by Synopsys can be considered as the first HLS product. Although, at that time, behavioral synthesis or algorithmic synthesis were more common terms. The real emerging of the HLS tools started in 2004 with the next generation HLS products. These tools accepted the popular C language as an input and enabled the required flexibility and ease of use. Mentor Graphics Catapult C and AutoESL AutoPilot are few examples of the HLS tools of this generation. In late 2010, the AutoESL Design Technologies Inc. was still a small independent HLS vendor. In 2011, Xilinx acquired AutoESL Design Technologies Inc. and the AutoPilot tool name was later changed into Vivado HLS.
The initial tool for our MIMO detector architecture comparison was the 2010 version of the Mentor Graphics Catapult C. However, due to very long synthesis time, the iterative design process was not efficient. The second evaluated tool was the AutoESL AutoPilot. It immediately impressed with remarkably fast synthesis. The disadvantage was the high number of bugs. After Xilinx acquiring the AutoESL, the AutoPilot soon became the Vivado HLS. The first Xilinx release of the tool was improved in the terms of usability. The disadvantage was the fact that other vendors FPGAs were not supported anymore. The latest version of Xilinx Vivado HLS tool is the Vivado HLS 2017. The average synthesis times of 8-best LSD MIMO detector implementation for these three tools are described in Table I . It should be noted that the desktop PC performance used to run the tools was not fully equal. Yet, the results give an overview of the user experience of different HLS tools used in this paper. The Catapult C and AutoPilot synthesis performance in terms of scheduling and resource usage were compared with a smaller design than the K-best LSD due to very long synthesis time of the Catapult C with the 8-best LSD design. Here, the small design was the non-iterative version of the loglikelihood ratio (LLR) calculation block of the 8-best LSD. The development time in the number of iterations vs. resource usage and processing time for 1200 subcarriers using Catapult C and AutoPilot are shown in Figures 2 and 3 , respectively. Tables II and III show the target latencies, achieved latencies and achieved resource usages for the Catapult C and AutoPilot, respectively. No target latency was set for the first iteration. It can be seen that the scheduling is very different with these tools. It is easier to achieve the target latency with the Catapult C. However, the resource usage is higher than that with the AutoPilot. From the user experience point of view, the Catapult C had a better schedule viewer and it showed the architecture view of all the loops and directives. Hence, tracing back to C code was easier. 
IV. AUTOPILOT 2011 VS VIVADO HLS 2017
The Catapult C 2010 achieved the target latency accurately if the design was relatively small. However, due to very long synthesis time with the complex 8-best tree search algorithm, the design process was not efficient. Therefore, the implementation of the K-best tree search algorithm was completed using the AutoPilot as described in Section III and reported in [11] .
In this paper, Vivado HLS 2017 tool was used to create reference implementation for the AutoPilot implementation. The complexity and throughput results maximizing the throughput are compared in Table IV . With AutoPilot, the 8-best LSD MIMO detector design including the sorting operation did not achieve the target throughput for the 4 × 4 64-QAM system. Sort-free architecture was exploited to achieve the target throughput of 347 Mbps. The so called sort-free architecture can be exploited to increase the performance of the SD tree search algorithms. However, it often increases the resource usage dramatically especially if the K is not smaller than the number of constellation points. With 2017 version of the Vivado HLS tool the target throughput was exceeded by 44% without the need for using the sort-free architecture.
V. FPGA TECHNOLOGY COMPARISON
Due to latest FPGA technology updates, the portfolio of the currently supported FPGAs by HLS tools is relatively large. In addition to 40 nm and 28 nm FPGA implementations, the Kbest LSD MIMO detector was also implemented on the 20 nm and 16 nm FPGAs to get good understanding of the technology influence on the results. resuls for the Xilinx high-end Virtex FPGA family. Figure  4 highlights the throughput gains. Xilinx mid-range Kintex FPGA family implementations were also carried out. However, those are not reported herein due to high correlation to Virtex results. Each of the implementations achieve the target throughput for LTE SC-FDMA 4 × 4 64-QAM uplink with a conventional K-best architecture. As it is seen in Table. V and Figure 4 , the latest FPGAs schedule with higher throughput and less resources. As an instance, the latest 16 nm implementation, has shown to achieve 25% increase in throughput in addition to approximately, 19% saving in LUT usage and 58% saving in FF usage in comparison to one with 28 nm process. This fact suggests that more complex algorithms with higher MIMO configuration orders or higher modulation rates could potentially be implemented with the latest FPGAs without massive parallelism or implementation complexity. In addition to capability of achieving a higher throughput, the more advanced FPGAs have shown a better power performance as well. Table VI demonstrate In order to improve the throughput of the implementation, Xilinx Vivado provides implementation strategies [15] . Considering the requirement, software tries different optimizations in placing and routing. However, each of them gives a different performance in terms of power. Table VII, presents performance of four different implementation strategies for Virtex 7 to achieve the maximum throughput. The first strategy (A: PerformanceRefinePlacement), increases the placer effort in the post-placement optimization phase and disables timing relaxation in the router. The second strategy (B: PerformanceNetDelaylow), compensates for the optimistic delay estimation and adds extra delay cost to distance and high fanout connections. Third strategy (C: AreaExplore), uses multiple optimization algorithms to get potentially fewer LUTs. The fourth strategy (D: Flow run post rout opt), enables physical optimizations in post implementation phase including routing. The result have shown that in addition to achieve the highest throughput, strategy D manages to perform a better performance in terms of power. Optimization strategy D results are reported in Table VI for Virtex 7. Similar iterative approach was exploited for all the other results.
VI. CONCLUSION
In this paper, the evolution of HLS tools was studied in two steps. The evolution from the early version of Catapult C to AutoPilot enabled significantly shorter synthesis time which is crucial for the iterative HLS design approach. Also lower resource usage was achieved with AutoPilot. The evolution from AutoPilot to the latest version of Vivado HLS has enabled even shorter maximum synthesis time. Moreover, more advanced built-in optimization strategies enabled higher maximum throughput achieving the required 64-QAM 4 × 4 MIMO LTE/LTE-A detection requirement easily.
The effect of FPGA technology was also studied. 25% higher throughput and 58% lower FF usage was archived by using the latest 16 nm FPGA over the 28 nm FPGA. Moreover, 57% lower dynamic power consumption was achieved with the 16 nm FPGA. Finally, the built-in Vivado HLS implementation strategies were compared. As a result, 22% lower power consumption was achieved with carefully selected strategy.
In conclusion, the HLS tools have evolved into practical implementation tools even for high complex wireless communication algorithms. While choosing between algorithms with different performance-complexity ratio, it should be noted that few tens of percentage difference in performance or complexity falls within the variation of FPGA technology selection.
