I. INTRODUCTION
In future 5 ℎ generation (5G) networks [1] [2], nonorthogonal signals will become a dominate technical direction with different applications gradually appearing. One promising 5G waveform, termed spectrally efficient frequency division multiplexing (SEFDM) [3] [4] , can improve spectral efficiency by compressing signal bandwidth compared to orthogonal frequency division multiplexing (OFDM). The generation of SEFDM signals has been fully studied and implemented in hardware such as [5] [6] . However, signal detection at the receiver is still challenging both in software modelling and hardware implementation. Maximum likelihood (ML) detector shows optimal performance but with high complexity. Its complexity reduced algorithm, termed sphere decoding (SD) [7] , shows attractive performance with reasonable complexity. It has been accepted in multiple input multiple output (MIMO) systems with very large scale integration (VLSI) hardware implementation in [8] . However, the random complexity limits its widely use. Its fixed complexity version, termed fixed sphere decoding (FSD) [9] , shows competitive performance while it is still limited to small size MIMO systems. Therefore, an efficient linear detector, which doesn't require Euclidean norm operations, seems to be a practical solution. The hardware design of the linear truncated singular value decomposition (TSVD) detector has been reported in [10] due to its low complexity and improved performance compared to zero forcing (ZF) detector. Moreover, theoretical work in [11] proves that the iterative detection (ID) outperforms the TSVD significantly. However, its complexity is analyzed based on counting the number of basic arithmetic operations which gives a poor estimate of hardware resource utilizations. The only way to get an accurate complexity estimate is to actually implement a system. But realizing a system in hardware is time consuming and sometimes not necessary. Therefore, this work shows, for the first time, a field programmable gate array (FPGA) RTL modelling of the ID detector.
II. SYSTEM MODEL AND ITERATIVE DETECTION
The discrete format of SEFDM signals is expressed as
where is the bandwidth compression factor defined as = Δ , where Δ denotes the frequency distance between adjacent sub-carriers, is the period of one SEFDM symbol.
[ 
where F is a × sub-carrier matrix with elements equal to the SEFDM complex sub-carriers
At the receiver after additive white Gaussian noise (AWGN), the signal model in matrix format is
where is an N-dimensional vector of demodulated symbols, C is an × correlation matrix defined by the multiplication of F and its conjugate F * and is an N-dimensional vector of AWGN noise samples correlated with F * . The principle of iterative detection has been studied in [11] . The basic idea is to cancel inter carrier interference (ICI) iteratively using a gradually reduced decision interval, Δ = 1 − / , where is the ℎ iteration and is the total iteration number. The interval is reduced gradually until zero in the iteration process and hard decision is made at the end. The iterative operation is expressed as
where e is an × identity matrix, is an N-dimensional vector of recovered symbols after iterations and −1 is an N-dimensional vector of estimated symbols after − 1 iterations. More details of the iteration operation are referred to [11] . The benefits of using the iterative detection scheme are demonstrated via constellation points in Fig. 1 . The red dots indicate matched filter (MF) results, which are interfered significantly. The black dots are much better due to the iterative removal of self-created ICI using the iterative detection scheme.
III. ANALYSIS ON THE FIXED-POINT PRECISION
Prior to the hardware modelling, Matlab simulation is required to get an initial result for the purpose of comparison. In a real hardware system, the cost of floating-point arithmetic is significant since it consumes more hardware resources and power than a fixed-point operation. Therefore, a fixed-point representation is favoured in hardware design. In terms of the Matlab simulation, the fixed-point representation is used since it is very close with an actual hardware result. In order to get the result, it is necessary to understand the composition of a fixed-point symbol. A fixed-point symbol is composed of three parts: the first part determines the sign (positive or negative); the second part is its integer part while the last stands for the decimal part. In this section, a concept termed 'precision' is defined as the number of bits represented in the decimal part.
The precision plays an important role in the performance and complexity of a system. Generally speaking, the higher precision, the better performance but at the expense of higher complexity. Therefore, this section firstly investigates the effects of various precisions and illustrates their performance through Matlab simulations. The error performance is tested by evaluating 10,000 4QAM-SEFDM symbols through an AWGN channel. Fig. 2 presents both fixed-point and floating-point results modulated with 4QAM symbols at precisions ranging from 4 bits to 7 bits. It is evident that 7 bits are sufficient for a fixedpoint representation to approach the floating-point result. It should be noted that there is a 0.5 dB degradation when the precision goes down to 5 bits. However, the 0.5 dB gap is negligible considering that a lower precision can save a lot of hardware resources. Therefore, in the following hardware modelling, 5 bits are used to be the default precision for a fixed-point representation.
IV. DESIGN ARCHITECTURE
The hardware modelling in this section is based on the first principle that a parallel and pipelined structure is adopted. This kind of architecture has a maximum throughput while the complexity level is relatively high.
The conversion from a floating-point representation to a fixed-point one is determined by the level of accuracy and the dynamic range [10] . The dynamic range is defined as the ratio of the maximum symbol amplitude to the minimum one. The accuracy of a symbol is decided by its decimal part. Therefore, the conversion work involves scaling the decimal part by multiplying with a scaling factor [10] . In practice, the scaling factor is limited by the resolution of a hardware component. In our design, it is determined by the available resources in the FPGA chip. Assuming sufficient hardware resources are available and the effects of different scaling factors have been emulated and illustrated in Fig. 2 where the 5-bit decimal representation is proved to be a compromising precision.
Noticing that there is no need to implement a product using a multiplier. This operation can be replaced by a simple shift and add operation. This approach brings resource/power savings and improved processing speed. Therefore, the scaling factor here is defined as 2 where is the number of bits standing for the resolution of a decimal part. For a binary representation, being multiplied by a scaling factor 2 indicates shifting all the binary bits left by bit positions with zeros inserted in the old positions. This shift operation is much simpler than a high precision multiplier, meanwhile, it is cost efficient and power efficient.
In this section, the length of one fixed-point symbol is set to be 8 bits where the first bit determines the sign (positive or negative) of a symbol; the next 2 bits represent the integer part and the last 5 bits (precision =5) stand for the decimal part. The software used here is Xilinx ISE and its programming language is Verilog HDL. Since the purpose of this hardware modelling is to test the functionality of the ID detector, therefore, system optimization is not considered. Throughout the modelling, a parallel design principle is followed leading to a high throughput at the cost of high resource utilizations.
The use of the first bit for the symbol sign is sufficient since only two options exist. A positive symbol indicates a bit '0' while a negative symbol is represented by a bit '1'. However, in terms of the integer part, two bits are used with an amplitude ranging from '0' to '3'. If the integer part of a number exceeds the defined maximum absolute amplitude '3', an overflow occurs leading to unexpected errors. The absolute amplitude of an ideal 4QAM symbol is '1', with small fluctuations from noise or interference, its absolute amplitudes maybe still within the range. Therefore, the use of two bits for the integer part has little effect on 4QAM symbols. The rest of bits are reserved for the representation of a decimal value ranging from '0' to '1'. The effect of decimal precision has been studied in the preceding section. Fig. 3 shows the RTL architecture of the ID detector. Seven modules including MUX 2 TO 1, ID FSM, INITIAL FSM, CORE, HOLD, IUPUT PORT and OUTPUT PORT are illustrated. The functions of each module are described in the following.
• MUX 2 TO 1. This component is responsible for controlling the source of incoming symbols. One source comes from the IUPUT PORT while the other one is the feedback estimates from the CORE. The signal detection in the rest of modules results in processing delays, therefore, MUX 2 TO 1 has to block the incoming symbols from the IUPUT PORT before the previous symbol is completely detected. A control signal termed 'SEL' coming from the ID FSM is used to switch the incoming symbols between the two modules.
This small component has two functions based on the output signal 'SEL'. The first function is to control the switching operation in the MUX 2 TO 1. The second one is to decide the symbol output time of the module HOLD where its output time is consistent with the symbol input time of the IUPUT PORT.
This module provides a system parameter termed uncertainty interval Δ [11] for the CORE. The uncertainty interval Δ is reduced iteratively until zero with the assistance of a counter inside. More information about the function of this parameter is referred to [11] .
This module is the key component that performs the iterative detection where a large amount of computations are required here. The uncertainty interval Δ is required to determine a decision boundary. After the first iteration, the estimated symbols derived from the CORE are fed back to the MUX 2 TO 1 waiting for the next operation. After receiving the control signal 'SEL' and the updated uncertainty interval Δ , the estimates stored inside the MUX 2 TO 1 are transferred back to the CORE for the second iteration. Therefore, from the first iteration to the last but one, based on the control signal 'SEL' and the uncertainty interval Δ , the estimates are fed back and fed forward between CORE and MUX 2 TO 1. At the last iteration, the estimates are directly transferred to the HOLD module and finally output.
• HOLD.
This module aims to hold or output detected symbols based on the control signal 'SEL'. Prior to the completion (i.e. the last iteration) of an signal detection, output symbols are blocked from this module. At the last iteration, according to the signal 'SEL', the symbols delivered from the CORE are output through this component. V. PERFORMANCE AND COMPLEXITY ASSESSMENTS This is the first attempt to model the ID detector in an FPGA RTL scenario. Therefore, at the first stage, a simple system is configured. Starting with a functionality evaluation, the number of sub-carriers is set to be 16; bandwidth compression factor is =0.8 and modulation scheme is 4QAM. The optimal bit precision is considered to be 5 bits, which is trade-off between performance and complexity. Multiplication can be replaced by using a shift and add operation for the purpose of resource/power saving. Thus, iteration number in this case is set to be 2 5 = 32, which is larger than a normal one (i.e. 20). Hence, the performance can be guaranteed.
In the design of practical systems, an M-dimensional complex computation can be simplified through the real valued decomposition (RVD) where an M-dimensional complex matrix/vector can be decomposed into an equivalent 2M-dimensional real matrix/vector. Therefore, (3) can be converted to
The timing performance and resource utilization of the ID detector in a hardware modelling environment are presented in Table. I and Table. II, respectively. It is noted that the resource utilization of DSP48E1 is 100%. DSP48E1 is a hard core in the FPGA chip. This element can realize multiplications between two real or complex numbers in a more optimized way. The function of this element can be replaced using slice LUTs, which will save a lot of hard cores in the FPGA chip. One concern about the ID hardware modelling is its iterative architecture. Employing parallel processing architecture consumes a large amount of hardware resources. Therefore, reusing hardware resources is quite important. The second concern is the processing delay. According to the aforementioned modelling, one iteration requires 5 clock cycles. This latency leads to a pipelined delay for the ID detector. With the increase of iteration numbers, the processing delay becomes more significant. Thus, reducing processing time for each symbol or processing more symbols simultaneously would be crucial for a practical ID detector.
Hardware RTL modelling results are shown in Fig. 4 where the screenshots of input symbols and detected symbols are illustrated together with the theoretical result generated from Matlab fixed-point simulations. In the first inset (right top) of Fig. 4 , due to the use of 4QAM modulation scheme, one complex SEFDM symbol (16 aligned 4QAM symbols) can be decomposed into 32 real symbols using the RVD. The first 16 symbols represent real parts while the rest of symbols stand for imaginary parts. Each original floating-point symbol has been converted to its corresponding fixed-point symbol multiplying with 2 5 = 32 (5 bits for the decimal resolution). In other words, the integers illustrated in the first inset are the symbols after shifting left by 5 bit positions with zeros inserted in the old positions. After a predefined iteration number, the detected symbols are illustrated in the second inset (bottom). The value of '-32' indicates '-1' and '32' represents '1'. The third inset (left top) presents one detected SEFDM symbol through Matlab fixed-point simulation which emulates an equivalent hardware modelling. Through a careful comparison, it is evident that the two results are identical indicating a successful hardware modelling.
To verify the effectiveness of the RTL modelled iterative detector, a co-simulation methodology is employed as shown in Fig. 5 . Source data is generated, modulated, AWGN distorted and demodulated in the Matlab environment. Two copies of the demodulated symbols are fed to the RTL environment and Matlab environment, respectively. Xilinx ISE simulator within the RTL environment generates fixed-point results while Matlab provides floating-point results. BER is calculated for each result and two BER curves are overlapped in the same figure to evaluate their similarities. Fig. 6 shows the overlapped BER performance of RTL fixed-point result and Matlab emulated fixed/floating-point results. As mentioned before, considering the trade-off between performance and complexity, 5-bit precision is employed. Thus, a performance gap, approximately 0.5dB at / =1e-3, exists between the theoretical floating-point result and its fixed-point one. However, there is a 0.3dB gap between the Matlab fixed-point result and RTL fixed-point one at / =1e-3. This is due to the overflow issue mentioned in section III. This challenge exists in the RTL emulation rather than the Matlab simulation.
VI. CONCLUSIONS
This paper studied bit precision on the performance of an iterative SEFDM detector. The trade-off between bit precision and system performance has been considered. An optimal design should reduce the complexity by minimizing the bit precision while meeting the BER performance requirement. Due to the attractive performance and relatively low complexity, a hardware model of the iterative detector is investigated here for the purpose of future practical applications. The theoretical throughput of the modelled ID detector can reach 33.37 Mbit/s. Furthermore, the model of the ID detector is verified through a co-simulation methodology. The RTL modelling result is similar to the theoretical fixed-point Matlab simulation results confirming the correctness of the hardware model at a pre-implementation stage.
