The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. This technical report summarizes the fixed point implementation for lattice-reduction aided detectors. More important, this report illustrates the performance of coded LR aided detectors.
I. INTRODUCTION
With the evolution of wireless communication systems, the multiple input multiple output (MIMO) system has been adopted to provide higher data rate [1] . In addition, error control codes (ECC) are usually included in the system to enhance the information reliability, e.g., Turbo Codes [2] and Low Density Parity Check (LDPC) codes [3] . The challenge to apply both MIMO and ECC into wireless systems is on designing a reliable but low-complexity receiver.
The optimal receiver for coded MIMO systems is to use a joint detector and decoder for the whole coded data block, which is extremely complex and infeasible in the practical system due to the long length of coded data block. Although decoupled detectors and decoders can significantly reduce the complexity, the performance would be largely degraded compared to the optimal receiver. In order to balance the complexity and performance, the receiver with iterative detection and decoding (IDD) is proposed in [4] , where the separate soft-input softoutput (SISO) detector and SISO decoder are used to achieve the near-optimal performance by exchanging extrinsic information iteratively.
The optimal SISO detector under IDD for coded MIMO systems would be the maximum a posteriori (MAP) detector, which is often with high complexity especially when the constellation size and/or the channel dimension are high. The list MIMO detectors, such as the list sphere detector [4] and the list sequential detector [5] , are an attractive choice as they allow a flexible 2 tradeoff between performance and complexity. One key issue of the list MIMO detector is to generate a list of candidates containing the transmitted symbol vectors with low complexity.
The way to find the list and the number of candidates in the list are directly related to both performance and complexity. So it is desirable if the detector can obtain the near-optimal performance only using a small number of candidates.
Recently, lattice reduction (LR) technique has been proposed to improve the performance of MIMO detector in [6] , [7] , and [8] , by transforming the channel matrix into a better-conditioned matrix. It is shown that LR-aided linear detectors can achieve the full diversity of the maximum likelihood (ML) receiver. Furthermore, the combination of LR with list MIMO detection like K-best detector [9] shows that it can maintain near-ML performance even with very low K values (the number of candidates), which means much lower complexity of the detector. The LR-aided IDD algorithms with list MIMO detector have been well studied in the literature [10] .
However, there are few papers focusing on the fixed-point design for the whole LR-aided IDD system, which is a key step for practical hardware implementation in VLSI chips or FPGAs.
In this paper, we evaluated the LR-aided IDD performance under finite precision in operands and arithmetic operations, and designed the detailed fixed-point implementation for the whole LR-aided IDD receiver based on that the bit error rate (BER) performance of the fixed-point system could be within 0.2dB degradation compared to the performance of the corresponding floating-point system.
The rest of this paper is organized as follows. Section II presents the system model of the LR-aided IDD receiver for MIMO coded systems. Section III introduces the key algorithms used in the fixed-point LR-aided IDD receiver. Section IV provides the detailed fixed-point implementation for the whole LR-aided IDD receiver followed by the conclusion in Section V.
II. SYSTEM MODEL
Consider a coded multiplexing transmission system depicted in Fig. 1 . At the transmitter, a sequence of binary information bits b is random produced, passed the ECC, and interleaved.
Then the coded sequence c is mapped into a symbol sequence s where the constellation size is k bits/symbol. For the system with N transmit and M receive antennas, the MIMO transmission can be expressed as: w I M . We assume that the channel matrix H is time-invariant during a certain block which is greater than a symbol period and change independently from block to block, and it is known at the receiver but unknown at the transmitter.
Report for Gatech ECE 8903
Qingsong Wen; GTID: 902537161
Next, we will focus on the LR-aided soft-output sphere decoding. Consider a coded multiplexing transmission system depicted in Figure. 2 as follows Here suppose a sequence of binary information bits b is after ECC and interleaving, and then the coded sequence c is mapped into a symbol sequence s where the constellation size is κ bits/symbol. At the receiver, iterative detection and decoding structure is adopted to exchange extrinsic information between the soft-output detector and the soft decoder of the ECC. Given the above system model, the extrinsic information is calculated by a posteriori probability (APP), which, for the i th bit of c, is approximated as [20] where S i,+1 represents the set of all the κN-bit-long sequences with the i th bit as +1 and similarly defined S i,−1 . Then, this new APP is passed to the soft decoder of ECC, which takes it as the priori information. Now both complexity and performance depend on the size of the candidate list C s . If the list of candidates is too long, the complexity is too high (near the exhaustive search), but if the list is too short, the performance will be close to the one of hard detectors. In the following, the CLLL-based low-complexity algorithms will be used to generate the lists of candidates [21] . At the receiver, LR-aided IDD structure is adopted to exchange extrinsic information between the SISO detector and the SISO decoder. The extrinsic information L E,t is first calculated by the SISO detector based on the observation y, the channel H, and the pror information L A,t which is fed back by the SISO decoder. Then, the extrinsic information from the detector is passed through the interleaver to the SISO decoder, which takes it as priori information L A,d to obtain the information bits and calculate new extrinsic information L E,d to feed back to the detector.
Thus, the receiver is designed in an iterative way between the detection and decoding.
III. KEY ALGORITHMS IN LR-AIDED IDD RECEIVER

A. Lattice Reduction
In the MIMO transmission model in Eq. (1), the received signal vector y is the noisy observation of the vector Hs, which is in the lattice spanned by the columns of H since all the entries of s can be transformed to complex integers by shifting and scaling. In general, a lattice has more than one set of basis vectors. 4 There exist some bases that span the same lattice as H but are closer to orthogonality than H. The process of finding a basis closer to orthogonality is called LR. Theoretically, finding an optimal set of bases (closest to orthogonality) in a lattice is computationally expensive. Thus, the ultimate goal of LR algorithms is to find a "better" channel matrixH = HT where T as a unimodular matrix, which means that all the entries of T and T −1 are complex integers and the determinant of T , is ±1 or ±j. The restrictions on the matrix T ensure that the lattice generated byH is the same as that of H.
Generally, LR techniques involve preprocessing H to produce a reduced-lattice basisH =
HT . This factorization allows us to rewrite the system in Eq. (1) as
Here we adopt the complex LLL (CLLL) algorithm [8] , [11] to perform the LR on the channel matrix H. The detailed pseudo-code of the CLLL algorithm can be summarized as follows in cal problems. The LLL algorithm e optimal basis, but it guarantees asis within a factor to the optimal is one of our major concerns, we r LR here. al lattice is defined in [11] . As 26], the worst case of the number ded by the LLL algorithm to find ere N is the size of the basis. [18] , [22] , [24] adopt the real [11] and use the real LR-aided , we provide a detailed complex h reduces the RLLL's complexity rmance. nition of a reduced basis in [11] complex matrixH is called a he QR-decompositionH =QR onditions:
is arbitrarily chosen from (
of the CLLL algorithm can be meter δ controls the complexity algorithm and the bigger δ is, the mpared with the RLLL algorithm ences of the CLLL algorithm are: equation is on complex numbers; lex unitary Θ is adopted. Because 
for n = k − 1 :
end (13) end (14) if
Swap the (k-1)th and kth columns inR and T
is bounded by 1 − c 2 δ . If H is singular, i.e., rank(H) < N, then Lemma 1 does not hold true since H is not a basis any more. In this case, we need to reduce the size of H and then apply the CLLL algorithm. From Lemma 1, we can see that CLLL algorithm does not guarantee to reduce the od for every realization of H, but the new basisH now has an upper bound on od which is strictly less than one. In the following, Fig. 2 ). In order to facilitate the fixed-point design while to keep the performance at the same time, some modifications are adopted in [12] , where the Relaxed Size Reduction Condition is defined for the calculating of Size Reduction part, the Complex Lovasz Condition is replaced by the Siegel Condition, the integer-rounded division (Line 8 in Fig. 2 ) is implemented by using a single Newton-Raphson (NR) iteration method, and the calculation of Θ (Line 16 in Fig. 2 ) is completed by Householder CORDIC algorithm.
B. List MIMO detector
For the list MIMO detector in LR-aided IDD receiver, the authors in [10] proposed three methods, i.e. fixed radius algorithm (FRA), fixed candidates algorithm (FCA), and fixed memoryusage algorithm (FMA). FRA as well as FMA is a combination of sphere decoding [13] and LR, which searches all possible candidates in the sphere. In this case the number of candidates is random, which may cause difficulty on hardware implementation. FCA is a combination of Kbest algorithm [14] and LR, which applies an element-by-element searching with a fixed number of points on each layer so that it is suitable for the hardware implementation.
For LR-aided linear hard detectors, LR is first applied on the channel matrix H followed by the linear equalization based on the reduced-lattice basisH. For example, when Zero Forcing (ZF) equalizer is adopted, we can get
Then we need to obtain an estimate of z in Eq. (3) and next the s is estimated through one-to-one mapping, which implies we need to get a candidate list of z in the list MIMO detector. Different from the SD method in [13] , here the sphere is built in the z-domain centered at LR-aided estimate instead of the s-domain centered at ZF estimate or other estimate from preprocessing.
However, because of matrix T , the constellation of z is not ready. Some candidatesẑ on integer lattice may not generate valid candidates in s-domain. One way is to find all possible z's and then perform searching, which costs high computational complexity. Since our final goal is to obtain s not z and the alphabet of s is known, so we need to find the list of candidates on s, 6 C s as:
To further reduce the complexity, we can apply QR decomposition for T −1 so that T −1 = Q T R T , then we obtain
Here low complexity tree-searching methods can be performed by starting from the bottom layer. In order to facilitate the hardware implementation, we select the FCA as the list MIMO detector in the LR-aided IDD receiver for fixed-point design because its breadth-first tree-search method has a fixed throughput like K-best method. Furthermore, FCA always includes the LRaided hard-decision in the candidate list to guarantee diversity. The detailed pseudo-code of the FCA algorithm can be summarized as follows in Fig. 3 [10] .
Report for Gatech ECE 8903 Qingsong Wen; GTID: 902537161 Gram-Schmidt (GS) algorithm, Householder transformation (HT), and Givens rotation (GR).
In [15] , [16] , it has been shown that GS can be efficiently implemented through Coordinat Rotation Digital Computer (CORDIC) and Triangular Systolic Array (TSA) algorithms. So GS does not require norm and division operations by CORDIC algorithm, and it can easily adopt parallelism when processing a large matrix by TSA algorithm. Furthermore, GS demonstrates higher numerical stability with VLSI implementation in the QRD process compared with GS and HT methods. Due to these reasons, we select the GS method as the QRD algorithm in the LR-aided IDD receiver.
The QRD process under GS algorithm with TSA and CORDIC [15] can be illustrated on a 2 × 2 complex matrix H as:
where j = √ −1, A, B, C, D represent the magnitudes, and θ a , θ b , θ c , θ d stand for the angles of the matrix entries. In order to get QRD of the H matrix, the H is first transformed by the unitary matrix Q 1 expressed by:
where the three angles θ 1 , θ 2 , θ 3 are calculated as follows:
After the above transformation, we can get an upper triangular matrix R 1 as: 
Next, the R 1 is transformed by another simple unitary matrix Q 2 expressed by:
8
So that we get the last R matrix of the QRD process as follows:
Based on the above procedure, for a 4 × 4 matrix H, the QRD can be implemented through the CORDIC-based systolic array as depicted in Fig. 4 . Three different types of cells are shown in Fig. 4 : delay unit(DU), processing element(PE), and rotational unit(RU) [15] . DU delays the incoming data by number of clock cycles that neighboring cell takes to process the data, then deliver it to PE when it is available. PE, as the most complex unit, can operate in either vectoring mode or rotation mode. In vectoring mode, PE calculates the three angles described in (8) , stores them into the cell memory, and meanwhile computes the norm of the complex vector. The computed norm is passed to the east of the cell with a flag that requests the next PE to operate in vectoring mode. In rotation mode, PE rotates the incoming complex vector with the angles stored in the cell memory, and passes the results 9 from north to south port and west to east port. Fig. 5 depicts the structure of PE in both modes with data flows. Similarly, RU has the same operation modes as PE, but operates in vectoring mode only when a diagonal element from a channel matrix enters from the north port. 
D. LLR Computing between the detector and the decoder
The extrinsic information L E,t shown in Fig. 1 is usually expressed by the log-likelihood ratio (LLR) of each transmitted bit as follows [10] :
where Cs denotes the candidate list from the list MIMO detector in the LR-aided IDD receiver,
represents the subset of Cs with the ith bit as +1, and similarly defined S i,−1 , so that
Now both complexity and performance of the list MIMO detector depend on the size of the candidate list Cs. If the list of candidates is too long, the results will be near to the optimal MAP while the complexity is too high (near the exhaustive search). On the other hand, if the list is too short, the performance will be degraded due to the inaccurate L E,t values. Furthermore, the error of L E,t is especially large in the case when the output list Cs includes only candidates with c i either +1 or −1, which may result in very large values in Eq. (12) that would cause the decoder from correcting the falsely detected data.
The undesirable effect of the small candidates in the list MIMO detector can be reduced by LLR clipping [4] , which limits the dynamic range of LLR values so that the decoder can still overcome the error data from the detector. The LLR clipping is defined as follows:
where the L clip E,t (c i |y) is the clipped LLR and the L max is the predefined maximum LLR value for L E,t . Besides improving the performance of the list MIMO detector, LLR clipping can also reduce the word-length of the fixed-point design and decrease the complexity of the hardware implementation.
E. Turbo Decoding
The Turbo decoder contains two elementary MAP decoders interconnected to each other by interleavers (π) and deinterleavers (π −1 ) in serial way as shown in Fig. 6 .
Each elementary decoder has three inputs: the systematic bit (y ks ), the parity bits from the component encoder (y kp1 or y kp2 ), and the extrinsic information from the other component decoder (L(u k )), also known as a-priori information of the systematic bit. During the Turbo decoding, the component decoders iteratively exchange the probabilities for each information bit represented by LLR, which could ameliorates the LLRs of the information bits and improves the decoding accuracy.
For the fixed-point implementation, here we adopt the well known Max-Log-MAP algorithm, which has near the same performance as the optimal MAP algorithm while with much lower complexity [17] . For the Max-Log-MAP algorithm, the calculation process of each constituent decoder can be summarized in the following parts:
11 data bits at the input to ≤ 5114).
it is u k (+1 or -1) and the odeword (X k , Z k , Z' k ) cture, where X k is the Z' k are the parity bits odulo-2 adders and the ers in the constituent ve decoder is shown in AP decoders linked by rs (π -1 ). Each decoder output corresponding to parity bits from the (y kp1 and y kp2 ), and the the other component riori information of the t decoders exploit both nel and this a-priori ciated probabilities for typically represented in tios (LLRs) [7] . Each se probabilities at each n, the first component annel outputs (y ks , y kp1 ) dicating the estimate of rm of LLRs. Then, the the a priori information der are subtracted from the extrinsic information mation of the systematic decoder. This extrinsic nal information for the fine the LLRs of the at each iteration. For implementation purposes, the well known Log-MAP algorithm is used [7] . The Log-MAP algorithm is the original MAP algorithm [8] in the log domain.
Figure 2. General structure of a turbo decoder
The decoding process performed in each constituent decoder for the computation of the LLRs can be summarized in the following steps:
Branch metric computation (BM)
where u k is the information bit that makes the transition from state s' to state s in the trellis. L(u k ) is the a priori information provided by the previous decoder and x kl and y kl are the expected symbols (x 0 and x 1 in Figure 1 -(b)) and the actual received symbols (in figure 2 y ks and y kp1 if MAP1 is used or y ks and y kp2 if MAP2 is used) at the channel output, respectively. Finally, L c is the channel reliability value which for an Additive White Gaussian Noise (AWGN) channel is defined as:
2 ), and σ 2 is the noise variance.
Forward recursion (FW)
where A 0 (0)=0, A 0 (s)=-∞, for all s ≠ 0. 
Backward recursion (BW)
where B N (0)=0, B N (s)=-∞, for all s ≠ 0.
Log-Likelihood Ratio (LLR)
( ) eqs. (2-4) , A and B are called the node metrics of 
Extrinsic information
2, Forward Recursion computing (FW)
where α k=0 (s = 0) = 0, and α k=0 (s = 0) = −∞.
3, Backward Recursion computing (BW)
where β k=N (s = 0) = 0, and β k=N (s = 0) = −∞.
4, LLR computing
5. Extrinsic information computing
In the Equ. (14)- (18), u k is the information bit which produces the transition from state s to state s in the Turbo trellis. L(u k ) is a priori information and x kl and y kl are the expected transmitted symbols and the actual received symbols, respectively. L c is the channel reliability defined as L c = 2/σ 2 , where σ 2 is the noise average power.
IV. FIXED-POINT DESIGN FOR LR-AIDED IDD RECEIVER
In this section, the fixed-point design for the whole LR-aided IDD receiver will be analyzed and decided based on the algorithms of the above section. For the fixed-point simulation, let F P (iwl, f wl) be the finite representation of an wl-bit two's complement number where f wl is the fractional worldlength and iwl is the integer wordlength including a sign bit, so wl = iwl + f wl. In order to compare the practical fixed-point performance under different wordlength accuracy with the ideal floating-point performance, all the simulations are based on the same system parameters assumed in the following paragraph.
In this paper, the LR-aided IDD receiver is applied in the i. 
A. LLR clipping between the detector and the decoder
To study the LLR clipping effect and to find the optimal clipping threshold of the LR-aided IDD receiver, we examined the BER performance under different clipping values as shown in Fig. 7 . The simulation results demonstrate that the performance of the system can be clearly improved by applying a proper LLR clipping threshold. On the other hand, either too large clipping values or too small clipping values would degrade the system performance. Based on the simulation, LLR clipping threshold with L max = 8 is shown to be a appropriate and simple choice to be used, which is also consistent with the results in [18] and [4] .
We also examined the effect of iteration times for the IDD and Turbo decoding on the system performance under the above selected clipping threshold with L max = 8. Fig. 9 shows that 16 bits are enough for the fractional wordlength in QRD since in this case both data and angles can achieve an accuracy within 0.14%. In sum, F P (5, 16) and F P (4, 16) are suitable for the data and the angles in the QRD module, respectively.
For the QRD of the unimodular matrix T −1 , the angles property is the same as that in the QRD of the H, so the same F P (4, 16) is adopted; for the data part, because all the entries are Gaussian integers, we can reduce the fractional wordlength and increase the integer wordlength while keep the whole wordlength invariable. Here we adopt F P (13, 8) for the data, which shows that the performance of system would be almost the same as that of floating-point system in the following simuation. Besides, due to the identical whole wordlength compared with the H, the same QRD hardware implementation can be used for both unimodular matrix T −1 and channel matrix H.
For the CLLL part, the fixed-point design is mainly referred to our former work in [12] . The fixed-point representation of some key parameters in CLLL are as follows: the integer bits for u, T , and internal datapath of Householder CORDIC are 11 bits, 9 bits, and 5 bits respectively; the fraction bits for both Q and R are 13 bits; the integer bits of R after size reduction and basis updating are 5 bits at most.
When only applying the fixed-point design for the list MIMO detector under the above analysis in the LR-aided IDD receiver, its performances compared with the floating-point system under LLR clipping are depicted in the Fig. 10 . The results show that the BER performance degradations of the fixed-point design for the list MIMO detector are kept less than 0.2 dB. Fixed-point design for Turbo decoding has been well studied in the literature [19] , [20] , [21] , and [22] . The most important parts of the fixed-point implementation for Turbo decoding are the BM, FW, and BW parts as shown in Section III-E. The fixed-point implementation in this paper is mainly based on the results in [22] . Here the bits width for the BM, FW, and BW we adopted are F P (5, 3), F P (7, 3), and F P (7, 3), respectively. And the bits width for both the extrinsic information and prior information is F P (5, 3) .
When only applying the fixed-point design for the Turbo decoder under the above analysis in the LR-aided IDD receiver, its performance differences compared with the floating-point system under LLR clipping are depicted in the Fig. 11 . The results show that the BER performance degradations of the fixed-point design for the Turbo decoder are kept within 0.1 dB.
D. Fixed-point performance of the whole LR-aided IDD receiver
Based on the above finite wordlength analysis for the MIMO detector and the Turbo decoder, and by adding the fixed-point design for the LLR information between detector and decoder, show that its BER performance degradation is within 0.2dB compared with the floating-point system. With these results the hardware implementation of the LR-aided IDD receiver can be straightforwardly implemented in VLSI and FPGA, which is also our next work consideration.
