Abstract-In this paper, we consider the low density parity check (LDPC) coded multi-input multi-output (MIMO) system with iterative detection and decoding (IDD). Since the traditional frame-by-frame receiver scheme suffers from a huge decoding delay, we propose an efficient scheme with a shuffled structure between the demapper and decoder, which adopts group vertical shuffled belief propagation (BP) algorithm. The proposed shuffled iterative receiver converges faster and significantly reduces the delay introduced by the IDD process. Simulation results demonstrate that our proposed shuffled iterative receiver exhibits several tenths dB of signal-to-noise ratio gain in comparison to the existing schemes, while imposing a much lower average number of iterations for the IDD process.
I. INTRODUCTION
Many receiver schemes have been designed to approach the channel capacity of multi-input multi-output (MIMO) systems. In particular, the receivers that adopt an iterative detection and decoding (IDD) structure [1] , [2] are capable of closely approximating the optimal joint detection and decoding in an iterative fashion and, therefore, achieving excellent performance while maintaining tractable complexity. An IDD receiver consists of a soft detector/demapper and a soft decoder. The demapper estimates the log likelihood ratios (LLRs) of the encoded bits, which serve as the input of the decoder. Then the decoder generates a posteriori LLRs and feeds back the extrinsic information to the demapper. This iterative process is repeated until the procedure converges or the preset maximum number of iterations is reached.
Low density parity check (LDPC) code is a class of linear block code with near Shannon limit performance. It has been widely considered as a forward error correction (FEC) code in the IDD schemes for MIMO systems [3] - [7] . In [3] , the decoder exchanges the extrinsic information with the demapper frame by frame per l c decoding-loop iterations. In the process, the check node messages are either reset to zero or not reset after each demapper-decoder iteration, which are referred to as the resetting and non-resetting algorithms, respectively. The non-resetting algorithm with l c = 1 is the traditional frameby-frame scheme commonly used in LDPC-coded MIMO systems. However, such LDPC-coded MIMO systems suffer from the drawbacks of high computational complexity and severe iteration delay.
Shuffled decoding is first proposed in the turbo-decoding field to improve the convergence speed. The scheme proposed in [8] extends the shuffled decoding to reduce the delay of demapper-decoder iteration for bit-interleaved coded modulation with iterative demapping (BICM-ID) in single-input single-output systems. But the number of demapping units required remain large, which equals to the parallel order of the decoder. This may lead to a prohibitive computational complexity for high-order modulation, and thus it is unsuitable for MIMO systems.
In this paper, we develop an efficient IDD scheme with a shuffled structure between the demapper and decoder for LDPC-coded MIMO systems. The proposed shuffled iterative receiver as usual consists of a soft demapper, a bit-wise interleaver and an LDPC decoder. However, our decoder adopts a semi-parallel structure in which the extrinsic information generated in each decoding cycle is fed back to the demapper as the a priori information immediately, instead of waiting for the decoding completion of the whole code frame. The bit-wise interleaver is carefully designed to guarantee that the bits fed back by the decoder in each cycle are mapped onto several intact symbol vectors. The number of demapping units required by our shuffled iterative receiver equals to the number of symbol vectors, which is much smaller than that of the scheme proposed in [8] . We also propose a partial feedback of decoded bits which offers a flexible performancecomplexity tradeoff. Based on a well-designed schedule, our scheme enjoys a low iteration delay as well as a relatively low complexity. Simulation results show that the proposed shuffled iterative receiver exhibits several tenths dB gains in the signalto-noise ratio (SNR) in comparison to the non-iterative scheme and resetting algorithm given in [3] , while imposing a much lower average number of iterations.
II. BACKGROUND

A. System Model
The LDPC-coded MIMO system with N t transmit antennas and N r receive antennas is considered, in which the interleaver and de-interleaver are denoted by Π and Π −1 , respectively. As shown in Fig. 1 , the source bits
where K = N · R c . The coded bits after passing through the interleaver are grouped into vectors of length K b = m·N t , and each bit vector is mapped onto a symbol vector s ∈ C
whose entries are chosen from a complex-valued constellation A, where |A| = 2 m and m is the order of the constellation. The received signal y is given by
where H ∈ C Nr×Nt is the MIMO channel matrix and n ∈ C Nr×1 denotes a complex-valued additive white Gaussian noise (AWGN) vector with covariance matrix σ 2 I Nr . We assume a quasi-static Rayleigh flat fading environment, and the entries of H are independent and identically distributed (i.i.d.) complex-valued Gaussian variables with zero mean and a variance 0.5 per dimension. We further assume that the receiver has perfect knowledge of the channel matrix H.
The receiver performs IDD as illustrated in Fig. 1 . For each demapper-decoder iteration, the demapper calculates the extrinsic information L 
B. Soft-Input Soft-Output Demapper
Both the optimal maximum likelihood (ML) demapper [1] and the suboptimal K-BEST sphere decoder demapper [9] are considered. The demapper computes the extrinsic information for each coded block of bits based on the received vector y and the a priori information L 
where B i,0 and B i,1 denote the sets of the candidate symbol vectors with b i = 0 and b i = 1, respectively. All the possible transmit vectors are considered by the ML demapper, leading to a computational complexity that increases exponentially with K b = m · N t . By contrast, for the K-BEST demapper, a small set of the candidate vectors is generated by a breathfirst tree search keeping only the best K candidate at each level, and consequently the complexity is reduced. It should be noted that if either B i,0 or B i,1 is null, no information is obtained regarding one of the two hypothesises of this bit. In such a case, then the output LLR is clipped to a constant value, denoted by ±l clip , respectively. For both the ML demapper and the K-BEST demapper, the candidate transmit vectors and the corresponding Euclidean distances ||y − H s|| are stored for the iterative operation.
C. LDPC Decoder
A group vertical shuffled belief propagation (BP) algorithm [10] is adopted at the decoder to speed up the convergence of decoding. We group all the bit nodes into G layers uniformly and perform the vertical shuffled BP algorithm layers by layers in each iteration. The decoding process is summarized in Algorithm 1 Group Vertical Shuffled BP Algorithm For iteration t (t = 1, 2, · · · , t max ) and layer g (g = 1, 2, · · · , G), perform the following operations on each bit node b that belongs to layer g. Horizontal process
Updating process l,b . The decoding process of a layer is referred to as a decoding cycle. In each decoding cycle, the decoder generates the a posteriori LLRs of P = N/G bits. The initialization, stopping criterion test and output steps remain the same as those of the standard BP algorithm [11] .
The group shuffled decoding is suitable for quasi-cyclic LDPC (QC-LDPC) code whose check matrix is comprised of circulant matrices and null matrices of the same size q × q. We can simply set G = q, and the g-th layer contains the g-th bit of each sub-matrix, where g = 1, 2, · · · , G. In this paper, we use QC-LDPC code as an example, and we point out that other kinds of LDPC codes can also be supported.
D. Iterative Operation Between Demapper and Decoder
The traditional MIMO IDD receiver performs demapperdecoder iteration in a frame-by-frame schedule, which means that the extrinsic information generated by the decoder can only be fed back to the demapper after the entire code frame has been decoded. In such a frame-by-frame schedule, the decoder and demapper work in turn, and each waits the other to complete its operations on an entire code frame, which leads to a huge iteration delay. The long delay of traditional frameby-frame schemes severely limits the effective throughput of the system. In order to reduce the IDD delay, a large number of demapping units are required which however increases the complexity considerably.
For example, the message passing algorithm proposed in [8] extends the idea of shuffled decoding to exchange information between the demapper and decoder efficiently which utilizes the parallelism of the LDPC decoder and a partial update strategy of the demapper. In each sub-iteration, the LLRs of the bits involved are calculated by the demapper employing the existing a priori information, and then the decoder generates the extrinsic information of these bits which are fed back to the demapper immediately. This scheme reduces the delay introduced by demapper-decoder iteration considerably, but the number of demapping units required, which equals to the parallel order of the LDPC decoder, is large. Therefore, its complexity is relatively high, especially for systems with highorder modulation. Consequently, for a large MIMO system with high-order modulation, the computational complexity of the shuffled decoding scheme proposed in [8] may become prohibitively high.
III. EFFICIENT SHUFFLED ITERATIVE RECEIVER
As discussed in the previous section, a critical problem of the conventional frame-by-frame receiver scheme is the severe iteration delay induced. The shuffled receiver scheme of [8] may effectively reduce this IDD delay at the cost of high complexity. We propose an efficient shuffled iterative receiver which enjoys a low IDD delay and converges faster, while only imposing a relatively low complexity.
A. The Proposed Shuffled Iterative Receiver
Consider the group vertical shuffled BP algorithm adopted by the decoder. In the g-th decoding cycle, the extrinsic information L
, which will be forwarded to the decoder for next iteration. At the same time, the decoder moves to the next decoding cycle without waiting for the completion of demapping operation. The demapper and decoder form a pipeline structure which reduces the iteration delay significantly, compared with the traditional frame-by-frame scheme.
We also propose a partial feedback strategy, which only feeds back the extrinsic information of P f bits in each decoding cycle, where P f ≤ P . With a smaller number of bits participating in the feedback, the computational complexity of demapping is reduced. Besides, only the candidate transmit vectors and the corresponding Euclidean distances associated with the bits that participate in the feedback need to be stored, which leads to a reduction of RAM resources. Hence the partial feedback strategy offers a flexible trade-off between the performance and complexity. In particular, P f = 0 indicates that no extrinsic information is exchanged between the decoder and demapper, which is equivalent to the non-iterative scheme, while P f = P means that all the extrinsic information are fed back and, therefore, the hardware complexity required is the highest and the BER performance attainable is the best.
The interleaver and de-interleaver are carefully designed to guarantee that the feedback bits are mapped onto several intact symbol vectors. Thus we have P f /K b ∈ N. This minimizes the number of symbol vectors related to these P f bits, and consequently it reduces the number of demapping units required, which equals to P f /K b . By contrast, the scheme proposed in [8] requires P demapping units, which is much larger than P f /K b . Thus our proposed shuffled iterative receiver enjoys a much lower complexity. The interleaving process is actually reading/writing the extrinsic information at appropriate address and, therefore, it can simply be realized as a look-up table (LUT) that memorises the reading/writing addresses at each decoding cycle. The proposed shuffled iterative receiver is given in Algorithm 2, where it is seen that in each decoding cycle g, P bits are decoded, while for the P f bits among these P bits, which are to be fed back, their extrinsic information are calculated by the P f /K b demapping units and thereafter the a priori information of these P f bits are updated. Note that for MIMO systems with a large number of antennas and/or high-order modulation, K b may become larger than P , which indicates that the bits decoded in one cycle cannot be mapped onto an intact symbol. Therefore some modifications are made on the schedule for such systems. We perform a decoder-demapper iteration per l c decoding cycles and the decoded l c ·P f bits are mapped onto several intact symbol vectors. Thus the number of demapping units required becomes l c · P f /K b . The receiver still enjoys a pipeline structure with low IDD delay.
Algorithm 2 Algorithm of Shuffled Iterative Receiver
For iteration t (t = 1, 2, · · · , t max ) and cycle g (g = 1, 2, · · · , G), denote n (n = 1, 2, · · · , P ) as the index of the bits processed in this cycle,ñ (ñ = 1, 2, · · · , P f ) as the index of the bits to be fed back to the demapper, andn (n = Π(1), Π(2), · · · , Π(P f )) as the index of the interleaved feedback bits associated with indexñ. The interleaved bits are mapped onto symbol vector k (k = 1, 2, · · · , P f /K b ). Decoding Process 1. Calculate the a posteriori LLRs of all the bits Q (t) n using Algorithm 1. 2. Calculate the extrinsic information of the bits to be fed back as L
Interleaving Process
Demapping Process
Calculate the extrinsic information L e 1 (n) by demapping symbol vector k using Eq. (2).
De-interleaving Process
Update the a priori information F (t) n according to
The a priori information of the bits that do not participate in the iterative process remain unchanged.
B. Analysis of The Proposed Scheme
The proposed shuffled iterative receiver has several advantages. Firstly, the delay induced by the IDD process is greatly reduced compared with the traditional frame-by-frame scheme. Secondly, the number of demapping units required is much less than that of the existing shuffled receiver given in [8] , leading to a low complexity. Furthermore, the proposed partial feedback strategy offers a flexible trade-off between the performance and complexity. Fig. 2 illustrates the schedule of our proposed shuffled iterative receiver, where it can be observed that this shuffled iterative receiver employs a parallel schedule, namely, the decoder and demapper form a pipeline structure and they work simultaneously. As long as the sum of the clock cycles required for the LUT and demapping operations is guaranteed to be no larger than that of a decoding cycle, the decoder will never be idle to wait for the completion of demapping operation.
Let us take the QC-LDPC code in IEEE 802.11n with the code length N = 1944, the code rate R c = 2/3 and the sub-matrix size of q = 81 as an example. The decoding process consists G = 81 cycles and in each cycle P = 24 bits of a layer are decoded. For simplicity, consider that the extrinsic information of all the P f = P bits are fed back. Further assume that a decoding cycle occupies T c clock cycles, a demapping unit which handles a symbol vector needs T d clock cycles, and the LUT operation needs δ clock cycles. As can be inferred from Fig. 2 , for the proposed shuffled iterative receiver, a total of 81T c + T d + δ clock cycles are required for an iteration. By contrast, for the traditional frameby-frame scheme, if we use the same number of demapping units as the shuffled one, a decoder-demapper iteration requires 81T c + 81T d + Δ clock cycles, where Δ is the delay of the interleaver which is typically very long. It can be seen that our shuffled iterative receiver significantly reduces the delay induced by the IDD process, compared with the traditional frame-by-frame scheme. Even compared with the non-iterative scheme, for which 81T c clock cycles are needed for one iteration, our proposed shuffled scheme is competitive in terms of process delay.
IV. SIMULATION RESULTS
We now present the simulation results to compare the proposed shuffled iterative receiver with the non-iterative scheme and the resetting algorithm given in [3] . The QC-LDPC code For the (N r = 2) × (N t = 2) MIMO system with 16-QAM modulation and the ML detection, Figs. 3 and 4 compare the bit error rate (BER) performance and the average iteration numbers of the three receivers, respectively. For the noniterative scheme, the maximum iteration number was set to 50. For the resetting algorithm, the decoder and demapper exchanged the extrinsic information once per l c = 25 decodingloop iterations, and the maximum iteration number of the decoder-demapper loop was set to 2. For our proposed scheme, the extrinsic information of P f = P = 24 bits were fed back in each cycle, while the maximum iteration numbers of 20 and 30 were considered. For each scenario, the iterative process was repeated until the LDPC decoder converged or the preset maximum iteration number was reached.
It can be observed from Fig. 3 that our proposed shuffled iterative receiver provides approximately 0.7 dB and 0.5 dB gains in the SNR over the non-iterative scheme and the resetting algorithm, respectively, at the BER level of 10 −5 . Furthermore, the average iteration number of our shuffled iterative algorithm is much less than those of the other two schemes at the same BER level, as can be seen from Fig. 4 . Additionally, we also notice that the performance gain of our proposed shuffled iterative receiver attained by increasing its maximum iteration number from 20 to 30 is limited. This demonstrates that our shuffled iterative receiver is capable of obtaining a good performance even with a relatively small maximum iteration number.
Next we present the simulation results for the (N r = 3) × (N t = 3) MIMO system with 16-QAM modulation and the K-BEST detection where K was set as 64. For our shuffled iterative receiver, only the maximum iteration number of 20 was considered, while in each cycle, all the decoded P f = P = 24 bits were fed back. The parameters of the other two schemes remained the same as the previous example. As can be seen from Fig. 5 , the proposed shuffled iterative receiver exhibits approximately 0.5 dB and 0.2 dB gains in the SNR at the BER level of 10 −5 over the non-iterative scheme and the resetting algorithm, respectively. Due to the suboptimal demapping algorithm adopted, the gains are not as large as in the case of adopting the ML demapper, but they are still substantial. Fig. 6 shows that the average iteration number is also greatly reduced by our shuffled iterative receiver, compared with the other two schemes.
Our simulation investigation therefore shows that the proposed shuffled iterative receiver attains several tenths dB gains in the SNR in comparison to the widely used non-iterative scheme and resetting algorithm, as well as imposes a smaller number of iterations at a give BER level, compared with the existing schemes. Furthermore, as demonstrated in the previous section, our shuffled iterative receiver exhibits a much lower IDD delay, compared with the traditional frame-byframe scheme. Therefore, our proposed scheme offers a lowcomplexity and low-delay design to achieve a high MIMO system throughput.
V. CONCLUSIONS
In this contribution, we have proposed a shuffled iterative receiver for LDPC-coded MIMO systems. In our shuffled iterative receiver, the decoder adopts the vertical group shuffled BP algorithm, and the extrinsic information of the decoded bits generated in each cycle are fed back to the demapper immediately, rather than waiting for the completion of decoding the entire code frame. The decoder and demapper form a pipeline structure which leads to a significant reduction in the IDD delay. A partial feedback strategy has also been suggested to provide a flexible performance and complexity trade-off. Simulation results have demonstrated that the proposed shuffled iterative receiver outperforms the existing non-iterative scheme and resetting algorithm in terms of achievable BER performance, while imposing a smaller average number of iterations. Our work thus has shown that our proposed scheme offers a low-complexity and low-delay IDD design for highthroughput LDPC-coded MIMO systems.
