I. INTRODUCTION
T HE adverse effects of channel fading may be significantly reduced by employing orthogonal transmit diversity invoking multiple antennas [1, 2] . The concept of combining orthogonal transmit diversity designs with the principle of sphere packing was introduced by Su et al. in 2003 [3] in order to maximise the achievable coding advantage 1 , where it was demonstrated that the proposed Sphere Packing (SP) aided Space-Time Block Coded (STBC) system, referred to here as STBC-SP, was capable of outperforming the conventional orthogonal design based STBC schemes of [4, 5] . The ultimate rationale of this paper is to use a novel three-dimensional Extrinsic Information Transfer (EXIT)-chart-based technique to jointly design the two time-slots' STBC signal by nearoptimally combining them into an iteratively detected SP symbol.
The turbo principle of [6] was extended to multiple parallel concatenated codes [7] , to serially concatenated codes [8] as well as to multiple serially concatenated codes [9] . The appeal of concatenated coding is that low-complexity iterative detection replaces the potentially more complex optimum decoder, such as that of [10] . In [11] , the employment of the turbo principle was considered for iterative soft demapping in the context of bit-interleaved coded modulation (BICM), where a soft demapper was used between the multilevel demodulator and the channel decoder. In [12] , a turbo coding scheme was proposed for the multiple-input multiple-output (MIMO) Rayleigh fading channel, where a block code was employed as an outer channel code, while an orthogonal STBC scheme was considered as the inner code. The iterative soft demapping principle of [11] was extended to STBC-SP schemes in [2, [13] [14] [15] , where it was demonstrated that turbo-detected STBC-SP schemes provide useful performance improvements over conventionally-modulated orthogonal design based STBC schemes. It was shown in [16] that a recursive inner code is needed in order to maximise the interleaver gain and to avoid the formation of a bit-error rate (BER) floor, when employing iterative decoding. This principle has been adopted by several authors designing serially concatenated schemes, where rate-1 inner codes were employed for designing low complexity turbo codes suitable for bandwidth and power limited systems having stringent BER requirements [17] [18] [19] [20] [21] .
Recently, studying the convergence behaviour of iterative decoding has attracted considerable attention [22] [23] [24] [25] [26] . In [22] , ten Brink proposed the employment of the so-called EXIT characteristics between a concatenated decoder's output and input for describing the flow of extrinsic information through the soft-in/soft-out constituent decoders. The computation of EXIT charts was further simplified in [23] to a time average, for scenarii when the PDFs of the communicated information at the input and output of the constituent decoders are both symmetric and ergodic. The concept of EXIT chart analysis has been extended to three-stage concatenated systems in [24] [25] [26] .
In this paper, we propose a capacity-approaching three-stage turbo-detected STBC-SP scheme, where iterative decoding is carried out between three constituent decoders, namely an STBC-SP demapper, an inner rate-1 recursive A P osteriori 0090 Fig. 1 . Three-stage serially concatenated system.
P robability (APP)-based decoder and an outer APP-based decoder. We first derive the capacity limit for STBC-SP schemes. Then, an upper bound on the maximum achievable rate is calculated, based on the EXIT charts of the STBC-SP demapper. At a spectral efficiency of η = 1 bits/s/Hz, the upper bound of the maximum achievable rate is within 0.5 dB of the capacity, and our proposed three-stage scheme operates within 1.0 dB of the capacity. The rationale of the proposed architecture is explicit: (1) SP modulation maximises the coding advantage of the transmission scheme by jointly designing and detecting the SP symbols hosting the two time-slots' STBC symbols; (2) the inner rate-1 recursive decoder maximises the interleaver gain and hence avoids having a BER floor; and (3) the outer irregular convolutional codes (IRCCs) [23, 27] minimise the area of the EXIT chart's convergence tunnel and hence facilitate near-capacity operation [28] . This paper is organised as follows. In Section II, a brief description of our three-stage system is presented. Section III provides our 3D EXIT chart analysis along with its simplified 2D projections. The capacity of STBC-SP schemes is derived in Section IV, where an upper bound on the maximum achievable rate is also calculated based on the EXIT chart analysis. Our simulation results and discussions are provided in Section V. Finally, we conclude in Section VI.
II. SYSTEM OVERVIEW

A. Encoder
The schematic of the entire system is shown in Figure 1 , where the transmitted source bits u 1 are encoded by the outer channel Encoder I having a rate of R I . The outer channel encoded bits c 1 are then interleaved by the first random bit interleaver, where the randomly permuted bits u 2 are fed through the rate-1 Encoder II. The concatenated coded bits c 2 at the output of the rate-1 encoder are interleaved by the second random bit interleaver, producing the permuted bits u 3 . After channel interleaving, the sphere packing mapper first maps blocks of B channel-coded bits b = b 0,...,B−1 ∈ {0, 1} to the L = 2 B number of legitimate four-dimensional sphere packing modulated symbols s l ∈ S, where
. The STBC encoder then maps each sphere packing modulated symbol s l to a space-time signal C l as [3, 13] :
where x l,1 and x l,2 are complex-valued symbols constructed from the 4-dimensional real-valued coordinates of the SP symbol s l in order to maximise the coding advantage of the space-time signal C l [3] , since the lattice D 4 has the best minimum Euclidean distance in the four-dimensional realvalued Euclidean space R 4 [29] . Specifically, x l,1 and x l,2 may be written as
Furthermore, G 2 (x l,1 , x l,2 ) is the space-time transmission matrix given by [4] 
where the rows and columns of Eq. (3) represent the temporal and spatial dimensions, corresponding to two consecutive time slots and two transmit antennas, respectively.
B. Channel Model
In this treatise, we considered a time-correlated narrowband Rayleigh fading channel, based on Jakes' fading model [30] associated with a normalised Doppler frequency of f D = f d T s = 0.1, where f d is the Doppler frequency and T s is the symbol period. However, the complex-valued fading envelope is assumed to be constant only across the transmission period of a space-time coded symbol spanning T = 2 time slots and varies from one symbol to another according to the aforementioned Rayleigh fading model. The complex Additive White Gaussian Noise (AWGN) of n = n I + jn Q is also added to the received signal, where n I and n Q are two independent zero-mean Gaussian random variables having a variance of σ 
C. Decoder
As shown in Figure 1 , the received complex-valued symbols are first decoded by the STBC decoder in order to produce the received SP soft-symbols r, where each SP symbol represents a block of B coded bits [13] . Then, iterative demapping/decoding is carried out between the SP demapper, APP-based soft-in/soft-out (SISO) module II and APP-based SISO module I, where extrinsic information is exchanged between the three constituent demapper/decoder modules. More specifically, L ·,a (·) in Figure 1 represents the a priori information, expressed in terms of the log-likelihood ratios (LLRs) of the corresponding bits, whereas L ·,o (·) and L ·,e (·) represent the a posteriori and extrinsic LLRs of the corresponding bits, respectively, where the subscript (·) is used to distinguish the different constituent decoders, i.e. Decoder I, Decoder II and the SP demapper. The iterative process is performed for a number of consecutive iterations. During the last iteration, only the a posteriori LLR values L I,o (u 1 ) of the original uncoded systematic information bits u 1 are required, which are passed to a hard decision decoder in order to determine the estimated transmitted source bitsû 1 as shown in Figure 1 .
III. EXIT CHART ANALYSIS
A. Preliminaries
The main objective of employing EXIT charts [22] , is to predict the convergence behaviour of the iterative decoder by examining the evolution of the input/output mutual information exchange between the inner and outer decoders in consecutive iterations. The application of EXIT charts is based on the two assumptions that upon assuming large interleaver lengths, (1) the a priori LLR values are fairly uncorrelated; (2) the a priori LLR values exhibit a Gaussian PDF. In this section, the approach presented in [26] is adopted in order to provide the EXIT chart analysis of the proposed three-stage system of Figure 1 .
Let I ·,a (x), 0 ≤ I ·,a (x) ≤ 1, denote the mutual information (MI) [31] between the a priori LLRs L ·,a (x) as well as the corresponding bits x and let I ·,e (x), 0 ≤ I ·,e (x) ≤ 1, denote the MI between the extrinsic LLRs L ·,e (x) and the corresponding bits x.
B. 3D EXIT Charts
As seen from Figure 1 , the input of Decoder II is constituted by the a priori input L II,a (c 2 ) and the a priori input L II,a (u 2 ) provided after bit-deinterleaving by the SP demapper and Decoder I, respectively. Therefore, the EXIT characteristic of Decoder II can be described by the following two EXIT functions [22, 26] :
which are illustrated by the 3D surfaces drawn in dotted lines in Figures 2 and 3 , respectively. On the other hand, the EXIT characteristic of the SP demapper as well as that of Decoder I are each dependent on a single a priori input, namely on L M,a (u 3 ) and L I,a (c 1 ), respectively, both of which are provided by the rate-1 Decoder II after appropriately ordering the bits, as seen in Figure 1 . The EXIT characteristic of the SP demapper is also dependent on the E b /N 0 value. Consequently, the corresponding EXIT functions for the SP demapper and Decoder I, respectively, may be written as
which are illustrated by the 3D surfaces drawn in solid lines in Figures 2 and 3 , respectively. Eqs. (4) to (7) may be represented with the aid of two 3D EXIT charts. More specifically, the 3D EXIT chart of Figure 2 can be used to plot Eq. (4) and Eq. (6), which describe the EXIT relation between the SP demapper and Decoder II. Similarly, the 3D EXIT chart of Figure 3 can be used to describe the EXIT relation between Decoder II and Decoder I by plotting Eq. (5) and Eq. (7). Figures 2 and 3 show an example of these 3D EXIT charts, when Encoder I is a half-rate memory-1 recursive systematic convolutional (RSC) code having octally represented generator polynomials of (G r , G) = (3, 2) 8 , where G r is the feedback polynomial, while Encoder II is a simple rate-1 accumulator, described by the pair of octal generator polynomials (G r , G) = (3, 2) 8 .
C. 2D EXIT Chart Projections
The 3D we derive their unique and unambiguous 2D representations, which can be interpreted in the usual way.
The intersection of the surfaces in Figure 2 , shown as a thick solid line, portrays the best achievable performance, when exchanging mutual information between the SP demapper and the rate-1 Decoder II for different fixed values of
II,e (c 2 )) point belonging to the intersection line in Figure 2 uniquely specifies a 3D point (I II,a (u 2 ), I II,a (c 2 ), I II,e (u 2 )) in Figure 3 , according to the EXIT function of Eq. (5). Therefore, the line corresponding to the (I II,a (u 2 ), I II,a (c 2 ), I II,e (c 2 )) points along the thick line of Figure 2 is projected to the solid line shown in Figure 3 , while the 2D projection of the solid line in Figure 3 at I II,a (c 2 ) = 0 onto the plane spanned by the lines (I II,a (u 2 ), I II,e (u 2 )) and (I I,e (c 1 ), I I,a (c 1 )) is shown in Figure 4 , represented by the dotted line at E b /N 0 = 2.0 dB. This projected EXIT curve may be written as
Projected 2D EXIT charts of similar nature will be used throughout the rest of the paper for the sake of describing the convergence behaviour of the three-stage turbo-detected STBC-SP scheme. More details on the related 3D-to-2D EXIT chart projection are provided in [26] . Figure 4 shows the 2D-projected EXIT curve of the SP demapper, when operating at E b /N 0 = 2.0 dB and employing Anti-Gray Mapping 2 (AGM-1) scheme, which is described in Appendix A and in Table II . The figure also shows the 2D-projected EXIT curve of the outer RSC Decoder I and the 2D-projected EXIT curves of the combined SP demapper and the rate-1 Decoder II at different E b /N 0 values, when employing AGM-1 of Table II . Observe in Figure 4 that an open convergence tunnel is taking shape for the threestage scheme upon increasing the Signal-to-Noise Ratio (SNR) beyond E b /N 0 = 2.0 dB. This implies that according to the predictions of the 2D EXIT chart seen in Figure 4 , the iterative decoding process is expected to converge to the (1.0, 1.0) point and hence an infinitesimally low BER may be attained beyond E b /N 0 = 2.0 dB. By contrast, for the traditional two-stage turbo-detected STBC-SP scheme, there would be a BER floor preventing it from achieving an infinitesimally low BER due to the non-recursive nature of the SP demapper, which also prevents the intersection of the EXIT curves of the SP demapper and the outer RSC Decoder I from reaching the (1.0, 1.0) point of convergence, despite increasing the SNR or the number of iterations. In contrast to this, the three-stage scheme of Figure 1 becomes capable of achieving an infinitesimally low BER, as suggested by the EXIT-chart predictions of Figure 4 .
D. EXIT Tunnel-Area Minimisation for Near-Capacity Operation
In this section we will exploit the well-understood properties of conventional 2D EXIT charts that a narrow and open EXIT-tunnel represents a near-capacity performance. Therefore, we invoke Irregular Convolutional Codes (IRCCs) for the sake of appropriately shaping the EXIT curves by minimising the area within the EXIT-tunnel using the procedure of [23, 27] .
Let A I andĀ I be the areas under the EXIT-curve T I,c1 (i) of Eq. (7) and its inverse T
, respectively, which are expressed as:
Similarly, the area A p II is defined under the EXIT-curve T p II,u2 (i) of Eq. (8) . It was observed in [23, 32] that for the APP-based outer Decoder I, the areaĀ I maybe approximated byĀ I ≈ R I , where the equalityĀ I = R I was later shown in [28] for the family of Binary Erasure Channels (BECs). The area property ofĀ I ≈ R I implies that the lowest SNR convergence threshold occurs, when we have A p II = R I + , where is an infinitesimally small number, provided that the following convergence constraints hold [27] :
Observe, in Figure 4 , however that there is a 'larger-thannecessary' tunnel area between the projected EXIT curve T p II,u2 (i) and the EXIT curve T
−1
I,c1 (i) of the outer 1/2-rate RSC code at E b /N 0 = 2.0 dB. This implies that the BER curve is farther from the achievable capacity than necessary, despite the fact that the specific bit-to-SP-symbol mapping scheme of AGM-1 and the 1/2-rate RSC code employed in Figure 4 were specifically optimised for convergence at a low E b /N 0 value. More quantitatively, the area under the projected EXIT curve T Table II. be attained, provided that the constraints outlined in Eq. (10) are satisfied. Hence we will invoke IRCCs [23, 27] as outer codes that exhibit flexible EXIT characteristics, which can be optimised to more closely match the 2D-projected EXIT curve T p II,u2 (i) of Figure 4 , rendering the near-capacity code optimisation a simple curve-fitting process.
An IRCC scheme constituted by a set of P = 17 subcodes was constructed in [27] from a systematic 1/2-rate memory-4 mother code defined by the octally represented generator polynomials (G r , G) = (31, 27) 8 . Each of the P = 17 subcodes encodes a specific fraction of the uncoded bits determined by the weighting coefficient, α i , i = 1, . . . , P . Hence the coefficients α i are optimised with the aid of the iterative algorithm of [23] , so that the EXIT curve of the resultant IRCC closely matches the 2D-projected EXIT curve T 
IV. CAPACITY AND MAXIMUM ACHIEVABLE RATE
A. Capacity of STBC-SP Schemes
For the sake of simplicity, a system having two transmit and a single receive antenna is considered, although its extension to systems having more than one receive antenna is straightforward. Assuming perfect channel estimation, the complex-valued channel output symbols received during two consecutive STBC time slots are first diversity-combined in order to extract the estimatesx 1 
where h 1 and h 2 represent the complex-valued non-dispersive channel coefficients corresponding to the first and second transmit antenna, respectively, andń 1 as well asń 2 are zero-mean complex Gaussian random variables with variance σ
A received sphere-packed symbol r is then constructed from the estimatesx 1 andx 2 using Eq. (2) as
where r = [ã 1ã2ã3ã4 ] ∈ R 4 . The received sphere-packed symbol r can be written as
where
and w is a four-dimensional real-valued Gaussian random variable having a covariance matrix of σ
where we have N D = 4, since the symbol constellation S is four-dimensional. According to Eq. (15), the conditional PDF p(r|s l ) is given by
where we have α = h· 2L E and (·) T represents the transpose of a vector.
The channel capacity valid for STBC schemes using N D -dimensional so-called L-orthogonal signalling [33] over the Discrete-input Continuous-output Memoryless Channel (DCMC) [34] was derived in [35] . Assuming that all the legitimate transmitted 4-dimensional SP symbols, s l ∈ S, 0 ≤ l ≤ L − 1, are equiprobable, the channel capacity of the STBC-SP scheme over the DCMC is given by [35, 36] 
with N D = 4. The bandwith-efficiency was defined as [35] 
(19) Figure 6 shows the DCMC capacity of the 4-dimensional SP modulation assisted STBC scheme for L = 16, where the Continuous-Input Continuous-Output Memoryless Channel (CCMC) [34] capacity of the MIMO scheme is given by [37] . More specifically, Figure 6 demonstrate that at a bandwidth efficiency of η = 1 bit/s/Hz, the capacity limit for the DCMC STBC-SP scheme employing N t = 2 transmit and N r = 1 receive antennas is E b /N 0 = 0.78 dB. The EXIT chart analysis of Figure 5 predicts that our three-stage system will converge at E b /N 0 = 1.5 dB, i.e. within 0.72 dB from the capacity limit. The dashed curve in Figure 6 that refers to the maximum achievable rate of the three-stage turbo-detected STBC-SP scheme is discussed next in Section IV-B.
B. Maximum Achievable Rate
A tighter upper limit on the maximum achievable rate of the system can be calculated based on the area property of A I ≈ R I the EXIT charts discussed in Section III-III-D. 
where B = log 2 (L) is the number of bits per SP symbol and R ST BC−SP = (21) where R o is the original outer code rate used when generating the 2D-projected EXIT curves of the SP demapper and the rate-1 Decoder II of Eq. (8) Step 1: Let RI = Ro.
Step 2: Let E b /N0 = ρmin dB.
Step 3: Calculate N0.
Step 4: Let IM,a(u3) = 0.
Step 5: Activate the SP demapper.
Step 6: Save IM,e(u3) = TM,u 3 (IM,a(u3), E b /N0).
Step
Step 9: Calculate E b /N0 using Eq. (21).
Step 10: Save ηmax(E b /N0) of Eq. (20).
Step 11:
Step 3.
Step 12: Output ηmax(E b /N0) from Step 10. Observe that ρ min and ρ max are adjusted accordingly in order to produce the desired range of the resultant E b /N 0 values. Furthermore, the output of Algorithm 1 is independent of the specific choice of R o , since Eq. (21) The resultant maximum achievable bandwidth efficiency is demonstrated in Figure 6 , which is slightly lower than the bandwidth efficiency of Eq. (19), i.e. we have η max < η
ST BC−SP DCMC
. Observe that the bandwidth efficiency calculated using Eq. (19) and using the EXIT charts as well as Eq. (20) were only proven to be equal for the family of BECs [28] . Nonetheless, similar trends have been observed for both AWGN and Inter-Symbol-Interference (ISI) channels [25, 27] , when APP-based decoders are used for all decoder blocks [28] . However, the discrepancy between the two bandwidth efficiency curves shown in Figure 6 that are calculated using Eq. (19) and Eq. (20) is due to the fact that the SP demapper is not an APP-based decoder. Nevertheless, the bandwidth efficiency calculated based on the EXIT charts using Eq. (20) and Algorithm 1 constitutes a tighter bound on the maximum achievable bandwidth efficiency of the system. Figure 6 shows that at a bandwidth efficiency of η = 1 bit/s/Hz, the capacity limit for the STBC-SP scheme is about E b /N 0 = 1.3 dB, which is within 0.2 dB from the prediction of our EXIT chart analysis seen in Figure 5 , where convergence is predicted at E b /N 0 = 1.5 dB.
V. RESULTS AND DISCUSSIONS
Without loss of generality, we considered a sphere packing modulation scheme associated with L = 16 using two transmit and a single receiver antenna in order to demonstrate the performance improvements achieved by the proposed system. All simulation parameters are listed in Table I . Table II in combination with the system parameters outlined in Table I and operating at E b /N 0 = 1.8 dB with an interleaver depth of D = 10 6 bits after 33 three-stage iterations.
A. Decoding Trajectory
EXIT chart based convergence predictions are usually verified by the actual iterative decoding trajectory. Figure 5 shows that the three-stage turbo-detected STBC-SP scheme is expected to converge at E b /N 0 = 1.5 dB, where convergence to the (1.0, 1.0) point requires an excessive number of three-stage iterations. However, convergence to the (1.0, 1.0) point becomes more feasible for E b /N 0 > 1.5dB. Figure 7 illustrates the actual decoding trajectory of the three-stage turbo-detected STBC-SP scheme of Figure 1 at E b /N 0 = 1.8 dB, when using an interleaver depth of D = 10 6 bits and 33 three-stage iterations. The zigzag-path seen in Figure 7 represents the actual extrinsic information transfer between the SP demapper and the rate-1 Decoder II on one hand and the outer IRCC Decoder I on the other. Figure 8 compares the performance of the proposed threestage IRCC-coded STBC-SP scheme employing anti-Gray mapping (AGM-2) against that of an identical-throughput 
B. BER Performance
(2) (3) Fig. 8 . Performance comparison of the anti-Gray mapping AGM-2 (1) based IRCC-coded three-stage STBC-SP scheme in conjunction with L = 16 against an identical-throughput 1 bit/symbol (BPS) uncoded STBC-SP scheme (2) using L = 4 and against Alamouti's conventional G 2 -BPSK scheme (3) as well as against a two-stage RSC-coded STBC-SP scheme (4) , when employing the system parameters outlined in Table I and using an interleaver depth of D = 10 6 bits.
1 Bit Per Symbol (1BPS) uncoded STBC-SP scheme [3] using L = 4 and against Alamouti's conventional G 2 -BPSK scheme [4] . The system is also benchmarked against a twostage RSC-coded STBC-SP scheme [13] , when employing the system parameters outlined in Table I and using an interleaver depth of D = 10 6 bits. Figure 8 specifically demonstrates that the proposed three-stage turbo-detected scheme is capable of achieving infinitesimally low BER values, where its performance is not limited by a BER floor, which is in contrast to the two-stage turbo-detected STBC-SP scheme. Observe that the two-stage turbo-detected STBC-SP scheme uses only 10 iterations since the advantage of employing any further iterations diminishes owing to the presence of a BER floor. Explicitly, Figure 8 demonstrates that a coding advantage of about 22.2 dB was achieved at a BER of 10 −5 after 28 iterations by the three-stage turbo-detected STBC-SP system over both the uncoded STBC-SP [3] and the conventional orthogonal STBC design based [4, 5] schemes for transmission over the correlated Rayleigh fading channel considered. Additionally, a coding advantage of approximately 2.0 dB was attained over the 1BPS-throughput RSC-coded AGM-3 STBC-SP scheme [13] at the expense of an increased decoding complexity due to the employment of the rate-1 decoder and the additional three-stage iterations. According to Figure 8 , the three-stage turbo-detected STBC-SP scheme operates within approximately 1.0 dB from the capacity limit of Eq. (19) and 0.5 dB from the maximum achievable bandwidth efficiency limit of Eq. (20) .
C. Effect of Interleaver Depth
The EXIT chart predictions are typically closely met, when large interleaver depths are employed, as mentioned in Section III-III-A. Moreover, for a practical perspective it is always interesting to investigate the achievable performance, when employing shorter interleaver depths, while using different number of three-stage iterations. Figure 9 shows the achievable coding gain of the three-stage STBC-SP scheme against Alamouti's conventional identical-throughput 1 BPS G 2 -BPSK scheme, when employing various interleaver depths and different number of three-stage iterations.
VI. CONCLUSION
We proposed a three-stage serial concatenated turbodetected STBC-SP scheme that is capable of achieving infinitesimally low BER values, where the performace is not limited by a BER floor, which is routinely encountered in conventional two-stage systems, especially when the inner stage is constituted by a non-recursive inner decoder. The convergence behaviour of the three-stage system was analysed with the aid of novel 3D EXIT charts and their 2D projections [25, 26] . With the advent of 2D projections, an IRCC [23, 27] was constructed for the sake of matching the projected EXIT curve of the SP demapper and the rate-1 inner decoder leading to a near-capacity performance. The capacity of the STBC-SP scheme was calculated and a procedure was proposed for calculating a tighter upper bound on the maximum achievable bandwidth efficiency of the three-stage system using EXIT chart analysis. Our proposed three-stage scheme operated within about 1.0 dB from the capacity limit and within 0.5 dB from the maximum achievable bandwidth efficiency limit.
In this appendix, the different anti-Gray mapping schemes introduced in this paper for STBC-SP signal sets of size L = 16 are described in detail. Since the space-time signal, which is constructed from an orthogonal design using the sphere packing scheme of Eq. (1) 
