Review of Parallel Decoding of Space-time Block Codes toward 4G Wireless and Mobile Communications  by Amani, Elie et al.
 Procedia Computer Science  19 ( 2013 )  1059 – 1067 
1877-0509 © 2013 The Authors. Published by Elsevier B.V.
Selection and peer-review under responsibility of Elhadi M. Shakshuki
doi: 10.1016/j.procs.2013.06.149 
The 8th International Symposium on Intelligent Systems Techniques for Ad hoc and 
Wireless Sensor Networks (IST-AWSN)  
Review of Parallel Decoding of Space-Time Block Codes 
toward 4G Wireless and Mobile Communications 
 
Elie Amania, Karim Djouania,b and Anish Kuriena 
aFSATI/Dept. of Electrical Engineering, Tshwane University of Technology (TUT), Pretoria, 0001, South Africa 
bUniversité Paris-Est Créteil (UPEC), 12/LISSI Lab, Créteil, France 
 
Abstract 
This paper presents a review of recent developments in the area of STBC decoding particularly parallel decoding of 
full-rate full-diversity STBCs toward real-time 4G wireless communications. After reviewing some parallel STBC 
decoding techniques and presenting one of the most promising types of parallel processors suitable for the 4G 
SDR the SIMD processor, the paper shows that parallel decoding of the Golden Code on the ClearSpeed CSX700 
SIMD processor achieves a speedup of up to 30 times.  The paper highlights the potential to achieve real-time 
decoding of high-rate STBCs with the use of robust parallel processors. 
 
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [name organizer] 
 
Keywords: Data level Parallelism (DLP); Multiple-input multiple-output (MIMO); parallel decoding; single-instruction multiple-
data (SIMD) processor; Software Defined Radio (SDR); Space-Time Block Code (STBC). 
1. Introduction 
Multiple-input multiple-output (MIMO) communication techniques in which several antennas are used at both the 
transmitter and receiver are considered in t
high data rates while exploiting bandwidth and power as efficiently as possible. They play a key role in improving the 
reliability of communication over multipath fading channels in current wireless standards, namely IEEE 802.11 
(wireless LAN), IEEE 802.16 (WiMAX) and 3GPP Long-Term Evolution (LTE). Space-Time Block Coding is one of 
the most widely used MIMO transmission schemes. This scheme allows transmission of redundant signals from 
numerous antennas at different time periods, and thus mitigates the effects of multipath fading. Since the invention of 
the first full-rate, full-diversity, orthogonal space-time block code (OSTBC), the 2 2 Alamouti Code [1], many 
orthogonal and quasi-orthogonal STBCs (QOSTBCs) have been constructed [2], [3], [4], [5], [6]. QOSTBCs were 
developed to overcome the rate limitation of OSTBCs for systems with more than two transmit antennas. Also, perfect 
Available online at www.sciencedirect.com
© 2013 The Authors. Published by Elsevier B.V.
Selection and peer-review under responsibility of Elhadi M. Shakshuki
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
1060   Elie Amani et al. /  Procedia Computer Science  19 ( 2013 )  1059 – 1067 
STBCs (PSTBCs) [7], which belong to the Golden Code [8] family, have been invented. PSTBCs outperform the 
Alamouti Code and all OSTBCs and QOSTBCs in that they have full rate, full diversity, a non-vanishing constant 
minimum determinant that improves spectral efficiency, uniform average transmitted energy per antenna, and good 
shaping [7]. However, they suffer higher decoding complexity. PSTBCs are suitable for multi-antenna base stations 
serving multi-antenna mobile terminals. 
Many decoding techniques have been applied to STBCs that include well-known signal processing algorithms such 
as the Zero Forcing (ZF) algorithm, Minimum Mean Square Error (MMSE) decoding, Decision Feedback (DF) 
decoding, Sphere Decoding (SD), Interference Cancellation (IC), Conditional Optimisation (CO), and Maximum 
Likelihood (ML) decoding. Among all of these, ML decoding has been proven to have the best Signal-to-Noise Ratio 
(SNR) performance [9]. Because the complexity of the optimal decoding of most STBCs increases as the number of 
transmit antennas and/or the size of the underlying QAM constellation increases, the focus of recent research in the 
area of STBC decoding has been on reducing this decoding complexity. This has been done by developing modified or 
improved versions of the ZF, MMSE, DF, SD, IC, CO and ML algorithms as applied to STBC decoding, and most 
importantly by considering their parallelisation [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21]. In fact, 
most of these algorithms exhibit a high level of parallelism that can be exploited to map and implement them on highly 
parallel processors. Implementation on parallel processors generally benefits from tremendous increase in decoding 
speed as demonstrated in [13]. Any algorithm with a high level of parallelism will run faster when implemented on a 
multi-processor (parallel) architecture as compared to a mono-processor (sequential) architecture. However, the fourth 
generation (4G) base station (BS) and mobile station (MS) software defined radio (SDR) presents another challenge. 
The signal processing of the 4G SDR technology requires 10-1000 times more computational performance than 3G, but 
allows only 2-5 times power budget increase [22]. Thus, although highly parallel processors offer high computational 
performance, not all types of parallel processors are suitable to implement the digital baseband processing of 
algorithms in IEEE 802.16 (WiMAX) and 3GPP LTE standards. Low-power consumption is a crucial requirement. 
This paper reviews some parallel algorithms applied to the decoding of STBCs. As a complete review of all 
existing parallel STBC decoding algorithms is not possible, emphasis is put on some of the most famous approaches 
that have been applied over the past few years. The paper also studies a type of parallel processor that may be suitable 
for implementing the 4G SDR, the single-instruction multiple-data (SIMD) processor, due to the data level parallelism 
(DLP) exhibited by many 4G baseband signal processing algorithms.  The paper further discusses the speedup results 
achieved by implementing the parallel conditionally optimised (CO) decoding of the Golden Code on a low-power 
highly parallel SIMD processor, the ClearSpeed CSX700. This parallel implementation shows the potential of 
achieving real-time decoding of high-rate STBCs using a suitable parallel processor. 
The rest of the paper is organised as follows. In section 2, the MIMO system model is discussed. In Section 3, 
some STBC parallel decoding techniques are reviewed. Section 4 describes SIMD processors and their suitability for 
the 4G SDR. Section 5 presents the algorithmic steps of the parallel CO decoding of the Golden Code. Experimental 
results are discussed in section 6. Finally, conclusions are provided in section 7. 
2. MIMO System Model 
The general transmission model of a basic MIMO system with N transmit antennas and M receive antennas is  
nHsr
where r  is the M  1 received signal vector, H  is the M  N channel matrix, s  is the N  1 transmitted signal vector, 
and n  is the M  1 additive noise vector made of complex white Gaussian variables each with zero-mean and a 
2. Equation 1 represents a single MIMO user in a multi-user system or a subcarrier in Orthogonal 
Frequency Division Multiplexing (OFDM) systems. At the receiver end, the channel matrix H  is estimated by a 
channel estimator, and an estimate s  of the transmitted signal vector s  is computed by transforming the received 
signal vector r  using linear or nonlinear MIMO decoders. In the next section, five MIMO-STBC decoders are 
reviewed. In recent literature, some of these decoders have been parallelised and mapped on SIMD processors to 
achieve real-time decoding. 
1061 Elie Amani et al. /  Procedia Computer Science  19 ( 2013 )  1059 – 1067 
3. Some Parallel STBC Decoding Techniques 
3.1. Parallel ZF and MMSE Decoding 
The ZF and MMSE are low-complexity linear decoders extensively used in MIMO-STBC decoding, especially in 
decoding the Alamouti STBC. The Alamouti STBC has been adopted and implemented in many industrial standards, 
including WiMAX and 3GPP LTE. In ZF decoding, the estimate s  of the transmitted signal vector s  is computed as 
follows: 
rHrHH)(Hs *1*
where * denotes the conjugate transpose operation, H is the pseudo-inverse of the channel matrix H and r is the 
received signal vector. In MMSE decoding, s  is obtained by 
rHI)H(Hs *12*
 is the noise variance and I  is the identity matrix with the same size as H*H If the equalization matrix that is 
computed in both ZF and MMSE is denoted by X, then X=H+ in ZF and X=(H*H + I)-1H* in MMSE. The decoding 
cost for both ZF and MMSE depends on the time taken to compute matrix X in every channel coherence time. The 
computation of X involves matrix inversion and multiplication, and matrix inversion is very computationally 
expensive in MIMO-STBC decoding. 
In [23] and [24], the authors present a fast, data-parallel and silicon-efficient method to perform complex-valued 
matrix inversion, named blockwise analytic matrix inversion (BAMI) and its simplified version for STBC decoding. 
This method, mainly based on 2  2 and 4  4 matrices which are the most relevant in MIMO-STBC, partitions the 
matrix to be inverted into four smaller matrices. The inverse is computed based on computations performed on these 
smaller matrices. For a two-user Alamouti STBC system, the computation of the 4  4 equalization matrix X in ZF 
and MMSE requires only a few Multiply-and-Accumulate (MAC) operations, plus two extra additions in MMSE. 
These operations are easily performed on a baseband processor with a 4-way SIMD floating-point complex MAC 
datapath. The method in [24] allows the computation of matrix X for 32 subcarriers in 1.4 μs and thus allows real-
time communication in an Alamouti-STBC-based WiMAX system with 5 MHz channel bandwidth, 8 ms channel 
coherence time and 102.9 μs symbol duration. 
3.2. Parallel Interference Cancellation 
OSTBCs such as the Alamouti Code only need linear processing at the receiver to separate the data streams from 
multiple antenna transmissions and to eliminate co-channel interference or Multiple Access Interference (MAI) in 
multi-user environments [1], [2]. Simple linear processing produces accurate estimates of the transmitted symbols. 
However, with QOSTBCs, some co-channel interference remains after linear processing is applied and this severely 
reduces performance. To counter the effect of this performance degradation and obtain better estimates of the 
transmitted symbols, some detection techniques are used after linear processing. The most famous of these techniques 
are Joint-Detection (JD), Successive Interference Cancellation (SIC), Iterative Parallel Interference Cancellation 
(IPIC) [12], and Partial Parallel Interference Cancellation (PPIC) [11]. JD and SIC suffer high complexity compared 
to IPIC and PPIC. 
The codewords of the rate-one QOSTBC of [6] are 4  4 matrices of the form 
1234
*
2
*
1
*
4
*
3
*
3
*
4
*
1
*
2
4321
zzzz
zzzz
zzzz
zzzz
X
where 1z , 2z , 3z , 4z  are constellation symbols, * denotes the complex conjugate operation, each column represents 
a symbol sequence transmitted from a different antenna, and each row contains the symbol sequence transmitted on a 
different time slot. This QOSTBC therefore transmits four symbols using four transmit antennas in four time slots. If 
the channel is modelled as Rayleigh flat fading, the path gain affecting a symbol kz  between transmit antenna  and 
receive antenna  at time  is a complex coefficient of the form (t)h ji, . Assuming quasi-static fading, the channel 
coefficients remain constant during each 4 transmission time slots, i.e. ji,ji,ji,ji,ji, h(4)h(3)h(2)h(1)h . 
Thus, after linear processing, the estimates of each 4 transmitted symbols at  receive antennas are  
1062   Elie Amani et al. /  Procedia Computer Science  19 ( 2013 )  1059 – 1067 
noise,z)hhh2(hz)hhhh(z
noisez)hhh2(hz)hhhh(z
noisez)hhh2(hz)hhhh(z
noisez)hhh2(hz)hhhh(z
1
p
1i
i,3i,2i,4i,14
p
1i
2
i,4
2
i,3
2
i,2
2
i,14
2
p
1i
i,4i,1i,3i,23
p
1i
2
i,4
2
i,3
2
i,2
2
i,13
3
p
1i
i,4i,1i,3i,22
p
1i
2
i,4
2
i,3
2
i,2
2
i,12
4
p
1i
i,3i,2i,4i,11
p
1i
2
i,4
2
i,3
2
i,2
2
i,11
**
**
**
**
where the first summation for each symbol estimate represents the desired terms and the second summation 
represents the interfering terms. 
Parallel Interference Cancellation (PIC), among others, a detailed description of which is provided in [10], [11] 
and [12], is then used to remove the interfering terms and obtain better estimates. PIC regenerates and simultaneously 
subtracts the estimated interfering terms from each received signal to form new received signals that go through PIC 
again. It is called parallel IC because the suppression of interference from the received signals is performed in 
parallel. This process is iteratively repeated until no significant improvement in performance can be further achieved. 
3.3. Parallel Conditionally Optimised Decoding 
Conditional optimisation (CO) is a decoding technique widely used in statistical estimation and signal processing 
[17]. In STBC decoding, this technique reduces the constellation search space by optimisation over a subset of 
symbols conditioned on the remaining symbols. CO avoids exhaustive search and therefore reduces the decoding 
complexity. CO applies to QOSTBCs and codes that are constructed by multiplexing several other STBC blocks such 
as the 2  2 STBC in [25], the Silver Code [26], the Golden Code [8] and other PSTBCs [7]. CO has essentially ML 
performance and naturally reduces to a parallel calculation that fits SIMD processors. The application of sequential 
CO decoding to the Silver Code or the Golden Code reduces their decoding complexity from O(N4) to O(N2) for an 
N-QAM constellation. Parallel CO decoding of the Silver Code on general purpose graphics processing units 
(GPGPUs) in [13] achieves its real-time decoding in WiMAX systems with all channel bandwidths. In the following 
paragraph, parallel CO decoding of the Silver Code is described.  
The Silver Code codewords are made of two orthogonally encoded pairs and are of the form 
*
34
*
43
*
12
*
21
zz
zz
v
zz
zz
X
where 
ii
ii
121
211
v
7
1 is a unitary matrix, and the entries 1z , 2z , 3z , 4z  are the QAM symbols to be 
transmitted. Let   and   be the received signal vectors where the components are the signals 
received over two consecutive time slots as shown in Eq. 7 and Eq. 8. 
 *
12
*
21
zz
zz
v *
34
*
43
zz
zz
                                   (7) 
 *
12
*
21
zz
zz
v *
34
*
43
zz
zz
                                 (8)                  
where  is the received signal from receive antenna k at time slot t,  is the channel gain from transmit antenna l to 
receive antenna k, and  represents complex Gaussian noise variables. The two received signal vectors can be 
combined in a single vector  and expressed as           
1063 Elie Amani et al. /  Procedia Computer Science  19 ( 2013 )  1059 – 1067 
where , , , , 
*
12
*
22
*
11
*
21
22122111
hhhh
hhhh
, and v *
12
*
22
*
11
*
21
22122111
hhhh
hhhh
. 
The parallel CO decoding is performed as follows: 
 Step1: In parallel, 
Choose ; Given , compute -
-
; Round  off: ; 
Compute the decision metric - - ;  
 Step2: Of all the simultaneously computed decision metric values, find the minimum using parallel reduction and 
estimate the transmitted symbols as follows: 
{
- -
- -
;  
Here,  denotes the conjugate transpose operation and  is the quantizer for the QAM constellation. 
3.4. Parallel Dimensionality Reduced Sphere Decoding 
In [20] and [21], dimensionality reduction and parallelisation are applied to the sphere decoding (SD) algorithm to 
achieve fast and efficient decoding of the Golden Code with worst-case complexity of O(N2) and even O(N1.5), for an 
N-QAM constellation, and a loss of only 1 dB with respect to optimal decoding for low range of SNR. The method 
uses a parallelised depth-first search strategy. The dependent and independent parts of the search tree are separated. 
Then, dimensionality reduction and parallelisation with two hard decisions and six parallel trees is applied. Thus, six 
parallel independent depth-first searches are run with two independent hard decisions on a subset of the constellation 
symbols. Six independent and parallel solutions are obtained, and the solution with the minimum distance metric is 
selected as final solution. 
4. SIMD Processors and 4G SDR 
Fourth generation (4G) wireless standards have been defined to increase the bandwidth to higher data rates. 
Indeed, data rates ranging from hundreds of megabits per second (Mbps) for high mobility scenarios to one Gigabit 
per second (Gbps) for stationary and low mobility situations are now supported in the IEEE 802.16 and 3GPP LTE 
standards. Unfortunately, this also implies an increase in the computational capacity required to process these 
standards on SDR systems. More powerful processors are needed. Hopefully, most baseband signal processing 
algorithms of the 4G standards exhibit high DLP that can be exploited through the use of SIMD processors. In fact, 
all the three major components of the 4G physical layer, namely the OFDM modulator/demodulator, the MIMO 
encoder/decoder and the channel encoder/decoder [22] exhibit enough DLP to be implemented on massively parallel 
SIMD processors. For instance, for the 1024-point Fast Fourier Transform (FFT) algorithm used in 4G OFDM 
demodulation, all 2-point FFT operations can be performed in parallel. This means that an SIMD processor with a 
number of processing elements (PEs) that is as large as the FFT vector width is the most suitable to achieve 
maximum performance. Highly parallel SIMD processors are made of a control unit and a number of PEs. Each PE 
has limited local memory and uses asynchronous I/O mechanisms to access large shared memory [27]. The single-
instruction multiple-data operation means that all software threads simultaneously running on the array of PEs 
execute the same instruction but may operate on different data. Advances in semiconductor manufacturing and 
processor architectures have made possible the development of multi-core SIMD array processors. The latter have 
multiple processor cores on the same chip, each with an array of PEs, that operate in SIMD mode. The presence of 
multiple cores allows a program with a large amount of data to be partitioned into multiple threads that run side-by-
side on multiple cores, and the data in each thread to be further portioned among different PEs on the cores. It is clear 
that multi-core architectures tremendously increase computational capability. The biggest challenge for programmers 
is the optimisation of code execution, memory access, and data transfer patterns in order to fully benefit from the 
1064   Elie Amani et al. /  Procedia Computer Science  19 ( 2013 )  1059 – 1067 
parallelism of multi-core SIMD processors. But the most crucial challenge for manufacturers of SIMD processors that 
are suitable to implement the 4G SDR is power consumption. Computational capability comes with increase in power 
consumption. This makes low-power highly parallel SIMD processors such as the ClearSpeed CSX700 [28] and their 
future advanced and adapted versions better candidates for embedded 4G SDR applications compared to SIMD 
architectures such as GPGPUs [13] that have high power consumption. 
5. Parallel CO Decoding of the Golden Code 
In this section, the parallel CO decoding is applied to the Golden Code. The codewords of the Golden Code are 2  
2 matrices of the form 
                  
where z1, z2, z3, z4 are the transmitted QAM symbols;  =1+i i  is the Golden Number, and = 
- .  Let   and   be the received signal vectors where the entries are the received signals over two 
successive time slots as shown in Eq. 11 and Eq. 12. 
  
                               (11)
 
                                (12) 
where skt is the received signal from receive antenna k at time slot t, hlk is the channel gain from transmit antenna l to 
receive antenna k, =1+i , , and μkt denotes complex Gaussian noise variables. The short notation for the 
received signal vector is           
where , , , ,  
, and . 
 
Fig. 1.  Algorithm for parallel CO decoding of the Golden Code. 
Similar to the parallel CO decoding of the Silver Code, the parallel CO decoding of n QAM symbols, in a frame 
transmitted using the Golden STBC scheme on a multicore SIMD processor with p cores and k PEs per core is 
illustrated in the algorithm in Fig. 1. Additional details regarding the implementation of this algorithm on the 
CSX700 are provided in [29]. The choice of the symbol set to condition on the other depends on the estimated entries 
1065 Elie Amani et al. /  Procedia Computer Science  19 ( 2013 )  1059 – 1067 
of the channel matrix. If ,  is selected as the main symbol set, and  is conditioned on 
 as depicted in Fig. 1. On the other hand, if ,  is selected as the main symbol set,  is 
conditioned on  and the rest of the algorithm remains the same. 
6. Experimental Results 
This section presents the experimental results obtained from the sequential and parallel implementations of the 
algorithm in Fig. 1. Sequential implementation is performed on the Inte -2600 CPU (3.40 GHz core clock 
speed) and the parallel implementation on the ClearSpeed CSX700. The CSX700 is a low-power massively parallel 
SIMD processor with 2 cores, each with 96 PEs [28]. The CSX700 has a peak computational performance of 96 
GFLOPS and a typical power consumption of 9 to 10 W. The parallel implementation requires data transfers between 
a host CPU and the ClearSpeed card where the CSX700 is mounted. The host CPU is programmed in C language and 
the CSX700 in a data-parallel extension to the C language called Cn. Cn parallelism is expressed at the type level and 
not at the code level. A variable of poly type stores parallel data, meaning that each PE has a copy of the variable in 
its local memory and can store a different value in it. On the other hand, a variable of mono type stores non-parallel 
data which resides in mono memory of the processor core.  
The results are based on systems that assume perfect channel state information (CSI) at the receiver. It has been 
proved in [18] that the sequential CO decoding has essentially ML performance. Even the parallelised algorithm has 
essentially ML performance. Tables 1 and 2 contain the decoding durations for both sequential and parallel 
implementations and the observed speedup ratio. QPSK and 16-QAM modulation schemes are used. These results 
show that parallel CO decoding of the Golden Code on the CSX700 runs up to 10 times faster than sequential CO 
decoding on a CPU for QPSK and up to 30 times faster for 16-QAM. Table 3 depicts the obtained decoding time per 
QPSK/16-QAM symbol on the CSX700 with 1 active core and with 2 active cores, and projects the decoding time 
that would be achieved with 8 cores similar  shows that the speedup ratio is directly 
proportional to the number of cores used on the parallel SIMD processor. More cores naturally imply greater speedup 
especially if the algorithm can scale up to the number of available PEs. But this is not always the case and speedup 
may reach an asymptotic value beyond which there will not be any significant increase irrespective of the increase in 
the number of cores. In the case of parallel CO decoding of the Golden Code, if all decision metrics need to be 
calculated simultaneously, QPSK will require 16 PEs (42), 16-QAM will require 256 PEs (162) and 64-QAM will 
require 4096 PEs (642). Currently, there are no low-power SIMD processors that can reach such a large number of 
PEs. However, any number of available PEs can be exploited meaningfully to implement parallel CO decoding of the 
Golden Code. The number of decision metrics that are computed simultaneously will just be limited by the number of 
available PEs. Finally, it can be proved from the results in Table 3 that the use of 4 to 5 CSX700 processors has the 
potential to deliver real-time decoding of a 1.25 MHz Mobile WiMAX OFDMA frame using QPSK or 16-QAM 
modulation and the Golden STBC scheme.  
 
7. Conclusions 
Current 4G wireless communications demand higher data rates than 3G and consequently more computational 
capability for processors used to process the 4G algorithms. STBC coding and decoding techniques are among the 
techniques that need to adapt to the required high data rates in 4G. The trend in recent literature has been to 
parallelise the existing sequential STBC decoding algorithms to fit them for implementation on highly parallel 
processors. Some SIMD implementations have proved that parallelisation indeed benefits from tremendous reduction 
in computational latency of decoding. The results in this paper confirm this. Parallel CO decoding of the Golden 
Code on the ClearSpeed CSX700 SIMD processor achieves a speedup of up to 30 times when compared to sequential 
CO decoding on a CPU. These results reveal a potential to achieve real-time decoding of high-rate STBCs such as the 
Golden Code with the use of robust parallel SIMD processors. However, power consumption remains a challenge. 
The increase in computational capability comes with the increase in power consumption. Manufacturers of highly 
parallel processors suitable for 4G SDR applications need to take this into consideration. 
 
1066   Elie Amani et al. /  Procedia Computer Science  19 ( 2013 )  1059 – 1067 
 
Fig. 2.  Speedup Ratio of parallel vs. sequential decoding 
 
Table 1. QPSK Decoding Time and Speedup on CSX700 
 Decoding Time [seconds] Speedup 
Number of QPSK Symbols CPU 1 CSX700 core 2 CSX700 cores 1 CSX700 core 2 CSX700 cores 
524288 5.328 2.477 1.239 2.151 × 4.300 × 
8192 0.096 0.040 0.020 2.400 × 4.800 × 
512 0.012 0.0025 0.0012 4.800 × 10.00 × 
 
Table 2. 16-QAM Decoding Time and Speedup on CSX700 
 Decoding Time [seconds] Speedup 
Number of QPSK Symbols CPU 1 CSX700 core 2 CSX700 cores 1 CSX700 core 2 CSX700 cores 
32768 5.085 0.535 0.266 9.505 × 19.12 × 
2048 0.348 0.035 0.017 9.943 × 20.47 × 
256 0.059 0.004 0.002 14.75 × 29.50 × 
 
Table 3. Estimated Decoding Time per QPSK/16-QAM Symbol 
Modulation 
Scheme Decoding Instance 
Decoding time/symbol 
(1 core) 
Decoding time/symbol 
(2 cores) 
Projected decoding time/symbol 
(8 cores) 
QPSK 
Worst-case speedup 4.725 μs 2.363 μs  0.591 μs 
Best-case speedup 4.688 μs 2.344 μs  0.586 μs 
16-QAM 
Worst-case speedup 16.357 μs 8.118 μs  2.045 μs 
Best-case speedup 15.625 μs 7.812 μs  1.953 μs 
 
References 
[1]  IEEE Journal on Selected Areas in 
Communication, vol. 16, no. 8, pp. 1451-1458, October 1998.  
[2]  V. Tarokh, H. Jafarkhani and A. R. Cal - IEEE Trans. Inform. Theory, 
vol. 45, pp. 1456-1467, July 1999.  
10
2
10
3
10
4
10
5
10
6
0
5
10
15
20
25
30
Number of QPSK/16-QAM Symbols (log scale)
S
pe
ed
up
 (l
in
ea
r s
ca
le
)
 
 
1 CSX700 core, QPSK
2 CSX700 cores, QPSK
1 CSX700 core, 16-QAM
2 CSX700 cores, 16-QAM
1067 Elie Amani et al. /  Procedia Computer Science  19 ( 2013 )  1059 – 1067 
[3]  - IEEE 
J. Select. Areas Commun., vol. 17, pp. 451-460, March 1999.  
[4]  - in Proc. Int. Symp. 
Wireless Personal Multimedia Communications, September 2001.  
[5]  - Wireless Personal Commun., 
vol. 18, pp. 165-178, August 2001.  
[6]  - IEEE Trans. Commun., vol. 49, pp. 1-4, January 2001.  
[7]  F. Oggier, G. Rekaya, J.- - IEEE Trans. Inf. Theory, vol. 52, no. 9, pp. 
3885-3902, September 2006.  
[8]  J.- Code: A 2×2 Full-Rate Space-Time Code with Non-
IEEE Transactions on Information Theory, vol. 51, no. 4, pp. 1432-1436, April 2005.  
[9]  in Proc. 10th Mediterranean Electrotechnical Conference, vol. III, pp. 1218-
1221, 2000.  
[10] -based multiuser detection and parallel interference cancellation for space-time block coded WCDMA 
systems employing 16- The 13th IEEE International Symposium on Personal, Indoor and Mobile 
Radio Communications, vol. 2, pp. 688- 692, 15-18 September 2002.  
[11] Y.-F. Huang, J.-Y. Lin and T.- -time block coded MC-CDMA 
in Proc. 2006 International Conference on Wireless Communications and Mobile Computing, pp. 491-496, 3-6 July 2006.  
[12] 
IEEE Trans. Wireless Commun., vol. 7, no. 5, pp. 1603-1613, May 2008.  
[13] -Time Codes on Graphics 
47th Annual Allerton Conference on Communication, Control, and computing, pp. 1262-1269, 30 Sept. - 2 Oct. 
2009.  
[14] 
Space- in Proc. IEEE ICASSP 2009, pp. 2725-2728, 19-24 April 2009.  
[15] U. Park, D.-S. Oh, B.- -coded QO- 2010 IEEE 6th 
International Conference on Wireless and Mobile Computing, Networking and Communications, pp. 252-255, 11-13 October 2010.  
[16] - KICS, vol. 36, no. 2, pp. 100-107, 
19 January 2011.  
[17] S. Sirianunpiboon, Y. Wu, A. R. Cal
IEEE Trans. Inf. Theory, vol. 56, no. 3, pp. 1106-1113, 2010.  
[18] IEEE 
Trans. Inf. Theory, vol. 57, no. 6, pp. 3537-3541, June 2011.  
[19] -Orthogonal Space-Time Block Codes: Derivation and 
Lowpower I IEICE Tech. Rep., vol. 108, no. 445, pp. 43-46, March 2009.  
[20] -Case Decoding Complexity of O(m2 in 
Proc. ISWCS, November 2011.  
[21] S. Kahrama -case Complexity of O(m1.5) 
in Proc. IEEE WCNC, pp. 246-250, April 2012.  
[22] alability of SIMD for the Next Generation Software Defined 
in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5388-5391, 31 March-4 April 2008.  
[23] in Proc. IEEE ISCAS, pp. 2610-
2613, 2007.  
[24] - 4th IEEE International 
Conference on Circuits and Systems for Communications, pp. 279-282, 26-28 May 2008.  
[25] -rate full-diversity 2x2 space- in Proc. IEEE Int. Conf. Signal 
Process. Commun., pp. 416-419, 24-27 November 2007.  
[26] O. Tirkkonen a -orthogonal space- in Proc. IEEE 
GLOBECOM, vol. 2, pp. 1122-1126, 25-29 November 2001.  
[27] published in the book 
"Languages and Compilers for Parallel Computing", Springer-Verlag Berlin, Heidelberg, 2008.  
[28] 
http://www.clearspeed.com/. 
[29] -Power SIMD 
accepted paper, IEEE ICIT 2013.  
 
