A High Speed Interleaver for Emerging Wireless Communications by Yuan-Wei Wu et al.
2005 International Conference on Wireless Networks, Communications and Mobile Computing
A High Speed Interleaver for Emerging Wireless Communications'
Yuan-Wei Wu and Pangan Ting
Computer and Communication Lab.
Industrial Technology Research Institute
Hsinchu, Taiwan 310, R. 0. C.
tywvwu, pating}@,itri. org. tw
Abstract
In this paper, a novel high-speed
interleaver/deinterleaver is proposed. As data rates
demanded in emerging communication applications
increase, architectures for high-speed interleaving
become essential. Three pipelined blocks comprise the
interleaver to execute three permutation steps in high
speed. The addition ofStatic Random Access Memories
(SRAM) in stage I speeds up 24 times of interleaving
efficiency. Operating at 200 MHz, this design
accomplishes the overall throughput of 4.8 Gbps. The
memory architecture of this interleaver can be
configured to dynamically enable and disable memory
blocks for low power consumption. This flexible design
can perform both interleaving and deinterleaving
functions. It supports different wireless standards as well,
such as IEEE Standard (Std.) 802.16a, IEEE Std.
802. 11 a/g, and IEEE 802. 1 In proposal.
1. Introduction
Interleavers, which permute data sequences, -are
commonly deployed to break up data dependence and
improve capabilities of error correction codes. Due to
their outstanding performance, binary convolutional and
turbo codes [1], [2] are widely used with interleaving in
various communication systems. Studies have shown that
the performance of convolutional or turbo codes with
interleaving approaches to the Shannon limit [2], [3]
closely.
In a common communication system, information bits
are coded by a convolutional or turbo encoder and
interleaved by an interleaver in the transmitting end. The
coded bits are de-interleaved by a de-interleaver and
' This research is supported in part by the National Science Council,
Taiwan, R.O.C. under Grants NSC93-2220-E-007-018 and
NSC93-2220-E-007-026.
Hsi-Pin Ma
Department ofElectrical Engineering
National Tsing Hua University
Hsinchu, Taiwan 30013 R.O.C
hp@ee. nthu. edu. tw
decoded by a viterbi or turbo decoder in the receiving
end. To adjust the desired data rates, a puncturer and
de-puncturer with specific coding rates could be adopted.
A typical communication system can be seen in Figure 1.
The coding performance greatly depends on
interleaving technologies. Many papers [5]-[8] have
evaluated the error rate performance of interleaving
algorithms. However, few papers [9] aim at architecture
design of high-throughput interleavers, especially for
MIMO systems. An overall data rate of more than 100
Mbps is required for modern wireless communications.
In wireline communications, required data rate is even
much higher. For these applications, design of
high-speed interleavers becomes extremely critical. This
paper proposes a high-speed interleaver architecture for
both MIMO and SISO systems.
The paper is structured as follows. In Section II, the
interleaving mapping based on the WWiSE proposal [10]
of IEEE standard 802.1 In is described. Section III
introduces a high-throughput interleaver architecture.
Configurations for system conformance and low power
solutions are addressed in Section IV. Section V gives
the implementation results. Section VI concludes this
paper.
encodepuncurer nterlaver transmitter
channel
decoer e-puctuer d-inerlaver receiver
Figure 1. A typical communication system
2. WWiSE Interleaver
The WWiSE Interleaver is a block interleaver with
block size NCBPS, the number of coded bits in a
MIMO-OFDM symbol. The interleaver performs four
operations-
1. distribute coded bits to NSS spatial streams,
11920-7803-9305-8/05/$20.00 ©2005 IEEE
2. at each streams, write data into a buffer row by
row and read out column by column,
3. perform the intra-column permutation, and
4. perform the inter-column permutation.
The interleaving functions corresponding to above
operations are defined by [101:
k=k, x Nss+n, (I)
i =((NCBps/Nss)/ IDEPTH)( kn mod IDEPlH)+flOOr(k. / IDEPTH), (2)
to stream 2 as (1). Secondly, these bits are written into
buffer row by row and read out column by column.
Moreover, intra-column operations are performed to bits
at column 1, 3, and 5. Finally, an inter-column
permutation is performed to obtain the first output data
bit at column 1, row 5. It is noticed that, in the second
operation, data bits in the buffer are read from an initial
address which is not necessary to be zero because of the
inter-column permutation.
3. Interleaver Architecture
j =s floor(i/s)+(i+NcBPS/NSS-floor(IDEjMxi/(NcBpsINss))) mod s,
(3)
i=(j+NcBps/Nss-NBpscDn) mod (NCBPS/NSS), (4)
where
NBpsc denotes coded bits per subcarrier per antenna,
NCBPS denotes coded bits per MIMO-OFDM symbol,
NSS denotes the number oftransmitted spatial streams,
NSD denotes the number of data subcarriers,
k = 0, 1, ..., NCBPS- 1, denoting the index of the coded
bits before interleaving,
n = 0, 1, ..., NSS- I, denoting the index of the spatial
streams,
kn = 0, 1, ..., (NcBPSINss)- I,
Dn= 5 n, denoting the shift in subcarriers for spatial
stream n,
s = max(NBPSc/2, 1), and
IDEPTH, denoting the interleaving depth,
=12 ifNSS =1 and NSD 108,
=6, otherwise.
According to (1 )-(4), with NSS=4, NBPsc=4, and
NCBps=864, the relation between the input and output
interleaving data sequences for stream 2 is illustrated
partially with Table 1, where jn# denotes the data
sequence after interleaving while k# denotes the data
sequence before interleaving.
Table 1. A case of the interleaving data sequence
ColumnO Column I Column2 Column3 Column4 Column5
J 1 78=k50 J.J215k54 Ji34=k58
j 79=k74 J 214=k78 jn35=k82
n1 80=k9s | EJ136=4 06
j.18=kl_ j_J37 kl30
Jn176=k-2 J^2213,=,k,, ,,Jn32=klO l_6g=kl4, Jnlo44k8 ,jll
J" I77=k26 Ju)12=2k30 Jn33 "k34 IJ-68=k38 Jn105=k42 IJn140=k46'.~ .A~4j71=k62
J-7i=kiEin
Jn I O6=k66
j,lA42=k94
j-I AR=kl IA ,I A5=kl I8
j)409-k138 IJi44=ki42
The above four operations 1-4 can be examined in
Table 1. First, data bits are distributed into four streams
(Nss=4) and therefore data bit k2, k6, klo,..., are allotted
The interleaving operations introduced in Section 11
can be approached in either serial or parallel. However,
as claimed data rates increase, serial interleaving
operations cannot meet requirements of high data rate
communication applications. A parallel pipelined
interleaver architecture can speed up with relatively a
little more hardware complexity. Figure 2 is an example
to show a block diagram of the proposed interleaver
architecture, which supports the symbol size of2592 bits
in this case. The first stage collects and puts input data
bits in six 4-bit groups and exports four 6-bit groups to
stage 2. These 6-bit groups are written into memories
row by row and read out column by column at stage 2 to
ensure that adjacent bits are allotted to nonadjacent
carriers. Stage 3 accomplishes shift operations for
intra-column and part of inter-column permnutations. All
stages can be configured so as to be compatible with
some specific wireless communication standards and
reduce power consumption.
24 Interleaver 24 Interleaver 24 Interleaver 24
Stagel Stage2 Stagd
spr4bx6*12 spr6bxlO8*8 Swapper
Figure 2. Interleaver architecture
Table 2. Another case of the interleaving data
sequence
ColumnO Column] Column2 Column3 Column4 Column5
* Lii4 217=kR 4324.k12.. 2-14nii-5J~4J-=2Q-
je =k24 Zj8 2,, Jn325 k36
_432_40 Jn542k44-
J-) k48 J- I09=k52 Jn2)I6 k56 jn326=k60 Jn433 k64 jn540=k68
jn3k73 =k72 k7LO^r Jn274 ,L,7~k88j54k.
jn14=96 JnIS±Qk Jn221=k,04 j-328=kl 08 2"425=W,].2 J h-
jn,5=k120 in l*=kj 24 Jn21g=k128 ijn329k132 jn436=k136 Jn543=140I
To illustrate the operation of the interleaver
architecture in detail, each of the three stages will be
analyzed from the next paragraph. In addition, a set of
parameters specified in the IEEE 802.11 WWiSE
1193
proposal is used as an example, where the number of
transmitted spatial streams (Nss) is equal to 4, coded bits
per subcarrier per antenna (NBPSC) is equal to 6, and
coded bits per MIMO-OFDM symbol (NCBPS) is equal to
2592. In this case, the interleaving relation for stream 0 is
shown partially with Table 2. Table 3 presents only the
partial data sequence for all streams before interleaving.
3.1. Interleaver Stage 1
By accessing the data buffer X bits at a time, X times
of speed-up can be achieved, compared with serial types
of interleavers. In this research, 4 banks of single-port
SRAM compose the data buffer, mentioned in the
operation 2 of Section II. Each input and output ports of
the four memory banks are 6 bits in width. Hence, X
equals to 24.
As presented in Table 2, the first 6 bits written to bank
0 ofthe data buffer are ko, k24, k48, k72, k96, and k120.
Similarly, kn, k24+n, k48+n, k72+n, k96+n, and kl20+n
should be the first 6 bits written to bank n, where n
denotes the index of the spatial streams. At the second
cycle, the 6 bits in column 1 for each stream should be
written into each bank of the data buffer respectively.
The writing operation should be repeated until all symbol
data bits are stored in the buffer.
Table 3. Combined interleaving data sequence
ColumnO Column] Column2 Column3 Column4 Column5
k4_7 k8-11 k12-15 k16-19 k20-23
k28-31 k32-35 k36-39 k4O043 k44-47
k52-55 k56-59 k6- 63 k64-67 k68-71
k76 79 k80 83 k84-87 k88-91 k92-95
k100_103 k104-107 k108 11, k,12-115 k116-119
k124 127 k128131 kt32 135 1k36 139 k140 t43
Table 4. Data manipulation at stage 1
BankO Bank] Bank2 Bank3 Bank4 BankS
According to the above operations, a data collector,
the first stage of the interleaver, is required to store data
bits in the incoming sequence and provide the buffer in
stage 2 with data bits in desired sequences. For instance,
the collector receives blocks of data bits in the sequence
of {{[kO23], [k2447], ..., [kl2-l4]
flkI44-1671, _., [k264-287]}, ...} and exports data in
the sequence of {{[ko3, k24-27 , k120-123.
[k20-23, k44-47, ..., k,40 143]}, ...}. After storing the
first block of data, {[ko-23], ..., [k20-143]}, in bank
0-5, shown in Table 3, the interleaver starts to store the
second data block, {[kl44 167], ..., [k264-287]} in bank
0'-5' and export the stored first data block in the
sequence of {[ko-3, k2427,-., kl20-123], [k2-23,
k44-47, ..., k140-p43]} from bank 0-5 at the same time.
Clearly, the latency of stage 1 equals to the value of the
parameter, column depth, and the block size is
determined by:
block size = column depth x buffer bit width (X) (5)
* outputs
Figure 3. Architecture of interleaver stage 1
While exporting the previous, e.g. ith, block of data
bits, the interleaver should be able to receive the current
one, e.g. i+lth, concurrently at stage 1. The architecture
with ping-pong SRAM sets drawn in Figure 3 performs
the desired functions. In the beginning, as shown in Table
4, this module stores the 24-bit input data, {[kI 23]}, into
the upper set of6-bank single-port SRAM with address 0,
and continues to store consequent input data with
increasing addresses. After the whole block is completed,
the other memory set, lower one, takes over the storing
task and the upper memory set reads out the data in
desired sequences. The upper and lower sets ofmemories
form a ping-pong buffer. In this case, the value ofd and X
in (5) is 6 and 24 respectively. Therefore, the block size
is formed by 144 bits. It should be notice that, to make
1194
k4-7 k8_11 k12-15 k16-19 k2O-23
k44-47 k28-31 k32-35 k36-39 k4O-43
k64-67 k68-71 k52-55 k56-59 k6O-63
k84-87 k88-91 kQ-95 k76-79 k8O-83
k,04-107 k108-1 11 k, 12-115 kl 16-119 Ea klOO 103
k124-127 k128-131 I k132-135 I k136-139 k140-143
use of the whole bandwidth of memory 1/0 data bus, the
sequence of data fragments, e.g. {[k2427],
[k28-31 ]-[k4447]}, written into or read from each
bank is shifted by additional control signals. Although
addresses used for writing operations are the same in the
6 banks, reading addresses used in the 6 banks should be
different. Table 4 represents the manipulation in the
supposed case.
control
signals r 2 6
6bxl outputs to
sramtsq g stream 0
6bx108
sram
6bxiOS 2 6
Oram outputs to
24 6bx I 08 ~~~stream I
inputs
sram 6
6bxIOS outputs to
sram : \ stream
2
6bxlO8
sram 2 6
6bx108 outputs to
sram stream 3
6bx108
Figure 4. Architecture of interleaver stage 2
3.2. Interleaver Stage 2
The data buffer, mentioned in the operation 2 of
Section II, is maintained in stage 2. The hardware
architecture is shown in Figure 4, where 24-bit input data
is divided into four 6-bit groups and written into four
banks respectively for MIMO systems with four streams.
For example, the data group of {kn, kn+24, kn+48,
kn+72' kn+96, kn+120} in Table 3 is stored into the first
entry AO in bank n, where n denotes the index of spatial
streams and should be 0, 1, 2, or 3. Each bank stores
input data groups in row-wise and retrieves in
column-wise. Table 5 presents the addressing manner of
bank 0, where entry Ai, with address i, storing the 6-bit
data group at column i in Table 2. The same addressing
method is applied to all banks.
Table 5. Addressing method for memory banks at
sta e 2
AO Al A2 A3 A4 A5
A6 A7 A8 A9 AIO All
A30 A31 A32 A33 | A34 | A35
row-wise, from AO to A35 and read in column-wise, from
a specific address specified by interleaving equations in
standards, for example, in the order of {A6, A12,
A30, Al, ..., AO}. Similar to stage 1, a ping-pong
memory structure is applied to stage 2 for pipelining. To
fill up the buffer in stage 2 with a whole MIMO-OFDM
symbol, the latency of the second stage is NCBPS/24.
3.3. Interleaver Stage 3
The stage 3 is designed to assist the intra-column and
inter-column permutations. In fact, the intra-column
permutation, identified as swap in Figure 5, is a circular
shift among 3 or 2 bits. According to (3), three kinds of
intra-column permutations exist in WWiSE, illustrated as
Figure 6(a)-(c). The intra-column permutations can be
implemented by multiplexers.
shif,
swap or 6 R 12 6
bypass
inputs
Figure 5. Architecture of interleaver stage 3
bC bl b bh bl
bl z s bC bl Vb bl /h b2
b3 W<b b2 bl b2 /bC
b' X ' b4 b' b4 Wb
(a) (b- (c)
Figure 6. Three swap modes
In addition to the intra-column permutation,
inter-column permutation, described as (4) and identified
as shift in Figure 5, is utilized in WWiSE as well.
Accordingly, the first output data bit may not locate at
boundaries between 6-bit groups, as seen in Table 1. By
means of registers, two consecutive cycles of output data
are combined, and the desired data bits can be extracted
at the expense of the 1-cycle latency in stage 'J. Three
kinds of extraction schemes exist for inter-column.
penrnutations, illustrated as the three ranges covered by
(a), (b), and (c) in Figure 7.
1195
With this technique, data groups can be stored in
,(a) >
Figure 7. Three shift modes
Figure 5 shows the third stage ofthe interleaver with a
single spatial stream. For MIMO systems with N spatial
streams, N copies of logics in Fiture 5 comprise the stage
4. Configurations
The same architecture can be applied to deinterleaving
by reversing the data flow. Furthermore, the bit width of
data bus can be extended for convolution codecs with
soft decision mechanism. The architecture shown in
Figure 2 evolves to the structure in Figure 8 to integrate
interleaving and deinterleaving with six soft decision bits
in a single interleaver. To perform interleaving, data
sequences are executed in the order of stage 1, 2, and 3.
Contrarily, the sequence for deinterleaving reverses to
stage 3, 2, and 1. Because two operations are performed
in stage 3, the execution order ofthe two steps, swap and
shift, should be reversed for deinterleaving purpose.
Hence, there are two possible data flows in stage 3 and
the architecture becomes Figure 9.
The interleaver buffer in stage 2 can be modified for
low power operations. While performing interleaving,
memories in right-half side of banks in Figure 10 are
disabled to reduce power consumption. Similarly, for
systems with spatial streams less than 4, part of the
memories can be disabled. For example, for a SISO
system, which owns only one stream, only the first bank
of memories is enabled and the other three banks are
disabled as shown in Figure 11.
144 Interleaver 144 Interleaver 144 Interleaver 144
Stagel Stage2 Stage3' g
spr24bx6*12 spr36bxI08*8 Swapper
Figure 8. Modified Interleaver architecture
inputs swa o
-----------------------------~--------------
t t
Figure 9. Modified architecture for stage 3
In addition to low power consumption, system
conformance is taken into account in the proposed
architecture. Many communication standards adopt
similar interleaving algorithms as (1)-(4) with different
parameters. IEEE Std. 802.1 la conforms to (1)-(3) with
parameter Nss=l and IDEPTH=16. Another case in IEEE
Std. 802.16a [11], Nss=l and IDEPTH=I6 or 18. In IEEE
802.11 TGn Sync proposal [12], IDEPTH could be 16 or 18,
too. Actually, the influence of spatial stream count on the
interleaver is already addressed in the previous
paragraph that only parts ofdata bus and memories are in
use and enabled. Manipulations on address controllers
and multiplexers for memories in stage 1 and stage 2
enable the interleaver to accommodate different IDEPTH
values. For interleaving/deinterleaving in stage 2, the
difference between two consecutive reading/writing
addresses is exactly the value of IDEPTH
control
signals
inputs 4 outputs
Figure 10. Modified architecture for stage 2
1196
control
signals
144
inputs I
I .rl
toutputs
Figure 11. Power-saving mode
5. Implementation results
We have described the proposed interleaver in
Verilog-HDL and synthesized the circuits with the
TSMC 0.18pm cell library. Based on the maximum
interleaving block size of 2592 bits, the implementation
results are summarized as Table 7. The proposed
interleaver can operate above 200MHz and hence the
throughput can exceed 4.8Gbps. The memories in stage 1
afford 24-bit parallel access in stage 2. Hence, the overall
throughput for interleaving is 24 bits/cycle and the
deinterleaving throughput is 144 bits/cycle. Compared
with serial types of interleavers, this design obtains 24
times of speed-up at the expense of roughly 32% more
hardware complexity. The overall LILO latency can be
accumulated from the three stages and formulated with
IDEPTH+(NCBPS/24)+ 1.
Table 7. Implementation results
clock rate 100 MHz 200 MHz
memorv area 178945 gm2 178945 jm2
for stage I
memorvarea 385642 gm 385642 gm
for stage 2 m
total area 729021 jm2 733245 gmm2
6. Conclusion
A high-performance architecture for both interleaver
and deinterleaver for emerging wireless communications
is presented in this paper. The parallelization of the
proposed structure is efficient in complexity and power
consumption. This research intends to provide the
three-stage structure that the first stage collects and
arranges small data pieces in desired orders, the second
stage performs permutation to a whole data symbol by a
simple memory structure, and the third stage shifts bits
on data bus. All parameters in this paper are just taken as
examples to interpret the architecture in detail.
Many possible configurations can be applied to this
flexible architecture. The optimal configuration sets
allow minimal complexity in control circuits or least
power consumed. The control flow is not covered in
depth in this paper and can be analyzed in further
research.
References
[1] P. Elias, "Coding for Noisy Channels," IEEE Conv. Rec.,
vol. 3, pt. 4, pp. 37-46, 1955.
[2] C. Berrou, A. Glavieux and P. Thitimajshima, "Near
Shannon Limit Error-Correcting Coding and Decoding:
Turbo Codes," Proc. IEEE ICC'93, Geneve, Switzerland,
pp. 1064-1070. May 1993.
[31 Robert H. Morelos-Zaragoza, "The Art of Error
Correcting Coding," John Wiley & Sons, April 2002.
[41 Wireless LAN Medium Access Control and Physical
Layer specifications (Standards style), IEEE Standard
802.1la, 1999.
[5] Xiao-Feng Wang, YousefR. Shayan, and Mao Zeng, "On
the code and interleaver design of broadband OFDM
systems," Communications Letters, IEEE, vol. 8, pp.
653-655, Nov. 2004.
[6] H. R. Sadjadpour, N. J. A. Sloane, M. Salehi, and G.Nebe,
"Interleaver design for turbo codes," IEEE J. Select.
Areas Commun., vol. 19, pp. 831-837, May 2001.
[7] Shibutani, A., et al, "Performance ofW-CDMA mobile
radio with turbo codes using prime interleaver," IEEE Int.
Conf. VTC, vol. 2, pp. 15-18, May 2000.
[81 Shibutani. A., et al, "Multistage recursive interleaver for
turbo codes in DS-CDMA mobile radio," IEEE Trans.
Vehicular Technology, vol. 51, pp. 88-100, Jan. 2002.
[9] Uchida, Y., et al., "VLSI Architecture of Digital Matched
Filter and Prime Interleaver for W-CDMA," IEEE Int.
Symp. Circuits and Systems, vol. 3, pp. 111-269-272, May
2002.
[10] Manoneet Singh, et al., "'WWiSE Proposal: High
throughput extension to the 802.1 1 Standard,"
Available: http:H/www.wwise.org/technicalproposal.htm.
[11] Air Interface for Fixed Broadband Wireless Access
Systems (Standards style), IEEE Standard 802.16a, 2003.
[12] Syed Aon Mujtaba, "TGn Sync Proposal Technical
Specification,"'
Available: http://www.tgnsync.org/techdocs/.
1197
