A Study on High Performance Gbps MIMO Wireless System by Tran Thi Hong
A Study on High Performance Gbps MIMO Wireless
System
著者 Tran Thi Hong
year 2014-12
学位授与年度 平成26年度
学位授与番号 17104甲情工第294号
URL http://hdl.handle.net/10228/5332
A STUDY ON HIGH PERFORMANCE GBPS MIMO WIRELESS
SYSTEM
Kyushu Institute of Technology
Department of Computer Science and Systems Engineering
Tran Thi Hong
December 2014
Contents
1 Introduction 3
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Wireless System Overview 9
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 MAC Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Channel Access Control . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Data Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 PHY Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 PHY Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 MIMO Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Forward Error Correction (FEC) . . . . . . . . . . . . . . . . . . . 19
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 RC4 Encryption Architectures 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 RC4 Stream Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Analysis of Hardware Implementation . . . . . . . . . . . . . . . . 26
3.3 RAM-based Four-bit RC4 Architecture . . . . . . . . . . . . . . . . . . . 29
3.4 Register-basedM-byte RC4 Architecture . . . . . . . . . . . . . . . . . . . 33
ii
3.4.1 Architecture and Operation . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 How to Design SWAP Block . . . . . . . . . . . . . . . . . . . . . 36
3.5 Experimental Results and Comparisons . . . . . . . . . . . . . . . . . . . 39
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 MIMO Detection Algorithm and Architecture 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Channel Model and Full K-best Algorithm . . . . . . . . . . . . . 45
4.3 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 Direct Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Parent Node Grouping . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.3 Two-Dimensional Sorter (2D Sorter) . . . . . . . . . . . . . . . . . 51
4.4 The Proposed Hardware Architecture . . . . . . . . . . . . . . . . . . . . 53
4.4.1 Overview Architecture . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.2 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . 56
4.5 BER Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5.1 QRD versus SQRD . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.2 Parent Node Grouping . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.3 2D Sorter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5.4 The Proposed Detection . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Complexity Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 LDPC Decoder Architecture 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.1 LDPC Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.2 Min-Sum Decoding Algorithm . . . . . . . . . . . . . . . . . . . . 74
5.3 The Proposed LDPC Decoder Architecture . . . . . . . . . . . . . . . . . 76
5.3.1 Basic Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
iii
5.3.2 The Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . 79
5.3.3 LLR Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Result Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.1 PER Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.2 ASIC Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 Conclusion and Future Work 87
A Verification of the Designs 90
B Snapshots of the Designs 95
Bibliography 105
iv
List of Tables
3.1 Operation of the RAM-based Four-bit RC4 . . . . . . . . . . . . . . . . . 32
3.2 Operation of the Register-based M-byte RC4 . . . . . . . . . . . . . . . . 34
3.3 SWAP block, rules for designing S (t)[im] and S (t)[ jm] . . . . . . . . . . . . 37
3.4 SWAP block, rules for designing S (m)[km] . . . . . . . . . . . . . . . . . . 38
3.5 ASIC synthesis results of RC4 circuits . . . . . . . . . . . . . . . . . . . . 40
3.6 Tri-port RAM versus dual-port RAM . . . . . . . . . . . . . . . . . . . . . 41
4.1 Total visited nodes of 4  4MIMO K-best . . . . . . . . . . . . . . . . . . 67
4.2 ASIC synthesis results of 4  4MIMO K-best . . . . . . . . . . . . . . . . 69
4.3 ASIC synthesis results of several MIMO detection schemes . . . . . . . . . 70
5.1 ASIC synthesis results of LDPC Decoders . . . . . . . . . . . . . . . . . . 85
v
List of Figures
1.1 Global mobile data trac forecast (2013 - 2018) [4] . . . . . . . . . . . . . 4
1.2 Wireless LAN standards . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Security and reliability of a wireless communication system . . . . . . . . 5
1.4 Our research field: MIMO wireless system . . . . . . . . . . . . . . . . . . 5
1.5 MIMO system model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 802.11ac system’s and our research topics’ overview . . . . . . . . . . . . 7
2.1 The Open System Interconnection reference (OSI) model . . . . . . . . . . 10
2.2 The data flow through MAC and PHY layers . . . . . . . . . . . . . . . . . 11
2.3 Using RC4 to encrypt and decrypt data at the transmitter and receiver . . . 13
2.4 Processing flow of a) PHY transmitter and b) PHY receiver of a 44MIMO
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Performance-Complexity relationship of the typical MIMO detection schemes 19
2.6 Brief history of some forward error correction (FEC) codes . . . . . . . . . 19
3.1 RC4 algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Block diagram of the RAM-based Four-bit RC4 . . . . . . . . . . . . . . . 30
3.3 The utilization of tri-port RAM . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Block diagram of the Register-based M-byte RC4 . . . . . . . . . . . . . . 33
4.1 Stages N and N   1 of the full K-best algorithm . . . . . . . . . . . . . . . 46
4.2 Direct expansion: a) compute DIn, DQn; and b) compute Dn . . . . . . . . 49
4.3 Probability (in %) that a child node may be selected as one of K best nodes 50
4.4 Stages N and N   1 of the proposed 2D sorter-based K-best . . . . . . . . 52
vi
4.5 An example of 2D sorter operation . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Probability (in %) that an element of the sorted matrix becomes one of the
actual K best nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7 a) Detection’s configuration; and b) the corresponding overview hardware
architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 Conventional multiplier versus GAIN-MUX-based multiplier . . . . . . . . 58
4.9 Block diagram of ‘STAGE 4’ and ‘STAGE 3’ . . . . . . . . . . . . . . . . 59
4.10 The design of ‘2D-SORT’ block . . . . . . . . . . . . . . . . . . . . . . . 60
4.11 The overview of ‘LLR’ block . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.12 BER of 802.11ac system, QRD versus SQRD . . . . . . . . . . . . . . . . 62
4.13 BER of 802.11ac system, parent node grouping . . . . . . . . . . . . . . . 63
4.14 BER of 802.11ac system, 2D Sorter . . . . . . . . . . . . . . . . . . . . . 64
4.15 BER of 802.11ac system, various MIMO detection schemes . . . . . . . . 66
4.16 PER of 802.11ac system, various MIMO detection schemes . . . . . . . . . 66
5.1 Check matrix H and the corresponding Tanner graph . . . . . . . . . . . . 73
5.2 Check matrix of the structured LDPC code . . . . . . . . . . . . . . . . . . 74
5.3 Positions of variable nodes Azrc and check nodes B
z
rc . . . . . . . . . . . . . 77
5.4 Overview hardware architecture of LDPC decoder . . . . . . . . . . . . . . 78
5.5 Circuits inside a) ‘SUM-VN(c)’ and b) ‘VN(c)’ for the zth sub-column . . . 78
5.6 Timing diagram of the proposed LDPC decoder . . . . . . . . . . . . . . . 80
5.7 LLR values and the quantization range . . . . . . . . . . . . . . . . . . . . 81
5.8 Floating point PER simulation results . . . . . . . . . . . . . . . . . . . . 83
5.9 Fixed point PER simulation results . . . . . . . . . . . . . . . . . . . . . . 84
A.1 RC4’s Verification: Step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.2 RC4’s Verification: Step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.3 RC4’s Verification: Step 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.4 Test the functional behavior of MIMO detection . . . . . . . . . . . . . . . 92
A.5 Test the BER performance of the MIMO detection circuit . . . . . . . . . . 93
A.6 BER performance Results of the MIMO detection circuit . . . . . . . . . . 93
A.7 Verification flow of LDPC decoder . . . . . . . . . . . . . . . . . . . . . . 94
vii
B.1 RAM-based RC4’s top view circuit . . . . . . . . . . . . . . . . . . . . . . 95
B.2 RAM-based RC4, inside of ‘RC4-Main-Process’ block . . . . . . . . . . . 96
B.3 RAM-based RC4, inside of ‘S-box’ of ‘RC4-Main-Process’ block . . . . . 96
B.4 RAM-based RC4, inside of ‘CALCULATE-J’ of ‘RC4-Main-Process’ block 97
B.5 Register-based RC4’s waveform 1 (M = 4 bytes) . . . . . . . . . . . . . . 98
B.6 Register-based RC4’s waveform 2 (M = 4 bytes) . . . . . . . . . . . . . . 99
B.7 2D sorter-based K-best MIMO detection’s top view circuit . . . . . . . . . 100
B.8 2D sorter-based K-best MIMO detection, inside of ‘Stage 4’ block . . . . . 101
B.9 2D sorter-based K-best MIMO detection, inside of ‘Stage 3’ block . . . . . 102
B.10 LDPC decoder’s waveform . . . . . . . . . . . . . . . . . . . . . . . . . . 103
viii
Summary
Recently due to the high demand of the wireless communication users, the throughput
of wireless communication systems has been significantly improved. For instance, the
maximum throughput of the wireless local area network (WLAN) system theoretically has
increased from 54 Mbps (802.11a) to 600 Mbps (802.11n) and to 6.93 Gbps (802.11ac)
within 9 years from 2003 to 2012. However, the development of high throughput sys-
tems such as 802.11ac is facing to the problems as follows. Firstly, the systems need high
throughput Ron cipher (RC4) hardware circuit to encrypt and decrypt the transfer data for
security purpose. Whereas the conventional hardware architectures of RC4 are not able
to achieve high throughput because of the RC4’s swapping issue. Secondly, as the pur-
pose of transferring the data from one point to another, any communication system must
be reliable enough to guarantee that the data is transferred correctly. To achieve high re-
liability in terms of bit error rate (BER) or packet error rate (PER) performance, the high
throughput systems need high performance high throughput multiple input multiple output
(MIMO) detection and high performance high throughput forward error correction (FEC).
Currently, we have many types of MIMO detection but no type could satisfy both through-
put and performance requirements yet. Some types such as zero forcing (ZF) and minimum
mean square error (MMSE) can achieve high throughput but low detection performance,
while the others such as maximum likelihood detection (MLD) and K-best sphere decod-
ing (K-best) have high detection performance but low throughput. Regarding FEC code,
although low density parity check (LDPC) is the best FEC code in terms of error correcting
performance, its decoder hardware circuit is too complex and is significantly aected by the
operation modes. Whereas, most of the wireless communication systems nowadays adopt
multiple operation modes, e.g., the 802.11ac adopts 12 modes, and the 802.16e adopts 76
modes. Consequently, developing a high throughput LDPC decoder within an acceptable
hardware cost is still a problem.
To be able to implement the high throughput wireless communication systems into the
real applications, we solve the above-mentioned problems in this thesis. Particularly, we
propose the following algorithm and hardware architectures.
1
 We propose two high throughput RC4 hardware architectures: RAM-based and Register-
based RC4. The RAM-based one improves the throughput by 33% as compared to
the conventional RAM-based works. It is recommended for the systems that support
up to 1 Gbps. The Register-based one improves throughput by 400% as compared
to the conventional Register-based works. It is recommended for systems that oper-
ate at throughput from 1 Gbps to 7.4 Gbps. The Register-based RC4 is thus able to
support the maximum throughput of the 802.11ac system (6.93 Gbps).
 We propose a high performance high throughput MIMO detection algorithm and
hardware architecture. The BER performance of the 802.11ac system when em-
ploying our detection algorithm and the conventional MIMO detection algorithms
such as Bell lab layered space-time MMSE (BLAST-MMSE), lattice reduction aided
MMSE (LRA-MMSE), and K-best are evaluated. The detection performance of our
algorithm outperforms the BLAST-MMSE’s, LRA-MMSE’s, and is very close to the
K-best’s. Whereas, our detection complexity is just 1.2% of that of the K-best. It is
also less complex than the LRA-MMSE and even MMSE.
 We propose a mode-independent and high throughput LDPC decoder hardware ar-
chitecture. The mode-independent LDPC decoder implements one circuit for all the
modes. A new operation mode can be added to the developed decoder easily by in-
serting a ROM block that stores the value of the parity check matrix corresponded
to that mode. The decoder that support 12 modes for 802.11ac system is designed
and evaluated. It provides throughput up to 7.7 Gbps at 500 MHz, or 15.4 bits/cycle
which is about 8 times better than the conventional works.
The content of this thesis may be interested by the researchers on the fields of WiFi
802.11 family, WiMAX 802.16 family. In terms of real application, the results of our
research can be utilized to develop the wireless chip in the devices such as laptop, iPhone,
Tablet, wireless router, etc.
2
Chapter 1
Introduction
1.1 Background
Recently, the user demand of wireless communication has been dramatically increased.
People use wireless communication to download web content from mobile devices, to
wirelessly connect one television (TV) to many other TV video resources such as Blu-Ray
DVD players, satellite set-tops, and internet video, etc., [1]-[3]. According to the forecast
of Cisco Visual Networking Index shown in Fig. 1.1, the global Mobile data trac will 10
times increase from 2013 to 2018, and reach to about 16 Exabytes per month in 2018 [4].
To fulfill such high user demand in the future, throughput of the wireless communication
must be unremittingly improved also. For example, the IEEE 802.11 committee has pro-
vided many standards that increase the throughput of wireless local area network (LAN)
systems from 54 Mbps (802.11a) to 600 Mbps (802.11n) and to 6.93 Gbps (802.11ac) just
within 9 years from 2003 to 2012, refer to Fig. 1.2 [5]-[7].
However, it is not easy to develop a wireless system that achieves throughput as high
as the standards provide. Many researches must be taken in advance to solve the problems
such as: 1) the degradation of reliability, 2) the limited-throughput of signal processing
hardware circuits, and 3) the complicated of the system.
In another aspect, security and reliability are the most important and must be firstly
considered factors of any wireless communication system. Let say we have a system with
one access point (AP) and three users, refer to Fig. 1.3. The AP wants to send information
3
Global Mobile Data Traffic 
Forecast from 2013 to 2018 (by 
Cisco Visual Networking Index)
1 EB = 1 Exabyte 
= 10
6
Gigabytes
Figure 1.1: Global mobile data trac forecast (2013 - 2018) [4]
1997 1999
2003 2009 2012
802.11
2 Mbps
802.11b
11 Mbps
802.11a/g
54 Mbps
802.11n
600 Mbps
802.11n
6.9 Gbps
SISO-DSSS
SISO-OFDM
MIMO-OFDM
MU-MIMO-OFDM
546Mbps
6.3Gbps
- SISO: Single Input Single Output
- DSSS: Direct Sequence Spread Spectrum
- OFDM : Orthogonal Frequency Division Multiplexing
- MIMO: Multiple Input Multiple Output
- MU-MIMO: Multi User MIMO
Year
Figure 1.2: Wireless LAN standards
4
USER 1
USER 2
USER 3
AP
password is Kyutech
%$#)=!”%...”
“$#’(!)$?>…*
Reliable: password is Kyutech
Unreliable: password is Kyu36ch
Figure 1.3: Security and reliability of a wireless communication system
Internet
AP
STA1
STA2
STA3
BSS1
BSS2 BSS3
Distributed System (DS)
TX
RX
BSS : Basic Service Set
AP   : Access Point
STA : Station
TX  : Transmitter
RX  : Receiver
Figure 1.4: Our research field: MIMO wireless system
5
password is Kyutech to user 2 only. The system needs security to make sure that only user
2 can obtain the transmitted information. The unexpected users such as user 1 and 3 can
hear the transmitted data but can not understand the information. And enough reliability is
needed to guarantee that user 2 receives the information correctly. In case the system does
not have enough reliability, user 2 has high probability to receive incorrect information.
The incorrect information is, of course, meaningless. The system must re-transmit the
information until it is correct. That results to the throughput degradation.
In this thesis, we propose algorithm and/or hardware architecture that take care the se-
curity and reliability of the MIMO wireless systems providing several Gigabits per second
(Gbps) throughput. As be illustrated in Fig. 1.4, our research focuses on wireless commu-
nication within a basic service set (BSS).
1.2 Research Objectives
The objectives of this thesis are to present:
 Two high throughput Ron cipher (RC4) hardware architectures
 A 2D Sorter-based K-best MIMO detection algorithm and its prototype hardware
architecture
 A mode-independent low density parity check (LDPC) decoder hardware architec-
ture
for the MIMO wireless systems that achieve several Gbps throughput. In which, RC4 is the
mandatory part of wired equivalent privacy (WEP) and WiFi-protected access (WPA) se-
curity standards which are equipped in most of wireless LAN devices. Meanwhile, MIMO
detection and forward error correction (FEC) code play the most important role on deciding
the system reliability. And LDPC is the best FEC code in terms of decoding accuracy.
A simple MIMO wireless system and our research topics are illustrated in Fig. 1.5. The
RC4 is implemented in medium access control (MAC) layer of both transmitter and receiver
to take care the security of the system. The MIMO detection is at the physical (PHY) layer
of the receiver to deal with inter-antenna interference (IAI) and detect the transmitted data.
6
RC4
MIMO 
Detection
s1
s2
s3
s4
…
…
…
…
…
…
…
…
PHY
s1
s2
s3
s4
PHY
MAC
MAC
RC4
Noise
: IAI
L
D
P
C
E
n
c
o
d
e
r
S
e
r
i
a
l
-
t
o
-
P
a
r
a
l
l
e
l
 
L
D
P
C
D
e
c
o
d
e
r
P
a
r
a
l
l
e
l
-
t
o
-
S
e
r
i
a
l
 
IAI means Inter-antenna Interference
Figure 1.5: MIMO system model
Chapter 1.
Introduction 
Chapter 2: 
Wireless System Overview 
Chapter 3.
RC4 Encryption Arch.
Chapter 4. 
MIMO Detection Alg. & Arch. 
Chapter 5. 
LDPC Decoder Arch.
Chapter 6. 
Conclusion & Future Work
Security
Reliability
Figure 1.6: 802.11ac system’s and our research topics’ overview
Because the MIMO detection does not always detect correctly, LDPC decoder is used to
correct the error of MIMO detection. MIMO detection and LDPC decoder work together
to improve the reliability of the system.
In this thesis, we investigate how the proposed MIMO detection itself improves the
reliability (in terms of bit error rate (BER) and packet error rate (PER) performance) of
the system. We also investigate how the LDPC decoder works together with the MIMO
detection to further improve the PER performance. About the hardware design, we check
7
the complexity as well as power consumption, and compute the maximum throughput of
the proposed RC4, MIMO detection, and LDPC hardware architectures.
1.3 Thesis Hierarchy
Fig. 1.6 shows this thesis hierarchy. The thesis has six chapters in which the first chapter
is devoted to the thesis introduction. The remain chapters are organized as follows.
Chapter 2. Wireless System Overview
This chapter shows the background of the MAC and the PHY layers of a wireless com-
munication system. Especially, we focus on explaining the basic knowledge of system
security in MAC layer, and MIMO detection, FEC codes in PHY layer.
Chapter 3. RC4 Encryption Architectures
This chapter shows the RC4 algorithm, briefly describes about the conventional works
on RC4 architectures, proposes our RAM-based and Register-based RC4 architectures, and
finally compares our works with the conventional ones in terms of complexity, throughput,
and power consumption.
Chapter 4. MIMO Detection Algorithm and Architecture
This chapter briefly introduces about the conventional MIMO detection algorithms and
their weakness, proposes our 2D Sorter-based K-best MIMO detection algorithm and its
prototype hardware architecture for 802.11ac system, and finally compares our works with
the conventional works in terms of BER performance, complexity, throughput, and power
consumption.
Chapter 5. LDPC Decoder Architecture
This chapter shows the fundamental of LDPC code, proposes our LDPC decoder archi-
tecture and an ecient quantization method, compares our works with the previous works
in terms of complexity, throughput, and power consumption, and finally evaluates the PER
performance of the system when combining the proposed MIMO detection and LDPC de-
coder.
Chapter 6. Conclusion and Future Work
This chapter provides a short summary of the whole thesis and the obtained results. We
also discuss about the future development of our research topics.
8
Chapter 2
Wireless System Overview
2.1 Overview
In this chapter, we show the overview of wireless communication system by getting wire-
less local area network (WLAN) system as an typical example. The others such as wireless
metropolitan area network (WMAN), wireless personal area network (WPAN), etc., may be
slightly dierent in some points because they deal with dierent coverage range. However,
the basic function and data flow through the systems are the same.
Fig. 2.1 shows the open systems interconnection (OSI) model of a communication sys-
tem [8]. A wireless device operates in the physical (PHY) layer and the medium access
control (MAC) of the data-link layer of OSI model. The MAC layer controls the operation
of a wireless device in a basic service set (BSS) and encrypt/decrypt the transfer data for
security purpose. The PHY layer is responsible for modulating and transmitting the data
[5]. Each transmitted data packet goes through each layer and is encapsulated with the
layer headers. The adding of the headers is needed for each layer operation [9]. Fig. 2.2
shows the data flow through the MAC and PHY layers. At the transmitter, the logical link
control (LLC), i.e., upper part of the data-link layer, sends service data unit (SDU) to the
MAC. The MAC encrypts SDU to obtain MAC SDU (MSDU), generates MAC protocol
data unit (MPDU) by adding the MAC header into MSDU, and sends MPDU to the PHY
layer. The PHY considers the MPDU from MAC as PHY SDU (PSDU). It then generates
the PPDU by adding the PHY header and frame check sequence (FCS) to the PSDU. In
9
7. Application
6. Presentation
5. Session
4. Transport
3. Network
1. Physical
LLC
2. Data 
Link
MAC
Wireless 
device
Figure 2.1: The Open System Interconnection reference (OSI) model
the PHY layer, the PPDU will go throughput the PHY processes such as channel coding,
modulation, etc., before being sent via antennas into wireless environment that aected by
noise and interference [5], [9]. At the receiver side, the PHY layer detects the PPDU and
uses PHY header to select the appropriate PHY processes types such as modulation type,
coding rate, error correction type, etc. It then uses FCS information to check whether the
obtained PPDU is correct or not. If the PPDU is incorrect, the transmission is failed, and
the receiver may request a re-transmission. Otherwise, the PHY removes PHY header and
FCS, and sends PSDU to MAC layer. The MAC layer uses MAC header to select the ap-
propriate MAC processes, decrypt the received MSDU, and sends the decrypted MSDU to
the LLC. The LLC considers this data as its SDU [5], [9].
Commonly, a part of MAC layer is run by software program embedded in the micro-
processor. A part of PHY layer and a part of MAC layer are processed by digital hardware
circuits. And a part of PHY layer is done by analog hardware circuits.
10
MAC 
Header
MSDU
Encryption
+ Noise
+ Interference
PSDU
PHY 
Header
FCS
MSDU
SDU
MAC 
Header
MSDU
Decryption
PSDU
PHY 
Header
FCS
MSDU
SDU
LLC
MAC
PHY
Data Detection + 
Error Correction
MPDU
PPDU
Figure 2.2: The data flow through MAC and PHY layers
2.2 MAC Layer
2.2.1 Channel Access Control
The first task of the MAC layer is to perform the channel access control, refer to [9]. It
means that the MAC layer of a wireless device will decide when and how to find, join, use,
and leave a BSS. To do that, the following processes are necessary [9]:
 Beacon: The MAC of access point (AP) periodically broadcasts Beacon frames
which carry regulatory information for managing the BSS such as country code,
maximum allowable transmit power, channel number, etc.
 Scanning: The MAC of station (STA) finds the BSS by scanning process. There
are commonly two types of scanning: passive scanning and active scanning. The
former one simply looks for Beacon frames while the latter one is more complicated.
With the active scanning, the MAC of STA sends Probe Request to ask for more
information which does not available in the Beacon frames. The MAC of AP then
replies the Probe Response to the requested STA.
11
 Authentication: Once the STA has detected the existence of a BSS via scanning and
if it wants to join the BSS, its MAC is required to do the Authentication process as
a permission for joining. In this process, wire equivalent privacy (WEP) protocol
is commonly used to demonstrate knowledge of a shared encryption key. The basic
idea of the Authentication is as follows. The STA sends Authentication request. The
AP receives the request, uses a key to encrypt an information and sends the encrypted
data to the STA. The STA uses its own key to decrypt the received data and forwards
the data to AP. The AP compares the received data with its transmitted data. If the
two data are matched, the Authentication is successful. Otherwise, the STA does not
have permission to join the BSS.
 Association: Before a STA is allowed to transfer data with the other STAs via an AP,
it must associate with the AP. The Association provides a mapping between the STA
and AP so that the messages from distributed system (DS) can reach the destination
STA via AP. The Association begins with the request from STA and the AP responses
to the request.
 Transfer Data: After the successful Association, the STA is allowed to transfer data
with the AP. However, there may be not one but many STAs associated to a BSS,
the MAC of AP must control the transfer time slot for each STA. The MAC of STAs
response to the control of AP. This is to avoid the collision which may be caused due
to many simultaneous channel access requests from multiple STAs. To control the
channel access time, the MAC may use carrier sense multiple access with collision
detection (CSMA/CD) or with collision avoidance (CSMA/CA) mechanisms. The
control of Data/ACK frame exchange may be used to determine whether the transfer
is successfully completed or not. Some ideas such as request-to-send/clear-to-send
(RTS/CTS) frame exchange, network allocation vector (NAV), etc., may be used to
protect a station’s transmissions from hidden node problem caused by the neighbor
overlap BSS.
 Reassociation: This process is needed in case a STA moves from a BSS to another
BSS within the same extended service set (ESS). The Reassociation begins with the
request from STA, and the AP responses to the request.
12
Key 
RC4
Plain 
text
+
Cipher 
text
RC4
+
Ciphering 
key
Cipher 
text
Key 
Ciphering 
key
Plain 
text
Transmitter
Receiver
Figure 2.3: Using RC4 to encrypt and decrypt data at the transmitter and receiver
 Disassociation: This process is used to terminate an existing association. The process
may be performed by either STA or AP. The STA does the Disassociation when it
want to leave the network. And the AP does the Disassociation when it is unable
to support that STA. Disassociation is merely an acknowledge reception without
needing any response message.
2.2.2 Data Encryption
The second main task of the MAC layer is to encrypt and decrypt the data for authentica-
tion and confidentiality purposes. The authentication is to make sure that only permission
clients (or STAs) can join the network, refer to Authentication in Sect. 2.2.1. The confiden-
tiality is to guarantee that the transferred data can be seen by the expected users only who
own the correct key [10].
Nowadays, a wireless LAN device can select one of three security standards: wire
equivalent privacy (WEP), Wi-Fi protected access (WPA), and WPA2 [10]-[13]. While
WEP and WPA uses the simple Ron cipher (RC4) encryption, WPA2 adopts the high com-
plex advanced encryption standard (AES) for protecting the data. Because of the low com-
plexity, WEP and WPA are preferred for the common purpose systems that do not require
high level of security. Contrarily, WPA2 is more complex and stronger than WEP and
WPA. It is preferred for special systems in which security is the first priority.
To support the high throughput systems in which security is necessary but is not the first
priority, we research on high throughout RC4, i.e., the mandatory part of WEP and WPA
13
[13], [14]. Fig. 2.3 shows how the RC4 is used to encrypt and then decrypt the data. At the
transmitter side, an RC4 circuit uses a key information to generate the ciphering key. The
plain-text data is encrypted by XOR with values of ciphering key. At the receiver side, the
same RC4 circuit uses the same key information to generate the same ciphering key as the
transmitter has done. The received data is decrypted by XOR with the generated ciphering
key [13].
Designing a high throughput RC4 circuit is very challenging due to the byte wise swap-
ping of the S -box. A number of researches have been done to improve the throughput of
RC4 circuit [15]-[20]. However, there is no research provides more than a byte of ciphering
key per cycle yet. In this thesis, we propose two RC4 architectures which achieve several
bytes of ciphering key per cycle in order to support multiple Gbps system. We present this
work in chapter 3.
2.3 PHY Layer
2.3.1 PHY Overview
The PHY layer takes care the physical transmission such as forward error correction (FEC),
modulation, detection, etc. The main processes at the transmitter and at the receiver sides
are shown in Fig. 2.4a and b, respectively. The detail of these processes are described in
[21].
To increase the system throughput, there are common ways such as:
1. Increasing the number of spatial streams (Nss)
2. Expanding the channel bandwidth
3. Adopting high order modulation
4. Improving the reliability of the system to eliminate the need for re-transmission
For example, to increase the maximum throughput from 600 Mbps (802.11n) to 6.9 Gbps
(802.11ac), the maximum number of spatial streams is increased from 4 to 8, the maximum
14
Sc
r
a
m
b
l
e
r
I
L
I
L
I
L
I
L
M
a
p
p
e
r
M
a
p
p
e
r
M
a
p
p
e
r
M
a
p
p
e
r
I
F
F
T
I
F
F
T
I
F
F
T
I
F
F
T
G
I
 
I
n
s
e
r
t
G
I
 
I
n
s
e
r
t
G
I
 
I
n
s
e
r
t
G
I
 
I
n
s
e
r
t
W
i
n
d
o
w
W
i
n
d
o
w
W
i
n
d
o
w
W
i
n
d
o
w
A
n
a
l
o
g
A
n
a
l
o
g
A
n
a
l
o
g
A
n
a
l
o
g
M
U
X
M
U
X
M
U
X
M
U
X
S
p
a
t
i
a
l
 
M
a
p
p
i
n
g
F
C
S
F
C
S
F
C
S
F
C
S
G
a
i
n
G
a
i
n
G
a
i
n
G
a
i
n
P
i
l
o
t
P
i
l
o
t
P
i
l
o
t
P
i
l
o
t
 
S
p
a
t
i
a
l
 
S
t
r
e
a
m
 
P
a
r
s
e
r
 
F
E
C
 
E
n
c
o
d
e
r
A
n
a
l
o
g
A
n
a
l
o
g
A
n
a
l
o
g
A
n
a
l
o
g
G
I
R
Frame Sync
G
I
R
G
I
R
G
I
R
M
I
M
O
 
 
 
 
 
D
E
T
E
C
T
I
O
N
D
e
-
M
a
p
D
e
-
I
L
A
G
C
F
F
T
F
F
T
F
F
T
F
F
T
Channel Estimation
P
h
a
s
e
 
T
r
a
c
k
i
n
g
D
e
-
M
a
p
D
e
-
I
L
D
e
-
M
a
p
D
e
-
I
L
D
e
-
M
a
p
D
e
-
I
L
S
p
a
t
i
a
l
 
S
t
r
e
a
m
 
D
e
-
P
a
r
s
e
r
F
E
C
 
D
e
c
o
d
e
r
D
e
-
S
c
r
a
m
b
l
e
r
C
a
r
r
i
e
r
 
F
r
e
q
u
e
n
c
y
 
O
f
f
s
e
t
FEC : Forward Error Correction
IL     : Interleave
FCS : Frequency Cyclic Shift
GI    : Guard Interval 
GIR  : GI Remove
AGC :  Automatic Gain Control
a) PHY Transmitter
b) PHY Receiver
Preamble Generator
Figure 2.4: Processing flow of a) PHY transmitter and b) PHY receiver of a 4  4 MIMO
system
channel bandwidth is expanded from 40 MHz to 160 MHz, and the maximum modulation
size is from 64-QAM to 256-QAM, etc., refer to [7].
15
Because of the limited size of wireless device which results to the limited number of
equipped antennas, and because of the limitation on bandwidth resource, methods 1 and 2
are not able to be utilized in some cases. If using method 3, the reliability of the system is
degraded due to high density of constellation nodes. Consequently, the overall throughput
may be reduced if the degraded throughput due to re-transmission becomes larger than the
improved throughput due to using high order modulation. For that reason, improving the
reliability of the system in case of using high order modulation are required. Among the
processes in Fig. 2.4, MIMO detection and FEC decoder play the most important role on
improving the system reliability.
2.3.2 MIMO Detection
Multiple input multiple output (MIMO) communication system is a system in which both
transmitter and receiver use multiple antennas for transferring the data [22]. The use of
MIMO system can be classified into two types: spatial multiplexing and transmit diversity.
The former is to increase the throughput (data rate) by sending dierent data via dierent
antennas. The latter is to improve the transmission reliability by sending the same data via
multiple antennas [22] pp.11. In this thesis, we focus on the former type.
Assume that there is an N M MIMO system with N transmit and M receive antennas.
The system has a transmit vector x = [x1; : : : ; xN], an M  N channel matrix H, a noise
vector n = [n1; : : : ; nM], and a received vector y = [y1; : : : ; yM]. The channel model can be
shown in (2.1).
y = Hx + n (2.1)
xˆ = H 1y (2.2)
H 1  H+ZF = (HHH) 1HH (2.3)
H 1  H+MMSE = (HHH + I2n) 1HH (2.4)
Where, I is the identity matrix. Basically, the MIMO detection estimates vector xˆ of the
transmitted vector x as be shown in (2.2). Because the inverse matrix H 1 can not always
be found, some MIMO detection algorithms such as zero forcing (ZF) [23] and minimum
16
mean square error (MMSE) [24] compute the pseudo-inverse matrices H+ZF and H
+
MMSE as
(2.3) and (2.4), respectively. In (2.4), 2n is the noise variance. The problem of these detec-
tion schemes is that they cannot obtain the diversity gain. Their detection capability is thus
considered to be low. To improve the detection capability, the Bell lab layered space-time
MMSE (BLAST-MMSE) [25] and lattice reduction aided MMSE (LRA-MMSE) [26] have
been proposed. The basic idea of the BLAST-MMSE is that: it makes a first detection of
the most powerful signal using MMSE scheme, and then considers the estimated signal as
an already-known noise before detecting the next signal. Because the most powerful signal
is selected to detect first, the correct probability is high. The performance of this scheme is
thus better than the ZF and MMSE. The basic idea of the LRA-MMSE is that it finds the
orthogonal form of the channel matrix H before applying MMSE for detection. By doing
so, the input channel matrix of the MMSE detection is expected to be orthogonal, the de-
tection capability of the MMSE is improved as compared to case of non-orthogonalization.
The disadvantages of this scheme are: 1) The complexity of LR calculation is high and sig-
nificantly aected by the modulation type. 2) It has been proven in [27] that if the number
of spatial streams becomes large enough, i.e., above 4 streams, the orthogonal defect prob-
lem will happen. Consequently, the detection capability of the LRA-MMSE is remarkably
reduced. The LRA-MMSE is thus not a good choice for large MIMO system and/or using
high order modulation types such as 256-QAM.
xˆ = argmin
x2

jjy  Hxjj2 (2.5)
In term of detection performance, the maximum likelihood detection (MLD) is well-known
as the optimal scheme [28], [29]. It calculates the Euclidean distance from the received
information to all of the x candidates and selects the candidate that results to the smallest
distance. Let say the system uses W-QAM modulation which has W constellation nodes.
There will beWN possible candidates of x = [x1; : : : ; xN]. The MLDmust compute (2.5) by
WN times, in which 
 is the set of WN possible candidates. For instance, if a 4  4 MIMO
system uses 256-QAM modulation i.e., N = 4 and W = 256, the MLD scheme must
computes (2.5) by 2564  4:3  109 times using 2564 candidates of x = [x1; : : : ; x4]. It can
be seen from (2.5) that the MLD complexity increases exponentially with the constellation
17
sizeW and with the number of spatial streams N.
H = QR (2.6)
z = QHy = Rx +QHn (2.7)
xˆ = argmin
x2

jjz   Rxjj2 = argmin
x2
N
1X
i=N
jzi  
NX
j=i
ri jx jj2 (2.8)
PEDn =
nX
i=N
jzi  
NX
j=i
ri jx jj2 (2.9)
To reduce the MLD complexity, several researches on suboptimal MLD algorithm, es-
pecially on the K-best sphere detecting and the improved K-best [30]-[38], have been pro-
posed. The conventional K-best firstly decomposes the channel matrix H into the unitary
matrix Q and the upper triangle matrix R as (2.6). It then computes (2.7) and (2.8). In
which, equation (2.8) is computed through N stages from N to 1. Stage N computes (2.9)
(with n = N) by usingW candidates of xN . The results are sorted to select only K-best can-
didates for the next stages. Stage n (N 1  n  1) compute (2.9) by KW times from KW
candidates of [xN ; : : : ; xn]. The results are sorted to select only K best candidates. The de-
tail of the conventional K-best will be described in chapter 4. By selecting only K best
candidates per stage, the K-best detection algorithm can partly solve the high-complexity
problem of the MLD in trade-o with the degradation of detection performance. How-
ever, the K-best detection’s complexity is still strongly aected by the constellation sizeW,
which makes it dicult to be adopted in systems using high order modulation.
The detection performance-complexity relationship of several MIMO detection schemes
is illustrated in Fig. 2.5. In this thesis, we propose a MIMO detection algorithm and its pro-
totype hardware architecture used for 802.11ac system. The complexity of our algorithm is
not aected by the constellation size W, and is much simple than the K-best. Our research
about MIMO detection will be shown in chapter 4 of this thesis.
18
Performance
Complexity
ZF
MMSE
LRA 
MMSE
MLD
K-best
Proposed
Optimal
BLAST
MMSE
ZF: Zero Forcing
MMSE: Minimum Mean Square Error
BLAST MMSE: Bell lab Layered Space Time MMSE
LRA MMSE: Lattice Reduction Aided MMSE
MLD: Maximum Likelihood Detection
Figure 2.5: Performance-Complexity relationship of the typical MIMO detection schemes
Reed 
Solomon 
code
Hamming 
code
LDPC Introduced (1963)
Convolutional 
code (CC)
BCH 
code
Renewed interest in 
LDPC (1996)
Turbo 
code
1970
1960
1990
2000
1980
Practical 
implementation 
of codes
LDPC beats 
Turbo and CC
year
Figure 2.6: Brief history of some forward error correction (FEC) codes
2.3.3 Forward Error Correction (FEC)
Because the MIMO detection does not always detect the data correctly, the system com-
monly employs the FEC to correct the MIMO detection error. As a result, the system
reliability is improved. Once the system adopts high order modulation for increasing the
19
communication data rate, MIMO detection error increases. A high performance FEC code
is thus required to maintain the system reliability.
There are a number of FEC codes such as Hamming code [39], Reed Solomon code
[40], binary convolution code (BCC) [41]-[42], Turbo code [42], and low density parity
check code (LDPC) [43], etc., see Fig. 2.6. Among these codes, LDPC is a linear block
code. It was introduced by R. Gallager in 1963 [43], had been forgot for decades until it
was rediscovered by [44]. Its performance has been proved to outperform the other FEC
codes and is close to the Shannon limitation [44], [45].
c = uG (2.10)
HcT = 0 (2.11)
Let say the system wants to send the information u which has k bits. At the transmitter
side, the LDPC encoder computes n bits of the codeword c (n > k) by using (2.10). In
which,G is a kn generator matrix. The ratio r = k=n is called as the code rate. The smaller
the code rate r is, the higher the system reliability and the lower the system throughput will
be. For any codeword c, the equation (2.11) will be satisfied. In which,H is themn check
matrix (m = n   k). Designing an LDPC code is to provide a pair of G and H matrices. At
the receiver side, the LDPC decoder will check whether the received data of c is a codeword
or not by using the parity check constrain in (2.11). If (2.11) is satisfied, the received data
is concluded to be correct. Otherwise, a number of iterations are processed to update the
value of the received data until (2.11) is satisfied. How to update the data in each iteration
is depended on the LDPC decoding algorithm. There are some typical decoding algorithms
such as sum-product [46], min-sum [47], and min-max [48], etc. Regardless of using any
decoding algorithm, the computation in each iteration is dedicated to a specified check
matrix H. Meanwhile, the practical wireless communication commonly supports several
operation modes, e.g., the 802.11ac supports 12 modes and the 802.16e supports 76 modes,
and each mode has its own check matrix value. Consequently, the LDPC decoder must deal
with several versions of check matrix. To provide enough high throughout, LDPC decoder
hardware implementation must be too complicated for real application. In this thesis, we
20
propose a multi-mode LDPC decoder architecture for high throughput MIMO systems.
Our architecture can be configured easily to operate in any modes without increasing the
complexity. This work is presented in chapter 5.
2.4 Summary
In this chapter, we have shown the overview of the MAC and PHY layers of a wireless
communication system. In which our explanation is mostly based on WLAN system. The
other such as WMAN, and WPAN, etc. may be slightly dierent in some points. The MAC
layer takes care the channel access control and security of the system. The channel access
control means that the MAC controls when and how to find, join, use, and leave the BSS.
Some processes have been standardized for channel access control such as Beacon Broad-
casting, Scanning, Authentication, Association, Disassociation, etc. In terms of security,
the MAC of WLAN device encrypts and decrypts the data by following WEP, WPA, or
WPA2 security standards. The WEP and the WPA have low complexity and are preferred
for the common purpose systems that do not require high level of security, while WPA2 has
high complexity and is preferred for special systems in which security is the first priority.
The mandatory core of WEP and WPA is RC4, while the mandatory core of WPA2 is AES.
To support high throughput wireless systems in which security is necessary but not the first
priority, research for high throughput RC4 architectures is required. In another side, the
PHY layer processes the data so that the data can safety transferred via a noisy channel.
Among the necessary processes such as scrambler, modulation, interleave, forward error
correction (FEC), etc., MIMO detection and FEC are the most important ones. The reli-
ability of the system is mainly decided by the performance of these two processes. The
performance-complexity of some typical MIMO detection types and the brief history of
some FEC codes have been shown in this chapter. It has been proved that LDPC achieves
the best performance but the hardware implementation of LDPC decoder is too compli-
cated and strongly aected by operation modes. While, a wireless communication system
supports not one but many operation modes, e.g., 802.11ac supports 12 modes and 802.16e
supports 76 modes.
This chapter has also exposed that we propose RC4 architectures for securing the Gbps
21
wireless systems. We also propose new MIMO detection and hardware architecture, and a
multi-mode LDPC decoder for improving the reliability of high throughput wireless sys-
tems. Our works on RC4, MIMO detection, and LDPC are presented in chapter 3, 4, and
5, respectively.
22
Chapter 3
RC4 Encryption Architectures
3.1 Introduction
The Rivest Cipher 4 (RC4) or Alleged RC4 (ARC4) algorithm was proposed by Ron Rivest
of RSA Security in 1987. It was kept as a trade secret until 1994. As mentioned in chapter
2, RC4 is a mandatory part of the Wired Equivalent Privacy (WEP) and the WiFi-Protected
Access (WPA) security implemented in the MAC layer of a wireless communication sys-
tem. It has become one of the most known and perhaps the most studied stream cipher. Re-
cently, due to the significant improvement of wireless system throughput, designing a high
throughput RC4 is required to secure such systems. However, achieving high throughput
in RC4 is very challenging because of the byte wise swapping of the S-box elements. Byte
swapping means that the S-box elements must be read, processed, and written back to a
dierent location for every generated byte of the ciphering key [14], [15].
Practically, the main component of the RC4, S-box, is stored in whether RAM or a set
of registers. We call them as RAM-based and Register-based RC4 architecture, respec-
tively. While the RAM-based RC4 takes several clock cycles to generate only a byte of
ciphering key, the Register-based is able to produce even more than a byte per cycle. The
disadvantage of the Register-based RC4 is the complexity in terms of design and hardware
cost as well. Based on the RAM-based architecture, P. D. Kundarewich et al. [16] pro-
posed a CPLD-based hardware implementation of the RC4 using assembly language. This
architecture took 4 clock cycles to generate a byte of ciphering key. According to Tsoi et
23
al. [17], a massively parallel implementation of RC4 on FPGA made use of two pieces of
dual port RAMs. Its processing time is three clock cycles per byte. The next architecture
was proposed by Kitsos et al. [18] in 2003. Three pieces of single port RAMs were imple-
mented and three clock cycles per byte were required. Recently, Chattopadhyay et al. [20]
proposed RC4 architecture that used two dual port RAMs combining with pipeline tech-
nique to reduce the processing time to two cycles per byte. To the best of our knowledge,
no further ecient improvement for RAM-based RC4 architecture was published yet. The
typical examples of the Register-based RC4 are a patent by Matthews Jr. [19] in 2008, and
the second design of [20] in 2012. The dierence between the two designs is that: while
[19] used hardware pipeline technique, [20] made use of the so called loop unrolling idea.
The processing time of both designs are one clock cycle per byte of ciphering key.
In this chapter, we propose two novel high throughput RC4 architectures. The first one
is a RAM-based RC4 that uses only a single piece of tri-port RAM to store the S-box.
Its processing time is 2 clock cycles per byte. The contribution of this architecture is to
consume lower power than the previous work. The second architecture is a Register-based
multi-byte RC4. It is expected to be the first proposed architecture that outputs M bytes
of ciphering key per cycle. Where, M is any integer value larger than one. Based on this
architecture, a 4-byte RC4 circuit is developed (M = 4).
The rest of this chapter is organized as follows. Section 2 shows the RC4 algorithm
and its hardware implementation’s analysis. Section 3 explains the RAM-based Four-bit
RC4 architecture. Section 4 focuses on the Register-based multi-byte RC4 one. The detail
on how to design the SWAP block is also described in this section. Section 5 exposes our
experiment results and comparison. The final section, section 6, is the conclusion.
3.2 RC4 Stream Cipher
3.2.1 Algorithm
RC4 algorithm defines a method to generate pseudo random stream of ciphering key from
a provided master key, refer to [14] and [15]. The algorithm is shown in Fig. 3.1. It
operates via two stages: Key Scheduling Algorithm (KSA) and Pseudo Random Generator
24
End
Initial
KeySetup
KSA
PRGA
for i = 0 to 255 
S[i] = i
i = i + 1
endfor
i = j = 0
for i = 0 to 255
j = j + S[i] + Key[i]
Swap: S[i]  S[j]
i = i + 1
endfor
i = j = n = 0
for i = 0 to N
i = i + 1    
j = j + S[i] 
Swap: S[i]  S[j]
k = S[i] + S[j]
Ckey[n] = S[k]
n = n + 1
endfor
End
Start
Start
Figure 3.1: RC4 algorithm
Algorithm (PRGA). The KSA stage includes two substages: Initial and KeySetup.
In the Initial substage, 256 bytes of S-box are initialized by values that equal to their
index number. In the KeySetup substage, the initialized S-box is permuted based on the
provided master key. It does 256 iterations as shown in the left flowchart of Fig. 3.1.
Where, i and j are two index pointers of the S-box. S[i] and S[j] respectively represent the
current values of the S-box at i and j index pointers. Key[i] is the i-th byte of key stream,
which is generated by cyclically extending the master key to 256 byte length. all the sum
operations shown in Fig. 3.1 are ”addition modulo 256”.
The PRGA stage generates the ciphering key to encrypt or decrypt the data. Its oper-
ation is similar to the KeySetup sub-stage, except the following points. Firstly, the index
25
pointer j is calculated without information of the master key. Secondly, in addition to the
two index pointers i and j used for permuting the S-box, index pointer k is added for gen-
erating the ciphering key. In the Fig 3.1, N is the number of the required ciphering key
bytes.
3.2.2 Analysis of Hardware Implementation
RAM-based RC4 implementation: The S-box’s elements are stored in RAM, both read
and write commands need one clock cycle. Thus, the following steps are required to gen-
erate a byte of ciphering key, one step per cycle.
 Step 1: Read S [i] command.
 Step 2: Obtain S [i] value, then calculate j pointer, then read S [ j] command.
 Step 3: Obtain S [ j] value, then swap S [i] $ S [ j] by writing S [i] into j pointer and
S [ j] into i pointer, calculate k pointer.
 Step 4: read S [k] command. This is the expected ciphering key. Besides, read S [i]
command for the next iteration.
Obviously, if the S-box RAM has at least two write ports (required for Step 3) and two read
ports (required for Step 4), the RC4 circuit will need three clock cycles per byte. That is
because Step 1 and 4 can be performed at the same cycle. If a single port RAM is used,
Step 3 will take two clock cycles, and Step 4 must be separated with Step 1. The RC4 core
will need five clock cycles to complete the above-mentioned four steps.
We observe that it is possible to combine Step 3 of the current iteration and Step 1 of
the next iteration (or Step 4) into one clock cycle. Thus, the above-mentioned four steps
can be shorten to two pipeline steps as follows:
 Step-p1 (1st iteration): Read S [i] command for the 1st iteration.
 Step-p2 (1st iteration): Obtain S [i] value, then calculate j pointer, then read S [ j] for
the 1st iteration.
26
 Step-p1 (2nd iteration): Obtain S [ j], then swap S [i] $ S [ j], calculate k pointer for
the 1st iteration; read S [i] for the 2nd iteration.
 Step-p2 (2nd iteration): Read S [k] to obtain the 1st byte of ciphering key; obtain S [i],
then calculate j pointer, then read S [ j] for the 2nd iteration.
 . . .
According to the RC4 algorithm, the reading S [i] for the second iteration must be per-
formed after the swapping of the first iteration is completed. The operation according to
our proposed idea, in Step-p1 (2nd iteration), seems to violate the original RC4 algorithm.
It, actually, generates correct ciphering key because we do a pseudo-swapping as follows:
before reading the second S [i], the second pointer i is compared to the first pointers j. If
they equal each other, the second S [i] will be assigned to the first S [i]. Otherwise, the
second S [i] will be read from the S-box RAM. At a result, the Step-p1 generates the same
result as if the reading of the second S [i] is done after the first swapping is completed.
Based on this idea, we proposes a novel RAM-based RC4 architecture using a single
of tri-port RAM which has one read-only, one write-only, and one read-write port. For the
Step-p1, the write-only port and the write mode of the read-write port are for swapping
S [i] $ S [ j], the read-only port is used to read S [i]. For the Step-p2, the read-only port is
aimed to read S [k], and the read-mode of the read-write port is for reading S [ j]. Our design
requires only two clock cycles per byte of ciphering key. This is the shortest processing
time that can be achieved by a RAM-based RC4 architecture because it is impossible, for
the RAM-based architecture, to read S [i] and S [ j] of the same iteration at the same cycle.
The S [i] must be read out first. In the next cycle, j is calculated by using the read data S [i],
and S [ j] will be read after then.
In terms of hardware implementation, our tri-port RAM has two read and two write
control circuits. That is the same with the dual-port RAM in [20], which has two read-write
ports. Theoretically, our tri-port RAM costs the same hardware resource as the dual-port
RAM in [20] does. By using only a single piece of RAM, our design is expected to reduce
the hardware resource and power consumption by almost half as compared to [20].
27
Register-based RC4 implementation We observe that the KSA-KeySetup substage
and PRGA stage of the RC4 algorithm shown in Fig. 3.1 can be rewritten as follows:
j0 = 0 (3.1)
i1 = i0 =
8>>>><>>>>:0 (If it is KSA-KeySetup substage)1 (If it is PRGA stage) (3.2)
j1 = j0 + S [i1] + Key[1] (3.3)
S [i1] $ S [ j1] (3.4)
k1 = S (1)[i1] + S (1)[ j1] = S [ j1] + S [i1] (3.5)
Ckey[1] = S (1)[k1] (3.6)
i2 = i1 + 1 (3.7)
j2 = j1 + S (1)[i2] + Key[2] (3.8)
S (1)[i2] $ S (1)[ j2] (3.9)
k2 = S (2)[i2] + S (2)[ j2] = S (1)[ j2] + S (1)[i2] (3.10)
Ckey[2] = S (2)[k2] (3.11)
:::::::::::::::
iM = iM 1 + 1 (3.12)
jM = jM 1 + S (M 1)[iM] + Key[M] (3.13)
S (M 1)[iM] $ S (M 1)[ jM] (3.14)
kM = S (M)[iM] + S (M)[ jM] = S (M 1)[ jM] + S (M 1)[iM] (3.15)
Ckey[M] = S (M)[kM] (3.16)
:::::::::::::::
28
Where, S (n)[i] represents the value of S-box at the index pointer i after n times of swapping.
It is obvious that if a circuit can calculate all equations from (3.3) to (3.16) within one clock
cycle, it will generate M bytes of ciphering key per cycle, which is called as M-byte RC4
core for short, where, M is an integer larger than one.
When using Register to store the S-box, S-box’s elements can be read out immediately,
but one clock cycle is still required for updating the elements. Thus, at the cycle that the
circuit calculates equations (3.3) and (3.4), the S-box cannot produces S (1)[k1], and S (1)[i2].
Consequently, equations (3.6), (3.8), etc., cannot be calculated in parallel with (3) and
(4). To implement an M-byte RC4 architecture, S (1)[i1], S (1)[ j1], S (2)[i1], etc.,. must be
calculated in advance by a combinational circuit that we call as SWAP block.
In this chapter, we propose and describe in detail a general multi-byte RC4 architecture.
Based on this, a 4-byte RC4 circuit is designed and compared with the previous works.
3.3 RAM-based Four-bit RC4 Architecture
We proposes a RAM-based Four-bit RC4 architecture as shown in Fig. 3.2. CAL-J block
calculates the index pointer j. COUNT block is an 8-bit counter which is used to generate
the index pointer i. CAL-K block calculates the index pointer k. The main part of the core
is SBOX block, which is used to store 256 elements of the S-box and perform the swapping
via RAM’s read and write commands. It includes a 256-byte tri-port RAM and a small
circuit for controlling the RAM’s read and write commands. Three ports of the RAM are
read-only (port 1), write-only (port 2), and read-write ports (port 3). The utilization of the
RAM is shown in Fig. 3.3. This figure shows that: In the Initial substage of KSA, port 2
writes values from 0 to 127 into the first 128 elements of the S-box, and port 3 writes values
from 128 to 255 into the last 128 elements of the S-box. By doing that, the KSA-Initial
substage is completed within 128 clock cycles. In the KSA-KeySetup substage and PRGA
stage, each iteration is completed in two steps, where one step needs one clock cycle. Three
ports of the RAM are scheduled as follows: Port 1 reads S-box at i pointer in Step-p1 and
at k pointers in Step-p2. Port 2 writes S [ j] into S-box at i pointer. Port 3 writes S [i] into
S-box at j pointer in Step-p1, and reads S-box at j pointer in Step-p2.
Further detail about the operation and timing of this architecture is shown in Table 3.1.
29
SBOX
S[k] (ciphering key)
k
S[j]
COUNT
0
255
i
S[j]
S[i]
k
j
Key[i]
j
Z
-1
prev_j
0
S[i]
8bit adder
S[i]
Z
-1
S[j]
8bit adder
Z
-1
:Delay 1 clock cycle
CAL-J
CAL-K
Figure 3.2: Block diagram of the RAM-based Four-bit RC4
In this table, the dark-blue thick line is used to separate the table into two parts: the upper
part describes the commands that apply to the RAM’s port (RAM’s inputs). The lower part
shows the RAM’s outputs (read data), the calculation based on these read data, and the
generated ciphering key. The red small numbers inside the parentheses represent the flow
of generating one byte of ciphering key, as follows:
1. Port 1 reads S-box at pointer i. Where value of i is determined as follows: For the
first iteration, it is assigned to 0 (if KSA stage) or 1 (if PRGA stage). For the other
iterations, it is increased by 1.
2. The read value S[i] appears at the port-1’s output.
3. The index pointer j is calculated. Where, Key[n] is the n-th byte of the input key,
Key[n] = 0 in case of PRGA stage.
4. Port 3 reads S-box at j pointer.
5. The read value S[j] appears at the port-3’s output.
30
P2_waddr
P2_wdata
P3_wdata
P3_rwaddr
i =0,1,…127
i =128, 
…, 255
Tri-port RAM
P1_raddr
i
k
S[j]
P2_waddr
P2_wdata
Z
-1
P3_rwaddr
P3_wdata
Z
-1
(read addr.)
(write addr.)
j
S[i]
S[i]
S[k]
S[j]
P1_rdata
P3_rdata
Tri-port RAM
Sub-stage: KSA - Initial
Sub-stage: KSA – KeySetup
& Stage : PRGA
Port-1: Read only
Port-2: Write only
Port-3: Read Write
Figure 3.3: The utilization of tri-port RAM
6. The swapping S[i] $ S[j] is performed by port 2 and port 3. Where, port 2 writes
S[j] into i pointer, and port 3 writes S[i] into j pointer. For PRGA stage, k pointer is
calculated by summing S[i] and S[j].
7. Port 1 reads S-box at k pointer.
8. The read value S[k] appears at the port-1’s output.
9. The ciphering key equals to S[k].
To avoid the race condition which happens during simultaneous writing tasks on the
same memory location, the value of i and j indexes are compared before deciding whether
to perform the write command or not. In case i equals to j, swapping is not necessary. And
thus, the write command is canceled to avoid any potential problem.
From Table 3.1, it can be seen that the core generates one byte of ciphering key every
two clock cycles. In other word, encryption speed of this core is 4 bits per cycle.
31
Ta
bl
e
3.
1:
O
pe
ra
tio
n
of
th
e
R
A
M
-b
as
ed
Fo
ur
-b
it
R
C
4
N
a
m
e
K
S
A
-
I
n
i
t
i
a
l
K
S
A
 
–
K
e
y
S
e
t
u
p
 
a
n
d
 
P
R
G
A
I
t
e
r
a
t
i
o
n
 
1
I
t
e
r
a
t
i
o
n
 
2
I
t
e
r
a
t
i
o
n
 
3
I
t
e
r
a
t
i
o
n
 
4
…
S
t
e
p
-
p
1
S
t
e
p
-
p
2
S
t
e
p
-
p
1
 
S
t
e
p
-
p
2
S
t
e
p
-
p
1
S
t
e
p
-
p
2
S
t
e
p
-
p
1
…
P
o
r
t
-
1
:
 
R
e
a
d
 
p
o
r
t
(
1
)
 
i
1
 
=
0
/
1
i
2
 
=
 
i
1
+
 
1
(
7
)
 
 
 
 
k
1
i
3
 
=
 
i
2
+
 
1
k
2
i
4
 
=
 
i
3
+
 
1
…
P
o
r
t
-
2
:
 
W
r
i
t
e
 
p
o
r
t
A
d
d
r
.
 
0
 
t
o
 
A
d
d
r
.
 
1
2
7
(
6
)
 
 
 
i
1

S
[
j
1
]
i
2

S
[
j
2
]
i
3

S
[
j
3
]
…
P
o
r
t
-
3
:
 
R
e
a
d
/
W
r
i
t
e
 
p
o
r
t
(
W
R
)
 
A
d
d
r
.
 
1
2
8
 
t
o
 
A
d
d
r
.
 
2
5
5
(
4
)
(
R
D
)
 
j
1
(
6
)
(
W
R
)
 
j
1

S
[
i
1
]
(
R
D
)
 
j
2
(
W
R
)
 
j
2

S
[
i
2
]
(
R
D
)
 
j
3
(
W
R
)
 
j
3

S
[
i
3
]
…
P
o
r
t
-
1
:
 
R
e
a
d
 
D
a
t
a
(
2
)
 
S
[
i
1
]
S
[
i
2
]
(
8
)
 
 
 
 
 
S
[
k
1
]
S
[
i
3
]
S
[
k
2
]
…
P
o
r
t
-
3
:
 
R
e
a
d
 
D
a
t
a
(
5
)
 
 
 
 
 
S
[
j
1
]
S
[
j
2
]
S
[
j
3
]
…
C
a
l
c
u
l
a
t
i
o
n
s
(
3
)
j
1
=
 
0
+
 
S
[
i
1
]
 
+
 
K
e
y
[
1
]
(
6
)
k
1
=
 
S
[
i
1
]
 
+
 
S
[
j
1
]
j
2
=
 
j
1
 
+
 
S
[
i
2
]
 
+
 
K
e
y
[
2
]
k
2
=
 
S
[
i
2
]
 
+
 
S
[
j
2
]
j
3
=
 
j
2
 
+
 
S
[
i
3
]
 
+
 
K
e
y
[
3
]
k
3
=
 
S
[
i
3
]
 
+
 
S
[
j
3
]
…
C
i
p
h
e
r
i
n
g
 
K
e
y
(
9
)
C
k
e
y
[
1
]
=
 
S
[
k
1
]
C
k
e
y
[
2
]
=
 
S
[
k
2
]
…
(
n
)
T
h
e
 
n
t
h
o
r
d
e
r
 
o
f
 
g
e
n
e
r
a
t
i
n
g
 
t
h
e
 
f
i
r
s
t
 
b
y
t
e
 
o
f
 
c
i
p
h
e
r
i
n
g
 
k
e
y
G
e
n
e
r
a
t
i
n
g
 
t
h
e
 
s
e
c
o
n
d
 
b
y
t
e
 
o
f
 
c
i
p
h
e
r
i
n
g
 
k
e
y
G
e
n
e
r
a
t
i
n
g
 
t
h
e
 
t
h
i
r
d
 
b
y
t
e
 
o
f
 
c
i
p
h
e
r
i
n
g
 
k
e
y
G
e
n
e
r
a
t
i
n
g
 
t
h
e
 
f
o
u
r
t
h
 
b
y
t
e
 
o
f
 
c
i
p
h
e
r
i
n
g
 
k
e
y
N
o
t
e
s
:
32
SBOX
(Register
0…255)
READ
SBOX
S
(1)
[i
2
]
S
(1)
[j
2
]
S[j
1
]
S[i
1
]
UPDATE
SBOX
i
1
, j
1
, i
2
, j
2
, …, i
M
, j
M
, k
1
, k
2,
…, k
M
S
(M-1)
[i
M
]
S
(M-1)
[j
M
]
S
(M)
[k
M
]
S
(2)
[k
2
]
M bytes of 
ciphering key

S
(M)
[i
1
]
S
(M)
[j
1
]
S
(M)
[i
2
]
S
(M)
[j
2
]
S
(M)
[i
M
]
S
(M)
[j
M
]
j
1
i
2
j
2
i
M
j
M
i
1


i
1
j
1
i
2
j
2
i
M
j
M
Z
-1
Z
-1
j
1
COMPUTE
Key[1]
Key[2]
Z
-1
Key[M]
S
(1)
[k
1
]
j
2
j
M-1

…
j
1
=j
0
+S[i
1
]+Key[1]
j
2
=j
1
+S
(1)
[i
2
]+Key[2]
k
1
=S[j
1
]+S[i
1
]
k
2
=S
(1)
[j
2
]+S
(1)
[i
2
]
i
M
= i
0
+M
j
M
=j
M-1
+S
(M-1)
[i
M
]+Key[M]
k
M
=S
(M-1)
[j
M
]+S
(M-1)
[i
M
]
S[i
2
]
S[j
2
]
SWAP 
S[i
M
]
S[j
M
]
S
(M)
[k
1
]
S
(M)
[k
2
]
S
(M)
[k
M
]

k
1
k
2
k
M
i
1
= i
0
+1
i
2
= i
0
+2
…
…
Figure 3.4: Block diagram of the Register-basedM-byte RC4
3.4 Register-basedM-byte RC4 Architecture
3.4.1 Architecture and Operation
Block diagram of our Register-basedM-byte RC4 architecture is shown in Fig. 3.4. SBOX
block is composed of 256 discrete 8-bit registers. These registers store 256 elements of
the S-box. The READ SBOX block returns the values of S-box at the index pointers i1,
j1, k1, . . . , iM, jM, and kM. This is purely combinational circuit so that there is no read
latency. SWAP block calculates S (1)[i2], S (1)[ j2], . . . , S (M)[iM], S (M)[ jM], and S (1)[k1],
S (2)[k2], . . . , S (M)[kM]. The S (1)[i2], S (1)[ j2], . . . , S (M 1)[iM], S (M 1)[ jM] are then used by
COMPUTE block; while S (1)[k1], S (2)[k2], . . . , S (M)[kM] are the first, second, . . . , andM-th
33
T a
bl
e
3.
2:
O
pe
ra
tio
n
of
th
e
R
eg
is
te
r-
ba
se
d
M
-b
yt
e
R
C
4
B
l
o
c
k
 
&
 
T
a
s
k
F
o
r
 
K
S
A
–
K
e
y
S
e
t
u
p
 
s
u
b
-
s
t
a
g
e
 
a
n
d
 
 
P
R
G
A
 
s
t
a
g
e
1
s
t
c
l
o
c
k
 
c
y
c
l
e
2
n
d
c
l
o
c
k
 
c
y
c
l
e
3
r
d
c
l
o
c
k
 
c
y
c
l
e
…
1
.
C
O
M
P
U
T
E
:
 
C
a
l
c
u
l
a
t
e
i
m
R
E
A
D
 
S
B
O
X
:
 
O
b
t
a
i
n
 
S
[
i
m
]
i
0
=
 
-
1
/
0
i
1
=
 
i
0
+
 
1
 

S
[
i
1
]
i
2
=
 
i
0
+
 
2
 

S
[
i
2
]
… i
M
=
 
i
0
+
 
M
 

S
[
i
M
]
i
0
=
 
i
M
(
i
M
o
f
 
1
s
t
c
y
c
l
e
)
i
1
=
 
i
0
 
+
 
1
 

S
[
i
1
]
i
2
=
 
i
0
+
 
2
 

S
[
i
2
]
… i
M
=
 
i
0
+
 
M

S
[
i
3
]
 
(
i
M
o
f
 
2
n
d
c
y
c
l
e
)
…
…
2
.
C
O
M
P
U
T
E
:
 
C
a
l
c
u
l
a
t
e
 
j
m
R
E
A
D
 
S
B
O
X
:
 
O
b
t
a
i
n
 
S
[
j
m
]
S
W
A
P
:
 
c
a
l
c
u
l
a
t
e
 
S
(
1
)
[
i
2
]
,
…
j
0
=
 
0
j
1
 
=
 
j
0
 
+
 
S
[
i
1
]
 
+
 
K
e
y
[
1
]

S
[
j
1
]
 

S
(
1
)
[
i
2
]
j
2
 
=
 
j
1
 
+
 
S
(
1
)
[
i
2
]
 
+
 
K
e
y
[
2
]

S
[
j
2
]
 
…

S
(
M
-
1
)
[
i
M
]
j
M
=
 
j
M
-
1
 
+
 
S
(
M
-
1
)
[
i
M
]
 
+
 
K
e
y
[
M
]

S
[
j
M
]
 
j
0
=
 
j
M
(
j
M
o
f
 
1
s
t
c
y
c
l
e
)
j
1
 
=
 
j
0
 
+
 
S
[
i
1
]
 
+
 
K
e
y
[
M
+
1
]

S
[
j
1
]
 

S
(
1
)
[
i
2
]
j
2
 
=
 
j
1
 
+
 
S
(
1
)
[
i
2
]
 
+
 
K
e
y
[
M
+
2
]

S
[
j
2
]
 
…

S
(
M
-
1
)
[
i
M
]
j
M
=
 
j
M
-
1
 
+
 
S
(
M
-
1
)
[
i
M
]
 
+
 
K
e
y
[
2
M
]

S
[
j
M
]
 
(
j
M
o
f
 
2
n
d
c
y
c
l
e
)
…
…
3
.
S
W
A
P
:
 
C
a
l
c
u
l
a
t
e
 
S
(
m
)
[
i
m
]
,
 
S
(
m
)
[
j
m
]
S
(
1
)
[
i
1
]
,
 
 
S
(
1
)
[
j
1
]
,
 
…
,
 
S
(
M
)
[
i
M
]
,
 
 
S
(
M
)
[
j
M
]
S
(
M
)
[
i
1
]
,
 
 
S
(
M
)
[
j
1
]
,
…
,
 
S
(
M
)
[
i
M
]
,
 
 
S
(
M
)
[
j
M
]
 
S
(
1
)
[
i
1
]
,
 
 
S
(
1
)
[
j
1
]
,
 
…
,
 
S
(
M
)
[
i
M
]
,
 
 
S
(
M
)
[
j
M
]
S
(
M
)
[
i
1
]
,
 
 
S
(
M
)
[
j
1
]
,
…
,
 
S
(
M
)
[
i
M
]
,
 
 
S
(
M
)
[
j
M
]
 
…
…
4
.
U
P
D
A
T
E
 
S
B
O
X
:
 
W
r
i
t
e
 
t
o
 
S
-
b
o
x
S
(
M
)
[
i
1
]
 

i
1
;
 
S
(
M
)
[
j
1
]
 

j
1
S
(
M
)
[
i
2
]
 

i
2
;
 
S
(
M
)
[
j
2
]
 

j
2
… S
(
M
)
[
i
M
]
 

i
M
;
 
S
(
M
)
[
j
M
]
 

j
M
S
(
M
)
[
i
1
]
 

i
1
;
 
S
(
M
)
[
j
1
]
 

j
1
S
(
M
)
[
i
2
]
 

i
2
;
 
S
(
M
)
[
j
2
]
 

j
2
… S
(
M
)
[
i
M
]
 

i
M
;
 
S
(
M
)
[
j
M
]
 

j
M
…
…
5
.
C
O
M
P
U
T
E
:
 
C
a
l
c
u
l
a
t
e
 
k
m
R
E
A
D
 
S
B
O
X
:
 
O
b
t
a
i
n
 
S
(
M
)
[
k
m
]
k
1
 
=
 
S
(
1
)
[
i
1
]
 
+
 
S
(
1
)
[
j
1
]
 

S
(
M
)
[
k
1
]
 
 
k
2
 
=
 
S
(
2
)
[
i
2
]
 
+
 
S
(
2
)
[
j
2
]
 

S
(
M
)
[
k
2
]
 
 
… k
M
=
 
S
(
M
)
[
i
M
]
 
+
 
S
(
M
)
[
j
M
]
 

S
(
M
)
[
k
M
]
 
 
k
1
 
=
 
S
(
1
)
[
i
1
]
 
+
 
S
(
1
)
[
j
1
]
 

S
(
M
)
[
k
1
]
 
 
k
2
 
=
 
S
(
2
)
[
i
2
]
 
+
 
S
(
2
)
[
j
2
]
 

S
(
M
)
[
k
2
]
 
 
… k
M
=
 
S
(
M
)
[
i
M
]
 
+
 
S
(
M
)
[
j
M
]
 

S
(
M
)
[
k
M
]
 
 
…
6
.
S
W
A
P
:
 
c
a
l
c
u
l
a
t
e
 
c
i
p
h
e
r
i
n
g
 
k
e
y
C
k
e
y
[
1
]
 
=
 
S
(
1
)
[
k
1
]
 
 
C
k
e
y
[
2
]
 
=
 
S
(
2
)
[
k
2
]
…
C
k
e
y
[
M
]
 
=
 
S
(
M
)
[
k
M
]
C
k
e
y
[
1
]
 
=
 
S
(
1
)
[
k
1
]
 
 
C
k
e
y
[
2
]
 
=
 
S
(
2
)
[
k
2
]
…
C
k
e
y
[
M
]
 
=
 
S
(
M
)
[
k
M
]
…
N
o
t
e
s
:
m
 
=
 
1
,
2
,
…
,
M
34
bytes of ciphering key that need to be generated by the core. The SWAP block is a special
combinational circuit. It will be described deeply in the next subsection. COMPUTE block
calculates i1, j1, k1, . . . , iM, jM, and kM. To increase the maximum frequency of the entire
RC4 core, pipeline registers are placed before calculating k1, k2, . . . , and kM. Because of
using these pipeline registers, the pointers k1, k2, . . . , and kM are calculated when the S-
box has been updated by values after M times of swapping. Thus, the output of READ
SBOX block for k1, k2, . . . , and kM are S (M)[k1], S (M)[k2], . . . , and S (M)[kM], respectively.
UPDATE SBOX block writes the new values S (M)[i1], S (M)[ j1], . . . , S (M)[iM], and S (M)[ jM]
into registers of the S-box pointed by i1, j1, . . . , iM, and jM, respectively. Where, S (M)[iM]
is the expected value of S-box at iM after M times of swapping, and similar for S (M)[ jM]
and the others.
The operation of an M-byte RC4 architecture is explained as follows.
At the reset cycle, all 256 registers of the SBOX block are initiated by values that equal
to their index number. The KSA-Initial substage is completed.
The KSA-KeySetup substage and PRGA stage are performed via six steps as shown in
Table 3.2 and are explained as follows:
1. The COMPUTE block calculates the index pointers i1, . . . , iM. The READ SBOX
block then outputs S [i1], . . . , S [iM]. Note that at the first cycle of KSA-KeySetup,
i0 =  1; at the first cycle of PRGA, i0 = 0; otherwise, i0 is assigned to iM of the
previous cycle.
2. The following calculation is repeated with all values of m so that m increases from 1
to M: The COMPUTE block calculates jm. The READ SBOX block outputs S [ jm].
The SWAP block calculates S (m)[im+1]. Note that at the first cycle of either KSA-
KeySetup or PRGA, j0 = 0; otherwise, j0 is assigned to jM of the previous cycle.
3. The SWAP block calculates S (1)[i1], S (1)[ j1], . . . , S (M)[iM], S (M)[ jM] for computing
k1, . . . , kM in the next cycle. It also outputs S (M)[i1], S (M)[ j1], . . . , S (M)[iM], S (M)[ jM]
for updating the S-box.
4. The UPDATE SBOX block writes S (M)[i1], S (M)[ j1], . . . , S (M)[iM], S (M)[ jM] into
registers pointed by i1, j1, . . . , iM, jM, respectively. This task is to update the S-box
35
by the new values after M times of swapping.
5. In the next clock cycle, the COMPUTE block calculates the index pointer k1, . . . , kM.
The READ SBOX block outputs values of S-box at k1, . . . , kM pointers. Note that the
S-box now contains the updated values after M times of swapping. Thus, the output
data are S (M)[k1], . . . , S (M)[kM].
6. The SWAP block calculates S (1)[k1], . . . , S (M 1)[kM 1]. This is the 1st, . . . , (M-1)th
byte of the ciphering key. The Mth byte of ciphering key is S (M)[kM].
In the KSA-KeySetup substage, the index pointers k1, . . . , kM, and ciphering key S (1)[k1],
. . . , S (M)[kM] are not calculated. While in PRGA stage, values of Key[1], . . . , Key[M], etc.,
are assigned to zero.
As be seen from Table 3.2 that M bytes of ciphering key are generated every clock
cycle.
3.4.2 How to Design SWAP Block
How to deal with the swapping of S-box is the most diculty on designing a multi-byte
RC4 circuit. In other words, SWAP is the key block of the M-byte RC4 architecture. In
this section, we will show simple rules to implement this block. The function of this block
is to do precise estimation of the following:
1. The values of S-box at pointers im, jm after t times of swapping: S (t)[im] and S (t)[ jm],
where m = 1; 2; : : : ;M; and t = 1; 2; : : : ;M. The formulas for calculating these
values are shown in Table 3.3.
2. Values of ciphering key: S (m)[km], where m = 1; 2; : : : ;M   1. The formulas for
calculating these values are shown in Table 3.4.
Calculate S (t)[im] and S (t)[ jm]:
 t = 1; m = 1, 2, . . . ,M:
At the first time of swapping, value at pointer i1 is exchanged with value at pointer j1:
S [i1] $ S [ j1]. Thus, it is sure that S (1)[i1] = S [ j1] and S (1)[ j1] = S [i1]. While values at
36
Ta
bl
e
3.
3:
SW
A
P
bl
oc
k,
ru
le
s
fo
rd
es
ig
ni
ng
S
(t
) [i
m
]a
nd
S
(t
) [
j m
]
N
a
m
e
I
n
i
t
i
a
l
v
a
l
u
e
t
=
 
1
t
=
 
2
…
t
=
 
M
S
(
t
)
[
i
1
]
S
[
i
1
]
S
(
1
)
[
i
1
]
=
S
[
j
1
]
S
(
2
)
[
i
1
]
=
S
(
2
)
[
j
2
]
(
i
f
i
1
=
j
2
)
=
S
(
1
)
[
i
1
]
,
o
t
h
e
r
w
i
s
e
…
S
(
M
)
[
i
1
]
=
S
(
M
)
[
j
M
]
,
(
i
f
i
1
=
j
M
)
=
S
(
M
-
1
)
[
i
1
]
,
o
t
h
e
r
w
i
s
e
S
(
t
)
[
j
1
]
S
[
j
1
]
S
(
1
)
[
j
1
]
=
S
[
i
1
]
S
(
2
)
[
j
1
]
=
S
(
2
)
[
j
2
]
(
i
f
j
1
=
j
2
)
=
S
(
2
)
[
i
2
]
(
i
f
j
1
=
i
2
)
=
S
(
1
)
[
j
1
]
,
o
t
h
e
r
w
i
s
e
…
S
(
M
)
[
j
1
]
=
S
(
M
)
[
j
M
]
,
(
i
f
j
1
=
j
M
)
=
S
(
M
)
[
i
M
]
,
(
i
f
j
1
=
i
M
)
=
S
(
M
-
1
)
[
j
1
]
,
o
t
h
e
r
w
i
s
e
S
(
t
)
[
i
2
]
S
[
i
2
]
S
(
1
)
[
i
2
]
=
S
(
1
)
[
j
1
]
(
i
f
i
2
=
j
1
)
=
S
[
i
2
]
,
o
t
h
e
r
w
i
s
e
S
(
2
)
[
i
2
]
=
S
(
1
)
[
j
2
]
…
S
(
M
)
[
i
2
]
=
S
(
M
)
[
j
M
]
,
(
i
f
i
2
=
j
M
)
=
S
(
M
-
1
)
[
i
2
]
,
o
t
h
e
r
w
i
s
e
S
(
t
)
[
j
2
]
S
[
j
2
]
S
(
1
)
[
j
2
]
=
S
(
1
)
[
j
1
]
(
i
f
j
2
=
j
1
)
=
S
(
1
)
[
i
1
]
(
i
f
j
2
=
i
1
)
=
S
[
j
2
]
,
o
t
h
e
r
w
i
s
e
S
(
2
)
[
j
2
]
=
S
(
1
)
[
i
2
]
…
S
(
M
)
[
j
2
]
=
S
(
M
)
[
j
M
]
,
(
i
f
j
2
=
j
M
)
=
S
(
M
)
[
i
M
]
,
(
i
f
j
2
=
i
M
)
=
S
(
M
-
1
)
[
j
2
]
,
o
t
h
e
r
w
i
s
e
…
…
…
…
…
…
S
(
t
)
[
i
M
]
S
[
i
M
]
S
(
1
)
[
i
M
]
=
S
(
1
)
[
j
1
]
(
i
f
i
M
=
j
1
)
=
S
[
i
M
]
,
o
t
h
e
r
w
i
s
e
S
(
2
)
[
i
M
]
=
S
(
2
)
[
j
2
]
(
i
f
i
M
=
j
2
)
=
S
(
1
)
[
i
M
]
,
o
t
h
e
r
w
i
s
e
…
S
(
M
)
[
i
M
]
=
S
(
M
-
1
)
[
j
M
]
S
(
t
)
[
j
M
]
S
[
j
M
]
S
(
1
)
[
j
M
]
=
S
(
1
)
[
j
1
]
(
i
f
j
M
=
j
1
)
=
S
(
1
)
[
i
1
]
(
i
f
j
M
=
i
1
)
=
S
[
j
M
]
,
o
t
h
e
r
w
i
s
e
S
(
2
)
[
j
M
]
=
S
(
2
)
[
j
2
]
(
i
f
j
M
=
j
2
)
=
S
(
2
)
[
i
2
]
(
i
f
j
M
=
i
2
)
=
S
(
1
)
[
j
M
]
,
o
t
h
e
r
w
i
s
e
…
S
(
M
)
[
j
M
]
=
S
(
M
-
1
)
[
i
M
]
37
Table 3.4: SWAP block, rules for designing S (m)[km]
S
(M-1)
[k
M-1
]
…
S
(1)
[k
1
]
S
(M-1)
[k
M-1
] =S
(M-1)
[i
M
] (if k
M-1
=i
M
)
=S
(M-1)
[j
M
] (if k
M-1
= j
M
)
=S
(M)
[k
M-1
], otherwise
…
S
(1)
[k
1
] =S
(1)
[i
M
] (if k
1
=i
M
)
=S
(1)
[j
M
] (if k
1
=j
M
)
…
=S
(1)
[i
2
] (if k
1
=i
2
)
=S
(1)
[j
2
] (if k
1
=j
2
)
=S
(M)
[k
1
],otherwise
the other pointers i2, j2, . . . , iM, jM depend on the relationship between these pointers with
i1 and j1, as be explained below.
 S (1)[i2]: If i2 = j1, it is obvious that S (1)[i2] = S (1)[ j1], where S (1)[ j1] has just been
found above. Otherwise, value at i2 does not change, we have S (1)[i2] = S [i2]. Note that
we always have i2 , i1 because i2 = i1 + 1.
 S (1)[ j2]: If j2 = j1, it is obvious that S (1)[ j2] = S (1)[ j1]. If j2 = i1, then S (1)[ j2] =
S (1)[i1]. Otherwise, value at j2 does not change, we have S (1)[ j2] = S [ j2].
S (1)[i3], S (1)[ j3], . . . , S (1)[iM], S (1)[ jM] are calculated similar to S (1)[i2] and S (1)[ j2].
Refer to the third column of Table 3.3 for more detail.
 t = 2; m = 1, 2, . . . ,M:
At the second time of swapping, value at pointer i2 is exchanged with value at pointer
j2: S (1)[i2] $ S (1)[ j2]. Thus, S (2)[i2] = S (1)[ j2] and S (2)[ j2] = S (1)[i2]. While values at the
other pointers depend on these points’ relationship with i2 and j2, as be explained below.
 S (2)[i1]: If i1 = j2, it is obvious that S (2)[i1] = S (2)[ j2]. Otherwise, value at i1 does not
change, we have S (2)[i1] = S (1)[i1].
 S (2)[ j1]: If j1 = j2, it is obvious that S (2)[ j1] = S (2)[ j2]. If j1 = i2, then S (2)[ j1] =
S (2)[i2]. Otherwise, value at j1 does not change, we have S (2)[ j1] = S (1)[ j1].
S (2)[i3], S (2)[ j3], . . . , S (2)[iM], S (2)[ jM] are calculated similar to S (2)[i1] and S (2)[ j1].
Refer to the fourth column of Table 3.3 for more detail.
 t = 3, . . . ,M:
The calculation is similar to cases t = 1 and t = 2 with the notice that: at the t time
of swapping, the values at it and jt are exchanged each other. While values at the other
38
pointers depend on these points’ relationship with it and jt. To help the reader recognizes
this point easily, in Table 3.3, the swapping points are filled in by a pale yellow color.
Calculate S (m)[km]:
The input of this calculation is S (M)[km], while the expected output is S (m)[km]. Between
m to M, there are m + 1, . . . , M times of swapping. Because of these swapping, S (m)[km]
may be exchanged with value of S-box at im+1, jm+1, . . . , iM, or jM if km equals to one
of them. In short, if km does not equal to any one of the mentioned-above pointers, then
S (m)[km] = S (M)[km]. If km equals to one of those pointers, for example km = im, there is no
doubt that S (m)[km] = S (m)[im]. The detail is shown in Table 3.4.
From the proposed M-byte RC4 architecture, a 4-byte RC4 circuit (with M = 4) is
developed. Its hardware cost and power consumption will be shown in the next section.
3.5 Experimental Results and Comparisons
Both of our proposed RC4 circuits, RAM-based Four-bit and Register-based Four-byte
ones, were designed by using Verilog hardware description language. They were then
verified by usingModelSim simulation tool. The circuits were finally synthesized in CMOS
SAED 90 nm process ASIC. The typ tm library for middle voltage operation condition and
the compile ultra command were used. To compare with the designs of [20] which were
synthesized in 65 nm ASIC technology, we used the method shown in [50] to normalize the
throughput of [20] into 90 nm technology. Table 3.5 shows the comparison of our proposed
RC4 architectures with the previous works. The contribution of our proposed architectures
can be seen clearly from the table, as follows:
 Register-based RC4: High throughput at low operating frequency. With 32
bits/cycle, the proposed Four-byte RC4 achieves the highest encryption speed. It is four
times faster than that of [19] and [20]; twelve times faster than that of [17] and [18]. In or-
der to provide high throughput, [19], [20] and the previous works must need a special high
frequency for itself. A clock tree as well as clock synchronization issues must be taken care
of if the remain system operates at lower frequency. The proposed Four-byte architecture
can get rid of these issues.
39
Ta
bl
e
3.
5:
A
SI
C
sy
nt
he
si
s
re
su
lts
of
R
C
4
ci
rc
ui
ts
A
rc
hi
te
ct
ur
e
Te
ch
no
lo
gy
A
re
a
Fr
eq
ue
nc
y
T
hr
ou
gh
pu
t(
G
bp
s)
T
hr
ou
gh
pu
t
Po
w
er
(K
ga
te
)
(M
H
z)
Pu
bl
is
he
d
N
or
m
al
iz
ed

(b
its
/c
yc
le
)
(m
W
)
R
A
M
-b
as
ed
[1
7]
by
T
so
i
FP
G
A
n/
a
50
0.
13
3
n/
a
2.
67
n/
a
[1
8]
by
K
its
os
FP
G
A
n/
a
66
0.
17
6
n/
a
2.
67
n/
a
[2
0]
by
C
ha
tto
A
SI
C
65
nm
22
81
0
3.
24
2.
34
4
8.
1
T
he
pr
op
os
ed
A
SI
C
90
nm
30
.5
58
5
n/
a
2.
34
4
3.
2
R
eg
is
te
r-
ba
se
d
[1
9]
by
M
at
th
ew
s
n/
a
n/
a
n/
a
n/
a
n/
a
8
n/
a
[2
0]
by
C
ha
tto
A
SI
C
65
nm
33
.6
12
80
10
.2
4
7.
4
8
9.
3
T
he
pr
op
os
ed
A
SI
C
90
nm
12
4
23
2
7.
4
7.
4
32
4.
6

: N
or
m
al
iz
ed
th
ro
ug
hp
ut
fr
om
S
te
ch
no
lo
gy
to
90
nm
=
(t
hr
ou
gh
pu
ta
tS
)
S 90
n/
a
:n
ot
av
ai
la
bl
e
40
Table 3.6: Tri-port RAM versus dual-port RAM
Name Area
(Kgate)
Power 
(mW)
Tri-port RAM 26.8 1.61
2×dual-port RAM
53.4 3.3
 Register-based RC4: Low power consumption. For the Register-based architec-
ture, at the same normalized throughput of 7.4 Gbps, our RC4 circuit requires 4.6 mW of
power, which is only half as compared to [20]. Because of operating at lower frequency,
the proposed architecture can save power.
 RAM-based RC4: Low power consumption. For the RAM-based architecture, in
order to provide the same normalized throughput of 2.34 Gbps, the proposed Four-bit RC4
needs 3.2 mW of power, which is only 40% of that of the RAM-based design in [20]. This
power reduction is due to the fact that the design in [20] implements two piece of separated
RAMs, which is double of ours. Moreover it needs a special circuit to manage the data
coherence between the two RAMs, while ours does not need. The same conditions such as
using saed90nm typ tm library and compile ultra command, operating at 585 MHz, etc,.
are used to synthesize our tri-port RAM and the dual-port RAM mentioned in [20]. Their
hardware cost and power consumption are shown in Table 3.6. The results show that when
evaluating at the same condition, both hardware cost and power consumption of two pieces
of dual-port RAM are two times bigger than those of a single tri-port RAM.
3.6 Conclusion
In this chapter, we have proposed two novel multi Gbps RC4 architectures for securing the
high throughput wireless systems. The RAM-based RC4 uses a single of tri-port RAM to
produce 4 bits of ciphering key per cycle, which is the maximum encryption speed can be
achieved by a RAM-based architecture. And the Register-based RC4 is the first proposed
architecture that generates M bytes per cycle, where M theoretically can be any positive
integer. From the M-byte architecture, a 4-byte RC4 has been developed (M = 4). This
paper has also showed that: 1) The proposed RAM-based RC4 is the first RAM-based
41
architecture achieving up to 4 bits/cycle throughput, which is 1.5 times higher than the
previous work. Later on, another work [20] has been published which also achieves 4
bits/cycle. To provide the same throughput of 2.34 Gbps, our work consumes 3.2 mW
power which is only 40% of that of RAM-based RC4 in [20]. 2) The proposed Register-
based RC4 is the first architecture achieving 32 bits/cycle throughput, which is at least 4
times higher than the conventional works. To provide the same throughput of 7.4 Gbps, our
Register-based RC4 consumes 4.6 mW, which is half of Register-based RC4 in [20].
42
Chapter 4
MIMO Detection Algorithm and
Architecture
4.1 Introduction
As mentioned in chapter 2, MIMO detection plays an important role on improving reliabil-
ity of wireless communication system. Among several MIMO detection schemes such as
maximum likelihood detection (MLD) [28], linear minimum mean square error (LMMSE)
[24], Bell Labs layered space-time MMSE (BLAST MMSE) [25], and lattice-reduction
aided MMSE (LRA MMSE) [26], etc., MLD is the optimal one in terms of reliability.
However, its complexity increases exponentially with the number of constellation nodes of
the modulation and with the number of spatial streams [29]. Several researches on subop-
timal MLD algorithms, especially on the K-best and its improved versions [30]-[38], have
been done instead. If a MIMO system sends data via N spatial streams, the full K-best
will process through N stages. In each stage, it firstly computes the Euclidean distance
from the received information to all of the constellation nodes (i.e., expansion task) and
then sorts the obtained results (i.e., sorting task) to select K best nodes. If we denote W as
the number of constellation nodes, complexity of the expansion and sorting tasks increases
proportionally to W andW2, respectively.
To reduce the K-best’s complexity, several researches were carried out and published
already. These researches can be classified into two methods named as complex domain
43
and real domain. The former one processes through N stages as the full K-best does.
However, new ideas are proposed to reduce the complexity in trade-o with an acceptable
performance degradation. Some typical proposals on this method are a fixed sphere decoder
algorithm - FSD in [30], a step reduced K-best sphere detection algorithm in [31], and a
zigzag on-demand expansion scheme in [32]. On the other hand, the real domain method
separates the in-phase (IP) and quadrature-phase (QP) components of a complex data into
two independent real data and processes these data in real domain. Thus, the complexity
of each stage is reduced, while the number of stages is increased from N (in complex
domain) to 2N (in real domain). The well-known researches on this method are [33, 34,
35, 36]. Studying these works, we recognize that the expansion and sorting tasks are still
too complex for practical implementation if a large value of K and high-order modulation
types such as 256-QAM are needed.
In this chapter, we propose an algorithm and hardware design of a low complexity 2D
sorter-based K-best MIMO detection. The detection bases on the complex domain method.
The contributions of this work is briefly described as follows:
 In terms of algorithm, we propose direct expansion and parent node grouping meth-
ods to reduce the expansion’s complexity, and two dimensional (2D) sorter to sim-
plify the sorting task. The direct expansion specifies the best candidates directly with-
out searching all the constellation nodes. Consequently, complexity of the algorithm
is negligibly aected by constellation size. The Euclidean distance computation be-
comes simpler, and the divider is eliminated. The parent node grouping helps to
reduce the number of search candidates within an acceptable amount without trade-
o of the BER performance. The 2D sorter does the matrix-based sorting. It has
low complexity, is suitable for hardware resource sharing, and provides approximate
result.
 In terms of hardware architecture, a prototype of the algorithm which aims to support
4  4 MIMO 802.11n/ac systems is developed. We utilize some techniques such as
resource sharing and GAIN-MUX-based multiplier to further reduce the complexity.
The rest of this chapter is organized as follows: Section 4.2 shows the preliminary infor-
mation such as notations, channel model, and full K-best algorithm. Section 4.3 describes
44
our algorithm. Section 4.4 focuses on hardware design. Sections 4.5 and 4.6 compare
the proposed one with the previous works in terms of BER performance and application
specific integrated circuit (ASIC) results, respectively. We conclude the chapter in Section
4.7.
4.2 Background
4.2.1 Notations
We shall use bold lowercase letters for vectors and bold capital letters for matrices. Fur-
thermore, k k denotes the L   2 norm distance or Euclidean distance, ( )H denotes the
Hermitian transpose of a matrix, and ( )I and ( )Q respectively denote the in-phase(IP) and
quadrature-phase (QP) parts of a signal.
4.2.2 Channel Model and Full K-best Algorithm
This chapter considers a MIMO system with spatial multiplexing signaling (i.e., the signal
transmitted from individual antennas are independent of each other). Let N andM represent
the number of transmit and receive antennas, respectively, with M  N. Assume that the
transmit symbol is taken from a quadrature amplitude modulation (QAM) which has W
constellation nodes.
y = Hx + n (4.1)
The transmission of each vector x over flat-fading MIMO channels can be modeled as
(4.1), in which y is the M  1 received signal vector, H is the M  N channel matrix, x is
the N  1 transmit symbol vector, and n is the M  1 independent identically distributed
(i.i.d.) Gaussian white noise vector. Channel H is decomposed into two matrices Q and
R: H = QR, in which Q is an M  M unitary matrix and R is an M  N upper triangular
matrix. In case M > N, the last M   N rows of R are zero, and the size of the R matrix
45
Sorting
 
Sorting


1 W
1
K
1
W
1
W
1 K
Parent 
nodes
child 
nodes
Expansion
Stage 
N
Stage 
N-1

Figure 4.1: Stages N and N   1 of the full K-best algorithm
thus becomes N  N. For simplicity, in this work we assume that M = N.
xˆ = argmin
x2
N
jjz   Rxjj2 = argmin
x2
N
1X
i=N
jzi  
NX
j=i
ri jx jj2 (4.2)
PEDn =
n+1X
i=N
jzi  
NX
j=i
ri jx jj2|                {z                }
PEDn+1
+ jzn  
NX
j=n
rn jx jj2|            {z            }
Dn
(4.3)
Where,
P1
i=N(:) presents the sum that i = N;N   1; : : : ; 1.
The full K-best finds the transmitted symbol x by solving (4.2). In this equation, 
N
denotes WN possible sets of the transmitted symbol vector x, and z = QHy. Equation (4.2)
is computed through N stages in the order from N to 1, one after another. The nth stage (n =
N; : : : ; 1) computes the nth partial Euclidean distance (PEDn) in (4.3) by adding PEDn+1
(i.e., results of the (n + 1)th stage) with Dn (i.e., calculated in the nth stage).
Two main tasks - expansion and sorting - will be done in stage n (n = N; : : : ; 1) (refer
to [32] for details).
 Expansion task firstly computes K W values of Dn and xn (i.e., child nodes) from
K parent nodes selected from stage n + 1. It then calculates PEDn = PEDn+1 + Dn.
46
 Sorting task sorts K W values of PEDn to find the K smallest values of PEDn and
the corresponding fxN ; : : : ; xng. The selected data will become the parent nodes of
the next stage (i.e., stage n   1).
The processing of the two first stages (i.e., N and N   1) is illustrated in Figure 4.1.
Notice that K = 1 if n = N.
At the final stage (i.e., stage 1), the sorting is not performed. All K W values of PED1
are used for the final decision, whether hard or soft decision. The hard decision method
finds the value of fxN ; : : : ; x1g that is equivalent to the smallest value of PED1 and decides
this value as the decoded data, while the soft decision method calculates the log likelihood
ratio (LLR) of all information bits.
4.3 The Proposed Algorithm
Firstly, we use sorted QR decompose (SQRD) pre-processing [49], [50]: H = SQR in-
stead of the conventional QRD: H = QR to improve the BER performance. In [50], the
authors have shown that a low complex SQRD can be designed by using the modified
Gram-Schmidt algorithm with pipelining and resource sharing.
The main process of our algorithm is done through N stages as the full K-best does.
The following ideas are proposed to reduce the complexity.
4.3.1 Direct Expansion
The direct expansion will find L best child nodes per parent nodes. Firstly, Dn in (4.3) is
rewritten into (4.4) and (4.5) as follows.
Dn = j zn  
NX
j=n+1
rn jx j|           {z           }
fn
 rnnxnj2 = j fn   rnnxnj2 (4.4)
= ( f In   rnnxIn)2|         {z         }
DIn
+ ( f Qn   rnnxQn )2|          {z          }
DQn
(4.5)
47
In the first quarter of the constellation (in which IP and QP parts are both non-negative),
we divide the IP space into
p
W   1 subdomains such as [0; rnn); [rnn; 2rnn); : : : ; [(
p
W  
2)rnn;1). Each subdomain is associated with a set of ceil(
p
L) best values of xIn. For
example, if the modulation is 16-QAM and L = 9, the IP space is divided into [0; rnn),
[rnn; 2rnn), and [2rnn;1) subdomains. The corresponding three best values of xIn are (1,  1,
3), (1, 3,  1), and (3, 1,  1), respectively (refer to Figure 4.2a). With QP space and xQn , we
do similarly.
The L best child nodes per parent node in stage n (n = N; : : : ; 1) are directly specified
as follows:
Step 1. Calculate fn that is defined in (4.4).
Step 2. Determine the IP subdomain that f In belongs to by comparing f
I
n with values such as
rnn, 2rnn, . . . , (
p
W   2)rnn. From that, the ceil(
p
L) best values of xIn will be known.
If f In < 0, the signs of x
I
n are reversed. Then, we calculate the corresponding DIn in
(4.5). The xQn and DQn are found similarly (refer to Figure 4.2a).
Step 3. From ceil(
p
L) best values of xIn, DIn, and x
Q
n , DQn, we compute L best values of
xn and Dn in (4.5). Let call in and qn as the index numbers of the best values of
DIn and DQn, which are already in ascending order. The combination of the sum
Dn = DIn+DQn is arranged so that the sum i2n+q
2
n increases. Consequently, the results
of Dn are approximately in ascending order without sorting (refer to Figure 4.2b).
To expand L best child nodes from a parent node, the previous works such as [32] firstly
finds the center node by rounding the result xc = fn=rnn. It then seeks for L nearest nodes
to the center node. The divider is thus required. By comparing as step 2, the proposed
algorithm can eliminate the divider fn=rnn. Furthermore, by using (4.5), L values of Dn are
obtained from ceil(
p
L) values of DIn and DQn. The complexity of computing Euclidean
distance Dn is reduced.
4.3.2 Parent Node Grouping
It is important to know how much should the number of child nodes per parent node (L)
be. If L is too large, BER performance is improved. However, the detection’s complexity
48
2
5
 
 
 
 
 
 
5
 
 
 
 
 
 
8
 
 
 
 
1
0
 
 
 
 
1
0
 
 
 
1
3
 
 
 
1
3
 
 
 
 
1
8
n
D
I
n
D
Q
n
D
+
+
+
+
+
+
+
+
+
=
+
2
2
n
n
q
i
1
 
 
 
 
 
 
 
 
 
2
 
 
 
 
 
 
 
 
 
3
1
 
 
 
 
 
 
 
 
 
2
 
 
 
 
 
 
 
 
 
 
3
=
n
i
=
n
q
n
f
0
n
n
r
n
n
r
2
n
n
r
3
n
n
r
n
n
r
2
n
n
r
3
)
3
(
n
D
I
)
2
(
n
D
I
)
1
(
n
D
I
)
2
(
n
D
Q
)
1
(
n
D
Q
)
3
(
n
D
Q
a
)
=
n
d
1
 
 
 
 
 
2
 
 
 
 
 
 
3
 
 
 
 
 
 
4
 
 
 
5
 
 
 
 
 
 
6
 
 
 
 
 
7
 
 
 
 
 
8
 
 
 
 
 
 
9
b
)
1
 
 
 
 
 
 
 
 
 
2
 
 
 
 
 
 
 
 
 
3
1
 
 
 
 
 
 
 
 
 
2
 
 
 
 
 
 
 
 
 
 
3
Fi
gu
re
4.
2:
D
ir
ec
te
xp
an
si
on
:a
)c
om
pu
te
D
I n
,D
Q
n;
an
d
b)
co
m
pu
te
D
n
49
c = 1 c = 2 c = 3 c = 4 c = 5 c = 6 c = 7 c = 8 c = 9
k = 1 99.3 96.2 90.7 84.2 73.2 58.1 42.1 27.4 19.2
k = 2 98.7 95 83.8 69.1 29.7 16.1 6.6 2.6 1.3
k = 3 98.1 93.7 75.9 48.2 8.7 2.8 0.6 0.2 0.1
k = 4 94.9 85.1 48.6 19.5 1.8 0.6 0.2 0.1 0
k = 5 84.7 64.2 21.6 6.8 0.6 0.2 0.1 0 0
k = 6 65.8 36.6 7.2 2.2 0.2 0.1 0 0 0
k = 7 47.9 20.7 2.7 0.8 0.1 0 0 0 0
k = 8 31.2 10 1.1 0.3 0 0 0 0 0
k = 9 23 6.1 0.6 0.2 0 0 0 0 0
k = 10 15.1 3.1 0.4 0.1 0 0 0 0 0
k = 11 11.7 2 0.2 0.1 0 0 0 0 0
k = 12 9.8 1.6 0.2 0.1 0 0 0 0 0
k = 13 8.2 1.2 0.2 0.1 0 0 0 0 0
k = 14 6.8 0.9 0.2 0.1 0 0 0 0 0
k = 15 5.4 0.6 0.1 0.1 0 0 0 0 0
k = 16 4.8 0.6 0.1 0.1 0 0 0 0 0
k = 17 3.9 0.5 0.1 0 0 0 0 0 0
k = 18 3.4 0.4 0.1 0.1 0 0 0 0 0
k = 19 2.9 0.4 0.1 0.1 0 0 0 0 0
k = 20 2.6 0.3 0.1 0.1 0 0 0 0 0
k = 21 2.4 0.3 0.1 0.1 0 0 0 0 0
Figure 4.3: Probability (in %) that a child node may be selected as one of K best nodes
is also increased. If L is too small, the BER performance may be too small to fulfill the
system requirement.
Notice that once the L best child nodes are directly specified as mentioned in Sect.
4.3.1, if L > K, there is no probability that one of the last L   K child nodes of any parent
will become the final selection. Thus, selecting L  K is a way to reduce the complexity
without trade-o of the performance.
In another aspect, assume that k and c are the index number of the K parent nodes
50
(PEDn+1) and of the L child nodes (Dn) per parent node in stage n, respectively. Because
values of PEDn+1 are already sorted in stage n+1, the parent node that has high index k will
have a large value of PEDn+1. Thus, its child nodes are expected to have low probability to
be selected as one of the K smallest (best) nodes for the next stage. To prove this analysis,
we did the simulation and computed the probability (in %) in which a child node might
become one of the K best nodes. We use the 4  4 IEEE 802.11ac simulator, channel D,
256-QAMmodulation, stage 3 of the MIMO detection, 148,000 data samples, K = 21, and
L = 9. The result is shown in Figure 4.3. From this figure, it can be seen that the larger the
index k is, the smaller the number of child nodes may be selected.
Based on that fact, we propose a parent node grouping method as follows: The K parent
nodes are divided into G groups. Each group has A = K=G parent nodes. Note that K and
G should be selected so that K is dividable to G (i.e., mod(K;G) = 0). Group 1 contains
the best parent nodes, while group G contains the worst parent nodes. Each parent node of
the gth (g = 1; 2; : : : ;G) group is expanded by Lg child nodes so that LG <    < L1  K.
4.3.3 Two-Dimensional Sorter (2D Sorter)
Sorting is the major bottleneck of the K-best detection because of its high complexity.
Theoretically, the sorting of n elements requires (n2   n)=2 comparators.
In this subsection, we propose a two-dimensional (2D) sorter which has low complexity,
is suitable for hardware resource sharing, and produces approximate result. The 2D sorter
for sorting C =
PG
g=1 ALg child nodes is described as follows: we put the C child nodes
into an A  B matrix, in which B = PGg=1 Lg. The jth row of the matrix contains all the
child nodes of the jth parent of all groups. The illustration in the case G = 3 is shown in
Figure 4.4. The matrix operates through two processes called as row sorting and column
sorting, one after the other, as follows:
 Row sorting. The B elements in a row are sorted. The smallest value is located in the
left of the row. This sorting is repeated for all rows.
 Column sorting. The A elements in a column are sorted. The smallest value is located
in the top of the column. This sorting is repeated for all columns.
51

Stage 
N
Stage 
N-1
K parent 
nodes

C child
nodes
1    L
1










1    L
1
1    L
2
1    L
2
1    L
3
1    L
3


K parent 
nodes
C child
nodes
1    L
1










1    L
1
1    L
2
1    L
2
1    L
3
1    L
3



1            L
1
1           L
2
1             L
3



1            L
1
1           L
2
1             L
3
  

Row 
sort
Row 
sort
Col. sort Col. sort
2D Sorter
Figure 4.4: Stages N and N   1 of the proposed 2D sorter-based K-best
1 2 3 2 5 2
2 3 4 3 5 3
1 3 5 2 3 6
3 4 6 2 3 1
1 2 2 2 3 5
2 3 3 3 4 5
1 2 3 3 5 6
1 2 3 3 4 6
1 2 2 2 3 5
1 2 3 3 4 5
1 2 3 3 4 6
2 3 3 3 5 6
Row sorting
a) Initial value
b) After row sorting
c) 2D sorter’s result
Column 
sorting
Figure 4.5: An example of 2D sorter operation
52
Figure 4.5, we show an example of the 2D Sorter operation.
After completing the row and column sorting, the K top-left elements of the sorted
matrix are expected to be the best (smallest) values and are selected. A simulation is needed
in advance to correctly determine the positions of the best candidates.
To verify the correctness of the 2D sorter, we did the simulation and measured the
probability (in %) in which an element of the sorted matrix might become one of the actual
K = 7, K = 14, and K = 21 best nodes. The simulation parameters are: 802.11 ac
simulator, channel B and D, 256-QAM, 148,000 data samples, G = 3, L1 = 4, L2 = 3, and
L3 = 1. The results are shown in Figure 4.6. From these results, positions of the 1st to
the 7th (yellow color), 8th to the 14th (green color), and 15th to the 21st (blue color) best
nodes are one by one determined. The figure also shows that the obtained results (in %) are
slightly aected by channel type. However, the influence is too small so that the positions
of the best nodes are not aected by channel type.
The 2D sorter is suitable for hardware resource sharing because all the rows (columns)
do the same task. A circuit which sorts B elements of the 1st row in the 1st cycle can be
reused to sort the 2nd, . . . , Ath rows in the 2nd, . . . , Ath cycles.
4.4 The Proposed Hardware Architecture
4.4.1 Overview Architecture
To determine the eectiveness of the proposed algorithm practically, we develop a 4  4
2D sorter-based K-best MIMO detection for 802.11n and 11.ac systems. The detection
supports five modulation types such as BPSK, QPSK, 16-QAM, 64-QAM, and 256-QAM.
After completing exhaustive simulation and considering the trade-o between BER perfor-
mance and complexity, the detection is configured as follows:
 At all stages, we select K = 21, G = 3, A = K=G = 7, L1 = 4, L2 = 3, L3 =
1, B = 8, and C = 56. In the case of 16-QAM, QPSK, and BPSK, which has
W < K, the numbers of parent nodes of stages 3, 2, and 1 (denoted by K3, K2,
and K1, respectively) are selected as follows: with 16-QAM mode, K3 = 14 and
53
100 98.2 76.4 42.2 0 0 0 0
100 83.4 19.3 0 0 0 0 0
86.2 14.5 0 0 0 0 0 0
49.4 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0
7.6 0 0 0 0 0 0 0
1.8 0 0 0 0 0 0 0
100 100 99.8 85.5 2.9 0.4 0.1 0
100 100 94.9 56.7 0.1 0 0 0
100 100 67.6 18.5 0 0 0 0
99.9 90.9 11.8 0 0 0 0 0
77.1 29.5 0 0 0 0 0 0
41.7 4 0 0 0 0 0 0
18.3 0.2 0 0 0 0 0 0
100 100 100 99.9 30.7 7.7 1.8 0.1
100 100 100 97.4 7.1 0.8 0.1 0
100 100 99.9 81.7 0.5 0 0 0
100 100 95.7 51.8 0 0 0 0
99.9 98.8 64.2 14 0 0 0 0
98.4 69.8 4.6 0 0 0 0 0
57.7 17.1 0.1 0 0 0 0 0
a.1) K = 7
100 98.1 72.7 37.1 0 0 0 0
100 83.1 16.7 0 0 0 0 0
88.6 16 0 0 0 0 0 0
54.2 0 0 0 0 0 0 0
23.3 0 0 0 0 0 0 0
8.3 0 0 0 0 0 0 0
1.9 0 0 0 0 0 0 0
100 100 99.7 82.9 3.3 0.5 0.1 0
100 100 94.1 52.3 0.1 0 0 0
100 100 63.8 15.7 0 0 0 0
99.9 91.3 10.9 0 0 0 0 0
80.2 32.3 0 0 0 0 0 0
46.6 4.6 0 0 0 0 0 0
21.2 0.3 0 0 0 0 0 0
100 100 100 99.8 34.6 8.5 2 0.1
100 100 100 96.9 8.1 0.8 0.1 0
100 100 99.9 79.1 0.6 0 0 0
100 100 95.3 47.6 0 0 0 0
99.9 98.7 61.3 12.8 0 0 0 0
98.2 70.8 4.8 0 0 0 0 0
60.5 19.4 0.2 0 0 0 0 0
b.1) K = 7
a.2) K = 14
b.2) K = 14
a.3) K = 21
b.3) K = 21
Channel B Channel D
1
st
– 7
th
best nodes
8
th
– 14
th
best nodes
Notes: 15
th
– 21
st
best nodes
Figure 4.6: Probability (in %) that an element of the sorted matrix becomes one of the
actual K best nodes
54
K2 = K1 = 21; with QPSK mode, K3 = 4, K2 = 14, and K1 = 21; and with BPSK
mode, K3 = 2, K2 = 4, and K1 = 8.
 Stage 4 does not use the sorter, while stages 3 and 2 use the proposed 2D sorter with
the matrix size of 7  8.
 For achieving high detection performance, the soft decision is implemented in which
C = 56 values of PED1 are used for calculating the Log Likelihood Ratio (LLR).
The LLR of the ith bit of xn (denoted as bni) is calculated as (4.6).
LLR(bni) = LLR0(bni)   LLR1(bni) (4.6)
Where, LLR0(bni) and LLR0(bni) denote the smallest value of PED1 in case bni is
supposed to be zero, and one, respectively.
PED4 = jzI4   r44xI4j2|        {z        }
DI4
+ jzQ4   r44xQ4 j2|         {z         }
DQ4
(4.7)
f3 = z3   r34x4 (4.8)
PED3 = PED4 + j f I3   r33xI3j2|        {z        }
DI3
+ j f Q3   r33xQ3 j2|         {z         }
DQ3
(4.9)
f2 = z2   r23x3   r24x4 (4.10)
PED2 = PED3 + j f I2   r22xI2j2|        {z        }
DI2
+ j f Q2   r22xQ2 j2|         {z         }
DQ2
(4.11)
f1 = z1   r12x2   r13x3   r14x4 (4.12)
PED1 = PED2 + j f I1   r11xI1j2|        {z        }
DI1
+ j f Q1   r11xQ1 j2|         {z         }
DQ1
(4.13)
The configuration and overview hardware architecture of the detection is shown in Fig-
ure 4.7a and b, respectively. The ‘STAGE 4’ block computes K best values of PED4 and the
corresponding x4 in (4.7). Similarly, the ‘STAGE 3’ block computes K best values of PED3
and the corresponding fx4; x3g in (4.9). The ‘STAGE 2’ block computes K best values of
55
..
.
.
1
3
1
4
1
4
1
3
.
.
.
.
.
.
1
Stage 4 Stage 3 Stage 2 Stage 1
2D
Sort
1
1
7
8
14
15
21
L
L
R
K=21
C=56
.
.
.
.
1
3
1
4
1
4
1
3
.
.
.
.
.
.
1
1
1
7
8
14
15
21
K=21
C=56
.
.
.
.
1
3
1
4
1
4
1
3
.
.
.
.
.
.
1
1
1
7
8
14
15
21
K=21
C=56
2D
Sort
Multiplier-Less
L
L
R
PED
4
x
4
PED
3
x
4
,x
3
z
4
z
3
K=21
G=3
L
1
=4
L
2
=3
L
3
=1
C=56
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
STAGE 
4
STAGE 
3
STAGE 
2
STAGE 
1
PED
2
x
4
,x
3
, 
x
2
PED
1
x
4
,x
3
, 
x
2
,x
1
z
2
z
1
a)
b)
Figure 4.7: a) Detection’s configuration; and b) the corresponding overview hardware ar-
chitecture
PED2 and the corresponding fx4; x3; x2g in (4.11). The ‘STAGE 1’ block computes C best
values of PED1 and the corresponding fx4; x3; x2; x1g in (4.13). The ‘LLR’ block computes
the log likelihood ratio. The ‘Multiplier-Less’ block prepares necessary data so that the
above-mentioned blocks can eliminate the high complex multipliers.
4.4.2 Hardware Implementation
To achieve low complexity, in addition to utilize the proposed algorithm, the following
implementation points are worth to be noticed.
56
GAIN-MUX-based Multiplier
From (4.7) to (4.13), it can be seen that the detection requires a large number of multipliers
to compute ri jx j (i = 4; 3; 2; 1; j  i). For example, 2
p
K multipliers are needed to compute
r44xI4 and r44x
Q
4 in stage 4 (see (4.7)), and the multiplier costs large hardware resource.
To compute ri jx j (i = 4; 3; 2; 1; j  i), instead of using the multiplier, we implement
GAIN and multiplexer (MUX) as be shown in Figure 4.8. This figure illustrates the case
of multiplying ri j with m best values of x j. The left figure shows the conventional method
which uses m dierent multipliers. The right figure is our proposed GAIN-MUX-based
multipliers. The input data ri j firstly goes into the ‘GAIN’ block that amplifies ri j by the
modulation gain D and then by the values of the constellation nodes such as 1; 3; 5; : : : ; 15.
Notice that all the possible values of x j are fD; 3D; : : : ; 15Dg. The outputs of ‘GAIN’
blocks are then inputted to m MUX blocks. Each MUX is controlled by a select signal of
x j (i.e., denoted by sel s
(m)
j ). If values of x
(m)
j are fD; 3D; : : : ; 15Dg, values of sel s(m)j will
be f0; 1; : : : ; 7g. Consequently, the outputs of MUX blocks are equivalent to the outputs of
multipliers in the left figure. Meanwhile, hardware cost for MUX is much smaller than that
for the multiplier.
The detection needs multipliers to compute many data, such as r44xI4, r
I
33x
I
3, and r22x
I
2,
while possible values of xI4, x
I
3, and x
I
2 are the same. Thus, one ‘GAIN’ block can be shared
among them. The ‘Multiplier-Less’ block implements this ‘GAIN’ block.
Resource Sharing
This technique is implemented in STAGE 4, STAGE 3, STAGE 2, and STAGE 1 blocks.
The STAGE 4 block computes K best values of PED4 and x4 in (4.7). Based on the
direct expansion method, it finds ceil(
p
21) = 5 best values of DI4 and DQ4 and then adds
these values together. Because the processes of finding DI4 and DQ4 are similar to each
other, they share the same circuit. Figure 4.9a shows the block diagram inside STAGE 4, in
which, ‘BLOCKA’ is shared to find best values of DI4, xI4 and DQ4, x
Q
4 in two clock cycles.
In other words, the sharing factor of this block is 2. The design of BLOCK A is shown in
Figure 4.9b, in which the ‘SIGN ABS’ block determines the sign and absolute value of jzI4j
(and jzQ4 j). The ‘CONS-LOCAT’ block specifies the subdomain in the constellation that jzI4j
57
××
×
x
j
(1)
r
ij
x
j
(1)
r
ij
x
j
(2)
r
ij
x
j
(m)
r
ij
x
j
(2)
x
j
(m)
MUX
r
ij
GAIN
r
ij
3r
ij
15r
ij
sel_x
j
(1)
sel_x
j
(2)
sel_x
j
(m)
r
ij
x
j
(1)
r
ij
x
j
(2)
r
ij
x
j
(m)
15
3
D
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Figure 4.8: Conventional multiplier versus GAIN-MUX-based multiplier
(and jzQ4 j) belongs to. Based on information of the CONS-LOCAT block, the ‘DI/DQ CAL’
block computes the best values of DI4 and DQ4, while the ‘XDE-CODE’ block finds the
best values of xI4 and x
Q
4 .
The STAGE 3 block computes the best values of PED3 and the corresponding fx4; x3g in
(4.9). The block diagram is shown in Figure 4.9c, in which ‘B1,’ ‘B2,’ and ‘B3’ respectively
perform the direct expansion for the 1st to the 7th (i.e., group 1), the 8th to the 14th (i.e.,
group 2), and the 15th to the 21st (i.e., group 3) parent nodes. Because all parent nodes in
the same group process similarly, they can share the same circuit. Consequently, the B1
block is designed to find L1 best child nodes of one parent node only. It is then reused in
seven clock cycles to complete the direct expansion for seven parent nodes of group 1. The
sharing factor is 7. Similarly, the B2 and B3 blocks are shared by seven times. Each B1,
B2, and B3 block has the following components: ‘CAL f3’ computes f3 in (4.8), ‘BLOCK
A*’ computes DI3 and DQ3, and ‘SUM’ computes PED3 from PED4, DI3, and DQ3 (see
58
SUM
DI
4
DQ
4
x
4
I
(121)
x
4
Q
(121)
PED
4
(121)
z
4
I
z
4
Q
(a): STAGE4
BLOCK 
A
CONS-
LOCAT
XDE-
CODE
DI/DQ
CAL
abs
i/qlocat
x
4
DI
4
DQ
4
z
4
I/Q
(b): BLOCK A
sign
SIGN
ABS
PED
3
(121)
x
4
,x
3
(121)
(c): STAGE 3
CAL F3 (1)
CAL 
F3 (1)
BLOCK 
A’ (1)
SUM
DI3
PED
4
(17)
DQ3
x
4
,x
3
x
4  
(17)
r
34
CAL F3 (1)
CAL 
F3 (1)
BLOCK 
A’ (1)
DI3
DQ3
PED
3
DI
3
DQ
3
PED
4
(814)
PED
4
(1521)
B1
B2
B3
BLOCK B
CAL 
f
3
BLOCK 
A* SUM
2D-
SORT
(814)
(1521)
Figure 4.9: Block diagram of ‘STAGE 4’ and ‘STAGE 3’
(4.9)).
After each clock cycle, B1, B2, and B3 output the best child nodes of one parent node
in all groups. In other words, all elements of one row in the sort matrix (see Sect. 4.3.3)
are obtained per cycle. Block ‘2D-SORT’ thus requires only a one-row-sorting circuit.
This circuit is then shared to sort all seven rows in seven clock cycles. The sharing factor
is 7. The hardware design of the 2D-SORT block is shown in Figure 4.10. The ‘ROW-
SORT’ block sorts eight outputs of B1, B2, and B3 per clock cycle. Only four best data are
obtained. In the ‘COL-SORT’, the ‘1to7’ collects the best values fromROW-SORT in seven
cycles and sorts them. The ‘1to6’ block collects the 2nd best values from ROW-SORT in
seven cycles, sorts them, and obtains six best data, so on. The designs of ROW-SORT and
‘1to3’ of COL-SORT are shown in Figure 4.10b,c, respectively. It can be seen that the
2D-SORT needs only 36 comparators to sort 56 child nodes, which is significantly reduced
as compared to (562   56)=2 = 1; 540 comparators if using the full sort.
The architectures of STAGE 2 and STAGE 1 are similar to STAGE 3. The sharing
factor of these blocks is 7. However, the 2D-SORT block is not implemented in STAGE 1.
59
: determines the large (L) and small (S) numbers 
: determines the small number
B3(1)
B2(1)
B2(2)
B2(3)
D1
D2
D3
D4
C1
D1
C2
D2
C3
D3
C4
D4
rs1
rs2
rs3
rs4
1
st
2
nd
3
rd
S 
L 
S 
L 
S 
L 
rs4
ROW-SORT
1to3
B1(4)
B1(2)
B1(3)
B1(1)
C1
C2
C3
C4
S 
L
ROW-
SORT
B1
1
2
3
4
B2
1
2
3
B3
1
best-21
(a): 2D-SORT
1to7
1to6
1to5
1to3
rs1
rs2
rs3
rs4
best-1
best-2
best-3
…
Notes:
(b)
(c)
COL-SORT
Figure 4.10: The design of ‘2D-SORT’ block
BIT 7
BIT 3
BIT 6
BIT 5
BIT 4
BIT 2
BIT 1
BIT 0
X1
BIT 7
BIT 3
BIT 6
BIT 5
BIT 4
BIT 2
BIT 1
BIT 0
X2
BIT 7
BIT 3
BIT 6
BIT 5
BIT 4
BIT 2
BIT 1
BIT 0
X3
BIT 7
BIT 3
BIT 6
BIT 5
BIT 4
BIT 2
BIT 1
BIT 0
X4
B1(1,2,3,4)
B2(1,2,3)
B3(1)
LLR 
Figure 4.11: The overview of ‘LLR’ block
60
Instead, the results of B1, B2, and B3 are directly passed to the LLR block.
Bit-Shift-based LLR Computation
In order to compute the LLR(bni) in (4.6), the zero’s probability LLR0(bni) and the one’s
probability LLR1(bni) must be known. Besides, in case of using the K-best based detection,
not all but only a subset of constellation nodes will be considered for computing the LLR.
That brings to a problem that: there may have cases in which the values of information
bit corresponding to the selected nodes are all zero (or one). In other words, LLR1(bni) (or
LLR0(bni)) can not be computed. In these cases, we propose a bit-shift-based LLR compu-
tation method to computes LLR1(bni) and LLR0(bni) as (4.14) and (4.15), respectively.
LLR1(bni) = 2cLLR0(bni) (4.14)
LLR0(bni) = 2cLLR1(bni) (4.15)
In (4.14) and (4.15), c can be any positive integer, i.e., c = 1; 2; : : : . To compute (4.14)
and (4.15), we simply shift the input data LLR0(bni) (or LLR1(bni)) to the left by c bits. The
bit-shift operation does not cost hardware resource.
The overview of ‘LLR’ block is shown in Fig. 4.11. Where, ’BIT i-Xn’ (i = 0; 1 : : : ; 7)
and (n = 1; : : : ; 4) computes LLR(bni) in (4.6). Equations (4.14) and (4.15) will be used if
bni values of the selected nodes are all zero (or one), respectively.
The hardware design of ‘LLR’ block is done by our partner, Ms. Hongyo Reina, in her
graduation thesis [52].
4.5 BER Performance Comparison
Our simulation parameters are: 4  4 MIMO 802.11ac simulator, one user, channel D,
Bandwidth 80 MHz, binary convolution code (BCC) error correction, 5000 packets, 2:5 
106 bytes of data in total.
61
4.5.1 QRD versus SQRD
Figure 4.12 shows the BER performance of 802.11ac system when applying 256-QAM
modulation, full K-best detection. The figure shows that using SQRD pre-processing helps
to improve BER performance by 0.6 dB, 0.8 dB (at BER = 10 3), and 1 dB (at BER = 10 2)
for cases of K = 21, K = 10, and K = 6, respectively, as compared to the case of using
QRD.
25 27 29 31 33 35
10
-4
10
-3
10
-2
10
-1
10
0
SNR [dB]
B
E
R
 
 
K = 6, QRD
K = 6, SQRD
K = 10, QRD
K = 10, SQRD
K = 21, QRD
K = 21, SQRD
Figure 4.12: BER of 802.11ac system, QRD versus SQRD
62
4.5.2 Parent Node Grouping
Figure 4.13 shows the BER performance of 802.11ac system when applying 256-QAM,
the proposed MIMO detection in which K = 21, G = 3, and SQRD. The figure shows
that the BER performance is insignificantly degraded when ‘L1-L2-L3’ is decreased from
256-256-256 (full K-best) to ‘9-9-9’, ‘9-6-3’, ‘4-4-4’, and ‘4-3-1’. Numerically, the perfor-
mance degradation of ‘L1-L2-L3 = 4-3-1’ is about 0.15 dB as compared to the full K-best
(at BER = 10 3). However, when continuing to reduce the number of child nodes per
parent node to ‘L1-L2-L3 = 1-1-1’, the performance degradation is about 1.2 dB, which is
considerable, as compared to the full K-best.
25 27 29 31 33 35
10
-4
10
-3
10
-2
10
-1
10
0
SNR [dB]
B
E
R
 
 
L1 - L2 - L3 = 1 - 1 - 1
L1 - L2 - L3 = 4 - 3 - 1
L1 - L2 - L3 = 4 - 4 - 4
L1 - L2 - L3 = 9 - 6 - 3
L1 - L2 - L3 = 9 - 9 - 9
Full K-best
Figure 4.13: BER of 802.11ac system, parent node grouping
63
4.5.3 2D Sorter
Figure 4.14 shows the BER performance of 802.11ac system when applying 256-QAM, the
proposed MIMO detection in which K = 21, G = 3, L1 = 4, L2 = 3, L3 = 1, and SQRD.
The terms ‘S4 FullSort’, ‘S4 NoSort’, ‘S32 FullSort’, and ‘S32 2D Sort’ denote that a full
sorter is used in stage 4, no sorter is used in stage 4, full sorters are used in stages 3 and 2,
and 2D sorters are used in stages 3 and 2, respectively.
The figure shows that the BER performance of S4 NoSort - S32 2D Sort is insignif-
icantly degraded as compared to the case of S4 FullSort - S32 FullSort. The amount of
degradation is about 0.08 dB (at BER = 10 3). In other words, 1) by applying the direct
25 27 29 31 33 35
10
-4
10
-3
10
-2
10
-1
10
0
SNR [dB]
B
E
R
 
 
S4 FullSort - S32 FullSort
S4 FullSort - S32 2D Sort
S4 NoSort - S32 FullSort
S4 NoSort - S32 2D Sort
Figure 4.14: BER of 802.11ac system, 2D Sorter
64
expansion method, the sorter can be eliminated in stage 4, and 2) the 2D sorter is an accept-
able approximation of the full sorter. It can be used in trade-o with about 0.08-dB BER
performance.
4.5.4 The Proposed Detection
Figure 4.15 and 4.16 shows the BER and PER of 802.11ac system when applying vari-
ous MIMO detection schemes such as BLAST MMSE (soft decision), LRA-MMSE (soft
decision), full K-best (soft decision), and the proposed detection (soft and hard decisions).
From Fig. 4.15, it can be seen that for all modulation types (16-QAM, 64-QAM, and
256-QAM), the proposed detection with soft decision (green line) outperforms the BLAST
MMSE (blue line) and LRA MMSE (black line), and is close to the full K-best with soft
decision (red line). Numerically, at the observation point of BER = 10 3, the proposed
detection (with soft decision) is better than BLAST MMSE by 6.7, 3.7, and 2.3 dB, respec-
tively. It is better than LRA MMSE by 1, 0.5, and 0.02 dB, respectively. As compared
to the full K-best, the BER performance degradation of the proposed one is about 0.2 dB
for all cases. In addition, using soft decision can improve the performance of the proposed
detection by about 2 dB as compared to the hard decision (green line versus pink line).
From Fig. 4.15, we also see that the BER performance’s gap from the proposed de-
tection (soft decision) and the full K-best to the LRA MMSE and the BLAST MMSE de-
creases when the modulation types increase from 16-QAM to 64-QAM and to 256-QAM.
That is because the modulation size increases while the K value is fixed to 21. Conse-
quently, the BER performance of the proposed detection and of the full K-best is expected
to be worse as the modulation size increases. Notice that in cases of BPSK and QPSK, the
proposed detection searches all of the constellation nodes; it thus achieves the same BER
as the optimal MLD does.
The similar thing is happen in Fig. 4.16, the PER performance of the proposed detection
(with soft decision) is close to the full K-best in all cases, better than BLAST MMSE and
LRA MMSE in 16-QAM and 64-QAM cases, and approximate to LRA MMSE in 256-
QAM case. If we increase the K value, both BER and PER performance of the proposed
detection are expected to be better than LRA MMSE in 256-QAM case.
65
14 16 18 20 22 24 26 28 30 32 34
10
-4
10
-3
10
-2
10
-1
10
0
SNR [dB]
B
E
R
 
 
Full K-best
BLAST MMSE
LRA MMSE
Prop. Decoder, soft 
Prop. Decoder, hard 
16-QAM
Rate = 3/4
64-QAM
Rate = 3/4
256-QAM
Rate = 3/4
Figure 4.15: BER of 802.11ac system, various MIMO detection schemes
15 20 25 30 35
10
-2
10
-1
10
0
SNR [dB]
P
E
R
 
 
16QAM, BLAST MMSE
16QAM, LRA MMSE
16QAM, Full K-best, soft
16QAM, Proposed, hard
16QAM, Proposed, soft
64QAM, BLAST MMSE
64QAM, LRA MMSE
64QAM, Full K-best, soft
64QAM, Proposed, hard
64QAM, Proposed, soft
256QAM, BLAST MMSE
256QAM, LRA MMSE
256QAM, Full K-best, soft
256QAM, Proposed, hard
256QAM, Proposed, soft
Figure 4.16: PER of 802.11ac system, various MIMO detection schemes
66
Table 4.1: Total visited nodes of 4  4MIMO K-best
Algorithm Full K-best [30], 2008 [31], 2012 [32], 2013 Proposed
Modulation 256-QAM 256-QAM 256-QAM 64-QAM 256-QAM
K value 21 26 26 10 21 21
Total nodes 16384 1024 1004 189 387 189
4.6 Complexity Comparison
Due to the application of the direct expansion method, the number of search candidates (or
visited nodes) of the proposed detection is no longer aected by the constellation size. It is
aected by K, Lg (g = 1; : : : ;G), and N only.
Numerically, we compare the complexity of the proposed algorithm with the previous
works in terms of total number of visited nodes (shorted as ‘total nodes’) in Table 4.1. All
the compared algorithms are configured to be 4  4 MIMO detection (N = 4). The data of
[30] and [31] are obtained from their papers. Data of [32] is calculated by ourselves after
understanding the algorithm. In the best of our knowledge, this algorithm needs to visit
(
p
W + K + 1) + 2K(RSE num + CSE num + 1) + K nodes, in which RSE num = 4 and
RSE num = 3 are reported to be optimal for the case of N = 4, K = 10, and W = 64
(64-QAM).
This table shows that:
 As compared to the full K-best, the total nodes of the proposed algorithm is just about
1.2% of that of the full K-best.
 As compared to [30] and [31], the total nodes of the proposed algorithm reduces
about 5.4 times, while the gap of the K value is about 1.24 times.
 The total nodes of the proposed algorithm is about half of that of [32], while both
have the same K = 21 and the proposed one supports higher modulation than [32]
(256-QAM versus 64-QAM). In case [32] supports K = 10 and the proposed supports
K = 21, they have the same total nodes.
The comparison in Table 4.1, however, just reflects the algorithm’s complexity in terms
of total nodes. The complexity on computing the Euclidean distance of each visited node
67
and on sorting the nodes cannot be seen.
To compare the proposed detection with the previous ones thoroughly, we designed and
synthesized our detection in ASIC. The synthesis tool was the Design Vision of Synopsys.
The CMOS SAED 90 nm technology and saed90nm min library were used. The applied
voltage was 1.32 V.
The ASIC synthesis results are shown and compared in Table 4.2. All the designs are
44 MIMO K-best. From this table, the contribution of the proposed detection can be seen
as follows:
High throughput. The proposed detection achieves the highest throughput among all de-
signs. Comparing with the most recent work in [32], the proposed detection’s through-
put is two times higher.
Low power consumption. Among all the designs, the proposed design consumes the least
power, which is about 56 mW.
Small area. Although supporting higher modulation (i.e., 256-QAM) and larger K (i.e.,
K = 21) than the most recent work in [32], the proposed detection occupies less
hardware area. It needs 180 Kgates, which is almost half of [32]. Remember that the
proposed detection and [32] have the same number of visited nodes (see Table 4.1).
This is the evidence for the eectiveness of the 2D sorter and computation method
of the direct expansion.
High normalized hardware eciency (NHE). The proposed design obtains the highest
NHE. It is 15.2 Mbps/Kgate, which is better than [35, 37, 38], and [32] by 50.7,
29.2, 8.5, and 3.6 times, respectively.
Short latency. The proposed design has the shortest latency. It is 0.07 s.
Table 4.3 compares the ASIC synthesis results of some typical MIMO detection schemes
such as MMSE, LRA-MMSE, and the proposed detection. The LRA-MMSE needs the
lattice reduction (LR) pre-processing [26] and the proposed detection needs the SQRD
pre-processing [50]. In [24], the authors has proposed 9-step and 2-step MMSE hardware
architectures for 4  4 MIMO systems. The ASIC synthesis results of two architectures
68
Ta
bl
e
4.
2:
A
SI
C
sy
nt
he
si
s
re
su
lts
of
4

4
M
IM
O
K
-b
es
t
D
es
ig
n
[3
4]
,2
00
6
[3
3]
,2
00
6
[3
5]
,2
01
0
[3
6]
,2
01
2
[3
7]
,2
00
7
[3
8]
,2
01
0
[3
2]
,2
01
3
Pr
op
os
ed
M
od
ul
at
io
n
16
-Q
A
M
16
-Q
A
M
64
-Q
A
M
64
-Q
A
M
64
-Q
A
M
(4
-6
4)
Q
A
M
64
-Q
A
M
(2
-2
56
)Q
A
M
K
va
lu
e
5
5
5-
64
10
64
N
/A
10
21
M
et
ho
d
R
ea
l
R
ea
l
R
ea
l
R
ea
l
C
om
pl
ex
C
om
pl
ex
C
om
pl
ex
C
om
pl
ex
Pr
oc
es
s
0.
35

m
0.
25

m
65
nm
0.
13

m
0.
13

m
0.
13

m
0.
13

m
90
nm
H
ar
d/
so
ft
de
ci
si
on
N
/A
N
/A
H
ar
d
H
ar
d
So
ft
So
ft
H
ar
d
So
ft
f m
ax
(M
H
z)
10
0
13
2
15
8
28
2
27
0
19
8
41
7
59
0
T
hr
ou
gh
pu
t
54
42
4
73
2-
10
0
67
5
10
0
28
5-
43
1
1,
00
0
2,
70
0
(M
bp
s)
21
0a
1;
17
8a
52
9
 7
2a
97
5a
14
0a
41
1
 6
23
a
1;
44
4a
2;
70
0a
A
re
a
(K
ga
te
)
91
11
4
1,
76
0
11
4
28
0
35
0
34
0
18
0
Po
w
er
(m
W
)
62
6
N
/A
16
5
13
5
94
57
-7
4
1,
70
0
56
N
H
E
b
(M
bp
s/
K
ga
te
)
2.
33
10
.3
0.
3-
0.
04
8.
5
0.
52
1.
18
-1
.7
9
4.
26
15
.2
L
at
en
cy
(
s)
2.
4
0.
4
N
/A
0.
6
N
/A
N
/A
0.
36
0.
07
a N
or
m
al
iz
ed
th
ro
ug
hp
ut
fr
om
S
te
ch
no
lo
gy
to
90
nm
=
(t
hr
ou
gh
pu
ta
tS
)
S 90
b N
or
m
al
iz
ed
ha
rd
w
ar
e
e
ci
en
cy
(N
H
E
)=
N
or
m
al
iz
ed
th
ro
ug
hp
ut
(M
bp
s)
A
re
a
(K
ga
te
s)
.
69
Table 4.3: ASIC synthesis results of several MIMO detection schemes
MIMO Scheme MMSE LRA MMSE Proposed
Technology 90 nm 65 nm 90 nm
Pre-processing Area (Kgates) - - 193 [26] 179 [50]
Detection Area (Kgates) - - 303 - 885 [24] 180
Total Area (Kgates) 303 [24] 885 [24] 496 - 1078 359
are 303 Kgates, and 885 Kgates, respectively. The total hardware area of the LRA-MMSE
is obtained from the LR pre-processing in [26] and the MMSE detection in [24]. And the
proposed detection needs the SQRD pre-processing in [50]. From the result in Table 4.3, it
can be seen that the proposed MIMO detection is even less complex than the MMSE and
the LRA-MMSE schemes. Although the complexity of the BLAST MMSE is not shown in
this chapter, it is expected to be more complex than MMSE.
4.7 Conclusion
In this chapter, we have proposed an algorithm and hardware design of a 2D sorter-based
K-best MIMO detection that supports up to 256-QAM. By utilizing the ideas such as di-
rect expansion, parent node grouping, and 2D sorter, the algorithm has been proven to
be less complex than the previous works, and its complexity is negligibly aected by the
constellation size. A prototype hardware architecture of the algorithm has been developed
to support 4  4 MIMO 802.11n and 11ac systems. Some techniques such as resource
sharing, MUX-GAIN-based multiplier, and bit-shift-based LLR have been implemented to
further reduce the complexity.
The chapter has shown that the proposed detection outperforms the BLASTMMSE and
LRA MMSE, and is close to the full K-best in terms of BER and PER performance. The
hardware design of the detection achieves the highest throughput (2.7 Gbps), consumes the
least power (56 mW), obtains the best hardware eciency (15.2 Mbps/Kgate), and has the
shortest latency (0.07 s) as compared to the conventional works on K-best based MIMO
detection. This chapter has also shown that the proposed detection is less complex than
MMSE and LRA-MMSE detection schemes.
70
Chapter 5
LDPC Decoder Architecture
5.1 Introduction
To improve the reliability of a wireless communication system, we should select a good
error correction code in addition to adopting the high performance MIMO detection. The
research for high performance MIMO detection has been shown in chapter 4. To continue
the work of improving the system reliability, we present the research about high perfor-
mance error correction code, i.e., LDPC decoder, in this chapter.
In chapter 2, we have shown that among various codes such as binary convolution code
(BCC) [41], Turbo code [42], low density parity check (LDPC) code [43], etc., LDPC
has been proven to achieve the best performance and is close to the Shannon limitation
[44], [45]. However, the hardware design of LDPC decoding is very high complex due to
multiple operation modes. To cope with the variety of channel condition and packet size,
a wireless system supports many code rates and lengths, respectively. The low code rate
results to low data rate but high decoding performance. It is preferred to use in case the
channel condition is not good. It is vice versus to the high code rate. The large code length
provides higher decoding performance but requires more complex circuit than the small
code length does. Depend on situation, a suitable code length and rate would be used. For
example, the 802.16e WiMAX supports 76 operation modes which are combined from 4
code rates (1/2, 2/3, 3/4, and 5/6) and 19 code lengths (ranging from 576 bits to 2304 bits).
And the 802.11ac Wi-Fi supports 12 operation modes combined from four code rates (1/2,
71
2/3, 3/4, and 5/6) and 3 code lengths (648, 1296, and 1944) [54]. Each operation mode has
a specific check matrixH value. Meanwhile, the hardware design of the LDPC decoder are
commonly dedicated to the pre-defined H. That results to a very high complicated LDPC
decoder for multiple operation modes. Furthermore, long time is needed for developing
and debugging the decoder if the design is aected too much on values of every entries of
H.
In this chapter, we propose a multi-mode LDPC decoder architecture in which the main
parts of the decoder are independent from H value. A new operation mode can be added
by simply inserting a new ROM that store theH value into the developed decoder. Because
of the independence on H value, the decoder developing and debugging are expected to
be simple and require less time. Based on the proposed architecture, we design a min-
sum LDPC decoder for 802.11ac system and synthesize it in ASIC. In this chapter, we
also show that how to quantize the log likelihood ratio (LLR) of the codeword bits is an
important decision that significantly aects to the error correcting performance. We then
propose an ecient quantization method.
The rest of this chapter is organized as follows. Section 5.2 presents the background of
LDPC code and min-sum decoding algorithm. Section 5.3 describes our proposed LDPC
decoder architecture as well as quantization method. Section 5.4 is the result evaluation in
terms of packet error rate (PER) performance and ASIC synthesis. Final section 5.5 is our
conclusion.
5.2 Background
5.2.1 LDPC Code
LDPC code is a special class of the linear block codes. It was introduced by R. Gallager
[43] in 1963. It is presented by a sparse parity check matrix H in which H is an M  N
matrix. If the value of each entry of H is whether zero or one, the code is called as binary
LDPC code. N is the length of the codeword, and M is the number of parity bits. Each row
of H presents one parity check constrain.
The LDPC code can also be described by a Tanner graph in [53]. Each row of H
72









=
1001
1101
0110
H
x
1 x
2
x
3
x
4
CN
1
CN
2
CN
3
VN
1
VN
2
VN
3
VN
4
Figure 5.1: Check matrix H and the corresponding Tanner graph
corresponds to a check node (CN), while each column corresponds to a variable node (VN)
in the graph. If hmn = 1 (n = 1; : : :N;m = 1; : : : ;M), there is connection between the nth
variable node (VNn) and themth check node (CNm). These nodes can exchange information
to each other in the decoding process. The illustration of Tanner graph is shown in Fig. 5.1.
If all the CN (or VN) have the same number of connections, the LDPC code is called
as regular. Otherwise, the code is irregular. The irregular codes produce better decoding
performance than the regular codes do.
The nowadays wireless standards adopt structured irregular LDPC code in which the
parity-check matrix H can be partitioned into square submatrices Prc (r = 1; : : : ;R; c =
1; : : : ;C) of size ZZ. These submatrices are either null submatrices or cyclic-permutations
of the identity matrix. A structured matrix H can be shortly presented by the following
HBAS E matrix.
HBAS E =
0BBBBBBBBBBBBBBBBBBBB@
p11 p12    p1C
p21 p22    p2C
:::
:::
: : :
:::
pR1 pR2    pRC
1CCCCCCCCCCCCCCCCCCCCA
(5.1)
In which, prc is the represented value of Prc submatrix. It means that Prc will be obtained
by cyclically shifting all rows of the Z  Z identity matrix to the right by prc elements [6].
The illustration of structured LDPC code is shown in Fig. 5.2.
73
P11
P
12
P
1C
P
21
P
R1
P
22
P
RC
P
2C


 

P
R2

H =
N=Z×C
M=Z×R
1 0 0
0 1 0
0 0 1
0 0 0
0 0 0
0 0 0
0 1 0
0 0 1
1 0 0
Z
Z
Identity matrix Null matrix
Permutation matrix
Z
Z
Z
Z
Z=3
Figure 5.2: Check matrix of the structured LDPC code
HxT = 0 (5.2)
Decoding a received data x = [x1; x2; : : : ; xN] is an iterated process of checking (5.2) and
updating x in case (5.2) is not satisfied yet.
5.2.2 Min-Sum Decoding Algorithm
Belief propagation (BP), also known as sum-product message passing (SPMP), is the op-
timal LDPC Decoder algorithm in terms of error correction capability. However, due to
its high complexity, the min-sum algorithm is commonly selected for hardware implemen-
tation. The input data of the decoder is the log likelihood ratio (LLR) of the estimated
codeword bits (denoted as  = [1; : : : ; N]). The min-sum algorithm processes through
74
a maximum of a pre-defined L iterations. Each iteration has three main steps, i.e., CN
Update, VN Update, and Parity Check. The lth iteration operates as follows.
 CN Update: 8CNm, 1  m  M, compute:
(l)mn = min
n02N(m);n0,n
j(l 1)
mn0 j
Y
n02N(m);n0,n
sgn((l 1)
mn0 ) (5.3)
 VN Update: 8VNn, 1  n  N, compute:
(l)mn = n +
X
m02M(n);m0,m
(l)
m0n (5.4)
 Parity Check: apply hard decision to compute xˆ = [xˆ1; : : : ; xˆN] as (5.5).
xˆn =
8>><>>: 0 if n +
P
m2M(n) 
(l)
mn  0,
1 otherwise
(5.5)
If HxˆT = 0 or l = L, the computed xˆ is outputted as the decoded data. Otherwise, the
(l + 1)th iteration is started.
Notice that (0)mn = 0 and 
(0)
mn = n, 8m; n; N(m) is the set of n values in the mth
row of H in which hmn = 1; and M(n) is the set of m values in the nth column of H
that hmn = 1. Values of N(m) and M(n) are depended on values of check matrix H, the
computation in (5.3) and (5.4) are thus depended on H. Consequently, hardware design of
LDPC decoder is dedicated to a specified H. To be applied into wireless systems which
supports several operation modes, i.e., several check matrix values, LDPC decoder circuit
becomes complicated. Based on the required throughput and complexity, the hardware
design of the LDPC decoder can be serial, partial parallel, or fully parallel [54].
In the next section, we propose a partial parallel LDPC decoder based on min-sum
algorithm. Most parts of our architecture are independent from check matrix value.
75
5.3 The Proposed LDPC Decoder Architecture
5.3.1 Basic Ideas
To understand the min-sum algorithm intuitively, let assume a simple structured check
matrix H that has C = 3, R = 2, Z = 3, M = R  Z = 6, N = C  Z = 9, and the
corresponding HBAS E as (5.6).
HBAS E =
0BBBBBB@p11 p12 p13p21 p22 p23
1CCCCCCA =
0BBBBBB@1 0 20 2 1
1CCCCCCA (5.6)
The decoder must compute values of variable node mn and check node mn at positions
that hmn = 1 in every iterations. For a structured LDPC code, there are Z variable nodes
and Z check nodes must be calculated in the Z  Z submatrix Prc zone if Prc is not a null
matrix. For simplicity, we denote Azrc, and B
z
rc respectively as the values of variable node
and check node at the zth column of Prc matrix zone. Positions of the variable and check
nodes corresponded to (5.6) are shown in Fig. 5.3a and d. We can see that: positions of Azrc,
and Bzrc are not fixed but depend on value of HBAS E. If the positions of Azrc, and Bzrc become
fixed, the hardware circuit that performs CN Update and VN Update will be independent
fromHBAS E value. For that reason, we propose a min-sum computation method as follows.
 Concatenate all variable nodes inside the submatrix Prc to a wide signal Arc (Arc =
[AZrc : : : A
1
rc]).
 Find A0rc = [A0Zrc : : : A01rc] by circularly shifting Arc to the right by prc values. Positions
of Azrc (1  z  Z) in the check matrix become diagonal as Fig. 5.3b.
 CN Update: Compute B0zrc as (5.7).
B
0z
rc = min1c0C;c0,c
jA0zrc0 j
Y
1c0C;c0,c
sgn(A
0z
rc0) (5.7)
Assign B
0
rc = [B
0Z
rc : : : B
01
rc].
76
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
2
11
A
3
11
A
1
11
A
1
12
A
3
13
A
2
12
A
1
13
A
3
12
A
2
13
A
1
21
A
2
21
A
3
21
A
2
22
A
1
22
A
3
22
A
2
23
A
3
23
A
1
23
A
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
2
11
A
3
11
A
1
11
A
1
12
A
3
13
A
2
12
A
1
13
A
3
12
A
2
13
A
1
21
A
2
21
A
3
21
A
2
22
A
1
22
A
3
22
A
2
23
A
3
23
A
1
23
A
1 0 2
0 2 1
Circular
Shift
By
P
r,c
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
2
11
B
3
11
B
1
11
B
1
12
B
3
13
B
2
12
B
1
13
B
3
12
B
2
13
B
1
21
B
2
21
B
3
21
B
2
22
B
1
22
B
3
22
B
2
23
B
3
23
B
1
23
B
Circular
Shift
By
P
r,c
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
2
11
B
3
11
B
1
11
B
1
12
B
3
13
B
2
12
B
1
13
B
3
12
B
2
13
B
1
21
B
2
21
B
3
21
B
2
22
B
1
22
B
3
22
B
2
23
B
3
23
B
1
23
B
2 10
1 0
2
a)
b)
c)
d)
Figure 5.3: Positions of variable nodes Azrc and check nodes B
z
rc
 Find Brc = [BZrc : : : B1rc] by circularly shifting B0rc to the left by prc values. By doing
so, positions of Bzrc (1  z  Z) are returned to their original ones before performing
VN Update, refer to Fig. 5.3d.
 VN Update: Compute the sum Azc of all check nodes in the zth column of the cth
column sub-matrices as (5.8). We then compute Azrc as (5.9).
77
MUX(1)
CSR(1)
CN(1)
CSR(C)




1
λ
MUX(C)
CN(Z)
CSL(1) CSL(C)
VN(1)
VN(C)


SUM-
VN(1)
CN-
RAM(1)
SUM-
VN(C)
CN-
RAM(C)

C
λ
DEC(1) DEC(C)
MUX
CSR
CN-UPDATE
CSL
PRE
VN-
UPDATE
MODE(1)
MODE(K)

P-ROM
P_value
P_has
CTRL
1
λ
C
λ

mode
DEC(1)
DEC(C)

LDPC Decoder
Parity-
Check
Delay 2
Delay 2
P_value
Figure 5.4: Overview hardware architecture of LDPC decoder
1
0
+
first_iter
CN-
RAM(c)
ADD
+
-
SUM-
VN(c)
truncate
b) VN(c), z
z
c
λ
z
rc
B
z
c
A
a) SUM-VN(c), z
z
rc
B
z
c
A
z
rc
A
Figure 5.5: Circuits inside a) ‘SUM-VN(c)’ and b) ‘VN(c)’ for the zth sub-column
78
Azc = 
z
c +
X
1rR
Bzrc (5.8)
Azrc = A
z
c   Bzrc (5.9)
 Parity Check: Applying hard decision to find xˆ and check the parity constrainHxˆT =
0.
5.3.2 The Proposed Architecture
Based on the above-mentioned computation method, we propose an LDPC Decoder archi-
tecture as be shown in Fig. 5.4. The decoder supports K operation modes that have the
check matrix size smaller than or equal to the predefined parameters R, C, Z.
In Fig. 5.4, the designs of ‘MUX’, ‘CSR’, ‘CN-UPDATE’, ‘CSL’, ‘PRE’, and ‘VN-
UPDATE’ are independent from HBAS E value. They work together to update the check
nodes and variable nodes in one row of HBAS E per clock cycle. In detail, the fAr1; : : : ; ArCg
and fBr1; : : : ; BrCg (1  r  R) are calculated in the rth cycle of an iteration.
The ‘MUX’ block has C multiplexers to select  = f1; : : : ; Cg in the first iteration and
select Ar = fAr1; : : : ; ArCg in the other iterations. In which, the ‘MUX(c)’ selects whether
c = [Zc : : : 
1
c] or Arc = [A
Z
rc : : : A
1
rc], (1  c  C). Let say D is the number of bits
that presents zc as well as A
z
rc (1  z  Z), the output bit-width of each multiplexer will
be D  Z. The ‘CSR’ block performs the circular shift right after receiving the prc, i.e.,
P value, from ‘P-ROM’ block. In which, the ‘CSR(c)’ shifts the input Arc to obtain A
0
rc.
The ‘CN-UPDATE’ block has Z ‘CN’ sub-blocks to compute the check node values as
(5.7). In case Prc is a null matrix, the ‘P-ROM’ will inform to the ‘CN-UPDATE’ block
by inactive the signal P has. The ‘CN-UPDATE’ then sets the values of the corresponding
[B
0Z
rc : : : B
01
rc] to zero. The ‘CSL’ block receives P value and performs the circular shift left
to return the check node values to their original positions. In which, the ‘CSL(c)’ shifts the
input B
0
rc to obtain Brc. In side ‘PRE’ block, the ‘SUM-VN(c)’ (1  c  C) adds the values
of all check nodes in the zth (1  z  Z) subcolumn of the cth column, refer to Fig. 5.5a.
The computation is based on (5.8). The ‘CN-RAM(c)’ (1  c  C) stores the check nodes
79

CN_
RAM
SUM_VN
CN_RAM
SUM_VN
RSL
CN
RSR
1 2 R
MUX

3
1 2 R

3
1 2 R
3
1 2 R
3
1 2 R
3

1 2 R
3

DECODED_DATA
1 2 R
3

DEC_DATA
R+3 cycles Start ite. l
Start ite. l+1
VN
MUX
R cycles 
Figure 5.6: Timing diagram of the proposed LDPC decoder
Bzrc. The ‘VN-UPDATE’ block computes the variable nodes A
z
rc as (5.9), refer to Fig. 5.5b.
The ‘P-ROM’ stores prc of HBAS E. If the decoder supports K operation modes with K
dierent values of check matrices, K ‘ROM’ sub-blocks are needed. The ‘Parity-Check’
block check the condition HxˆT = 0, in which xˆ = [xˆ1; : : : ; xˆN] are the signs of ’SUM-VN’
results. The ‘CTRL’ controls the timing of the above-mentioned blocks. In Fig. 5.4, the
yellow blocks represent the delay due to using registers.
The data flow of our proposed decoder is shown in Fig. 5.6. It can be seen that the
decoder needs R + 3 clock cycles to complete one decoding iteration. In case the decoder
needs L iterations for decoding a LDPC code with M = R  Z sub-rows and N = C  Z
sub-columns, its throughput will be:
Throughput =
N   M
L(R + 3)
=
Z(C   R)
L(R + 3)
[bits=cycle] (5.10)
Where, N  M is the number of information bits per codeword, and L(R+ 3) is the number
of cycles for decoding a codeword. Equation (5.10) shows that throughput of the proposed
80
100 200 300 400 500 600
-15
-10
-5
0
5
10
15
Sample
V
a
l
u
e
-E
E
Figure 5.7: LLR values and the quantization range
decoder is aected by the code rate (or R). In case the system selects the high code rate
(R becomes small) for high throughput purpose, the proposed decoder can provides high
throughput.
5.3.3 LLR Quantization
In terms of hardware implementation, we must quantize the LLR values of the estimated
codeword bits and use a number of binary bits to present them. Let denote D as the number
of bits that presents each LLR value, the range of LLR value will be divided into 2D discrete
values. Fig. 5.7 shows the LLR values of a 648 bit-length codeword bits. Assume that we
only quantize the values from  E to E, i.e., inside the red box in Fig. 5.7, the values that
are smaller than  E or larger than E will be truncated to  E and E, respectively. The
quantization resolution  will be:
 =
2E
2D
(5.11)
From (5.11) we can see that the smaller the value of E is, the smaller the resolution
 will be, and thus the higher the accuracy of the quantization will be. However, if E is
81
too small, there are many values larger than E and must be truncated. The error due to
truncation would reduce the accuracy of the decoder. In short, determining the threshold
value E for quatization would be an issue that significantly aects to the performance of
the LDPC decoder.
In another aspect, the LLR n of the nth codeword bit is computed as (5.12).
n = log
 
Pr(xn = 0jy)
Pr(xn = 1jy)
!
(5.12)
In (5.12), Pr(xn = 0jy) and Pr(xn = 1jy) present the probabilities that the nth bit of the
codeword is considered to be zero, and one, respectively. If Pr(xn = 0jy) = KPr(xn = 1jy)
where K > 10 or K < 1=10, it is nearly correct to conclude that the nth bit is zero, or one,
respectively. In that case, the error due to truncation may be inconsiderable.
Based on our research, we propose to select the threshold E so that 2:5 < E < 4. It
means that 50 > K > 10 or 1=50 < K < 1=10. By doing so, the quantization range can be
shorten significantly as compared to case of full range which is commonly from -15 to 15.
The quantization accuracy is thus improved, while truncation error is inconsiderable.
82
5.4 Result Evaluation
5.4.1 PER Simulation
Fig. 5.8 shows the packet error rate (PER) performance of the 802.11ac PHY system when
using BCC and LDPC codes. The simulation parameters are: 4  4 MIMO, 80 MHz
bandwidth, channel D, 64-QAM and 256-QAM modulation, 2000 packets, packet size of
1000 data bytes, 2D Sorter-based K-best MIMO detection mentioned in chapter 4.
Fig. 5.8 shows that using LDPC codes can improve the PER performance of the system
by about 1.1 dB, and 1.5 dB (PER=10 1) in cases of 64-QAM, and 256-QAM, respectively.
26 28 30 32 34 36 38
10
-1
10
0
SNR [dB]
P
E
R
 
 
64QAM, BCC
64QAM, LDPC
256QAM, BCC
256QAM, LDPC
256 QAM
Rate = 3/4
64 QAM
Rate = 3/4
Figure 5.8: Floating point PER simulation results
83
32 33 34 35 36 37 38
10
-1
10
0
SNR [dB]
P
E
R
 
 
BCC, E=7
LDPC, E=14
LDPC, E=7
LDPC, E=3.5
LDPC, E=1.75
Figure 5.9: Fixed point PER simulation results
To prove the correctness of our proposed quantization in Sect. 5.3.3, we have simulated
the 802.11ac PHY systems, 256-QAM modulation, packet size of 500 data bytes. The
other parameters are the same as mentioned above. We use D = 4 bits to quantize the LLR
values, and select E = 14, E = 7, E = 3:5, and E = 1:75. The simulation results are
shown in Fig. 5.9. The results show that the PER performance is improved when E value
is reduced from 14, to 7, and to 3.5. That is because the shortening of quantization range
has improved the quantization accuracy. However, if we continue to reduce the E value to
1.75, the error due to truncation becomes considerable. The PER performance is thus not
improved. The PER performance is improved by about 1.7 dB (PER = 10 1) as compared
to case E = 14. In addition, with fixed point simulation, PER performance of LDPC is
much better than that of BCC.
84
5.4.2 ASIC Synthesis
Based on the proposed architecture, we have designed a multi-mode LDPC decoder for
802.11ac system. The decoder supports three code-lengths such as 1944 (Z = 81), 1296
(Z = 54), and 648 (Z = 27). Each code-length has four code-rates such as 1/2 (R = 12),
2/3 (R = 8), 3/4 (R = 6), and 5/6 (R = 4). All 12 modes has C = 24.
We have synthesized the decoder in ASIC using CMOS SAED 90nm technology, saed90nm typ
library. Our decoder is compared with the previous works in Table 5.1. It can be seen that
our design is the first one achieves throughput up to 7.7 Gbps, or 15.4 bits/cycle (bpc). To
fairly compare with the previous works, we have normalized the eciency of the designs
in terms of throughput per area (bpc/mm2) by taking care the dierent in technology, code
length, and number of iterations as (5.13) and (5.14).
E. = norm tech  norm leng  norm iter  T:P:(bpc)
area(mm2)
(5.13)
=
Tech:
90nm
 MaxLen:
1944
 MaxIter:
15
 T:P:(bpc)
area(mm2)
(5.14)
The result shows that our decoder achieves the best eciency (1.89 bpc/mm2) which is
better than [56], [57], [58], and [59] by 23.6, 9, 1.4, and 2.7 times, respectively.
Table 5.1: ASIC synthesis results of LDPC Decoders
Design [56], 2009 [57], 2010 [58], 2010 [59], 2011 Proposed
Technology 0.13m 0.13m 0.13m 90nm 90nm
Code Length (bits) 1536 576-2304 576-2304 576-2304 648-1944
No. of Iteration 2-8 15 15 8-12 15
Frequency (MHz) 125 260 150 400 500
Area (mm2) 4.94 6.3 2.46 0.679 8.2
Throughput (Mbps) 86 205 248-287 200 7700
Throughput (bpc) 0.69 0.79 1.91 0.5 15.4
Eciency (bpc/mm2) 0.08 0.21 1.33 0.7 1.89
85
5.5 Conclusion
In this chapter, we have proposed a high throughput multi-mode LDPC decoder architecture
for wireless communication systems. Most blocks of our architecture are independent from
check matrix value. To insert a new operation mode, we simply add a new ROM that
stores the check matrix value corresponded to that operation mode into the ‘P-ROM’ block.
We have also proposed an ecient quantization method which is required for hardware
implementation of the decoder. The floating point simulation results have shown that using
LDPC code can improve the PER performance of 802.11ac PHY system by about 1.1 dB,
and 1.5 dB in cases of 64-QAM, and 256-QAM, respectively. The fixed point simulation
results have shown that applying our proposed quantization method can improve the PER
performance by about 1.7 dB, and that LDPC performance is much better than BCC’s.
Based on the proposed architecture, we have designed a twelve-mode LDPC decoder for
802.11ac system and synthesized it in ASIC 90nm technology. Our decoder achieves up to
15.4 bits per cycle or 7.7 Gbps at 500 MHz frequency. In terms of eciency that already
taken care the dierent in technology, code length, and number of iterations, our design
achieves 1.89 (bpc/mm2), which is the best among the designs.
86
Chapter 6
Conclusion and Future Work
Due to the high demand of wireless communication users, the throughput of wireless sys-
tem must be significantly increased. In chapter 2 we have shown a number of methods
to increase the system throughput. Among the methods, improving the system reliability
in addition to adopting high order modulation has been proven to be the most preferable
one because it is not influenced by the physical limitations such as channel resource or
device size, etc. However, this method needs much research eort. In addition, once the
data is transferred wirelessly, anyone (include the unexpected users) within the coverage
area is able to obtain the data, security thus becomes indispensable. The goal of this the-
sis is to improve and implement the security and reliability for MIMO wireless systems
that achieve several Gbps throughput. In chapter 2, we have also shown the overview of
a wireless system. The data flow through MAC and PHY layers has been described. The
MAC layer performs the channel access control and encrypts/decrypts the transfer data for
security purpose. To encrypt/decrypt the data, the system can follow either WEP, WPA,
or WPA2 security standards in which WEP and WPA are preferred in most common pur-
pose systems. As a mandatory part of both WEP and WPA, RC4 needs further research
for high throughput implementation. Besides, PHY layer is responsible for transmitting
the data correctly, i.e., guarantees the reliability of the system. The main processes of a
4 4 MIMO transceiver has been shown. Among the processes, MIMO detection and FEC
play the most important role on deciding the system reliability. The overview of MIMO
detection and LDPC decoder, which is the best performance FEC, has been shown.
87
In chapter 3 we have presented our research on high throughput RC4 architectures for
securing the wireless systems. The RAM-based and the Register-based RC4 architectures
have been proposed. The former one uses a single of tri-port RAM to store the S -box
and has throughput of 4 bits/cycle. We has proven that 4 bits/cycle is the best throughput
can be achieved by a RAM-based RC4. The latter one uses a set of 256 registers to store
the S -box and has throughput of M (M = 1; 2; : : : ) bytes/cycle. We has described the
SWAP rule which is the key point of the Register-based RC4. The ASIC synthesis results
of the proposed RAM-based RC4 and Register-based RC4 with M = 4 have shown that
both architectures consume less power than the conventional works. In addition, while the
conventional RAM-based RC4 requires several RAM blocks and a special circuit to manage
the data coherence between the RAMs, our proposed RAM-based RC4 uses a single of
RAM block only. Our RAM-based RC4 is recommended for systems that need throughput
up to 1 Gbps because of its low complexity and low power consumption. Meanwhile, the
proposed Register-based RC4 is the first one that produces more than a byte of ciphering
key per cycle. It is best suit for securing the future high throughput systems that require
above 1 Gbps throughput.
In chapter 4 we have proposed a 2D Sorter-based K-best MIMO detection algorithm
and hardware architecture for system reliability. Basically, our algorithm performs simi-
larly to the conventional full K-best. However, new ideas such as direction expansion, par-
ent node grouping, two dimensional sorter have been introduced to significantly reduce the
complexity. Consequently, its complexity is negligibly aected by the constellation size.
The proposed algorithm is thus best suit for MIMO system that adopts high order modu-
lation such as 256-QAM. This chapter has shown that our proposed detection outperforms
the BLAST MMSE and LRA MMSE, and is close to the full K-best in terms of BER and
PER performance. About hardware architecture, some techniques such as resource shar-
ing, MUX-GAIN-based multiplier, and bit-shift-based LLR have been utilized to further
reduce the complexity. The ASIC synthesis results have shown that our proposed detection
achieves the highest throughput (2.7 Gbps), consumes the least power (56 mW), obtains the
best hardware eciency (15.2 Mbps/Kgate), and has the shortest latency (0.07 s) as com-
pared to the conventional works on K-best based MIMO detection. This chapter has also
shown that the proposed detection is less complex than MMSE and LRA-MMSE detection
88
schemes.
Continuing the purpose of improving the system reliability, we have proposed a multi-
mode high throughput LDPC decoder in chapter 5. The proposed decoder is basically not
aected by check matrix value. It is thus best suit for high through systems that require
multiple operation modes. In this chapter, we have also shown that how to quantize the
LLR, i.e., input data of LDPC decoder, would significantly aect to the system reliability.
We then propose an ecient quantization method. Both floating point and fixed point sim-
ulation results on 802.11ac systems have been shown. All the results have shown that per-
formance of the system when using LDPC is much better than that of using BCC. The PER
performance is improved by at least 1 dB. In addition, applying the proposed quantization
method can improve the PER performance of LDPC by about 1.7 dB. In terms of hardware
implementation, the ASIC synthesis results have shown that the proposed twelve-mode
LDPC decoder for 802.11ac system achieves up to 15.4 bits per cycle or 7.7 Gbps at 500
MHz frequency, which is the highest throughput as compared to the conventional works.
To obtain a fair comparison, we have normalized the eciency of all designs in terms of
throughput per area (bpc/mm2) by taking care the dierent in technology, code length, and
number of iterations. The normalized results have shown that our design achieves 1.89
(bpc/mm2) which is much better than the conventional works.
For future works, we will integrate the developed RAM-based RC4, 2D Sorter-based
K-best MIMO detection, and LDPC decoder into the 802.11ac system. Because the target
throughput of our developing 802.11ac system is up to 1.73 Gbps, the RAM-based RC4
will be used. The Register-based RC4 is preserved for the next project which needs higher
throughput.
89
Appendix A
Verification of the Designs
This appendix shows how we have debugged and tested the hardware circuits of our pro-
posed architectures. We have used two design methods for developing the proposed archi-
tectures. The way of verification was aected by the design methods.
 Model based design: In this method, the Synphony model developed by Synopsys
was used to design the hardware circuits. The Synphony model is a tool that is
embedded and run in Matlab software. Based on this method, we have designed the
RAM-based RC4 and the 2D Sorter-based K-best MIMO detection.
 Verilog based design: In this method, the Verilog hardware description language
(HDL) was used to design the hardware circuits. The Modelsim tool was used to
simulate the functional behavior of the developed circuits. Based on this method, we
have designed the Register-based RC4 and LDPC decoder.
RC4’s Verification: The verification was done via three steps. In step 1, we developed
an “RC4.m” software program that run by Matlab tool. We tested the “RC4.m” program
by inputting the same key as provided by IEEE 802.11 standard [5], pp. 1134, refer Fig.
A.1. The “RC4.m” program have generated the same ciphering key as the one provided by
[5]. It means that the “RC4.m” operated correctly. In step 2, we used “RC4.m” to test the
RAM-based RC4 circuit. The test flow and test cases are illustrated in Fig. A.2. In step 3,
we used “RC4.m” to test the Register-based RC4 circuit. The test flow and test cases are
shown in Fig. A.3.
90
RC4.m
----------
----------
Key
Cipherkey
soft
Matched
IEEE 
802.11
( hexa:    fb, 02, 9e, 30, 31, 32, 33, 34
or deci:  251, 2, 158, 48 ,49, 50, 51, 52) 
Cipherkey
standard
Figure A.1: RC4’s Verification: Step 1
RC4.m
-----------
----------
Matlab
RC4 software
RC4.mdl
-----------
-----------
RC4 hardware
Key
Cipherkey_soft Cipherkey_hard
Matched
Test
Case
Input Key
Values
No. Test 
Data 
(#Bytes)
No. 
Error
1 Ref. 802.11 200.000 0
2 18 200.000 0
3 255 (8 bytes) 100.000 0
4 0 (8 bytes) 100.000 0
5 
20
Random 
(8 bytes)
100.000 0
Figure A.2: RC4’s Verification: Step 2
RC4.m
---------
---------
Matlab
software_cipherkey.dat
RC4.v
---------
---------
Modelsim
Hardware_cipherkey.txt
Matched
RC4 software
RC4 hardware
Key
Test
Case
Input Key
Values
No. Test 
Data 
(#Byte)
No. 
Error
1 Ref. 
802.11
200.000 0
2 0 (8 bytes) 100.000 0
3 255 (8 
bytes)
100.000 0
4 
10
Random 
(8 bytes)
100.000 0
Figure A.3: RC4’s Verification: Step 3
91
MIMO detection’s Verification: The test flows of our MIMO detection hardware
circuit are shown in Fig. A.4 and A.5. In Fig. A.4, we converted the LLR calculation of
the MIMO detection from floating point to fixed point (4 bits). We captured the data from
802.11ac simulator and inputted these data to both “MIMO Detection Software” with 4-bit
fixed point LLR and “MIMO Detection Circuit” with 44-bit fixed point Main Processing
(MP) and 4-bit fixed point LLR. We then compared the output data of them. The output
data have matched. In fig. A.5, we checked the BER performance of the “MIMO Detection
Circuit” in two cases. Case 1 is 44-bit fixed point MP and 4-bit fixed point LLR. Case 2 is
16-bit fixed point MP and 4-bit fixed point LLR. The BER performance results are shown
in Fig. A.6. This figure shows that the BER of the MIMO detection circuit is close to that
of the floating point MIMO detection software.
Figure A.4: Test the functional behavior of MIMO detection
92
Figure A.5: Test the BER performance of the MIMO detection circuit
15 20 25 30 35
10
-3
10
-2
10
-1
10
0
 
 
MP(float), LLR(float)
MP(fixed, 44bits), LLR(fixed, 4bits)
MP(fixed, 16bits), LLR(fixed, 4bits)
64-QAM
Rate = 3/4
256-QAM
Rate = 3/4
16-QAM
Rate = 3/4
SNR
BER
Figure A.6: BER performance Results of the MIMO detection circuit
93
LDPC_
Encoder.m
-------------
-------------
Matlab
LLR.txt
Modelsim
DEC_hard.txt
Matched
LDPC software
LDPC hardware
LLR.m
---------
---------
+
noise
LDPC_
Decoder.m
-------------
-------------
DEC_soft.dat
LDPC_
Decoder.v
-------------
-------------
Random 
inputs
Infor_Data.dat
Matched
Figure A.7: Verification flow of LDPC decoder
LDPC Decoder’s Verification: We checked the LDPC Decoder’s hardware circuit
by comparing its result DEC hard:txt with the information data In f or Data:dat and with
the output DEC so f t:dat of the LDPC Decoder software program LDPC Decoder:m. In
which the LDPC Decoder:m had been used to simulate the PER of the 802.11ac system in
chapter 5.
94
Appendix B
Snapshots of the Designs
This appendix shows the snapshots of our proposed designs. For the Model based designs
such as RAM-based RC4 and MIMO detection, we show the snapshots of the circuits. For
the Verilog based designs such as Register-based RC4 and LDPC decoder, we show the
snapshots of simulation waveform run by Modelsim.
Figure B.1: RAM-based RC4’s top view circuit
95
Figure B.2: RAM-based RC4, inside of ‘RC4-Main-Process’ block
Figure B.3: RAM-based RC4, inside of ‘S-box’ of ‘RC4-Main-Process’ block
96
Figure B.4: RAM-based RC4, inside of ‘CALCULATE-J’ of ‘RC4-Main-Process’ block
97
Fi
gu
re
B
.5
:R
eg
is
te
r-
ba
se
d
R
C
4’
s
w
av
ef
or
m
1
(M
=
4
by
te
s)
98
Fi
gu
re
B
.6
:R
eg
is
te
r-
ba
se
d
R
C
4’
s
w
av
ef
or
m
2
(M
=
4
by
te
s)
99
Fi
gu
re
B
.7
:2
D
so
rt
er
-b
as
ed
K
-b
es
tM
IM
O
de
te
ct
io
n’
s
to
p
vi
ew
ci
rc
ui
t
100
Fi
gu
re
B
.8
:2
D
so
rt
er
-b
as
ed
K
-b
es
tM
IM
O
de
te
ct
io
n,
in
si
de
of
‘S
ta
ge
4’
bl
oc
k
101
Fi
gu
re
B
.9
:2
D
so
rt
er
-b
as
ed
K
-b
es
tM
IM
O
de
te
ct
io
n,
in
si
de
of
‘S
ta
ge
3’
bl
oc
k
102
Fi
gu
re
B
.1
0:
L
D
PC
de
co
de
r’
s
w
av
ef
or
m
103
Acknowledgment
I am heartily thankful to my supervisor, Professor Hiroshi Ochi, whose guidance, support,
and encouragement me during my study in Kyushu Institute of Technology (Kyutech). I
also want to thank Ass. Professor Masayuki Kurosaki for his support and advice.
To my family for always loving, encouraging, and supporting me.
I am also indebted to Prof. Katsuhiro Inoue, Prof. Masato Tsuru from Kyutech,
who take time to read and give their very helpful advices for my thesis manuscript. To
Prof. Shigenori Kinjo from Japan Coast Guard Academy, who travels from Hiroshima to
Fukuoka for my thesis defense and also give very insightful comments.
I would also like to thank the JASSO scholarship Foundation, ASO CEMENT Co. Ltd.
and Sato Oya Scholarship Foundation for giving me financial support. To VLSI Design and
Education Center (VDEC), the University of Tokyo in collaboration with Synopsys for pro-
viding ASIC synthesis tool license. To Prof. Huh Jong Honn, Ms. Hirata, Ms. Hasegawa,
Ms. Noguchi for teaching me Japanese language and culture. To Ms. Maki for helping
me with various university related documents. To Ms. Kobayashi for guiding me to enjoy
the life in Japan. To Mr. Nawata for taking care my physical and moral health during my
staying in Japan.
I sincerely express my gratitude to Mr. Nagao, Mr. Leo, and Ms. Hongyo for shar-
ing their knowledge to me and kindly cooperating with me in the research. To Mr. Nico,
Mr. Khoa, Mr. Khai, Ms. Nguyen, Mr. Uwai, Mr. Koga, Mr. Nana, and all of Ochi and
Kurosaki lab members for their cooperating in the research and sharing their time to enjoy
the life with me.
To my fellow international students who I met and shared experiences. Lastly, I would
like to oer my regards to all of those who support me in any aspect during my studying.
104
Bibliography
[1] M.S. Obaidat, “Trends and challenges in wireless systems,” The 14th IEEE Int. Conf.
on Electronics, Circuits and Systems (ICECS), Dec. 2007, pp. 1.
[2] V.K. Bhargava, “State of the art and future trends in wireless communication: ad-
vances in the physical layer,” Proc. of Comm. Networks and Services Conf. (CNSR),
May 2006, pp. 1-3.
[3] M.A. Uusitalo, “Global vision for the future wireless world from the WWRF,” IEEE
Vehicular Technology Magazine, vol. 1, no. 2, pp. 4-8, 2006.
[4] Cisco Visual Networking Index, “Global Mobile Data Trac Forecast Update, 2013-
2018,” White Paper, Feb. 2014.
[5] IEEE 802.11 standard: Wireless LAN Medium Access Control (MAC) and Physical
Layer (PHY) Specifications, IEEE Std. 802.11, 2007.
[6] IEEE 802.11n standard: Wireless LANMedium Access Control (MAC) and Physical
Layer (PHY) Specifications, IEEE Std. 802.11n, 2009.
[7] IEEE 802.11ac standard: Wireless LAN Medium Access Control (MAC) and Physi-
cal Layer (PHY) Specifications, IEEE Std. 802.11ac, 2013.
[8] The OSI model: understanding the seven layers of computer networks, Expert refer-
ence series of white papers, Global knowledge at www.globalknowledge.com.
[9] E. Perahia, and R. Stacey, “Next generation wireless LANs 802.11n and 802.11ac,”
Second Edition, Cambridge, 2013, pp. 221-247.
105
[10] S. Minho, J. Ma, A. Mishra, and W.A. Arbaugh, “Wireless network security and
interworking,” Proceedings of the IEEE, vol. 94, No. 2, pp. 455-466, 2006.
[11] N.R. Mead, and G. McGraw, “Wireless security’s future,” IEEE Security and Privacy,
vol. 1, no. 4, pp. 68-72, 2003.
[12] E. Shithirasenan, V. Muthukkumarasamy, and D.A. Powell, “IEEE 802.11i WLAN
security protocol - A software engineer’s model,” Proc. of the 4th Asia Pacific Infor.
Tech. security conf. 2005.
[13] A.H. Lashkari, M.M.S. Danesh, and B. Samadi, “A survey on wireless security proto-
cols (WEP, WPA and WPA2/802.11i),” 2nd IEEE Inter. Conf. on Computer Science
and Inf. Tech. (ICCSIT), 2009, pp. 48-52.
[14] B. Schneier, “Applied Cryptography-Protocols, Algorithms and Source Code in C,”
Second Edition, John Wiley and Sons, New York, 1996.
[15] A. Mousa, and A. Hamad, “Evaluation of the RC4 algorithm for data encryption,”
International Journal of Computer Science and Applications, Vol. 3, No. 2, pp.44-56,
2006.
[16] P. D. Kundarewich, S. J.E. Wilton, and A. J. Hu, “A CPLD-based RC-4 cracking
system,” The 1999 Canadian Conference on Electrical and Computer Engineering,
vol.1, pp. 397-402, Canada, 1999.
[17] K.H. Tsoi, K.H. Lee, and P.H.W Leong, “A massively parallel RC4 key search
engine,” Proc. of the 10th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM’02), pp. 13-21, USA, 2002.
[18] P. Kitsos, G. Kostopoulos, N. Sklavos, and O. Koufopavlou, “Hardware implemen-
tation of the RC4 stream cipher,” In Proc. of 46th IEEE Midwest Symposium on
Circuits and Systems 2003, vol.3, pp. 1363-1366, Egypt, 2003.
[19] D.P. Matthews Jr., “Methods and apparatus for accelerating ARC4 processing,” US
Patent Number 7403615, Morgan Hill, CA, 2008.
106
[20] A. Chattopadhyay, and G. Paul, “Exploring security-performance trade-os during
hardware accelerator design of stream cipher RC4,” in Proc. IEEE/IFIP 20th Int’l
Conf. VLSI and System-on-Chip (VLSI-SOC), pp. 251-254, CA, 2012.
[21] T. H. Tran, “Research and design the PHY layer of an IEEE 802.11n 4  4 MIMO
wireless LAN system,” M.S. thesis, Electronics and Telecommunication department,
University of Science, Ho Chi Minh, VietNam, May 2012, pp. 17-21.
[22] E. Biglieri et al., “MIMO wireless communications,” Cambridge University press,
New York, 2007.
[23] C. Siriteanu et al., “MIMO zero-forcing detection analysis for correlated and esti-
mated Rician fading,” IEEE transactions on Vehicular Technology, vol. 61, no. 7, pp.
3087-3099, 2012.
[24] S. Yoshizawa, H. Ikeuchi, and Y. Miyanaga, “VLSI implementation of a scalable
pipeline MMSE MIMO detector for a 4  4 MIMO-OFDM receiver,” IEICE Trans.
on Fundamentals, Vol. E94-A, No. 1, pp. 324-331, Jan. 2011.
[25] C.Z.W.H. Swetman, J.S. Thomson, B. Mulgrew, and M. Peter, “A comparison of
the MMSE detector and its BLAST versions for MIMO channels,” IEE Seminar on
MIMO: Comm. Sys. from Concept to Implementations, 2001, pp. 19/1-19/6.
[26] R. Ragumadhavan, “Hardware-optimized Lattice Reduction Algorithm for
WiMax/LTE MIMO detection using VLSI,” Inter. Journal of Computer Science
and Information Technology (IJCSMC), Vol. 2, No. 4, pp. 146-154, Apr. 2013.
[27] Y. Yokota, and H. Ochi, “Complexity reduction for higher order MIMO decoder us-
ing block diagonalization,” 2013 Int. Sym. on Intelligent Signal Proc. and Comm.
Systems (ISPACS), Nov. 2013, pp.235-239.
[28] D.K.C. So, and R.S. Cheng, “Layered maximum likelihood detection for MIMO sys-
tems in frequency selective fading channels,” IEEE Transactions on Wireless Com-
munications, vol. 5, no. 4, pp. 752-762, 2006.
107
[29] L. Azzam, and E. Ayanoglu, “Reduction of ML decoding complexity for MIMO
sphere decoding, QOSTBC, and OSTBC,” Information Theory and Application
Workshop, San Diego CA USA, 27 Jan. - 1 Feb. 2008, pp. 18–25.
[30] L.G. Barbero, and J.S. Thompson, “Fixing the complexity of the sphere decoder for
MIMO detection,” IEEE Trans. Wireless Commun. Vol. 7, No. 6, pp. 2131–2142,
2008.
[31] X. Mao, Y. Cheng, L. Ma, and H. Xiang, “Step reduced K-best sphere decoding,”
Vehicular Technology Conference (VTC Fall), Quebec Canada, 3-6 Sept. 2012, pp.
1–4.
[32] M. Mahdavi, and M. Shabany, “Novel MIMO detection algorithm for high-order con-
stellations in the complex domain,” IEEE Trans. VLSI Syst., Vol. 21, pp. 834–847,
2013.
[33] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, “K-best MIMO de-
tection VLSI architectures achieving up to 424 Mbps,” Proc. Int. Symp. Circuits and
Systems (ISCAS 2006), Kos Island Greece, 21-24 May 2006, pp. 1151–1154.
[34] Z. Guo, and P. Nilsson, “Algorithm and implementation of the K-best sphere decoder
for MIMO detection,” IEEE Trans. Sel. Areas Commun. vol. 24, No. 3, pp. 491–503,
2006.
[35] S. Mondal, A. Eltawil, C.A. Shen, and K.N. Salama, “Design and implementation of
a sort free K-best sphere decoder,” IEEE Trans. VLSI Syst., Vol. 18, pp. 1497–1501,
2010.
[36] M. Shabany, and P.G. Gulak, “A 675 Mb/s, 4  4 64-QAM K-best MIMO detector in
0.13 um CMOS,” IEEE Trans. VLSI Syst., Vol. 20, pp. 135–147, 2012.
[37] S. Chen, T. Zhang, and Y. Xin, “Relaxed K-best MIMO signal detector design and
VLSI implementation,” IEEE Trans. VLSI Syst., Vol. 15, pp. 328–337, 2007.
108
[38] C. Liao, T. Wang, and T. Chiueh, “A 74.8 mW soft-output detector IC for 88 spatial-
multiplexing MIMO communications,” IEEE Trans. Solid State Circuits, Vol. 45, pp.
411–421, 2010.
[39] C. Yuanyuan, and Z. Xunying, “Research and implementation of interleaving group-
ing Hamming code algorithm,” IEEE Int. Conf. on Signal Processing, Comm. and
Computing (ICSPCC), 2013, pp. 1-4.
[40] P. Dayal, and R.K. Patial, “Implementation of Reed-Solomon CODEC for IEEE
802.16 network using VHDL code,” The Int. Conf. on Optimization, Reliability, and
Infor. Tech. (ICROIT), 2014, pp. 452-455.
[41] T. Matsumoto, and F. Adachi, “BER analysis of convolution coded QDPSK in digital
mobile radio,” IEEE Transactions on Vehicular Technology, vol. 40, no. 2, pp. 435-
442, 2002.
[42] W. Chuanyu, W. Shanshan, T. Ye, and M. Yufeng, “Research and simulation on the
performance of turbo code and convolution code in advanced orbiting systems,” IEEE
Int. Conf. on Microwave, Antenna, Propagation and EMC Tech. for Wireless Comm.
(MAPE), 2013, pp. 487-491.
[43] R. Gallarger, “LowDensity Parity Check Codes,” IEEE Trans. on Information Theory,
col.8, no.1, pp. 21-28, 1962.
[44] D. J. C. MacKay, and R. M. Neal, “Near Shannon limit performance of low-density
parity-check codes,” Electronics Letters, vol. 32, pp. 1645-1646, 1996.
[45] S. Chung, G.D. Forney, T.J. Richardson, and R. Urbanke, “On the design of low-
density parity-check codes within 0.0045 of the Shannon limit,” IEEE Communica-
tion Letters, vol. 5, no. 2, 2001.
[46] X.Y. Hu, E. Eleftheriou, D.M. Arnold, and A. Dholakia, “Ecient implementation of
the sum-product algorithm for decoding LDPC codes,” IEEE Global Telecommuni-
cations Conference (GLOBECOM), 2001, vol. 2, pp. 1036-1036E.
109
[47] F. Zarkeshvari, and A.H. Banihashemi, “On implementation of min-sum algorithm
for decoding low-density parity-check (LDPC) codes,” IEEE Global Telecommuni-
cations Conference (GLOBECOM), 2002, vol. 2, pp. 1349-1352.
[48] V. Savin, “Min-Max decoding for non binary LDPC codes,” IEEE Int. Symposium on
Information Theory, 2008, pp. 960-964.
[49] C.H. Pan, and T.S. Lee, and Y. Li, “An ecient near-ML algorithm with SQRD for
wireless MIMO communications in metro transportation systems,” IEEE Intelligent
Transportation Systems Conference (ITSC), 2007, pp. 603-606.
[50] Y. Miyaoka, Y. Nagao, M. Kurosaki, and H. Ochi, “RTL Design of High-Speed QR
Decomposition for MIMO Decoder,” IEICE Transactions on Fundamentals of Elec-
tronics, Communications and Computer Sciences, vol.E95-A, no.11, pp.1991-1997,
Nov 2012.
[51] T.H. Tran, Y. Nagao, M. Kurosaki, B. Sai, and H. Ochi, “ASIC implement of
600 Mbps IEEE 802.11n 4  4 MIMO wireless LAN system,” The 14th IEEE Int.
Conf. on Advan. Commu. Tech. (ICACT), PyeongChang Korea, 19-22 Feb. 2012, pp.
360–363.
[52] Reina Hongyo, “Design of A Low Complexity Soft Decision Detector for Maximum
Likelihood Detection MIMO Decoder,” B.E. thesis, Department of Computer Science
and Systems Engineering, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan,
March 2014.
[53] R.M. Tanner, “A Recursive Approach to Low Complexity Codes,” IEEE trans. on
Information Theory, vol. 27, no. 5, pp. 533-547, 1981.
[54] M. Awais, and C. Condo, “Flexible LDPC Decoder Architectures,” Hin-
dawi Publishing Corporation, VLSI Design, Vol. 2012, Article ID 730835,
doi:10.1155/2012/730835.
[55] T.H. Tran, Y. Nagao, and H. Ochi, “Algorithm and Hardware Design of A 2D Sorter-
based K-best MIMO Decoder,” EURASIP Journal on Wireless Communications and
110
Networking.2014, 2014:93, DOI: 10.1186/1687-1499-2014-93, Jun. 2014. Circuits,
vol. 43, no. 3, pp. 672-683, 2008.
[56] X.Y. Shih, C.Z. Zhan, and A.Y. Wu, “A real-time programmable LDPC decoder chip
for arbitrary QC-LDPC parity check matrices,” the IEEE Asian Solid-State Circuits
Conf. (A-SSCC-2009), Nov. 2009, pp. 369-372.
[57] S. Huang, D. Bao, B. Xiang, Y. Chen, and X. Zeng, “A flexible LDPC decoder archi-
tecture supporting two decoding algorithms,” the IEEE Inter. Symp. on Cir. and Sys.
(ISCAS-2010), Jun. 2010, pp. 3929-3932.
[58] B. Xiang, D. Bao, S. Huang, and X. Zeng , “A fully-overlapped multi-mode QC-
LDPC decoder architecture for mobile WiMAX applications,” the 21st IEEE Inter.
Conf. on Application-Specific Sys. Arch. and Processors (ASAP-2010), July 2010,
pp. 225-232.
[59] Y.L. Wang, Y.L. Ueng, C.L. Peng, and C.J. Yang, “Processing-task arrangement for
a low-complexity full-mode WiMAX LDPC codec,” IEEE Trans. on Cir. and Sys. I,
vol. 58, no. 2, pp. 415-428, 2011.
111
Publication List
Journals
1. Thi Hong Tran, Yuhei Nagao, and Hiroshi Ochi “Algorithm and hardware design of
a 2D sorter-based K-best MIMO Decoder,” EURASIP Journal on Wireless Commu-
nications and Networking.2014, 2014:93, DOI: 10.1186/1687-1499-2014-93, Jun.
2014.
2. Thi Hong Tran, Leonardo Lanante JR., Yuhei Nagao, and Hiroshi Ochi, “Hardware
design of multi Gbps RC4 stream cipher,” 2013 IEICE Transactions on Fundamentals
of Electronics, Communications and Computer Sciences, vol. E96-A, No. 11, pp.
2120-2127, Nov. 2013.
International Conferences
1. Thi Hong Tran, Yuhei Nagao, Hiroshi Ochi, and Masayuki Kurosaki, “ASIC design
of 7.7 Gbps multi-mode LDPC decoder for IEEE 802.11ac,” 14th International Sym-
posium on Communications and Information Technologies (ISCIT-2014), Incheon,
Korea, Sept. 2014, pp. 259-263.
2. Thi Hong Tran, Yuhei Nagao, and Hiroshi Ochi, “A 2D Sorter-based K-best algo-
rithm for high order modulation MIMO systems,” 2014 IEEE 80th Vehicular Tech-
nology Conference (VTC2014-Fall), Vancouver, Canada, Sept. 2014, 5 pages (un-
published).
112
3. Thi Hong Tran, Yuhei Nagao, and Hiroshi Ochi, “A 4  4 multiplier-divider-less
K-best MIMO decoder up to 2.7 Gbps,” 2014 IEEE International Symposium on
Circuits and Systems (ISCAS-2014), Melbourne, Australia, Jun. 2014, pp. 1696-
1699.
4. Thi Hong Tran, Reina Hongyo, Yuhei Nagao, and Hiroshi Ochi, “Algorithm and
hardware design of a quasi MLD decoder for MIMO systems,” 2014 International
Symposium on Dependable Integrated Systems (DISC), Fukuoka, Japan, March 2014.
5. Thi Hong Tran, Yuhei Nagao, and Hiroshi Ochi, “A novel ecient MLD algorithm
using I/Q separating method for MIMO system,” IEEE International Conference on
Advanced Technologies for Communications (ATC) 2013, Vietnam, pp. 506-510,
Oct. 2013.
6. Thi Hong Tran, Hiroshi Ochi, and Masayuki Kurosaki, “The novel M-Byte RC4 ar-
chitecture for high throughput WLAN systems,” The 2013 International Symposium
on Electrical-Electronics Engineering (ISEE-2013), Ho Chi Minh, Vietnam, Nov.
2013.
7. Thi Hong Tran, Leonardo Lanante JR., Yuhei Nagao, Masayuki Kurosaki, and Hi-
roshi Ochi, “Hardware implementation of high throughput RC4 algorithm,” 2012
IEEE International Symposium on Circuits and Systems (ISCAS-2012), Seoul, Ko-
rea, May 2012, pp. 77-80.
8. Thi Hong Tran, Yuhei Nagao, Masayuki Kurosaki, Hiroshi Ochi, and Baiko Sai,
“Model-based Design Method for Wireless System VLSI with Synopsys Synphony
Tool,” The 2nd Solid-State Systems Symposium VLSI and Related Technologies
(4S), Vietnam, Aug. 2012, pp. 283-289.
9. Thi Hong TRAN, Yuhei Nagao, Masayuki Kurosaki, Baiko Sai, and Hiroshi Ochi,
“ASIC implement of 600Mbps IEEE 802.11n 4  4 MIMO wireless LAN system,”
The 14th IEEE International Conference on Advanced Communication Technology
(ICACT-2012), Korea, Feb. 2012, pp. 360-363.
113
Technical Reports
1. Thi Hong Tran, Reina Hongyo, Yuhei Nagao, and Hiroshi Ochi, “Low complexity
quasi MLD MIMO decoder using 2D sorter,” IEICE Tech. Rep., vol. 113, no. 386,
RCS2013-265, pp. 59-64, Jan. 2014.
2. Reina Hongyo, Thi Hong Tran, and Hiroshi Ochi “An ecient MIMO maximum
likelihood detection algorithm using I/Q separating method,” IEICE General confer-
ence, B-5-103, pp.466, Sept. 2013.
3. Leonaro Lanante Jr., Shogo FUJITA, Yuji YOKOTA, Takuro YOSHIDA, Thi Hong
Tran, Yuhei NAGAO, Baiko SAI, and Hiroshi OCHI, “Design of 1.7 Gbps IEEE
802.11ac multi-user MIMO wireless LAN system,” WTP Wireless Technology Park
2012, Yokohama, Japan, Jul. 2012.
4. Thi Hong Tran, Andjas Ardiansyah, Nico Surantha, Yuhei Nagao, Masayuki Kurosaki,
and Hiroshi Ochi, “Design and ASIC implementation of 600Mbps IEEE 802.11n
4x4 MIMO OFDM system,” IEICE Society Conference 2011, Hokkaido, Japan,
Sept.2011.
114
