High-Performance Decoder Architectures For Low-Density Parity-Check Codes by Zhang, Kai
Worcester Polytechnic Institute
Digital WPI
Doctoral Dissertations (All Dissertations, All Years) Electronic Theses and Dissertations
2012-01-09
High-Performance Decoder Architectures For
Low-Density Parity-Check Codes
Kai Zhang
Worcester Polytechnic Institute
Follow this and additional works at: https://digitalcommons.wpi.edu/etd-dissertations
This dissertation is brought to you for free and open access by Digital WPI. It has been accepted for inclusion in Doctoral Dissertations (All
Dissertations, All Years) by an authorized administrator of Digital WPI. For more information, please contact wpi-etd@wpi.edu.
Repository Citation
Zhang, K. (2012). High-Performance Decoder Architectures For Low-Density Parity-Check Codes. Retrieved from
https://digitalcommons.wpi.edu/etd-dissertations/17
HIGH-PERFORMANCE DECODER ARCHITECTURES
FOR LOW-DENSITY PARITY-CHECK CODES
by
Kai Zhang
A Dissertation
Submitted to the faulty of the
WORCESTER POLYTECHNIC INSTITUTE
In partial fulllment of the requirements for the
Degree of Dotor of Philosophy
in
Eletrial and Computer Engineering
by
August, 2011
APPROVED:
Prof. Xinming Huang Prof. Berk Sunar
Thesis Advisor Thesis Committee
Worester Polytehni Institute Worester Polytehni Institute
Prof. James Dukworth Dr. Zhongfeng Wang
Thesis Committee Thesis Committee
Worester Polytehni Institute Broadom Corporation
Abstrat
The Low-Density Parity-Chek (LDPC) odes, whih were invented by Gallager
bak in 1960s, have attrated onsiderable attentions reently. Compared with other
error orretion odes, LDPC odes are well suited for wireless, optial, and mag-
neti reording systems due to their near-Shannon-limit error-orreting apaity,
high intrinsi parallelism and high-throughput potentials. With these remarkable
harateristis, LDPC odes have been adopted in several reent ommuniation
standards suh as 802.11n (Wi-Fi), 802.16e (WiMax), 802.15.3 (WPAN), DVB-S2
and CMMB.
This dissertation is devoted to exploring eient VLSI arhitetures for high-
performane LDPC deoders and LDPC-like detetors in sparse inter-symbol inter-
ferene (ISI) hannels. The performane of an LDPC deoder is mainly evaluated
by area eieny, error-orreting apability, throughput and rate exibility. With
this work we investigate tradeos between the four performane aspets and develop
several deoder arhitetures to improve one or several performane aspets while
maintaining aeptable values for other aspets.
Firstly, we present a high-throughput deoder design for the Quasi-Cyli (QC)
LDPC odes. Two new tehniques are proposed for the rst time, inluding parallel
layered deoding arhiteture (PLDA) and ritial path splitting. Parallel layered
deoding arhiteture enables parallel proessing for all layers by establishing dedi-
ated message passing paths among them. The deoder avoids rossbar-based large
interonnet network. Critial path splitting tehnique is based on artiulate ad-
justment of the starting point of eah layer to maximize the time intervals between
adjaent layers, suh that the ritial path delay an be split into pipeline stages.
Furthermore, min-sum and loosely oupled algorithms are employed for area e-
ieny. As a ase study, a rate-1/2 2304-bit irregular LDPC deoder is implemented
using ASIC design in 90 nm CMOS proess. The deoder an ahieve an input
throughput of 1.1 Gbps, that is, 3 or 4 times improvement over state-of-art LDPC
deoders, while maintaining a omparable hip size of 2.9 mm
2
.
Seondly, we present a high-throughput deoder arhiteture for rate-ompatible
(RC) LDPC odes whih supports arbitrary ode rates between the rate of mother
ode and 1. While the original PLDA is lak of rate exibility, the problem is solved
graefully by inorporating the punturing sheme. Simulation results show that
our seleted punturing sheme only introdues the BER performane degradation
of less than 0.2dB, ompared with the dediated odes for dierent rates speied
in the IEEE 802.16e (WiMax) standard. Subsequently, PLDA is employed for high
throughput deoder design. As a ase study, a RC-LDPC deoder based on the
rate-1/2 WiMax LDPC ode is implemented in CMOS 90 nm proess. The deoder
an ahieve an input throughput of 975 Mbps and supports any rate between 1/2
and 1.
Thirdly, we develop a low-omplexity VLSI arhiteture and implementation for
LDPC deoder used in China Multimedia Mobile Broadasting (CMMB) systems.
An area-eient layered deoding arhiteture based on min-sum algorithm is inor-
porated in the design. A novel split-memory arhiteture is developed to eiently
handle the weight-2 submatries that are rarely seen in onventional LDPC de-
oders. In addition, the hek-node proessing unit is highly optimized to minimize
omplexity and omputing lateny while failitating a reongurable deoding ore.
Finally, we propose an LDPC-deoder-like hannel detetor for sparse ISI han-
nels using belief propagation (BP). The BP-based detetion omputationally de-
pends on the number of nonzero interferers only and are thus more suited for sparse
2
ISI hannels whih are haraterized by long delay but a small fration of nonzero
interferers. Layered deoding algorithm, whih is popular in LDPC deoding, is also
adopted in this paper. Simulation results show that the layered deoding doubles
the onvergene speed of the iterative belief propagation proess. Exploring the
speial struture of the onnetions between the hek nodes and the variable nodes
on the fator graph, we propose an eetive detetor arhiteture for generi sparse
ISI hannels to failitate the pratial appliation of the proposed detetion algo-
rithm. The proposed arhiteture is also reongurable in order to swith exible
onnetions on the fator graph in the time-varying ISI hannels.
3
Aknowledgements
First of all, I wish to thank my advisor, Professor Xinming Huang, for his guid-
ane and support through all stages of my studies and researh at the Worester
Polytehni Institute. I am very grateful for his reognition, his inspiration, and the
exposure and opportunities that I have reeived during the ourse of my study.
I am also indebted to to Professor Berk Sunar, Professor James Dukworth, and
Dr. Zhongfeng Wang for their valuable supports as members of my thesis ommittee.
I am grateful to my fellow graduate students in the Embedded Computing Lab,
Cao Liang, Wenxuan Guo, Yanjie Peng, Chen Shen and Wei Wang for their friend-
ship and support.
I would also like to thank all the sta in ECE department, inluding Robert
Brown, Catherine Emmerton, Colleen Sweeney, Brenda MDonald and Staie Mur-
ray for their kind assistane and oordinations.
This thesis is dediated to my wonderful family. I am forever grateful to my
parents Shanfeng Zhang and Qingfang Zhao, my wife Wanning Jiang, for their love,
support, and enouragement. I shall not try to put my appreiation and love for
them into words.
i
Contents
1 Introdution 1
1.1 Bakground . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Summary of Motivations and Contributions . . . . . . . . . . . . . . 4
1.3.1 Ultra-High-Throughput LDPC Deoder Arhiteture Design . 5
1.3.2 Flexible-Rate High-Throughput LDPC Deoder Arhiteture
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Eient Deoder Arhiteture Design for LDPC Codes with
Speial Struture . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.4 Eient Arhiteture Design in LDPC-Like BP-Based Cir-
umstanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 LDPC Codes and Deoding Algorithms 10
2.1 Introdution of LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Quasi-Cyli LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Belief Propagation Algorithm . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Min-Sum Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Loosely Coupled Algorithm . . . . . . . . . . . . . . . . . . . . . . . 14
ii
2.6 Early Termination Strategy . . . . . . . . . . . . . . . . . . . . . . . 16
2.7 Layered Deoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . 18
3 High-Throughput LDPC Deoder Arhiteture with Parallel Lay-
ered Deoding 22
3.1 High Throughput Strategies . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Parallel Layered Deoding Arhiteture . . . . . . . . . . . . . . . . . 25
3.3 Critial Path Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Proposed Deoder Arhiteture . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Overall Deoder Arhiteture . . . . . . . . . . . . . . . . . . 31
3.4.2 Pipelined Arhiteture for CNU . . . . . . . . . . . . . . . . . 33
3.4.3 Deision Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 High-Throughput Rate-Compatible LDPC Deoder Arhiteture 40
4.1 Introdution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Punturing Shemes for Rate-Compatible LDPC Codes . . . . . . . . 42
4.2.1 Quasi-Cyli LDPC Codes with Dual-Diagonal Parity Struture 42
4.2.2 Rate-Compatible LDPC Codes . . . . . . . . . . . . . . . . . 43
4.2.3 Seleted Punturing Sheme . . . . . . . . . . . . . . . . . . . 45
4.3 High-Throughput Rate-Compatible LDPC Deoder Arhiteture . . . 50
4.3.1 Summary of the Parallel Layered Deoding Arhiteture . . . 50
4.3.2 RC LDPC Deoder Design . . . . . . . . . . . . . . . . . . . 54
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 Simulation Results for Puntured WiMax Codes . . . . . . . . 57
4.4.2 Hardware Implementation Results . . . . . . . . . . . . . . . . 58
iii
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Low-Complexity LDPC Deoder Arhiteture for CMMB Systems 62
5.1 Introdution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 QC-LDPC Codes in CMMB Standard . . . . . . . . . . . . . . . . . . 64
5.3 Dual-rate Deoder Design . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.1 Overall Arhiteture . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Dual-Rate CNU Design . . . . . . . . . . . . . . . . . . . . . . 67
5.3.3 Memory Aess for Partially Parallel Layered Deoding . . . . 71
5.3.4 Split-Memory Arhiteture for Weight-2 Sub-matries . . . . . 72
5.3.5 Number of Pipeline Stages . . . . . . . . . . . . . . . . . . . . 74
5.4 Area-Eient Design Tehniques . . . . . . . . . . . . . . . . . . . . 76
5.4.1 Memory Redution . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.2 Read/Write Networks . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Implementation results . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6 Design of Belief-Propagation Based Detetors for Sparse ISI Chan-
nels 82
6.1 Introdution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Channel Model and Deoding Algorithms . . . . . . . . . . . . . . . . 84
6.2.1 Channel Model and Fator Graph Representation . . . . . . . 84
6.2.2 Belief-Propagation Algorithm . . . . . . . . . . . . . . . . . . 85
6.2.3 Layered Deoding Algorithm . . . . . . . . . . . . . . . . . . . 87
6.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Detetor Arhiteture Design . . . . . . . . . . . . . . . . . . . . . . 90
6.4.1 Overall Arhiteture . . . . . . . . . . . . . . . . . . . . . . . 90
iv
6.4.2 Arhiteture of CNU . . . . . . . . . . . . . . . . . . . . . . . 93
6.4.3 Cahe-Like Arhiteture . . . . . . . . . . . . . . . . . . . . . 94
6.5 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7 Conlusions 97
v
List of Figures
2.1 Regular (8,4) LDPC odes: (a) Parity-hek matrix (b) Tanner graph 11
2.2 Implementation of early termination strategy . . . . . . . . . . . . . . 18
2.3 Parity-hek matrix for the seleted rate-1/2 LDPC ode in 802.16e
standard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Message passing ow in horizontal layered deoding. . . . . . . . . . 19
2.5 Arhiteture of horizontal layered deoding with loosely oupled al-
gorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Proessing status at four dierent lok yles. . . . . . . . . . . . . 26
3.2 Variable summations passing diretions of the H base matrix: (a) for
olumn 1 (b) for olumn 6. . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Oset-modied parity-hek matrix for rate-1/2 LDPC ode in 802.16e. 29
3.4 Timing diagram for parallel layered deoding arhiteture. . . . . . . 29
3.5 BER performane omparison of dierent deoding algorithms. . . . . 31
3.6 Overall parallel layered deoding arhiteture for QC-LDPC odes. . 32
3.7 CNU arhiteture with ritial path splitting into 4 pipelined stages. . 34
3.8 Register alloation in eah setion of the hard deisions. . . . . . . . . 35
3.9 Layout of the deoder ore area. . . . . . . . . . . . . . . . . . . . . . 37
4.1 Desription of 1-SR node and k -SR node. . . . . . . . . . . . . . . . . 44
vi
4.2 Parity-hek matrix for the seleted rate-1/2 LDPC ode in WiMax. . 46
4.3 BERs of the three puntured odes and the dediated ode at rate
2/3 over AWGN hannels. . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Overall arhiteture of a rate-ompatible LDPC deoder. . . . . . . . 53
4.5 Arhiteture of LLR Initialization Blok. . . . . . . . . . . . . . . . . 55
4.6 Arhiteture of the Address Generator. . . . . . . . . . . . . . . . . . 56
4.7 BERs of the puntured LDPC odes over AWGN hannels. . . . . . 58
4.8 Layout of the proposed deoder hip. . . . . . . . . . . . . . . . . . . 59
5.1 Struture of the parity hek matrix for LDPC odes in CMMB stan-
dard: (a) rate-1/2; (b) rate-3/4. . . . . . . . . . . . . . . . . . . . . . 64
5.2 BER performane for dierent rates and quantization shemes. . . . . 66
5.3 Overall arhiteture of the CMMB LDPC deoder. . . . . . . . . . . 67
5.4 Arhiteture of CNU for dual-rate (1/2, 3/4) CMMB LDPC odes. . 68
5.5 The design of an PROF-based omparator . . . . . . . . . . . . . . . 69
5.6 Element orrespondene relations between CTV memory and APP
memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.7 Split-memory design for handling the weight-2 sub-matries. . . . . . 73
5.8 For rate-1/2 odes (a) Arhiteture of read network; (b) Arhiteture
of write network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.1 Fator graph of an ISI hannel. . . . . . . . . . . . . . . . . . . . . . 85
6.2 BER omparison between original BP algorithm and LDA-based BP
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 BER performane for various quantization shemes of reeived data. . 88
6.4 BER performane for various quantization shemes of extrinsi mes-
sages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vii
6.5 Overall detetor arhiteture. . . . . . . . . . . . . . . . . . . . . . . 91
6.6 Arhiteture of the CNU. . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.7 Diret-mapped Cahe Arhiteture . . . . . . . . . . . . . . . . . . . 94
viii
List of Tables
3.1 Overall omparison between proposed deoder and other existing
LDPC deoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 Three punturing shemes for ahieving rate 2/3 from rate 1/2 mother
ode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Index of puntured bloks at dierent desired rates . . . . . . . . . . 48
4.3 Overall omparison between proposed deoder and other existing
WiMax LDPC deoders . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Connetions of input and output ports in the read network for the
rate-1/2 odes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Connetions of input and output ports in the read network for the
rate-3/4 odes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Overall omparison between proposed deoder and other existing ir-
regular deoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ix
Chapter 1
Introdution
This hapter rst introdues the history of error orretion odes and LDPC odes
and then desribes the motivation, related works, ontributions and outline of this
dissertation.
1.1 Bakground
A major onern in designing reliable data transmission and storage systems is
the ontrol of data errors indued by a noisy hannel or storage medium. In 1948,
Shannon demonstrated that if the information rate is less than the hannel apaity,
introdued errors an be orreted by proper enoding and deoding of the infor-
mation [63℄. From then on, various types of error orretion odes were designed
by adding some redundanies to the original data (known as enoding), whih the
reeiver an use to hek onsisteny of the delivered message, and to reover data
determined to be erroneous (known as deoding). Classial and widely used error
orretion odes inlude Hamming Code (1950), BCH Code (1959) [34, 8℄, Reed-
Solomon Code (1960) [59℄, Convolutional Code (1955) [20, 74℄, Turbo Code (1993)
[6℄ and LDPC Code (1962) [24℄.
1
Firstly disovered by Gallager in 1962 [24℄, LDPC odes were a lass of Shannon-
limit-approahing odes in whih two innovative ideas were exploited: iterative de-
oding and onstrained random ode onstrution. However, LDPC odes were
mostly negleted by ode theorists for more than 30 years due to the tremendous
omputational eorts in implementing their enoder and deoder and the intro-
dution of Reed-Solomon odes [59℄. Shortly after the debut of another Shannon-
limit-approahing ode, Turbo ode [6℄, in whih both the above ideas are explored,
LDPC odes were redisovered by MaKay and Neal in 1996 [50℄. LDPC odes have
several advantages over Turbo odes: no error oors at high SNRs, more inherent
deoding parallelism, higher throughput potentials, lower hardware omplexity due
to the elimination of long interleavers.
LDPC odes are speied by sparse parity hek matries H and an be fully
represented by bipartite graphs (known as Tanner graph) with two sets of nodes,
alled the hek nodes and the variable nodes. LDPC odes an be eetively
deoded using the standard BP algorithm, also alled sum-produt algorithm (SPA).
Two phases of messages, the hek-to-variable (CTV) messages and the variable-to-
hek (VTC) messages, are transmitted along the edges of the Tanner graph [69℄ to
update eah other iteratively.
Good LDPC odes that have been found are largely long, random and om-
puter generated odes with BP-based iterative deoding and an ahieve an er-
ror performane as lose as 0.0045 dB away from the theoretial Shannon limit
[50, 49, 60, 61, 14℄. In despite of exellent error performane, these odes are im-
pratial for hardware implementation due to the following reasons: (1) they are
very long (up to 100,000 bits) and thus a large memory is required to store the
iterative messages, as well as the onguration of the H matrix; (2) their enod-
ing iruitry would be very large in order to fulll the omplex matrix and vetor
2
multipliations; (3) their large lateny and lak of parallelism when deoding on a
random Tanner graph; (4) randomly onstruted LDPC odes also require a om-
plex routing network, whih not only onsumes a large amount of hip area, but
also signiantly inreases the omputation delay.
In order to make LDPC odes hardware-friendly, strutured LDPC odes that
have elegant regularity in the struture of their H matries and an provide ompa-
rable or even better error performane were introdued in by algebrai onstrution
[40, 39, 75, 35, 73, 53, 19, 57℄. It has been demonstrated in [48℄ that strutured,
algebrai-onstruted odes an outperform random ones for medium length LDPC
odes (up to a few thousand bits). Besides, their enoding an be simply imple-
mented by linear shift registers. That is why LDPC odes are widely adopted by
several reent ommuniation standards suh as 802.11n (Wi-Fi), 802.15.3 (WPAN)
802.16e (WiMax), DVB-S2, CMMB, 802.3an (10G-BaseT) and et. Quasi-Cyli
(QC) LDPC odes, a speial lass of strutured LDPC odes, deompose the large
H matrix into small sub-matries whih are either identity matrix or its permutation
[22, 43, 11℄. Therefore, QC LDPC odes are best suited for hardware implementa-
tion sine the regular strutures of their H matries make simple message swithing
and memory aessing possible.
1.2 Related Works
This dissertation is devoted to high-performane VLSI arhiteture design and im-
plementation for LDPC odes. Generally, state-of-art LDPC deoder arhitetures
an be divided into three ategories: fully parallel method, serial method and partly
parallel method.
Blanksby et al. [7℄ diretly mapped the BP algorithm into hardware and imple-
3
mented a fully parallel irregular LDPC deoder. This method, through with high
throughput, requires huge amount of interonnetions to omplete the messages
passing within a random Tanner graph, leading to the average net length of 3nm
and the total hip area of 52.5 mm
2
. On the other side, Yeo et al. [80℄ designed a
serial deoder by sharing the omputation units and updating messages in sequen-
tial. Serial method results in very simple hardware arhiteture but low deoding
throughput, whih is usually unaeptable by real-world appliations, due to the
large lateny of ten-thousand yles needed for deoding a odeword.
Sine strutured LDPC odes are arhiteture-aware, their deoders an be im-
plemented by partly parallel method to explore potential parallelism and improve
the throughput [85, 12, 51, 86, 37, 87, 77, 36℄. Compared with random LDPC odes,
the H matrix of strutured LDPC an be easily stored by storing only permutation
values of eah small sub-matrix instead of the entire H matrix. Also, regularity in
the H matrix of strutured LDPC odes enables easy ontrol over the read/write
operations of VTC and CTV messages when deoding by employing a simple bitwise
ounter, thus avoiding the huge interonnetion network in [7℄. QC LDPC deoders
proposed in [77, 79, 78, 17, 66, 28, 45, 70, 9, 25, 27, 67℄ also belong to the partly
parallel ategory.
1.3 Summary of Motivations and Contributions
This researh is motivated by desires to explore high-performane LDPC deoder
arhitetures in terms of area eieny, error-orreting apability, throughput and
rate exibility. Although LDPC deoder has been an extensive researh topi for
almost a deade, there are still many unresolved problems. With this dissertation,
we have onsidered existing VLSI design issues and developed various deoder ar-
4
hitetures to improve throughput, enhane exibility and redue omplexity of an
LDPC deoder.
The spei motivations for the work and the orresponding ontributions will
be summarized as follows.
1.3.1 Ultra-High-Throughput LDPC Deoder Arhiteture
Design
Motivation: With the inreasing demand for high-data-rate wireless appliations,
many reent ommuniation systems employ ultra-high throughput hannel odes
to math the data-rate requirements. For example, 802.15.3 standard is targeted
for the data rate of multi-giga bits per seond (Gbps). However, onventional high-
throughput LDPC deoders proposed in [7, 46, 54℄ were implemented by employing
more hardware resoures to operate at a higher parallelism, whih onsumed a large
amount of hip area. For example, the deoder in [46℄ ahieves an equivalent input
throughput of 1 Gbps with a lok rate of only 200 MHz, an undoubtedly high
parallelism of 512 and a large hip area of 14.5 mm
2
under 90 nm proess. Thus it
is desirable to investigate other methods to improve throughput.
Contribution: We proposed a parallel layered deoding arhiteture to ahieve
ultra high throughput. Parallel layered deoding arhiteture not only keeps the
advantage of onventional layered deoding algorithm that an redue the onver-
gene required number of iterations by half and thus double the throughput, but
also avoids the serial operations existing in onventional layered deoding algorithm
whih limits overall throughput. Two other tehniques were also applied: ritial
path splitting and xed message passing. The former optimizes the deoder's ritial
path and divides it into several pipelined stages. The latter avoids rossbar-based
large interonnet network, whih usually bottleneks an LDPC deoder to ahieve
5
high lok rate. Both tehniques improve the lok rate and thus the throughput of
an LDPC deoder without inreasing parallelism and hardware resoures. As a ase
study, a rate-1/2 2304-bit irregular LDPC deoder was implemented using ASIC
design in 90 nm CMOS proess. The deoder an ahieve an input throughput of
1.1 Gbps, that is, 3 or 4 times improvement over state-of-art LDPC deoders, while
maintaining a omparable hip size of 2.9 mm
2
.
1.3.2 Flexible-Rate High-Throughput LDPC Deoder Arhi-
teture Design
Motivation: In wireless ommuniation systems, it is desirable to adjust ECC
ode rate aording to the hannel state information (CSI) to meet various servie
requirements and hannel onditions. For example, WiMax standard provides 6
dierent LDPC ode patterns in 4 dierent rates (
1/2,2/3A,2/3B,3/4A,3/4B,5/6) [4℄. Al-
though there have been some researh works on the design of exible rate LDPC
deoders [68, 84, 52, 46, 66℄, none of them an provide arbitrary ode rate. More-
over, these deoders usually employ ompliated interonnetion network to swith
between dierent ode patterns [84, 46, 66℄, suh as Benes networks used in [52℄,
whih leads to signal ongestion problem, large delay and low throughput. How to
design a exible rate LDPC deoder with high throughput remains hallenging.
Contribution: With the dissertation we proposed a rate-ompatible (RC) LDPC
deoder arhiteture that an provide arbitrary ode rate between the rate of the
mother ode and 1. Codes with exible rates were generated by punturing the par-
ity bits of the mother ode with negligible error performane degradation. Parallel
layered deoding arhiteture was also adopted to ahieve high throughput where
ompliated interonnet network was replaed by xed wires. The proposed RC
LDPC deoder also helps solve the rate inexibility problem existing in previous
6
parallel layered deoding arhiteture. As a ase study, a RC-LDPC deoder based
on the rate-1/2 WiMax LDPC ode is implemented in CMOS 90 nm proess. The
deoder an ahieve an input throughput of 975 Mbps and supports any rate between
1/2 and 1.
1.3.3 Eient Deoder Arhiteture Design for LDPC Codes
with Speial Struture
Motivation: Although urrent LDPC deoders an handle QC LPDC odes ef-
fetively and eiently, it is still a problem when dealing with some speially on-
struted odes, suh as the LDPC odes in CMMB standard whose sub-matries
are weigh-2 superimposed matries instead of identity matries. The design of an
area-eient LDPC deoder that supports multiple ode patterns is essential for
wireless appliations suh as CMMB.
Contribution: We proposed a low-omplexity LDPC deoder for CMMB systems.
An area-eient layered deoding arhiteture based on min-sum algorithm is in-
orporated in the design. A reongurable arhiteture, whih an support dual
rate LDPC odes speied in the CMMB standard, is onstruted with minimal
overhead. A novel split-memory arhiteture is developed to eiently handle the
weight-2 submatries that are rarely seen in onventional LDPC deoders. In ad-
dition, the hek-node proessing unit is highly optimized to minimize omplexity
and omputing lateny while failitating a reongurable deoding ore.
7
1.3.4 Eient Arhiteture Design in LDPC-Like BP-Based
Cirumstanes
Motivation: Besides LDPC deoding, a large variety of algorithms in ommunia-
tion and signal proessing an be viewed as instanes of BP algorithm, whih oper-
ates by iterative message passing on a bipartite graph [47℄. For example, reent re-
searh has foused on the appliation of the BP algorithm to deteting over a known
ISI hannel [15, 62, 5℄. The omputational omplexity of the BP-based detetion in-
reases exponentially only with the number of the nonzero interferers, with respet to
the optimal maximum a posterior (MAP) algorithms or maximum-likelihood (ML)
algorithms whose omputational omplexity are exponential in hannel length. BP-
based hannel detetor is needed espeially in the sparse ISI hannel with long delay
spreads and only a few nonzero interferers, suh as the underwater aousti (UWA)
hannels [62℄. Thus it is desirable to investigate LDPC-like VLSI arhitetures for
appliations under these irumstanes.
Contribution: In this dissertation, we provided a feasible solution of detetor arhi-
teture for random sparse ISI hannels. A reongurable arhiteture was developed
in order to swith exible onnetions on the fator graph in the time-varying ISI
hannels. Layered deoding and task resheduling one used in LDPC deoding were
also inorporated to aelerate the iterative proess. All of the tasks are managed by
the ontrol unit implemented as a nite state mahine (FSM). The proposed dete-
tor is implemented using ASIC design in 90 nm CMOS proess and also prototyped
on an FPGA.
1.4 Outline
This dissertation is outlined as follows.
8
Chapter 2 reviews the fundamentals of LDPC odes and their deoding algo-
rithms that are used throughout this dissertation. Besides standard BP, other de-
oding algorithms suh as loosely oupled algorithm, min-sum algorithm and layered
deoding algorithm are also presented.
Chapter 3 develops a ultra-high-throughput LDPC deoder arhiteture imple-
mentation with parallel layered deoding. We rst explain how parallel layered de-
oding works and then presents how it an help split the ritial path and eliminate
the interonnet network.
In Chapter 4, a punturing-based RC LDPC deoder arhiteture that supports
any ode rate 1/2 to 1 is designed and implemented. First we investigate various
punturing shemes and pik up the optimal one in term of error performane.
Then we modify the parallel layered deoding arhiteture developed in Chapter 3
by adding a ontrol blok to arry out punturing.
Chapter 5 presents a low-omplexity LDPC deoder for CMMB systems. Split-
memory arhiteture is proposed to eiently handle the weight-2 sub-matries. A
reongurable arhiteture, whih an support dual rate LDPC odes speied in
the CMMB standard, is then onstruted with minimal overhead;
Chapter 6 presents VLSI implementation of a BP-based detetor for sparse ISI
hannels, suh as underwater aousti hannel. Cahe-like arhiteture is developed
by storing only the messages that urrent node interferes with.
Chapter 7 draws the onlusion.
9
Chapter 2
LDPC Codes and Deoding
Algorithms
In this Chapter, we reviews some fundamentals of LDPC odes, QC LDPC odes and
their deoding algorithms, inluding standard BP algorithm, min-sum algorithm,
loosely oupled algorithm and layered deoding algorithm.
2.1 Introdution of LDPC Codes
The LDPC odes an be desribed by a M × N sparse parity hek matrix H, in
whih most of the elements are 0's and only a few are 1's. M denotes the number
of parity hek equations, that is, number of the hek nodes, while N is the blok
length, that is, the number of variable nodes. Fig. 2.1(a) shows a rate-1/2 4Ö8
parity hek matrix of a regular (8,4) LDPC ode. In the deoding aspet, a parity
hek matrix an be mapped into a bipartite graph, alled the Tanner graph, with all
of the variable nodes on one side and hek nodes on another. Loations of the non-
zero elements in the H matrix indiate the straight onnetions between variable
10
           
           
           
           
 
 
 
 

 
 
 
 
 
	


 

 
 
ﬀ
ﬁ 
ﬂ
ﬃ ﬃ
ﬀ
ﬃﬁ ﬃ
ﬂ
ﬃ ﬃ ﬃ! ﬃ"
#$%
#&%
'() *
+,,
$-
+
)(' *
+,,
$-
+
Figure 2.1: Regular (8,4) LDPC odes: (a) Parity-hek matrix (b) Tanner graph
nodes and hek nodes, as illustrated in Fig. 2.1 (b). These onnetions an be also
onsidered as the messages (CTV messages and VTC messages) transmitting paths
when deoding using the iterative BP algorithm.
2.2 Quasi-Cyli LDPC Codes
QC LDPC odes are a speial lass of the LDPC odes with strutured H matrix
whih an be generated from an mb × nb base matrix Hb.
Hb =


P0,0 P0,1 · · · P0,nb−1
P1,0 P1,1 · · · P1,nb−1
.
.
.
.
.
.
.
.
.
.
.
.
Pmb−1,0 Pmb−1,1 · · · Pmb−1,nb−1


(2.1)
Eah nonzero element Pi,j in the base matrix is a z × z submatrix that an
be expanded by irularly right-shifting an identity matrix with the shift value
dened by Pi,j. The struture of the parity hek matrix makes it onvenient to
11
determine the loations of the nonzero elements. Random onnetions between
CNUs and VNUs now beome well-regulated and easy to handle. Therefore, QC-
LDPC odes are welomed by several advaned ommuniation standards, suh as
802.11n, 802.15.3 and 802.16e.
2.3 Belief Propagation Algorithm
The BP algorithm provides an eient and powerful method to deoding LDPC
odes. Standard BP algorithm in [24℄ is usually transformed into logarithmi domain
where additions an be used instead of the omplex multipliation. Before presenting
the BP algorithm, we rst make some denitions as follows: Let cn denote the n-th
bit of a odeword and yn denote the orresponding reeived value from the hannel.
Let rmn[k] (qmn[k]) be the CTV (VTC) message from hek node m to variable node
n at the k-th iteration. Let N (m) denote the set of variables that partiipate in
hek m and M (n) denote the set of heks that partiipate in variable n. The set
N (m) without variable n is denoted as N (m) \ n and the set M (n) without hek
m is denoted as M (n) \m.
1. Initialization:
Under the assumption of equal priori probability, ompute the hannel prob-
ability pn (intrinsi information) of the variable node n, by:
pn = log
P (yn | cn = 0)
P (yn | cn = 1)
(2.2)
The CTV message rmn is set to be zero.
2. Iterative Deoding:
At the k-th iteration, for the variable node n, alulate VTC message qmn [k]
12
by
qmn [k] = pn +
∑
m′∈{M(n)\m}
rm′n [k − 1] (2.3)
Meanwhile, the deoder an make a hard deision by alulating the APP
(a-posterior probability) by
Λn [k] = pn +
∑
m′∈M(n)
rm′n [k − 1] (2.4)
Deide the n-th bit of the deoded odeword xn = 0 if Λn > 0 and xn = 1
otherwise. The deoding proess terminates when the entire odeword x =
[x1, x2, · · · · · · xN ] satisfy all of the M parity hek equations: Hx = 0, or the
preset maximum number of iteration is onsumed.
If the deoding proess does not stop, then, alulate the CTV message rmn
for the hek node m, by
rmn [k] =
∏
n′∈{N(m)\n}
sign (qmn′ [k])
×Ψ−1

 ∑
n′∈{N(m)\n}
Ψ (|qmn′ [k]|)


(2.5)
Ψ (x) = Ψ−1 (x) = log
1 + e−x
1− e−x
(2.6)
2.4 Min-Sum Algorithm
The nonlinear log − tanh funtion in the hek node updating step is usually im-
plemented by Look-Up Table (LUT), whih seriously inreases the omplexity and
operating lateny of the CNU. In paper [78℄, the author proposed a method to bal-
ane the path lateny of CNU and BNU by transferring one of the two log-tanh
13
funtions from CNU to BNU. Another eetive method is to use the min-sum al-
gorithm to lower the CNU omplexity by approximating the CTV message with a
minimum operation, as shown in (2.7).
rmn [k] =
∏
n′∈{N(m)\n}
sign (qmn′ [k])
×
(
min
n′∈{N(m)\n}
{|qmn′ [k]|} × α
)
(2.7)
Here, a normalized fator α is introdued to ompensate for the performane loss
exiting in the min-sum algorithm without ompensation ompared to BP algorithm
[23, 26, 10℄. In this dissertation, α is set to be 0.75.
Using the min-sum algorithm, the look-up tables (LUTs) whih implement the
intriate non-linear funtion in standard BP algorithm are now replaed by rather
simple omparators, resulting in simpler omputation omplexity of CNU. Besides,
storage resoures an also be redued, as only the minimum and seond minimum
value of all of the VTC messages within a hek node need to be stored.
2.5 Loosely Coupled Algorithm
In the BP algorithm, messages are transmitted between hek nodes and variable
nodes iteratively to update eah other. VTC messages are renewed by hannel prob-
abilities and CTV messages from the hek nodes. Then, these renewed messages
need to be passed again to update the VTC messages. Every VTC and CTV message
should be transmitted immediately after they are renewed along the edges on the
Tanner graph. This large amount of transmitted messages auses an interonne-
tion problem. Large hip area is oupied by interonnetions and an area-eient
deoder beomes a hallenge.
14
Loosely oupled algorithm [36℄ was introdued to solve the omplex interonne-
tion problem. The deoder does not exhange the CTV and VTC messages between
the hek nodes and variable nodes. Instead, it delivers only the hek and variable
summation ∆m and Λn. At a variable node, given the hek summation values ∆m,
a VNU would rst reover individual CTV message rmn and then alulate the next
variable summation Λn whih will be transmitted to CNUs. As a result, (2.3) and
(2.4) an now be modied as:
rmn [k − 1] = (sign (∆m [k − 1])× sign (qmn [k − 1]))
×Ψ−1 (Ψ (|∆m [k − 1]|)−Ψ (|qmn [k − 1]|)) (2.8)
Λn [k] = pn +
∑
m′∈M(n)
rm′n [k − 1] (2.9)
qmn [k] = pn +
∑
m′∈{M(n)\m}
rm′n [k − 1] (2.10)
Similarly, at a hek node, given the variable summation values Λn, a CNU would
rst reover individual VTC message qmn and then alulate the next hek summa-
tion ∆mwhih will be transmitted to VNUs. As a result, (2.5) an now be modied
as:
qmn [k] = Λn [k]− rmn [k − 1] (2.11)
∆m [k] =
∏
n′∈N(m)
sign (qmn′ [k])
×Ψ−1

 ∑
n′∈N(m)
Ψ (|qmn′ [k]|)


(2.12)
15
rmn [k] =
∏
n′∈{N(m)\n}
sign (qmn′ [k])
×Ψ−1

 ∑
n′∈{N(m)\n}
Ψ (|qmn′ [k]|)


(2.13)
If the min-sum algorithm is also applied, then we only need to transmit the variable
summation Λn, leading to further simpliation in interonnetion omplexity [70℄.
2.6 Early Termination Strategy
As presented in the BP algorithm, the deoding proess an be terminated by two
general onditions: one is the parity hek equations Hx = 0, another is the atual
number of iterations exeeds the predened maximum number. For regular LDPC
ode deoding in paper [85℄, parity hek equations beome easy to verify, beause
every lok yles the deoded bits from the VNUs are within a same hek and par-
ity hek an be ompleted immediately after the deoded bits are deided. However,
for irregular odes suh parity hek method would lead to larger hardware resoure,
longer deoding lateny and lower deoding throughput, beause extra storage re-
soures and lok yles are required to save the hard deision bits and to verify to
see if the parity hek equations are satised.
Another onventional termination method is to only set the maximum number
of iterations and stop at the preset number without onsidering if the parity hek
equations are satised. The method will have little hurt to the error-orretion
performane of the LDPC odes if the maximum number if suiently large. In
iruit level design, a simple ounter will be ne to fully implement this termination
method. Thus, less hardware omplexity is employed, ompared with the parity
hek equation method. However, the method annot arbitrarily adjust the num-
16
ber of iterations and will ause a waste in high SNR hannels suh as wire-line
environments where fewer errors our.
In order to address the drawbaks of the two existing termination riteria, we
propose a more onvenient and robust early termination strategy [66℄ to balane
the throughput requirements and hardware usage. The early termination strategy
an not only dynamially adjust the number of iterations when dealing with om-
muniation hannels of dierent SNRs, but also be suient for the low-ost and
low-power hardware implementation.
The pivot of the early termination strategy lies in that hard deisions from
previous iteration are stored and ompared with newly generated hard deisions. If
all of the deoded bits at urrent iteration are idential with those of the previous
iteration, then the deoder indiates a suessful deoding of urrent odeword and
jumps out of the iterative proess. Otherwise, the deoding proess would ontinue
until two suessive hard deisions beome the same or the maximum number of
iterations is satised. As indiated in [66℄, for the LDPC ode used in this paper,
we only need to hek 1152 information bits instead of 2304 bits, beause it is a
rate-1/2 systemati linear blok ode.
Fig. 2.2 shows the hardware arhiteture of the early termination strategy. It
mainly onsists of XOR gates, OR gates and registers whih are used to save the
hard deisions. At every iteration yle, hard deisions of previous iteration are
retrieved from registers and XORed with urrent deisions to generate an array
of intermediate signals same_diff . Then eah bits of the same_diff signal are
ORed to generate the flag signal. A high-level flag signal indiates dierene of
hard deisions between the two suessive iterations and then the deoding proess
will ontinue if the predened number of iteration is not satised. Otherwise, a low-
level flag signal tells identity of the two suessive hard deisions and the proess
17
 

  	


 









 
 
 




 
 

ﬀ
ﬁ ﬂ
ﬃ !"#!
$$%&
$$%&
$$%&
$$%& '(
)*
+ ,
-..
./(
0
Figure 2.2: Implementation of early termination strategy
will stop.
2.7 Layered Deoding Algorithm
In BP algorithm [24℄, the two-phase mutual messages, namely VTC messages qmn
and CTV messages rmn, are updated by separate proessing units and passed to eah
other iteratively. qmn updates will not start until all of the rmn are prepared and vie
versa. In horizontal layered deoding, the CTV message from the urrent layer will
be passed vertially to all other unproessed layers that belong to the same variable
node. In eah iteration, the horizontal layers are proessed sequentially from the
top to the bottom layer.
As an example, the H matrix of rate-1/2 LDPC ode from 802.16e standard is
quasi-yli, whih onsists of sub-matries that are generated by irularly shifting
an identity matrix, as shown in Fig. 2.3. Suh struture is well suited for horizontal
layered deoding as eah sub-matrix has the olumn weight of one and we an treat
eah row of the base matrix as a layer. Message passing ow of layered deoding for
the seleted H matrix is illustrated in Fig. 2.4.
18
072641664312
00493965711
007270599410
0051432483129
0047273118
00181453957
000798240466
00724184395
00256547614
000338122243
001297922272
07835573941
242322212019181716151413121110987654321
Figure 2.3: Parity-hek matrix for the seleted rate-1/2 LDPC ode in 802.16e
standard.
  

  
 

 





	
	
	
	
	
	
	
	
	
	
	
	
	
	


  

  

  




ﬀﬁﬂﬃ



ﬀﬁﬂﬃ



ﬀﬁﬂﬃ



ﬀﬁﬂﬃ
	
	



ﬀﬁﬂﬃ
     

Figure 2.4: Message passing ow in horizontal layered deoding.
19
 

	













	

 

	





ﬀ
ﬁ
ﬂﬃ

 

	

 !

	

!
"#
$%
&
'
(
)*
+,
-
.
/

	
!
0
123

0
123

0
123
!
4
4
4
5
1


6
7	38

9
3
6



6
7	38

9
3
6



6
7	38
!
4
4
4
:;;
<
=>?@
A
4
4
4
4
4
4
B
31C
	D31
6
	

C123 E2
C123
 FG
H
I	32

J
K
L
M
N
O
P
Q
R
S
 FG
H
I	32

T
UV
W
X
Y
Z[
\
]
 FG
H
I	32
!
^_
`
a
b
c
d
e
f
gh
i
j
k
4
4
4
4
4
4
9
3
6



6
7	38

5
1


6
7	38

5
1


6
7	38
!
Figure 2.5: Arhiteture of horizontal layered deoding with loosely oupled algo-
rithm.
20
Loosely oupled algorithm is also adopted in this work to redue the interon-
netion omplexity. As illustrated in Fig. 2.5, at the j-th layer, the VTC mes-
sages are rst reovered from the variable summations Λn and the CTV message
from memories, as in (6). At the output of CNUj, the updated CTV messages{
rj,nj | nj ∈ N (j)
}
in (7) are stored bak to the CTV memories. Variable summa-
tion
{
Λnj | nj ∈ N (j)
}
are renewed as in (8) and sent bak to the APP memory.
The entire ow an be expressed as follows:
qj,nj = Λnj − rj,nj (2.14)
rj,nj =

 ∏
n′j∈{N(j)\nj}
sign
(
qj,n′j
)
×
(
min
n′j∈{N(j)\nj}
{∣∣qj,nj ∣∣}× α
)
(2.15)
Λnj = qj,nj + rj,nj (2.16)
Compared with onventional layered deoding algorithm, it an be observed
that loosely oupled algorithm does not require variable node operations. In other
words, VNUs an be eliminated sine their main funtion of updating the variable
summations Λn an be done by CNUs using VTC messages rj,nj [k] from previous
layers, as indiated in (2.16).
21
Chapter 3
High-Throughput LDPC Deoder
Arhiteture with Parallel Layered
Deoding
3.1 High Throughput Strategies
With the inreasing demand for high-data-rate wireless appliations, many reent
ommuniation systems employ ultra-high throughput hannel odes to math the
data-rate requirements. For example, 802.15.3 standard is targeted for the data
rate of multi-giga bits per seond (Gbps), thus LDPC odes are preferred ompared
with onvolutional odes and Turbo odes. However, it is a great hallenge to design
a high-through LDPC deoder due to the omplexity of deoding algorithm. Sine
QC-LDPC odes are inreasingly popular in emerging ommuniation standards, we
fous on the design and arhiteture of QC-LDPC deoder in this paper.
22
The throughput of an LDPC deoder an be alulated as
Throughput =
Freq × Block Length
Cycles per Iter × Numof Iter
(3.1)
Therefore, three strategies an be attempted in order to improve the throughput:
reduing the number of iterations required for onvergene, reduing the deod-
ing lateny per iteration and improving the operating frequeny. Correspondingly,
three arhiteture-aware shemes are studied in this paper, inluding layered de-
oding algorithm, parallel layered deoding arhiteture, and ritial path splitting
tehnique.
First, layered deoding algorithm is adopted to redue the required number of
iterations by a fator of two for any given SNR, ompared with the standard BP
algorithm. Hene, the deoding throughput is supposed to be doubled without
any bit error performane loss. Generally, there are two layered deoding methods:
horizontal layered deoding [51, 33, 45, 9, 25, 28℄ and vertial layered deoding
[82℄. It has been proved these two methods are theoretially equivalent and both
an onverge twie as fast as the BP algorithm [65℄. In this paper, we employ
the horizontal layered deoding strategy beause it is favorable for the min-sum
algorithm.
Seondly, we propose a novel sheme namely parallel layered deoding arhi-
teture that enables onurrent proessing among all layers. In traditional layered
deoding arhiteture, the layers are proessed sequentially whih leads to longer de-
oding lateny per iteration [33℄. In parallel layered deoding arhiteture, preisely
sheduled message passing among dierent layers guarantees that all updated mes-
sages are passed to their designated loations in onneted layers. The parity-hek
matrix optimization proedure adds spei osets to eah layer (row) in the base
23
parity hek matrix, making the idle time intervals between onneted layers su-
iently large for message passing. Moreover, parallel layered deoding arhiteture
supports the deoding arhiteture in whih hek node updates unit (CNU) and
variable node updates unit (VNU) are ombined into a single funtional unit. Thus,
no extra lok yles are needed to omplete the variable node update as they have
been merged into the CNU. As a result, the number of lok yles per iteration
in parallel layered deoding arhiteture an be redued by 75% ompared with the
existing arhitetures.
Finally, the tehnique of reduing ritial path delay is proposed to improve the
operating frequeny at iruit-level. The ombination of CNU and VNU results a
long ritial path in the deoder implementation [66℄, whih limits the maximum
lok speed. Fortunately, there are idle time intervals among dierent layers whih
allows the iterative messages to be proessed and passed to the next layer within
several lok yles. Therefore, we an insert registers to split the CNU (inluding
the VNU) into several pipelined stages. Consequently, the ritial path is split into
multiple stages and the lok speed an be improved dramatially, on ondition
that the number of stages does not exeed the idle time intervals. In pratie, this
ritial path splitting method an inrease the maximum frequeny of the deoder
by a fator of 3 or higher.
To demonstrate the aforementioned three tehniques, a rate-1/2 2304-bit QC-
LDPC ode is seleted from 802.16e standard as a ase study. In addition, existing
tehniques suh as min-sum algorithm and loosely oupled algorithm are also em-
ployed to simplify deoding omplexity and to redue the hip area of the deoder
design.
24
3.2 Parallel Layered Deoding Arhiteture
As disussed above, the essential reason that layered deoding algorithm an redue
the number of iterations is that the latest extrinsi messages are passed to and
employed by the subsequent layers within the urrent iteration. Therefore, layered
deoding requires layers to be proessed sequentially, whih results a large deoding
lateny per iteration. A method of inreasing parallelism inside a layer is proposed
in [51, 45℄, but all layers are still proessed in series. Although deoding throughput
an be improved, this method introdues rossbar-based interonnetion networks
that inrease the hardware omplexity.
Motivated by the partly parallel mehanism mentioned in [85℄, we propose the
parallel layered deoding arhiteture that allows all layers to be proessed onur-
rently. Eah layer generates and sends updated messages and at the same time it
also reeives the updated messages from other layers. Unlike the method proposed
in [51, 45℄, parallel layered deoding arhiteture uses parallel proessing among all
layers and serial proessing within eah layer. Detailed message proessing ow at
the j-th layer (CNU) an be summarized as
1. Feth the orresponding variable summations
{
Λnj | nj ∈ N (j)
}
from the
APP memory and CTV messages
{
rj,nj | nj ∈ N (j)
}
from loal CTV memory.
2. Calulate the VTC messages
{
qj,nj | nj ∈ N (j)
}
in the same row using (2.14).
3. Calulate horizontally to obtain new CTV messages
{
rj,nj | nj ∈ N (j)
}
, as in
(2.15).
4. Immediately update the variable summations
{
Λnj | nj ∈ N (j)
}
using (2.16).
5. Deliver the new variable summation Λnj to the same loation at another layer.
25
 


   	 

 



   	 

 



    	   



 



 



 



 



 



 



 

 ﬀ

ﬁ
ﬂﬃ


  

 ﬀ

ﬁ
ﬂﬃ
!"
" #
$%


& &

 ﬀ

ﬁ
ﬂﬃ


' ﬂ

 ﬀ

ﬁ
' ﬃ
!"
" #
$%


 

 ﬀ

ﬁ
' ﬃ


( 

 ﬀ

ﬁ
' ﬃ


  

 ﬀ

ﬁ
)* ﬃ
!"
" #
$%


+ 

 ﬀ

ﬁ
)* ﬃ


, ﬂ

 ﬀ

ﬁ
)* ﬃ


+ (

 ﬀ

ﬁ
(  ﬃ
!"
" #
$%


 * -

 ﬀ

ﬁ
(  ﬃ


& &

 ﬀ

ﬁ
(  ﬃ
.
/
0
1
2
3
.
/
0
1
2
45
.
/
0
1
2
67
.
/
0
1
2
8 4
9: :
;
ﬁ< <  =ﬁ<
9: :
;
ﬁ< <  =ﬁ<
9: :
;
ﬁ < <  =ﬁ <
F
i
g
u
r
e
3
.
1
:
P
r
o

e
s
s
i
n
g
s
t
a
t
u
s
a
t
f
o
u
r
d
i

e
r
e
n
t

l
o

k

y

l
e
s
.
2
6
Hereby, we explain how to pass Λnj messages in parallel layered deoding arhi-
teture. Instead of passing Λnj to all unproessed layers as in onventional layered
deoding, parallel layered deoding arhiteture only sends Λnj to the layer that
will use it next. Let us take the rst olumn of the H base matrix as an example.
None-zero entries are at the 4th, 9th and 12th layer whose permutation numbers
are 61, 12 and 43, respetively. Now we suppose that at yle 0 eah of these three
orresponding CNUs starts to proess from the rst row of the sub-matrix, orre-
sponding to the 62th, 13th and 44th olumn indies. In general, orresponding row
and olumn indies of the entry being proessed at yle l an be alulated as
rindex = l (3.2)
cindex = mod (l + 1 + s (i, j) , 96) (3.3)
In Fig. 3.1, it illustrates the message passing routes in parallel layered deoding
arhiteture using the the rst olumn of the parity hek matrix in Fig. 1 as an
example. The operation sequene in time of these three layers is desribed as the
following:
1. At yle 0, CNUs at layers 4, 9 and12 start to proess simultaneously from
row 1 of eah sub-matrix, orresponding to olumn 62, olumn 13 and olumn
44 aording to (3.3).
2. At yle 18, all three layers are proessing at row 19, orresponding to olumn
80 (in layer 4), olumn 31 (in layer 9) and olumn 62 (in layer 12). Layer 12
alls for the latest summation message of olumn 62 whih has already been
updated by layer 4. Therefore, the updated variable summations at layer 4
should be sent to layer 12 (Layer 4 →Layer 12).
27
  


 
 
	


 
 


 







 
 



 






 
 




 







 










Figure 3.2: Variable summations passing diretions of the H base matrix: (a) for
olumn 1 (b) for olumn 6.
3. At yle 31, all three layers are proessing at row 32, orresponding to olumn
93 (in layer 4), olumn 44 (in layer 9) and olumn 75(in layer 12). Layer 9
alls for the latest summation message of olumn 44 whih has already been
updated by layer 12. Therefore, the updated variable summations at layer 12
should be sent to layer 9 (Layer 12 →Layer 9).
4. Similarly, at yle 47, all three layers are proessing row 48, orresponding to
olumn 13 (in layer 4), olumn 60 (in layer 9) and olumn 91 (in layer 12).
Layer 4 alls for latest summation message of olumn 13 whih has already
been updated by layer 9. Therefore, the updated variable summations at layer
9 should be sent to layer 4 (Layer 9 →Layer 4).
Based on the desription above, message passing routes for olumn 1 of base
matrix is shown in Fig. 3.2(a). An additional example is illustrated in Fig. 3.2(b)
for olumn 6 of the base matrix. In general, we an determine the message passing
routes among layers based on the their permutation values. For eah olumn, we
an sort the permutation values of all layers in desending order. Eah layer then
passes messages to the next layer with a smaller permutation value. Moreover, the
28
12
11
10
9
8
7
6
5
4
3
2
1
Layer
808710255027+80
004939657+0
161688867514+16
005143248312+0
8855108127+8
88881064587+88
00079824046+0
848460297227+84
121237775973+12
00033812224+0
882017873035+8
0783557394+0
242322212019181716151413121110987654321Offset
Figure 3.3: Oset-modied parity-hek matrix for rate-1/2 LDPC ode in 802.16e.
 
  	
  	
  	
  	
  	
  




	


		


 

 ﬀﬁ
ﬂﬃ

 ﬂ
 


 

ﬃ
ﬃ
ﬁ


ﬂ
!
ﬂ
 " 
#


ﬁ
	

ﬁ
ﬂ
Figure 3.4: Timing diagram for parallel layered deoding arhiteture.
layer with the smallest permutation value loops bak and onnets to the layer with
the largest value. This message passing sheme guarantees the updated messages
among all layered are proessed progressively within eah iteration.
3.3 Critial Path Splitting
Aording to the message passing mehanism explained above, we suppose CNUs in
all layers start at row 1 of eah sub-matrix. Atually, the CNUs an start to operate
from dierent rows, whih is equivalent to adding an oset to the permutation value
of eah layer. For example, Fig. ?? shows a modied H base matrix with a set
of oset values added for dierent layers. The oset values are arefully seleted,
suh that the dierene of modied permutation values between any two layers is
29
at least 5. It means eah layer needs to read, update, store and pass the message to
the next onneted layer with 5 lok yles. The detail steps for the proessing in
a layer are shown in in Fig. 3.4. In other words, eah CNU has a period of 4 yles
to omplete the task of reading old message and update new message, as indiated
in (2.14) to (2.16). The message passing rule remains the same, exept that the
updated permutation values should be used to determine the onnetions in PLDA.
From Fig. 3.3, we an see that the row weight of eah layer is either 6 or 7, whih
requires 6 or 7 messages to be ompared in the CNU. Combined with other ne-
essary funtions suh as adding, rounding and memory read/write, CNU beomes
the ritial path that limits the maximum operating frequeny of the deoder. Tak-
ing advantage of the 4-yle time intervals, we an split the ritial path of CNU
into 4 pipelined stages by inserting 3 levels of registers. An optimal splitting yields
balaned delays among the pipeline stages. The implementation of ritial path
splitting will be given in Setion 3.4.2.
3.4 Proposed Deoder Arhiteture
In order to prove the onept of our proposed parallel layered deoding arhiteture
and high-throughput strategies, a rate-1/2 2304-bit QC-LDPC ode seleted from
802.16e standard is designed based on the oset-modied H matrix. The size of
eah sub-matrix is hosen to be 96 × 96. Before onstruting the funtional units
of the deoder, we rst perform xed-point analysis to quantize the word-length of
messages. In fat, word-length quantization is a tradeo between memory resoures
and bit-error-rate performane. Fig. 3.5 shows the BER performane of the seleted
rate-1/2 LDPC ode using message word-length (3, 2). As demonstrated in [?℄, 5
bits (3 bits for the integer part and 2 bits for the frational part) are adequate
30
0.5 1 1.5 2 2.5
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
SNR(dB)
Bi
t E
rro
r R
at
e
 
 
Floating Standard BP @ 20 iter
Floating Min−Sum @ 20 iter
Floating Min−Sum @ 10 iter 
Floating Proposed @ 10 iter
Fixed(3,2) Proposed @ 10 iter
Figure 3.5: BER performane omparison of dierent deoding algorithms.
for representing the absolute values of the extrinsi messages. Considering one
additional sign bit, we hoose 6 bits xed-point representation for messages in this
implementation.
3.4.1 Overall Deoder Arhiteture
The overall arhiteture of the QC-LDPC deoder is shown in Fig. 3.6. It onsists
of hek node proessing units (CNUs), APP memory banks, CTV memory banks
and hard-deision units. The entire H matrix is divided into 12 layers and eah
layer employs a dediated CNU. APP memory banks and CTV memory banks
are used to store variable summations and CTV messages. Eah non-zero sub-
matrix orresponds to one APP memory unit and one CTV memory unit. In layered
deoding, eah APP memory exports a summation message to the CNU and imports
a new summation message from the CNU of another layer. Similarly, eah CTV
memory exports a CTV message to the CNU and imports a new CTV message
from the same CNU for eah layer. As a result, single-port memories that support
31
  
  

 	 



 



	  











ﬀ
ﬁ
ﬂ
ﬃ

 
!
"
#
$

 	
%

% 
  &
  

 	 

&

 



	  
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:

 	
%

% &
    &
  

 	 

 &

 



	  
 &
;<
=
>
?
@
A
BC
DE
F
G
H
IJ
KL
M
N
O
PQ
RS
T
U
V

 	
%

%  &
WX X Y
Z [
\
] ^
_
`
`
`
`
`
`
a
b   


a
b   

&
a
b   

 &
`
`
`
cd d 
 


 

e
& 
 

e
f 
 

e
g

 

e
 h
 

e
 i 
 

e
 f
 
	    
j

	 
k 

 

cd d 
 
&

 
&
e
& 
 
&
e
l 
 
&
e
m

 
&
e
 h
 
&
e
n 
 
&
e
 &

 
&
e
o
cd d 
 
 &

 
 &
e
 
 
 &
e
l 
 
 &
e
n

 
 &
e
& h
 
 &
e
 & 
 
 &
e
 f
`
`
`
`
`
 
	    
j

	 
k 

 
&
 
	    
j

	 
k 

 
 &
`
`
`
`
`
`
`
p
F
i
g
u
r
e
3
.
6
:
O
v
e
r
a
l
l
p
a
r
a
l
l
e
l
l
a
y
e
r
e
d
d
e

o
d
i
n
g
a
r

h
i
t
e

t
u
r
e
f
o
r
Q
C
-
L
D
P
C

o
d
e
s
.
3
2
onurrent read and write operations are used in the design. Sine there are 76 sub-
matries in total, the APP memory bank and the CTV memory bank eah onsists
of 76 small single-port memory units.
Eah layer in the parallel layered deoding arhiteture orresponds to 6 or 7
APP memory units, whih means that the updated variable summations of the j-
th layer Λj,nj will be delivered to APP memory units in other layers based on the
message passing sheme desribed in Setion ?? and Fig. 3.2. Therefore, these
variable summations are passed to their destinations through xed onnetion wires
instead of rossbar-based networks.
3.4.2 Pipelined Arhiteture for CNU
For hek node update, eah layer employs a CNU to perform a series of funtions
inluding subtration, omparison and addition. Consequently, the CNU beomes
the longest delay path in timing. In min-sum algorithm, a CNU needs to ompare
6 or 7 numbers to nd the minimum value and its loation as well as the seond
minimum value. Thus, the omparator beomes a key omponent in a CNU. A
2-input omparator onsists of an adder and a multiplexer. A 3-input ompara-
tor an be implemented with three adders, ve multiplexers and some basi logi
gates. Based on omparison units of 2-input and 3-input omparators, 6-input or 7-
input omparator an be onstruted by ombining three levels of 2-input or 3-input
omparators, as shown in Fig. 3.7.
Fig. 3.7 shows detailed arhiteture of CNU and its funtional blok inside. The
subtrator and adder bloks fulll the funtions dened in (2.14) and (2.16), respe-
tively. Two quantizers are inserted to prevent the CTV and variable summations
from overow during omputation. The abs blok alulates the absolute values
of VTC messages and the ompare & selet blok determines the values of CTV
33
   
  
  
	
 

 


 

 
    
 

 
    
 

 
    
 

 
    
 

 
    
 

 
    
 

 
    
 ﬀﬁ
 ﬀﬁ 
 ﬀ
ﬂ
ﬁ

	ﬃ ﬃ  
	ﬃ ﬃ  
	ﬃ ﬃ  
	ﬃ ﬃ  
	ﬃ ﬃ  
	ﬃ ﬃ  
	ﬃ ﬃ  
  
  

ﬁ  
 

ﬁ  
  

ﬁ  
  

ﬁ   
  

ﬁ  !
  

ﬁ  "
  

ﬁ  #
	$ $

ﬁ  
 


 ﬀﬁ
 ﬀﬁ 
 


 ﬀﬁ
 ﬀﬁ 
 


 ﬀﬁ
 ﬀﬁ 
%
&  ﬁ  ﬀ' 

%
&  ﬁ  ﬀ' 

& 

& 

& 

& 

& 

& 

& 

  
  
   
  
  !
  "
  #
	$ $ 
  
	$ $ 
  
	$ $ 
   
	$ $  
  !
	$ $ !
  "
	$ $ "
  #
	$ $ #
  

#
  
  
  
   
  !
  "
  #
	$ $

ﬁ  
	$ $

ﬁ  
	$ $

ﬁ   
	$ $

ﬁ  !
	$ $

ﬁ  "
	$ $

ﬁ  #
() *+
, -
. /
0
12 34
56 7
12 89
: ) ;
<) += >.
?
12 @ 1
A
6 + .B >C)
12 @ 8
D 7 /
12 9 8
EF ;
GH
8
12 9 8
EF ;
G H
8
12 9 8
EF ;
GH
8
12 84
EF ;
G
+I )
J 5) *) KB
12 @ 1
A
6 + .B >C )
12 3L
D= = ) I
12 1M
/ >
?
.
5B +
?
) @
12 M 8
5B +
?
) 3
12 M M
5B +
?
) 8
12N M
5B +
?
) 9
12 N
OP QR QST UV T RW XY P Y
F
i
g
u
r
e
3
.
7
:
C
N
U
a
r

h
i
t
e

t
u
r
e
w
i
t
h

r
i
t
i

a
l
p
a
t
h
s
p
l
i
t
t
i
n
g
i
n
t
o
4
p
i
p
e
l
i
n
e
d
s
t
a
g
e
s
.
3
4
 
 	

 

 	

 

 	




  
 


  
 

  
 




ﬀ


ﬀ
ﬁ
ﬀ
ﬂﬃﬃ
ﬂﬃﬃ
ﬂﬃﬃ

 
ﬁ
 
ﬀ


 


!"
#$
%&
ﬀ

!"
#$
%&


!"
#$
%& 
Figure 3.8: Register alloation in eah setion of the hard deisions.
messages.
As explained in Setion 3.3, in order to redue the ritial path delay and improve
lok speed, a CNU an be split into 4 pipelined stages by inserting 3 set of registers.
Registers are arefully inserted suh that all 4 pipeline stages have approximately
the same delay, sine maximum frequeny is determined by the longest delay path.
Delays of eah funtional unit and the pipeline stages are also shown in Fig. 3.7.
3.4.3 Deision Units
The output bits of the LDPC deoder are deided by the signs of the variable sum-
mations. In traditional horizontal layered deoding, deisions an be made during
the variable node proessing at the bottom layer. Parallel layered deoding arhi-
teture operates in a slightly dierent way in whih every layer generates variable
summations during eah iteration, as CNUs in dierent layers are working onur-
35
rently. As illustrated in Fig. 3.8, let us take the rst olumn of the modied H base
matrix with oset as an example. CNU 4, CNU 9 and CNU 12 start to operate
from row 13, row 1 and row 81, respetively, as dened by the oset values shown
in Fig. 3.3, whih orrespond to olumn 74, olumn 13 and olumn 28, respetively.
Aording to the message passing rule derived in Setion 3.4.2, variable summations
passing routes of the rst olumn is Layer 4→Layer 12→Layer 9→Layer 4, also
shown in Fig. 3.8. After ompletion of eah iteration, the nal variable summations
for olumns 13 through 27 are stored at layer 12. Variable summations for olumns
28 through 73 are stored at layer 4. Similarly, variable summations for olumn 74
through 12 are stored at layer 9. Therefore, the hard deision bits for olumns 1
through 96 are stored in the memories distributed in 3 dierent layers.
In this design, we employ a onvenient and robust early termination strategy,
similar to [64, 66℄. The pivot of the early termination strategy lies in that the hard
deision bits from previous iteration are stored and ompared with the deision bits
from urrent iteration. If all deoded bits are idential, then the deoder indiates
suessful deoding of a odeword and the iterative deoding terminates. Otherwise,
the deoding proess ontinues until the maximum number of iterations is reahed.
3.5 Implementation Results
In order to evaluate the performane of our proposed parallel layered deoding ar-
hiteture, we implement a rate-1/2 2304-bit QC-LDPC deoder in TSMC 90nm
1.0V CMOS tehnology with 8-layer metals. We omplete synthesis and ore area
plae and route using Synopsys tools.
Implementation results show that the deoder an operate at a maximum fre-
queny of 950MHz after synthesis, orresponding to 2.2Gbps deoding throughput
36
Figure 3.9: Layout of the deoder ore area.
using 10 iterations. A total of 152 single-port memory units eah with size of 96×6
bits (inluding one sign bit for eah message) are employed, whih sums up to
87, 752 bits of memory use and oupies more than 75% of the ore area. Parallel
layered deoding arhiteture only needs (p+ 3)×Iter lok yles for the deoding
proess and p is the size of the sub-matrix. In [66℄ and [70℄, this number rises to
p × (4× Iter + 1) and p × (5× Iter) + 12, respetively. Hene, under the same
number of iterations, parallel layered deoding arhiteture ould redue the deod-
ing lateny by approximately 75%. Moreover, the implementation results show that
the maximum operating frequenies with and without ritial path splitting method
are 950MHz and 305MHz, respetively, whih demonstrates an improvement of the
deoder speed by a fator of 3. Combined with layered deoding algorithm that
doubles the onvergene speed, our proposed arhiteture an signiantly improve
the throughput of the QC-LDPC deoder up to multi-Gbps.
Fig. 3.9 shows the layout view of the deoder ore area of 1.8mm × 1.6mm
and the logi density of 70%. The design does not have a rossbar interonnetion
network, sine parallel layered deoding arhiteture employs xed message passing
paths Single-port memories are generated by Synopsys DesignWare tool and thus
37
Table 3.1: Overall omparison between proposed deoder and other existing LDPC deoders.
C. Liu [45℄ X. Shih [66℄ T. Brak [9℄ G. Gentile [25℄ Y. Ueng [70℄ M. Karkooti [?℄ Proposed Deoder
Code Length 576~2304 576~2304 576~2304 576~2304 2304 1944 2304
Frequeny 150MHz 83.3MHz 333MHz 400MHz 200MHz 412MHz 950MHz
Iterations 20 2~8 10, 15 15 4.6 (average) 15 10
Throughput 105Mbps 60~220Mbps 133-928Mbps 128-746Mbps 106Mbps 736MHz 2.2Gbps
Tehnology 90nm 130nm 130nm 65nm 180nm 130nm 90nm
Area 6.25mm2 8.29mm2 3.83mm2 0.59mm2 - 2.4mm2 2.9mm2
Power 264mW 52mW - - - 502mW 870mW
3
8
attened during synthesis and plae and route design ow. The ore area of the
deoder is 2.9mm2 and the estimated power onsumption is 870 mW.
Table 3.1 shows the deoder implementation results ompared with the existing
QC-LDPC deoders. Note that the throughput values from [9, 25℄ are realulated
based on the deoded bits for fair omparison to other implementations listed in Ta-
ble 3.1. We show that the proposed deoder an ahieve higher deoding throughput
with omparable or smaller hip area. The deoder onsumes more power mainly
due to its high operating frequeny.
3.6 Summary
LDPC odes are widely used in reent ommuniation systems due to their superior
error-orretion performane. In this hapter, we proposed a new arhiteture to
improve the throughput of QC-LDPC deoders. With parallel layered deoding ar-
hiteture and ritial path splitting tehnique, the deoder implementation, whih
uses TSMC 90nm tehnology, an ahieve 2.2 Gbps deoding throughput for seleted
rate-1/2 irregular QC-LDPC odes. In addition, min-sum and loosely oupled algo-
rithms are employed for area eieny and the ore size is 2.9mm2. The proposed
parallel layered deoding arhiteture is suitable for the near-apaity hannel odes
and to meet the inreasing demand of high data rate ommuniation systems.
39
Chapter 4
High-Throughput Rate-Compatible
LDPC Deoder Arhiteture
4.1 Introdution
Reently, there is a growing interest in rate-ompatible (RC) LDPC odes and their
appliations. On time-varying hannels, it is desirable to adjust the FEC ode rate
aording to the hannel state information (CSI). Rate ompatibility is also de-
sirable for ommuniations in the type-II hybrid automati-repeat-request (ARQ)
protools [42℄. The onept of RC LDPC odes was rst proposed in [32, 30, 31℄ by
punturing the parity bits of a low-rate randomly onstruted LDPC ode (alled
the mother ode). The puntured odes showed good error performane, but the
ompliated punturing algorithm and large blok size were not pratial for hard-
ware implementation. Later, nite-length punturing patterns were proposed in [29℄
by introduing the denition of k-step reovery (k-SR) variable node. Based on the
k-SR theory, several studies were presented on punturing shemes for QC LDPC
odes with dual-diagonal parity struture [13, 58℄. Nevertheless, the hardware design
40
of RC LDPC deoder has not yet been well studied.
From the hardware implementation prospetive, there are many existing researh
work on VLSI implementation of LDPC deoders for single-rate odes [85, 12, 51,
86, 77, 17, 16℄ and multi-rate odes [52, 44, 81, 68, 84, 46, 66℄. With the inreasing
demand for high-data-rate wireless appliations, high-throughput LDPC deoder
arhitetures are in need. In our previous work presented in Chapter 3 (also see
[83℄), we proposed a parallel layered deoding arhiteture that an provide up to 1
Gbps input throughput owing to the onurrent proessing of all layers and the split
ritial path resulting muh faster lok speed. However, the parallel layered de-
oding arhiteture has a major drawbak. Beause it uses xed onnetions among
layers, the ustom designed deoder only ts one spei ode. That provides the
motivation for us to ondut further researh to solve the exibility problem.
In this paper, we investigate the punturing shemes for rate-ompatible LDPC
odes and their hardware implementations. For QC LDPC odes with dual-diagonal
parity struture, an eient punturing sheme is seleted whih inludes the weight-
3 vertial sub-matrix in the punturing blok. As a ase study using the rate-1/2
WiMax LDPC ode, we show that the seleted punturing sheme results in the
bit-error-rate (BER) performane degradation of less than 0.2dB ompared with
dediated WiMax odes at four dierent ode rates. Subsequently, we inorporate
the punturing sheme into the parallel-layered-deoding based arhiteture for the
design of a RC LDPC deoder. The deoder is implemented in CMOS 90 nm proess
and an ahieve an input throughput of 975 Mbps at 10 iterations. It supports any
arbitrary rates between rate of the mother ode and 1.
The rest of this hapter is organized as follows. Setion 4.2 introdues the bak-
ground of RC LDPC odes and a omparison of dierent punturing shemes. The
high-throughput RC LDPC deoder arhiteture is presented in Setion 4.3. Se-
41
tion 4.4 gives the implementation results by omparing with some existing WiMax
LDPC deoders, followed by summary and future work in Setion 4.5.
4.2 Punturing Shemes for Rate-Compatible LDPC
Codes
4.2.1 Quasi-Cyli LDPC Codes with Dual-Diagonal Parity
Struture
As introdued in Setion 2.2, QC LDPC odes are a speial lass of strutured
LDPC odes whih are well suited for hardware implementation. In [56℄, a speial
lass of systemati odes is dened based on QC LDPC odes. The base matrix
Hb an be partitioned into two parts, the systemati bits matrix Hs on the left and
the parity bits matrix Hp on the right, suh that Hb =
[
(Hs)mb×kb
.
.
.(Hp)mb×mb
]
,
where kb = nb −mb. The parity bits matrix Hp an be further partitioned into two
setions: the left most olumn Ho is a weigh-3 matrix and the remaining olumns
Hd is a dual-diagonal matrix.
Hp =
Ho Hd

Pb0
0
.
.
.
Pp
.
.
.
0
Pq
∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣∣
I 0 · · · 0 0
Pb1 I · · · 0 0
0 Pb2 · · · 0 0
.
.
.
.
.
. · · ·
.
.
.
.
.
.
.
.
.
.
.
. · · · I 0
0 0 · · · Pbm
b
−2
I
0 0 · · · 0 Pbm
b
−1


(4.1)
42
We refer suh struture as dual-diagonal parity struture. The original purpose
of this struture as in [56℄ was intended for fast enoding beause it guarantees
linear time enoding eieny. Consequently, the dual-diagonal parity struture
has been adopted in many LDPC odes inluding those in the WiMax standard
[4℄. In WiMax odes, the seondary diagonal elements Pb1,Pb2 , · · ·Pbm
b
−1
are also
identity matries. Then the onnetions between the parity bit nodes and the hek
bit nodes beome zigzag edges. In this paper, we fous on the punturing shemes
and deoder design for rate-ompatible LDPC odes with suh dual-diagonal parity
strutures.
4.2.2 Rate-Compatible LDPC Codes
Rate ompatibility an be ahieved by punturing the parity bits of the mother ode.
If belief-propagation (BP) algorithm [24℄ is employed for deoding, punturing an
be arried out by simply setting the log-likelihood ratios (LLRs) of the puntured
bits to zeros (in logarithmi domain). Several punturing shemes have been pro-
posed for randomly onstruted LDPC odes [32, 30, 31℄ and for short-length LDPC
odes [29℄. The puntured odes, though at dierent ode rates, are originated from
the same mother ode. Therefore, only one enoder and deoder is needed.
A major advantage of using rate-ompatible odes is to redue the storage mem-
ory. For instane, the WiMax standard lists six dierent LDPC odes for four dif-
ferent rates (
1/2,2/3A,2/3B,3/4A,3/4B,5/6). A typial WiMax LDPC enoder/deoder
design [9, 25, 66, 45℄ must store the parity-hek matries of all six odes. In
addition, it also requires a ompliated swithing network that an hange the on-
netions among VNUs and CNUs for eah dierent ode. Rate-ompatible odes
avoid the storage overhead as well as the potential lateny problem aused by the
large swithing network. In ommuniation systems, rate-ompatible odes are also
43
 




	

 




	




	



      


  
 


	

	

 ﬀ


	

	



	



	




	



      


  
 ﬁ
ﬂﬂ
ﬂﬂ
ﬂﬂ
ﬂﬂ
 ﬃ
Figure 4.1: Desription of 1-SR node and k -SR node.
suitable for the hybrid ARQ protools [42℄, beause the transmitter an add redun-
danies progressively by punturing the mother ode based on the CSI.
Sine rate-ompatible odes have many advantages listed above, a prominent
question is: an they provide omparable error performane with the unpuntured
odes? For fair omparison, the unpuntured odes are referred to dediated odes
with the same length of the puntured odes. Before presenting the detail punturing
shemes, we rst introdue the denition of the k-SR variable node [29℄.
A puntured variable node v is alled 1-step reoverable (1-SR) if there is at least
one onneted hek node c, alled survived hek node, suh that all other variable
nodes onneted to c are not puntured exept for v. It is alled 1-SR beause suh a
puntured variable node an be reovered in one iteration on binary erasure hannel
44
(BEC). An example of 1-SR node is shown in Fig. 4.1(a). The puntured variable
node v1 is onneted to two hek nodes, i.e. node c1 and c3. One of these two
hek nodes, node c1, has four neighboring variable nodes, namely v1, v2, v6 and
v7. Exept for node v1, all other variable nodes, v2, v6 and v7, are not puntured.
Therefore, variable node v1 is alled 1-SR node.
The denition of k-SR node an be extended from 1-SR node suh that at least
one onneted hek node c, alled the survived hek node, ontains one or more
(k − 1)-SR variable nodes while others are m-SR node, where 0 ≤ m < k − 1 [29℄.
On BEC, a k-SR node an be reovered in kth iteration. Obviously, a k1-SR node
is more reliable than a k2-SR node if k1 < k2. An illustration of k-SR node is shown
in Fig. 4.1(b).
4.2.3 Seleted Punturing Sheme
Sine a k1-SR node is more reliable than a k2-SR node if k1 < k2, we rst maximize
the size of the 1-SR group G1 in order to minimize the error performane loss of
the puntured odes. Then we try to maximize the size of the 2-SR group G2, and
so on. In other words, for the groups {G1, G2 · · ·Gn} the puntured sheme will
punture the low indexed group rst.
The punturing proedure an be divided into two steps: (1) the puntured blok
seletion and (2) the puntured bits seletion inside a blok.
First, we will investigate how to selet puntured bloks. As an example, we
present an eient punturing sheme by punturing the rate-1/2 2304-bit mother
ode from WiMax standard [4℄, as shown in Fig. 4.2. It inludes 1152 parity bits
whih are partitioned into 12 bloks with the blok size of 96 bits. Due to the zigzag
pattern of the parity struture, the grouping of k-SR node beomes easy to handle
[13℄. For example, if a rate-2/3 LDPC ode is obtained by punturing the mother
45
sH
pH
oH dH
           	   	
Figure 4.2: Parity-hek matrix for the seleted rate-1/2 LDPC ode in WiMax.
ode, half of the parity bits, i.e., 6 bloks (or 576 bits) are puntured. Table I shows
three punturing shemes in whih the 6 puntured bloks are all 1-SR nodes. PBI
denotes the puntured blok index and SC denotes the number of survived heks
orresponding to eah puntured blok.
Despite of the same number of 1-SR nodes, the error performanes of the three
shemes still dier from eah other, as shown in Fig. 4.3. The number of survived
heks, as listed in Table 4.1, has an impat on the error performane of the pun-
tured ode. Sheme 1 and sheme 3 both have more survived heks than sheme
2. Therefore, the BER performanes of sheme 1 and sheme 3 are better than that
of sheme 2.
Sheme 1 is the punturing sheme proposed in [58℄, in whih the weight-3 blok,
i.e., Ho, is not puntured. In ontrary, sheme 3 selets the weight-3 blok for pun-
turing. Simulation results in Fig. 4.3 show that sheme 3 has better error perfor-
mane than sheme 1 over AWGN hannels. This is largely beause the puntured
weight-3 variable nodes have more neighboring heks whih an provide more in-
46
1 1.5 2 2.5 3
10−7
10−6
10−5
10−4
10−3
10−2
10−1
SNR per bit
BE
R
BER, Rate=2/3, Number of iterations=10 with layered decoding
 
 
Scheme 1 
Scheme 2
Scheme 3
Dedicated
Figure 4.3: BERs of the three puntured odes and the dediated ode at rate 2/3
over AWGN hannels.
Table 4.1: Three punturing shemes for ahieving rate 2/3 from rate 1/2 mother
ode
Sheme 1 Sheme 2 Sheme 3
PBI SC PBI SC PBI SC
1 2 0 1 0 2
3 2 1 1 2 2
5 2 3 1 4 2
7 2 4 1 6 1
9 2 6 1 8 2
11 2 7 1 10 2
Total SC 12 6 11
47
Table 4.2: Index of puntured bloks at dierent desired rates
Num Rate Pun Bits Pun Blk Idx (PBI) Surv Blk Idx (SBI)
1 1/2 - 12/23 1 - 96 0 0,1,2,3,4,5,6,7,8,9,10,11
2 12/23 - 6/11 97 - 192 0, 10 1,2,3,4,5,6,7,8,9,10,11
3 6/11 - 4/7 193 - 288 0 2, 10 1,3,4,5,6,7,8,9,10,11
4 4/7 - 3/5 289 - 384 0,2,8,10 1,3,4,5,6,7,9,10,11
5 3/5 - 12/19 385 - 480 0,2,4,8,10 1,3,5,6,7,9,10,11
6 12/19 - 2/3 481 - 576 0,2,4,6,8,10 1,3,5,7,9,10,11
7 2/3 - 12/17 577 - 672 0,2,4,6,8,9,10 1,3,5,7,9,11
8 12/17 - 3/4 673 - 768 0,1,2,4,6,8,9,10 1,3,5,7,11
9 3/4 - 4/5 769 - 864 0,1,2,4,5,6,8,9,10 3,5,7,11
10 4/5 - 6/7 865 - 960 0,1,2,4,5,6,7,8,9,10 3,7,11
11 6/7 - 12/13 961 - 1056 0,1,2,3,4,5,6,7,8,9,10 3,11
12 12/13 - 1 1057 - 1152 0,1,2,3,4,5,6,7,8,9,10,11 11
formation during the deoding iterations. Thus, punturing the weight-3 blok as
in sheme 3 is reommended. Table 4.2 shows the seleted punturing sheme when
1, 2, . . . 12 bloks are puntured, whih orresponds to a group of ode rates of
12/23, 12/22, . . . 12/12. Here we name this group of rates as blok rate.
If the desired rate is between two onseutive blok rates, we an punture a
blok partially by taking some bits out of a blok. In Pun Blk Idx (PBI) olumn
of Table 4.2, the normal numbers indiate the entire bloks are puntured and an
itali number indiates the designated blok whih may be puntured partially if
needed. The puntured bits within that blok are seleted based on the following
proedure [58℄. In order to disperse the puntured bits within a blok, a speial
sequene uz is generated reursively from u1 as in (4.2) through (4.3) and (4.4),
where z is the size of the submatrix.
u1 = {0} , (4.2)
48
u2k = {uk (0) , uk (0) + k, uk (1) , uk (1) + k, · · · ,
uk (k − 1) , uk (k − 1) + k} , (4.3)
u2k+1 = {k, uk (0) , uk (0) + k + 1, · · · ,
u (k − 1) , uk (k − 1) + k + 1} , (4.4)
Next, the atual puntured bit sequene u′z is adjusted from uz based on the val-
ues of b0, l and q using (4.5), where l is the row number of Pp, b0 is the permutation
value of Pb0 and q is the permutation value of Pq in the weight-3 submatrix Ho
from (4.1). For the seleted rate-1/2 WiMax ode, z = 96, q = 7 and l = 6.
u′z =


mod (b0uz, z) (if PBI ≤ l)
mod ((z − q) uz, z) (if PBI > l)
(4.5)
Note that u′z is a sequene with length of z, and eah element u
′
z(i), i = 0, 1, ..., z−1,
indiates the olumn index of the bit to be puntured within the blok. In pratie,
the sequene is omputed and stored in a LUT. As soon as the number of puntured
bits is alulated from the given rate, the indies for the puntured bloks and the
bits of the partially puntured blok an be looked up instantly. More details will
be disussed in the deoder implementation.
49
4.3 High-Throughput Rate-Compatible LDPC De-
oder Arhiteture
4.3.1 Summary of the Parallel Layered Deoding Arhite-
ture
LDPC odes an be eetively deoded using belief-propagation (BP) algorithm [24℄.
Two phases of messages, hek-to-variable (CTV) messages and variable-to-hek
(VTC) messages, are transmitted along the edges of Tanner graph to update eah
other iteratively. Min-sum algorithm and modied min-sum algorithm [23, 26, 10℄
have been introdued to redue the omplexity of CTV message updating.
Layered deoding algorithm [51, 9, 25, 45, 33, 28℄ has been adopted to redue the
number of iterations by a fator of two, ompared with the standard BP algorithm.
Hene, the deoding throughput an be improved without any bit error performane
loss. In BP algorithm, VTC updates do not start until all of the CTV messages are
reeived and vie verse. In horizontal layered deoding algorithm, the updated CTV
messages from the urrent layer are passed vertially to all layers below for the same
variable node. In eah iteration, the horizontal layers are proessed sequentially from
the top to the bottom layers.
The overall deoding proedure for a m × n parity hek matrix with min-sum
and layered deoding algorithm is summarized as follows:
qj,nj [k] = Λnj [k]− rj,nj [k] (nj ∈ N (j)) (4.6)
50
rj,nj [k] =

 ∏
n′j∈{N(j)\nj}
sign
(
qj,n′j [k]
)
×
(
min
n′j∈{N(j)\nj}
{∣∣qj,nj [k]∣∣}× α
)
(4.7)
Λnj [k] = qj,nj [k] + rj,nj [k] (4.8)
The details about layer deoding algorithm an be referred to [51, 9, 25, 45, 33,
28℄. However, traditional layered deoding algorithm proesses layers in sequential
order, whih results in a large deoding lateny per iteration. A method of inreasing
parallelism inside a layer is proposed in [51, 45℄, but all layers are still proessed
in series. Although deoding throughput an be improved, this method introdues
rossbar-based interonnetion networks that inrease the hardware omplexity.
In Chapter 3 (also see [83℄), a parallel layered deoding arhiteture was proposed
whih allows all layers to be proessed in parallel. Eah layer has an individual CNU
whih generates and sends updated messages and at the same time also reeives
the updated messages from other layers. Unlike the method proposed in [51, 45℄,
parallel layered deoding arhiteture uses parallel proessing among all layers and
serial proessing within eah layer. In parallel layered deoding, the message passing
routes among layers are based on their permutation values as in the parity-hek
matrix. Fig. 3.3 shows the values in the parity hek of the rate 1/2 WiMax odes.
For eah vertial blok, we rst sort the permutation values of all layers in desending
order. Subsequently, we designate eah layer to pass its message to another layer
whih has the next smaller permutation value in the same olumn. Finally, the
layer with the smallest permutation value loops bak and onnets to the layer with
the largest value. This message passing sheme guarantees the updated messages
51
among all layered are proessed progressively within eah iteration. This designated
message passing sheme works well, beause the rows within eah layer are proessed
sequentially and the updated messages are passed only to the layer who is going to
proess the same olumn next.
It is worth mentioning that we have tentatively added an oset to the permuta-
tion values for eah layers as in Fig. 3.3, whih is equivalent to show that the CNUs
start to proess from dierent rows (instead of already start from row 1). From the
deoding prospetive, hanging the proessing order in eah layer does not aet
the performane nor throughput of a deoder. In fat, the oset values are arefully
seleted suh that the dierene of the modied permutation values between any
two layers is at least 5. It means that eah layer (or CNU) has a time span of 5
lok yles to read, update, store and then pass the message to the next onneted
layer. The main advantage is that we an design the CNUs, whih are usually the
ritial path in deoder design, into 5-stage pipeline arhiteture. This is alled
ritial path splitting tehnique in 3, whih redue the lateny of the ritial path
and thus improve the lok speed and deoding throughput.
The major issue of the aforementioned parallel layered deoding arhiteture is
that the onurrent message passing routes among all layers are xed and optimized
for the spei ode. As mentioned in Setion 4.1, rate ompatible LDPC odes
or at least multiple rates are desirable for ommuniation systems on time-varying
hannels. Thus, we investigate various punturing shemes and inorporate the rate
ompatibility into parallel layered deoding arhiteture to provide the muh needed
exibility, without sariing its advantage of high throughput.
52
 


 





	





















ﬀ
ﬁ
ﬂ
ﬃ

 
!
"
#
$
%
&
'




(

(

)
*
+

,

-
.
.
/
0
1
2
3
4
3
5

6

7
8
5

6

7
9
3
4
3
:
5

6

7

9
3
4
:
;
<
<
<
<
 


=
 





	



=









8
>
?
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q




(

(
8
)
*
+

,
8
-
.
.
/
0
1
R
:
4
3
5

6
8
7
8
:
4
S
5

6
8
7

8
:
4
3
S
:
4
:
;
<
<
<
<
<
<
<
<
 



=
 





	




=










8
T
U
V
W
X
Y
Z
[
\
]
^
_
`
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o




(

(

8
)
*
+

,

8
-
.
.
/
0
1
2
R
5

6

8
7

3
:
4
:
3
:
4
S
5

6

8
7

8
5

6

8
7

9
5

6

8
7
8
p
<
<
<
<
<
<
qqq
qqq
qqq
qqq
r
,

6
)
*
+

,
s
r
,

6
)
*
+

,
s
t

)
*
+

,
p
t

)
*
+

,
u
t

)
*
+

,
s
r
,

6
)
*
+

,
v
t

)
*
+

,
v
t

)
*
+

,
9
r
,

6
)
*
+

,
u
r
,

6
)
*
+

,
p
t

)
*
+

,


r
,

6
)
*
+

,


Figure 4.4: Overall arhiteture of a rate-ompatible LDPC deoder.
53
4.3.2 RC LDPC Deoder Design
The overall arhiteture of the RC LDPC deoder is shown in Fig. 4.4. It onsists of
CNUs, LLR initialization blok, APP memory banks, CTV memory banks and bit
deision units. It is similar to a regular PLDA design exept for the initialization
of the APP memory. Using the rate-1/2 WiMax ode as an example, the entire
Hb matrix is divided into 12 layers and eah layer has a dediated CNU. APP
memory banks and CTV memory banks are used to store APP messages and CTV
messages. Eah nonzero elements in the base parity hek matrix orresponds to
one APP memory unit and one CTV memory unit. Eah APP memory exports
APP messages to the CNU and eah CTV memory also exports CTV messages
to the CNU. The CNU rst alulates the VTC messages (as indiated in (4.6)),
then alulate the updated set of CTV and APP messages (as indiated in (4.7)
and (4.8)), and nally imports updated CTV message to CTV memory banks and
updated APP messages to APP memory banks. Therefore, the CNU beomes the
ritial path of the PLDA whih limits the maximum frequeny.
However, as indiated in Setion 4.3.1, an interval of 5 lok yles is available
for the APP message passing from one layer to another if oset-modied parity-
hek matrix in Fig. 3.3 is used. Considering the memory writing operation whih
will ost one lok yle, a split CNU with 4 pipeline stages an be designed by
inserting some registers to the original CNU, bringing in redued ritial path delay
and improved maximum frequeny.
The punturing sheme is implemented by pathing the LLRs of the puntured
parity bits to be 0's, whih is alled LLR initialization. The length of the atual
reeived odeword is smaller than that of the original mother ode beause of the
puntured bits. For the seleted punturing sheme, the puntured bloks and bits
listed in Table II should be initialized to zeros.
54
  

  

 




  
	



  
	


 
 






 





























 
  
 
 
Address
Generator
 
 
 
 
   
 
 
 




















ﬀﬀﬁ
12×12 LUT
Memory Banks
AdderComparators
ﬂ
ﬃ
 
!"#
 ﬂ
$%%&
!'('
)
%*
!+
ﬃ
)
%*
!
&$,
(
(
 
!'('
-./ -.0 -./0
)
%*
!
&$,
(
     





Figure 4.5: Arhiteture of LLR Initialization Blok.
Fig. 4.5 shows the arhiteture of the LLR Initialization Blok, inluding a
group of Comparators, a Deoder, an Address Generator and a 12×12 look-up table
(LUT ). The LLRs from the hannel are stored in assigned memory banks and the
others are set to be 0's. Based on the length of the puntured bits (n_pun), the
rate of the ode an be dedued using a group of Comparators and an Adder. The
thresholds of the omparators are set to be the multiples of the sub-matrix size
z, whih is 96 for the mother ode. Eah omparator ompares n_pun with one
threshold and returns 1 if n_pun is greater and 0 otherwise. Totally there are
11 omparison results. These results are sent to a deoder to determine the range
of the puntured ode. Here the deoder is simply omposed of a group of adders
whih add all of the omparison results to get the idx_rate signal. For example, if
the length of the puntured bits is smaller than 96, then eah omparator returns a
55
……0 10 11
‘p’
‘p’
Clock
Accumulator
LUT
==?
n_parity
idx_rate
clk
idx_blk
addr_mem
Figure 4.6: Arhiteture of the Address Generator.
0 and idx_rate is therefore 0. This orresponds to row 1 in Table 4.2 and the rate
lies between 1/2 and 12/23. The Address Generator generates addresses for the
memory banks, as well as the 12× 12 LUT. Signal idx_blk denotes whih memory
bank is being written with the LLRs at a time instant. The 12 × 12 LUT stores
the enable signal for the memory banks to deide whih memory should be written
by LLRs at a time onstant. Thus, the ontents of the 12× 12 LUT represent the
punturing bit seletion as in Table 4.2.
The detail design of the Address Generator is shown in Fig. 4.6. The ore
omponent is an aumulator whih aumulates every lok yle and outputs two
signals idx_blk and addr_mem, one for the 12 × 12 LUT to selet the enable sig-
nal and another for the memory banks as the write address. However, it an be
observed from Table II that for every rate range, only one vertial blok is partially
puntured and the rest are entirely puntured. In other words, the lok aumu-
lator will aumulate every 96 yles unless for the partially puntured blok that
it aumulates at a smaller number of yles whih is determined by the number of
puntured bits in this blok. Therefore, a LUT is used here to store the index of the
partially puntured blok based on Table II, in order to indiate the lok aumu-
56
lator to aumulate at a dierent step when meeting with the partially puntured
blok. The ontent of the LUT is exatly the itali numbers in Table 4.2.
4.4 Experimental Results
For experimental study, we implement the seleted punturing sheme for the WiMax
LDPC odes. We hoose the rate-1/2 LDPC ode in the WiMax standard as the
mother ode. Numerial simulations are performed to verify the BER performane
between the puntured odes and the dediated odes at three dierent rates. Fur-
thermore, the rate ompatible LDPC deoder are developed based on the parallel
layered deoding arhiteture and then implemented using standard ell ASIC design
ow.
4.4.1 Simulation Results for Puntured WiMax Codes
The BER performane of a group of puntured LDPC odes are presented in Fig.
4.7. The rate of mother ode is 1/2 and the ode length is 2304 bits. Five dierent
rates are generated using the seleted puntured shemes, i.e., 3/5, 2/3, 3/4, 5/6
and 6/7. The orresponding number of puntured bits are 384, 576, 768, 922 and
960. Thus the ode length of the puntured odes are 1920, 1728, 1536, 1382 and
1344, respetively.
To verify the seleted punturing sheme, dediated LDPC odes at rate 2/3,
3/4 and 5/6 from WiMax standard are simulated to ompare their performane
with the puntured odes, also shown in Fig. 4.7. At eah rate, a spei mode is
seleted from the 19 modes of eah WiMax ode to make the seleted ode lengths
of the dediated odes equal or similar to those of the puntured odes. The ode
lengths of the dediated odes at rate 2/3, 3/4 and 5/6 are adjusted to 1728, 1536
57
1 1.5 2 2.5 3 3.5 4
10−7
10−6
10−5
10−4
10−3
10−2
10−1
SNR per bit
BE
R
BER, Number of interations=10 with layered decoding
 
 
Mother Code, Rate=1/2
Punctured, Rate=3/5
Dedicated, Rate=2/3
Punctured, Rate=2/3
Dedicated, Rate=3/4
Punctured, Rate=3/4
Dedicated, Rate=5/6
Punctured, Rate=5/6
Punctured, Rate=6/7
Figure 4.7: BERs of the puntured LDPC odes over AWGN hannels.
and 1344, equivalent or similar to those of the orresponding puntured odes whose
ode lengths are 1728, 1536 and 1382. Fig. 4.7 shows that the BER of the puntured
ode is very lose to the dediated ode, with less than 0.2dB performane loss at
BER of 10−5.
4.4.2 Hardware Implementation Results
In order to demonstrate the ombined system performane of the seleted rate om-
patible LDPC odes and the PLDA arhiteture, we implement the rate LDPC
deoder in TSMC 90 nm tehnology with 8 layers. We omplete the synthesis and
ore area plae and route using the standard Synopsys tools.
Implementation results show that the deoder an operate at a maximum fre-
queny of 838 MHz after synthesis, whih orresponds to a onstant input through-
put of 975 Mbps for all ode rates. Fig. 4.8 shows the layout view of the deoder
58
APP
Memory
Bank
 



	


CUs 
and 
other 
logic 
blocks
LLR
Initialization Block
Figure 4.8: Layout of the proposed deoder hip.
with the ore area of 1.96 mm2 and the logi density of 70%. Read/write memories
are generated by Synopsys DesignWare tool and thus attened during synthesis and
plae and route proess. The estimated power onsumption of the deoder ore is
650 mW running at 838 MHz lok frequeny.
We also ompare the proposed RC LDPC deoder design with several other
existing WiMax LDPC deoder implementations, as listed in Table 4.3. For fair
omparison, we rst sale all the deoders to 65 nm tehnology node. Then, a
metri alled the throughput-to-area ratio (TAR) is introdued show how muh
throughput a deoder an ahieve per area unit. Table 4.3 shows that the proposed
deoder an provide higher throughput using smaller hip area. More interestingly,
the proposed deoder design an provide any arbitrary ode rate between 1/2 and
1 as opposed to only 4 seleted rates in the existing WiMax LDPC deoders.
59
Table 4.3: Overall omparison between proposed deoder and other existing WiMax LDPC deoders
C. Liu [44℄ T. Brak [9℄ X. Shih [66℄ C. Liu [45℄ Proposed
Supported Rates 1/2, 2/3, 3/4, 5/6 1/2, 2/3, 3/4, 5/6 1/2, 2/3, 3/4, 5/6 1/2, 2/3, 3/4, 5/6 Any rate between 1/2 and 1
Frequeny (MHz) 300 333 83.3 150 1100
Iterations 20 10, 15 2~8 20 10
Throughput (Mbps) 212 83-155 60~220 105 1280
Tehnology (nm) 90 130 130 90 65
Area (mm2) 6.22 3.83 8.29 6.25 1.96
Area saled to 65 nm (mm2) 3.24 0.96 2.07 3.26 1.96
TAR (Mb · s−1mm−2) 65.4 86.5~161.5 29.0~106.3 32.2 653.1
6
0
4.5 Summary
This paper presents the algorithm, design and implementation of a rate-ompatible
LDPC deoder. Using the seleted punturing sheme, the BER performanes of
the puntured odes are omparable with the dediated odes with less than 0.2 dB
performane degradation in simulation results. In addition, rate ompatible LDPC
odes provide an ideal solution to the exibility problem of the parallel layered de-
oding arhiteture. Considering the WiMax standard, a rate ompatible LDPC
deoder is designed using the rate-1/2 ode as mother ode. The hardware imple-
mentation shows the maximum input throughput of 975 Mbps. Comparing to a
multi-rate LDPC deoder, the rate ompatible design an eliminate the memory
to store multiple odes and the network to swith among them. Therefore, rate-
ompatible LDPC oder are highly desirable for advaned wireless ommuniation
systems.
61
Chapter 5
Low-Complexity LDPC Deoder
Arhiteture for CMMB Systems
5.1 Introdution
Mobile digital broadasting TV is an emerging area for next-generation multime-
dia ommuniations and entertainment servie. Several ommuniation standards
have already taken plae worldwide, inluding DVB-H [21℄ in Europe, T-DMB [2℄
in Korea, and ISDB-T [3℄ in Japan. The China Multimedia Mobile Broadasting
(CMMB) standard [1, 76℄, ratied in 2006, is the mobile television and multimedia
standard developed and speied by the State Administration of Radio, Film, and
Television (SARFT) in China. Both high data rate and high reliability are desirable
for broadasting networks, whih requires high-throughput forward error orretion
(FEC) odes with exellent error orreting performane.
LDPC ode is therefore an ideal andidate due to its near-shannon-limit error
performane and inherent parallelism for parallel implementation. They have been
hosen as the ECC for CMMB together with Reed-Solomon (RS) odes. The mo-
62
bility requirement of CMMB standard demands for an area eient and low-power
LDPC deoder. In reent years, many researh projets have been foused on re-
duing memory size and omplexities of node proessing units and interonnetion
network in LDPC deoders [85, 51, 36, 78℄.
With the inreasing popularity of mobile handheld devies, it demands for a
low-omplexity FEC deoder that is both area eient and low power. In reent
years, many researh projets have been foused on the redution of memory size
and simplifying the omplexity of interonnetion network in an LDPC deoder.
Memory-eient arhitetures are onstruted by employing min-sum algorithm [78℄
and memory-aware arhiteture [51℄. Compliated interonnets an be redued by
the partially parallel arhitetures [85, 51℄ and further optimized by the loosely
oupled algorithm [36℄. While layered deoding arhitetures addressing weight-1
matries are presented in [45℄, the novelty of this paper lies in the low-omplexity
arhiteture design for layered deoding with weight-2 matries.
In this hapter, we present the arhiteture and implementation of an LDPC
deoder for CMMB standard. The main ontributions of this paper are listed as
follows: (1) A reongurable arhiteture, whih an support dual rate LDPC odes
speied in the CMMB standard, is onstruted with minimal overhead; (2) Split-
memory arhiteture is proposed to eiently handle the weight-2 superimposed
sub-matries. The proposed tehniques takes advantages of the regular struture in
both rate 1/2 and 3/4 odes in CMMB standard and they apply well to the odes
with a few weight-2 sub-matries.
The rest of this hapter is organized as follows. The QC-LDPC ode struture
in the CMMB standard and its deoding algorithms are introdued in Setion 5.2.
The reongurable arhiteture of the dual-rate LDPC deoder design is presented
in Setion 5.3. Memory redution tehnique and simplied read/write networks are
63
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
1
257
513
769
1025
1281
1537
1793
2049
2305
2561
2817
3073
3329
3585
3841
4097
4353
4609
Variable Nodes
(a)
Circulant H matrix: rate=1/2
Ch
ec
k 
No
de
s
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
1
257
513
769
1025
1281
1537
1793
2049
2305
Variable Nodes
(b)
Circulant H matrix: rate=3/4
Ch
ec
k 
No
de
s
Figure 5.1: Struture of the parity hek matrix for LDPC odes in CMMB standard:
(a) rate-1/2; (b) rate-3/4.
disussed in Setion 5.4. Setion 5.5 shows the ASIC implementation of the deoder
and its performane results, followed by the summary in Setion 5.6.
5.2 QC-LDPC Codes in CMMB Standard
Fig. 5.1 shows the strutures of the base matrix H for the LDPC odes dened
in the CMMB standard, inluding the rate-1/2 ode and the rate-3/4 ode. They
belong to the lass of QC-LDPC odes that an be expanded from the base matrix,
and the size of eah sub-matrix is 256×256. In fat, the original parity hek matrix
in the CMMB standard is regular but not well strutured. We obtain an equivalent
64
QC form as shown in Fig. 5.1 through proper olumn permutations. The size of
the base matrix H is 18 × 36 for the rate-1/2 ode and is 9 × 36 for the rate-3/4
ode. A blank element an be expanded as a 256× 256 all zero sub-matrix. A non-
blank element an generally be expanded as a 256× 256 irularly shifted identity
matrix, also alled weight-1 sub-matrix, exept for a few weight-2 sub-matries. As
highlighted in Fig. 5.1, there is one weight-2 sub-matrix in the rate-1/2 ode and
three of them in the rate-3/4 ode. To the best of our knowledge, the QC-LDPC
deoder with weight-2 sub-matries has not been muh studied. Most of the existing
works were foused on QC odes with all weight-1 sub-matries, e.g. [45℄.
Both rate-1/2 and 3/4 odes in CMMB standard are regular LDPC odes with
ode length of 9216 bits and olumn weight of 3. Row weight of the rate-3/4 ode
is 12, twie of the row weight of the rate-1/2 ode whih is 6. Therefore, the total
number of none-zero elements in both odes are exatly the same, whih is a unique
property for the design of a unied deoder that an support dual rates as speied
in CMMB standard.
5.3 Dual-rate Deoder Design
Before onstruting the funtional units of the deoder, we rst perform xed-point
simulations to quantize the word-length of messages. In fat, word-length quanti-
zation is a tradeo between memory size and bit-error-rate performane. Fig. 5.2
shows the BER performane of the LDPC odes in CMMB with CTV message rep-
resentations in oating point and word-length (3, 1) whih orresponds to 3 bits of
integer part and 1 bit of the frational part. Considering a sign bit, we hoose 5-bit
xed-point representation for the CTV messages in the design. APP messages are
6-bit representations in our design, inluding a sign bit, 4 bits of integer part and 1
65
1.4 1.6 1.8 2 2.2 2.4 2.6 2.8
10−6
10−5
10−4
10−3
10−2
SNR per bit
Bi
t E
rro
r R
at
e
BPSK, AWGN, Num of Iteration = 15
 
 
Fixed (3,1) for rate 1/2
Floating for rate 1/2
Fixed (3,1) for rate 3/4
Floating for rate 3/4
Figure 5.2: BER performane for dierent rates and quantization shemes.
bit of frational part.
5.3.1 Overall Arhiteture
Fig. 5.3 shows the overall deoder arhiteture for CMMB LDPC odes. The CTV
memory is used to store the CTV messages orresponding to one row in the H
matrix of the QC-LDPC ode. Instead of storing 6 CTV messages in eah row
(beause the rate 1/2 odes have the row weight of 6), a onatenated message is
stored onsisting of four elements: the minimum magnitude; the seond minimum
magnitude; the relative loation index of the minimum magnitude; and the signs of
all messages in this row. Sine there are 6 elements in eah row for rate-1/2 ode,
only 3 bits are needed to represent the relative loation index. The reovery unit is
to reover the individual CTV message from its ompressed form. The APP memory
bank stores the APP messages for eah olumn and it loads the intrinsi messages
from the hannel at initialization. The read network fethes the APP messages
from APP memory bank aording to the loations of nonzero elements in eah row.
66
 



	





	








	





    






	ﬀ
ﬁ
	






	ﬀ
ﬂ


	


ﬃ

 
!
"
#
$
%
&
'
(
)
*
+
,
-
.
/
0
1
2
3
4
5
6
78
9::; 6<
=> ?@>=>
9::;
6<
=>
?@>=>
ABCDBC
6C
7
6C
7
EFF 
	

 G
ﬀ
H
::
@
;
= I
H
::
@
;
=J
K
LMNOPQRRQS TMPN
U
Figure 5.3: Overall arhiteture of the CMMB LDPC deoder.
Adder_1 reovers the VTC message as in (??) and then sends it to the CNU. The
CNU reads the VTC messages from adder_1 and updates the CTV messages. The
ompressed form of the updated CTV messages is saved in the CTV memory, while
in original form they are sent to adder_2 for updating the APP messages as dened
in (??). Subsequently, the write network stores the updated APP messages to the
APP memory bank. The deoded memory stores the output bits after the deision.
5.3.2 Dual-Rate CNU Design
Before desribing the dual-rate CNU arhiteture, we rst elaborate CNU design for
the rate-1/2 odes. Fig. 5.4 an be viewed as two rate-1/2 CNUs in parallel. For
implementation of the min-sum algorithm, the CNU requires a 6-number omparator
to nd the minimum and seond minimum values of the VTC messages in eah row.
Here we employ a pseudo-rank order lter (PROF) based design derived from a
67
    

 	 
    

 	 

 
    

 	 

 









 

 

 











ﬀ

ﬁ  ﬀ




ﬂ
ﬃ
 !" #$
 % " #$
&' ( 


&' ()


&' ( 


&' (*


&' (+


&' (,


    

 	 
    

 	 

 
    

 	 

  









 

 

 





)




)
ﬀ

ﬁ  ﬀ




ﬂ
ﬃ
 !" #$
 % " #$
&' ( 

)
&' ()

)
&' ( 

)
&' (*

)
&' (+

)
&' (,

)

  

)
    - .
/'
ﬃ
01 1
23
4
56
( ' & 


( ' &)


( ' & 


( ' &*


( ' &+


( ' &,


( ' &

 ﬃ 78 9 ﬀ ﬀ 9
:


/'
ﬃ
;< =>
2? @ 3A
6
    - .
/ '
ﬃ
0 1 1
2 3
4
56
( ' & 

)
( ' &)

)
( ' & 

)
( ' &*

)
( ' &+

)
( ' &,

)
( ' &

 ﬃ 78 9 ﬀ ﬀ 9
:

)
/ '
ﬃ
;< =>
2? @ 3A
6
ﬀ

ﬁ  ﬀ






ﬂ
ﬃ






ﬂ
ﬃ

)
ﬀ

ﬁ  ﬀ

)


















ﬀ 9
ﬂ
ﬂ
ﬃ


ﬂ
ﬃ

)
F
i
g
u
r
e
5
.
4
:
A
r

h
i
t
e

t
u
r
e
o
f
C
N
U
f
o
r
d
u
a
l
-
r
a
t
e
(
1
/
2
,
3
/
4
)
C
M
M
B
L
D
P
C

o
d
e
s
.
6
8
   
 



  
 

 	 

 















 
 







ﬀ
Figure 5.5: The design of an PROF-based omparator .
previous work [78℄, as shown in Fig. 5.3.2. The PROF an output the minimum
and seond minimum magnitudes of two pairs of inputs. For eah pair, two elements
are ompared and the one with larger value is plaed on the top. Thus, the ompare
& swap bloks are added before the PROF bloks to sort the magnitudes of eah
pair of inputs. To assist reording the positions of the minimum values, the ompare
& swap blok outputs a bit value 0 if two inputs are swapped. Otherwise, it outputs
1 as shown in the dotted line. Similarly, the PROF blok also indiates the loation
of the minimum magnitude by setting an output 0 if the minimum value belongs to
the upper pair and 1 otherwise.
In order to ahieve a fully funtional LDPC deoder for CMMB standard, the
proposed arhiteture must support both ode rates of 1/2 and 3/4. Instead of intro-
duing a large swith network as in other multi-rate LDPC deoder whih inreases
the omplexity and area, here we innovatively extend the rate-1/2 CNU design and
make it ompatible with rate-3/4 by adding a few extra multiplexers together with
minor modiation of several funtional units. With minimum overhead, we an
ahieve a reongurable arhiteture for dual-rate deoding.
69
For the rate-3/4 implementation, the CNU has to deal with the row weight of 12,
twie that of the rate-1/2 odes. The idea is to ombine two rate-1/2 CNUs, namely
CNU0 and CNU1, into a single rate-3/4 CNU design as shown in Fig. 5.4. Here we
briey elaborate the proess of onstruting the dual-rate CNU. For rate-3/4 odes,
a 12-number omparator is required. Therefore, a row is divided into two sets, eah
of whih has a weight of 6. Then, two CNUs eah with a 6-number omparator are
assigned to proess two sets of CTV messages in parallel. As expeted, eah set
produes the minimum and the seond minimum magnitudes. Now we introdue
an additional PROF blok to ompare these two sets of minimum magnitudes that
belong to the same row (denoted as signalmin and min2). Signalmin_sel is set to 0
if the minimum value from CNU0 is seleted and set to 1 if the minimum value from
CNU1 is seleted. The loations of these two sets of minimum values are denoted
by signals lo_0 and lo_1, respetively.
An additional multiplexer is introdued to assign the index loation of the min-
imum value. If min_sel is set to 0, whih indiates the minimum value of the row
belongs to CNU0, the 3-bit loation signal lo_0 is passed to the upper reovery
blok through signal min_lo_0. The loation signal of the lower reovery blok is
assigned with idle value 111, whih indiates that the minimum magnitude is not
ontained in CNU1. Similarly, the loation index is passed to the lower reovery
blok if CNU1 ontains the minimum magnitude of the row.
As we an see from Fig. 5.4, the overall operations of the reongurable CNU
remain the same. The introdued overhead is minimum and does not aet the
deoder throughput or frequeny of the iruits. Therefore, this is a very eient
approah to design a reongurable dual-rate deoder.
70
  
 
 
 
 


	


 




 
	




Figure 5.6: Element orrespondene relations between CTV memory and APP mem-
ory.
5.3.3 Memory Aess for Partially Parallel Layered Deoding
To inrease the deoding throughput, multiple rows an be proessed in parallel
whih is usually referred as partially parallel layered deoding. For parallelism
fator p, it requires aess of the CTV messages for p onseutive rows in one lok
yle. Therefore, we group p rows of CTV messages into a single entry in the
CTV memory. Similarly, p olumns of APP messages an also be grouped into a
single entry in the APP memory. However, the alignment of row and olumn-based
memory partitions may not math exatly, depending on the permutation value of
the sub-matrix. As shows in Fig. 5.6, the p CTV messages from an entry of the
CTV memory orrespond to p APP messages that are stored in 2 adjaent loations
in the APP memory. Therefore, a shift register an be used for storing and seletion
of the appropriate APP messages read from the memory.
The parallelism fator p is usually hosen as a number that divides l, where l is
the size of the sub-matrix (l = 256 for CMMB odes). Pratially, two APP memory
loations an be read in 2 onseutive lok yles and the data are stored in a shift
register with the size of 2p. The permutation value sij of the sub-matrix determines
whih p messages are to be seleted. It an be proved that the orresponded p
71
messages are loated between (s%p+ 1) and (p− 1 + sij%p) in the shift register,
where the shift ontrol signal assigned by (sij%p) has the bit width of ⌈log2 (p)⌉.
5.3.4 Split-Memory Arhiteture for Weight-2 Sub-matries
A major hallenge of deoder design in CMMB standard is to handle the weight-2
sub-matries in the H base matrix highlighted in Fig. 5.1. Eah weight-2 sub-
matrix is omposed of 2 superimposed yli-shifted identity matries, as opposed
to 1 shifted identity matrix for weight-1 sub-matrix. Considering layered deoding,
in [55℄, a simplied layered deoding algorithm is proposed to deal with suh su-
perposed sub-matries without APP message passing inside eah sub-matrix. To
fulll a omplete layered deoding algorithm with weight-2 sub-matries, the CNU
needs to read two APP messages from two dierent memory loations at the same
time. Similarly, two updated APP messages must be written bak to memory in the
same lok yle. Therefore, a dual-read, dual-write memory is needed to support
the operations. However, dual-port memories are not desirable for implementation
beause of inreased area and power onsumption.
Hereby, we propose a novel split-memory arhiteture that onsists of two single-
port memories to handle the weight-2 sub-matrix. Fig. 5.7 shows the memory
struture. A weight-2 sub-matrix is deomposed into two weight-1 sub-matries
s0 and s1, and eah uses a single-port memory to store the APP messages. One
memory stores APP messages orresponding to permutation value s0 and another
to permutation value s1. Without loss of generality, we suppose s0 < s1. Eah
row i orresponds to two olumns j0 and j1. Two APP messages from olumn j0
and j1 are sent to the CNU through the read network, while the updated APP
messages Λj0 and Λj1 are interhanged before being stored in the other memory,
as shown in Fig. 5.7(a). In this ase, it an be guaranteed that the CNU reeives
72
   
 
 

 
 

	


 
,0jL ,1jL
jL
jL
      
 


 
0jL0jL 1jL1jL








Figure 5.7: Split-memory design for handling the weight-2 sub-matries.
73
the most updated APP messages within the same iteration. More speially, the
proessing of olumns [s1, l − 1] and [0, s0 − 1] for sub-matrix s0 inorporates the
newest APP messages that have just been updated as in the sub-matrix s1. Similarly,
the proessing of olumns [s0, s1 − 1] for sub-matrix s1 involves the newest APP
messages updated as in the sub-matrix s0. Upon ompletion of the proessing
of the weight-2 sub-matrix, the newest APP messages orresponding to olumns
[s1, l − 1] ∪ [0, s0 − 1] are stored in memory M0 and the newest APP messages
orresponding to olumns [s0, s1 − 1] are stored in memory M1. When proessing
many other regular weight-1 sub-matries, a multiplexer is used to read the orret
APP messages for the olumns involved with the weight-2 sub-matries. As in
Fig. 5.7(b), the APP messages from memory M0 is seleted if the olumn index
j ∈ [s1, l − 1] ∪ [0, s0 − 1]. Otherwise, APP messages from M1 is seleted if j ∈
[s0, s1 − 1].
The key advantage of the proposed split-memory arhiteture is that it makes
the proessing of weight-2 sub-matries the same as that of regular weight-1 sub-
matries. It does not introdue additional delays to slow down the deoding through-
put. Moreover, there is no performane lost for layered deoding using the proposed
split-memory arhiteture.
5.3.5 Number of Pipeline Stages
For the overall deoder arhiteture shown in Fig. 5.3, there is a long data path
beginning from the APP memory, through read network, CNU, two adders, write
network, and returning to the APP memory. In order to improve the overall lok
frequeny, this path is split into several pipeline stages. After reading the APP
message from memory, it should allow multiple lok yles to omplete the CNU
proessing, update the APP messages for the urrent layer and store them bak to
74
the memory. In fat, it is not neessary to immediately write bak the updated
APP messages unless the same olumns are used again during the proessing of
following layers. However, the number of the pipeline stages an not be arbitrarily
determined beause of the onstraint brought by the weight-2 sub-matries during
layered deoding.
The weight-2 sub-matries are omposed of two overlapped weight-1 sub-matries
with the permutation values of s0 and s1. Without loss of generality, we suppose
that s1 ≥ s0. At row i, it orresponds to two olumns and the dierene of two
olumn indies is alulated as
△i =


s1 − s0, F o i ∈ [0, l − s1 − 1] ∪ [l − s0 − 1, l − 1]
l − (s1 − s0) i ∈ [l − s1, l − s0 − 2]
(5.1)
The minimal value of △i is denoted as △min. For eah weight-2 sub-matrix, two
olumns are proessed urrently. As desribed in the split-memory arhiteture in
Setion 4.2, the olumns that are proessed by sub-matrix s0 rst must be stored
bak to the APP memory before sub-matrix s1 starts to proess the same olumn
again, and vie verse for the olumns proessed by s1rst. Considering the partially
parallel arhiteture that proesses p rows eah time, the property of the weight-2
sub-matrix requires APP message to be updated and stored bak within ⌊△min/p⌋
lok yles. In CMMB LDPC odes, there are four weight-2 sub-matries, inluding
1 in the rate-1/2 odes and 3 in the rate-3/4 odes as shown in Fig. 5.1. The
permutation values of the four sub-matries are (0, 146), (0, 208), (0, 145), and (0,
138), respetively. Thus △min is 48, whih orresponds to the sub-matrix at row 0
olumn 7 in the H base matrix of the rate-3/4 odes. If we hoose the parallelism
fator as 8, then the number of pipeline stages is 6. The deoder implementation
with 6-stage pipeline shows that it an redue the ritial path delay and results in
75
high lok frequeny.
5.4 Area-Eient Design Tehniques
5.4.1 Memory Redution
Memory usually dominates the overall hip area and power onsumption in an LDPC
deoder [51℄ beause iterative messages at eah hek nodes and variable nodes must
be stored for the proessing in subsequent iterations. Min-sum algorithm simplies
the storage requirement at the hek node. As mentioned earlier, the CTV messages
for a row are saved in a ompressed form that inludes the minimum magnitude, the
seond minimum magnitude, the relative loation index of the minimum magnitude
and the signs of the CTV messages in this row. As an example, the row weight of
rate-1/2 odes is 6 and eah CTV message is quantized into 5 bits. For m rows,
the size of the CTV memories an be redued from 30m bits to 17m bits, whih is
about 43% memory redution.
In addition, the proposed layered deoding algorithm only requires the storage
of APP messages instead of the atual VTC messages. It redues the memory size
sine eah olumn only needs to store one APP message. For example, the LDPC
odes in CMMB has a olumn weight of 3, whih indiates that the required memory
to store variable node information is redued by 66.7%.
5.4.2 Read/Write Networks
As the key part of the deoder, read and write networks ontrol the messages passing
between APP memory and CNU. In general, a strutural (m,n) ode suh as the
QC-LDPC ode in CMMB an redue a large m × n read and write networks to a
76
 



 

 









	
























ﬀﬁ
ﬂ
ﬃ






 

!
"#
$ %
!
"#
$ &
!
"#
$ '
(
)*+,
-
-
(
)*+,
(
)*+,
-
-
-
-


!
"#
$ %
!
"#
$ &
!
"#
$ '


	










.





/







ﬀﬁ
ﬂ
ﬃ






 

0
12
03
2
Figure 5.8: For rate-1/2 odes (a) Arhiteture of read network; (b) Arhiteture of
write network.
Table 5.1: Connetions of input and output ports in the read network for the rate-1/2
odes
Input Ports Output Ports
Λ0, Λ1, Λ2, Λ3, Λ4, Λ5 Port 0
Λ6, Λ7, Λ8, Λ9, Λ10, Λ11 Port 1
Λ12, Λ13, Λ14, Λ15, Λ16, Λ17 Port 2
Λ18, Λ19, Λ20, Λ21, Λ22, Λ23, Λ24 Port 3
Λ22, Λ24, Λ25, Λ26, Λ27, Λ28, Λ29 Port 4
Λ30, Λ31, Λ32, Λ33, Λ34, Λ35 Port 5
77
Table 5.2: Connetions of input and output ports in the read network for the rate-3/4
odes
Input Ports Output Ports
Λ0, Λ1, Λ2 Port 0
Λ3, Λ4, Λ5 Port 1
Λ6, Λ7, Λ8, Λ9 Port 2
Λ6, Λ9, Λ10, Λ11, Λ12 Port 3
Λ12, Λ13, Λ14, Λ15 Port 4
Λ13, Λ15, Λ16, Λ17, Λ20 Port 5
Λ18, Λ19, Λ20 Port 6
Λ21, Λ22, Λ23, Λ24 Port 7
Λ24, Λ25, Λ26 Port 8
Λ27, Λ28, Λ29, Λ30 Port 9
Λ29, Λ30, Λ31, Λ32, Λ33 Port 10
Λ32, Λ33, Λ34, Λ35 Port 11
muh smallermb×nb swithing network, whih is 18×36 for rate-1/2 CMMB odes.
The size of the network an be further redued by exploiting the struture of the
QC odes. As seen from Fig. 5.1, the row weight of rate-1/2 ode is limited to 6,
whih simplies the number of yli shift patterns in a horizontal blok. Therefore,
the network an be further redued to the size of 36 × 6. Moreover, the output
ports of the read network do not need to onnet all 36 input ports. From Fig.
5.1, we an draw that the rst output port of the read network only needs to be
onneted with the APP memories Λ0, Λ1, Λ2, Λ3, Λ4 and Λ5, as the rst 6 olumns
in the base matrix. The onnetions between the input and output ports of the
read network for the rate-1/2 odes are listed in Table 5.4.2. Therefore, only four
6 × 1 multiplexers and two 7 × 1 multiplexers are required to implement the read
network. Similarly, four 1×6 de-multiplexers and two 1×7 de-multiplexers are used
to implement the write network. The arhitetures of the read network and write
network for rate-1/2 odes are shown in Fig. 5.8. Applying the same simpliation
tehnique, the onnetions of the input and the output ports of the read network
78
for the rate-3/4 odes are listed in Table 5.4.2.
5.5 Implementation results
To evaluate the performane of the proposed arhitetures, we implemented the dual-
rate 9216-bit LDPC deoder in 90nm 1.0V CMOS tehnology with 8-layer metals.
Synthesis results show that the deoder an operate at a maximal frequeny of 431
MHz. The parallelism fator is 8 and the number of pipeline stages is 6. Beause
there are 18 horizontal bloks in the H matrix and eah blok an be expanded to a
256×256 sub-matrix, the number of lok yles per iteration is 18×256/8+5 = 581.
The deoder is apable to proess the two rates speied in CMMB standard with
a maximum deoding throughput of 456 Mbps at 15 iterations. If alulated by the
information bits, the eetive throughput is 228 Mbps for the rate-1/2 ode and 342
Mbps for the rate-3/4 ode.
The deoder hip onsumes a ore area of 2.1mm × 2.1mm and power on-
sumption of 115mW. SRAMs are generated by Synopsys DesignWare tool and thus
attened during synthesis and hip layout. In pratie, we an operate the deoder
at a redued lok frequeny while signiantly lowering the supply voltage, whih
takes advantage of high throughput of urrent design. Thus the resultant design
should be more power eient.
Table 5.5 shows the deoder implementation results ompared with other LDPC
deoders with long ode length, suh as LDPC odes in DVB-S2 standard. In
the proposed deoder, over 75% of the ore area is onsumed by SRAMs, whih are
synthesized memory bloks. While in [72, 71℄, the memories are ustomized memory
modules whih are area-eient and low-power. But still, the proposed deoder is
able to ahieve higher deoding throughput with omparable hip area.
79
Table 5.3: Overall omparison between proposed deoder and other existing irregular deoders
F. Kienle [38℄ J. Dielessen [18℄ P. Urard [72℄ P. Urard [71℄ Proposed Deoder
Code Length 64800 64800 16200/64800 16200/64800 9216
Frequeny 270MHz - 300MHz 174MHz 431MHz
Iterations 30 30 programmable programmable 15
Throughput 255Mbps 90Mbps 135Mbps 105Mbps 228Mbps or 342Mbps
Tehnology 130nm 90nm 90nm 65nm 90nm
Memory - - 2.832M bits 3.18M bits 170K bits
Area 22.7mm2 4.1mm2 15.8mm2 6.07mm2 4.4mm2
8
0
5.6 Summary
In this paper, we present a low-omplexity QC-LDPC deoder implementation to
support dual-rate LDPC odes speied by the CMMB standard. Various opti-
mizations are inorporated in the design to redue omplexity and lateny of om-
putational units, to minimize memory usage, and to simplify the swithing network.
Min-Sum based layered deoding is employed to ahieve eetive high deoding
throughput with low omputation omplexity. A split-memory tehnique is pro-
posed to eiently handle exeptional weight-2 sub-matries. Reonguration for
supporting dual-rate LDPC odes is enabled with minimal hardware overhead.
81
Chapter 6
Design of Belief-Propagation Based
Detetors for Sparse ISI Channels
6.1 Introdution
Sparse hannels are most often found in underwater aousti ommuniations (UWA)
and ultra-wideband ommuniations (UWB) both of whih have attrated lots of
interests in reent years. For instane, the underwater aousti ommuniation sys-
tems have large delay spreading and multipath propagation. The hannel is usually
modeled with large delay spread and high sparsity. It is a hallenge to design a
low-omplexity detetor with aeptable error performane for sparse hannel. Tra-
ditional approahes suh as Viterbi algorithm is optimal but the omplexity is too
high for long hannel length. In reent years, a group of researhers in signal pro-
essing and ommuniations proposed to use the BP algorithm for symbol detetion
over a known ISI hannel [41, 62℄. The omputational omplexity of the BP-based
detetion is solely determined by the number of the nonzero interferers. On the
ontrary, the omplexities of the optimal maximum a posterior (MAP) algorithms
82
or maximum-likelihood (ML) algorithms inrease exponentially as of the hannel
length. Therefore, BP-based hannel detetor is onsidered in the sparse ISI hannel
with long delay spreads and only a few nonzero interferers, suh as the underwater
aousti hannels [62℄.
The BP algorithm was popularly used to represent the iterative deoding of
LDPC odes [24℄ on a fator graph (also alled Tanner graph). While applying the
BP algorithm to detet symbols over a known ISI hannel, the input and output
symbols of the sparse hannel are desribed as the variable nodes and hek nodes
on the fator graph, respetively.
Arhiteture of the BP-based detetor an be referred to the arhiteture of an
LDPC deoder [85, 83℄. The main funtional bloks inlude memories for storing
the iterative messages, node proessing units (inluding hek node units (CNUs)
and variable node units (VNUs)), onnetion networks and other ontrol bloks.
However, the LDPC odes are usually designed to have some speial struture to
failitate parallel node proessing and redue the omplexity of ontrolling and mes-
sage passing, suh as QC LDPC odes [83℄. Although we an model the sparse
hannel using similar fator graph struture, the major hallenge of a BP-based
detetor design lies in that the hannel struture is lak of regularity due to the
random loations of the interferers in a time-varying ISI hannel.
In this hapter, we provide a feasible and eient solution for suh type of
random ISI hannels. A reongurable arhiteture of the BP-based detetor is
design, whih an hange the onnetions in the fator graph for time-varying ISI
hannels. The hannel state information (CSI) inluding the onnetions between
CNUs and VNUs are stored in registers whih an be updated if the CSI hanges.
Layered deoding algorithm is also inorporated to aelerate the iterative lateny.
All of the tasks are managed by the ontrol unit implemented as a nite state
83
mahine (FSM). The proposed detetor is implemented using ASIC design in TSMC
90 nm CMOS proess and also prototyped on an FPGA.
The rest of this hapter is organized as follows. In setion 6.2, the BP algorithm
and layered deoding algorithm for the ISI hannels is introdued. Setion 6.3
presents the the simulation results. The detetor arhiteture is presented in Setion
?? and the implementation results is presented in 5. Finally, Setion 6 draws the
onlusion.
6.2 Channel Model and Deoding Algorithms
6.2.1 Channel Model and Fator Graph Representation
In a wireless ommuniation system through a disrete-time ISI hannel with noise,
the information bits d is rst enoded by the forward error orretion (FEC) enoder
to add redundant information to the stream of information bits in a way that allows
errors whih are introdued by the noise in the hannel to be orreted. The oded
bits c are then enter the modulator to be onverted into signals appropriate for
transmission over the hannel. For simpliity, the binary phase-shift keying is used
in this paper. The modulated symbols, i.e., the input of the hannel x, are orrupted
by interferene and noise in the hannel, whih an be written as
yn =
L−1∑
l=0
flxn−l + ωn (6.1)
where L is the hannel length and F = {f0, f1, · · · · · · fL−1} denotes the equivalent
impulse response of the ISI hannel. Here we assume F is known to the detetor
already. The number of nonzero elements in F is atually the number of the nonzero
interferers, denoted by D.
84
Check Nodes
Variable Nodes
x0 x1 x2 x3 x4 x5 x6 x7
y0 y1 y2 y3 y4 y5 y6 y7
………
………
………
:
:
1
1 1 1 1 1 1 1
0.2 0.2 0.2 0.2 0.2
1 11 1 11
Figure 6.1: Fator graph of an ISI hannel.
The fator graph representation of an ISI hannel is shown in Fig. 6.1 with L = 4
and F = {1, 0, 1, 0.2}. The input and the output of the hannel are denoted as
variable nodes and hek nodes, respetively. The onnetions between the hek
nodes and variable nodes represent the dependenies of the output on the hannel
input symbols. The number on the edge denotes the amplitude of eah tap in F .
6.2.2 Belief-Propagation Algorithm
Before presenting the BP algorithm [62℄, we rst give some denitions as follows:
Let rn→n−j[k] be the CTV message from hek node n to variable node n− j during
the k-th iteration. Let qn→n+j[k] be the VTC message from variable node n to hek
node n+ j.
1. Initialization:
Under the assumption of equal priori probability, ompute the hannel log-
likelihood ratio (LLR) pn (intrinsi information) of the variable node n, by:
pn = log
P (yn | xn = +1)
P (yn | xn = −1)
(6.2)
For unoded systems, pn = 0. The CTV message rn→n−j is initially set to zero
and the APP messages Λn is initially set to be pn.
85
2. Calulating the VTC messages:
At the k-th iteration, for variable node n, alulate VTC message qn→n+j [k]
by
qn→n+j [k] = Λn [k − 1]− rn+j→n [k − 1] (6.3)
3. Calulating the CTV messages:
Calulate the CTV message rn→n−j by (6.4). Here bn = (dn + 1)/2.
rn→n−j [k] = max
∀x: xn−j=+1


−
∣∣∣∣yn − L∑
l=0
flxn−l
∣∣∣∣
2
2σ2
+
L∑
i=0,i6=j
bn−i · qn−i→n [k]


− max
∀x: xn−j=−1


−
∣∣∣∣yn − L∑
l=0
flxn−l
∣∣∣∣
2
2σ2
+
L∑
i=0,i6=j
bn−i · qn−i→n [k]

(6.4)
4. Renew the APP messages:
Then the APP (a-posterior probability) messages are renewed as follows
Λn [k] = qn→n+j [k] + rn+i→n [k] (6.5)
5. Deiding hard bits:
If the preset maximum number of iterations is reahed, deide the n-th bit of
the deoded odeword xn = +1 if Λn ≥ 0 and xn = −1 otherwise.
86
0 2 4 6 8 10 12
10−5
10−4
10−3
10−2
10−1
100
SNR (dB)
BE
R
BER performance of LDA−based BP algorithm
 
 
Original BP @ 6 iterations
Original BP @ 3 iterations
LDA−based BP @ 3 iterations
Figure 6.2: BER omparison between original BP algorithm and LDA-based BP
algorithm.
6.2.3 Layered Deoding Algorithm
In order to alleviate message dependeny and redue deoding lateny, we borrow
the layered deoding algorithm from LDPC deoding to use it here in the BP-
based detetor design. Layered deoding is performed in a way that eah hek
node is treated as a layer and within an iteration, renewed CTV messages from
urrent horizontal layer will be transmitted vertially to other unproessed layers
that belong to the same variable nodes as the newly updated messages. In this
way, layered deoding is able to redue the required number of iterations by half
for a given SNR, as shown in Fig. 6.2. In this gure, the sparse ISI hannel is
haraterized by L = 6 and F = {0.408, 0, 0, 0, 0.816, 0.408}.
6.3 Simulation Results
In this setion, we evaluate the performane of the proposed algorithm by Matlab
simulation in term of bit error rate (BER) versus signal-to-nose ratio (SNR) per bit
87
7 7.5 8 8.5 9 9.5 10
10−4
10−3
10−2
10−1
100
SNR per bit
BE
R
BER performances for different quantization schemes of received data
 
 
1:1
1:2
2:1
2:2
3:2
3:3
4:3
Figure 6.3: BER performane for various quantization shemes of reeived data.
Eb/N0. The nite word length eet on the detetor performane is also analyzed
to minimize the required word length of every kind of message. In this paper, the
hannel model is haraterized as a 64-tap hannel (L = 64) with typially 4 nonzero
interferers (D = 4). The data are modulated by BPSK and transferred over AWGN
hannel.
Let us rst examine the quantization of the reeived data yn in (6.4). A short
word length would result in poor BER performane. Hereby through simulation we
try to nd an optimal quantization sheme with the shortest word length as well as
no signiant BER performane loss.
The quantization sheme an be express as i : f , in whih the reeived data are
quantized to i integer bits and f frational bits. Various quantization shemes for
the reeived data are investigated from 1 : 1 to 4 : 3. Error performanes for some
typial quantization shemes, suh as 1 : 1, 1 : 2, 2 : 1, 2 : 2, 3 : 2, 3 : 3, and
88
7 7.5 8 8.5 9 9.5 10
10−4
10−3
10−2
10−1
SNR per bit
BE
R
BER performances for different quantization schemes of extrinsic messages
 
 
1:2
2:1
2:2
3:1
3:2
3:3
4:2
4:3
Figure 6.4: BER performane for various quantization shemes of extrinsi messages.
4 : 3, are depited in Fig. 6.3. It is observed that the dierene between 3 : 2 and
3 : 3 is negligible within a wide range of BER, while the dierene 3 : 2 and 2 : 2 is
signiant. Thus it turns out that using the 3 : 2 sheme for the reeived data seems
to be the optimal tradeo between hardware omplexity and deoding performane.
Similarly, the quantization for the extrinsi messages are also analyzed. Error
performanes for some typial quantization shemes, suh as 1 : 2, 2 : 1, 2 : 2, 3 : 1,
3 : 2, 3 : 3, 4 : 2, and 4 : 3 for the CTV and VTC messages, are depited in Fig. 6.4.
Also, as the APP message is the sum of several CTV messages, the quantization
sheme for an APP message is (i+ 2) : f if the orresponding CTV quantization
sheme is denoted as i : f . It is observed that the dierene between 4 : 2 and
4 : 3 is negligible within a wide range of BER, while the dierene 4 : 2 and 3 : 2 is
signiant. Sine sheme 4 : 2 and 3 : 3 have the same length, 4 : 2 is preferred due
to better BER performane. Thus it turns out that using the 4 : 2 sheme for the
89
extrinsi messages (6 : 2 for the APP messages) seems to be the optimal tradeo
between hardware omplexity and deoding performane.
6.4 Detetor Arhiteture Design
6.4.1 Overall Arhiteture
In this setion, we propose an eetive arhiteture for the BP-based detetor with
layered deoding. Overall arhiteture is shown in Fig. 6.5. In this setion, the
hannel model is haraterized as a 64-tap hannel (L = 64) with typially 6 nonzero
interferers (D = 4). The length of eah frame of data is 2048 bits. The detetor
mainly onsists of two memories whih store the extrinsi messages, a ahe whih
is used to store extrinsi messages for hek node urrently being proessed, a CNU
responsible for the hek node update shown in (6.4) and a ontrol unit whih
swithes from one state to another. The LDA is implemented by proessing eah
hek node (layer) one by one and the proessing of eah hek node an be divided
into several states. The proposed arhiteture is also reongurable in order to
swith exible onnetions on the fator graph in the time-varying ISI hannels.
Below are the detailed desriptions of eah state and how to ontrol them.
1. The APP messages are initialized by hannel LLRs and stored in APP Memory
with the size of n×mapp. n is the length of a frame of data steam and mapp
denotes the length of the quantized APP messages. Here n = 2048 = 211 and
mapp = 6 + 2 = 8 aording to the word length analysis presented in Setion
3.
2. An APP message is read from the APP Memory and written to the Cahe. A
PC (program ounter) is applied to indiate the index of urrently proessed
90
  
  

	 	 

 

	  
 	
	  





  
 	
	  
	 	 

 

	  






  
 ﬀﬁ
ﬂ 
ﬃ
 !" #
$%" %
& #% $
$%" %
'()*
+,- . ./0
/1 23* - 2) (
456 57 78 7
9
:;
9
<8 6= 57
<>= ? :6 8
@
= < :8 6 <
A %  %B B #B
!C
D
"
E # ! %B
F "
D
"
GH
I
JK L
M
N
O P
J K L
QO 
N
O P
R R
 M
N
O P
R R
QO 
N
O P
 ST
U V W ﬂ
X
 	
	  
	 	 

 
 
	  
Y
#
Z


Y # Z  
[  \ ]\ W^
 _`
a
_`
+ ,- . ./0
b bc 1
def g
def g
h ;ij :8 6
k
.l*
m
3 2/ (
G n `  o
p qrs tuvw
e vx twy
z
{| v w
}w~ u g ux
p |   w w | {
  w | {
 Ł 
p | ~  w s tuvw
w | { w w
 p ~ ~ 
p qrs tuvw


}w~ u g ux
   
   
   
    
 p ~ ~ 
     	
¡ 
¢
M £
J ¡¤
¥



¦§
    
F
i
g
u
r
e
6
.
5
:
O
v
e
r
a
l
l
d
e
t
e

t
o
r
a
r

h
i
t
e

t
u
r
e
.
9
1
layer (ICL), whih is atually the address of the APP memory and the CTV
memory. The size of the Cahe is L ×mapp and L = 64 = 2
6
is the hannel
length (L≪ n). When the detetor starts to proess layer j + 1 after layer j,
only one new APP message (Λj+1) is needed and others are either replaed by
new APP messages from CNU or remain unhanged.
3. APP messages orresponding to urrent layer are read from Cahe. The read
addresses are from the CSI reg whih stores CSI inluding the relative inter-
ferer loations (RIL) and interferer oeients. CSI is obtained from the han-
nel estimator and is exible depending on the time-varying hannel. Mean-
while, CTV messages are also fethed and the CNU omplete the task dened
in (6.3), (6.4) and (6.5).
4. Renewed CTV messages are stored bak into CTV Memory and renewed APP
messages are stored bak to Cahe. Signal APPRead is used to swith between
writing a new APP message from the APP Memory or writing renewed APP
messages from the CNU.
5. When the detetor starts to proess layer j + 1, variable nodes in the range
[j − L+ 2, j + 1] are all potential interferers. Hereby, variable node j−L+1
will never be used again within this iteration and orresponding APP message
Λj−L+1 must be stored bak to the APP memory and prepared for the use
in the next iteration. Signal StoreUpdate deides where the output of the
Cahe should go: the CNU or the APP memory? Also, at the input the APP
memory, signal ChannelRead ontrols what to be saved in the APP memory :
hannel LLRs or the APP messages.
6. Partiularly, if the preset number of iterations is reahed, the APP messages
do not need to store bak to the APP memory and are ontrolled by signal
92
  
 




 


 
 
 
	




	

	




	

 
 
 




ﬀ
ﬁﬂﬂ


ﬀ
ﬃ
 
!
ﬃ
 
!
"
#$
%
 
 
 
 
 
 
 
 
 

 
 
 
&
'
ﬀ()
'*+'
,
'

-./
0
1
2
3
4
2
3
4
5
5

.


ﬀ
ﬁﬂﬂ
.


ﬀ
	




	
 6
7
8
6
7
8
6
7
8
6
7
8




	

   
Figure 6.6: Arhiteture of the CNU.
Deision and delivered diretly to the Deision unit to deide the hard bits.
6.4.2 Arhiteture of CNU
The CNU arhiteture is illustrated in Fig. 6.6. Its funtionality overs equation
(6.3), (6.4) and (6.5). As here the number of nonzero interferers D is 4, totally
2D = 16 values of the term

−
∣∣∣∣∣yn−
L∑
l=0
flxn−l
∣∣∣∣∣
2
2σ2
+
L∑
i=0,i6=j
bn−i · qn−i→n [k]

 should be
alulated, as indiated in (6.4). Among these 16 values, 8 will be used to alulate
the minuend term (rst maximum funtion) in (6.4) and another 8 values are for the
subtrahend term (seond maximum funtion). Parallel arhiteture is onstruted
to alulated the 16 values onurrently and a Reshedule Network is designed to
93
 








	

	

	

	









Figure 6.7: Diret-mapped Cahe Arhiteture
partition the 16 values into two groups, one for the minuend and another for the
subtrahend.
6.4.3 Cahe-Like Arhiteture
The LLRs from the hannel and the APP messages from the belief propagation
will be stored in the APP Memory during the iterative proess. Layered deod-
ing algorithm is implemented by proessing the hek nodes one by one and eah
hek node is onsidered as one layer. However, at eah hek node (layer), only
D variable nodes out of previous L nodes are onneted with urrent hek node.
In other words, D VTC messages from previous L variable nodes are seleted and
sent to CNU for iterative proessing. Therefore, as shown in Fig. 6.5, a ahe-like
arhiteture is developed in whih a Cahe is used to read L APP messages from
the main memory (the APP Memory) and then provide D useful APP messages for
the CNU.
94
Fig. 6.7 shows the diret mapping from the APP Memory to the Cahe in whih
eah APP message is mapped to exat one loation in the Cahe. Sine n = 2048
and L = 64, in this paper a 32 : 1 diret-mapped ahe is used. L onseutive APP
messages are opied from the APP Memory to the Cahe, but the start-point of
these L onseutive messages may not be stored in the rst loation of the Cahe.
Atually the L messages are stored in a irulant shifted format with the shift value
determined by the least signiant bits (LSBs) of the ICL, as shown in Fig 6.5.
Below are the detailed onlusions on how to set the addresses for the APP Memory
and the Cahe:
1. When storing the updated APP message bak from the Cahe to the APP
Memory, the address for the Cahe is the LSBs of the ICL and the address for
the APP memory equals to ICL− 64.
2. When reading new APP message from the APP Memory to the Cahe, the
address for the Cahe is the LSBs of the ICL and the address for the APP
memory equals to ICL.
3. When readingD = 4 APP messages from the Cahe, the address for the Cahe
is the sum of the RIL and the shift value of the 64 onseutive messages whih
equals to the LSBs of the ICL plus 1, as depited in Fig. 6.5.
6.5 Implementation Results
To evaluate the performane of our proposed shemes, a BP-based detetor with the
frame size of 2048 bits is synthesized and implemented on both FPGA and ASIC
platforms. Implemented on Xilinx XC2VP30 devie, the maximum frequeny is 56.2
MHz whih orresponding to the throughput of 1.44 Mbps with 3 iterations.
95
The same arhiteture is implemented and synthesized using TSMC 90nm ASIC
tehnology. The synthesized detetor an ahieve a maximum throughput of 3.48
Mbps for 3 deoding iterations. The ore area of the detetor is 1.8mm2 and the
power onsumption is 54 mW at the maximum frequeny of 136 MHz.
6.6 Summary
In this hapter, a low-omplexity detetor design was presented for sparse ISI han-
nels using the belief-propagation algorithm. The layered deoding algorithm is also
employed in the design. Simulation results show the 3 : 2 quantization sheme
for the reeived data and 4 : 2 sheme for the extrinsi messages ause negligi-
ble BER performane loss but would redue the hardware omplexity signiantly.
The proposed detetor for sparse ISI hannels is implemented on both FPGA and
ASIC platforms, whih demonstrate the real-time performane of 3.48 Mbps and
1.44 Mbps throughput, respetively.
96
Chapter 7
Conlusions
This dissertation investigated several VLSI design issues for iterative belief propaga-
tion (BP) algorithm, inluding high-performane LDPC deoders and ISI detetors
with the goal to redue hip area and/or ahieve high-throughput and/or rate ex-
ible implementations.
A parallel layered deoding LDPC deoder arhiteture was proposed to ahieve
ultra high throughput. Sequential operations on layers in traditional layered de-
oding arhiteture was replaed by onurrent message alulations and transmis-
sions among all layers aording to ertain rules that won't hange the nature of
layered deoding algorithm. Besides onurrent proessing, this arhiteture also
enables pipelined ritial path and avoids the ompliated interonnetion network
whih is usually onsidered as the barrier for high-throughput LDPC deoders. All
above methods ontribute to a high-throughput LDPC deoder and push the input
throughput to more than 1 Gbps.
Then, we extended our proposed parallel layered deoding arhiteture to pun-
tured LDPC odes and developed a deoder arhiteture with both high throughput
and rate exibility. We studied several punturing shemes and piked up one that
97
has the best error performane and least BER degradation ompared with ded-
iated odes. A orresponding implementation of the seleted punturing sheme
was designed and added to the parallel layered deoding arhiteture. The proposed
LDPC deoder an ahieve an input throughput of 975 Mbps and supports any rate
between 1/2 and 1.
Furthermore, we proposed a low-omplexity, area-eient LDPC deoder arhi-
teture that is ompatible with China Multimedia Mobile Broadasting (CMMB)
standard. Using resoure-sharing, the hek node unit in this arhiteture an swith
between the two required ode rates in CMMB. Split-memory arhiteture enables
eetive implementation of layered deoding algorithm on weigh-2 superimposed
matries.
Finally, we developed a detetor arhiteture for sparse ISI hannels whih is also
based on sparse matrix and BP algorithm, like the LDPC deoder. Bipartite graph,
min-sum algorithm and layered deoding algorithm, whih are popular in LPDC
deoding, are also applied here. Unlike the LDPC odes, for deteting in sparse
ISI hannels, variable nodes are possibly onneted to part, not all, of the hek
nodes. Hene, a ahe-based detetor arhiteture was proposed to use a ahe
to store messages from/to hek nodes that interfere with urrent variable node.
Also, the detetor arhiteture is reongurable to support any possible onnetions
between variable nodes and hek nodes. To our best knowledge, this is the rst
VLSI realization of BP-based detetor based upon the available literature resoures.
98
Bibliography
[1℄ Mobile Multimedia Broadasting (P. R. China) Part 1: Framing Struture,
Channel Coding and Modulation for Broadasting Channel.
[2℄ [Onleine℄: http://eng.t-dmb.org/.
[3℄ [Online℄: http://www.dibeg.org/tehp/tehp.htm.
[4℄ [Online℄: http://www.ieee802.org/16/tge.
[5℄ E. Aktas, Belief Propagation with Gaussian Priors for Pilot-Assisted Commu-
niation over Fading ISI Channels, IEEE Trans. Wireless Commun., vol. 8,
no. 4, pp. 20562066, Apr. 2009.
[6℄ C. Berrou, A. Glavieux, and P. Thitimajshima, Near Shannon Limit Error-
Correting Coding and Deoding: Turbo-Codes, in Pro. IEEE Intl. Conf.
Commun. (ICC 1993), vol. 2, May 1993, pp. 10641070.
[7℄ A. Blanksby and C. Howland, A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density
parity-hek ode deoder, IEEE J. Solid-State Ciruits, vol. 37, no. 3, pp.
404412, Mar 2002.
[8℄ R. Bose and R.-C. D.K., On a Class of Error Corretion Binary Group Codes,
Inform. Control, vol. 3, pp. 6879, Mar. 1960.
99
[9℄ T. Brak, M. Alles, F. Kienle, and N. Wehn, A Synthesizable IP Core for
WiMAX 802.16E LDPC Code Deoding, in Pro. IEEE 17th Int. Symp. Per-
sonal, Indoor and Mobile Radio Communiations, Sep. 2006, pp. 15.
[10℄ J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X.-Y. Hu, Redued-
Complexity Deoding of LDPC Codes, IEEE Trans. Commun., vol. 53, no. 8,
pp. 12881299, Aug. 2005.
[11℄ L. Chen, J. Xu, I. Djurdjevi, and S. Lin, Near-Shannon-Limit Quasi-Cyli
Low-Density Parity-Chek Codes, IEEE Trans. Commun., vol. 52, no. 7, pp.
10381042, Jul. 2004.
[12℄ Y. Chen and D. Hoevar, A FPGA and ASIC Implementation of Rate 1/2,
8088-b Irregular Low Density Parity Chek Deoder, in Pro. IEEE GLOBE-
COM 2003, vol. 1, De. 2003, pp. 113117.
[13℄ E. Choi, S. Suh, and J. Kim, Rate-Compatible Punturing for Low-Density
Parity-Chek Codes with Dual-Diagonal Parity Struture, in Pro. IEEE
Symp. Person. Indoor Mobile Radio Commun. (PIMRC), vol. 4, Sep. 2005,
pp. 26422646.
[14℄ S. Y. Chung, G. Forney, T. Rihardson, and R. Urbanke, On the Design of
Low-Density Parity-Chek Codes within 0.0045 dB of the Shannon Limit,
IEEE Commun. Lett., vol. 5, no. 2, pp. 5860, Feb. 2001.
[15℄ G. Colavolpe and G. Germi, On the Appliation of Fator Graphs and the
Sum-Produt Algorithm to ISI Channels, IEEE Trans. Commun., vol. 53,
no. 4, p. 746, Apr. 2005.
100
[16℄ O. Daesun and K. Parhi, Min-Sum Deoder Arhitetures with Redued Word
Length for LDPC Codes, IEEE Trans. Ciruits Syst. I, Reg. Papers, vol. 57,
no. 1, pp. 105 115, Jan. 2010.
[17℄ Y. Dai, N. Chen, and Z. Yan, Memory Eient Deoder Arhitetures for
Quasi-Cyli LDPC Codes, IEEE Trans. Ciruits Syst. I, Reg. Papers, vol. 55,
no. 9, pp. 28982911, Ot. 2008.
[18℄ J. Dielissen, A. Hekstra, and V. Berg, Low Cost LDPC Deoder for DVB-S2,
in Pro. Design, Autom. Test Eur., 2006, pp. 130135.
[19℄ I. Djurdjevi, J. Xu, K. Abdel-Ghaar, and S. Lin, A Class of Low-Density
Parity-Chek Codes Construted Based on Reed-Solomon Codes with Two In-
formation Symbols, IEEE Commun. Lett., vol. 7, no. 7, pp. 317319, Jul.
2003.
[20℄ P. Elias, Coding for Noisy Channels, IRE Conv. Re., vol. 4, pp. 3747, 1955.
[21℄ European Teleommuniations Standards Institude (ETSI), Digital Video
Broadasting (DVB): Transmission System for Handheld Terminals, EN 302
307 V1.1.1. http://www.dvb.org.
[22℄ M. Fossorier, Quasiyli low-density parity-hek odes from irulant permu-
tation matries, IEEE Trans. Inf. Theory, vol. 50, no. 8, pp. 17881793, Aug.
2004.
[23℄ M. Fossorier, M. Mihaljevi, and H. Imai, Redued omplexity iterative de-
oding of low-density parity-hek odes based on belief propagation, IEEE
Trans. Commun., vol. 47, no. 5, pp. 673680, May 1999.
101
[24℄ R. Gallager, Low-Density Parity-Chek Codes, IRE Tans. Inf. Theory, vol. 7,
pp. 2128, 1962.
[25℄ G. Gentile, M. Rovini, and L. Fanui, Low-Complexity Arhitetures of A
Deoder for IEEE 802.16e LDPC Codes, in Pro. Euromiro Conf. Digital
System Design (DSD), Aug. 2007, pp. 369375.
[26℄ E. B. Guilloud and J. Danger, λ-Min Deoding Algorithm of Regular and Ir-
regular LDPC odes, in Pro. 3rd Int. Symp. Turbo Codes and Related Topis,
Sep. 2003, pp. 451454.
[27℄ K. Gunnam, G. Choi, W. Wang, and M. Yeary, Multi-Rate Layered Deoder
Arhiteture for Blok LDPC Codes of the IEEE 802.11n Wireless Standard,
in IEEE ISCAS 2007, May 2007, pp. 16451648.
[28℄ K. Gunnam, G. Choi, M. Yeary, and M. Atiquzzaman, VLSI arhitetures
for layered deoding for irregular LDPC odes of WiMax, in Pro. IEEE Int.
Conf. Commun. (ICC), Jun. 2007, pp. 45424547.
[29℄ J. Ha, J. Kim, D. Klin, and S. MLaughlin, Rate-Compatible Puntured
Low-Density Parity-Chek Codes with Short Blok Lengths, IEEE Trans. Inf.
Theory, vol. 52, no. 2, pp. 728738, Feb. 2006.
[30℄ J. Ha, J. Kim, and S. MLaughlin, Punturing for Finite Length Low-Density
Parity-Chek Codes, in Pro. Inter. Symp. Inform. Theory (ISIT), Jun. 2004,
p. 151.
[31℄ , Rate-Compatible Punturing of Low-Density Parity-Chek Codes,
IEEE Trans. Inf. Theory, vol. 50, no. 11, pp. 28242836, Nov. 2004.
102
[32℄ J. Ha and S. MLaughlin, Optimal Punturing Distributions for Rate-
Compatible Low-Density Parity-Chek Codes, in Pro. Inter. Symp. Inform.
Theory (ISIT), Jun. 2003, p. 233.
[33℄ D. Hoevar, A redued omplexity deoder arhiteture via layered deoding
of LDPC odes, in Pro. IEEE Workshop on Signal Proess. Syst. (SiPS), Ot.
2004, pp. 107112.
[34℄ A. Hoquenghem, Codes oreteurs d'erreurs, Chires, vol. 2, pp. 147156,
1959.
[35℄ S. J. Johnson and S. R.Weller, Codes for Iterative Deoding from Partial Ge-
ometries, in Pro. IEEE Intl. Symp. Inform. Theory, Jul. 2002, p. 310.
[36℄ S.-H. Kang and I.-C. Park, Loosely Coupled Memory-Based Deoding Arhi-
teture for Low Density Parity Chek Codes, IEEE Trans. Ciruits Syst. I,
Reg. Papers, vol. 53, no. 5, pp. 10451056, May 2006.
[37℄ M. Karkooti and J. Cavallaro, Semi-Parallel Reongurable Arhitetures for
Real-Time LDPC Deoding, in Pro. ITCC½	2004, vol. 1, Apr. 2004, pp. 579
585.
[38℄ F. Kienle, T. Brak, and N. Wehn, A Synthesizable IP Core for DVB-S2 LDPC
Code Deoding, in Pro. Design, Autom. Test Eur., vol. 3, Mar. 2005, pp. 100
105.
[39℄ Y. Kou, S. Lin, and M. Fossorier, Low Density Parity Chek Codes Based on
Finite Geometries: A Redisovery and New Results, IEEE Trans. Inf. Theory,
vol. 47, no. 7, pp. 27112736, Nov. 2001.
103
[40℄ , Low Density Parity Chek Codes Based on Finite Geometries: A Redis-
overy, in Pro. IEEE Intl. Symp. Inform. Theory, Jun. 2000, p. 200.
[41℄ B. Kurkoski, P. Siegel, and J. Wolf, Joint Message-Passing Deoding of LDPC
Codes and Partial-Response Channels, IEEE Trans. Inf. Theory, vol. 48, no. 6,
pp. 14101422, Jun. 2002.
[42℄ J. Li and K. Narayanan, Rate-Compatible Low Density Parity Chek Codes
for Capaity-Approahing ARQ Shemes in Paket Data Communiations, in
Pro. Int. Conf. on Comm., Internet, and Info. Teh.(CIIT), Nov. 2002.
[43℄ Z. Li and B. Kumar, A Class of Good Quasi-Cyli Low-Density Parity Chek
Codes Based on Progressive Edge Growth Graph, in Pro. 38th Asilomar Conf.
Signals, Syst. Comput., vol. 2, Nov. 2004, pp. 19901994.
[44℄ C. Liu, C. Lin, S. Yen, C. Chen, H. Chang, C. Lee, Y. Hsu, and S. Jou, Design
of A Multimode QC-LDPC Deoder Based on Shift-Routing Network, IEEE
Trans. Ciruits Syst. II, Exp. Briefs, vol. 56, no. 9, pp. 734 738, Sep. 2009.
[45℄ C.-H. Liu, S.-W. Yen, C.-L. Chen, H.-C. Chang, C.-Y. Lee, Y.-S. Hsu, and
S.-J. Jou, An LDPC Deoder Chip Based on Self-Routing Network for IEEE
802.16e Appliations, IEEE J. Solid-State Ciruits, vol. 43, no. 3, pp. 684694,
Mar. 2008.
[46℄ L. Liu and C.-J. Shi, Slied Message Passing: High Throughput Overlapped
Deoding of High-Rate Low-Density Parity-Chek Codes, IEEE Trans. Cir-
uits Syst. I, Reg. Papers, vol. 55, no. 11, pp. 36973710, De. 2008.
[47℄ H.-A. Loeliger, An Introdution to Fator Graphs, IEEE Signal Proess. Mag.,
vol. 21, no. 1, pp. 2841, Jan. 2004.
104
[48℄ R. Luas, M. Fossorier, Y. Kou, and S. Lin, Iterative Deoding of One-Step
Majority Logi Dedutible Codes Based on Belief Propagation, IEEE Trans.
Commun., vol. 48, no. 6, pp. 931937, Jun. 2000.
[49℄ D. MaKay, Good Error-Correting Codes Based on Very Sparse Matries,
IEEE Trans. Inf. Theory, vol. 45, no. 2, pp. 399431, Mar. 1999.
[50℄ D. MaKay and R. Neal, Near Shannon Limit Performane of Low Density
Parity Chek Codes, Eletron. Lett., vol. 32, no. 18, p. 1645, Aug. 1996.
[51℄ M. Mansour and N. Shanbhag, High-throughput LDPC deoders, IEEE
Trans. Very Large Sale Integr.(VLSI) Syst., vol. 11, no. 6, pp. 976996, De.
2003.
[52℄ G. Masera, F. Quaglio, and F. Vaa, Implementation of A Flexible LDPC
Deoder, IEEE Trans. Ciruits Syst. II, Exp. Briefs, vol. 54, no. 6, pp. 542
546, Jun. 2007.
[53℄ T. Mittleholzer, Eient Enoding and Minimum Distane Bounds of Reed-
Solomon-Type Array Codes, in Pro. IEEE Intl. Symp. Inform. Theory, Jul.
2002, p. 282.
[54℄ T. Mohsenin and B. Baas, High-throughput ldp deoders using a multiple
split-row method, in Pro. ICASSP 2007, vol. 2, Apr. 2007, pp. II13II16.
[55℄ S. Muller, M. Shreger, M. Kabutz, M. Alles, F. Kienle, and N. Wehn, A
novel LDPC deoder for DVB-S2 IP, in Pro. Design, Automation and Test
in Europe, (DATE'09), April 2009, pp. 13081313.
[56℄ S. Myung, K. Yang, and J. Kim, Quasi-Cyli LDPC Codes for Fast Enod-
ing, IEEE Trans. Inf. Theory, vol. 51, no. 8, pp. 28942901, Aug. 2005.
105
[57℄ T. Okamura, Designing LDPC Codes Using Cyli Shifts, in Pro. IEEE Intl.
Symp. Inform. Theory, Jun. 2003, p. 151.
[58℄ H. Park, K. Kim, D. Kim, and K. Whang, Strutured Punturing for
Rate-Compatible B-LDPC Codes with Dual-Diagonal Parity Struture, IEEE
Trans. Wireless Commun., vol. 7, no. 10, pp. 36923696, Ot. 2008.
[59℄ R. S. Reed and S. G., Polynomial Codes over Certain Fields, J. So. Ind.
Appl. Math., vol. 8, pp. 300304, Jun. 1960.
[60℄ T. Rihardson, M. Shokrollahi, and R. Urbanke, Design of Capaity-
Approahing Irregular Low-Density Parity-Chek Codes, IEEE Trans. Inf.
Theory, vol. 47, no. 2, pp. 619637, Feb. 2001.
[61℄ T. Rihardson and R. Urbanke, The Capaity of Low-Density Parity-Chek
Codes under Message-Passing Deoding, IEEE Trans. Inf. Theory, vol. 47,
no. 2, pp. 599618, Feb. 2001.
[62℄ S. Roy, T. Duman, and V. MDonald, Error rate improvement in underwater
mimo ommuniations using sparse partial response equalization, IEEE J.
Oean. Eng., vol. 34, no. 2, pp. 181201, Apr. 2009.
[63℄ C. E. Shannon, A Mathematial Theory of Communiations, Bell Syst. Teh.
J., pp. 379423, Jul. 1948.
[64℄ R. Shao, S. Lin, and M. Fossorier, Two Simple Stopping Criteria for Turbo
Deoding, IEEE Trans. Commun., vol. 47, no. 8, pp. 11171120, Aug. 1999.
[65℄ E. Sharon, S. Litsyn, and J. Goldberger, Eient Serial Message-Passing
Shedules for LDPC Deoding, IEEE Trans. Inf. Theory, vol. 53, no. 11, pp.
40764091, Nov. 2007.
106
[66℄ X.-Y. Shih, C.-Z. Zhan, C.-H. Lin, and A.-Y. Wu, An 8.29 mm
2
52 mW Multi-
Mode LDPC Deoder Design for Mobile WiMAX System in 0.13 µm CMOS
Proess, IEEE J. Solid-State Ciruits, vol. 43, no. 3, pp. 672683, Mar. 2008.
[67℄ C. Studer, N. Preyss, C. Roth, and A. Burg, Congurable High-Throughput
Deoder Arhiteture for Quasi-Cyli LDPC odes, in Pro. 42th Asilomar
Conf. Signals, Syst., Comput., Ot. 2008, pp. 11371142.
[68℄ Y. Sun and J. Cavallaro, A Low-Power 1-Gbps Reongurable LDPC Deoder
Design for Multiple 4G Wireless Standards, in Pro. 2008 IEEE Int. SOC
Conf., Sep. 2008, pp. 367370.
[69℄ R. Tanner, A Reursive Approah to Low Complexity Codes, IEEE Trans.
Inf. Theory, vol. 27, no. 5, pp. 533547, Sep. 1981.
[70℄ Y.-L. Ueng, C.-J. Yang, Z.-C. Wu, C.-E. Wu, and Y.-L. Wang, VLSI Deoding
Arhiteture with Improved Convergene Speed and Redued Deoding Lateny
for Irregular LDPC Codes in WiMAX, in Pro. IEEE ISCAS 2008, May 2008,
pp. 520523.
[71℄ P. Urard, L. Paumier, V. Heinrih, N. Raina, and N. Chawla, A 360mW
105Mb/s DVB-S2 Compliant Code based on 64800b LDPC and BCH Codes
Enabling Satellite-Transmission Portable Devies, IEEE ISSCC Dig. Teh.
Papers, pp. 310311, Feb. 2008.
[72℄ P. Urard, E. Yeo, L. Paumier, P. Georgelin, T. Mihel, V. Lebars,
E. Lantreibeq, and B. Gupta, A 135Mb/s DVB-S2 Compliant Code Based on
64800b LDPC and BCH Codes, IEEE ISSCC Dig. Teh. Papers, pp. 446609,
Feb. 2005.
107
[73℄ B. Vasi, Combinatorial Construtions of Low-Density Parity-Chek Codes for
Iterative Deoding, in Pro. IEEE Intl. Symp. Inform. Theory, Jul. 2002, p.
312.
[74℄ A. Viterbi, Error Bounds for Convolutional Codes and an Asymptotially Op-
timum Deoding Algorithm, IEEE Trans. Inf. Theory, vol. 13, no. 2, pp. 260
269, Apr. 1967.
[75℄ P. O. Vontobel and R. M. Tanner, Constrution of Codes Based on Finite
Generalized Quadrangles for Iterative Deoding, in Pro. IEEE Intl. Symp.
Inform. Theory, Jun. 2001, p. 223.
[76℄ P. Wang and Y. Chen, Low-Complexity Real-Time LDPC Enoder Design for
CMMB, in Pro. IIHMSP '08, Aug. 2008, pp. 12091212.
[77℄ Z. Wang and Z. Cui, A Memory Eient Partially Parallel Deoder Arhi-
teture for QC-LDPC Codes, in Pro. 39th Asilomar Conf. Signals, Syst.,
Comput., Nov. 2005, pp. 729733.
[78℄ , Low-Complexity High-Speed Deoder Design for Quasi-Cyli LDPC
Codes, IEEE Trans. Very Large Sale Integr.(VLSI) Syst., vol. 15, no. 1, pp.
104114, Jan. 2007.
[79℄ Z. Wang and Q. wei Jia, Low Complexity, High Speed Deoder Arhiteture
for Quasi-Cyli LDPC Codes, in Pro. IEEE ISCAS 2005, May 2005, pp.
57865789.
[80℄ E. Yeo, P. Pakzad, B. Nikoli, and V. Anantharam, VLSI arhitetures for
iterative deoders in magneti reording hannels, IEEE Trans. Magn., vol. 37,
no. 2, pp. 748755, Mar. 2001.
108
[81℄ C. Zhang, Z. Wang, J. Sha, L. Li, and J. Lin, Flexible LDPC Deoder Design
for Multigigabit-per-Seond Appliations, IEEE Trans. Ciruits Syst. I, Reg.
Papers, vol. 57, no. 1, pp. 116124, Jan. 2010.
[82℄ J. Zhang and M. Fossorier, Shued iterative deoding, IEEE Trans. Com-
mun., vol. 53, no. 2, pp. 209213, Feb. 2005.
[83℄ K. Zhang and X. Huang, High-Throughput Layered Deoder Implementation
for Quasi-Cyli LDPC Codes, IEEE J. Selet. Areas Commun., vol. 27, no. 6,
pp. 985994, Aug. 2009.
[84℄ L. Zhang, L. Gui, Y. Xu, and W. Zhang, Congurable Multi-Rate Deoder
Arhiteture for QC-LDPC Codes Based Broadband Broadasting System,
IEEE Trans. Broadast., vol. 54, no. 2, pp. 226235, Jun. 2008.
[85℄ T. Zhang and K. Parhi, VLSI implementation-oriented (3,k)-regular low-
density parity-hek odes, in Pro. IEEE Workshop on Signal Proess. Syst.
(SiPS), 2001, pp. 2536.
[86℄ H. Zhong and T. Zhang, Design of VLSI Implementation-Oriented LDPC
Codes, in Pro. IEEE VTC 2003, vol. 1, Ot. 2003, pp. 670673.
[87℄ , Blok-LDPC: A Pratial LDPC Coding System Design Approah,
IEEE Trans. Ciruits Syst. I, Reg. Papers, vol. 52, no. 4, pp. 766775, Apr.
2005.
109
