High Performance Decoder Architectures for Error Correction Codes by Lin, Jun
Lehigh University
Lehigh Preserve
Theses and Dissertations
2015
High Performance Decoder Architectures for Error
Correction Codes
Jun Lin
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Dissertation is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Lin, Jun, "High Performance Decoder Architectures for Error Correction Codes" (2015). Theses and Dissertations. 2686.
http://preserve.lehigh.edu/etd/2686
HIGH PERFORMANCE DECODER
ARCHITECTURES FOR ERROR
CORRECTION CODES
by
Jun Lin
Presented to the Graduate and Research Committee
of Lehigh University
in Candidacy for the Degree of
Doctor of Philosophy
in
Electrical Engineering
Lehigh University
May 2015
c© Copyright 2015 by Jun Lin
All Rights Reserved
ii
Approved and recommended for acceptance as a dissertation in partial fulfillment
of the requirements for the degree of Doctor of Philosophy.
Date
Prof. Zhiyuan Yan
(Dissertation Advisor)
Accepted Date
Committee Members:
Prof. Zhiyuan Yan
(Committee Chair)
Prof. Meghanad D. Wagh
Prof. Tiffany Jing Li
Dr. Zhongfeng Wang
Broadcom Inc.
iii
Acknowledgments
First and foremost, I would like to express my special appreciation to my advisor
Prof. Zhiyuan Yan for his continuous support of my Ph.D study and research. His
rich experiences, insightful instructions and valuable expertise are the most impor-
tant gifts I have ever received. Instead of sticking to a single topic, Prof. Yan
encouraged me to explore more interesting and challenging research topics, which
broadens my views and paves ways for my future career. His patience and encour-
agement inspired me to pursuit my dream as an academic researcher. Moreover, he
taught me how to appreciate various ideas in a comprehensive manner and express
complicated concepts and works in a clear way. Prof. Yan is also my life mentor and
gave me a lot of valuable advices in seeking career opportunities. He recommended
me for an internship in Qualcomm, which is one of the most wonderful experiences
during the Ph. D study. It is really lucky for me to pursue my Ph.D degree under
his guidance. I could not imagine having a better advisor and mentor for my Ph.D
study.
Besides, I would like to thank Prof. Wagh, Prof. Tiffany Jing Li and Dr.
Zhongfeng Wang for serving on my Ph.D committee and spending their precious
time for examming my work. I am grateful for their encouragement, insightful
iv
comments and inspiring questions.
Many sincere thanks to my labmates and friends in Lehigh University, Feng Shi,
Hongmei Xie and Chenrong Xiong. We shared a lot of memorable time and exciting
discussions. It is exciting to cooperate with them on developing wonderful ideas
and results. In particular, I would like to thank them for helping me in finding an
apartment and shopping grocery every week. My life in Lehigh University could
be much harder without their help. Many thanks also go to my friends at Lehigh:
Yang Liu, Jiangfan Zhang, Xuanxuan Lu et al. The wonderful time spent with you
will be a precious part of my memory.
Finally, I would like to thank my parents for their endless support and uncon-
ditional love. They provided everything financially and spiritually for me to get
better educations. Without their love and encouragement, I would not have made
any achievements in my life.
v
Contents
Acknowledgments iv
Contents vi
List of Tables xi
List of Figures xiii
Abstract 1
1 Introduction 4
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Non-binary LDPC Codes . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Polar Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Error Control Decoders for RLNC . . . . . . . . . . . . . . . . 8
1.2 Contributions and Organization . . . . . . . . . . . . . . . . . . . . . 10
2 An Efficient Shuffled Decoder Architecture for Nonbinary Quasi-
Cyclic LDPC Codes 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
vi
2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Shuffled and Modified Shuffled Schedule . . . . . . . . . . . . . . . . 26
2.3.1 Shuffled Schedule . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.2 Modified Shuffled Schedule . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Shuffled Decoder Architecture . . . . . . . . . . . . . . . . . . . . . . 32
2.4.1 Check Node Unit Architecture . . . . . . . . . . . . . . . . . . 32
2.4.2 Variable Node Unit Architecture . . . . . . . . . . . . . . . . 35
2.4.3 Top Decoder Architecture . . . . . . . . . . . . . . . . . . . . 36
2.4.4 Implementation Results . . . . . . . . . . . . . . . . . . . . . 38
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 An Efficient Fully Parallel Decoder Architecture for Non-binary
LDPC Codes 42
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 TBCP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.1 Trellis based check node processing algorithm . . . . . . . . . 46
3.3 Improved Decoding Algorithm for NB-LDPC Codes . . . . . . . . . . 48
3.3.1 RTBCP algorithm . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 LLR compression for a priori messages . . . . . . . . . . . . . 51
3.3.3 Simplified variable node processing algorithm . . . . . . . . . 52
3.3.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Fully Parallel Decoder Architecture . . . . . . . . . . . . . . . . . . . 59
3.4.1 Top decoder architecture . . . . . . . . . . . . . . . . . . . . . 59
3.4.2 Parallel CNU architecture . . . . . . . . . . . . . . . . . . . . 61
vii
3.4.3 Low-latency VNU architecture . . . . . . . . . . . . . . . . . . 65
3.4.4 Decoding schedule, decoder throughput, and interconnection . 70
3.5 Implementation Results and Comparisons . . . . . . . . . . . . . . . 73
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Efficient Error Control Decoder Architectures for Noncoherent Ran-
dom Linear Network Coding 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 KK and MV codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.1 KK codes and its decoding algorithms . . . . . . . . . . . . . 81
4.2.2 MV codes and its list decoding algorithm . . . . . . . . . . . . 85
4.3 Efficient KK decoder architectures . . . . . . . . . . . . . . . . . . . . 89
4.3.1 Serial decoder architecture . . . . . . . . . . . . . . . . . . . . 89
4.3.2 Unfolded decoder architecture . . . . . . . . . . . . . . . . . . 96
4.4 Efficient MV list Decoder Architecture . . . . . . . . . . . . . . . . . 98
4.4.1 Serial list decoder architecture . . . . . . . . . . . . . . . . . . 98
4.4.2 Efficient interpolator architecture for MV codes . . . . . . . . 100
4.4.3 Efficient factorization architecture for MV codes . . . . . . . . 102
4.5 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5 An Efficient List Decoder Architecture for Polar Codes 115
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Polar Codes and Its CA-SCL Algorithm . . . . . . . . . . . . . . . . 118
5.2.1 Polar Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
viii
5.2.2 SCL and CA-SCL Algorithms . . . . . . . . . . . . . . . . . . 118
5.3 Two Improvements of the CA-SCL Algorithm . . . . . . . . . . . . . 125
5.3.1 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . 128
5.4 Efficient List Decoder Architecture . . . . . . . . . . . . . . . . . . . 132
5.4.1 Message Memory Architecture . . . . . . . . . . . . . . . . . . 132
5.4.2 Processing Unit Array . . . . . . . . . . . . . . . . . . . . . . 135
5.4.3 Path Pruning Unit . . . . . . . . . . . . . . . . . . . . . . . . 139
5.4.4 Partial Sum Update Unit and the CRC Unit . . . . . . . . . . 143
5.4.5 Decoding Cycles . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.4.6 Scalability of the Proposed List Decoder Architecture . . . . . 148
5.5 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6 A High Throughput List Decoder Architecture for Polar Codes 153
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.2.1 Polar Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.2.2 Prior Tree-Based SC Algorithms . . . . . . . . . . . . . . . . . 158
6.2.3 LLR Based List Decoding Algorithms . . . . . . . . . . . . . . 161
6.3 Reduced Latency List Decoding Algorithm . . . . . . . . . . . . . . . 162
6.3.1 SCL Decoding on A Tree . . . . . . . . . . . . . . . . . . . . . 162
6.3.2 Proposed RLLD algorithm . . . . . . . . . . . . . . . . . . . . 164
6.3.3 Discussions on the Parameters of Our RLLD Algorithm . . . . 171
6.3.4 Comparison with Related Algorithms . . . . . . . . . . . . . . 172
6.3.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 174
ix
6.4 High Throughput List Polar Decoder Architecture . . . . . . . . . . . 176
6.4.1 Top Decoder Architecture . . . . . . . . . . . . . . . . . . . . 176
6.4.2 Memory Efficient Quantization Scheme . . . . . . . . . . . . . 178
6.4.3 Proposed path pruning unit . . . . . . . . . . . . . . . . . . . 180
6.4.4 Proposed hybrid partial sum computation unit . . . . . . . . . 184
6.4.5 Latency and Throughput . . . . . . . . . . . . . . . . . . . . . 191
6.5 Implementation Results and Comparisons . . . . . . . . . . . . . . . 196
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7 Conclusions and Future Work 200
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Bibliography 204
Vita 216
x
List of Tables
2.1 Decoder Complexity Comparison for an (837, 726) LDPC Code over
GF(32) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Computational complexity comparison between the proposed SVNP
and the VNP algorithm in [1] . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Computational complexity comparison between the improved decod-
ing algorithm and the RHS algorithm in [2] . . . . . . . . . . . . . . . 55
3.3 Comparisons of LGUs and CGU with a 32×5 SRAM. . . . . . . . . . 70
3.4 Comparisons with other decoder architectures. . . . . . . . . . . . . . 76
4.1 Interpolation by Polynomials and Linearized Polynomials . . . . . . . 84
4.2 Hardware implementation results comparison. . . . . . . . . . . . . . 111
5.1 Area per Bit for RFs with Different Depth and Width 128 using
TSMC 90nm CMOS technology . . . . . . . . . . . . . . . . . . . . . 135
5.2 Bit width of LLM Inputs of PUl,j when n = 10, T = 8 and t = 4 . . . 136
5.3 Area comparison between fine grained PU array and regular PU using
TSMC 90nm CMOS technology . . . . . . . . . . . . . . . . . . . . . 137
xi
5.4 Comparison of ASIC implementation results using TSMC 90nm CMOS
technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.5 Implementation Results With R′ = 0.468 and R = 0.5 . . . . . . . . . 147
6.1 The Values of qIv ,L’s under Different List Sizes and Iv’s . . . . . . . . 174
6.2 Hardware resources needed by different methods per list . . . . . . . 184
6.3 Implementation Results for N = 210, R = 0.5 . . . . . . . . . . . . . . 193
6.4 Implementation Results for N = 213, R = 0.5 . . . . . . . . . . . . . . 194
6.5 Implementation Results for N = 215, R = 0.9004 . . . . . . . . . . . . 195
6.6 N
(i)
P with Respect to Iv and L . . . . . . . . . . . . . . . . . . . . . . 197
xii
List of Figures
2.1 Messages sent form check node to variable node . . . . . . . . . . . . 24
2.2 FERs of selected codes . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Comparison of convergence rates . . . . . . . . . . . . . . . . . . . . 30
2.4 CNU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Architecture of Proposed Top Sorter . . . . . . . . . . . . . . . . . . 33
2.6 Proposed RMAG Architecture . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Proposed Path Constructor Architecture . . . . . . . . . . . . . . . . 35
2.8 Proposed VNU Architecture . . . . . . . . . . . . . . . . . . . . . . . 36
2.9 Proposed shuffled decoder architecture for NB QC-LDPC codes . . . 37
2.10 Decoding Schedule of Proposed Shuffled Decoder . . . . . . . . . . . 38
3.1 LLR evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 LLR approximation when nm = 32 . . . . . . . . . . . . . . . . . . . 50
3.3 BER performance of the (110, 88) NB-LDPC code over GF(256) . . . 55
3.4 FER performance of the (110, 88) NB-LDPC code over GF(256) . . . 56
3.5 BER performance of the (372, 248) NB-LDPC code over GF(32) . . . 59
3.6 FER performance of the (372, 248) NB-LDPC code over GF(32) . . . 60
3.7 Proposed fully parallel decoder architecture . . . . . . . . . . . . . . 60
xiii
3.8 Parallel CNU architecture . . . . . . . . . . . . . . . . . . . . . . . . 61
3.9 Parallel sorter architecture . . . . . . . . . . . . . . . . . . . . . . . . 62
3.10 The macro architecture of MLF when the number of input is 8 . . . . 63
3.11 Path constructor architecture . . . . . . . . . . . . . . . . . . . . . . 65
3.12 VNU architecture assuming dv = 2 . . . . . . . . . . . . . . . . . . . 66
3.13 Reading behavior of the SRGs during the variable node processing . . 68
3.14 Architectures of the proposed LGUs . . . . . . . . . . . . . . . . . . . 69
3.15 Architectures of the proposed CGU . . . . . . . . . . . . . . . . . . . 69
4.1 Serial KK decoder architecture . . . . . . . . . . . . . . . . . . . . . 90
4.2 Architecture of interpolator0 . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Architecture of polyEvl for interpolator0 . . . . . . . . . . . . . . . . 92
4.4 Architecture of orderComp for interpolator0 . . . . . . . . . . . . . . 92
4.5 Architecture of the PUU that updates the coefficient of x[j] . . . . . . 93
4.6 Architecture of the polyDiv unit . . . . . . . . . . . . . . . . . . . . . 95
4.7 Parallel inversion architecture over GF(28) . . . . . . . . . . . . . . . 95
4.8 Unfolded decoder architecture . . . . . . . . . . . . . . . . . . . . . . 97
4.9 Unfolded decoder architecture . . . . . . . . . . . . . . . . . . . . . . 99
4.10 The architecture of interpolator0 for the proposed MV decoder . . . . 100
4.11 Architecture of polyEvl for interpolator0 of the proposed MV decoder 101
4.12 Architecture of orderComp for interpolator0 of the proposed MV de-
coder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.13 Architecture of PUU that updates x[j] for interpolator0 of the pro-
posed MV decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.14 Root pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
xiv
4.15 Architecture of factorization for MV decoder (L = 2) . . . . . . . . . 109
4.16 Architecture of SV0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1 Compressed channel message . . . . . . . . . . . . . . . . . . . . . . . 126
5.2 FER performance of a polar code with N = 1024 . . . . . . . . . . . 128
5.3 FER performances under CRC16 and rate 0.75 . . . . . . . . . . . . . 130
5.4 FER performances under CRC16 and rate 0.5 . . . . . . . . . . . . . 131
5.5 Top architecture of the list decoder . . . . . . . . . . . . . . . . . . . 132
5.6 The split of an irregular LLM memory . . . . . . . . . . . . . . . . . 134
5.7 Maximum values filter architecture . . . . . . . . . . . . . . . . . . . 141
5.8 (a) Architectures of IS (b) Architectures of DS (c) Architectures of
CAS (z = x1 + x2 + 1) . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.9 PSU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.10 Architecture of the proposed CRC unit . . . . . . . . . . . . . . . . . 145
6.1 Polar encoder with N = 8 . . . . . . . . . . . . . . . . . . . . . . . . 159
6.2 Binary tree representation of an (8, 3) polar code . . . . . . . . . . . 159
6.3 Node activation schedule for SC based list decoding on Gn . . . . . . 163
6.4 BER performance for an (8192, 4096) polar code . . . . . . . . . . . . 174
6.5 Decoder top architecture . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.6 Effects of the proposed MEQ scheme on the error performances . . . 180
6.7 The proposed architecture for PPU . . . . . . . . . . . . . . . . . . . 181
6.8 Hardware architecture of the proposed NG-Il . . . . . . . . . . . . . . 182
6.9 Architecture of NG-IIl . . . . . . . . . . . . . . . . . . . . . . . . . . 183
xv
6.10 (a) Top architecture of CUl. (b) Type-I PE. (c) Type-II PE. (d)
Inputs and outputs of the CN. . . . . . . . . . . . . . . . . . . . . . . 187
xvi
Abstract
Due to the rapid development of the information industry, modern communication
and storage systems require much higher data rates and reliability to server various
demanding applications. However, these systems suffer from noises from the prac-
tical channels. Various error correction codes (ECCs), such as Reed-Solomon (RS)
codes, convolutional codes, turbo codes, Low-Density Parity-Check (LDPC) codes
and so on, have been adopted in lots of current standards. With the increasing data
rate, the research of more advanced ECCs and the corresponding efficient decoders
will never stop.
Binary LDPC codes have been adopted in lots of modern communication and
storage applications due their superior error performance and efficient hardware
decoder implementations. Non-binary LDPC (NB-LDPC) codes are an important
extension of traditional binary LDPC codes. Compared with its binary counter-
part, NB-LDPC codes show better error performance under short to moderate block
lengths and higher order modulations. Moreover, NB-LDPC codes have lower error
floor than binary LDPC codes. In spite of the excellent error performance, it is hard
for current communication and storage systems to adopt NB-LDPC codes due to
1
complex decoding algorithms and decoder architectures. In terms of hardware im-
plementation, current NB-LDPC decoders need much larger area and achieve much
lower data throughput.
Besides the recently proposed NB-LDPC codes, polar codes, discovered by Arıkan,
appear as a very promising candidate for future communication and storage systems.
Polar codes are considered as a major breakthrough in recent coding theory society.
Polar codes are proved to be capacity achieving codes over binary input symmetric
memoryless channels. Besides, polar codes can be decoded by the successive can-
celation (SC) algorithm with of complexity of O(N log2N), where N is the block
length. The main sticking point of polar codes to date is that their error perfor-
mance under short to moderate block lengths is inferior compared with LDPC codes
or turbo codes. The list decoding technique can be used to improve the error perfor-
mance of SC algorithms at the cost higher computational and memory complexities.
Besides, the hardware implementation of current SC based decoders suffer from long
decoding latency which is unsuitable for modern high speed communications.
ECCs also find their applications in improving the reliability of network coding.
Random linear network coding is an efficient technique for disseminating informa-
tion in networks, but it is highly susceptible to errors. Ko¨tter-Kschischang (KK)
codes and Mahdavifar-Vardy (MV) codes are two important families of subspace
codes that provide error control in noncoherent random linear network coding. List
decoding has been used to decode MV codes beyond half distance. Existing hard-
ware implementations of the rank metric decoder for KK codes suffer from limited
throughput, long latency and high area complexity. The interpolation-based list
decoding algorithm for MV codes still has high computational complexity, and its
2
feasibility for hardware implementations has not been investigated.
In this exam, we present efficient decoding algorithms and hardware decoder
architectures for NB-LDPC codes, polar codes, KK and MV codes. For NB-LDPC
codes, an efficient shuffled decoder architecture is presented to reduce the number
of average iterations and improve the throughput. Besides, a fully parallel decoder
architecture for NB-LDPC codes with short or moderate block lengths is also pre-
sented. Our fully parallel decoder architecture achieves much higher throughput and
area efficiency compared with the state-of-art NB-LDPC decoders. For polar codes,
a memory efficient list decoder architecture is first presented. Based on our reduced
latency list decoding algorithm for polar codes, a high throughput list decoder ar-
chitecture is also presented. At last, we present efficient decoder architectures for
both KK and MV codes.
3
Chapter 1
Introduction
Error correction codes (ECCs), such as Reed-Solomon (RS) codes, convolutional
codes, turbo codes, Low-Density Parity-Check (LDPC) codes and so on, are widely
used in current communication and storage systems. Non-binary LDPC codes and
polar codes are recently emerged ECCs for future applications. However, NB-LDPC
codes suffer from high decoding algorithms and inefficient hardware decoder archi-
tecture. The successive cancelation (SC) based list (SCL) decoding algorithm for
polar codes has much better error performance that the SC algorithm. However, the
hardware implementations of the SCL decoding algorithm suffer from long decoding
latency, which is unsuitable for high speed applications. Besides, ECCs also find ap-
plications in random linear network coding (RLNC). The Ko¨tter-Kschischang (KK)
codes and Mahdavifar-Vardy (MV) codes are two important families of subspace
codes that provide error control in noncoherent random linear network coding.
In this chapter, we first explain our motivations of our research in Sec. 1.1, and
then present our main contributions in this dissertation as well as the organization
4
1.1. MOTIVATIONS
of this dissertation in Sec. 1.2.
1.1 Motivations
1.1.1 Non-binary LDPC Codes
Binary low-density parity-check (LDPC) codes are more and more popular in ap-
plications because of their capacity-approaching performance. In terms of perfor-
mance, binary LDPC codes start to show their weaknesses when the codeword length
is small or moderate, or when a higher order modulation is used. For these cases,
nonbinary LDPC (NB-LDPC) codes over high order Galois fields have shown great
potential [3, 4]. For instance, in [1], a rate-1/2 NB-LDPC code of length 84 over
GF(64) is shown to perform 0.375dB better than a rate-1/2 binary irregular LDPC
code of equivalent length 504 bits in [5] over binary input additive white Gaussian
noise (AWGN) channel. Over the QAM-AWGN channels, NB-LDPC codes with
a field order greater than or equal to the size of constellation have the advantage
that the encoder/decoder works directly with symbols. All mapping choices of the
codeword symbols to the constellation points are equivalent and lead to the same
performance. In [1], an NB-LDPC code over GF(256) performs 0.5dB better than a
rate-1/2 binary LDPC codes of the equivalent length 1008 bits over QAM256-AWGN
channel.
A significant obstacle to the application of NB-LDPC codes is that their de-
coding algorithms have high complexities. Hence, a lot of research effort has been
spent on efficient decoding algorithms for NB-LDPC codes [1,6]. Among them, the
EMS [1] and the Min-Max [6] algorithms draw a lot of attention because of their low
5
1.1. MOTIVATIONS
computation and memory complexity. For an NB-LDPC code over GF(2m), both
the EMS and the Min-Max algorithms store only the nm (nm  2m) most reliable
messages, thus reducing the memory requirement at the cost of small performance
degradation. The check node processing of the EMS algorithm needs additions and
comparisons, while the Min-Max algorithm needs only maximizations and compar-
isons in check node processing. The trellis-based check node processing (TBCP)
algorithm [7] reduces the computational complexity of check node processing of the
Min-Max algorithm by eliminating unnecessary check-to-variable messages.
Besides the Min-Max and EMS decoding algorithms, stochastic decoding [2,8,9]
is another way to reduce the hardware complexity of NB-LDPC decoders while
maintaining the decoding performance. Compared to conventional belief propa-
gation decoding algorithms, the stochastic decoding algorithm has lower hardware
complexity [9]. The relaxed half-stochastic decoding algorithm optimized for NB-
LPDC codes with variable node degree 2, called the RD2 algorithm, was proposed
in [8]. The RD2 decoding algorithm reduces the decoding complexity by reducing
the number of real multiplications significantly. An improved version of the RD2
algorithm, called the NoX decoding algorithm [2], is proposed to further reduce the
computational complexity.
Recently, a considerable amount of research effort has been spent on efficient
decoder architectures for NB-LDPC codes [9–17]. Existing NB-LDPC decoders still
suffer from low throughput and large hardware complexity. For example, a (248,
124) NB-LDPC decoder over GF(32) [15] achieves a throughput of 47.69 Mb/s at the
cost of 10.33 mm2 silicon area under 90nm technology. An (837, 726) NB-LDPC
decoder [16] over GF(32) achieves a throughput of 60Mb/s at the cost of 1.29M
6
1.1. MOTIVATIONS
standard NAND gates using the 180nm technology. The FPGA implementation of
a (192, 96) stochastic AMSA decoder [9] over GF(256) achieves a throughput of
64Mb/s at the frequency of 108MHz.
1.1.2 Polar Codes
Polar codes, recently introduced by Arıkan [18], are a significant breakthrough in
coding theory. It is proved that polar codes can achieve the channel capacity of any
discrete or continuous memoryless channel [18, 19]. Polar codes can be efficiently
decoded by the low-complexity successive cancelation (SC) decoding algorithm [18]
with a complexity of O(N logN), where N is the block length. To approach the
channel capacity using the SC algorithm, polar codes require very large code block
length (for example, N > 220 [20]), which is impractical in many applications.
For short or moderate length, the error performance of polar codes under the SC
algorithm is worse than that of Turbo or low-density parity-check (LDPC) codes [21].
Lots of efforts [21–28] have already been devoted to the improvement of error-
correction performance of polar codes with short or moderate lengths. An SC list
(SCL) decoding algorithm was proposed recently in [21], which performs better
than the SC algorithm and performs almost the same as a maximum-likelihood
(ML) decoder [21]. In [22–24], the cyclic redundancy check (CRC) is used to pick
the output codeword from L candidates, where L is the list size. The CRC-aided
SCL algorithm performs much better than the SCL algorithm at the expense of
negligible loss in code rate.
In terms of hardware implementations of the SC algorithm, an efficient semi-
parallel SC decoder was proposed in [20], where resource sharing and semi-parallel
7
1.1. MOTIVATIONS
processing were used to reduce the hardware complexity. An overlapped computa-
tion method and a pre-computation method were proposed in [29] to improve the
throughput and to reduce the decoding latency of SC decoders. Compared to the
semi-parallel decoder architecture in [20], the pre-computation based decoder archi-
tecture [29] can double the throughput. A simplified SC decoder for polar codes,
proposed in [30], reduces the decoding latency by more than 88% for a rate 0.7 polar
code with length 218.
Despite its significantly improved error performance, the hardware implementa-
tions of SC based list decoders [31–34] still suffer from long decoding latency and
limited throughput due to the serial decoding schedule. In order to reduce the de-
coding latency of an SC based list decoder, M (M > 1) bits are decoded in parallel
in [35–37], where the decoding latency can be reduced by M times ideally. How-
ever, for the hardware implementations of the algorithms in [35–37], the actually
achieved decoding latency reduction is less than M due to extra decoding cycles on
finding the L most reliable paths among 2ML candidates, where L is list size. A
software adaptive SSC-list-CRC decoder was proposed in [38]. For a (2048, 1723)
polar+CRC-32 code, the SSC-list-CRC decoder with L = 32 was shown to be about
7 times faster than an SC based list decoder. However, it is unclear whether the list
decoder in [38] is suitable for hardware implementation.
1.1.3 Error Control Decoders for RLNC
Random linear network coding (RLNC) is an efficient technique for disseminat-
ing information in networks (see, for example, [39–42]). Due to its random linear
operations, RLNC not only achieves network capacity with high probability in a
8
1.1. MOTIVATIONS
distributed manner, but also provides robustness against varying network condi-
tions [43]. Unfortunately, it is highly susceptible to errors due to noise, malicious or
malfunctioning nodes, or insufficient min-cut [44]. As a result, error control is vital
for RLNC.
Error control methods proposed for RLNC assume two transmission models. The
methods for the first model (see, for example, [45]) depend on and take advantage
of the underlying network topology or the particular linear networking operations
performed at various nodes. The methods for the other model (see, e.g., [44, 46])
assume that both the transmitter and the receiver have no knowledge of such channel
transfer characteristics. The two models are referred to as coherent and noncoherent
network coding, respectively. In this paper, we focus on error control for noncoherent
RLNC.
An error control code for noncoherent network coding [44], called a subspace
code, is a set of subspaces. Information is encoded in the choice of a subspace
spanned by a set of transmitted packets. A subspace code is called a constant-
dimension code (CDC) if all subspaces are of the same dimension. CDCs lead to
simplified network protocols due to the constant dimension. A class of asymp-
totically optimal CDCs, referred to as Ko¨tter-Kschischang (KK) codes, has been
proposed in [44]. A decoding algorithm based on interpolation for bivariate lin-
earized polynomials is also proposed for KK codes in [44]. It was shown in [46] that
KK codes correspond to lifting of Gabidulin codes, a class of optimal rank metric
codes. As a result, KK codes can be decoded by the generalized decoding algorithm
for the rank metric codes [46].
9
1.2. CONTRIBUTIONS AND ORGANIZATION
Motivated by KK codes, a new family of subspace codes, referred to as Mahdavifar-
Vardy (MV) codes in this paper, was proposed [47–49]. List decoding, which has
been used to decode beyond the error correction diameter bound [50], can be applied
to the decoding of MV codes. Using algebraic list decoding, it was shown [49] that
MV codes can achieve a better tradeoff between rate and decoding radius than KK
codes.
Error control for RLNC comes at the expense of additional computations needed
for encoding and decoding. The complexities of existing decoding algorithms [44,
49, 51] for KK and MV codes are much higher than those of encoding, and are
hence critical to applications of RLNC. Most previous works focus on theoretical
aspects of network coding. For example, the decoding complexities of KK and MV
codes were analyzed in [44,46] and [47–49], respectively. However, theoretical anal-
ysis does not completely reflect how the decoding algorithms affect the hardware
implementation results, such as area and throughput. For KK codes, decoder ar-
chitectures based on the generalized decoding algorithm for rank metric codes [46]
was proposed in [43]. Unfortunately, the rank metric decoder architectures in [43]
suffer from limited throughput, long decoding latency and high area complexity.
Besides, to the best of our knowledge, decoder architectures for MV codes and their
hardware implementations have not been investigated in the open literature.
1.2 Contributions and Organization
This dissertation has the following contributions and is organized as follows.
• In Chapter 2, the shuffled decoding algorithm and its corresponding decoder
10
1.2. CONTRIBUTIONS AND ORGANIZATION
architecture are investigated. Our main contributions of this chapter are two-
fold. First, we propose a shuffled schedule (SS) of the Min-Max algorithm
for NB-LDPC codes. To reduce the memory requirement and improve the
throughput, we also propose a modified shuffled schedule (MSS), which em-
ploys a novel shuffle sort (SST) algorithm to reduce the complexity of check
node processing significantly. Our simulation results show that both the SS
and MSS converge faster and have slightly better error performance than the
flooding schedule, and that the degradation of the MSS in error performance
as well as convergence rate is negligible. The simulation results also show that
the error performance of the MSS and layered schedule are almost the same.
Second, an efficient shuffled decoder architecture for NB QC-LDPC codes is
proposed based on the Min-Max algorithm using the modified shuffled sched-
ule. The proposed architecture has a similar top structure to other partly
parallel decoder architectures for binary and nonbinary LDPC codes. How-
ever, it has several key novelties: 1) its underlying modified shuffled schedule
is novel; 2) on-the-fly computation and hardware re-usage have been used to
reduce memory consumption and to improve the throughput; 3) a random
memory address generator (RMAG) has been employed in the check node
unit (CNU) to reduce the number of cycles required by CNP; 4) since the
variable node unit (VNU) becomes complex for decoders storing only the nm
most reliable values, the variable-to-check messages are stored in an improved
way so as to simplify the message access.
• In Chapter 3, a fully parallel decoder architecture based on the proposed
decoding algorithm is also proposed. The main contributions of this paper are
11
1.2. CONTRIBUTIONS AND ORGANIZATION
as follows:
1. Based on the Min-Max algorithm, a reduced memory complexity trellis
based check node processing (RTBCP) algorithm is proposed.
2. A simplified algorithm is proposed to reduce the computational complex-
ity of variable node processing (VNP). As a result, compared with the
RHS algorithm in [2], a stochastic decoder, the proposed decoding al-
gorithm needs fewer real multiplications but more real comparisons and
finite field additions.
3. For each a priori message, all LLRs except several most reliable ones
are approximated with a linear function. Two kinds of low complexity
LLR generation units are also proposed for the approximation of the
check-to-variable (c-to-v) LLR and a priori LLR, respectively. With 5-
bit quantization scheme and nm = 32, the areas of the two LGUs are
10.7% and 13.3%, respectively, of that of an SRAM which stores an LLR
vector under a 90nm CMOS technology. A similar approach was proposed
in [52] to approximate a priori LLR. The main differences between our
work and that in [52] are as follows:
– Besides the approximation of channel LLR vectors, we try to approx-
imate check-to-variable LLR vectors.
– A simplified variable node processing (SVNP) algorithm is proposed
to compensate the performance degradation caused by LLR approx-
imation.
4. A parallel check node unit (CNU) and a low-latency variable node unit
(VNU) are proposed. Based on the proposed CNU and VNU, an efficient
12
1.2. CONTRIBUTIONS AND ORGANIZATION
fully parallel decoder architecture is also proposed. A fully parallel NB-
LDPC decoders based on GF(256) is implemented with 28nm CMOS
technology. The decoder over GF(256) achieves a throughput of 546Mb/s
and an energy efficiency of 0.178nJ/b/iter.
Since routing congestion tends to be challenging for fully parallel LDPC de-
coder architectures, the proposed decoder architecture is not suitable for very
long LDPC codes. The proposed fully parallel decoder architecture is partic-
ularly advantageous for NB-LDPC codes over large fields, since the memory
reduction will be more significant when nm is large.
• In Chapter 4, we focus on efficient architectures and their hardware imple-
mentations of interpolation based decoders for KK and MV codes. The main
contributions of this paper are:
1. The decoder of KK codes has two stages: interpolation and factorization.
The generalized interpolation algorithm in [51] is used for the first stage
since it is more efficient than Gaussian elimination [51]. For factoriza-
tion, we propose a reformulated right division algorithm for linearized
polynomials, which is suitable for hardware implementations.
2. The list decoder of MV codes also has two stages: interpolation and
factorization. The generalized interpolation algorithm in [51] is used in
the interpolation process. A linearized Roth-Ruckenstein (LRR) algo-
rithm [53] is proposed in [47] to solve the factorization problem for MV
codes. In this paper, we make a more detailed study on the LRR algo-
rithm. For list size L = 2, we derive the equations used to compute all
13
1.2. CONTRIBUTIONS AND ORGANIZATION
the information symbols and uncover the relation between two possible
solutions. A matrix based LRR (M-LRR) algorithm, which is suitable
for hardware implementations, is also proposed for factorization.
3. A serial decoder architecture and an unfolded decoder architecture for
KK codes are proposed for applications with moderate and high through-
puts, respectively. Both architectures are implemented for KK codes over
GF(28) and GF(216) to demonstrate their efficiency. To the best of our
knowledge, this is the first efficient implementation of interpolation-based
decoder for KK codes. Compared to the rank metric decoder architec-
tures for KK codes [43], the proposed serial decoder architecture improves
the throughput by 4.9 and 13.2 times, while its gate counts are only 56%
and 76% of their respective counterparts in [43]. Moreover, for these two
codes, the unfolded architecture achieves a throughput of 12.5Gb/s and
41.6Gb/s, much higher than the throughput of 214Mb/s and 134Mb/s
of their respective counterparts in [43]. The throughputs per thousand
NAND gates of our architectures are much higher and their latency much
shorter than their counterparts in [43].
4. A serial list decoder architecture for MV codes is proposed. To the best
of our knowledge, this is the first hardware implementation of MV de-
coders. An efficient architecture for solving equations over an extension
field GF(qml) (q > 2 is moderate) is proposed. The proposed equation
solver does not require complicated inversion operations over GF(qml).
Besides, an implementation of factorization that computes all L possible
transmitted packets in parallel is proposed, where L is the list size for
14
1.2. CONTRIBUTIONS AND ORGANIZATION
list decoding.
• In Chapter 5, we propose the first hardware implementation of the CA-SCL
algorithm to the best of our knowledge. Based on both algorithmic and ar-
chitectural improvements, our decoder architecture achieves better error per-
formance and higher area efficiency compared with the decoder architecture
in [31]. Specifically, the major contributions of this work are:
1. Message memories account for a significant fraction of an SC or SCL
decoder [20,31]. In this chapter, an area efficient message memory archi-
tecture is proposed. Besides, a new compression method for the channel
messages is used to reduce the area of the proposed decoder architecture.
2. An efficient processing unit (PU) is proposed. For the proposed list de-
coder architecture, a fine grained PU profiling (FPP) algorithm is pro-
posed to determine the minimum quantization size of each input message
for each PU so that there is no message overflow. By using the quantiza-
tion size generated by the FPP algorithm for each PU, the overall area
of all PUs is reduced.
3. An efficient scalable path pruning unit (PPU) is proposed to control the
copying of decoding paths. Based on the proposed memory architecture
and the scalable PPU, our list decoder architecture is suitable for large
list sizes.
4. A low-complexity direct selection scheme is proposed for the CA-SCL
algorithm when a strong CRC is used (e.g. CRC32). The proposed
direct selection scheme simplifies the selection of the final output data
15
1.2. CONTRIBUTIONS AND ORGANIZATION
word.
5. For a (1024, 512) rate-1
2
polar code, the proposed list decoder architec-
ture is implemented for list size L = 2 and 4, respectively, under a 90nm
CMOS technology. Compared with the decoder architecture in [31] syn-
thesized under the same technology, our decoder achieves 1.24 to 1.83
times area efficiency (throughput normalized by area). Besides, the pro-
posed CA-SCL decoder has better error performance compared with the
SCL decoder in [31].
• In Chapter 6, a tree based reduced latency list decoding algorithm and its
corresponding high throughput hardware architecture are proposed for polar
codes. The main contributions are:
– A tree based reduced latency list decoding (RLLD) algorithm over loga-
rithm likelihood ratio (LLR) domain is proposed for polar codes. Inspired
by the simplified successive cancelation (SSC) [30] decoding algorithm
and the ML-SSC algorithm [54], our RLLD algorithm performs the SC
based list decoding on a binary tree. Previous SCL decoding algorithms
visit all the nodes in the tree and consider all possibilities of the infor-
mation bits, while our RLLD algorithm visits much fewer nodes in the
tree and consider fewer possibilities of the information bits. When con-
figured properly, our RLLD algorithm significantly reduces the decoding
latency and hence improves throughput, while introducing little perfor-
mance degradation.
16
1.2. CONTRIBUTIONS AND ORGANIZATION
– Based on our RLLD algorithm, a high throughput list decoder architec-
ture is proposed for polar codes. Compared with the state-of-arts SCL
decoders in [32, 33, 36], our list decoder achieves lower decoding latency
and higher area efficiency (throughput normalized by area).
More specifically, the major innovations of the proposed decoder architecture
are:
– An index based partial sum computation (IPC) algorithm is proposed to
avoid copying partial sums directly when one decoding path needs to be
copied to another. Compared with the lazy copy algorithm in [55], our
IPC algorithm is more hardware friendly since it copies only path indices,
while the lazy copy algorithm needs more complex index computation.
– Based on our IPC algorithm, a hybrid partial sum unit (Hyb-PSU) is
proposed so that our list decoder is suitable for larger block lengths.
The Hyb-PSU is able to store most of the partial sums in area efficient
memories such as register file (RF) or SRAM, while the partial sum units
(PSUs) in [31–33] store partial sums in registers, which need much larger
area when the block length N is larger. Compared with the PSU of [32],
our Hyb-PSU achieves an area saving of 23% and 63% for block length
N = 213 and 215, respectively, under the TSMC 90nm CMOS technology.
– For our RLLD algorithm, when certain types of nodes are visited, each
current decoding path splits into multiple ones, among which the L most
reliable paths are kept. In this paper, an efficient path pruning unit
(PPU) is proposed to find the L most reliable decoding paths among
17
1.2. CONTRIBUTIONS AND ORGANIZATION
the split ones. For our high throughput list decoder architecture, the
proposed PPU is the key to the implementation of our RLLD algorithm.
– For the fixed-point implementation of our RLLD algorithm, a memory
efficient quantization (MEQ) scheme is used to reduce the number of
stored bits. Compared with the conventional quantization scheme, our
MEQ scheme reduces the number of stored bits by 17%, 25% and 27%
for block length N = 210, 213 and 215, respectively, at the cost of slight
error performance degradation.
18
Chapter 2
An Efficient Shuffled Decoder
Architecture for Nonbinary
Quasi-Cyclic LDPC Codes
2.1 Introduction
Binary low-density parity-check (LDPC) codes are more and more popular in ap-
plications because of their capacity-approaching performance. In terms of perfor-
mance, binary LDPC codes start to show their weaknesses when the code word
length is small or moderate, or when higher order modulation is used for transmis-
sion. For these cases, nonbinary LDPC (NB-LDPC) codes over high order Galois
fields have shown great potential [3, 4]. For instance, in [1], a rate-1/2 NB-LDPC
code of length 84 over GF(64) is shown to perform 0.375dB better than a rate-1/2
binary irregular LDPC code of equivalent length 504 bits in [5] over binary input
19
2.1. INTRODUCTION
additive white Gaussian noise (AWGN) channel. Over the QAM-AWGN channels,
NB-LDPC codes with a field order greater than or equal to the size of constellation
have the advantage that the encoder/decoder works directly with symbols. All map-
ping choices of the codeword symbols to the constellation points are equivalent and
lead to the same performance. In [1], an NB-LDPC code over GF(256) performs
0.5dB better than a rate-1/2 binary LDPC codes of the equivalent length 1008 bits
over QAM256-AWGN channel.
A significant obstacle to the application of NB-LDPC codes is that their decoding
algorithms are of high complexity. Hence, a lot of research effort has been spent
on efficient decoding algorithms for NB-LDPC codes [3, 6, 56–58]. Among them,
the EMS [58] and the Min-Max [6] algorithms draw the most attention because
their low computation and memory complexity. Both the EMS and the Min-Max
algorithms can store only the nm (nm  q) most reliable messages, thus reducing
the memory requirement at the cost of small performance degradation. The check
node processing of the EMS algorithm needs addition and comparison operations,
while the Min-Max algorithm needs only maximization and comparison in check
node processing. The trellis-based check node processing (TBCP) algorithm [7]
reduces the computational complexity of check node processing (CNP) of the Min-
Max algorithm by eliminating unnecessary check-to-variable messages from CNP.
Recently, a considerable amount of research effort has already been spent on
efficient decoder architectures for NB-LDPC codes [7, 10, 11, 15, 59] based on the
EMS or the Min-Max algorithm. The existing NB-LDPC decoders still suffer from
low throughput and large hardware complexity. For example, a (248, 124) NB-
LDPC decoder over GF(32) [15] achieves a throughput of 47.69 Mb/s at the cost of
20
2.1. INTRODUCTION
10.33 mm2 silicon area under 90nm technology, while a (1200, 720) binary LDPC
decoder [60] achieves a throughput of 5.92 Gb/s at the cost of 13.5 mm2 silicon area
under 180nm technology.
The main contributions of this chapter are two-fold. First, we propose a shuffled
schedule (SS) of the Min-Max algorithm for NB-LDPC codes. To reduce the mem-
ory requirement and improve the throughput, we also propose a modified shuffled
schedule (MSS), which employs a novel shuffle sort (SST) algorithm to reduce the
complexity of check node processing significantly. Our simulation results show that
both the SS and MSS converge faster and have slightly better error performance
than the flooding schedule, and that the degradation of the MSS in error perfor-
mance as well as convergence rate is negligible. The simulation results also show
that the error performance of the MSS and layered schedule are almost the same.
Second, an efficient shuffled decoder architecture for NB QC-LDPC codes is pro-
posed based on the Min-Max algorithm using the modified shuffled schedule. The
proposed architecture has a similar top structure to other partly parallel decoder
architectures for binary and nonbinary LDPC codes. However, it has several key
novelties: 1) its underlying modified shuffled schedule is novel; 2) on-the-fly compu-
tation and hardware re-usage have been used to reduce memory consumption and to
improve the throughput; 3) a random memory address generator (RMAG) has been
employed in the check node unit (CNU) to reduce the number of cycles required by
CNP; 4) since the variable node unit (VNU) becomes complex for decoders storing
only the nm most reliable values, the variable-to-check messages are stored in an
improved way so as to simplify the message access.
For NB-LDPC codes, a shuffled schedule for the EMS algorithm was proposed
21
2.1. INTRODUCTION
in [1], and a shuffled schedule and a probabilistic shuffled schedule for the belief
propagation were proposed in [61]. The work herein differs from the previous works
in [1, 61] in three aspects. First, while the works in [61] and [1] focus on the belief
propagation and EMS decoding of NB-LDPC codes, respectively, our work considers
the Min-Max algorithm. Second, while our shuffled schedule is similar to those in [1,
61], our modified shuffled schedule with reduced-complexity check node processing
is novel. Third, while the works in [1, 61] focus on decoding algorithms, this work
considers not only decoding algorithms but also decoder architectures as well as
their hardware implementations.
For binary LDPC codes, a layered decoder architecture tends to be more efficient
than a shuffled decoder architecture. This is because both the CNP and variable
node processing (VNP) are relatively simple. However, for NB-LDPC codes, layered
decoder architectures [7, 15,59] have several drawbacks. First, The check node pro-
cessing of existing decoding algorithms for NB-LDPC codes is complex and requires
many cycles to finish. Processing the rows serially in the layered fashion requires
more cycles than processing all rows at the same time. Second, the variable node
processing for NB-LDPC codes is also complex compared to that of binary LDPC
codes. A round of variable node processing for a variable node takes 2nm cycles
for the nonbinary layered decoder architecture in [7]. Third, for layered decoder
architectures, the variable node processing may be performed several times for one
variable node during an iteration, leading to lower throughput. In contrast, the shuf-
fled decoder architecture proposed in this chapter processes all rows concurrently
and needs only a round of variable node processing for all variable nodes during an
iteration. This reduces the number of cycles needed for an iteration.
22
2.2. BACKGROUND
The shuffled schedule has also been used in decoder architectures for binary
LDPC codes (see, e.g., [62]). However, each check to variable (c-2-v) or variable
to check (v-2-c) message is a vector for NB-LDPC codes, whereas each c-2-v or
v-2-c message is just a single log likelihood ratio (LLR) for binary codes. This
fundamental difference also leads to higher complexities for decoder architectures
of NB-LDPC codes than their binary counterparts. The key novelties mentioned
above also apply when comparing the proposed architecture herein with that in [62].
The rest of this chapter is organized as follows. Section 2.2 briefly reviews the
TBCP algorithm. Section 2.3 proposes the shuffled and modified shuffled schedule
and presents our simulation results. Our decoder architecture and the hardware
implementation results are presented in Section 2.4. The conclusion is drawn in
Section 2.5.
2.2 Background
Consider check node m and variable node n in the Tanner graph [63] defined by
H, respectively. Let M(n) denote the set of neighboring check nodes connected to
n, and N(m) the set of variable nodes connected to m. For a ∈ GF(q), let Ln(a)
be the a priori information of variable node n concerning the symbol a and Qn(a)
be the posteriori information of the same symbol. Rm,n(a) and Qm,n(a) denote
the messages passed from m to n and from n to m concerning a, respectively. Let
cn be the (n + 1)-th coordinate of a codeword and sn be the most likely symbol
for cn. The Min-Max decoding algorithm [6] can be formulated as follows, where
Imax denotes the maximal number of iterations and A(m|an = a) def= {(aj)(j ∈
23
2.2. BACKGROUND
q0,c(k)(0)
q0,c(k)(nm-1)
qs0,c(k)(0)
qs0,c(k)(nm-1)
q1,c(k)(0)
q1,c(k)(nm-1)
qs1,c(k)(0)
qs1,c(k)(nm-1)
qn-1,c(k)(0) qsn-1,c(k)(0)
q0,c(k)(1) qs0,c(k)(1) q1,c(k)(1) qs1,c(k)(1) qn-1,c(k)(1) qsn-1,c(k)(1)
check node c
variable 
node 0
variable
node 1
variable 
node n-1
qn-1,c(k)(nm-1) qsn-1,c(k)(nm-1)
Figure 2.1: Messages sent form check node to variable node
N(m) \ {n})|∑j∈N(m)\{n} hm,jaj + hm,na = 0}:
Algorithm 1: Min-Max Algorithm [6]
Initialization:
Ln(a) =ln(Pr(cn = sn|channel)/Pr(cn = a|channel))
Qm,n(a) = Ln(a) (0 ≤ m < M, 0 ≤ a < q), i = 0
Iteration:
while HC 6= 0 and i < Imax do
check node Processing:
Rm,n(a) = min
(aj)∈A(m|an=a)
( max
j∈N(m)\{n}
Qm,j(aj))
variable node processing:
Q′m,n(a) = Ln(a) +
∑
m′∈M(n)Rm′,n(a)
Q′m,n = min
a∈GF (q)
Qm,n(a)
Qm,n(a) = Q
′
m,n(a)−Q′m,n
tentatively decoding:
cn = argmin
a
(Qm,n(a))
i = i+ 1
The Min-Max algorithm stores the nm most reliable messages. Suppose there
are n variable nodes connected with check node c. The truncated messages sent
from n variable nodes to check node c are shown in Fig. 2.1, where qv,c(k) is an LLR
value and qsv,c(k) ∈ GF (q), where GF(q) is a finite field with q elements. Each
v-2-c message contains nm LLRs and nm associated field symbols.
Normally, the CNP is performed in a forward-backward way [6], which is memory
24
2.2. BACKGROUND
demanding. The TBCP algorithm [7] eliminates unnecessary v-2-c messages from
CNP, thus reducing the memory consumption. The TBCP algorithm first sorts these
n(nm−1) nonzero LLRs in non-decreasing order as xc(1), xc(2), · · · , xc(X), and only
the X smallest ones are kept. Their associated field symbols are αc(1), αc(2), · · · ,
αc(X), and they belong to variable nodes with indices ec(1), ec(2), · · · , ec(X). A path
construction (PC) algorithm shown in Algorithm 2 is proposed in [7] to compute the
truncated c-2-v message: rc,v and rsc,v, where rc,v is the nm-dimension LLR vector
and rsc,v is the associated field symbol.
Algorithm 2: PC Algorithm [7]
input : xc(i), αc(i), ec(i) i = 1, · · · , X; zc(v) v = 0, · · · , dc − 1; αsum
output: rc,v(j), rsc,v(j) for j = 0, · · · , nm − 1
Initialization:
rc,v(0) = 0, rsc,v(0) = αsum ⊕ zc(v), i = 1, cnt = 1, Pc,0 = [0, 0, · · · , 0]
while cnt < nm do
if ec(i) 6= v then
j = cnt
for k = 0 to j − 1 do
α = rsc,v(k)⊕ zc(ec(i))⊕ αc(i)
if Pc,k(ec(i)) 6= 1 for α 6∈ rsc,t then
rc,v(cnt) = xc(i); rsc,v(cnt) = α
Pc,cnt(ec(i)) = 1
Pc,cnt(s) = Pc,k(s) for s 6= ec(i)
cnt = cnt+ 1
i = i+ 1
As shown in Algorithm 2, zc(v) = αc,v(0), v = 1, · · · , n and αsum =
∑n
k=1 zc(j).
Pc,k is an n-dimension vector over GF(2) which stores the constructed path. ⊕
denotes addition over GF(q). The constructed nm LLRs are picked from the sorted
list xc(i), and stored in rc,v. Their associated field symbols are stored in rsc,v. X
25
2.3. SHUFFLED AND MODIFIED SHUFFLED SCHEDULE
is set to 1.5nm [7] so that the decoding performance could be maintained. It takes
2nm iterations to compute rc,v, rsc,v [7].
2.3 Shuffled and Modified Shuffled Schedule
2.3.1 Shuffled Schedule
Suppose H is divided into G block columns: H = [H0 H1 ... HG−1], where Hi is an
M × g sub-matrix and g = N/G. Let M(n) denote the set of neighboring check
nodes connected to n, and N(m) denote the set of variable nodes connected to m.
Let Iv(i), Sv(i) for i = 0, · · · , nm − 1 denote an nm-ary a priori message, where
Iv is the LLR vector and Sv is the corresponding field symbol vector, respectively.
Let CM
(k)
c,l (i) = (x
(k)
c,l (i), α
(k)
c,l (i), e
(k)
c,l (i)) for i = 1, · · · , X, which are the inputs of the
PC algorithm [7] when computing updated c-2-v messages within block column l
in iteration k. Let (q
(k)
v,c , qs
(k)
v,c) and (r
(k)
c,v , rs
(k)
c,v ) denote a v-2-c and c-2-v message in
iteration k, respectively. Suppose the row weight for each Hi is exactly 1, which can
be easily satisfied by QC-LDPC codes. The proposed shuffled schedule is shown in
Algorithm 3.
During the initialization step, for each check node c, the init sort (IS) algorithm
sorts out X LLR values in a non-decreasing order from incoming dc(nm − 1) v-
2-c message elements, where dc is the corresponding check node degree. The IS
algorithm is shown in Algorithm 4, where n = dc; t, ts and ti are all X-dimension
vectors.
The block vnp(l) function in Algorithm 3 computes the corresponding q
(k+1)
v,c ,
26
2.3. SHUFFLED AND MODIFIED SHUFFLED SCHEDULE
Algorithm 3: Shuffled Schedule
Initialization:
q
(0)
c,v = Iv, qs
(0)
c,v = Sv for c ∈M(v)
for c = 0 to M − 1 do
CM
(0)
c,0 = IS({(q(0)v,c , qs(0)v,c)|v ∈ N(c)})
Iteration:
for k = 0 to Imax − 1 do
for l = 0 to G− 1 do
for c = 0 to M − 1 do
for {v ∈ N(c) and lG ≤ v < (l + 1)G} do
(r
(k+1)
c,v , rs
(k+1)
c,v ) = PC(CM
(k)
c,v )
(q
(k+1)
v,c , qs
(k+1)
v,c ) = block vnp(l)
for c = 0 to M − 1 do
for v ∈ N(c) do
if v < (l + 1)G then
tqv,c = q
(k+1)
v,c ; tqsv,c = qs
(k+1)
v,c
else tqv,c = q
(k)
v,c ; tqsv,c = qs
(k)
v,c
CM
(k)
c,l+1 = IS({(tq(k+1)v,c , tqs(k+1)v,c )|j ∈ N(c)})
CM
(k+1)
c,0 = CM
(k)
c,G
27
2.3. SHUFFLED AND MODIFIED SHUFFLED SCHEDULE
Algorithm 4: IS Algorithm
input : (q
(k)
v,c , qs
(k)
v,c)|v ∈ N(c) = {v0, v2, · · · , vdc−1}
output: CM
(k)
c,l (i) = (x
(k)
c,l (i), α
(k)
c,l (i), e
(k)
c,l (i)) for i = 1, · · · , X
t(i) = q
(k)
v1,c(i), ts(i) = qs
(k)
v1,c(i), ti(i) = 0 for i = 1, 2, · · · , nm − 1
for j = 1 to dc − 1 do
a = b = 1
for i = 1 to X do
if t(a) ≤ q(k)vj ,c(b) then
T
(k)
c,l (i) = (t(a), ts(a), ti(a)); a = a+ 1
else T
(k)
c,l (i) = (q
(k)
vj ,c(b), q
(k)
vj ,c(b), j); b = b+ 1
(t(i), ts(i), ti(i)) = CM
(k)
c,l (i) for all i
CM
(k)
c,l = T
(k)
c,l
qs
(k+1)
v,c messages within block l. Take variable node v as an example. The q-
ary message Qv,c are firstly computed as Qv,c(s) = Lv(s) +
∑
c′∈M(j)\cRc′,v(s) for
s = 0, 1, · · · , q − 1. Here, Lv(s) = Iv(i) if Sv(i) = s, otherwise Lv(s) = Iv(nm − 1)
which is maximum of Iv. Besides, Rc′,v(s) = r
(k+1)
c′,v (ic′) if rs
(k+1)
c′,v (i
′
c) = s, otherwise
Rc′,v(s) = r
(k+1)
c′,v (nm − 1). Finally, q(k+1)v,c and qs(k+1)v,c are computed by sorting Qv,c.
2.3.2 Modified Shuffled Schedule
For the shuffled schedule in Algorithm 3, the computation of CM
(k)
c,l+1 employ the
IS algorithm, which needs (dc − 1)X comparisons. Thus, dc(dc − 1)X comparisons
are needed for computing all CM
(k)
c,l during an iteration. Besides, all v-2-c messages
need to be stored. This results in low throughput as well as a significant memory
requirement. Instead, we propose a modified shuffled schedule (MSS) which uses a
shuffled sort (SST) algorithm to compute CM
(k)
c,l+1 as shown in Algorithm 5.
28
2.3. SHUFFLED AND MODIFIED SHUFFLED SCHEDULE
Algorithm 5: SST Algorithm
input : (q
(k+1)
v,c , qs
(k+1)
v,c , l),CM
(k)
c,l
output: CM
(k)
c,l+1
a = 1, b = 1, c = 1
for x = 0 to nm +X − 2 do
if e
(k)
c,l == l then b = b+ 1; continue
if q
(k+1)
v,c (a) ≤ x(k)c,l (b) then
CM
(k)
c,l+1(c) = (q
(k+1)
c,v (a), qs
(k+1)
c,v (a), l)
c = c+ 1, a = a+ 1
else CM
(k)
c,l+1(c) = CM
(k)
c,l (b); c = c+ 1, b = b+ 1
if c == X + 1 then break
The proposed SST algorithm needs at most nm+X−1 comparisons to compute
CM
(k)
c,l+1 based on CM
(k)
c,l and (q
(k+1)
v,c , qs
(k+1)
v,c ). It only takes at most (dc−1)X+(dc−
1)(nm +X − 1) = (dc − 1)(nm + 2X − 1) comparisons to compute all CM(k)c,l during
an iteration. Besides, only (q
(k+1)
v,c , qs
(k+1)
v,c ) need to be stored. As a result, compared
to the shuffled schedule in Algorithm 3, the MSS using the SST algorithm needs
fewer comparisons and less memory.
2.3.3 Simulation Results
Fig. 2.2 shows the frame error rate (FER) performance of the FFT-BP algorithm
and the Min-Max algorithm with flooding schedule, shuffled and modified shuffled
schedule as well as the layered schedule for three NB-LDPC codes on GF(32) [64]
over the AWGN channel with BPSK modulation. For our simulations, Imax = 30,
nm = 16, and X = 1.5nm = 24 for the SS and MSS. The flooding and layered
29
2.3. SHUFFLED AND MODIFIED SHUFFLED SCHEDULE
2 . 4 2 . 8 3 . 2 3 . 6 4 . 0 4 . 4 4 . 8 5 . 21 0
- 5
1 0 - 4
1 0 - 3
1 0 - 2
1 0 - 1
1 0 0
( 3 7 2 ,  2 8 6 ) F F T - B P S S M S S F l o o d i n g L a y e r e d
( 6 2 0 ,  5 0 9 ) F F T - B P S S M S S F l o o d i n g L a y e r e d
( 8 3 7 ,  7 2 6 ) F F T - B P S S M S S F l o o d i n g L a y e r e d M S S - ( 3 , 2 )
FER
S N R  ( d B )
Figure 2.2: FERs of selected codes
2 . 2 2 . 4 2 . 6 2 . 8 3 . 0 3 . 2 3 . 4 3 . 6 3 . 8 4 . 0 4 . 20 . 5 0
0 . 5 5
0 . 6 0
0 . 6 5
0 . 7 0
0 . 7 5
0 . 8 0
0 . 8 5
0 . 9 0
0 . 9 5
1 . 0 0
IR1
-IR
0-L
R
S N R
 ( 3 7 2 , 2 8 6 ) - I R 1 ( 3 7 2 , 2 8 6 ) - I R 0 ( 3 7 2 , 2 8 6 ) - L R ( 6 2 0 , 5 0 9 ) - I R 1 ( 6 2 0 , 5 0 9 ) - I R 0 ( 6 2 0 , 5 0 9 ) - L R ( 8 3 7 , 7 2 6 ) - I R 1 ( 8 3 7 , 7 2 6 ) - I R 0 ( 8 3 7 , 7 2 6 ) - L R
Figure 2.3: Comparison of convergence rates
30
2.3. SHUFFLED AND MODIFIED SHUFFLED SCHEDULE
schedule also use the TBCP algorithm in CNP. For all three codes, the error perfor-
mance with the MSS and SS is slightly better than that with the flooding schedule.
The layered schedule, the SS, and the MSS have nearly the same error performance,
which implies that the MSS results in little or no error performance degradation.
We also compare the convergence speed of the MSS, SS and layered schedule with
the flooding schedule in Fig. 2.3. Here IR0 = Nmss/Nf , IR1 = Nss/Nmss and LR =
Nl/Nf , where Nss, Nmss, Nf and Nl are the average numbers of iterations of the SS,
the MSS, the flooding and layered schedule, respectively. Several observations can
be made about Fig. 2.3. First, throughout the SNR range, IR0 < 1 and IR1 ≈ 1.
Thus, the MSS results in no degradation in convergence speed compared with the
SS, and both the MSS and SS converge faster than the flooding schedule. Second,
when FER is around 10−4, the average number of iteration of the MSS is only
60% − 70% of that of the flooding schedule. Third, IR0 and LR start to grow in
high SNR region, since even the flooding schedule converge very fast at high SNR.
Thus the advantage of the MSS and SS in convergence speed decreases when the
SNR is high. The same phenomenon was observed in [61].
As shown in Fig. 2.3, the layered schedule requires fewer iterations for the sim-
ulated codes than the MSS and SS, especially when the SNR is high. However, in a
hardware implementation, the MSS results in fewer clock cycles per iteration than
the layered schedule for two reasons. First, all rows can be processed at the same
time for the MSS, while only a block of rows can be processed concurrently for the
layered schedule. Second, the proposed MSS with the shuffled sort algorithm simpli-
fies the CNP after init sort is finished. In order to update the c-2-v messages in one
block column, only the updated v-2-c messages in the previous block column and
31
2.4. SHUFFLED DECODER ARCHITECTURE
m
p
z
p
p
m
m
p
m c,l(k)
Figure 2.4: CNU Architecture
old CMs are needed. As a result, the number of real value comparisons is reduced.
2.4 Shuffled Decoder Architecture
In this section, a shuffled decoder architecture with reduced memory consumption
and higher throughput is proposed for nonbinary QC-LDPC codes whose parity
check matrices consist of sub-matrices that are either the zero or shifted identity
matrices with nonzero entries replaced by elements of GF(q), where q = 2p.
2.4.1 Check Node Unit Architecture
The architecture of the proposed CNU which includes a top sorter and a path
constructor is shown in Fig. 2.4, where m denotes the quantization bits of an LLR
message and z = dlog2 dce. The top sorter provides corresponding inputs for the path
constructor which implements Algorithm 2. Take check node c as an example, the
top sorter in Fig. 2.4 computes CM
(0)
c,0 using the IS algorithm at the initialization step.
It also computes CM
(k)
c,l+1 for l = 0, 1, · · · , G− 1 using the SST algorithm during the
iteration process. Once CM
(k)
c,l is available, the path constructor computes updated
c-2-v messages within block column l using Algorithm 2.
The architecture of the proposed top sorter is shown in Fig. 2.5. It consists
32
2.4. SHUFFLED DECODER ARCHITECTURE
RAddr1
VLLR
p
p
RAM1
RAddr0
RAM0
ssym
sllr
rllr
(LLR, Sym, Idx)
p+m+z
ZR
p
sum
idx0 idx1
(rllr, rSym, ridx)
p+m+z
VSym
VIdx
m
p
z
(sllr, ssym, sidx)
p+m+z
(CLLR, CSym, Cidx) p+m+z
z z
zz
comp
p+m+z
VNUout
p
Figure 2.5: Architecture of Proposed Top Sorter
B0
DQ
en
A0
DQ
en
B1
DQ
en
A1
DQ
en
B2
DQ
en
A2
DQ
en
BX-1
DQ
en
AX-1
DQ
en
comp
idx0
idx1
block_index
M0
E0
M1
E1
1
counter0 counter1
First One Encoder one-hot Encoder
X bits <e0,e1,eX-1>
e0 e1 eX-1
1'b1
address<r-1:0>
S0
S1
d0 d1 dX-1
c0 c1 cX-1
z
z
z
1 1 1 1
r r
r r
sel 1
Figure 2.6: Proposed RMAG Architecture
of a parallel porter (PS), two X×(m+p+z)-bits RAMs (RAM0 and RAM1) used
to store CM
(k)
c,l , and a random memory address generator (RMAG). Each word of
RAM0 and RAM1 stores an LLR value, the associated field symbol and index. At
the initialization step, the computation of CM
(0)
c,0 is carried out in dc rounds. In
the first round, q
(0)
v1,c and qs
(0)
v1,c are copied into the corresponding location of RAM0.
Besides, the index values of each word of RAM0 are set to 1. In the second round,
the temp results T
(0)
c,0 as shown in Algorithm 4 is stored in RAM1. In the next round,
T
(0)
c,0 will be stored in RAM0. This repeats until CM
(0)
c,0 is computed.
33
2.4. SHUFFLED DECODER ARCHITECTURE
The computation of CM
(k)
c,l+1 can be implemented in the same way as the com-
putation of CM
(0)
c,0 . However, extra cycles are spend on testing whether e
(k)
c,l = l as
shown in Algorithm 5. Under the worst condition, nm − 1 cycles are used on the
index testing. As shown in Fig. 2.6, a random memory address generator is pro-
posed to eliminate the cycles used in index testing during the computing of CM
(k)
c,l+1.
As a result, only X cycles are needed for the computation of CM
(k)
c,l+1. Assuming
X = 5, suppose we need to compute CM
(0)
c,1 , and CM
(0)
c,0 is stored in RAM0. Suppose
e
(0)
c,0(1) = e
(0)
c,0(3) = 0. Then the RMAG generates a read address sequence (0, 2, 4).
The RMAG first stores a binary sequence Sl = (s1, s2, · · · , sX), where si = 1
if e
(k)
c,l (i) = l, otherwise si = 0. Then, Sl is used to generate the read address
sequence for the computation of CM
(k)
c,l+1. As shown in Fig. 2.6, at the initiation
step, S0 is computed and stored in registers A0, A1, · · · , AX−1, which are used to
generate read address sequence when computing of CM
(0)
c,1 . Meanwhile, S1 are stored
in B0, B1, · · · , BX−1, which are used in the computation of CM(0)c,2 . This repeats until
the decoding of a codeword is finished. The First One Encoder outputs the smallest
i such that di = 1. The one-hot Encoder (OE) in Fig. 2.6 outputs a binary sequence
(e0, e1, · · · , eX−1), where ex = 1. ej = 0 if j 6= x. Here x is the decimal value of the
input of OE.
The proposed path constructor shown in Fig. 2.7 is almost the same as that
in [7] except: 1) the size of the CRAM is q×w instead of nm×w; 2) the maximum
of r
(k)
c,v is stored in MaxR; 3) part of the hardware in path construct will be used
in VNP. These improvements simplify VNP when only part of c-2-v messages are
stored. As shown in Fig. 2.7, r
(k)
c,v (i) is stored in the memory word whose address is
rs
(k)
c,v (i)⊗h−1i,j , where⊗ denotes multiplication over GF(q). According to Algorithm 2,
34
2.4. SHUFFLED DECODER ARCHITECTURE
k
comp
decoder
WAddr
0
MaxR LLROut
load
sel
D D D
m
p
p
p
p
z
z
dcdc
dc dc
p
p
p p p p
z
1
1
p
enenen
m
p
1
,c vh
− m
m
m_en
m
0 1 nm-1
Figure 2.7: Proposed Path Constructor Architecture
only nm LLRs are generated. So q−nm words of CRAM are undefined. During the
VNP, CRAM outputs Rc,v(s) to VNU. If one VNU needs Rc,v(s), then the input
vsym in Fig. 2.7 equals s. If s ∈ rs(k)c,v which is stored in nm p-bit registers, then
CRAM(s) is sent to the output. Otherwise, CRAM(s) is not defined and MaxR is
sent to the output.
2.4.2 Variable Node Unit Architecture
The proposed VNU architecture is similar to that used in binary LDPC decoder [65].
As shown in Fig. 2.8, suppose the variable node degree is 4, a w×q RAM (TempRAM
in Fig. 2.8) is employed to store channel LLR values in the same way that c-2-v
values are stored in the CRAM in Fig. 2.7. For variable node v, the proposed VNU
computes the q elements of Qv,c serially. Meanwhile, these q elements are sent to PS
which sorts out the nm minimal LLRs and their corresponding field symbols. The
architecture of PS is similar to the sorter proposed in [10], and hence is omitted in
this chapter.
35
2.4. SHUFFLED DECODER ARCHITECTURE
TempRAM
MaxR
m
m
m
m
m m+2
m+3
m+2
m+2
m+2
GF comp
D D D
p p p
en
vsym
Sv
Figure 2.8: Proposed VNU Architecture
2.4.3 Top Decoder Architecture
Considering a nonbinary QC-LDPC code whose parity-check matrix H can be di-
vided into r × t sub-matrices of dimension s × s. Accordingly, H can be divided
into t block columns. The top architecture of the proposed shuffled decoder, shown
in Fig. 2.9, is a partly parallel architecture and hence has a similar top structure to
other partly parallel architectures (see, for example, [62,65]). M = r× s CNUs pro-
cess all rows concurrently. s VNUs process s columns concurrently. Two groups
of barrel shifters, BS0 and BS1, implements the interconnection between VNU
and CNU. The barrel shifter BS0 has s k-bit inputs and s k-bit outputs, where
k = m + dlog2 dve. The barrel shifter BS0 has s u-bit inputs and s u-bit outputs,
where u = max(m, q). The channel message RAM has two elements: LLR RAM and
its field symbol RAM. When a CNU needs to load messages from channel message
RAM, the LLR value and its associated field symbol will travel through BS0 and
BS1, respectively.
The decoding schedule of the proposed shuffled decoder is shown in Fig. 2.10.
During the initial sort process, the CNU loads channel LLR messages to compute
CM
(0)
c,0 . It takes nm + (dc − 1)(X + 1) cycles to compute all X elements of CM(0)c,0 .
Actually, the path construction (PCons) process can start two cycles later once
36
2.4. SHUFFLED DECODER ARCHITECTURE
CNU0
CNU1
CNUs-1
LLR 
RAM
VNU0
VNU1
VNUs-1
m
m
k
k
k
k
k
k
k
k
k
m
pk
k
k
p
p
p
p
p
Barrel
shifter
Barrel
shifter
CNU0
CNU1
CNUs-1
m
m
m
pk
k
k
p
p
p
p
p
Barrel
shifter
Barrel
shifter
CNU0
CNU1
CNUs-1
m
m
m
pk
k
k
p
p
p
p
p
Barrel
shifter
Barrel
shifter
BS0 BS1
GF Symbol 
RAM
sp
m
m
m
k
k
k
m
m
m
k
k
k
m
m
m
k
k
k
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
u
MuMpMm
sm
pipeline
pipeline
Figure 2.9: Proposed shuffled decoder architecture for NB QC-LDPC codes
CM
(0)
c,0(1) is written into RAM0 or RAM1. At the same time, the RMAG will
begin to store the indexes compare results (SICR) to register Ai’s or Bi’s in RMAG
once CM
(0)
c,0(1) is available. Since the path construction takes 2nm cycles, the total
number of cycles used before iteration 1 is (dc − 2)(X + 1) + 3nm + 2. After the
initialization process, the shuffled decoder enters into regular iterations. Considering
the processing of block column 0 during iteration 1, VNP updates the v-2-c messages
within block column 0. The updated v-2-c messages (q
(1)
v,c , qs
(1)
v,c) are stored in the
PS. VNP takes q cycles. The shuffle sort (SST) will begin once the VNP is finished.
It takes only 1 + 1.5nm cycles to compute CM
(0)
c,1 , because the RMAG eliminates the
cycles used in index comparing. The Pcons process can start two cycles after the
SST starts. The number of cycles used for processing one block column then is just
2 + 2nm, because SST and Pcons are conducted at the same time. The processing
of the other block columns is the same of that of block column 0. The total number
of cycles used for decoding one received word is (dc − 2)(X + 1) + 3nm + 2 + dc(2 +
2nm + v)Imax.
37
2.4. SHUFFLED DECODER ARCHITECTURE
Initial sort VNP
PS
SM
PCons
PCons SICR
AG
VNP
PS
SM
PCons
SICR
AG
Block column 0 Block column 1
VNP
PS
SM
PCons
SICR
AG
Block column t-1
Iteration 1
SICR
PCons Path Construction
SICR Store Index Comparison Results
SM Shuffled merge
AG Address Generating
PS Parallel Sorting
VNP
Variable Node
 Processing
Initialization
Figure 2.10: Decoding Schedule of Proposed Shuffled Decoder
2.4.4 Implementation Results
A shuffled decoder architecture for an (837,726) QC-LDPC code over GF(25) is im-
plemented. nm and X are set to 16 and 24, respectively, to ensure good decoding
performance. Each LLR is represented by w = 5 bits with three integer and two
fractional bits. This quantization scheme introduces little error performance degra-
dation as shown in Fig. 2.2. Two stages of pipeline registers have been inserted as
shown in Fig. 2.9. The memories in CNU and VNU are implemented with register
files in order to improve the frequency that the decoder can work at. The decoder
is synthesized with Cadence RTL Compiler using an SMIC 130nm library. The syn-
thesis results are summarized in Table 2.1, where Ncycle denotes the total number of
clock cycles required to decode one received word (assuming 15 iterations). Suppose
the clock frequency for a decoder is f MHz, then the corresponding throughput
of proposed shuffled decoder is (fNbRcode)/Ncycle, where Nb and Rcode denote the
equivalent code length counted by binary bit and code rate, respectively. The effi-
ciency in Table 2.1 is defined by the throughput-to-gate-count ratio (Mbps/Million
gates).
Implementations in [7,15,59] for the same (837, 726) nonbinary QC-LDPC code
are also shown in Table 2.1 (the results presented in [7] are based not on synthesis
38
2.4. SHUFFLED DECODER ARCHITECTURE
Table 2.1: Decoder Complexity Comparison for an (837, 726) LDPC Code over
GF(32)
[7] [59] [15] proposed
Iterations 15 15 15 15
nm 16 32 8 16
Schedule layered layered layered MSS
Process (nm) N/A 180 90 130
Quantization bits 5 5 7 5
Frequency (MHz) 150 200 260 500
Ncycle 62240 53541 N/A 28215
Throughput (Mb/s) 10 16 29.0 64.3
Gate count (NAND) 1.27M 1.37M 3.28M 2.13M
Efficiency (Mbps/Mil gates) 7.87 11.67 8.84 30.18
but on estimation). The efficiency of the proposed decoder architecture is much
higher than these previous works. Compare with [59], the efficiency of our work is
almost 3 times of that in [59]. Even if the frequency of [59] doubles, our work is still
30% higher. The efficiency of our work is almost 4 times of that in [15] despite a
more advanced technology used in [15].
We remark that both the throughput and the efficiency assume that the decoding
takes 15 iterations. We adopt this definition for consistency, since the throughput
in [7,15,59] is also defined in the same fashion. While this definition reflects the worst
case instantaneous throughput of the decoder architecture, it does not account for
the convergence behavior of the decoding algorithm if early termination is enabled.
For the three codes in Fig. 2.3, the layered schedule reduces the required number of
iterations by less than 20% than the MSS. If we were to define the throughput to be
proportional to the number of iterations, our decoder architectures would still have
better throughput than those in [7, 15,59].
39
2.5. CONCLUSION
As shown in Table 2.1, the proposed decoder architecture needs much fewer
clock cycles than those in [7, 59] based on the layered schedule. In addition to the
reasons discussed in introduction part, the improved throughput is also attributed
to two features of the proposed architecture. First, the RMAG reduces the number
of cycles used in the shuffled sorting and ensures that the path constructor always
gets some CM
(k)
c,l without waiting. Second, as shown in Fig. 2.10, part of the path
construction and the shuffle sorting process can be carried out simultaneously to
increase the throughput.
The proposed (837, 726) (Nb = 4185 bits) NB-LDPC decoder achieves a through-
put of 64.3 Mb/s and has a 13.28 mm2 silicon area under 130nm technology, while
a (4608, 4096) binary LDPC decoder [65] achieves a throughput of 2.1 Gb/s and
has a 1.92 mm2 silicon area under 65nm technology. This comparison indicates that
the decoding complexity and decoder architectures for NB-LDPC codes need to be
further improved.
2.5 Conclusion
In this chapter, we propose the shuffled and modified shuffled schedule of the Min-
Max decoding algorithm for NB-LDPC codes. Both the shuffled and modified shuf-
fled schedule have a slightly better error performance and converge faster than the
flooding schedule. Significantly reducing the complexity of check node processing,
the modified shuffled schedule leads to higher throughput and smaller memory re-
quirement while resulting in negligible degeneration in error performance and conver-
gence speed. Moreover, an efficient shuffled decoder architecture based on modified
40
2.5. CONCLUSION
shuffled schedule for Quasi-Cyclic (QC) LDPC codes is also presented. With im-
proved CNU and VNU, the proposed decoder architecture needs much fewer clock
cycles to decoding a received word compared to state of arts design. The implemen-
tation of an (837, 726) LDPC decoder over GF(32) demonstrates the efficiency of
proposed architecture.
41
Chapter 3
An Efficient Fully Parallel
Decoder Architecture for
Non-binary LDPC Codes
3.1 Introduction
Binary low-density parity-check (LDPC) codes are more and more popular in ap-
plications because of their capacity-approaching performance. In terms of perfor-
mance, binary LDPC codes start to show their weaknesses when the codeword length
is small or moderate, or when a higher order modulation is used. For these cases,
nonbinary LDPC (NB-LDPC) codes over high order Galois fields have shown great
potential [3, 4]. For instance, in [1], a rate-1/2 NB-LDPC code of length 84 over
GF(64) is shown to perform 0.375dB better than a rate-1/2 binary irregular LDPC
code of equivalent length 504 bits in [5] over binary input additive white Gaussian
42
3.1. INTRODUCTION
noise (AWGN) channel. Over the QAM-AWGN channels, NB-LDPC codes with
a field order greater than or equal to the size of constellation have the advantage
that the encoder/decoder works directly with symbols. All mapping choices of the
codeword symbols to the constellation points are equivalent and lead to the same
performance. In [1], an NB-LDPC code over GF(256) performs 0.5dB better than a
rate-1/2 binary LDPC codes of the equivalent length 1008 bits over QAM256-AWGN
channel.
A significant obstacle to the application of NB-LDPC codes is that their de-
coding algorithms have high complexities. Hence, a lot of research effort has been
spent on efficient decoding algorithms for NB-LDPC codes [1,6]. Among them, the
EMS [1] and the Min-Max [6] algorithms draw a lot of attention because of their low
computation and memory complexity. For an NB-LDPC code over GF(2m), both
the EMS and the Min-Max algorithms store only the nm (nm  2m) most reliable
messages, thus reducing the memory requirement at the cost of small performance
degradation. The check node processing of the EMS algorithm needs additions and
comparisons, while the Min-Max algorithm needs only maximizations and compar-
isons in check node processing. The trellis-based check node processing (TBCP)
algorithm [7] reduces the computational complexity of check node processing of the
Min-Max algorithm by eliminating unnecessary check-to-variable messages.
Besides the Min-Max and EMS decoding algorithms, stochastic decoding [2,8,9]
is another way to reduce the hardware complexity of NB-LDPC decoders while
maintaining the decoding performance. Compared to conventional belief propa-
gation decoding algorithms, the stochastic decoding algorithm has lower hardware
43
3.1. INTRODUCTION
complexity [9]. The relaxed half-stochastic decoding algorithm optimized for NB-
LPDC codes with variable node degree 2, called the RD2 algorithm, was proposed
in [8]. The RD2 decoding algorithm reduces the decoding complexity by reducing
the number of real multiplications significantly. An improved version of the RD2
algorithm, called the NoX decoding algorithm [2], is proposed to further reduce the
computational complexity.
Recently, a considerable amount of research effort has been spent on efficient
decoder architectures for NB-LDPC codes [9–17]. Existing NB-LDPC decoders still
suffer from low throughput and large hardware complexity. For example, a (248,
124) NB-LDPC decoder over GF(32) [15] achieves a throughput of 47.69 Mb/s at the
cost of 10.33 mm2 silicon area under 90nm technology. An (837, 726) NB-LDPC
decoder [16] over GF(32) achieves a throughput of 60Mb/s at the cost of 1.29M
standard NAND gates using the 180nm technology. The FPGA implementation of
a (192, 96) stochastic AMSA decoder [9] over GF(256) achieves a throughput of
64Mb/s at the frequency of 108MHz.
In this chapter, several improvements are proposed to reduce both the mem-
ory requirements and computational complexities of the Min-Max algorithm. A
fully parallel decoder architecture based on the proposed decoding algorithm is also
proposed. The main contributions of this chapter are as follows:
1. Based on the Min-Max algorithm, a reduced memory complexity trellis based
check node processing (RTBCP) algorithm is proposed.
2. A simplified algorithm is proposed to reduce the computational complexity
of variable node processing (VNP). As a result, compared with the RHS al-
gorithm in [2], a stochastic decoder, the proposed decoding algorithm needs
44
3.1. INTRODUCTION
fewer real multiplications but more real comparisons and finite field additions.
3. For each a priori message, all LLRs except several most reliable ones are
approximated with a linear function. Two kinds of low complexity LLR gen-
eration units are also proposed for the approximation of the check-to-variable
(c-to-v) LLR and a priori LLR, respectively. With 5-bit quantization scheme
and nm = 32, the areas of the two LGUs are 10.7% and 13.3%, respectively, of
that of an SRAM which stores an LLR vector under a 90nm CMOS technol-
ogy. A similar approach was proposed in [52] to approximate a priori LLR.
The main differences between our work and that in [52] are as follows:
• Besides the approximation of channel LLR vectors, we try to approximate
check-to-variable LLR vectors.
• A simplified variable node processing (SVNP) algorithm is proposed to
compensate the performance degradation caused by LLR approximation.
4. A parallel check node unit (CNU) and a low-latency variable node unit (VNU)
are proposed. Based on the proposed CNU and VNU, an efficient fully parallel
decoder architecture is also proposed. A fully parallel NB-LDPC decoders
based on GF(256) is implemented with 28nm CMOS technology. The decoder
over GF(256) achieves a throughput of 546Mb/s and an energy efficiency of
0.178nJ/b/iter.
Since routing congestion tends to be challenging for fully parallel LDPC decoder
architectures, the proposed decoder architecture is not suitable for very long LDPC
codes. The proposed fully parallel decoder architecture is particularly advantageous
for NB-LDPC codes over large fields, since the memory reduction will be more
45
3.2. TBCP ALGORITHM
significant when nm is large.
The rest of this chapter is organized as follows. Section 3.2 reviews the TBCP
algorithm. The RTBCP algorithm as well as the simplified variable node processing
algorithm is proposed in Section 3.3. The parallel CNU architecture, the low-latency
VNU architecture and the fully parallel decoder architecture are proposed in Sec-
tion 3.4. The implementation results and comparisons are presented in Section 3.5.
The conclusions are drawn in Section 5.6.
3.2 TBCP algorithm
3.2.1 Trellis based check node processing algorithm
Let GF(2m) be a Galois field with 2m elements. Let H = {hi,j} be a sparse parity
check matrix over GF(2m) with M rows and N columns. We focus on regular non-
binary LDPC codes, and hence assume H has constant row and column weights and
is an array of sparse circulants over GF(2m). Consider a check node c and a variable
node v in the Tanner graph defined by H. Let ε(v) denote the set of check nodes
adjacent to v, and τ(c) the set of variable nodes adjacent to c. For a ∈ GF(2m), let
Lv(a) be the a priori information of the variable node v concerning the symbol a.
For an LDPC code over GF(2m), check node processing is the most complex
part of the Min-Max algorithm [6,7]. The forward-backward approach [6] is widely
used in check node processing. However, both memory complexity and latency
are high for the forward-backward approach when the check node degree is high.
In [7], a TBCP algorithm is proposed to reduce the memory required by check
46
3.2. TBCP ALGORITHM
node processing. Let dc be the check node degree of a check node c. The trun-
cated variable-to-check (v-to-c) message from a variable node v to a check node c
is (φv,c, φ
f
v,c), where φv,c is an nm-dimension LLR vector and φ
f
v,c is the associ-
ated nm-dimension Galois field symbol vector. For the Min-Max algorithm, the first
element of φv,c, φv,c(0), is always zero. The TBCP algorithm [7] first merges the in-
coming dc(nm− 1) nonzero LLRs in non-decreasing order as xc(0), xc(1), · · · . Their
associated Galois field symbols are αc(0), αc(1), · · · , and they belong to variable
nodes with indices ec(0), ec(1), · · · . For 1 ≤ X ≤ dc(nm − 1), a truncated message
vector mc = {mc(0),mc(1), · · · ,mc(X − 1)}, where mc(i) = (xc(i), αc(i), ec(i)) for
i = 0, 1, · · · , X − 1, is used by the path construction (PC) algorithm [7], shown in
Algorithm 6, to compute the updated c-to-v message (ρc,v,ρ
f
c,v).
Algorithm 6: Path Construction Algorithm [7]
input : mc(i) i = 0, 1, · · · , X − 1; zc; αsum
output: ρc,v(j), ρ
f
c,v(j) for j = 0, · · · , nm − 1
Initialization:
ρc,v(0) = 0, ρ
f
c,v(0) = αsum ⊕ zc(v), i = 0, cnt = 1,Pc,0 = [0, 0, · · · , 0]
while cnt < nm do
if ec(i) 6= v then
j = cnt
for k = 0 to j − 1 do
α = ρfc,v(k)⊕ zc(ec(i))⊕ αc(i)
if Pc,k(ec(i)) 6= 1 and α 6∈ ρfc,t then
ρc,v(cnt) = xc(i); ρ
f
c,v(cnt) = α
Pc,cnt(ec(i)) = 1
Pc,cnt(s) = Pc,k(s) for s 6= ec(i)
cnt = cnt+ 1
i = i+ 1
As shown in Algorithm 6, zc(v) = φ
f
v,c(0) for v ∈ τ(c) = {v0, v1, · · · , vdc−1} and
47
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
αsum =
∑vdc−1
v=v0
zc(v). Algorithm 6 assumes a check node c and deals with variable
node v ∈ τ(c). Pc,cnt, which stores the constructed path, is a dc-dimension vector
over GF(2). ⊕ denotes addition over Galois field GF(2m). The constructed nm
LLRs are picked from the sorted list xc(i), and stored in ρc,v. Their associated
Galois field symbols are stored in ρfc,v. X is set to 1.5nm [7] so that at least nm
LLRs are generated for ρc,v. Suppose L is the number of times that the codes within
the for loop (lines 6 to 11) in Algorithm 6 are executed. In [7], L = 2nm.
The TBCP algorithm [7] computes all nm elements of a c-to-v message in serial.
The LLR and corresponding Galois field symbol of the most reliable element are
computed. The corresponding path information is also stored. Based on the path
information of the first element, the TBCP algorithm computes the LLR and Galois
field symbol of the second most reliable element. The third element is computed
based on the path information of the previous two elements. The process repeats
until all nm elements of a c-to-v message are computed. The LLR vector of a c-to-v
message is sorted in non-decreasing order once all nm elements are computed.
3.3 Improved Decoding Algorithm for NB-LDPC
Codes
3.3.1 RTBCP algorithm
In this section, a reduced complexity trellis based check node processing (RTBCP)
algorithm is proposed to reduce the complexity of check node processing further.
For a check node c, by observing the X LLR magnitudes of mc, it is found that these
48
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
LLRs come from the first few elements of all dc connected v-to-c messages. As a
result, one can store only nv (nv < nm) elements for each truncated v-to-c message.
Usually, nv should satisfy that dcnv > X where dc is the check node degree. The
detailed value of nv can be determined by performance simulation. When dc is large,
nv could be much smaller than nm. Since only nv elements are needed, the memory
for v-to-c messages can be further reduced. Besides, normally, a length nm parallel
sorter [10] is used to store the nm minimum LLRs and sort them in non-decreasing
order. It will need a length nv parallel sorter if only nv elements are stored. Thus,
the overall hardware cost is further reduced.
0 5 1 0 1 5 2 0 2 5 3 0 3 5- 0 . 5
0 . 0
0 . 5
1 . 0
1 . 5
2 . 0
2 . 5
3 . 0
3 . 5
4 . 0
r c,v(
j)
i n d e x  j
 i t e r a t i o n  1 i t e r a t i o n  2 i t e r a t i o n  3 i t e r a t i o n  4
Figure 3.1: LLR evolution
Studying the magnitudes among the truncated nm LLRs of a c-to-v message, it is
found that the LLR vector can be approximated using a piece-wise linear function.
In Fig. 3.1, we plot the LLR evolution of a randomly picked c-to-v message during
the decoding of a (110, 88) QC-LDPC code over GF(256) [66] under BPSK-AWGN
channel. The signal to noise ratio (SNR) is 3.9dB. These LLRs are computed using
the PC algorithm in [7] with nm = 32. As shown in Fig. 3.1, these LLRs demonstrate
49
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
piece-wise linearity during each iteration (a similar behavior is also demonstrated
by the LLRs for other SNR values).
0 5 1 0 1 5 2 0 2 5 3 0 3 5- 0 . 5
0 . 0
0 . 5
1 . 0
1 . 5
2 . 0
2 . 5
3 . 0
3 . 5
4 . 0
LLR
 ma
gnit
ude
i n d e x  j
 L L R s  d u r i n g  i t e r a t i o n  3 a p p r o x i m a t e d  L L R s  ( n c = 0 , β = 1 . 2 5 ) a p p r o x i m a t e d  L L R s  ( n c = 4 , η=1,β=1.25)
Figure 3.2: LLR approximation when nm = 32
Since the nm LLRs of ρc,v are generated in serial in Algorithm 6, we propose to
store only θc,v = ρc,v(nm − 1) and Mc,v = ρc,v(nc) during the while loop in Algo-
rithm 6, where nc is a parameter that can be determined by performance simulation.
Once the while loop is finished, the approximated LLR vector are computed with
the piece-wise linear function shown in Eq. (3.1), where η and β are scaling param-
eters to make the piece-wise interpolation more accurate. As shown in Fig. 3.2, we
approximate all the elements of an LLR vector using a piece-wise linear function:
ρˆc,v(j) =
 η
Mc,v
F (nc)
j j ≤ nc
Mc,v + β
θc,v−Mc,v
F (nm)
(j − nc) j > nc,
(3.1)
where F (x) = 2dlog2 xe. Note that the divisions in Eq. (3.1) can be implemented
with bit shifting. If nc = 0, then all LLRs are approximated with a linear function.
When an LLR element is needed, it is computed using Eq. (3.1). This reduces the
50
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
memory required to store the LLR vector for c-to-v messages.
Based on the above discussions, an improved path construction (IPC) algorithm
is also proposed as part of the RTBCP algorithm. The IPC algorithm differs from
the PC algorithm in [7] at two aspects. First, the if block in Algorithm 6 (lines 7 to
11) is changed to that shown in Algorithm 7, where each path pc,cnt is represented
by an integer index instead of a dc-dimension binary vector. Besides, pc,0 = dc for
the proposed IPC algorithm. For hardware implementations of the proposed IPC
algorithm, it needs only t bits to store an integer index, where t = log2 dc + 1 if dc
is a power of 2, and t = dlog2 dce if dc is not a power of 2. Besides, the process to
decide whether xc(i) is an element of ρc,v is also simplified. In addition, the number
of elements in the truncated message vector mc used in Algorithm 6, X, is reduced
to nm. For the proposed IPC algorithm, the number of loops of Algorithm 6, L, can
be smaller than 2nm. Second, when the while loop of Algorithm 6 is finished, all nm
LLR elements are computed using Eq. (3.1).
Algorithm 7: Improved if block
if pc,k 6= ec(i) and α 6∈ ρfc,t then
if cnt = nc then Mc,v = xc(i)
θc,v = xc(i); ρ
f
c,v(cnt) = α
pc,cnt = ec(i); cnt = cnt+ 1
3.3.2 LLR compression for a priori messages
Similar to the piece-wise linear approximation of the LLR vector of a c-to-v message,
part of the LLR vector of a a priori message can also be approximated by its
linear interpolation. Let Lv = (Lv(0), Lv(1), · · · , Lv(nm − 1)) be the sorted LLR
51
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
vector estimated from the channel for a variable node v. The approximated LLR
vector is Lˆv = (Lˆv(0), Lˆv(1), · · · , Lˆv(nm − 1)), where Lˆv(k) = Lc(k) for k ≤ nI and
Lˆv(k) = Lc(nI) + ∆v(k − nI) for k > nI . Here ∆v = βv Lv(nm−1)−Lv(nI)F (nm) and βv is a
scaling parameter. For hardware implementations, only nI (nI < nm) LLRs and ∆v
are stored.
3.3.3 Simplified variable node processing algorithm
In this chapter, a simplified variable node processing (SVNP) algorithm is proposed
in Algorithm 8, where dv is the degree for a variable node v, and (ρi,v,ρ
f
i,v) for
i = 0, 1, · · · , dv − 1 are dv c-to-v messages sent to variable node v. ρdv ,v = Lˆv, and
ρfdv ,v = Lˆ
f
v , where Lˆ
f
v is the corresponding Galois field symbol vector of the a priori
LLR vector Lˆv for a variable node v. h0,v, h1,v, · · · , hdc−1,v are dv nonzero Galois
field symbols associated with variable node v.
The FindLLR function returns ρw,v(k) such that hw,vFi,j = ρ
f
w,v(k) for 0 ≤ w <
dc. For w = dc, the FindLLR function returns ρw,v(k) such that Fi,j = ρ
f
w,v(k).
When w < dc, if hw,vFi,j 6∈ ρfw,v, the FindLLR function returns γθw,v, where γ
is a correction factor and θw,v is defined above. When w = dv, if Fi,j 6∈ ρfw,v,
θw,v = Lˆv(nm − 1). li for i = 0, 1, · · · , dv are dv + 1 integer parameters. The SORT
function sorts Ri,v, which has at most lsum =
∑dv
i=0 li (lsum ≤ nm) LLR elements,
in non-decreasing order and stores the nv minimal LLRs and their corresponding
Galois field symbols in φv,i and φ
f
v,i, respectively.
The proposed SVNP algorithm first serially computes at most lsum elements of a
v-to-c message. Among them, only nv(nv < nm) most reliable elements are stored.
Thus, both the memory requirement and the computation complexity are reduced.
52
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
Algorithm 8: SVNP algorithm
input : (ρ0,v,ρ
f
0,v), · · · , (ρdv ,v,ρfdv ,v); li, i = 0, 1, · · · , dv;
hi,v, i = 0, 1, · · · , dv − 1
output: (φv,i,φ
f
v,i), i = 0, 1, · · · , dv − 1
Initialization: Sv = ∅;Ri,v = ∅, RSi,v = ∅, i = 0, 1, · · · , dv − 1;
for i = 0 to dv do
for j = 0 to li − 1 do
if i < dv then Fi,j = h
−1
i,v ρ
f
i,v(j) ;
else Fi,j = ρ
f
i,v(j);
if Fi,j 6∈ Sv then
push Fi,j into Sv;
for w = 0 to dv do
tw =FindLLR(Fi,j,ρw,v,ρ
f
w,v);
for w = 0 to dv − 1 do
push ((
∑dv
b=0 tb)− tw) into Rw,v;
push Fi,j into RSw,v;
for i = 0 to dv − 1 do
(φv,i,φ
f
v,i) = SORT(Ri,v, RSi,v);
For the VNP algorithm in [1], all the elements of the incoming c-to-v messages are
needed for a round of variable node processing. However, for each incoming c-to-v
message, only part of its elements are needed for variable node processing when
using the SVNP algorithm.
For a round of variable node processing of a variable node v, the computational
complexities of the proposed SVNP algorithm and the VNP algorithm in [1] are
dominated by real value additions and comparisons. Hence, we compare the numbers
of real additions and comparisons of these two algorithms in Table 3.1. Since nv <
nm and lsum ≤ nm, the numbers of real comparisons and additions of the VNP
algorithm in [1] are more than (3− 4
dv
)nm
nv
and 3− 4
dv
(3dv−4
dv
> 1 when dv > 2) times,
53
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
respectively, of those of the proposed SVNP algorithm.
Table 3.1: Computational complexity comparison between the proposed SVNP and
the VNP algorithm in [1]
[1] SVNP
real comparisons (3dv − 4)nm log2(2nm) dvnv log2 lsum
real additions (3dv − 4)2nm dv2lsum
The complexity reduction brought by the SVNP algorithm is not obvious ac-
cording to Table 3.1. This is because (3dv − 4) = dv when dv = 2. Under this
condition, the advantage of the SVNP algorithm depends on the specific values of
nv and lsum, which in turn are selected by performance simulation. Thus, when
dv = 2 the advantage of SVNP algorithm may be code specific.
In Table 3.2, we compare the computational complexity per decoding iteration
of the proposed improved decoding algorithm (IDA) with that of the RHS algorithm
in [2], a stochastic decoder. For the proposed IDA, the computational complexity
of the proposed IPC algorithm depends on the v-to-c messages. To be conservative,
the maximal computational complexity of the IPC algorithm is assumed. As shown
in Table 3.2, compared with the RHS algorithm, the proposed IDA needs fewer
real multiplications but more real comparisons and additions over GF(2m). When
lsum < 2
m, the proposed IDA needs fewer real additions than the RHS algorithm.
lsum could be smaller than 2
m, as its value is determined by the decoding performance
simulation.
3.3.4 Numerical results
We compare the error performance of the proposed IPC and SVNP algorithms with
that of the original PC and VNP algorithm for a (110, 88) NB-LDPC code over
54
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
Table 3.2: Computational complexity comparison between the improved decoding
algorithm and the RHS algorithm in [2]
RHS [2] proposed IDA
real comparisons 0
Ndvnv log2 lsum+
2Mdcn
2
m
real additions 2Ndv2
m 2Ndvlsum
real multiplications N(3dv − 4)2m Mdc(nm − 2)
GF(2m) additions M(2dc − 2) 2Mdcn2m
GF(256) [66] with variable node degree 2 and check node degree 10 under the BPSK-
AWGN channel. In our simulations, nm = 32, the maximal number of iterations is
30, and the flooding schedule is used. The resulting bit error rate (BER) and frame
error rate (FER) are shown in Fig. 3.3 and Fig. 3.4, respectively, where PC-VNP
denotes the decoding algorithm with the original PC algorithm and variable node
processing algorithm in [1].
3 . 2 3 . 4 3 . 6 3 . 8 4 . 0 4 . 2 4 . 41 0
- 8
1 0 - 7
1 0 - 6
1 0 - 5
1 0 - 4
1 0 - 3
BER
S N R
 P C - V N P  L = 6 4  f l o a t i n g P C - ( 4 , 4 , 1 0 ) - 8  L = 6 4  f l o a t i n g I P C 0 - V N P  L = 6 4  f l o a t i n g I P C 3 - V N P  L = 6 4  f l o a t i n g I P C 3 - ( 2 , 2 , 1 0 ) - 8  L = 6 4  f l o a t i n g I P C 3 - ( 2 , 2 , 1 0 ) - 6  L = 6 4  f l o a t i n g M i n - M a x  f l o a t i n g I P C 3 - ( 4 , 4 , 1 0 ) - 8  L = 6 4  f i x e d I P C 0 - ( 4 , 4 , 1 0 ) - 8  L = 5 0  f i x e d I P C 0 - ( 2 , 2 , 1 0 ) - 8  L = 6 4  f i x e d I P C 0 - ( 2 , 2 , 1 0 ) - 8  L = 6 4  f i x e d
Figure 3.3: BER performance of the (110, 88) NB-LDPC code over GF(256)
In Fig. 3.3, L denotes the number of loops required by the corresponding IPC
55
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
3 . 2 3 . 4 3 . 6 3 . 8 4 . 0 4 . 2 4 . 41 0
- 6
1 0 - 5
1 0 - 4
1 0 - 3
1 0 - 2
1 0 - 1
FER
S N R
 P C - V N P  L = 6 4  f l o a t i n g P C - ( 4 , 4 , 1 0 ) - 8  L = 6 4  f l o a t i n g I P C 0 - V N P  L = 6 4  f l o a t i n g I P C 3 - V N P  L = 6 4  f l o a t i n g I P C 3 - ( 2 , 2 , 1 0 ) - 8  L = 6 4  f l o a t i n g I P C 3 - ( 2 , 2 , 1 0 ) - 6  L = 6 4  f l o a t i n g M i n - M a x  f l o a t i n g I P C 3 - ( 4 , 4 , 1 0 ) - 8  L = 6 4  f i x e d I P C 0 - ( 4 , 4 , 1 0 ) - 8  L = 5 0  f i x e d I P C 0 - ( 2 , 2 , 1 0 ) - 8  L = 6 4  f i x e d I P C 0 - ( 2 , 2 , 1 0 ) - 8  L = 6 4  f i x e d
Figure 3.4: FER performance of the (110, 88) NB-LDPC code over GF(256)
algorithm. PC-(i, j, k)-ω denotes the decoding algorithm with the original PC al-
gorithm and proposed SVNP algorithm with l0 = i, l1 = j, l2 = k and nv = ω.
IPCe-VNP denotes the decoding algorithm that employs the proposed IPC algo-
rithm with nc = e and the VNP algorithm in [1]. IPCe-(i, j, k)-ω denotes the
decoding algorithm with the proposed IPC algorithm with nc = e and the proposed
SVNP algorithm with l0 = i, l1 = j, l2 = k and nv = ω. For IPC0, β = 1.25. For
IPC3, η = 1.25, β = 1.75. γ = 1.25 for all simulated algorithms. For all IPCe-
(i, j, k)-ω algorithms, part of each a priori LLR messages are linearly approximated
with nI = 4 and βv = 1. For fixed point simulations, a (4,1) quantization scheme
is used, where four bits and one bit are used to represent the integer and fraction
parts of an LLR, respectively.
Based on the results shown in Fig. 3.3, several observations are made as follows:
1. The BER performance of the PC-VNP and PC-(4, 4, 10)-8 are nearly the same.
56
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
The proposed SVNP algorithm does not introduce noticeable performance
degradation.
2. The IPC0-VNP decoding algorithm shows an early error floor, while the BER
performance of the IPC3-VNP is even better than that of the PC-VNP algo-
rithm.
3. The BER performance of the IPC3-(2, 2, 15)-8 is better than that of the PC-
VNP and IPC0-(2, 2, 10)-8 decoding algorithms. The decoding performance
of IPC3-(2, 2, 15)-8 is close to that of original Min-Max floating algorithm.
4. The IPC0-(2, 2, 10)-8 decoding algorithm performs better than the PC-VNP
and IPC0-VNP decoding algorithms.
5. Reducing the number of loops used by the IPC algorithm will worsen the
corresponding decoding performance.
6. For decoding algorithms that employ the SVNP algorithm, nv should be large
enough to maintain decoding performance.
Both the IPC3-VNP and IPC3-(2, 2, 15)-8 decoding algorithms perform better
than the PC-VNP algorithm. Since the propose LLR approximation can also be
viewed as a non-uniform scaling technique, the improved decoding performance can
be attributed to the non-uniform scaling of each element in an LLR vector. As is
well known, for binary LDPC codes, the scaling technique has been used to improve
the decoding performance of the Min-Sum algorithm [67].
The early error floor of the IPC0-VNP decoding algorithm may come from the
inaccurate LLR approximation of the IPC0 algorithm. The performance of the
IPC0-VNP algorithm is improved when the VNP algorithm is replaced with the
proposed SVNP algorithm. On the other hand, the performance of the IPC3-VNP
57
3.3. IMPROVED DECODING ALGORITHM FOR NB-LDPC CODES
algorithm and the IPC3-(2, 2, 15)-8 algorithm are almost the same. With proper
value for each li, it seems that the SVNP is less sensitive to the LLR deviation
caused by the LLR approximation of the IPC0 algorithm. A conceptual explanation
is shown as follows.
Take the IPC0-(2, 2, 10)-8 decoding algorithm as an example, where l0 = 2,
l1 = 2 and l2 = 10. It is possible that the transmitted code symbol for variable node v
lies in the first several elements in Lˆfv , which is the corresponding Galois field symbol
vector of the sorted LLR vector Lˆv. For IPC0 algorithm, the LLR approximation
is not accurate enough. As a result, during variable node processing, the SVNP
algorithm considers only the few most reliable symbols and their associated LLRs
for each c-to-v message sent to variable node v. For these LLRs, the approximation
error tends to be small. On the other hand, the SVNP algorithm considers more
symbols from Lˆfv . In this way, symbols with approximated LLRs, which may degrade
the decoding performance, are excluded from variable node processing.
The proposed IPC and SVNP algorithms are also applied to a (372, 248) (dv = 4)
quasi-cyclic NB-LDPC (QC-NB-LDPC) code over GF(32) [68]. For both PC and
IPC algorithms, nm = 8. For IPC0-(1, 1, 1, 1, 6)-5 and IPC1-(1, 1, 1, 1, 6)-5
algorithms, nv = 5, nI = 4. As shown in Fig. 3.5, the BER performance of the
IPC0-(1, 1, 1, 1, 6)-5 and IPC1-(1, 1, 1, 1, 6)-5 algorithms is nearly the same as
that of the PC-VNP algorithm. Compared to original Min-Max floating algorithm,
IPC1-(1, 1, 1, 1, 6)-5 has 0.1dB performance degradation. The BER and FER
performance is shown in Fig. 3.5 and Fig. 3.6. For the fixed point simulation, a
(4,1) quantization scheme is used, where four bits and one bit are used to represent
the integer and faction parts of an LLR, respectively. For the (372, 248) code, the
58
3.4. FULLY PARALLEL DECODER ARCHITECTURE
2 . 2 2 . 4 2 . 6 2 . 8 3 . 0 3 . 21 0
- 7
1 0 - 6
1 0 - 5
1 0 - 4
1 0 - 3
1 0 - 2
BER
S N R
 P C - V N P I P C 0 - ( 1 , 1 , 1 , 1 , 6 ) - 5  L = 1 6  f l o a t i n g I P C 1 - ( 1 , 1 , 1 , 1 , 6 ) - 5  L = 1 6  f l o a t i n g I P C 1 - ( 1 , 1 , 1 , 1 , 6 ) - 3  L = 1 6  f l o a t i n g M i n - M a x  f l o a t i n g I P C 0 - ( 1 , 1 , 1 , 1 , 6 ) - 5  L = 1 6  f i x e d  ( 4 , 2 ) I P C 1 - ( 1 , 1 , 1 , 1 , 6 ) - 5  L = 1 6  f i x e d  ( 4 , 2 ) I P C 0 - V N P  L = 1 6  f l o a t i n g
Figure 3.5: BER performance of the (372, 248) NB-LDPC code over GF(32)
IPC0-VNP algorithm does not show an early error floor.
3.4 Fully Parallel Decoder Architecture
3.4.1 Top decoder architecture
Suppose there are M rows and N columns in the parity-check matrix. As shown in
Fig. 5.5, the proposed decoder architecture employs M CNUs and N VNUs. The
main characteristics of proposed fully parallel decoder architecture are as follows:
1. Check node processing and variable node processing are interleaved. During
check (variable, respectively) node processing, all rows (columns, respectively)
of the parity check matrix are processed simultaneously.
2. For both c-to-v and v-to-c messages, each message element is transmitted in
59
3.4. FULLY PARALLEL DECODER ARCHITECTURE
2 . 2 2 . 4 2 . 6 2 . 8 3 . 0 3 . 2
1 0 - 5
1 0 - 4
1 0 - 3
1 0 - 2
1 0 - 1
1 0 0
FER
S N R
 P C - V N P  f l o a t i n g I P C 0 - ( 1 , 1 , 1 , 1 , 6 ) - 5  L = 1 6  f l o a t i n g I P C 1 - ( 1 , 1 , 1 , 1 , 6 ) - 5  L = 1 6  f l o a t i n g I P C 1 - ( 1 , 1 , 1 , 1 , 6 ) - 3  L = 1 6  f l o a t i n g M i n - M a x  f l o a t i n g I P C 0 - ( 1 , 1 , 1 , 1 , 6 ) - 5  L = 1 6  f i x e d  ( 4 , 2 ) I P C 1 - ( 1 , 1 , 1 , 1 , 6 ) - 5  L = 1 6  f i x e d  ( 4 , 2 ) I P C 0 - V N P  L = 1 6  f l o a t i n g
Figure 3.6: FER performance of the (372, 248) NB-LDPC code over GF(32)
M
N
Figure 3.7: Proposed fully parallel decoder architecture
serial via the CNU-to-VNU (C2V) or VNU-to-CNU (V2C) interconnection
networks, respectively. The C2V and V2C networks can be hard wires to
connect CNUs and VNUs. If multiple quasi-cyclic NB-LDPC (QC-NB-LDPC)
codes need to be supported, barrel shifters can be used instead.
3. For each c-to-v message, the LLR vector, which has nm LLR elements, is not
stored. Instead, only one or two LLRs are stored, and the others are computed
on-the-fly.
4. The proposed fully parallel decoder architecture is suitable for moderate or
60
3.4. FULLY PARALLEL DECODER ARCHITECTURE
short length (around 103 bits) NB-LDPC codes over large fields.
3.4.2 Parallel CNU architecture
In this chapter, a parallel check node unit (CNU) is proposed for the proposed fully
parallel decoder architecture. The top architecture of the proposed CNU is shown
in Fig. 3.8, where m is the bit width of a Galois symbol, p is the number of the
quantization bits, r = dlog2 dve + p is the number of the quantization bits of the
LLRs sent to CNU, t is the bit width of the index. For a check node c, the proposed
CNU is capable of computing all the corresponding c-to-v messages, which are sent
from c, in parallel. For each c-to-v message, the nm elements are computed in serial.
m
p
z
p
p
m
m
p
m c,l(k)
Figure 3.8: Parallel CNU architecture
As shown in Fig. 3.8, the parallel CNU for a check node c consists of dc parallel
sorters (PS), dc path constructors (PC), a minimal LLR finder (MLF) and a trun-
cated message register group (TMR). The proposed parallel CNU engages in both
check node processing and variable node processing. During variable node process-
ing, dc v-to-c messages are sent to CNU in parallel. Take PSi as an example, it
receives each element of the incoming v-to-c message in serial. Meanwhile, PSi sorts
out the nv minimum LLRs and their corresponding Galois field symbols from the
received v-to-c message. During check node processing, the MLF unit computes the
61
3.4. FULLY PARALLEL DECODER ARCHITECTURE
truncated message vector mc. The X elements of mc are computed in serial and
stored in the TMR. At the same time, dc path constructors compute dc updated
c-to-v messages in parallel based on the proposed IPC algorithm. The computing
of mc and dc c-to-v messages are performed at the same time.
The architecture of the parallel sorter PSi, shown in Fig. 3.9, is similar to that
in [10]. The nv sorted LLRs and their corresponding Galois field symbols are stored
in LLR registers (LR) and symbol registers (SR), respectively. As shown in Fig. 3.9,
the mode signal configures the function of the PS. During variable node processing,
the PS acts as the parallel sorter in [10]. However, the PS acts as a shift regis-
ter group during check node processing. During check node processing, if shifti is
enabled, the shifting operations of LRi and SRi are defined as Li,j = Li,j+1 and
Si,j = Si,j+1, respectively, for j = 1, 2, · · · , nv − 2.
Li,Li,nv
i
i
r
m i
i
p
m
i
i
i
Li,
Si,S i,nv Si,
Figure 3.9: Parallel sorter architecture
The minimal LLR finder (MLF) unit computes the minimal LLR, the corre-
sponding Galois field symbol and index based on its inputs messages. Suppose the
number of input messages is 8, the macro architecture of the MLF unit is shown
in Fig. 3.10. Each input message Ii in Fig. 3.10 consists of three parts: the LLR
LOi, the corresponding Galois field symbol SOi, and the associated index idxi. The
MLF unit outputs the input message with the smallest LLR. The MLF unit is a
tree of compare-and-select (CAS) units. Each CAS unit has two input messages and
outputs the one with a smaller LLR.
62
3.4. FULLY PARALLEL DECODER ARCHITECTURE
0
1
0
1
0
1
m
m
t
t
p
p
p
m
t
0
1
2
3
4
5
6
7
p m t
p m t
p m t
Figure 3.10: The macro architecture of MLF when the number of input is 8
It takes X cycles for the proposed MLF to compute all X elements of the trun-
cated message vector mc. Once the minimal nv LLRs are sorted out, mc(0) =
(xc(0), αc(0), ec(0)) is computed by the MLF unit. Meanwhile, if the output index
ec(0) = k(0 ≤ k < dc − 1), then shiftk is enabled to shift both the LRk and SRk
by one step at the next clock. Once LRk and SRk have been shifted, the MLF unit
computes mc(1) and generates the corresponding shift signal. This repeats until all
X elements of mc are generated. Once an element of mc is computed, it is stored
in the TMR, which consists of nm (p+m+ t)-bit registers. The truncated message
element mc(i) is stored in the i-th location. All the PC units within a CNU can
access the message stored in any location of the TMR simultaneously.
The path constructor unit computes updated c-to-v message (ρc,v,ρ
f
c,v) based
on the propose IPC algorithm. Take PCi as an example, the architecture of the
proposed PC unit is shown in Fig. 3.11, where sum =
∑dc−1
i=0 Si,0, and s = dlog2 nme
is the width of the write and read address for the symbol register group (SRG) with
nm total locations. During the computing of the first element of a c-to-v message,
the initial load (IL) signal equals 1 and ZS = Si,0. For the computing of other
63
3.4. FULLY PARALLEL DECODER ARCHITECTURE
elements, ZS = SI,0, where I = ec(ji), and ji is the location index of the truncated
message element in the TMR that is read by PCi. The index register group (IRG),
which contains nm t-bit registers, stores nm paths: pi,0, pi,1, · · · , pi,nm−1. Each path
is represented by a t-bit integer as shown in Algorithm 7. The SRG, which contains
nm m-bit registers, stores the Galois field symbol vector, ρ
f
c,v. The SRG also tests
whether the input symbol (symIni) has already been stored. The proposed PC
unit reads the truncated messages from the TMR and computes the corresponding
updated c-to-v message in 2nm cycles. Compared to the path constructor in [7], the
hardware complexity of proposed PC unit is reduced for several reasons:
1. The memory used to store all nm paths is reduced. In [7], it needs nmdc bits
to store all nm paths. However, the proposed PC needs only nmt bits, where
t = log2 dc + 1 if dc is a power of 2, and t = dlog2 dce if dc is not a power of 2.
2. It is easier to determine whether xc(ji) is an LLR element of the updated
c-to-v message. As shown in Fig. 3.11, it only needs to compare whether two
indices are the same. In contrast, the PC unit in [7] needs to first encode the
t-bit index into a dc-bit binary sequence and then do the bit-test operation.
3. Due to LLR approximation, the proposed PC does not need an LLR RAM to
store the LLR vector of the updated c-to-v message. The PC shown in Fig. 3.11
adopts the approximation scheme with nc = 0. Thus, only the maximal LLR
θc,v is stored in the p-bit register, MR, shown in Fig. 3.11.
The SRG unit in CNU is moved to VNU in order to avoid the global connection
wires between different CNUs. This will be discussed in Sections 3.4.3 and 3.4.4.
64
3.4. FULLY PARALLEL DECODER ARCHITECTURE
k
comp
decoder
WAddr
0
MaxR LLROut
load
sel
D D D
m
p
p
p
p
z
z
dcdc
dc dc
p
p
p p p p
z
1
1
p
enenen
m
p
1
,c vh
− m
m
m_en
m
0 1 nm-1
Figure 3.11: Path constructor architecture
For the proposed CNU, all PC units can access any element of the TMR simul-
taneously. The check node processing can be initiated once the nv LLR elements
are sorted out. From mc(0), each PC unit processes at most one element of mc
sequentially during each cycle. On the other hand, The MLF unit generates one
new element of mc during each cycle. As a result, the PC units and MLF unit can
work simultaneously. Thus, it takes L(L ≤ 2nm) cycles to finish a round of check
node processing.
3.4.3 Low-latency VNU architecture
Many previous works [7, 15] on the NB-LDPC decoder architecture focus on the
simplification of check node processing. However, variable node processing of current
NB-LDPC decoders has not been carefully examined. The variable node processing
for one layer takes 2nm cycles for the VNUs in [7,15]. A low latency VNP algorithm
is proposed in [69] to reduce the cycles from 2nm to LS−V N+nm, where LS−V N < nm.
However, the VNP algorithm in [69] works for the layered schedule. Usually, the
variable node processing for different layers are performed in serial, which leads to
reduced throughput.
65
3.4. FULLY PARALLEL DECODER ARCHITECTURE
In this chapter, based on the proposed SVNP algorithm, a low-latency VNU
architecture is proposed to reduce the number of cycles used by variable node pro-
cessing. The proposed low-latency VNU architecture works for the flooding or shuf-
fled schedule. During an iteration, it takes only lsum cycles to finish variable node
processing for each variable node. For each c-to-v message sent to a variable node
v, only a fraction of the nm elements are used in variable node processing. Besides,
it still needs a fraction of the nm elements of each channel message during variable
node processing.
TempRAM
MaxR
m
m
m
m
m m+2
m+3
m+2
m+2
m+2
GF comp
D D D
p p p
en
vsym
Sv
Figure 3.12: VNU architecture assuming dv = 2
For a variable node v, suppose the variable node degree dv = 2, the proposed
VNU architecture is shown in Fig. 3.12. The VNU architecture for other dv values is
similar. The proposed VNU computes dv temporary v-to-c messages (Rw,v, RSw,v)
for w = 0, 1, · · · , dv − 1 as shown in Algorithm 8. Each temporary v-to-c message
has at most lsum elements, which are serially sent to the corresponding parallel
sorters in the CNU. As shown in Fig. 3.12, two symbol register groups, SRG0 and
SRG1, store the Galois field symbol vector for the c-to-v messages sent to v. Take
SRG0 as an example, the input multiplexors select cnuSymIn0 and cnuRA0 as its
Galois field symbol input and read address during check node processing. cnuSymIn0
and cnuRA0 are driven by symIn and RA in the corresponding PC unit shown in
66
3.4. FULLY PARALLEL DECODER ARCHITECTURE
Fig. 3.11, respectively. The Galois field symbol vector for the a priori message
concerning v is stored in SRG2. The output sai is the address of the input Galois
field symbol in SRGi. If the the input symbol has not been stored in SRGi, sai =
nm − 1.
Based on the Galois field symbol vector stored in SRG, the proposed VNU per-
forms variable node processing. Not all nm elements of the LLR vector of a c-to-v
message are stored. Instead, at most two LLR elements are stored since the use
of the IPC algorithm. During variable node processing, each LLR element is com-
puted on-the-fly based on Eq. (3.1). This LLR approximation approach reduces the
memory required to store c-to-v messages. The LLR generation unit (LGU) com-
putes the corresponding approximated LLR for c-to-v messages. The channel LLR
generation unit (CGU) computes the approximated LLR for a priori message based
on the approximation scheme proposed in Section 3.3.2. Suppose lsum = l0 + l1 + l2
for dv = 2, variable node processing is carried out as follows:
1. The input multiplexors select vnuRA0 as the read address of SRG0. The
address signal, vnuRA0, goes from 0 to l0−1, and is increased by 1 each cycle.
The corresponding Galois field symbol output is so0. The input multiplexors
of SRG1 and SRG2 select si1,0 and si2,0 as the symbol inputs, respectively.
Here, si1,0 = so0h1/h0 and si2,0 = so0/h0, where h0 and h1 are the non-zero
Galois field symbol in the v-th column of the parity-check matrix. The output
multiplexor of SRG0 selects vnuRA0 as the address input of LGU0.
2. The output multiplexors of SRG1 and SRG2 select sa1 and sa2 as the address
inputs of LGU1 and CGU, respectively.
67
3.4. FULLY PARALLEL DECODER ARCHITECTURE
3. Similar read operations will be applied to SRG1 and SRG2 in serial. vnuRA1
and vnuRA2 will go from 0 to l1− 1 and 0 to l2− 1, respectively. The reading
behavior of the SRGs during a round of variable node processing is shown in
Fig. 3.13. Besides, si0,1 = so1h0/h1, si0,2 = so2h0, si1,2 = so2h1 and si2,1 =
so1/h1, where so1 and so2 are the symbol output of SRG1 and SRG2, respec-
tively.
l0
l1
l2
1
2
0
2
0
1
0
1
2
0
1
2
l0
l1
l2
Figure 3.13: Reading behavior of the SRGs during the variable node processing
Suppose nm = 32 and 5-bit quantization scheme is used. Based on the scaling
factors proposed in Section 3.3.4, we propose two low complexity LGUs for nc = 0
and 3, respectively. The LGU for nc = 0 is shown in Fig. 3.14(a), where maxLLR
is the maximum LLR stored during check node processing. The SC0 unit first
multiplies the input by β, then divides the product by F (nm), where F (nm) is a
power of 2. The division in SC0 is just bit shifting. The output of SC0 has 7 bits
for the fraction part. In the proposed LGU, only one bit is kept for the fraction
part. The ST0 unit returns the maximal value in the quantization range if the
input is saturated. Otherwise, the output of ST0 is the same as the input. The
LGU for nc = 3, shown in Fig. 3.14(b), is more complex than that for nc = 0 and
needs both fixed-point multiplication and addition. Mc,v, shown in Algorithm 7, is
another stored LLR. When the index is no greater than nc, the SC1 unit multiplies
68
3.4. FULLY PARALLEL DECODER ARCHITECTURE
the input with η. Otherwise, the input of SC1 is multiplied with β. The product
is then divided by F (nc) or F (nm) using bit shifting. The functionality of ST1 is
similar to that of ST0. In addition, we also propose the CGU using the scaling
parameters from Section 3.3.4. As shown in Fig. 3.15, the architecture of CGU is
similar to that of LGU. The DEC unit in Fig. 3.15 generates the select signal for
the multiplexer in the CGU.
5
5
10 6 5
5
5
10
5
6
5
7 5
5
1
nc 
nc 
Mc,v
nc
Figure 3.14: Architectures of the proposed LGUs
5
5
10 6 7
5
Lv
Lv
Lv
Lv 5
5
5
5
5 3
5
Figure 3.15: Architectures of the proposed CGU
Since nm = 32 5-bit LLRs need to be stored for the PC algorithm in [7], we
also synthesize the proposed LGUs and compare the synthesis results with that of
a (32 × 5)-bit SRAM module under a 90nm CMOS technology in Table 3.3. The
SRAM is built with a register file using a memory compiler. LGU-0 and LGU-3
denote the LGU unit with nc = 0 and nc = 3, respectively, and CGU-4 denotes
the CGU unit with nI = 4. The areas of the proposed LGUs and CGU are only
69
3.4. FULLY PARALLEL DECODER ARCHITECTURE
a fraction of that of an SRAM storing 32 LLRs while maintaining the same clock
rate.
Table 3.3: Comparisons of LGUs and CGU with a 32×5 SRAM.
SRAM LGU-0 LGU-3 CGU-3
Frequency (MHz) 400 400 400 400
area (µm2) 7801 835 1031 1217
gate count 2763 295 365 431
3.4.4 Decoding schedule, decoder throughput, and inter-
connection
The decoding schedule of the proposed fully parallel decoder is simple. Take dv = 2
as an example, at the beginning of decoding, each a priori message is loaded into
SRG2 in its corresponding VNU. The compressed LLR messages are stored in the
CGU. At the same time, each message pair, which consists of an LLR and its
related Galois field symbol, is sent to the corresponding PS unit through the V2C
interconnection network. The loading of a priori messages takes nm cycles. Check
node processing begins once the loading of a priori messages is finished. The c-to-v
messages and the compressed message vector mc are updated at the same time,
since the MLF unit generates the elements of mc at a speed higher than the speed
at which the elements of mc are consumed by the PC unit. A round of check node
processing takes L cycles.
Variable node processing starts once check node processing is finished, and takes
lsum cycles. In order to improve the decoder’s clock frequency, P stages of pipeline
registers are inserted in the VNU. Each element of the computed v-to-c messages is
sent to the corresponding PS unit. For each v-to-c message, only the nv minimum
70
3.4. FULLY PARALLEL DECODER ARCHITECTURE
LLRs and their corresponding Galois filed symbols are stored in a non-decreasing
order in the corresponding PS unit. As a result, it takes L+ lsum cycles to finish an
iteration. The throughput T of the proposed fully parallel decoder is given by
T =
NmRf
NI(L+ lsum + P )
, (3.2)
where R is the code rate, f the clock rate, and NI the number of iterations.
Since the SRGs of each PC unit are moved to the corresponding VNU to avoid
global interconnection between different CNUs, the input control signals for the SRG
during check node processing are transmitted form CNUs to VNUs through the C2V
interconnection network. The output signal of the SRG that engages in the path
construction is transmitted through the V2C interconnection network. As shown in
Fig. 3.11, the control signals of the SRG during check node processing include: the
write and read addresses, the symbol input, and the write enable signal. The output
signals that go through the V2C network are existi and soi. As a result, the bus
width of each message that goes from CNUs to VNUs is b0 = max(m+p,m+2s+1),
where s = dlog2 nme is the width of the read and write address. The bus width of
each message that goes from VNUs to CNUs is b1 = m+r+1, where r = dlog2 dve+p
is the number of bits used to represent an output LLR of VNU.
It is well known that the main obstacle of the fully parallel decoder architec-
ture for binary LDPC codes is the routing congestion [70]. However, the routing
congestion for the proposed fully parallel NB-LDPC decoder architecture is allevi-
ated. The routing congestion is mainly determined by the number of the global
interconnection wires that connect between CNUs and VNUs and by the numbers
71
3.4. FULLY PARALLEL DECODER ARCHITECTURE
of CNUs and VNUs instantiated in the decoder. For a binary (Nb,Mb) LDPC code
with average variable node degree dv, the fully parallel decoder in [70] needs Mb
CNUs and Nb VNUs, and the number of edges in the corresponding Tanner graph is
Nbdv. As a result, the number of the global interconnection wire is 2Nbdvp, where p
is the number of the quantization bits. For a NB-LDPC code over GF(2m) with the
same code length and code rate, the proposed fully parallel decoder will be easier
to route since the numbers of CNUs and VNUs are reduced to Mb/m and Nb/m,
respectively. Hence the number of the global interconnection wires is reduced to
Nbdv(b0 + b1)/m for the proposed decoder architecture.
It has been shown that for large fields (2m ≥ 64), the best NB-LDPC codes de-
coded with belief propagation should be ultra sparse (cyclic codes, dv = 2) [71]. The
proposed fully parallel decoder architecture is especially suitable for such codes with
moderate or short length. Suppose the length of a NB-LDPC code over GF(256) is
1024 bits. Let nm = 32, dv = 2 and p = 5. Based on the proposed fully parallel
decoder architecture, the implementation of this 1024-bit NB-LDPC decoder has
1024
8
× 2 × (19 + 15) = 8704 global interconnection wires. On the other hand, the
fully parallel LDPC decoder architecture in [70] for a 1024-bit binary LDPC code
with dv = 2 and p = 5 has 2 × 1024 × 2 × 5 = 20480 global interconnection wires.
This comparison demonstrates that the proposed fully parallel decoder architecture
for moderate or short NB-LDPC codes over large fields is feasible.
The proposed fully parallel decoder architecture not only works for QC-NB-
LDPC codes, but also works for irregular NB-LDPC codes.
72
3.5. IMPLEMENTATION RESULTS AND COMPARISONS
3.5 Implementation Results and Comparisons
In order to demonstrate the efficiency of our proposed fully parallel decoder archi-
tectures, we synthesize our NB-LDPC decoders using Design Compilerr Graphical
(DCG) by Synopsys. Since DCG tightens timing and area correlation between syn-
thesis and placement to 5% [72], the timing/area results obtained by using DCG
are very close to those produced by place and route tools, such as IC Compilerr.
For power estimation, the PrimeTimer PX (PTPX) is employed. For the power
estimation flow, the SPEF file generated by DCG is used to improve the accuracy of
the power consumption results. The process of employing DCG synthesis is shown
as follows:
1. Step 1: Perform the normal synthesis. Based on the cell area, determine the
floorplan constraints, such as core area, pin locations and placement bounds
and so on.
2. Step 2: Perform the DCG synthesis with these floorplan constraints.
3. Step 3: If the signal congestions predicted by DCG is not acceptable, revise
the floorplan constraints and repeat Step 2. Otherwise, the DCG synthesis is
finished.
Since the results in [15] are derived from place and route, we compare our DCG
results with those in [15] in terms of energy efficiency (consumed energy per decoded
bit) and area efficiency, where energy efficiency =
power
throughput
and area efficiency =
area
throughput
. In this chapter, the (110, 88) NB-LDPC decoder is synthesized with
DCG, and the power consumption is measured at the SNR point where BER is
73
3.5. IMPLEMENTATION RESULTS AND COMPARISONS
around 1× 10−6. The physical results are shown in Table 3.4, where TGR denotes
throughput to gate count ratio. Table 3.4 shows that our decoder architecture has
better area efficiency and energy efficiency than those in [15].
The implementation results in [14] are derived from synthesis with Design Com-
piler using 180nm CMOS technology. Only the gate count (the gate count of mem-
ory is estimated) and throughput are provided in [14] and area and power are not
available in [14]. In order to make a fair comparison, our fully parallel decoder is
also synthesized under 180nm CMOS technology, and we compare the TGR of our
decoder with those in [14]. The memory in [14] can be implemented with normal
dual port SRAM. The memory of the our proposed decoder architecture can be
implemented with the content addressable memory (CAM). Due to the lack of such
a memory compiler, the memories in our proposed decoders are implemented with
registers, and the gate counts of our decoder in Table 3.4 are the synthesis gate
counts. Since registers require more area than memories generated from a mem-
ory compiler, our decoder architectures would have smaller gate counts than those
in Table 3.4 if CAM modules were used. In terms of TGR, the proposed decoder
architecture is better than those in [14].
The stochastic decoding algorithm for NB-LDPC codes is promising due to its
low hardware complexity [9]. Besides, the stochastic decoding algorithm is a good
candidate for fully parallel LDPC decoder architectures. The FPGA implementation
of a (192, 96) NB-LDPC stochastic decoder over GF(256) [9] achieves a throughput
of 65Mb/s, and it is projected [9] that a corresponding ASIC implementation can
achieve a throughput of 698Mb/s, which would be 27% higher than the throughput
of our proposed decoder on GF(256). In the chapter, however, we did not compare
74
3.6. CONCLUSION
our decoder architectures with that in [9] because the corresponding area, frequency
and power results under ASIC implementation are not provided in [9].
3.6 Conclusion
In this chapter, a reduced memory complexity trellis based check node processing
algorithm is proposed. An a priori message compression algorithm is also proposed
to reduce memory requirement further. A simplified algorithm is also proposed to
reduce the complexity of variable node processing. Based on the proposed algo-
rithms, a fully parallel decoder architecture for NB-LDPC codes. The hardware
efficiency of proposed fully decoder architecture is much higher than those of previ-
ous comparable decoder architectures in open literature.
75
3.6. CONCLUSION
Table 3.4: Comparisons with other decoder architectures.
[15] [14] This work‡
Code GF(32) GF(32) GF(256) GF(32)
Block Length 248 372 110 372
Code Rate 0.55 0.66 0.8 0.66
Process 90nm 180nm 28nm 180nm
Core Area 10.33
- 1.289 -
(mm2) (0.99*)
Utilization - - 0.757 -
NAND Gate Count 1.92M 0.6M† 2.57M 4.1M
Frequency (MHz) 260 200 520 220
Iterations 10 10 10 10
Throughput 47.69
66 546 982
(Mb/s) (153.3*)
TGR 24.9
110 212.4 239.5
(Mb/s/Million gate) (79.6*)
Power (mW) 479 - 976 -
Energy Efficiency
(nJ/bit) 10.06 - 1.78 -
(nJ/bit/Iteration) 1.006 - 0.178 -
Area Efficiency 4.62
- 423.58 -
(Mb/s/mm2) (154.5*)
†The gate count of the memories in [14] is estimated, assuming
that one memory bit takes the area of 1.5 NAND gates.
‡The implementation results of this work are not as accurate as
those obtained from a place and route tool. In this work, the
memories are implemented as registers , and the gate count of
our decoder is obtained from DCG or normal synthesis.
* These results have been normalized to 28nm for comparison.
Since the voltage of the process in [15] is not available, the en-
ergy efficiency has not been scaled.
76
Chapter 4
Efficient Error Control Decoder
Architectures for Noncoherent
Random Linear Network Coding
4.1 Introduction
Random linear network coding (RLNC) is an efficient technique for disseminat-
ing information in networks (see, for example, [39–42]). Due to its random linear
operations, RLNC not only achieves network capacity with high probability in a
distributed manner, but also provides robustness against varying network condi-
tions [43]. Unfortunately, it is highly susceptible to errors due to noise, malicious or
malfunctioning nodes, or insufficient min-cut [44]. As a result, error control is vital
for RLNC.
Error control methods proposed for RLNC assume two transmission models. The
77
4.1. INTRODUCTION
methods for the first model (see, for example, [45]) depend on and take advantage
of the underlying network topology or the particular linear networking operations
performed at various nodes. The methods for the other model (see, e.g., [44,46]) as-
sume that both the transmitter and the receiver have no knowledge of such channel
transfer characteristics. The two models are referred to as coherent and nonco-
herent network coding, respectively. In this chapter, we focus on error control for
noncoherent RLNC.
An error control code for noncoherent network coding [44], called a subspace
code, is a set of subspaces. Information is encoded in the choice of a subspace
spanned by a set of transmitted packets. A subspace code is called a constant-
dimension code (CDC) if all subspaces are of the same dimension. CDCs lead to
simplified network protocols due to the constant dimension. A class of asymp-
totically optimal CDCs, referred to as Ko¨tter-Kschischang (KK) codes, has been
proposed in [44]. A decoding algorithm based on interpolation for bivariate lin-
earized polynomials is also proposed for KK codes in [44]. It was shown in [46] that
KK codes correspond to lifting of Gabidulin codes, a class of optimal rank metric
codes. As a result, KK codes can be decoded by the generalized decoding algorithm
for the rank metric codes [46].
Motivated by KK codes, a new family of subspace codes, referred to as Mahdavifar-
Vardy (MV) codes in this chapter, was proposed [47–49]. List decoding, which has
been used to decode beyond the error correction diameter bound [50], can be applied
to the decoding of MV codes. Using algebraic list decoding, it was shown [49] that
MV codes can achieve a better tradeoff between rate and decoding radius than KK
codes.
78
4.1. INTRODUCTION
Error control for RLNC comes at the expense of additional computations needed
for encoding and decoding. The complexities of existing decoding algorithms [44,
49, 51] for KK and MV codes are much higher than those of encoding, and are
hence critical to applications of RLNC. Most previous works focus on theoretical
aspects of network coding. For example, the decoding complexities of KK and MV
codes were analyzed in [44,46] and [47–49], respectively. However, theoretical anal-
ysis does not completely reflect how the decoding algorithms affect the hardware
implementation results, such as area and throughput. For KK codes, decoder ar-
chitectures based on the generalized decoding algorithm for rank metric codes [46]
was proposed in [43]. Unfortunately, the rank metric decoder architectures in [43]
suffer from limited throughput, long decoding latency and high area complexity.
Besides, to the best of our knowledge, decoder architectures for MV codes and their
hardware implementations have not been investigated in the open literature.
In this chapter, we focus on efficient architectures and their hardware implemen-
tations of interpolation based decoders for KK and MV codes. The main contribu-
tions of this chapter are:
1. The decoder of KK codes has two stages: interpolation and factorization. The
generalized interpolation algorithm in [51] is used for the first stage since it is
more efficient than Gaussian elimination [51]. For factorization, we propose
a reformulated right division algorithm for linearized polynomials, which is
suitable for hardware implementations.
2. The list decoder of MV codes also has two stages: interpolation and factoriza-
tion. The generalized interpolation algorithm in [51] is used in the interpola-
tion process. A linearized Roth-Ruckenstein (LRR) algorithm [53] is proposed
79
4.1. INTRODUCTION
in [47] to solve the factorization problem for MV codes. In this chapter, we
make a more detailed study on the LRR algorithm. For list size L = 2, we
derive the equations used to compute all the information symbols and uncover
the relation between two possible solutions. A matrix based LRR (M-LRR)
algorithm, which is suitable for hardware implementations, is also proposed
for factorization.
3. A serial decoder architecture and an unfolded decoder architecture for KK
codes are proposed for applications with moderate and high throughputs, re-
spectively. Both architectures are implemented for KK codes over GF(28) and
GF(216) to demonstrate their efficiency. To the best of our knowledge, this is
the first efficient implementation of interpolation-based decoder for KK codes.
Compared to the rank metric decoder architectures for KK codes [43], the
proposed serial decoder architecture improves the throughput by 4.9 and 13.2
times, while its gate counts are only 56% and 76% of their respective counter-
parts in [43]. Moreover, for these two codes, the unfolded architecture achieves
a throughput of 12.5Gb/s and 41.6Gb/s, much higher than the throughput of
214Mb/s and 134Mb/s of their respective counterparts in [43]. The through-
puts per thousand NAND gates of our architectures are much higher and their
latency much shorter than their counterparts in [43].
4. A serial list decoder architecture for MV codes is proposed. To the best of
our knowledge, this is the first hardware implementation of MV decoders.
An efficient architecture for solving equations over an extension field GF(qml)
(q > 2 is moderate) is proposed. The proposed equation solver does not require
80
4.2. KK AND MV CODES
complicated inversion operations over GF(qml). Besides, an implementation
of factorization that computes all L possible transmitted packets in parallel is
proposed, where L is the list size for list decoding.
The rest of the chapter is organized as follows. Section 4.2 provides some related
background about KK and MV codes. Our serial and unfolded decoder architectures
for KK codes are proposed in Section 4.3. Section 4.4 presents the list decoder
architecture for MV codes. Section 5.5 presents the implementation results, and
conclusions are drawn in Section 5.6.
4.2 KK and MV codes
4.2.1 KK codes and its decoding algorithms
KK codes [44] constitute an important class of subspace codes with constant dimen-
sions. A KK code over GF(2m) is described by three parameters (m, n, k), where
n is the dimension of the transmitted subspace and k is the number of information
symbols over GF(2m). A k-dimension information vector u = (u0, u1 · · · , uk−1)T is
treated as a linearized polynomial [46] u(x) = u0x
[0]+u1x
[1]+ · · ·+uk−1x[k−1], where
x[i] denotes x2
i
and ui ∈ GF(2m). In this chapter, for all linearized polynomials
u(x), deg(u(x)) = max{j : uj 6= 0} denotes the degree of u(x). The information
vector u is encoded into n packets over GF(2m): p0, p1, · · · , pn−1, where pi = (βi,
u(βi)) and β0, β1, · · · , βn−1 are linearly independent over GF(2m). Each packet
consists of two elements from GF(2m). After these n encoded packets are injected
into the network, N potentially corrupted packets (r0(s), r1(s))’s are received, where
r0(s), r1(s) ∈ GF(2m) for s = 0, 1, · · · , N − 1. Based on the received packets, the
81
4.2. KK AND MV CODES
KK decoder produces uˆ = (uˆ0, uˆ1, · · · , uˆk−1)T .
Algorithm 9: Interpolation algorithm for KK codes [51]
input : (r0(s), r1(s)); s = 0, 1, · · · , N − 1
output: d(x, y) s.t. d(r0(s), r1(s)) = 0, s = 0, 1, · · · , N − 1
Initialization: f0(x, y) = x, f1(x, y) = y
for s = 0 to N − 1 do
for i = 0 to 1 do
∆i = fi(vs, ws)
Oi = max(deg(fi,x(x)), deg(fi,y(y)) + k − 1)
I0 = {i : ∆i 6= 0}; I1 = {i : ∆i = 0}
if I0 6= ∅ then
i∗ ← argmin
i∈I0
{Oi}
for i ∈ I0 do
if i 6= i∗ then
Fi(x, y) = ∆i∗fi + ∆ifi∗
else Fi(x, y) = fi
2 + ∆ifi
if I1 6= ∅ then
for i ∈ I1 do
Fi(x, y) = fi
f0(x, y) = F0, f1(x, y) = F1
O0 = max(deg(f0,x(x)), deg(f0,y(y)) + k − 1)
O1 = max(deg(f1,x(x)), deg(f1,y(y)) + k − 1)
if O0 ≤ O1 then
d(x, y) = f0(x, y)
else d(x, y) = f1(x, y)
The KK decoder in [44] consists of two stages. The first stage, called interpo-
lation, finds a nonzero bivariate linearized polynomial d(x, y) = dx(x) + dy(y) such
that d(r0(s), r1(s)) = 0 for s = 0, 1, · · · , N−1. The degrees of dx(x) and dy(y) are at
most m and m− k + 1, respectively. The second stage, referred to as factorization,
82
4.2. KK AND MV CODES
obtains the transmitted information symbols by computing a linearized polyno-
mial uˆ(x) such that d(x, uˆ(x)) = 0. While the interpolation can be implemented
by solving a system of linear equations via Gaussian elimination, a more efficient
generalized interpolation algorithm in the ring of linearized polynomials has been
proposed in [51] (the interpolation algorithm proposed in [44] is in fact a special case
of this generalized interpolation algorithm). The generalized interpolation algorithm
in [51] adapted to the interpolation problem for an (m,n, k) KK code is shown in Al-
gorithm 9, where fi(x, y) = fi,x(x) + fi,y(y) is a bivariate linearized polynomial over
GF(2m). A rank metric decoder has also been proposed for KK codes in [46], and
its hardware implementation has been investigated in [43]. Unfortunately the rank
metric decoder architectures in [43] suffer from limited throughput, long decoding
latency, and high area complexity.
The interpolation algorithm in [51] for KK codes parallels Koetter’s interpola-
tion algorithm for RS codes. The comparison between these two algorithms are
shown in Table 4.1, where IRS denotes the interpolation algorithm for RS codes.
As shown in Table 4.1, the IRS algorithm and Algorithm 9 are similar in their poly-
nomial updating rules. The key difference lies in the fact that Algorithm 9 deals
with linearized polynomial while the IRS algorithm deals with polynomials. Due to
their differences in multiplications, interpolator architectures for RS codes are not
applicable to Algorithm 9.
83
4.2. KK AND MV CODES
T
ab
le
4.
1:
In
te
rp
ol
at
io
n
b
y
P
ol
y
n
om
ia
ls
an
d
L
in
ea
ri
ze
d
P
ol
y
n
om
ia
ls
A
lg
or
it
h
m
s
IR
S
A
lg
or
it
h
m
9
R
in
g
P
ol
y
n
om
ia
ls
L
in
ea
ri
ze
d
P
ol
y
n
om
ia
ls
M
o
d
u
le
B
as
is
x
,y
,y
2
,.
..
,y
L
y 0
=
x
,y
1
,y
2
,.
..
,y
L
M
on
om
ia
ls
x
i y
j
x
[i
]
◦y
j
d
ef =
y
[i
]
j
E
le
m
en
ts
P
=
∑ i,j≥
0
a
i,
j
x
i y
j
Q
=
∑ i,j≥
0
b i
,j
y
[i
]
j
L
in
ea
r
F
u
n
ct
io
n
al
s
D
r,
s
P
(α
,β
)
=
∑ k∑
j
( k r)( j s
) a j,k
α
k
−r
β
j−
s
D
(Q
)
=
Q
(α
,β
1
,.
..
,β
L
)
In
it
ia
li
za
ti
on
f 0
,j
=
y
j
g 0
,j
=
y j
U
p
d
at
e
N
o
U
p
d
at
e
f i
+
1
,j
=
f i
,j
,j
/∈
J
=
{j
:
D
i+
1
(f
i,
j
)
6=
0}
sa
m
e
as
IR
S
C
ro
ss
-t
er
m
f i
+
1
,j
=
D
(f
i,
j∗
)f
i,
j
−
D
(f
i,
j
)f
i,
j∗
,j
∈
J
,j
6=
j∗
sa
m
e
ru
le
as
IR
S
O
rd
er
-r
ai
se
f i
+
1
,j
=
D
(f
i,
j
)(
x
f i
,j
)
−
D
(x
f i
,j
)f
i,
j
,j
=
j∗
g i
+
1
,j
=
D
(g
i,
j
)g
[1
]
i,
j
−
D
(g
[1
]
i,
j
)g
i,
j
,j
=
j∗
84
4.2. KK AND MV CODES
4.2.2 MV codes and its list decoding algorithm
MV codes are similar to but different from KK codes [44]. To enable list decoding,
different code constructions are proposed for different code dimensions in [47,48].
For an l-dimensional MV code over GF(qml), where l is a positive integer that
divides q − 1, the equation xl − 1 = 0 has l distinct roots e0 = 1, e1, . . . , el−1 over
GF(q). We first choose a primitive element γ of GF(qml) so that γ, γ[1], . . . , γ[ml−1]
form a normal basis of GF(qml), where γ[i] = γq
i
. We then construct elements
αi = γ + eiγ
[m] + e2i γ
[2m] + · · · + el−1i γ[m(l−1)] over GF(qml) for i = 0, 1, . . . , l − 1,
where ei’s are the l distinct roots of equation x
l − 1 = 0 over GF(q). It is proved
in [48] that the set {α[j]i : i = 0, 1, . . . , l−1, j = 0, 1, . . . ,m−1} is a basis of GF(qml)
over GF(q).
For an information vector u = (u0, u1, . . . , uk−1) over GF(q) and its correspond-
ing linearized polynomial u(x) =
∑k−1
i=0 uix
[i], let u⊗i(x) denote the composition of
u(x) with itself by i times for any nonnegative integer i, where
u⊗i(x) ,

x i = 0
u(x) i = 1
u(u⊗(i−1)(x)) i > 1
(4.1)
The information vector u is encoded into l packets p0, p1, . . . , pl−1, where
pi =
 (α0, u(α0), u
⊗2(α0), . . . , u⊗L(α0)) i = 0
(αi,
u(αi)
αi
, . . . , u
⊗L(αi)
αi
) otherwise
(4.2)
and L is the desired list size. Each packet consists of L+ 1 elements in GF(qml). At
85
4.2. KK AND MV CODES
the receiver, N potentially corrupted packets (r0(s), r1(s), · · · , rL(s))’s are received,
where r0(s), r1(s), · · · , rL(s) ∈ GF(qml) for s = 0, 1, · · · , N − 1.
Similar to the decoding of KK codes, the list decoding of MV codes is divided into
two stages: interpolation and factorization. The generalized interpolation algorithm
is also capable of performing the interpolation for the list decoding of MV codes.
The generalized interpolation algorithm adapted to the interpolation problem for
an l-dimensional MV code is shown in Algorithm 10.
As shown in Algorithm 10, fi(x, y1, · · · , yL) = fi,x(x)+fi,y1(y1)+ · · ·+fi,yL(yL) is
a nonzero multivariate linearized polynomial, where fi,x and fi,yj ’s (j = 1, 2, · · · , L)
are linearized polynomials. The maximal degrees of fi,x, fi,yj are (l+ t)L, (l+ t)L−
j(k− 1), respectively, where t < lL−L(L+ 1)k−1
2m
is the dimension of error packets
received. The output is a nonzero multivariate linearized polynomial d(x, y1, · · · , yL)
that satisfies d(r0(s), r1(s), · · · , rL(s)) = 0 for s = 0, 1, · · · , N−1. The interpolation
step is finished in N iterations.
The factorization step finds at most L possible solutions of u(x) for the following
equation:
d(x, u(x), u⊗2(x), · · · , u⊗L(x)) = 0. (4.3)
An LRR algorithm [47], shown in Algorithm 11, has been proposed to solve Eq. (4.3).
Let Y be a variable in the ring Lq[x], where Lq[x] is the set of linearized polynomials
with coefficients in GF(q). Since the output of Algorithm 10 is d(x, y1, · · · , yL) =
dx(x) + dy1(y1) + · · ·+ dyL(yL), Eq. (4.3) is equivalent to
d(x, Y ) = d0(x) + d1(x)⊗ Y + · · ·+ dL(x)⊗ Y ⊗L = 0 (4.4)
86
4.2. KK AND MV CODES
Algorithm 10: Interpolation algorithm for MV codes [51]
input : (r0(s), r1(s), · · · , rL(s)); s = 0, 1, · · · , N − 1
output: d(x, y1, · · · , yL) s.t. d(r0(s), r1(s), · · · , rL(s)) = 0, s = 0, 1, · · · , N − 1
Initialization: f0(x, y1, · · · , yL) = x, fi(x, y1, · · · , yL) = yi i = 1, 2, · · · , L
for s = 0 to N − 1 do
for i = 0 to L do
∆i = fi(r0(s), r1(s), · · · , rL(s))
dj = deg(fi,yj(y)) + j(k − 1)) for j = 1, 2, · · · , L
Oi = max(deg(fi,x(x), d1, d2, · · · , dL)
I0 = {i : ∆i 6= 0}; I1 = {i : ∆i = 0}
if I0 6= ∅ then
i∗ ← argmin
i∈I0
{Oi}
for i ∈ I0 do
if i 6= i∗ then
Fi(x, y1, · · · , yL) = ∆i∗fi + ∆ifi∗
else Fi(x, y1, · · · , yL) = ∆ifi[1] + ∆[1]i fi
if I1 6= ∅ then
for i ∈ I1 do
Fi(x, y1, · · · , yL) = fi
f0 = F0, f1 = F1
for i = 0 to L do
dj = deg(fi,yj(y)) + j(k − 1)) for j = 1, 2, · · · , L
Oi = max(deg(fi,x(x), d1, d2, · · · , dL)
d(x, y1, · · · , yL) = f0
Omin = O0
for i = 1 to L do
if Oi < Omin then
d(x, y1, · · · , yL) = fi
Omin = Os
87
4.2. KK AND MV CODES
where u(x) is the solution of Y and
di(x) =
 dx(x) i = 0dyi(yi)|yi=x i > 0. (4.5)
If the polynomial d(x, Y ) is divisible by x[s], then we define
d↓s(x, Y ) = d′0(x) + d
′
1(x)⊗ Y + · · ·+ d′L(x)⊗ Y ⊗L, (4.6)
where d′i(x)
[s] = di(x).
Algorithm 11: LRR algorithm [47]
Procedure: LRR(d(x, Y ), k, λ)
Global variables: A ⊆ Lq[x], u(x) =
∑k−1
i=0 uix
[i] ∈ Lq[x]
Call procedure initially with d(x, Y ) 6= 0 λ = 0
if λ == 0 then
A = ∅
s← largest integer s.t. d(x, Y ) is divisible by x[s]
H(x, γ)← 1
x
d↓s(x, γx)
Z ← set of all roots of H(0, γ) in GF(q)
foreach γ ∈ Z do uλ ← γ
if λ < k − 1 then
LRR(d↓s(x, Y [1] + γx), k, λ+ 1)
else
if d(x, uk−1x) == 0 then
A← A ∪ u(x)
As shown in Algorithm 11, the L possible solutions of u(x) are stored in the set
A. The original LRR algorithm is a high-level algorithm, the detailed expression of
d(x, Y ) are not specified in [47].
88
4.3. EFFICIENT KK DECODER ARCHITECTURES
4.3 Efficient KK decoder architectures
In this chapter we first propose a serial decoder architecture and an unfolded decoder
architecture for KK codes.
4.3.1 Serial decoder architecture
In order to minimize the hardware cost, a serial decoder architecture is proposed
in Fig. 4.1, where the widths of multi-bit buses are shown. The serial architecture
consists of the following major parts: coefficient registers CXRi and CYRi, two
interpolators interpolator0 and interpolator1, polynomial selection unit polySel, and
a polynomial divider polyDiv, which implements the factorization step. Algorithm 9
updates two bivariate linearized polynomials: fi(x, y) = fi,x(x) + fi,y(y) for i = 0
and 1, where fi,x(x) =
∑Nx−1
j=0 CEXi(j)x
[j] and fi,y(y) =
∑Ny−1
j=0 CEYi(j)y
[j] are
linearized polynomials in x and y, respectively. For i = 0 or 1, the coefficients of
fi,x(x) and fi,y(y) are stored in CXRi and CYRi, respectively. CXRi and CYRi
consist of Nx and Ny m-bit registers, respectively, since each element in GF(2
m) is
represented by m bits. Nx − 1 and Ny − 1 are set to the maximal degrees of fi,x(x)
and fi,y(y), respectively, during the interpolation process. Hence, Nx = m + 1,
Ny = m− k + 2.
Interpolator0 and interpolator1 compute the updated coefficients for f0(x, y) and
f1(x, y), respectively, and write the updated coefficients back to CXR and CYR
during each cycle. Since interpolator0 and interpolator1 have the same circuitry,
the architecture of interpolator0 is discussed in Sec. 4.3.1. After the interpolation
is finished, the polySel unit selects f0(x, y) if O0 ≤ O1, or f1(x, y) otherwise. Since
89
4.3. EFFICIENT KK DECODER ARCHITECTURES
0 1
0 0 1 1
mNymNx mNx mNy
mNx mNy
mNx
mNx
mNy
mNy
m
Nx Ny Nx Ny
Nx Ny
sel0
mNx mNy
r0 s
m m
2log k⎡ ⎤⎢ ⎥mNx
r1 sr0 s
m m
r1 s
Figure 4.1: Serial KK decoder architecture
PUU
Ny-1
c0
CEX0(0)CEX0(1)CEX0(Nx-1)CEY0(0)CEY0(1)CEY0(Ny-1)
CEX1(1)CEX1(Nx-1)CEY1(0)CEY1(1)CEY1(Ny-1) CEX1(0)
KEEP0MS0
m
00
UCEX0(0)UCEX0(1)UCEX0(Nx-1)UCEY0(0)UCEY0(1)UCEY0(Ny-1)
...
...
...
...
...
...
......
m m m m m m m m m m m
m m m m m m
polyEvl
r0(s)
m
m
Δ0 O0
l
orderComp
PUU
1
PUU
0
PUU
Nx-1
PUU
1
PUU
0
c0'
m
m m
mm
r1(s)
Figure 4.2: Architecture of interpolator0
the polySel unit can be easily implemented with the orderComp unit (see Fig. 4.4)
used for interpolator0, the details of the polySel unit are omitted. The coefficients
of the polynomial selected by the polySel unit will be stored in DCXR and DCYR,
which consist of Nx and Ny m-bit registers, respectively. The polyDiv unit will then
compute uˆ based on a reformulated right division algorithm described in Sec. 4.3.1.
90
4.3. EFFICIENT KK DECODER ARCHITECTURES
Efficient interpolator architecture
The architecture of interpolator0 is shown in Fig. 4.2. It computes the corresponding
∆0 and O0 as well as the updated polynomial coefficients for f0(x, y) as specified in
Algorithm 9. Interpolator0 consists of Nx + Ny polynomial updating units (PUUs)
and the orderComp and polyEvl units. Each PUU generates a new coefficient during
each cycle. The polyEvl unit evaluates the corresponding linearized polynomial and
generates ∆0, and the orderComp unit generates O0. The number of bits needed for
O0 is l = dlog2 max {Nx, Ny + k − 1}e.
To achieve high throughput, a fully parallel architecture shown in Fig. 4.3 is used
for the polyEvl unit. In this work, all Galois field elements are represented with
respect to a normal basis so that the q-exponentiation operation will be a cyclic
shift. Suppose the normal basis representation for b =
∑m−1
j=0 γ
[j]bj ∈ GF(2m) is
(bm−1, bm−2, · · · , b0), where bj ∈ GF(2) and γ[j]’s constitute a normal basis. The
corresponding normal basis representation for b2 is (bm−2, · · · , b0, bm−1). In Fig. 4.3,
the computation of b[j] is carried out by the S(j) unit, which cyclicly shifts its input
by j positions and requires wiring only. The SUM0 in Fig. 4.3 performs bit-wise
XOR operations for their m-bit input messages. Finite field multiplications and
additions are also used in the polyEvl unit. Additions over GF(2m) are simply bit-
wise XOR operations. In this work, we use the improved Massey–Omura normal
basis multipliers proposed in our previous work [43, Sec. III-B].
The architecture of the orderComp unit is shown in Fig. 4.4. The IsZero unit
tests whether its m-bit input message is zero. The two priority decoders in Fig. 4.4
compute deg(f0,x(x))) and deg(f0,y(y)), respectively, and their outputs have lx =
dlog2Nxe and ly = dlog2Nye bits, respectively. In Fig. 4.4, k is the number of
91
4.3. EFFICIENT KK DECODER ARCHITECTURES
0
r0 s
0
0 0 Nx
Nx
m
m
0
0 0 Ny
Ny
m m
m
m r1 s
m m
m
m m m
m m
m m
mm
m m
m m
m
Figure 4.3: Architecture of polyEvl for interpolator0
k-
O0
0 0 Nx 0 0 Ny
m
lx ly
0 0 Nx 0 0 Ny
l
m m m
Figure 4.4: Architecture of orderComp for interpolator0
information symbols. A fixed-point adder that performs integer addition is used.
The MAX unit computes the maximum of its two inputs.
The two groups of PUUs in Fig. 4.2 — Ny PUUs on the left and Nx PUUs
on the right — update f0,y(y) and f0,x(x), respectively. Since all PUUs have the
same circuitry, the PUU that updates the coefficient of x[j] is shown in Fig. 4.5. In
Algorithm 9, f0(x, y) is updated in three different ways: 1) f0(x, y) = f0(x, y)
2 +
∆0f0(x, y) when O0 < O1 and ∆0 is not zero; 2) f0(x, y) = ∆0f1(x, y) + ∆1f0(x, y)
when O0 > O1 and ∆0 is not zero; 3) f0(x, y) keeps unchanged when ∆0 is zero. As a
result, there are three different polynomial operations: 1) computing the square of a
linearized polynomial; 2) multiplying a linearized polynomial with a constant; and 3)
adding two linearized polynomials. The proposed PUU is configured to implement
92
4.3. EFFICIENT KK DECODER ARCHITECTURES
Figure 4.5: Architecture of the PUU that updates the coefficient of x[j]
these three operations with two control signals MSi and KEEPi.
Taking ∆0, ∆1, O0 and O1 as inputs, the control unit computes the control
signals
(KEEP0,KEEP1,MS0,MS1, c0, c
′
0, c1, c
′
1)
=

(1, 1, 0, 1,∆0,X,∆0,∆1) if ∆0 6= 0,∆1 6= 0, O0 ≤ O1
(1, 1, 1, 0,∆1,∆0,∆1,X) if ∆0 6= 0,∆1 6= 0, O0 > O1
(0, 1,X, 0,X,X,∆1,X) if ∆0 = 0,∆1 6= 0
(1, 0, 0,X,∆0,X,X,X) if ∆0 6= 0,∆1 = 0
, (4.7)
where X in Eq. (4.7) is “don’t care”.
Reformulated right division algorithm
In [44], a recursive right division procedure is proposed to solve factorization prob-
lem. In this chapter, the right division procedure [44] is reformulated in a non-
recursive manner. Let a(x) = dx(x) and b(x) = dy(y)|y=x. Let lc(a(x)) de-
note the leading coefficient of a(x). That is, if a(x) has degree d, i.e., a(x) =
adx
[d] + ad−1x[d−1] + · · · + a0x[0], then lc(a(x)) = ad 6= 0. The reformulated right
division algorithm is shown in Algorithm 12. The k messages symbols are recovered
93
4.3. EFFICIENT KK DECODER ARCHITECTURES
within at most k iterations.
Algorithm 12: Reformulated right division algorithm
Input: a(x) and b(x), b(x) 6≡ 0
Output: uˆ = (uˆ0, uˆ1, · · · , uˆk−1)
Initialization: j = 0; uˆi = 0, for 0 ≤ i < k
while deg(a(x)) ≥ deg(b(x)) and a(x) 6= 0 do
d = deg(a(x)), e = deg(b(x)), q = d− e
ad = lc(a(x)), be = lc(b(x)), uˆq = (ad/be)
[m−e]
t(x) = (ad/be)
[m−e]x[q], a(x) = a(x)− b(t(x))
j = j + 1
end while
if j > k or deg(a(x)) > 0 then
return decoding failure
end if
return uˆ
Efficient factorization architecture
A parallel polynomial divider that implements Algorithm 12 is shown in Fig. 4.6,
where AX(j) and BX(j) denote the coefficients of x[j] for a(x) and b(x), respectively,
UAX(j) the updated coefficients for a(x), and coeff and pos a recovered information
symbol and its position in the information vector, respectively. The COS unit finds
the leading coefficient and the degree of a given linearized polynomial. The inv
unit computes the inversion of the leading coefficient lc(b(x)) of b(x). The CS unit
cyclicly shifts the m-bit input by m − e positions, where e is the degree of b(x).
S(j) cyclicly shifts its input by j positions and hence requires wiring only. The LS
unit has Ny m-bit inputs and Nx-1 m-bit outputs. As shown in Fig. 4.6, we have
L(j+pos) = BX(j) for j = 0, 1, · · · , Ny − 1. For other j, L(j)=0. In this work, a
parallel inversion architecture is employed, and such an architecture for inversions
94
4.3. EFFICIENT KK DECODER ARCHITECTURES
over GF(28) is shown in Fig. 4.7.
m
Nx Nym m
deg
m
m
m
m
m
m m
m
Nx
Nx
deglc
m
Ny
Nx-2
lc
m
m
m
m m m
m m
Figure 4.6: Architecture of the polyDiv unit
-1
Figure 4.7: Parallel inversion architecture over GF(28)
Performance of the serial decoder architecture
We now consider the critical path delay (CPD), latency, and throughput of the serial
architecture in Fig. 4.1. The critical path delay of the polyDiv unit is given by TCOS
+ Tinv + Tmul + TCS + Tmul + Tadd, where TCOS is the delay of the COS unit
shown in Fig. 4.6 and Tinv, Tmul, and Tadd are the delays of finite field inversion,
multiplication, and addition, respectively. On the other hand, the critical path delay
of an interpolator is given by max{TpolyEvl, TorderComp} + Tctrl + TPUU. The
CPD of the polySel unit is negligible compared to those of the polyDiv unit and
95
4.3. EFFICIENT KK DECODER ARCHITECTURES
the interpolator. When m = 8, the CPDs of the polyDiv unit and the interpolator
are dominated by 5Tmul and 2Tmul, respectively. In order to balance the CPDs
between the polyDiv and interpolator, a stage of pipeline registers is inserted in the
polyDiv unit (indicated by the dotted line in Fig. 4.6).
For the serial architecture in Fig. 4.1, the interpolation needs N cycles, where N
is the number of linearly independent received packets. The maximal value of N is
set to 2n−k, since the decoder will fail if the number of errors exceeds its correction
capability when N > 2n − k. Without the pipeline registers, the factorization will
finish in M (M ≤ k) cycles, since the polyDiv unit recovers one non-zero information
symbol during each cycle. The polySel takes one cycle. In the worst case, it takes
at most N + 1 + k cycles to generate uˆ. With the pipeline registers inserted in
the polyDiv unit as shown in Fig. 4.6, it takes 2M cycles to finish the polynomial
division, and the overall latency becomes N+1+2M . Once the coefficients of d(x, y),
which is the output of interpolation, are loaded into CXR and CYR, the interpolator
could start processing the following received packets while the polyDiv unit is still
inferring the previous information vector. The throughput of the proposed serial
decoder architecture is given by fmk
max(N,2M,1)
Mb/s, assuming that the clock rate of
the decoder is f MHz.
4.3.2 Unfolded decoder architecture
For modern network applications, a throughput well beyond several Gb/s is de-
sirable. Since the serial architecture described in Section 4.3.1 may not meet such
throughput requirements, an unfolded decoder architecture is also proposed for high
throughput scenarios. As shown in Algorithms 9 and 12, both the interpolation and
96
4.3. EFFICIENT KK DECODER ARCHITECTURES
right division algorithms contain a loop, and both algorithms will finish in limited
iterations. As a result, an unfolded architecture is proposed in Fig. 4.8.
CYR0,0
...
CXR0,0
Nx,0-10
interpolator0,0
CYR1,0CXR1,0
interpolator1,0ctrl0
...
... Dy,0-10... Dx,0-10
polyDiv0
polyDivk-1
... ...
coeff0
coeffk-1
pos0
posk-1
TCR
DCYR0DCXR0
DCXRk-1 DCYRk-1
r0(0)
... Ny,0-10 ... Nx,0-10 ... Ny,0-10
vx,0m
m
vy,0 vx,0 vy,0 m
m
vx,1 vy,1 vx,1 vy,1CYR0,1
...
CXR0,1
Nx,1-10
CYR1,1CXR1,1
... Ny,1-10 ... Nx,1-10 ... Ny,1-10
r1(0)
r0(0)
r1(0)
... ... ...
vx,1 vy,1 vx,1 vy,1
interpolator0,1 interpolator1,1ctrl1
r0(1) m
m
m
mr1(1)
r0(1)
r1(1)
CYR0,N-1
...
CXR0,N-1
Nx,N-1-10
interpolator0,0
CYR1,N-1CXR1,N-1
interpolator1,0ctrlN-1
r0(N-1)
... Ny,N-1-10 ... Nx,N-1-10 ... Ny,N-1-10
vx,N-1m
m
vy,N-1 vx,N-1 vy,N-1 m
m
vx,N vy,N vx,N vy,N
... Nx,N-10 ... Ny,N-10 ... Nx,N-10 ... Ny,N-10
r1(N-1)
r0(N-1)
r1(N-1)
vx,N vy,N vx,N vy,N
polySel
... Dy,k-1-10... Dx,k-1-10
Figure 4.8: Unfolded decoder architecture
Compared with the serial decoder architecture, the unfolded architecture in
Fig. 4.8 has N stages of interpolator and k stages of polyDiv, where N is the
maximal number of received packets. For all received words, the N iterations of
an interpolation process are distributed to N pairs of interpolators. interpolatori,s
does the interpolation in iteration s of Algorithm 9 and passes the results to the
next stage of interpolators. Nx,s and Ny,s denote the numbers of registers needed by
interpolatori,s. The polyDivj unit implements iteration j of Alg. 12, and Dx,j and
Dy,j denote the numbers of registers needed by polyDivj.
Based on Algorithm 9, at the beginning of iteration s, (deg(fi,x(x)), deg(fi,y(y))) ≤
97
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
(dx,s, dy,s) =

(s, 0) if 0 ≤ s ≤ k
(s, s− k) if k < s ≤ m
(m,m+ 1− k) if s > m
. Similarly, we can show that the
degrees of a(x) and b(x) at the beginning of iteration j of Alg. 12 are at most
px,j = m− j and py,j = m−k+ 1, respectively, for j = 0, 1, · · · , k−1. Thus, we can
use Nx,s = dx,s+ 1 and Ny,s = dy,s+ 1 registers for interpolatori,s and Dx,j = px,j + 1
and Dy,j = py,j + 1 registers for polyDivj. These upper bounds help to reduce the
hardware cost of the proposed unfolded decoder architecture.
The decoding latency of the unfolded decoder architecture is the same as that of
the serial architecture, but the throughput is fmk Mb/s, which is higher than the
serial architecture.
4.4 Efficient MV list Decoder Architecture
MV codes enable stronger error correction at the cost of higher computational com-
plexity. As shown in [48, Fig. 1], MV codes under list decoding enhance the average
decoding radius when packet rate is low. In this section, an efficient serial list de-
coder architecture for MV codes is proposed. Without loss of generality, we take
L = 2 as an example to simplify the presentation. The decoder architecture for a
different L is similar and can be easily obtained.
4.4.1 Serial list decoder architecture
A serial list decoder architecture for MV codes is shown in Fig. 4.9. When L = 2,
three multivariate linearized polynomials, f0(x, y1, y2), f1(x, y1, y2), and f2(x, y1, y2),
98
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
participate in interpolation according to Algorithm 10, where fi(x, y1, y2) = fi,x(x)+
fi,y1(y1) + fi,y2(y2). As shown in Fig. 4.9, three groups of coefficient registers, CR0,
CR1, and CR2, store the coefficients of linearized polynomial f0, f1, and f2, re-
spectively. For each group of coefficient registers, CX, CY1, and CY2 store the
coefficients of fi,x, fi,y1 , and fi,y2 , respectively. The interpolator units perform the
interpolation step and update corresponding coefficient registers according to Algo-
rithm 10. Once the interpolation step is finished, the factorization unit computes
two possible transmitted information vectors.
0
0
0 1
1 2
1
0 1
2
0 1
www
www
www
w
h hI0 I1
r0 s
r1 s
r2 s
z
z
z
w w w
Figure 4.9: Unfolded decoder architecture
The coefficients of these multivariate linearized polynomials are elements in
GF(qml), where q = 2h. Each coefficient is represented by an ml-dimension vec-
tor over GF(q). As a result, each coefficient is represented by a z-bit binary vector
and stored in a z-bit register, where z = mlh.
The degree of each multivariate linearized polynomial may keep increasing during
the interpolation step. Besides, within the decoding radius, there is an upper bound
on the degree that each linearized polynomial can achieve. For an l-dimension MV
99
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
code over GF(qml), if a transmitted information vector can be correctly recovered,
then the maximal values of deg(fi,x), deg(fi,y1), and deg(fi,y2) are (l+ t)L, (l+ t)L−
(k − 1), and (l + t)L − 2(k − 1), respectively, where t < lL − L(L + 1)k−1
2m
is the
dimension of error packets received. Let N0, N1 and N2 denote the numbers of z-bit
coefficient registers for CX, CY1 and CY2, respectively. Then, N0 = (l + t)L + 1,
N1 = (l+ t)L− (k− 1) + 1, N2 = (l+ t)L− 2(k− 1) + 1 and w = (N0 +N1 +N2)z.
Similar to the KK decoders proposed in Section 4.3, all the finite field arithmetic
operations for the decoding of MV codes are performed assuming a normal basis.
Suppose an element c =
∑ml−1
i=0 ciγ
[i] is represented as (cml−1, · · · , c1, c0) under nor-
mal basis, where γ[i]’s constitute a normal basis of GF(qml) and ci ∈ GF(q). We
also define c[−i] , g, where g[i] = c. Then, under normal basis, c[−1] is represented
as (c0, cml−1, · · · , c1).
polyEvl orderComp
PUU
N2-1
coef0coef1coef2
w ucoef0
[PS0:MS0:KEEP0]
[c0:c0']
Δ0 O0
3
2z
2z
z
z
z
z
z
2z
z
z
2z
zzz
PUU
1
PUU
0
PUU
N1-1
2z
z
z
z
z
2z
z
z
2z
zzz
PUU
1
PUU
0
PUU
N0-1
2z
z
z
z
z
2z
z
z
2z
zzz
PUU
1
PUU
0
w
w
w
G0 G1 G2... ... ...
z z z
r0(s) r1(s) r2(s)
Figure 4.10: The architecture of interpolator0 for the proposed MV decoder
4.4.2 Efficient interpolator architecture for MV codes
The proposed interpolator for the serial MV list decoder is similar to that of the
KK decoders in Section 4.3. The interpolator architecture is shown in Fig. 4.10.
Interpolatori updates the coefficients of the multivariate linearized polynomial fi in
Algorithm 10. Take interpolator0 as an example. As shown in Fig. 4.10, interpolator0
100
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
consists of polyEvl, orderComp and three groups of PUUs. The functions of these
units are similar to those of an interpolator for the KK decoders in Section 4.3. As
shown in Fig. 4.10, coef0, coef1 and coef2 denote the coefficients of f0, f1 and f2,
respectively. The architectures of polyEvl and orderComp are similar to those of the
KK decoders in Section 4.3. Let f0,x =
∑N0−1
i=0 CEX(i)x
[i], f0,y1 =
∑N0−1
i=0 CEY1(i)y
[i]
1
and f0,y2 =
∑N0−1
i=0 CEY2(i)y
[i]
2 . The architectures of polyEvl and orderComp of
interpolator0 when L = 2 are shown in Fig. 4.11 and Fig. 4.12, respectively, where
all the multipliers are based on normal basis and z is the number of bits used to
represent a coefficient.
S(0) S(2(N0-1))
SUM0
...
SUM
...
r0(s)
Δ0
CEX0(0) CEX0(N0-1)
z
z z
z
S(0) S(2(N1-1))
SUM1
...
...
r1(s)
CEX0(0) CEX0(N0-1)
z
z z
S(0) S(2(N2-1))
SUM2
...
...
r2(s)
CEX0(0) CEX0(N0-1)
z
z z
z z z z z zz z z z z z
z
z z
Figure 4.11: Architecture of polyEvl for interpolator0 of the proposed MV decoder
k- k-
O0
0 0 N0 0 0 N1 0 0 N2
z z
IsZero IsZero
z z
IsZero IsZero
z z
IsZero IsZero
Figure 4.12: Architecture of orderComp for interpolator0 of the proposed MV de-
coder
The PUU of interpolator0, shown in Fig. 4.13, updates the coefficient of x
[j]
for f0,x. The linearized polynomial f0 is updated in four different ways: 1) f0 =
∆1f0 + ∆0f1; 2) f0 = ∆2f0 + ∆0f1; 3) f0 = ∆0f
[1]
0 + ∆
[1]
0 f0 and 4) f0 remains
101
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
0
0 j
0 j
h
0
0
z
z
z
0 j
1 j 2 j
0
0z
z z
z z
z z
z
h
z
z
z z
z
z
Figure 4.13: Architecture of PUU that updates x[j] for interpolator0 of the proposed
MV decoder
the same. The proposed PUU in Fig. 4.13 is configured to implement any of these
updating operation with properly set control signals PS0, MS0 and KEEP0.
4.4.3 Efficient factorization architecture for MV codes
Hardware efficient matrix based LRR algorithm
An LRR algorithm is proposed in [47] to solve the factorization problem for the list
decoding of MV codes. For efficient hardware implementation of the factorization
algorithm, more details about the LRR algorithm should be derived.
In this chapter, assuming L = 2, we derive the expression of linearized polyno-
mial d↓s(x, Y ) in each iteration of Algorithm 11. We denote d↓s(x, Y ) as d(i)(x, Y )
during the computation of the information symbol ui. Based on d
(i)(x, Y ), the equa-
tion used to compute ui is also derived. Here, Lemma 1 is given without proof since
it is straightforward.
Lemma 1. Let Y =
∑l
k=0 akx
[k] be an element in the ring Lq[x], and λ ∈ GF (q),
then (Y [1] + λx)[i] ⊗ (Y [1] + λx)[j] = Y [1+i] ⊗ Y [1+j] + λ2x[i+j].
102
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
Let Yi denote Y ⊗ Y [i]. When L = 2, d(x, Y ) in Eq. (4.4) has the following
general form:
d(x, Y ) = d0,0x
[0] + · · ·+ d0,n0x[n0]
+ d1,0Y
[0] + · · ·+ d1,n1Y [n1]
+ d2,0(Y0)
[0] + · · ·+ d2,n2(Y0)[n2],
(4.8)
where n0 = ml− 1, n1 = n0 − (k − 1), and n2 = n0 − 2(k − 1). Then, d(i)(x, Y ) has
the following general form:
d(i)(x, Y ) = d
(i)
0,0x
[0] + · · ·+ d(i)0,n0x[n0]
+ d
(i)
1,0Y
[0] + · · ·+ d(i)1,n1Y [n1]
+ d
(i)
2,0(Yi)
[0] + · · ·+ d(i)2,n2(Yi)[n2].
(4.9)
and d(0)(x, Y ) = d↓s(x, Y ). The equation to solve for u0 is d
(0)
2,0u
2
0 + d
(0)
1,0u0 + d
(0)
0,0 = 0.
The coefficients of d(i+1)(x, Y ) can be derived from d(i)(x, Y ). Let D(i)(x, Y ) =
d(i)(x, (Y [1] + uix)), then
D(i)(x, Y ) = d
(i)
0,0x
[0] + · · ·+ d(i)0,n0x[n0]
+ d
(i)
1,0Y
[1] + · · ·+ d(i)1,n1Y [n1+1]
+ d
(i)
1,0uix
[0] + · · ·+ d(i)1,n1uix[n1]
+ d
(i)
2,0(Yi+1)
[1] + · · ·+ d(i)2,n2(Yi+1)[n2+1]
+ d
(i)
2,0u
2
ix
[i] + · · ·+ d(i)2,n2u2ix[i+n2].
(4.10)
103
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
Eq. (4.10) can be simplified to be
D(i)(x, Y ) = D
(i)
0 (x) +D
(i)
1 (Y ) +D
(i)
2 (Yi+1)
= D
(i)
0,0x
[0] + · · ·+D(i)0,n0x[n0]
+ D
(i)
1,0Y
[1] + · · ·+D(i)1,n1Y [n1+1]
+ D
(i)
2,0(Yi+1)
[1] + · · ·+D(i)2,n2(Yi+1)[n2+1],
(4.11)
where D0,j = d
(i)
0,j + d
(i)
1,jui for 0 ≤ j < i, D0,j = d(i)0,j + d(i)1,jui + d(i)2,ju2i for j ≥ i,
D1,j = d
(i)
1,j and D2,j = d
(i)
2,j for all j. Then, we have
d(i+1)(x, Y ) = D
(i)
↓s (x, Y ), (4.12)
where s is the largest integer s.t. D
(i)
0 (x), D
(i)
1 (Y ) and D
(i)
2 (Y ⊗ Y [i+1]) are divisible
by x[s], Y [s] and (Y ⊗Y [i+1])[s], respectively. Once d(i)(x, Y ) is determined, we obtain
an equation about ui, d
(i)(x, uix) = 0.
It is interesting to illustrate the root pattern of the factorization algorithm for
MV codes. Since L = 2, ui is a root of the following i+ 1 equations:

d
(i)
0,0 + d
(i)
1,0ui = 0
d
(i)
0,1 + d
(i)
1,1ui = 0
· · ·
d
(i)
0,i−1 + d
(i)
1,i−1ui = 0
d
(i)
0,i + d
(i)
1,iui + d
(i)
2,0u
2
i = 0.
(4.13)
The first nonzero equation in Eq. (4.13) is used to solve ui. The equation to derive
u0 is d
(0)
2,0u
2
0 + d
(0)
1,0u0 + d
(0)
0,0 = 0.
104
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
Lemma 2. Consider a nonzero quadratic equation f(u) = au2 + bu+ c = 0, where
a, b, c ∈ GF(qml). Here u ∈ GF(q), where q = 2h. If f(u) = 0 has two distinct roots,
then a 6= 0, and b 6= 0. If f(u) = 0 has two identical roots, then a 6= 0, and b = 0.
If f(u) = 0 only has one root, then a = 0, b 6= 0.
The proof of Lemma 2 is omitted here since it is very straightforward. The
roots generation of the factorization process is shown in Fig. 4.14. Without loss
of generality, suppose the equation is f0(u0) = 0, derived from d
(0)(x, Y ), which
has only one root u0. d
(1)(x, Y ) is computed based on d(0)(x, Y ) and u0. The
equation f1(u1) = 0 also has only one root u1. The same root pattern repeats
until d(i)(x, Y ) is computed. Suppose fi(ui) = 0 has two different roots ui,0 and
ui,1. As a result, dl
(i+1)(x, Y ) and dr(i+1)(x, Y ) are computed based on ui,0 and
ui,1, respectively. dl
(i+1)(x, Y ) and dr(i+1)(x, Y ) are the linearized polynomials used
to derive the equations about ui+1,0 and ui+1,1, which are two possible values of
information symbol ui+1, respectively. Then we give the following two lemmas which
are proved in Appendix A and B, respectively.
d(i)(x,Y) dl(i+1)(x,Y)
dr(i+1)(x,Y)
ui,0
ui,1
fi+1,0(ui+1,0)
fi+1,1(ui+1,1)
dl(i+2)(x,Y)
fi+2,0(ui+2,0)
fi+2,1(ui+2,1)
dr(i+2)(x,Y)
ui+1,0
ui+1,1
d(0)(x,Y)
f0(u0)
d(1)(x,Y)
f1(u1)
u0
fi(ui)
Figure 4.14: Root pattern
Lemma 3. Both equations fi+1,0(ui+1,0) = 0 and fi+1,0(ui+1,0) = 0 have only one
root, where fi+1,0(ui+1,0) = 0 and fi+1,0(ui+1,0) = 0 are derived from dl
(i+1)(x, Y )
105
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
and dr(i+1)(x, Y ), respectively.
Lemma 4. The equation fi+j,0(ui+j,0) = 0, derived from dl
(i+j)(x, Y ), has only one
root for j > 1. The same is true for dr(i+j)(x, Y ).
Based on the above discussions, for L = 2, the two possible output information
vectors have the following general form
u(0) = {u0, u1, · · · , ui,0, ui+1,0, · · · , uk−1,0}
u(1) = {u0, u1, · · · , ui,1, ui+1,1, · · · , uk−1,1}
(4.14)
where 0 ≤ i ≤ k−1. It is also possible that factorization produces only one solution.
Then u(0) = u(1). In this chapter, a matrix based LRR (M-LRR) algorithm that is
suitable for hardware implementation is proposed in Algorithm 13. Here, we assume
L = 2.
Algorithm 13: M-LRR algorithm
input : d(x, Y ), n0, n1, n2
output: u(0),u(1)
Initialization: M (0) = 03×n0 ,M
(1) = 03×n0
for i = 0 to 2 do
for j = 0 to ni do
M
(0)
i,j = M
(1)
i,j = di,j
for i = 0 to k − 1 do
M (0) = shiftPow(M (0)); M (1) = shiftPow(M (1))
u
(0)
i = solver0(M
(0), i); u
(1)
i = solver1(M
(1), i)
M (0) = matrixUpdate(M (0), u
(0)
i , i)
M (1) = matrixUpdate(M (1), u
(1)
i , i)
As shown in Algorithm 13, the coefficients of d(x, Y ), which are the output
of interpolation, are first copied into the coefficient matrices M (0) and M (1). The
106
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
proposed M-LRR algorithm outputs at most two possible transmitted information
vectors. The shiftPow function is shown in Algorithm 14. The solver0 function in
Algorithm 13 is shown in Algorithm 15. The solver1 function is the same as solver0
except that it returns r1 when f(u) = 0 has two different roots. The matrixUpdate
function shown in Algorithm 16 updates coefficient matrices based on Eq. (4.10).
The shiftPow function preprocesses the coefficient matrix so that at least one
of M0,0, M1,0 and M2,0 is a nonzero coefficient, where M is the input matrix. Take
M (0) as an example, after preprocessing, the solver0 function derives the coefficients
of equation f
(0)
i (u
0
i ) = 0 and solves this equation. It is possible that f
(0)
i (u
0
i ) =
0 may have two distinct roots. However, only one root is selected as shown in
Algorithm 15. The matrixUpdate function updates the coefficient matrix that will
be used to compute the next information symbol.
Algorithm 14: shiftPow
input : M
output: M ′
Initialization: M ′ = 03×n0
i = 0
for j = 0 to n0 do
if M0,j 6= 0 or M1,j 6= 0 or M2,j 6= 0 then
i = j
break
for s = 0 to 2 do
for j = i to n0 do
M ′s,j−i = M
[−i]
s,j
107
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
Algorithm 15: solver0
input : M, i
output: ui
Initialization: a = 0, b = 0, c = 0, a, b, c ∈ GF (qml)
pow = 2
for j = 0 to i− 1 do
if M0,j 6= 0 or M1,j 6= 0 then
a = 0, b = M1,j, c = M0,j
pow = 1
break
if pow == 2 then
a = M2,i, b = M1,i, c = M0,i
f(u) = au2 + bu+ c
solve equation f(u) = 0, u ∈ GF (q) = {0, 1, · · · , q − 1}
If f(u) = 0 has only one root r, then ui = r
If f(u) = 0 has two roots r0 and r1, where r0 < r1, then ui = r0
Algorithm 16: matrixUpdate
input : M,ui, i
output: M ′
Initialization: M ′ = 03×n0
for j = 1 to i− 1 do
M ′0,j−1 = (M0,j + uiM1,j)
[−1]
for j = i to n0 do
M ′0,j−1 = (M0,j + uiM1,j + u
2
iM2,j)
[−1]
for j = 0 to n0 do
M ′1,j = M
[−1]
1,j
M ′2,j = M
[−1]
2,j
108
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
Efficient implementation of the M-LRR algorithm
Based on the proposed M-LRR algorithm, a parallel factorization architecture for
list size L = 2 is proposed in Fig. 4.15. The proposed factorization architecture
computes two possible transmitted information vectors u(0) and u(1) at the same
time. It takes at most n0 cycles to compute all possible information symbols. As
shown in Fig. 4.15, after the interpolation step is finished, the output of polySel is
loaded into the matrix coefficient register M (0) and M (1). The equation coefficients
selector (ECS) unit computes the coefficients of f(u) as described in Algorithm 15.
The SV0 and SV1 units compute corresponding roots based on coefficients output
from the ECS unit. The coefficient matrix update (CMU) unit implements both the
shiftPow and matrixUpdate functions specified in Algorithms 14 and 16. Both ECS
and CMU can be implemented with combinational logic.
a b c
u(0)
zz z
h
n2z
n2z
n1z
n1z
z z z
z z zn0z
n0z
z z z
z z z
z z z
z z z
n0z
n1z
n2z
abc
u(1)
z zz
h
n2z
n2z
n1z
n1z
zzz
zzz n0z
n0z
zzz
zzz
zzz
zzz
n0z
n1z
n2z
n0z
n1z
n2z
Figure 4.15: Architecture of factorization for MV decoder (L = 2)
The quadratic equation f(u) = 0 over GF(qml) is solved by enumeration. The
architecture of the proposed SV0 unit is shown in Fig. 4.16. As shown in Fig. 4.16,
the proposed SV0 consists of q equation checkers. The checker-i unit checks whether
109
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
f(i) = 0, where i ∈ GF(q). It outputs zero if f(i) = 0. The root selector (RS) unit
chooses the final roots. The root selection rule is specified in Algorithm 15. The
architecture of SV1 is almost the same as SV0 except the slight difference in the
RS unit. The proposed SV0 architecture is suitable for small or moderate q. Both
hardware cost and critical path will increase dramatically when q is large.
It needs at most n0 = ml− 1 cycles to finish the factorization step. As a result,
the worst case throughput is fhk
max(N,n0)
Mb/s, where f is the clock frequency.
2 a b c
0
h
z
z z zh
zz
2
q-1
h
z
h
zz
1
h
z
h
zz h u
q 2 q
Figure 4.16: Architecture of SV0
110
4.4. EFFICIENT MV LIST DECODER ARCHITECTURE
T
ab
le
4.
2:
H
ar
d
w
ar
e
im
p
le
m
en
ta
ti
on
re
su
lt
s
co
m
p
ar
is
on
.
co
d
e
K
K
co
d
es
M
V
co
d
es
ar
ch
it
ec
tu
re
ty
p
e
ra
n
k
m
et
ri
c
d
ec
o
d
er
[4
3]
p
ro
p
os
ed
se
ri
al
p
ro
p
os
ed
u
n
fo
ld
ed
p
ro
p
os
ed
se
ri
al
fi
el
d
G
F
(2
8
)
G
F
(2
1
6
)
G
F
(2
8
)
G
F
(2
1
6
)
G
F
(2
8
)
G
F
(2
1
6
)
G
F
(4
6
)
ga
te
co
u
n
t
71
.1
k
42
1.
5k
40
.2
k
31
9.
8k
27
4.
8k
43
90
.9
k
34
1.
8k
F
re
q
u
en
cy
(M
H
z)
45
5
27
6
40
0
33
3
40
0
33
3
32
2
L
at
en
cy
(m
s)
0.
47
2.
72
0.
05
3
0.
13
0.
05
3
0.
13
0.
04
7
T
h
ro
u
gh
p
u
t
(M
b
/s
)
21
4
13
4
10
66
17
77
12
80
0
42
66
6
19
3
G
at
e
effi
ci
en
cy
3.
0
0.
32
26
.5
5.
56
46
.5
9.
72
0.
56
111
4.5. IMPLEMENTATION RESULTS
4.5 Implementation Results
The two decoder architectures proposed in Section 4.3 are implemented for the two
KK codes investigated in [43]: an (8, 8, 4) KK code over GF(28) and a (16, 16, 8) KK
code over GF(216). The technology we used is a Free PDK 45nm process [73] which is
the same as that used in [43]. In Table 4.2, we compare the implementation results
of our decoder architectures with those in [43] in terms of gate count, frequency,
throughput and latency. The gate count is measured by the two-input one-output
NAND gate. The throughput of the proposed serial KK decoder architecture over
GF(28) and GF(216) is 4.9 and 13.2 times of that of designs in [43], respectively,
while the gate count is only 56% and 76% of that of the designs in [43], respec-
tively. The unfolded decoder architecture achieves a throughput of 12.5Gb/s and
41.6Gb/s, respectively. The latency of the rank metric decoders in [43] over GF(28)
and GF(216) is 8.6 and 20.9 times of the proposed serial and unfolded architectures.
The throughput per thousand NAND gates (throughput divided by the gate
count in thousands) of the proposed serial and unfolded KK decoder architectures
are much higher than the architectures in [43]. The throughput per thousand NAND
gates of the unfolded architecture is higher than that of the serial architecture,
because the upper bounds on Nx,s, Ny,s, Dx,j, and Dy,j above reduce the hardware
cost of coefficient registers, interpolators and the polyDiv units of the unfolded
architecture.
The proposed serial list decoder architecture is also implemented for a 2-dimensional
MV code over GF(46), where k = 3, l = 3, L = 2. As shown in Table 4.2, the pro-
posed MV decoder achieves a throughput of 193Mb/s at the cost of 341.8k gates.
For KK and MV codes, the code parameters are restricted by the size of the finite
112
4.6. CONCLUSION
field. The decoding complexity of long KK and MV codes may increase significantly
when the field size is large. In this chapter, we focus on KK and MV codes with
small parameters to reduce complexity. However, for practical applications, the
packet size is usually at the magnitude of several thousands of bytes. The proposed
decoder architectures also work for long codes based on Cartesian products of KK
or MV codes [43,46]. For example, suppose the packet length is 1000 bytes, we can
use the Cartesian product of 250 (8, 8, 4) KK codes over GF(28) to get a code of
length 2000 bytes. In this case, the evaluation of 250 polynomials are placed into
a single packet. The decoding of long packets is just the decoding of these 250 KK
codes. Based on the hardware complexity limit and throughput requirement, we
can use one or multiple (8,8,4) KK decoders.
4.6 Conclusion
In this chapter, based on a generalized interpolation and a reformulated right divi-
sion algorithms, an area efficient serial decoder architecture and a high throughput
unfolded decoder architecture are proposed for KK codes. The implementation re-
sults show that the proposed decoder architectures are much more efficient than
previously proposed rank metric decoder architectures. A serial list decoder archi-
tecture for MV codes is also proposed. An M-LRR algorithm is proposed for efficient
implementation of the factorization step of the decoding of MV codes. The synthe-
sis results demonstrate that the proposed list decoder architecture for MV codes
is feasible for hardware implementation. Future works include developing efficient
decoder architecture for fields with large sizes to achieve a good tradeoff between
113
4.6. CONCLUSION
error correction capability and hardware complexity.
114
Chapter 5
An Efficient List Decoder
Architecture for Polar Codes
5.1 Introduction
Polar codes, recently introduced by Arıkan [18], are a significant breakthrough in
coding theory. It is proved that polar codes can achieve the channel capacity of any
discrete or continuous memoryless channel [18, 19]. Polar codes can be efficiently
decoded by the low-complexity successive cancelation (SC) decoding algorithm [18]
with a complexity of O(N logN), where N is the block length. To approach the
channel capacity using the SC algorithm, polar codes require very large code block
length (for example, N > 220 [20]), which is impractical in many applications.
For short or moderate length, the error performance of polar codes under the SC
algorithm is worse than that of Turbo or low-density parity-check (LDPC) codes [21].
115
5.1. INTRODUCTION
Lots of efforts [21–28] have already been devoted to the improvement of error-
correction performance of polar codes with short or moderate lengths. An SC list
(SCL) decoding algorithm was proposed recently in [21], which performs better
than the SC algorithm and performs almost the same as a maximum-likelihood
(ML) decoder [21]. In [22–24], the cyclic redundancy check (CRC) is used to pick
the output codeword from L candidates, where L is the list size. The CRC-aided
SCL algorithm performs much better than the SCL algorithm at the expense of
negligible loss in code rate.
In terms of hardware implementations of the SC algorithm, an efficient semi-
parallel SC decoder was proposed in [20], where resource sharing and semi-parallel
processing were used to reduce the hardware complexity. An overlapped computa-
tion method and a pre-computation method were proposed in [29] to improve the
throughput and to reduce the decoding latency of SC decoders. Compared to the
semi-parallel decoder architecture in [20], the pre-computation based decoder archi-
tecture [29] can double the throughput. A simplified SC decoder for polar codes,
proposed in [30], reduces the decoding latency by more than 88% for a rate 0.7 polar
code with length 218.
The investigation of efficient list decoder architectures for polar codes is moti-
vated by improved error performance of the SCL and CA-SCL algorithms, especially
for polar codes with short or moderate lengths. The tree search list decoder architec-
ture for the SCL algorithm proposed in [31] is the first list decoder architecture for
polar codes in the literature to the best of our knowledge. In this chapter, we pro-
pose the first hardware implementation of the CA-SCL algorithm to the best of our
knowledge. Based on both algorithmic and architectural improvements, our decoder
116
5.1. INTRODUCTION
architecture achieves better error performance and higher area efficiency compared
with the decoder architecture in [31]. Specifically, the major contributions of this
work are:
1. Message memories account for a significant fraction of an SC or SCL de-
coder [20, 31]. In this chapter, an area efficient message memory architecture
is proposed. Besides, a new compression method for the channel messages is
used to reduce the area of the proposed decoder architecture.
2. An efficient processing unit (PU) is proposed. For the proposed list decoder
architecture, a fine grained PU profiling (FPP) algorithm is proposed to de-
termine the minimum quantization size of each input message for each PU so
that there is no message overflow. By using the quantization size generated
by the FPP algorithm for each PU, the overall area of all PUs is reduced.
3. An efficient scalable path pruning unit (PPU) is proposed to control the copy-
ing of decoding paths. Based on the proposed memory architecture and the
scalable PPU, our list decoder architecture is suitable for large list sizes.
4. A low-complexity direct selection scheme is proposed for the CA-SCL algo-
rithm when a strong CRC is used (e.g. CRC32). The proposed direct selection
scheme simplifies the selection of the final output data word.
5. For a (1024, 512) rate-1
2
polar code, the proposed list decoder architecture
is implemented for list size L = 2 and 4, respectively, under a 90nm CMOS
technology. Compared with the decoder architecture in [31] synthesized under
the same technology, our decoder achieves 1.24 to 1.83 times area efficiency
117
5.2. POLAR CODES AND ITS CA-SCL ALGORITHM
(throughput normalized by area). Besides, the proposed CA-SCL decoder has
better error performance compared with the SCL decoder in [31].
The rest of this chapter is organized as follows. In Section 5.2, polar codes as
well as the SCL and CA-SCL algorithms are briefly reviewed. Two improvements
of the CA-SCL algorithm are discussed in Section 5.3. The proposed list decoder
architecture is described in Section 5.4. Section 5.5 shows the implementation and
comparison results of the proposed list decoder architecture. The conclusions are
drawn in Section 5.6.
5.2 Polar Codes and Its CA-SCL Algorithm
5.2.1 Polar Codes
A generation matrix of a polar code is an N×N matrix G = BNF⊗n, where N = 2n,
BN is the bit reversal permutation matrix [18], and F =
[
1
1
0
1
]
. Here ⊗n denotes the
nth Kronecker power and F⊗n = F⊗F⊗(n−1). Let uN−10 = (u0, u1, · · · , uN−1) denote
the data bit sequence and xN−10 = (x0, x1, · · · , xN−1) the corresponding encoded bit
sequence, then xN−10 = u
N−1
0 G. The indices of the data bit sequence u
N−1
0 are
divided into two sets: the information bits set A contains K indices and the frozen
bits set Ac contains N −K indices.
5.2.2 SCL and CA-SCL Algorithms
List decoding was applied to the SC algorithm in [21] and the resulting SCL algo-
rithm outperforms the SC algorithm. For a list size L, the SCL algorithm keeps at
118
5.2. POLAR CODES AND ITS CA-SCL ALGORITHM
Algorithm 17: SCL algorithm [21]
input : n, the received channel message yN−10
output: uˆN−10
for l = 0 to L− 1 do
for β = 0 to N − 1 do
Pl,0[β][s] = Pr(yβ|s), s = 0, 1
for λ = 0 to n do rl[λ] = 0
for i = 0 to N − 1 do
for λ = φi to n− 1 do rl[λ] = l
foreach survived decoding path l do
metricComp(l, i)
if i ∈ Ac then
foreach survived decoding path l do
uˆl,i = Cl,n[0][i mod 2] = 0
else
pathPruning(P0,n, · · · , PL−1,n)
if i mod 2 == 1 then
foreach survived decoding path l do
pUpdate(l, n, i)
119
5.2. POLAR CODES AND ITS CA-SCL ALGORITHM
Algorithm 18: metricComp(l, i) [21]
input : l, i
determine (b
(i)
n , b
(i)
n−1, · · · , b(i)1 ) and φ(i)
for λ = φ(i) to n do
for k = 0 to 2n−λ do
if b
(i)
λ = 1 and λ = φ
(i) then
s = Cl,λ[β][0]
Pl,λ[k][u]
= G(Prl[λ−1],λ−1[2k], Prl[λ−1],λ−1[2k + 1], s)
= 1
2
Prl[λ−1],λ−1[2k][u⊕ s] · Prl[λ−1],λ−1[2k + 1][u] for u ∈ {0, 1}
else
Pl,λ[k][u] = F (Pl,λ−1[2k], Pl,λ−1[2k + 1])
=
1∑
u′=0
1
2
Pl,λ−1[2k][u⊕ u′] · Pl,λ−1[2k + 1][u′]
for u ∈ {0, 1}
most L decoding paths and outputs L possible data words uˆN−10,0 , uˆ
N−1
1,0 , · · · , uˆN−1L−1,0,
where uˆN−1l,0 = (uˆl,0, uˆl,0, · · · , uˆl,N−1). A low complexity state copying scheme was
proposed in [31] to simplify the copying process when a decoding path needs to be
duplicated.
For l = 0, 1, · · · , L−1 and λ = 0, 1, · · · , n, let Pl,λ be an array with 2n−λ elements:
Pl,λ[j] contains two messages Pl,λ[j][0] and Pl,λ[j][1] for j = 0, 1, · · · , 2n−λ − 1. Cl,λ
has the same structure as Pl,λ: Cl,λ[j] contains two binary partial sums Cl,λ[j][0]
and Cl,λ[j][1] for j = 0, 1, · · · , 2n−λ − 1. The SCL algorithm with low complexity
state copying [31] is reformulated in Algorithm 17. For the decoding of ui, the SCL
algorithm can be divided into the following parts:
• For each surviving decoding path l, compute the path metrics Pl,n[0][0] and
Pl,n[0][1] using the recursive function metricComp(l, i) shown in Algorithm 18.
120
5.2. POLAR CODES AND ITS CA-SCL ALGORITHM
For i = 1, 2, · · · , N−1, let (b(i)n , b(i)n−1, · · · , b(i)1 ) denote the binary representation
of index i, where i =
∑n−1
j=0 2
jb
(i)
n−j. φ
(i) (1 ≤ φ(i) ≤ n) in Algorithm 18 is the
largest integer such that b
(i)
φ(i)
= 1. When i = 0, φ(i) = 1. Based on the
recursive algorithm for computing path metric in [21] and the low complexity
state copying algorithm in [31], the path metric computation is formulated in a
non-recursive way in Algorithm 18, where rl = (rl[n−1], rl[n−2], · · · , rl[0]) is
the message updating reference index array for decoding path l. For decoding
path l, rl[0] ≡ 0, while all other elements are initialized with 0. Two types of
basic operations, denoted as F and G operations, respectively, are employed
in Algorithm 18.
• If ui is a frozen bit, for each decoding path, the decoded code bit uˆl,i = 0,
decoding path l will carry on with uˆl,i = 0. If ui is an information bit, decoding
path l (l = 0, 1, · · · , L− 1) splits into two decoding paths with corresponding
path metrics being Pl,n[0][0] and Pl,n[0][1], respectively. There are at most
2L paths after splitting, and 2L associated path metrics. The pathPruning
function in Algorithm 17 finds the L most reliable decoding paths based on
their corresponding path metrics.
• For each of the L surviving decoding paths, the pUpdate(l, n, i) function shown
in Algorithm 19 [21] updates the partial sum matrices that will be used in the
following path metric computation.
We make several observations about the path metric computation:
• When i = 0, Pl,1, · · · , Pl,n are updated in serial, and only the F computation
is employed.
121
5.2. POLAR CODES AND ITS CA-SCL ALGORITHM
• For i > 0, Pl,φ(i) , · · · , Pl,n are updated in serial. The G computation is used
when computing Pl,φ(i) , while the F computation is used for the other proba-
bility message arrays.
• The computation of Pl,φ(i) is based on Prl[φ(i)−1],φ(i)−1, while the computation
of Pl,λ (λ > φ
(i)) is based on Pl,λ−1.
Algorithm 19: pUpdate(l, λ, i) [21]
input : l, λ, i
if λ == 0 then return
j = bi/2c
for β = 0 to 2n−λ − 1 do
Cl,λ−1[2β][j mod 2] = Cl,λ[β][0]⊕ Cl,λ[β][1]
Cl,λ−1[2β + 1][j mod 2] = Cl,λ[β][1]
if j mod 2 == 1 then pUpdate(l, λ− 1, j)
122
5.2. POLAR CODES AND ITS CA-SCL ALGORITHM
G
(P
r l
[λ
−1
],
λ
−1
[2
k
],
P
r l
[λ
−1
],
λ
−1
[2
k
+
1]
,s
)
=
P
r l
[λ
−1
],
λ
−1
[2
k
][
u
⊕
s]
+
P
r l
[λ
−1
],
λ
−1
[2
k
+
1]
[u
]
(5
.1
)
F
(P
l,
λ
−1
[2
k
],
P
l,
λ
−1
[2
k
+
1]
)
=
m
ax
∗ (
P
l,
λ
−1
[2
k
][
u
]+
P
l,
λ
−1
[2
k
+
1]
[0
],
P
l,
λ
−1
[2
k
][
u
⊕
1]
+
P
l,
λ
−1
[2
k
+
1]
[1
])
(5
.2
)
F
(P
l,
λ
−1
[2
k
],
P
l,
λ
−1
[2
k
+
1]
)
=
m
ax
(P
l,
λ
−1
[2
k
][
u
]+
P
l,
λ
−1
[2
k
+
1]
[0
],
P
l,
λ
−1
[2
k
][
u
⊕
1]
+
P
l,
λ
−1
[2
k
+
1]
[1
])
(5
.3
)
123
5.2. POLAR CODES AND ITS CA-SCL ALGORITHM
The path pruning function, pathPruning, finds the L most reliable paths, a0,
a1, · · · , aL, and their corresponding decoded bits, c0, c1, · · · , cL, based on the path
metrics. The path metrics of the surviving L decoding paths are the L largest ones
among 2L input metrics. Once the surviving decoding paths are found, the partial
sums and the reference index array of decoding path l will copy from decoding path
al. The partial sum computation of decoding path l is carried on with the binary
input cl.
The pruning scheme in this chapter and the path pruning scheme in [28] both
try to eliminate decoding paths that are less reliable. However, there are still some
differences:
• The pruning scheme in [28] is used for successive cancelation stack (SCS)
decoding algorithm as well as the SCH decoding algorithm, which is a hybrid
of SCL and SCS decoding algorithms, whereas our pruning scheme is used for
the SCL algorithm.
• For the SCL algorithm, suppose there are L decoding paths before the decoding
of ui, then the metrics of 2L expanded decoding paths are computed. The
pruning scheme in this chapter finds the L largest metrics out of 2Lmetrics and
keeps their corresponding decoding paths. For the pruning scheme in [28], a
path will be deleted if its path metric is smaller than a dynamic threshold, ai−
ln(τ), where ai is the largest metric of candidate paths, and τ is a configuration
parameter.
• For the path pruning scheme in [28], the number of deleted paths is not fixed
and depends on the configuration parameter τ , while the number of deleted
124
5.3. TWO IMPROVEMENTS OF THE CA-SCL ALGORITHM
paths is always L for the pruning scheme in this chapter.
The F and G operations in Algorithm 18 are in probability domain. The F and
G operations in Algorithm 18 can also be performed over the logarithm domain [23].
For u ∈ {0, 1}, the resulting logarithm domain G and F computations are shown in
Eq. (5.1) and Eq. (5.2), respectively, where max∗(x, y) = max(x, y)+log(1+e−|x−y|).
max∗(x, y) can also be approximated with max(x, y), resulting in the approximated
F computation in Eq. (5.3).
In [22], the performance of the SCL algorithm is further improved by the adoption
of CRC, which helps to pick the right path from the L possible decoded data words.
In terms of the fixed point implementation, the CA-SCL algorithm is quite sensitive
to saturation. For two decoding paths, it is hard to decide which is better if the
metrics of both paths are saturated. In order to avoid message saturation, a non-
uniform quantization scheme is proposed in [31]. If the channel messages (Pl,0) are
all quantized with t bits, all the log-likelihood messages (LLMs) of Pl,λ need to be
quantized with t+ λ bits in order to avoid saturation.
5.3 Two Improvements of the CA-SCL Algorithm
In this chapter, two improvements of the CA-SCL algorithm are proposed. Firstly,
for the i-th received bit yi, there are two likelihoods, Pr{yi|0} and Pr{yi|1}. Suppose
Pr{yi|m} (m ∈ {0, 1}) is the smaller one among the two likelihoods. For j ∈ {0, 1},
two log-likelihood messages (LLMs) are defined as
Pl,0[i][j] = log
Pr{yi|j}
Pr{yi|m} . (5.4)
125
5.3. TWO IMPROVEMENTS OF THE CA-SCL ALGORITHM
Thus one of the LLMs is always 0, and the other is always non-negative. For the
proposed list decoder, only the non-negative LLM and its corresponding binary
index s are stored. As shown in Fig. 5.1, Msg denotes the stored non-negative LLM,
and its corresponding bit index is s. When s = 0, Pl,0[i][0] = Msg, Pl,0[i][1] = 0.
When s = 1, Pl,0[i][0] = 0, Pl,0[i][1] = Msg. If t bits are needed to quantize a channel
LLM, it takes t + 1 bits to represent two LLMs corresponding to a received bit yi,
while it takes 2t bits to store two LLMs directly.
Msg s
Figure 5.1: Compressed channel message
Secondly, at the end of the CA-SCL decoding algorithm, the candidate data word
that passes the CRC and has the greatest path metric is the output data word, which
will incur additional comparisons. In this chapter, a simple direct selection scheme
is proposed: we first calculate all L checksums in parallel and then scan from the
checksum of data word 0 to the checksum of data word L− 1, if a data word passes
the CRC, the scan process is terminated and the corresponding candidate data
word is the final output one. When all L CRC checks fail, since the CRC checksum
could be corrupted, a decoding failure is announced if re-transmission is possible;
otherwise, pick a data word randomly and output.
The direct selection scheme reduces computational complexity at the expense
of possible performance degradation. In this chapter, we give an estimation of the
frame error rate (FER) degradation. Let w denote the number of the detectable
errors for our CRC. Assume all the bits of the final L candidate data words are
independently subject to a bit error probability, pb. We calculate the increase in
126
5.3. TWO IMPROVEMENTS OF THE CA-SCL ALGORITHM
FER, Pe, caused by the direct selection scheme instead of the ideal selection scheme,
which always selects the transmitted data word if it is within the final L candidates.
For each candidate data word, there are three probabilities:
• The probability that the candidate data word is the same as the transmitted
one is givn by p1 = (1− pb)K .
• The probability that the candidate fails the CRC is denoted as p2.
• The probability that the CRC identifies the candidate as the transmitted data
word by mistake is denoted as p3, and p3
.
=
∑K
r=w+1
(
K
r
)
prb(1 − pb)K−r .=(
K
w+1
)
pw+1b (1− pb)K−w−1.
Thus, p1 + p2 + p3 = 1. We have
Pe 6 p3
1− pL2
1− p2 + p
L
2 − (1− p1)L. (5.5)
Note that pb depends on the signal to noise ratio (SNR) and the list size L. For a
specific SNR, in order to simplify our analysis, we can use p
b,SC to approximate pb,
where p
b,SC denotes the bit error probability of the SC algorithm. The probabilities,
p2 and p3, are also approximated. Though approximated probabilities are employed
when calculating Pe, the order of Pe still helps us in determining whether our direct
selection scheme is applicable. For instance, when a strong CRC is used, i.e. large
w, p3 is small, leading to a small Pe. On the other hand, a higher data rate leads to
a greater K and hence a greater Pe.
127
5.3. TWO IMPROVEMENTS OF THE CA-SCL ALGORITHM
5.3.1 Numerical Results
For a rate 1/2 polar code with N = 1024, the FERs of the SC, SCL and CA-SCL
algorithms are shown in Fig. 5.2, where SC denotes the floating-point SC algorithm.
CS2-max and CS2-map denote the floating-point CA-SCL algorithm with L = 2 and
the approximated F computation shown in Eq. (5.3) and the F computation shown
in Eq. (5.2), respectively. CSi-max-j denotes the fixed-point CA-SCL algorithm
with L = i and non-uniform quantization scheme with t = j, where t is the number
of quantization bits for channel probability message. Si-max-j denotes the fixed-
point SCL algorithm with L = i and non-uniform quantization scheme with t = j.
For all simulated CA-SCL algorithms, a CRC scheme with a generator polynomial
0x1EDC6F41 is employed, and the direct selection scheme is employed to pick the
final output codeword from L possible candidates.
2 . 0 2 . 5 3 . 0 3 . 5 4 . 0
1 0 - 5
1 0 - 4
1 0 - 3
1 0 - 2
1 0 - 1
FER
S N R  ( d B )
 S C C S 2 - m a x C S 2 - m a p C S 2 - m a x - 3 C S 2 - m a x - 4 C S 4 - m a x - 4 C S 8 - m a x - 4 S 2 - m a x - 3 S 4 - m a x - 3
Figure 5.2: FER performance of a polar code with N = 1024
The simulated results show that:
128
5.3. TWO IMPROVEMENTS OF THE CA-SCL ALGORITHM
• For the CA-SCL algorithm, the approximated F computation in Eq. (5.3)
results in negligible performance degradation.
• When each channel LLM is quantized with 4 bits, the employment of the
proposed non-uniform quantization scheme leads to negligible performance
degradation. When each channel LLM is quantized with 3 bits, the resulting
FER performance is roughly 0.2dB worse than that using 4-bit quantization.
• Using a larger list size leads to obvious performance improvement for the
CA-SCL algorithm, whereas the SCL algorithm with L = 2, 4 has nearly the
same performance, especially in the high SNR region. For polar codes with
moderate block length (e.g. N = 211, 212, 213), similar phenomena have been
observed in [22].
More simulation results on the proposed direct selection scheme are provided.
There are three selection schemes employed in our simulations.
• The proposed direct selection (DS) scheme, which outputs the first data word
that passes CRC.
• Ideal selection (IS) scheme, which always outputs the correct data word if it
exists in the final list.
• Metric based selection (MS) scheme [22], which outputs the data word that
has the maximal path metric among all data words that have passed CRC.
In Figs. 5.3 and 5.4, DSk, ISk and MSk denote the CA-SCL algorithms with list
size L = k under the direct, ideal and metric based selection schemes, respectively.
The generation polynomial of the CRC16 used in our simulations is 0x1021.
129
5.3. TWO IMPROVEMENTS OF THE CA-SCL ALGORITHM
As shown in Fig. 5.3, when code rate is 0.75 and CRC16 is used, the proposed
direct selection scheme introduces early error floor for all simulated list sizes, while
the metric based selection scheme performs nearly the same as the ideal selection
scheme. When code rate is 0.5, as shown in Fig. 5.4, the direct selection scheme
performs nearly the same as the ideal selection scheme with list size L = 2. When
list size L = 4, 8, 16, the proposed direct selection scheme shows certain performance
degradation compared with the ideal selection scheme, while the metric based selec-
tion scheme has little performance degradation. When CRC32 is used, the proposed
direct selection scheme performs nearly the same as the ideal selection scheme for
both code rates 0.5 and 0.75 [74].
3 . 0 3 . 2 3 . 4 3 . 6 3 . 8 4 . 0 4 . 2 4 . 4 4 . 6 4 . 8
1 0 - 4
1 0 - 3
1 0 - 2
1 0 - 1
1 0 0
FER
S N R
 D S 2 I S 2 M S 2 D S 4 I S 4 M S 4 D S 8 I S 8 M S 8 D S 1 6 I S 1 6 M S 1 6
Figure 5.3: FER performances under CRC16 and rate 0.75
We also calculate the bound on the FER degradation for all simulated cases. We
choose SNR = 3.6dB, since DS4, DS8 and DS16 begin to show an error floor at this
SNR in Fig. 5.3. For the length 1024 polar code, the bit error probability pb of the
SC algorithm is 6.28× 10−4 and 3.04× 10−6 for rate 0.75 and 0.5, respectively. For
130
5.3. TWO IMPROVEMENTS OF THE CA-SCL ALGORITHM
2 . 0 2 . 2 2 . 4 2 . 6 2 . 8 3 . 0 3 . 2 3 . 4 3 . 61 0
- 6
1 0 - 5
1 0 - 4
1 0 - 3
1 0 - 2
FER
S N R
 D S 2 I S 2 D S 4 I S 4 D S 8 I S 8 M S 8 D S 1 6 I S 1 6 M S 1 6
Figure 5.4: FER performances under CRC16 and rate 0.5
CRC16 and CRC32, w = 2 [75] and 4 [76], respectively. When CRC16 is used, for
each simulated list size, the bound is around 10−2 and 10−10 for rate 0.75 and 0.5,
respectively. When CRC32 is used, for each simulated list size, the bound is 10−4
and 10−17 for rate 0.75 and 0.5, respectively. It is found that the error degradation
caused by our DS scheme is big when the corresponding Pe is big (e.g. 10
−2). On
the other hand, when Pe is quite small (e.g. 10
−17), our DS scheme leads to little
performance degradation.
Based on our calculation results, for a given CRC and code rate, Pe increases
with the list size L. This observation indicates that the potential performance degra-
dation caused by the DS scheme will increase when L increases. This is consistent
with the simulation results shown in Figs. 5.3 and 5.4.
131
5.4. EFFICIENT LIST DECODER ARCHITECTURE
C-MEM
L-MEM
ISel OSel
PSU0
PSU1
PSUL-1
CRCU0
CRCU1
CRCUL-1
0
1
0
1
csel
bsel
CB
deComp
PUA0
PUA1
PUAL-1
D
D
D
D
path pruning 
unit (PPU)
...
......
......
...
PSUCRCC
wBUF
rBUF
Din
Dout
CCG
MVF
Figure 5.5: Top architecture of the list decoder
5.4 Efficient List Decoder Architecture
For the CA-SCL algorithm, we propose an efficient partial parallel list decoder ar-
chitecture shown in Fig. 5.5. The proposed list decoder architecture mainly consists
of the channel message memory (C-MEM), the internal LLM memory (L-MEM), L
processing unit arrays (PUAs) (PUA0, PUA1, · · · , PUAL−1), the path pruning unit
(PPU) and the CRC checksum unit (CRCU). These components are described in
details in the following subsections.
5.4.1 Message Memory Architecture
The L-MEM stores all the inner LLMs used for metric computation. Since all the
LLMs in Pl,λ need to be quantized with t + λ bits for λ ≥ 1, the variable-size
LLMs make the L-MEM architecture for the proposed list decoder nontrivial. In
this chapter, an area efficient scalable memory architecture for L-MEM is proposed
based on the nonuniform quantization:
• For λ = 1, 2, · · · , n, since each LLM within Pλ = (P0,λ, P1,λ, · · · , PL−1,λ) is
132
5.4. EFFICIENT LIST DECODER ARCHITECTURE
quantized with t + λ bits, a regular sub-memory is created for storing LLMs
in Pλ.
• All n sub-memories are combined to a single memory.
• Due to the nonuniform quantization, the width of each sub-memory may be
different. As a result, the concatenated L-MEM is an irregular memory with
varying width within its address space. For the proposed memory architec-
ture, the irregular L-MEM is split into several regular memories to fit current
memory generation tools.
The proposed L-MEM is a mix of different types of memories, including SRAM,
register file (RF) or register. Since SRAM and RF are more area efficient than a
register, the proposed L-MEM architecture is better than the register based LLM
memory in [31].
Suppose there are T processing units (PUs) in each PUA shown in Fig. 5.5, it
consumes at most 4LT LLMs for one round of computation. For λ = 1, 2, · · · , n,
we store all the LLMs within Pλ = (P0,λ, P1,λ, · · · , PL−1,λ) in a single memory as
follows.
• When 2n−λ+1L > 4LT , it takes a sub-memory of 2n−λ−1
T
words, where each
word has 4LT (t+ λ) bits.
• When 2n−λ+1L 6 4LT , it takes a sub-memory with only one single word,
which has 2n−λ+1(t+ λ)L bits.
An example of the concatenation of n = 6 sub-memories, (S1, S2, · · · , S6), is shown
in Fig. 5.6(a). For current memory compiler, it is hard to generate an irregular single
133
5.4. EFFICIENT LIST DECODER ARCHITECTURE
memory instance as shown in Fig. 5.6(a). For the proposed L-MEM architecture,
the concatenated irregular memory is split into several regular memory instances
as shown in Fig. 5.6(b), where additional dummy memories are added so that each
instance is regular. For general cases, the irregular memory is divided into λo =
n− log2 T − 1 regular instances. Depending on the number of words, each memory
instance could be implemented with SRAM, RF or registers.
M1
M2
M3
M4
dummy memory
(a) (b)
S1
S2
S3
S4
S5S6
Figure 5.6: The split of an irregular LLM memory
Compared with the register based LLM memory, the proposed L-MEM architec-
ture is more area efficient because:
• Some sub-memory instances can be implemented with SRAM or RF which is
more denser than register based memory.
• As shown in Fig. 5.6(b), most of the LLMs are store in the largest memory
instance M1 which contains
Nw = n− λo + 1 +
λo−1∑
λ=1
2n−λ−1
T
(5.6)
words, where each word has 4LT (t+ 1) bits.
134
5.4. EFFICIENT LIST DECODER ARCHITECTURE
As shown in Eq. (5.6), Nw is inverse to the number of processing units, T , within
a PUA. As a result, the area of the proposed L-MEM depends on T for a fixed block
length N = 2n and t. Taking RF as an example, we show the comparison of
area efficiency of RFs with different depths in Table 5.1, where area per bit (APB)
denotes the total area of a memory normalized by the number of total stored bits.
The total areas shown in Table 5.1 are from a memory compiler associated with a
90nm technology. As shown in Table 5.1, the RF with a larger depth has a smaller
APB. Hence, given the same amount of bits, it takes a smaller area if those bits can
be stored in a RF with a larger depth. For SRAM, the same phenomena have been
observed.
Table 5.1: Area per Bit for RFs with Different Depth and Width 128 using TSMC
90nm CMOS technology
depth 8 16 32 64 128
total area (µm2) 24331 27022 32308 42812 63811
APB (µm2) 23.7 13.1 7.9 5.2 3.89
The C-MEM can be implemented with a simple regular memory with N
2T
words,
where each word has 2T (t+1) bits. Due to the compression of the channel message,
each compressed channel message is de-compressed into two LLMs by the deComp
unit in Fig. 5.5 before being fed to the PUs. The deComp unit can be implemented
with multiplexors.
5.4.2 Processing Unit Array
The G and approximated F computations shown in Eq. (5.1) and Eq. (5.3), respec-
tively, are used in the metric computation. These two types of basic operation can
be performed with a PU [31, 74]. The hardware complexity of the proposed PU is
135
5.4. EFFICIENT LIST DECODER ARCHITECTURE
determined by p, which is the width of an input LLM.
Due to the non-uniform quantization of the LLMs belonging to different message
arrays, for each PU, the number of quantization bits, p, for each input LLM should
be large enough so that no overflow will happen. According to the fixed point
implementation of the CA-SCL algorithm, the quantization of Pl,n (l = 0, 1, · · · , L−
1) needs the most binary bits, which is t + n. For each PUA, it is unnecessary to
employ T PUs with p+1 = t+n. In this chapter, a fine grained PU profiling (FPP)
algorithm, shown in Algorithm 20, is proposed to decide p for each PU.
Algorithm 20: FPP Algorithm
input : n, t, λo = n− log2 T − 1
output: p[0], p[1], · · · , p[T − 1]
for j = 0 to T − 1 do
p[j] = t+ λo − 1
for λ = λo + 1 to n do
for j = 0 to 2n−λ − 1 do
p[j] = t+ λ− 1
Table 5.2: Bit width of LLM Inputs of PUl,j when n = 10, T = 8 and t = 4
j 0 1 2 3 4 5 6 7
p[j] 13 12 11 11 10 10 10 10
136
5.4. EFFICIENT LIST DECODER ARCHITECTURE
T
ab
le
5.
3:
A
re
a
co
m
p
ar
is
on
b
et
w
ee
n
fi
n
e
gr
ai
n
ed
P
U
ar
ra
y
an
d
re
gu
la
r
P
U
u
si
n
g
T
S
M
C
90
n
m
C
M
O
S
te
ch
n
ol
og
y
n
1
0
1
5
T
8
1
6
3
2
6
4
3
2
6
4
1
2
8
2
5
6
C
P
D
(n
s)
0
.5
5
5
0
.5
8
8
re
gu
la
r
P
U
ar
ra
y
ar
ea
(µ
m
2
)
2
7
6
5
0
5
5
2
5
9
1
1
3
9
0
2
2
2
5
4
1
8
1
5
0
9
5
1
3
0
8
6
4
0
6
0
2
5
0
9
1
2
1
2
3
5
9
fi
n
e
gr
ai
n
ed
P
U
ar
ra
y
ar
ea
(µ
m
2
)
1
9
2
8
0
3
4
1
3
1
5
9
0
4
8
1
0
1
3
7
7
1
0
4
4
3
4
1
9
0
6
1
5
3
3
4
9
2
7
5
9
4
0
4
8
ar
ea
sa
v
in
g
3
0
%
3
8
%
4
8
%
5
5
%
3
0
%
3
8
%
4
4
%
5
1
%
137
5.4. EFFICIENT LIST DECODER ARCHITECTURE
For the j-th PU of PUAl (l = 0, 1, · · · , L − 1), each LLM input is quantized
with p[j] bits. The proposed FPP algorithm is based on the observation that only
2n−λ < T PUs are needed when computing the updated Pl,λ with λ > λo. Thus, in
the proposed PUAl, only PUl,0, PUl,0, · · · , PUl,2n−λ−1 are enabled for the computing
of Pl,λ. Based on the proposed FPP algorithm, each PUA can finish the metric
computation without any overflow at the cost of less area consumption. As shown
in Algorithm 20, the bit width of the LLM inputs of a PU is determined by n, T
and t. One example is shown in Table 5.2, where n = 10, T = 8 and t = 4.
The area saving due to the proposed fine grained profiling algorithm also depends
on T , n and t. For the proposed list decoder architecture, there are L identical PU
arrays, where each array contains T PUs. In Table 5.3, we compare the area of a
regular PU array with that of an array where the input message width of each PU is
determined by the fine grained profiling algorithm. As shown in Table 5.3, the area
of PU arrays is reduced by 30% to 55% depending on the number of PUs with an
array and the block length N = 2n. Here, each channel message is quantized with
t = 4 bits.
Metric Computation Schedule
For the proposed L-MEM, each data word is capable of storing 4TL LLMs. More-
over, each word is equally divided into L consecutive parts, where the l-th part stores
the LLMs corresponding to decoding path l. The metric computation schedule is
almost the same as that of the partial parallel SC decoder in [20] except that L
PUAs work simultaneously for L decoding paths, respectively.
When a data word needs to be updated, the write mismatch would happen since
138
5.4. EFFICIENT LIST DECODER ARCHITECTURE
L PUAs generate only 2LT updated LLMs during one clock cycle. These L PUAs
need to read two consecutive data words from L-MEM in order to generate 4TL
LLMs. For the proposed list decoder architecture, as shown in Fig. 5.5, L write
buffers (wBUFs) are employed to store half of 4TL LLMs generated by L PUAs.
Once the remaining LLMs are computed, the output selection (OSel) module formats
these LLMs in the way that these LLMs are stored in the L-MEM.
Since all the LLMs belonging to Pλ = (P0,λ, P1,λ, · · · , PL−1,λ) with λ > λo are
stored in a single data word in L-MEM and the computing of LLMs belonging to
Pλ+1 can only take place once Pλ are updated, an additional clock cycle is needed
to read out the LLMs within Pλ that have been just written into the L-MEM. This
will increase the delay and decrease the throughput of the proposed list decoder. As
shown in [20], the bypass buffer, rBUF, is used to temporarily store the messages
written into the L-MEM and eliminate the extra read cycle.
5.4.3 Path Pruning Unit
For the CA-SCL algorithm, once the path metric computation of decoding step i is
finished, each current decoding path splits into two sub decoding paths. However,
the list decoder keeps at most L decoding paths. For the proposed list decoder archi-
tecture, a path pruning unit (PPU) is proposed to prune the split decoding paths in
an efficient way. As shown in Fig. 5.5, the proposed PPU contains two sub modules,
the maximum value filter (MVF) and the crossbar control signals generator (CCG).
The MVF generates L path indices a0, a1, · · · , aL−1 and L associated decoded bits
c0, c1, · · · , cL−1. For a current decoding path l, both the path metric and partial sum
computations will be based on the LLMs and partial sums within decoding path al,
139
5.4. EFFICIENT LIST DECODER ARCHITECTURE
and the decoded data bit for ul,i is cl. al and cl for l = 0, 1, · · · , L − 1 are used to
control the copying of partial sums and checksums.
Table 5.4: Comparison of ASIC implementation results using TSMC 90nm CMOS
technology
metric sorter [31] proposed MVF
L 2 4 8 16 32 2 4 8 16 32
CPD (ns) 0.45 0.85 1.8 4.1 9.6 0.54 1.25 2.25 3.7 5.2
area (µm2) 1995 9199 47119 241633 1392617 1580 8401 30814 96979 319498
area saving – 20% 8% 34% 59% 77%
Maximum Values Filter
Taking list size L = 8 as an example, the proposed MVF architecture, shown in
Fig. 5.7, consists of a bitonic sequence generator (BSG) and a stage of compare and
select (CAS) modules. The BSG has 16 inputs (D0, D1, · · · , D16) and 16 outputs
(S0, S1, · · · , S16). Each input or output consists of three parts: the path metric,
the associated list index and decoded bit. The width of each input and output is
z = x1 + x2 + 1, where x1 = t + n is the number of bit used to quantize a path
metric and x2 = log2 L is the number of bits used to represent a list index.
Each stage of the BSG consists of L
2
increase-order sorters (ISs) and L
2
decrease-
order sorters (DSs), which are shown in Fig. 5.8(a) and Fig. 5.8(b), respectively.
Both the IS and DS have two inputs and two outputs. For k = 0, 1, SIk =
(LRk, lk, bk), SOk = (LR
′
k, l
′
k, b
′
k) and LRk, lk and bk denote the path metric and its
corresponding list index and decoded bit. The IS reorders the inputs such that path
metric LR′0 ≤ LR′1. The output of the comp-max module is 1 when LR0 > LR1.
The DS reorders the inputs such that LR′0 ≥ LR′1 and the output of the comp-min
module is 1 when LR0 < LR1.
140
5.4. EFFICIENT LIST DECODER ARCHITECTURE
The BSG reorders the inputs based on the magnitude of path metrics. Let
LSr(r = 0, 1, · · · , 15) denotes the associated path metric of output Sr, the path
metrics of the 16 outputs satisfy:
LS0 ≤ LS1 ≤ · · · ≤ LS7, (5.7)
LS8 ≥ LS9 ≥ · · · ≥ LS15. (5.8)
It is proved in [77] that the 8 maximum values among LSi’s are max(LSr, LS8+r)
for r = 0, 1, · · · , 7. Hence, a stage of CAS modules is appended at the outputs of
BSG shown in Fig. 5.7, where CSAr takes Sr and Sr+8 as inputs. This stage of
CAS modules produce the outputs Ot = (al, cl) for l = 0, 1, · · · , L− 1. As shown in
Fig. 5.8(c), the CAS module compares the path metrics of its two inputs and selects
the corresponding list index and bit value whose associated path metric is larger.
DS
IS
DS
IS
DS
IS
DS
IS
IS
IS
DS
DS
IS
IS
DS
DS
IS
IS
DS
DS
IS
IS
DS
DS
IS
IS
IS
IS
DS
DS
DS
DS
IS
IS
IS
IS
DS
DS
DS
DS
IS
IS
IS
IS
DS
DS
DS
DS
CAS0
Bitonic Sequence Generator (BSG)
D0
D1
D2
D15
O0
O1
O2
O7
O6
O5
O4
O3
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
S0
S1
S2
S15
S3
S4
S5
S6
S7
S8
S9
S10
S11
S12
S13
S14
CAS1
CAS2
CAS3
CAS4
CAS5
CAS6
CAS7
Figure 5.7: Maximum values filter architecture
The metric sorter in [31] has the same function as that of the proposed MVF. We
compare the proposed bitonic sorter based MVF module with the metric sorter [31]
141
5.4. EFFICIENT LIST DECODER ARCHITECTURE
0
1
0
1
LR0
LR1
SI0
SI1
z
z
x1
x1
z
0
1
0
1
(a) (b)
sel
0
1CI0
CI1
x2+1
x2+1
x2+1
(c)
SO0
SO1
CO
z
SI0
SI1
z
z
z SO0
SO1z
comp-max LR0
LR1
x1
x1
sel
comp-min
LR0
LR1
x1
x1
sel
comp-max
Figure 5.8: (a) Architectures of IS (b) Architectures of DS (c) Architectures of CAS
(z = x1 + x2 + 1)
in terms of area and critical path delay (CPD) under different list sizes when both
modules are synthesized under the TSMC 90nm CMOS technology. As shown in
Table 5.4, the proposed MVF module is more suitable for large list sizes. For
list size L = 2 to 32, the proposed MVF achieves 8% to 77% area saving. The
proposed MVF architecture achieves area saving because the comparator dominates
the area for the metric sorter and the MVF modules. For list size L, the metric
sorter needs NMS = L(2L−1) comparators, while the proposed MVF module needs
NMV F = 1 + 2 + · · · + log2 L = L2 ((log2 L)2 + log2 L + 2) comparators. When L is
large, NMS/NMV F ≈ 4Llog2 L . Clearly, our MVF module needs fewer comparators.
When L = 2, 4, 8, compared with the metric sorter, the proposed MVF has
longer CPD while achieving area saving. However, the longer delay for the MVF
is inconsequential because it is not in the critical path for the decoder architecture
when L 6 8. When L = 16, 32, the proposed MVF is better than the metric sorter
in terms of both area and CPD. Thus, the proposed MVF is more suitable for large
list sizes.
142
5.4. EFFICIENT LIST DECODER ARCHITECTURE
Crossbar Control Signal Generator
Due to the lazy copy method [31], when decoding path l needs to be copied to
decoding path l′, instead of copying LLMs from path l to path l′, the index references
(rl = (rl[n− 1], · · · , rl[0]) shown in Algorithm 18) to LLMs of path l are copied to
path l′. For decoding path l, when PUAl is computing updated LLMs in Pl,λ, the
crossbar (CB) module shown in Fig. 5.5 selects input LLMs from decoding path
rl[λ− 1]. The CB can be implemented with L-to-1 multiplexors.
The crossbar control signal (CCG) generator computes the control signals of CB,
cc0, cc1, · · · , ccL − 1, where the l-th output of CB is connected to the ccl-th input.
Since the CCG is a direct implementation of the lazy copy scheme in [31], the details
are omitted and can be found in our extended manuscript [74].
5.4.4 Partial Sum Update Unit and the CRC Unit
In this chapter, a parallel partial sum update unit (PSU) is proposed to provide
the partial sum inputs to L PUAs when performing the G computation. Compared
with the PSU in [20,31], which needs N − 1 single bit registers for a decoding path,
our PSU needs only N
2
− 1 single register bits.
Take N = 23 as an example, the architecture of PSUl, which computes the
partial sums for decoding path l, is shown in Fig. 5.9, where stage3 and stage2 have
one and two elementary update units (EUs), respectively. rl,3,0, rl,2,0, rl,2,1 shown
in Fig. 5.9 are single bit registers. cl = uˆl,i is the binary input of the PSUl. There
are three partial sum outputs: bl,3, bl,2 and bl,1 with a width of 1, 2 and 4 bits,
respectively. When the LLMs in Pl,λ need to be updated with the G computation,
bl,λ is the corresponding partial sum input. The architectures of PSUl for other
143
5.4. EFFICIENT LIST DECODER ARCHITECTURE
code lengths can be derived from the architecture in Fig. 5.9. For a polar code with
length N = 2n, the corresponding PSUl contains n−1 stages: stagen, stagen−1, · · · ,
stage2, where stagej has 2
n−j EUs for n ≥ j ≥ 2.
When bit index i is even, cl is stored in rl,n,0 and other registers keep their
current values unchanged. When bit index i is odd, bit registers in stagen, stagen,
· · · , stageφ(i+1) are updated with their corresponding input. When decoding path
index l 6= al, the updated partial sums of decoding path l should be computed based
on the bit registers in PSUal . The switch network (SW) shown in Fig. 5.9 selects
the corresponding bit register value from PSUal . The width of the input signal
Bl,j,k = {r0,j,k, r1,j,k, · · · , rL−1,j,k}\{rl,j,k} is L− 1 bits.
rl,3,0
SW
L-1
1
1
rl,2,0
SW
L-1
1
1
rl,2,1
SW
L-1
1
1
bl,3[0] Bl,3,0 Bl,2,0
Bl,2,1
cl
EU2,0
EU2,1
EU3,0
stage3 stage2bl,2[0]
bl,2[1]
bl,1[0]
bl,1[1]
bl,1[2]
bl,1[3]
Figure 5.9: PSU architecture
144
5.4. EFFICIENT LIST DECODER ARCHITECTURE
d
l,
0
S
W
L
-1 1
1
S
W
L
-1 1
1
d l
,1
d l
,2
d
l,
h-
2
S
W
L
-1 1
1
d
l,
h
-1
S
W
L
-1 1
…
p h
-1
p 2
p
1
S
W
L
-1 1
cs
l
c l
d'
l,
0
d'
l,
1
d'
l,
h
-2
d'
l,
h-
1
cs
' l
0 1
0 1
0 1
sh
if
t l
X
O
R
 g
at
e
F
ig
u
re
5.
10
:
A
rc
h
it
ec
tu
re
of
th
e
p
ro
p
os
ed
C
R
C
u
n
it
145
5.4. EFFICIENT LIST DECODER ARCHITECTURE
The CRC unit (CRCU) checks whether a data word passes the CRC. Suppose
an h-bit CRC checksum is used, the architecture of the CRCUl for decoding path
l is shown in Fig. 5.10, where the generation polynomial for the CRC checksum
computation is p(x) = xh+ph−1xh−1+· · ·+p1x+1. The proposed CRCUl is based on
a well known serial CRC computation architecture [78]. If the polynomial coefficient
pk = 0, the corresponding XOR gate and multiplexer are removed. During the
decoding of the first K−h information bits, the control signal shift l = 0 and CRCUl
computes the h-bit checksum of these information bits. The checksum is stored in
bit registers dl,0, dl,1, · · · , dl,h−1 shown in Fig. 5.10. When a frozen bit is being
decoded, dl,0, dl,1, · · · , dl,h−1 will not be updated. Once the checksum computation
is finished, the checksum is compared with the remaining h decoded information
bits, and the control signal shift l = 1. The checksum and the remaining h code bits
are compared bit by bit. The comparison result is stored in the register csl. The
decoded codeword for decoding path l passes the CRC only if cs l = 0. The SW
module shown in Fig. 5.10 is the same as that used in the partial sum computation
unit PSUl. When l 6= al, the SW module selects dal,k for k = 0, 1, · · · , h− 1.
146
5.4. EFFICIENT LIST DECODER ARCHITECTURE
T
ab
le
5.
5:
Im
p
le
m
en
ta
ti
on
R
es
u
lt
s
W
it
h
R
′ =
0.
46
8
an
d
R
=
0.
5
p
ro
p
o
se
d
a
rc
h
it
ec
tu
re
[3
1
]†
[3
1
]‡
al
go
ri
th
m
C
A
-S
C
L
S
C
L
li
st
si
ze
L
2
4
2
4
2
4
to
ta
l
n
u
m
b
er
of
P
U
s
L
T
1
6
3
2
3
2
6
4
1
6
3
2
3
2
6
4
1
2
8
2
5
6
ch
an
n
el
m
es
sa
ge
q
u
an
ti
za
ti
on
b
it
s
t
4
3
p
ro
ce
ss
T
S
M
C
9
0
n
m
U
M
C
9
0
n
m
fr
eq
u
en
cy
(M
H
z)
5
0
0
5
0
0
4
5
4
4
7
6
6
9
9
7
5
7
6
8
4
6
9
4
4
5
9
3
1
4
to
ta
l
ar
ea
(m
m
2
)
0
.4
0
6
0
.5
5
3
0
.8
1
0
1
.1
3
2
1
.1
1
4
1
.1
7
4
2
.1
8
1
2
.1
9
7
1.
6
0
3
.5
3
N
C
3
2
0
0
2
8
1
6
3
2
0
0
2
8
1
6
3
2
0
0
2
8
1
6
3
2
0
0
2
8
1
6
2
5
9
2
2
5
9
2
la
te
n
cy
(m
s)
6
.4
5
.6
3
7
.0
4
5
.9
1
4
.5
7
3
.7
1
4
.6
7
4
.0
5
5.
6
4
8
.2
5
th
ro
u
gh
p
u
t
(M
b
p
s)
1
6
0
R
′
1
8
1
R
′
1
4
5
R
′
1
7
3
R
′
2
2
4
R
2
7
5
R
2
1
9
R
2
5
2
R
18
1
R
1
2
4
R
ar
ea
effi
ci
en
cy
(M
b
p
s/
m
m
2
)
3
9
4
R
′
3
2
7
R
′
1
7
9
R
′
1
5
2
R
′
2
0
1
R
2
3
4
R
1
0
0
R
1
1
4
R
11
3
R
3
5R
n
or
m
al
iz
ed
ar
ea
effi
ci
en
cy
1
.8
3
1
.3
0
1
.6
7
1
.2
4
1
1
1
1
-
‡O
ri
gi
n
al
sy
n
th
es
is
re
su
lt
s
b
as
ed
on
a
U
M
C
9
0
n
m
te
ch
n
o
lo
g
y
in
[3
1
].
†S
y
n
th
es
is
re
su
lt
s
b
as
ed
a
T
S
M
C
9
0
n
m
te
ch
n
o
lo
g
y,
p
ro
v
id
ed
b
y
th
e
a
u
th
o
rs
o
f
[3
1
].
147
5.4. EFFICIENT LIST DECODER ARCHITECTURE
5.4.5 Decoding Cycles
For the proposed list decoder, pipeline registers can be inserted in the paths that pass
through the MVF. Let NC denote the number of cycles spent on the decoding of one
data word. For list decoder architectures based on partial parallel processing [20],
NC = 2N +
N
T
log2
N
4T
+ npRN, (5.9)
where N , T , np, R denote the block length, the number of PUs per decoding path,
the number of pipeline registers inserted in the path pruning unit and the code rate,
respectively.
The corresponding throughput TP = fNR
NC
, where f is the frequency of the list
decoder. The latency TD =
NC
f
.
5.4.6 Scalability of the Proposed List Decoder Architecture
For our list decoders, in term of error performance, it is desirable to use large list
sizes since a larger L leads to more performance gain for the CA-SCL decoding
algorithm. For the current list decoder architecture in [31], two problems arise
when L increases.
• The message memories of the list decoder in [31] are built with registers due to
the non-uniform quantization of the logarithm domain messages. Besides, the
message memories dominate the whole decoder area. As a result, the memory
area of the list decoder is linearly proportional to list size L. For a larger list
size, the list decoder architecture in [31] will suffer from large area and high
power consumption due to its register based memory.
148
5.4. EFFICIENT LIST DECODER ARCHITECTURE
• As shown in Table 5.4, when the list size grows, the metric sorter suffers
from large area and long critical path delay, which results in a slower clock
frequency of the list decoder. If multiple pipelines are inserted in the metric
sorter, the number of cycles for decoding one codeword also increases as shown
in Eq. (5.9).
For our list decoder architecture, these two issues are solved as follows.
• The proposed memory architecture is more area efficient compared to register
based memory. Besides, the proposed memory architecture offers a tradeoff
between data throughput and memory area. The register based memory [31]
remains almost unchanged when the number of PUs changes. However, for the
proposed memory architecture, the number of PUs affects the depth-width ra-
tio of the message memories. Hence, the area of message memory can be tuned
by varying the number of PUs. Reducing the number of PUs will increase the
depth of message memories, which is more area efficient. On the other hand,
reducing the number of PUs will also increase the number of cycles used on
decoding one codeword and decrease the data throughput.
• When the list size increases, the proposed MVF is more area efficient and has
a shorter critical path delay compared with the metric sorter [31].
As shown in Eq. (5.6), the depth of the largest LLM memory instance will
increase when N increases. Hence, the area efficiency will be improved when N
increases. As a result, our list decoder architecture is more suitable for large block
length N .
149
5.5. IMPLEMENTATION RESULTS
5.5 Implementation Results
In this chapter, our list decoder architecture has been implemented with list size
L = 2 and 4 for a rate 1/2 polar code with N = 1024. For each list size, two list
decoders with the numbers of T = 8 and 16 PUs, respectively, are implemented
and synthesized under a TSMC 90nm CMOS technology. For the L-MEM within
each of our list decoder, each sub memory is compiled with a memory compiler if
its depth is large enough. Otherwise, the sub memory is built with registers. For
all implemented decoders, each channel LLM is quantized with 4 bits in order to
achieve near floating point decoding performance. For our list decoders with L = 2
and 4, one stage of pipeline registers is used. Since the synthesis results in [31] were
based on a UMC 90nm technology, the authors of [31] have generously re-synthesized
their decoder architecture using the TSMC 90nm technology. We list both synthesis
results from [31] and the re-synthesized results provided by the authors of [31] in
Table 5.5. To make a fair comparison, we focus on the re-synthesized results.
The implementation results in Table 5.5 show that:
• The decoder architecture in [31] has a higher throughput than our list decoder
architecture. The reason is that the decoder architecture in [31] employs regis-
ter based memory while the proposed list decoder architecture employs register
file (RF) based memories. The read and write delays of an RF are larger than
those of a register based memory.
• On the other hand, our list decoder architecture is more area efficient. For list
decoders with the same L and T values, compared with the decoder of [31],
our list decoder architecture achieves 1.24 to 1.83 times of area efficiency.
150
5.5. IMPLEMENTATION RESULTS
Our list decoder is implemented for the N = 1024 polar code because the same
block length is used in [31]. For larger block length or larger list size, our advantage
in area efficiency is expected to be greater due to more area efficient LLM memory.
Since the CA-SCL algorithm helps to select the correct one from L possible
decoded codewords [22], the decoding performance of the CA-SCL algorithm is
better than that of the SCL algorithm with the same list size in [31]. As shown in
Fig. 5.2, the proposed CA-SCL decoders in Table 5.5 outperform the SCL decoders in
Table 5.5. We note that the number of PUs has no impact on the error performance
of the SCL and CA-SCL decoders.
As shown in Fig. 5.2, for the CA-SCL algorithm, increasing the list size results
in noticeable decoding gain according to our simulations. As shown in [21, Fig. 1],
increasing the list size of the SCL algorithm leads to negligible decoding gain es-
pecially in high SNR region. For the CA-SCL algorithm, the choice of list size L
depends on the tradeoff between error performance and decoding complexity. Bet-
ter error performance can be achieved by increasing the list size L. For the SCL
algorithm, we need to find the threshold value LT , where little further decoding gain
is achieved by employing a list size L > LT . For the SCL algorithm, the feasible list
size should be no greater than LT and satisfy the error performance requirement.
Due to the serial nature of the successive cancelation method, the SC based de-
coders and its list variants suffer from long decoding latency. In terms of through-
put, the throughput of SC based decoders is expected to be lower than BP based
decoders, since the BP algorithm for polar codes has a much higher parallelism.
On the other hand, the BP algorithm for polar codes still suffers from inferior finite
length error performance [26,79]. Current simulation results [26] show that the error
151
5.6. CONCLUSION
performance of the BP algorithm for polar codes is similar to that of SC algorithm,
but worse than those of the SCL and CA-SCL algorithms.
5.6 Conclusion
In this chapter, an efficient list decoder architecture has been proposed for polar
codes. The proposed decoder architecture achieves higher area efficiency and better
error performance than previous list decoder architectures.
152
Chapter 6
A High Throughput List Decoder
Architecture for Polar Codes
6.1 Introduction
Polar codes [18] are a significant breakthrough in coding theory, since polar codes can
achieve the channel capacity of binary-input symmetric memoryless channels [18]
and arbitrary discrete memoryless channels [19]. Polar codes of block length N
can be efficiently decoded by a successive cancelation (SC) algorithm [18] with a
complexity of O(N logN). While polar codes of very large block length (N >
220 [20]) approach the capacity of underlying channels under the SC algorithm, for
short or moderate polar codes, the error performance of the SC algorithm is worse
than turbo or LDPC codes [55].
Lots of efforts [23, 24, 55] have already been devoted to the improvement of er-
ror performance of polar codes with short or moderate lengths. An SC list (SCL)
153
6.1. INTRODUCTION
decoding algorithm [55] performs better than the SC algorithm. In [23, 24, 55], the
cyclic redundancy check (CRC) is used to pick the output codeword from L candi-
dates, where L is the list size. The CRC-aided SCL (CA-SCL) decoding algorithm
performs much better than the SCL decoding algorithm at the expense of negligible
loss in code rate.
Despite its significantly improved error performance, the hardware implementa-
tions of SC based list decoders [31–34] still suffer from long decoding latency and
limited throughput due to the serial decoding schedule. In order to reduce the de-
coding latency of an SC based list decoder, M (M > 1) bits are decoded in parallel
in [35–37], where the decoding latency can be reduced by M times ideally. How-
ever, for the hardware implementations of the algorithms in [35–37], the actually
achieved decoding latency reduction is less than M due to extra decoding cycles on
finding the L most reliable paths among 2ML candidates, where L is list size. A
software adaptive SSC-list-CRC decoder was proposed in [38]. For a (2048, 1723)
polar+CRC-32 code, the SSC-list-CRC decoder with L = 32 was shown to be about
7 times faster than an SC based list decoder. However, it is unclear whether the list
decoder in [38] is suitable for hardware implementation.
In this chapter, a tree based reduced latency list decoding algorithm and its
corresponding high throughput hardware architecture are proposed for polar codes.
The main contributions are:
• A tree based reduced latency list decoding (RLLD) algorithm over logarithm
likelihood ratio (LLR) domain is proposed for polar codes. Inspired by the
simplified successive cancelation (SSC) [30] decoding algorithm and the ML-
SSC algorithm [54], our RLLD algorithm performs the SC based list decoding
154
6.1. INTRODUCTION
on a binary tree. Previous SCL decoding algorithms visit all the nodes in
the tree and consider all possibilities of the information bits, while our RLLD
algorithm visits much fewer nodes in the tree and consider fewer possibilities
of the information bits. When configured properly, our RLLD algorithm sig-
nificantly reduces the decoding latency and hence improves throughput, while
introducing little performance degradation.
• Based on our RLLD algorithm, a high throughput list decoder architecture
is proposed for polar codes. Compared with the state-of-arts SCL decoders
in [32,33,36], our list decoder achieves lower decoding latency and higher area
efficiency (throughput normalized by area).
More specifically, the major innovations of the proposed decoder architecture
are:
• An index based partial sum computation (IPC) algorithm is proposed to avoid
copying partial sums directly when one decoding path needs to be copied to
another. Compared with the lazy copy algorithm in [55], our IPC algorithm is
more hardware friendly since it copies only path indices, while the lazy copy
algorithm needs more complex index computation.
• Based on our IPC algorithm, a hybrid partial sum unit (Hyb-PSU) is proposed
so that our list decoder is suitable for larger block lengths. The Hyb-PSU is
able to store most of the partial sums in area efficient memories such as register
file (RF) or SRAM, while the partial sum units (PSUs) in [31–33] store partial
sums in registers, which need much larger area when the block length N is
larger. Compared with the PSU of [32], our Hyb-PSU achieves an area saving
155
6.1. INTRODUCTION
of 23% and 63% for block length N = 213 and 215, respectively, under the
TSMC 90nm CMOS technology.
• For our RLLD algorithm, when certain types of nodes are visited, each current
decoding path splits into multiple ones, among which the L most reliable paths
are kept. In this chapter, an efficient path pruning unit (PPU) is proposed
to find the L most reliable decoding paths among the split ones. For our
high throughput list decoder architecture, the proposed PPU is the key to the
implementation of our RLLD algorithm.
• For the fixed-point implementation of our RLLD algorithm, a memory efficient
quantization (MEQ) scheme is used to reduce the number of stored bits. Com-
pared with the conventional quantization scheme, our MEQ scheme reduces
the number of stored bits by 17%, 25% and 27% for block length N = 210, 213
and 215, respectively, at the cost of slight error performance degradation.
Note that the SSC and ML-SSC algorithms reduce the latency of the SC algo-
rithm by performing it on a binary tree. Inspired by this idea, our RLLD algorithm
performs the SC based list decoding algorithm on a binary tree. The low-latency
list decoding algorithm [38] also performs the list decoding algorithm on a binary
tree. Our work [80] and the decoding algorithm in [38] are developed independently.
Both of our RLLD algorithm and the low-latency list decoding algorithm [38] try to
reduce the number of visited nodes in the binary tree so that the decoding latency
can be reduced. However, there are still some differences.
• Compared with the decoding algorithm in [38], our RLLD algorithm visits
fewer nodes. Illuminated by the ML-SSC algorithm, our RLLD algorithm
156
6.1. INTRODUCTION
processes certain arbitrary rate nodes [30] in a fast way.
• When a rate-1 node [30] is visited, our RLLD algorithm employs a less com-
plex and hardware friendly algorithm to compute the returned constituent
codewords.
• Our RLLD algorithm is based on LLR messages, while the decoding algorithm
in [38] is based on logarithm likelihood (LL) messages, which require a larger
memory to store.
In terms of hardware implementations, compared with state-of-arts SC list de-
coders [31–34,36,37], our high throughput list decoder architecture shows advantages
in various aspects.
• For the high throughput list decoder architecture, LLR message is employed
while LL message was used in [31,32,36,37]. The LL based memories require
more quantization bits and a larger memory to store. The area efficient mem-
ory architecture in [32] is employed to store all LLR messages. LLR messages
were also employed in [33,34]. However, the register based memories in [33,34]
suffer from excessive area and power consumption when N is large.
• Our list decoder architecture employs a Hyb-PSU, which is scalable for polar
codes of large block lengths. The register based PSUs of the list decoders
in [31–33] suffer from area overhead when the block length is large. Instead
of copying partial sums directly, our scalable PSU copies only decoding path
indices, which avoids additional energy consumption.
The proposed high throughput list decoder architecture has been implemented for
several block lengths and list sizes under the TSMC 90nm CMOS technology. The
157
6.2. PRELIMINARIES
implementation results show that our decoders outperform existing SCL decoders in
both decoding latency and area efficiency. For example, compared with the decoders
of [33], the area efficiency and decoding latency of our decoders are 1.65 to 45 times
and 3.4 to 6.8 times better, respectively.
The rest of the chapter is organized as follows. Related preliminaries are reviewed
in Section 6.2. The proposed RLLD algorithm is presented in Section 6.3. The high
throughput list decoder architecture is presented in Section 6.4. In Section 6.5, the
implementation and comparisons results are shown. At last, the conclusion is drawn
in Section 6.6.
6.2 Preliminaries
6.2.1 Polar Codes
Let uN−10 = (u0, u1, · · · , uN−1) denote the data bit sequence and xN−10 = (x0, x1,
· · · , xN−1) the corresponding codeword, where N = 2n. Under the polar encoding,
xN−10 = u
N−1
0 BNF
⊗n, where BN is the bit reversal permutation matrix, and F =[
1
1
0
1
]
. Here ⊗n denotes the nth Kronecker power and F⊗n = F ⊗ F⊗(n−1). For
i = 0, 1, · · · , N − 1, ui is either an information bit or a frozen bit, which is set
to zero usually. For an (N,K) polar code, there are a total of K information bits
within uN−10 . The encoding graph of a polar code with N = 8 is shown in Fig. 6.1.
6.2.2 Prior Tree-Based SC Algorithms
A polar code of block length N = 2n can also be represented by a full binary tree
Gn of depth n [30], where each node of the tree is associated with a constituent
158
6.2. PRELIMINARIES
u0
u1
u2
u3
u4
u5
u6
u7
s10
s11
s12
s13
x0
x1
x2
x3
x4
x5
x6
x7
s14
s15
s16
s17
s20
s21
s22
s23
s24
s25
s26
s27
s00
s01
s02
s03
s04
s05
s06
s07
Figure 6.1: Polar encoder with N = 8
s00 s01 s02 s03
s10,s14 s11,s15
s04 s05 s06 s07
s12,s16 s13,s17
s20,s22,s24,s26 s21,s23,s25,s27
layer index
2
1
0
v
v
0
v 0
v 1
v
1
v
v
pv x0,x1,...,x7
3
0
1 2
3 4 5 6
7 8 9 10 11 12 13 14
node index
Figure 6.2: Binary tree representation of an (8, 3) polar code
code. For example, for node 1 shown in Fig. 6.2, the correspondent constituent
code is the set {(s20, s22, s24, s26)}, where each element (s20, s22, s24, s26) relates to
the data word u70 as shown in Fig. 6.1. The binary tree representation of an (8, 3)
polar code is shown in Fig. 6.2, where the black and white leaf nodes correspond to
information and frozen bits, respectively. There are three types of nodes in a binary
tree representation of a polar code: rate-0 , rate-1 and arbitrary rate nodes. The
leaf nodes of a rate-0 and rate-1 nodes correspond to only frozen and information
bits, respectively. The leaf nodes of an arbitrary rate node are associated with both
information and frozen bits. The rate-0, rate-1 and arbitrary rate nodes in Fig. 6.2
are represented by circles in white, black and gray as shown.
The SC algorithm can be mapped on Gn, where each node acts as a decoder for
its constituent code. The SC algorithm is initialized by feeding the root node with
159
6.2. PRELIMINARIES
the channel LLRs, (Y0, Y1, · · · , YN−1), where Yi = log(Pr(yi|xi = 0)/Pr(yi|xi = 1))
and (y0, y1, · · · , yN−1) is the received channel message vector. As shown in Fig. 6.2,
the decoder at node v receives a soft information vector αv and returns a constituent
codeword βv. When a non-leaf node v is activated by receiving an LLR vector αv,
it calculates a soft information vector α0v and sends it to its left child. Node v first
waits until it receives a constituent codeword β0v , and then computes and sends a soft
information vector α1v to its right child. Once the right child returns a constituent
codeword β1v , node v computes and returns a constituent codeword βv. When a
leaf node v is activated, the returned constituent codeword βv contains only one bit
βv[0], where βv[0] is set to 0 if leaf node v is associated with a frozen bit; otherwise,
βv[0] is calculated by making a hard decision on the received LLR αv[0], where
βv[0] = h(αv[0]) =

0 αv[0] > 0,
0 or 1 αv[0] = 0,
1 αv[0] < 0.
(6.1)
From the root node, all nodes in a tree are activated in a recursive way for the SC
algorithm. Once βv for the last leaf node is generated, the codeword x
N−1
0 can be
obtained by combining and propagating βv up to the root node.
The SSC decoding algorithm in [30] simplifies the processing of both rate-0 and
rate-1 nodes. Once a rate-0 node is activated, it immediately returns the all zero
vector. Once a rate-1 node is activated, a constituent codeword is directly calcu-
lated by making hard decisions on the received soft information vector as shown in
Eq. (6.1). The ML-SSC decoding algorithm [54] further accelerates the SSC decod-
ing algorithm by performing the exhaustive-search ML decoding on some resource
160
6.2. PRELIMINARIES
constrained arbitrary rate nodes, which are called ML nodes in [54]. For an ML
node with layer index t, the constituent codeword passed to the parent node pv is
βv = argmax
x∈C
2n−t−1∑
i=0
(1− 2x[i])αv[i], (6.2)
where C is the constituent code associated with node v.
6.2.3 LLR Based List Decoding Algorithms
For SCL decoding algorithms [31, 34, 55], when decoding an information bit ui,
each decoding path splits into two paths with uˆi being 0 and 1, respectively. Thus
2L path metrics are computed and the L paths correspond to the L minimum
path metrics are kept. The list decoding algorithm in [31, 55] are performed either
on probability or logarithmic likelihood (LL) domain. In [34], an LLR based list
decoding algorithm was proposed to reduce the message memory requirement and
the computational complexity of LL based list decoding algorithm. For decoding
path l (l = 0, 1, · · · , L− 1), the LLR based list decoding algorithm employs a novel
path metric
PM
(i)
l =
i∑
k=0
D(L(k)n [l], uˆk[l]), (6.3)
where D(L
(k)
n [l], uˆk[l]) = 0 if h(L
(k)
n [l]) equals to uˆk[l]. Otherwise, D(L
(k)
n [l], uˆk[l]) =
|L(k)n [l]|. Here L(k)n [l] , W
(k)
n (y
N−1
0 ,uˆ
k−1
0 [l]|0)
W
(k)
n (y
N−1
0 ,uˆ
k−1
0 [l]|1)
and yN−10 = (y0, y1, · · · , yN−1) is the re-
ceived channel message vector.
161
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
6.3 Reduced Latency List Decoding Algorithm
6.3.1 SCL Decoding on A Tree
Similar to the SSC decoding algorithm, the SC based list decoding algorithms [31,55]
can also be performed on a full binary tree Gn [38, 80]. The SCL decoding is
initiated by sending the received channel LLR vector to the root node of Gn. As
shown in Fig. 6.3, without losing generality, each internal node v in Gn is activated
by receiving L LLR vectors, αv,0, αv,1, · · · , αv,L−1, from its parent node vp and is
responsible for producing L constituent codewords, βv,0, βv,1, · · · , βv,L−1, where αv,l
and βv,l correspond to decoding path l for l = 0, 1, · · · , L−1. Suppose the layer index
of node v is t, αv,l and βv,l have 2
n−t LLR messages and binary bits, respectively,
for l = 0, 1, · · · , L− 1.
Once a non-leaf node v is activated, it calculates L LLR vectors, αvL,0, αvL,1,
· · · , αvL,L−1, and passes them to its left child node vL, where
αvL,l[i] = f(αv,l[2i], αv,l[2i+ 1]) (6.4)
for 0 ≤ i < 2n−t−1 and l = 0, 1, · · · , L− 1.
Here f(a, b) = 2 tanh−1(tanh(a/2) tanh(b/2)) and can be approximated as:
f(a, b) ≈ sign(a) · sign(b) min(|a|, |b|). (6.5)
Node v then waits until it receives L codewords, βvL,0, βvL,1, · · · , βvL,L−1, from
vL. In the following step, node v calculates another L LLR vectors, αvR,0, αvR,1,
162
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
· · · , αvR,L−1, and passes them to its right child node vR, where
αvR,l[i] = g(αv,l[2i], αv,l[2i+ 1], βvL,l[i])
= αv,l[2i](1− 2βvL,l[i]) + αv,l[2i+ 1]
(6.6)
for 0 ≤ i < 2n−t−1 and l = 0, 1, · · · , L− 1.
At last, node v waits until it receives L codewords, βvR,0, βvR,1, · · · , βvR,L−1, from
vR. It then calculates βv,0, βv,1, · · · , βv,L−1 and passes them to its parent node vp,
where
(βv,l[2i], βv,l[2i+ 1]) = (βvL,l[i]⊕ βvR,l[i], βvR,l[i]), (6.7)
for 0 ≤ i < 2n−t−1 and l = 0, 1, · · · , L− 1.
,0 ,1 , 1, , ,v v v L    ,0 ,1 , 1, , ,v v v L   
,0 ,1 , 1, , ,v v v L     
,0
,1
, 1
,
,
,
v
v
v L


 




,0
,1
, 1
,
,
,
v
v
v L


 




,0 ,1 , 1, , ,v v v L     
v
vv
vp
Figure 6.3: Node activation schedule for SC based list decoding on Gn
For l = 0, 1, · · · , L− 1, PMl is the path metric associated with decoding path l
and is initialized with 0. When a leaf node v associated with an information bit is
activated, decoding path l splits into two paths with βv,l being 0 and 1, respectively.
Note that the layer index of a leaf node is n, hence αv,l and βv,l have only one LLR
and binary bit, respectively, when node v is a leaf node. For the SCL decoding, 2L
163
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
expanded path metrics are computed, where
PMjl = PMl +D(αv,l, j), (6.8)
for j = 0, 1 and l = 0, 1, · · · , L − 1. D(αv,l, j) = 0 if h(αv,l) equals j. Otherwise,
D(αv,l, j) = |αv,l|. Suppose the L minimum expanded path metrics are PMj0a0 ,
PMj1a1 , · · · , PMjL−1aL−1 , which correspond to the L most reliable paths, then βv,l = jl
for l = 0, 1, · · · , L − 1. Decoding path al will be copied to decoding path l before
further partial sum and LLR vector computations. For each decoding path l, path
metric is also updated with PMl = PM
jl
al
. When a leaf node v associated with a
frozen bit is activated, βv,l = 0 for l = 0, 1, · · · , L− 1 are passed to its parent node
vp. The updated path metric PMl = PMl + D(αv,l, 0). Note that the SCL algorithm
on a tree described above is equivalent to the SCL algorithms in [31,55].
6.3.2 Proposed RLLD algorithm
In this chapter, a reduced latency list decoding (RLLD) algorithm is proposed to
reduce the decoding latency of SC list decoding for polar codes. For a node v, let
Iv denote the total number of leaf nodes that are associated with information bits.
Let Xth be a predefined threshold value and X0 and X1 be predefined parameters.
Our RLLD algorithm performs the SC based list decoding on Gn and follows the
node activation schedule in Section 6.3.1, except when certain type of nodes are
activated. These nodes calculate and return the codewords to their parent nodes
while updating the decoding paths and their metrics, without activating their child
nodes. Specifically:
164
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
• When a rate-0 node v is activated, βv,l is a zero vector for l = 0, 1, · · · , L− 1.
• When a rate-1 node v with Iv > Xth is activated, βv,l is just the hard decision
of αv,l for l = 0, 1, · · · , L − 1. For a well constructed polar code, we observe
that the polarized channel capacities of the information bits corresponding to
rate-1 nodes with Iv > Xth are greater than those of the other information
bits. Hence, for rate-1 nodes with Iv > Xth, our RLLD algorithm considers
only the most reliable candidate codeword for each decoding path due to a
more reliable channel.
• When a rate-1 node v with Iv 6 Xth is activated, the returned codewords are
calculated by the proposed candidate generation (CG) algorithm.
• Let t denote the layer index of node v. When an arbitrary rate node v with
Iv 6 X0 and 2n−t 6 X1 is activated, each decoding path splits into 2Iv paths.
From now on, such an arbitrary rate node is called fast processing (FP) node.
The proposed metric based search (MBS) algorithm is used to calculate the
returned codewords.
When performed on a binary tree, the SCL algorithms in [31, 55] do the path ex-
panding and pruning as well as the updating of path metrics when a leaf node is
activated. In contrast, our RLLD algorithm do the path expanding and pruning as
well as updating of path metrics when a certain intermediate node is visited. Thus,
our RLLD algorithm visits fewer nodes.
When a rate-1 node with Iv > Xth or a rate-0 node is activated, ideally, PMl is
updated with PMl + ∆v,l for l = 0, 1, · · · , L−1, where ∆v,l =
∑Iv−1
i=0 D(αv,l[i], βv,l[i]).
For each rate-1 node with Iv > Xth, ∆v,l = 0 since βv,l is the hard decision of
165
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
αv,l. However, for a rate-0 node, ∆v,l could have a non-zero value. For our RLLD
algorithm, ∆v,l is also set to be 0 for each rate-0 node, since the resulting performance
degradation is negligible. By setting ∆v,l to be 0, we no longer need to calculate
αv,l sent to a rate-0 node.
Proposed CG Algorithm
When a rate-1 node with Iv 6 Xth is activated, ideally, we should consider 2Iv
candidate codewords for each decoding path. Since there are at most L codewords
from the same decoding path that could be passed to the parent node, it is enough
to find only the L most reliable codewords among 2Iv candidates for each decoding
path. When Iv is large (e.g. Iv > 32), finding the L most reliable codewords is com-
putationally intensive and lacks efficient hardware implementations. For our RLLD
algorithm, we considers only the W (W < L) most reliable codewords among 2Iv
candidates for each decoding paths. In this chapter, W is set to be 2, since it results
in efficient hardware implementations at the cost of negligible error performance
lost.
When W = 2, the proposed CG algorithm, shown in Alg. 21, is used to calculate
the codewords passed to the parent node. Besides, the CG algorithm also outputs
L list indices, a0, a1, · · · , aL−1, which indicate that decoding path al needs to be
copied to path l. Suppose the layer index of such a rate-1 node v is t. For each
decoding path l, there are 2Iv = 22
n−t
candidate codewords that could be passed to
the parent node vp. However, our CG algorithm considers only the most reliable
codeword Cv,l,0 and the second most reliable codeword Cv,l,1. In order to find these
166
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
two codewords, each candidate codeword Cv,l,j is associated with a node metric
NMjl =
∑Iv−1
k=0
mk|αv,l[k]| (6.9)
for j = 0, 1, · · · , 2Iv−1, where mk = 0 if Cv,l,j[k] equals h(αv,l[k]) and 1 otherwise. As
a result, the smaller a node metric is, the more reliable the corresponding candidate
codeword is. Based on Eq. (6.9), Cv,l,0 = h(αv,l) is the hard decision of the received
LLR vector αv,l. Cv,l,1 is obtained by flipping the kM,l-th bit of Cv,l,0, where kM,l is
the index of the LLR element with the smallest absolute value among αv,l.
Each decoding path splits into two paths and has two associated candidate code-
words. Alg. 21 calculates 2L expanded path metrics PMjl for l = 0, 1, · · · , L−1 and
j = 0, 1 to select L codewords passed to the parent node. The minL function in
Alg. 21 finds the L smallest values among 2L input expanded path metrics. Once
βv,l for l = 0, 1, · · · , L − 1 are computed, decoding path al is copied to decoding
path l before further operations.
Algorithm 21: The proposed CG algorithm
input : αv,0, αv,1, · · · , αv,L−1
output: βv,0, βv,1, · · · , βv,L−1; a0, a1, · · · , aL−1
for l = 0 to L− 1 do
kM,l = argmin
k∈{0,1,··· ,Iv−1}
|αv,l[k]|
NM0l = 0; Cv,l,0 = h(αv,l)
NM1l = |αv,l[kM,l]|; Cv,l,1 = Flip(Cv,l,0, kM,l)
PMjl = PMl + NM
j
l for j = 0, 1
(PMb0a0 , · · · ,PMbL−1aL−1) = minL(PM00,PM10, · · · ,PM1L−1) for l = 0 to L− 1 do
βv,l = Cv,al,bl ; PMl = PMblal
167
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
Proposed MBS Algorithm
When an FP node is activated, each current decoding path expands to 2Iv paths,
each of which is associated with a candidate codeword. Similar to the CG algorithm,
the proposed MBS algorithm calculates L codewords passed to the parent node and
L path indices, a0, a1, · · · , aL−1. The calculation of returned codewords are shown
as follows.
• For each candidate codeword Cjv,l, calculate its corresponding node metric NMjl
for j = 0, 1, · · · , 2Iv − 1 and l = 0, 1, · · · , L− 1.
• Calculate 2IvL expanded path metrics PMjl for l = 0, 1, · · · , L − 1 and j =
0, 1, · · · , 2Iv − 1.
• Find L expanded path metrics among 2IvL ones. The correspondent candidate
codewords are passed to the parent node vp.
To calculate the node metric, we propose a new method with low computational
complexity. In the literature, two methods can be used: the direct-mapping method
(DMM) shown in Eq. (6.9) and the recursive channel combination (RCC) [37]. In
terms of computational complexity, the former needs 2Iv(2n−t−1)L additions, where
N = 2n and t is the layer index of an FP node v. The RCC needs (
∑n−t−1
i=1 2
i22
n−t−i
+
2Iv)L additions. Compared to the DMM, the RCC approach needs fewer additions.
For our RLLD algorithm, we want to compute these 2Iv node metrics in parallel.
However, the parallel hardware implementations of the DMM and RCC algorithms
require large area consumption. This will be discussed in more detail in Section 6.4.3.
In this chapter, a hardware efficient node metric computation method, which
takes advantage of both the DMM and the RCC, is proposed. The proposed method,
168
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
referred to as the DR-Hybrid (DRH) method, is shown in Alg. 22, where Cjv,l[2i :
2i+ 1] = (Cjv,l[2i], Cjv,l[2i+ 1]), and r is represented by a binary tuple of length two,
i.e. r = r0 + 2r1. In our method, the RCC approach is used to calculate θl,i first.
Then, the DMM is carried out.
Algorithm 22: DR-Hybrid method
for l = 0 to L− 1 do
/* ----------RCC---------------- */
1 for i = 0 to 2n−t−1 − 1 do
2 for r = 0 to 3 do
3 θl,i[(r0, r1)] = (1− 2r0)αv,l[2i] + (1− 2r1)αv,l[2i+ 1];
/* ----------DMM---------------- */
4 for j = 0 to 2Iv − 1 do
5 NMjl =
∑2n−t−1−1
i=0 θl,i[Cjv,l[2i : 2i+ 1]].
The DRH method needs 4×2n−t−1 +2Iv(2n−t−1−1) additions. Take X0 = 8 and
X1 = 16 as an example, the DMM, RCC and DRH methods need 3840, 864 and
1824 additions. Though our DRH method needs more additions than the RCC, it
results in a more area efficient hardware implementation when all 2Iv node metrics
are computed in parallel, since the RCC method needs more complex multiplexors.
Once we have 2IvL node metrics and corresponding candidate codewords, 2IvL
expanded path metrics PMjl = PMl + NM
j
l for l = 0, 1, · · · , L − 1 and j =
0, 1, · · · , 2Iv − 1 can be computed. The next step is selecting L returned codewords
and their corresponding expanded path metrics.
Ideally, we should find the L minimum expanded path metrics, which correspond
to the L most reliable codewords, among 2IvL ones. However, directly finding the
L minimum values from 2IvL ones is computationally intensive and lacks efficient
169
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
hardware implementations. A bitonic sequence based sorter [32] (BBS) with 2IvL
inputs is able to fulfill this task. Such a BBS takes 2Iv−1L(
∑s−1
i=1 i)+2
Iv−2L compare-
and-switch (CS) units [32], where each of them has one comparator and two 2-to-1
multiplexors and s = log2(2
IvL). For example, when Iv = 8 and L = 4, such a BBS
needs 23296 CS units. In order to simplify the hardware implementation, a two-stage
sorting scheme was proposed in [37], where the first stage selects q (q < L) smallest
node metrics from 2Iv ones for each decoding path. The second stage selects the
L smallest metrics from the Lq expanded path metrics produced by the first stage.
Compared with the direct sorting scheme [32, 36], the hardware implementation of
the two-stage sorting scheme is more efficient at the cost of certain error performance
degradation.
In this chapter, our MBS algorithm employs the two-stage sorting scheme and
improves the first stage in the following two aspects:
• Instead of using a fixed q, our MBS algorithm employs a dynamic qIv ,L(qIv ,L 6
L), which is a power of 2 and depends on both Iv and L.
• An approximated sorting (ASort) method is used to select out qIv ,L metrics
from 2Iv ones. Though these sorted metrics are not always the precisely qIv ,L
smallest among 2Iv one, our ASort method leads to an efficient hardware
implementation.
Our ASort method is illustrated as follows:
• When 2Iv 6 2L, the BBS with 2L inputs and L outputs is used to select the
qIv ,L minimum node metrics from 2
Iv ones.
• When 2Iv > 2L, all 2Iv node metrics are divided into qIv ,L groups as follows:
170
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
NM0l , · · · ,NMm−1l︸ ︷︷ ︸
group 1
, · · · ,NM(qIv,L−1)ml , · · · ,NMqIv,Lm−1l︸ ︷︷ ︸
group qIv,L
.
Here m = 2
Iv
qIv,L
. The two minimum node metrics of each group are first computed.
The BBS computes the minimum qIv ,L node metrics among 2qIv ,L ones.
After the first stage of sorting, the number of expanded path metrics Ne could
be 2L, 4L, · · · , L × L. The second stage of sorting is the same as that in [37]. A
binary tree of 2L-L BBSs are employed to sort out the final L minimum expanded
path metrics. Take Ne = 4L as an example, there are 4L extended path metrics:
PMj0l0 , PM
j1
l1
, · · · , PMj4L−1l4L−1 , then PMj0l0 , · · · , PM
j2L−1
l2L−1 and PM
j2L
l2L
, · · · , PMj4L−1l4L−1 are
applied to two 2L-L BBSs, respectively. Thus, total 2L metrics are selected out.
Then the 2L-L BBS is employed again to generated the final L minimum extended
path metrics: PM
j′0
l′0
, PM
j′1
l′1
, · · · , PMj
′
L−1
l′L−1
.
6.3.3 Discussions on the Parameters of Our RLLD Algo-
rithm
For our RLLD algorithm, the returned codewords from rate-1 nodes with Iv > Xth
are obtained by making hard decisions on the received LLR vectors. The other
rate-1 nodes are processed by our CG algorithm. Note that both the hard decision
approach and our CG algorithm could cause potential error performance degrada-
tion since ideally we should consider 2Iv candidate codewords for each decoding
path. With more rate-1 nodes (decreasing Xth) being processed by the hard de-
cision approach, the decoding latency could be reduced at the cost of more error
performance degradation.
The choices of X0 and X1 are tradeoffs between implementation complexity and
171
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
achieved decoding latency reduction. Ideally, we want X0 and X1 to be as large as
possible so that more data bits could be decoded in parallel. The number of adders
needed by Alg. 22 is proportional to 2Iv2n−t in terms of hardware implementations.
Thus, for practical implementations, we could choose only realistic values for X0
and X1.
For the two step sorting scheme of our MBS algorithm, we want qIv ,L to be
as small as possible so that the sorting complexity could be minimized. However,
reducing qIv ,L could degenerate the resulting error performance, since ideally we
need to consider the L most reliable candidate codewords for each decoding path.
As a result, the selections of qIv ,L are tradeoffs between sorting complexity and error
performance.
6.3.4 Comparison with Related Algorithms
If we perform the SC based list decoding algorithms [31, 55] on a tree, then all
2N − 1 nodes of the tree will be activated. For our RLLD algorithm, denote na as
the number of activated nodes. Then we have na < 2N − 1, where na is determined
by the block length N , the code rate, the locations of frozen bits and the parameters
X0 and X1. X0 and X1 are used to identify all FP nodes. The reduction of the
number of activated nodes will transfer into reduced decoding latency and increased
throughput. Take the (8, 3) polar code in Fig. 6.2 as an example, suppose X0 = 1
and X1 = 2, then only 5 nodes (node 0, 1, 2, 5, 6) need to be activated by our RLLD
algorithm, whereas the previous algorithms [31,55] need to activate all 15 nodes.
The CA-SCL decoding algorithm was also performed on a binary tree in [38].
Compared with the low-latency list decoding algorithm [38], our RLLD algorithm
172
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
employs the proposed MBS algorithm to process FP nodes, while FP nodes were pro-
cessed by activating its child nodes in [38]. Our MBS algorithm results in decreased
decoding latency at the cost of potential error performance loss. Besides, our RLLD
algorithm takes a simpler approach when a rate-1 node is activated. When a rate-1
node is activated, a Chase-like algorithm was used to calculate the L codewords
passed to the parent node in [38]. Compared to the Chase-like algorithm, our CG
algorithm has lower computational complexity and is more suitable for hardware
implementation due to the following facts:
(1) The Chase-like algorithm in [38] was performed over log-likelihoods (LL)
domain while our method is performed over LLR domain. Compared with our LLR
based method, it takes more additions to calculate related metrics for the Chase-like
algorithm.
(2) For each decoding path, the Chase-like algorithm considers 1 +
(
c
1
)
+
(
c
2
)
candidate constituent codewords, where c = 2 in [38]. In contrast, our method
considers only two constituent codewords.
(3) In order to find the L best decoding paths and their constituent codewords,
the Chase-like algorithm creates a candidate path list. The final L candidates are de-
termined by inserting and removing elements from the list. The Chase-like algorithm
is suitable for software implementations. However, the hardware implementations
of the Chase-like algorithm has not been discussed in [38]. On the other hand, with
a bitonic based sorter [32] (BBS), the L most reliable decoding paths can be decided
in parallel for our CG algorithm.
173
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
6.3.5 Simulation Results
For an (8192, 4096) polar code, the bit error rate (BER) performances of the pro-
posed RLLD algorithm as well as other algorithms are shown in Fig. 6.4. In Fig. 6.4,
CSx denotes the CA-SCL decoding algorithm with L = x, where CRC-32 is used.
Rx-y denotes our RLLD algorithm with L = x and Xth = y. The values of qIv ,L’s
under different list sizes and Iv’s are shown in Table 6.1. For all simulated algo-
rithms, the additive white Gaussian noise (AWGN) channel and binary phase-shift
keying (BPSK) modulation are used. For all simulated RLLD algorithms, X0 = 8
and X1 = 16 for practical implementations.
Table 6.1: The Values of qIv ,L’s under Different List Sizes and Iv’s
Iv 1 2 3 4 5 6 7 8
L
2 2 2 2 2 2 2 2 2
4 2 4 4 4 4 4 4 2
8 2 4 8 8 8 8 4 2
16 2 4 8 8 8 8 8 2
32 2 4 8 8 8 4 4 2
SNR (dB)
1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8
Bi
t E
rro
r R
at
e 
(B
ER
)
10-9
10-8
10-7
10-6
10-5
10-4
10-3
10-2
10-1
SC
CS2
R2-8
R2-64
CS4
R4-8
R4-128
CS8
R8-128
CS16
R16-256
CS32
R32-256
Figure 6.4: BER performance for an (8192, 4096) polar code
174
6.3. REDUCED LATENCY LIST DECODING ALGORITHM
Based on the simulation results shown in Fig. 6.4, we observe that R2-8 performs
nearly the same as CS2 and R2-64. When the list size increases, compared with CS4,
R4-8 shows obvious error performance degradation when BER is blow 10−7. The
degradation is reduced by increasing Xth to 128, as we observe that R4-128 performs
nearly the same as CS4. When the list size further increases (e.g. L = 16 and 32),
at low BER level, the error performance degradation shows again even Xth = 256.
As shown in Fig. 6.4, R16-256 and R32-256 are worse than CS16 and CS32 when
BER is below 10−5 and 10−6, respectively. Note that for the (8192, 4096) polar code
in this chapter, Iv of a rate-1 node is at most 256.
Depending on the specific list size, we predict that our RLLD algorithm will
show performance degradation compared to the CA-SCL algorithm at certain BER
values even when all rate-1 nodes are processed by the proposed CG algorithm.
Nevertheless, for the (8192, 4096) polar code, our RLLD algorithm can still show
obvious advantage in terms of error performance compared with the SC algorithm.
The root causes of the error performance degradation of our RLLD algorithm may
be as follows:
(1) For our RLLD algorithm, when a rate-1 node with Iv 6 Xth is activated, only
the two most reliable constituent codewords are kept. When list size L is large, there
may not be enough candidate codewords to include the correct codeword, since our
CG algorithm could miss certain good candidate codewords.
(2) When a rate-1 node with Iv > Xth is activated, only the most reliable can-
didate codeword is considered for each decoding path, which could also cause error
performance degradation.
(3) During the first sorting stage of our MBS algorithm, when 2Iv > L, qIv ,L is
175
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
selected to be no greater than L for certain Iv values for efficient hardware imple-
mentation. As a result, we may lose certain good candidate codewords due to the
limitation on qIv ,L.
6.4 High Throughput List Polar Decoder Archi-
tecture
6.4.1 Top Decoder Architecture
0
1
0
1
Din
Dout
...
IMEM
CMEM
LBuf0
LBuf1
CBuf
PPUCUL-1
CU0
CU1
...
PUAL-1
PUA0
PUA1
IEncCRCC
pList
pCCode
PS
LICout
SN
c0c1
Hyb-PSU
0
1
LBuf2
c2
Figure 6.5: Decoder top architecture
In this chapter, based on the proposed RLLD algorithm, a high throughput list
decoder architecture, shown in Fig. 6.5, for polar codes is proposed. In Fig. 6.5, the
channel message memory (CMEM) stores the received channel LLRs, and the inter-
nal LLR message memory (IMEM) stores the LLRs generated during the SC com-
putation process. With the concatenation and split method in our prior work [32],
the IMEM is implemented with area efficient memories, such as register file (RF) or
SRAM. The proposed architecture has L groups of processing unit arrays (PUAs),
each of which contains T processing units [20] (PUs) and is capable of performing ei-
ther the f or the g computation. The hybrid partial sum unit (Hyb-PSU) in Fig. 6.5
176
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
consists of L computation units, CU0, CU1, · · · , CUL−1, which are responsible for
updating the partial sums of L decoding paths, respectively. The path pruning unit
(PPU) in Fig. 6.5 finds the list indices and corresponding constituent codewords for
L survival decoding paths, respectively.
Both our high throughput list decoder architecture in Fig. 6.5 and that in [32]
employ a partial parallel processing method. Besides, both architectures contain a
channel message memory and internal message memory. However, compared to the
architecture in [32], the major improvements of our list decoder architecture are:
(a) Instead of LL messages, our high throughput list decoder architecture em-
ploys LLR messages, which result in more area efficient internal and channel message
memories.
(b) The PPU in Fig. 6.5 implements our CG and MBS algorithms, while the
PPU in [32] is just a sorter which selects L values among 2L ones. Due to the
proposed PPU, our decoder architecture achieves much higher throughput that the
that in [32].
(c) Our list decoder architecture employs a novel Hyb-PSU, which is more area
and energy efficient than that in [32]. Our Hyb-PSU is based on the proposed index
based partial sum computation algorithm. When a decoding path needs to be copied
to another one, our Hyb-PSU avoids copying partial sums directly by copying only
decoding path indices. In contrast, the PSU in [32] copies path sums directly, which
incurs additional energy consumption. Our Hyb-PSU stores most of the partial
sums in area efficient memories, while the PSU in [32] stores all the partial sums in
area demanding registers. Hence, our Hyb-PSU is scalable for larger block lengths.
177
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
6.4.2 Memory Efficient Quantization Scheme
For an SC or SCL decoder, the message memory occupies a large part of the overall
decoder area [20, 32]. An SCL decoder needs a channel message memory and an
internal message memory. For an LLR based SCL decoder, the channel memory
stores N channel LLR messages. The internal message memory stores Ln LLR
matrices: Pl,t for l = 0, 1, · · · , L − 1 and t = 1, 2, · · · , n, where Pl,t has 2n−t LLR
messages.
For a fixed point implementation of our RLLD algorithm, it is straightforward
to quantize all LLRs in the internal memory with Q bits. In this chapter, a memory
efficient quantization (MEQ) scheme is proposed to reduce the size of the internal
memory. f(a, b) in Eq. (6.5) has the same magnitude range as those of a and b,
while the magnitude range of g(a, b, s) in Eq. (6.6) is at most twice of those of
a and b (s is either 0 or 1). Since P0,t, P1,t, · · · , PL−1,t are computed based on
P0,t−1, P1,t−1, · · · , PL−1,t−1, for a decoding path l, the LLRs in Pl,t1 may need a
greater magnitude range than that of the LLRs in Pl,t2 , where t1 > t2. Suppose each
channel LLR is quantized with Qc bits, the proposed MEQ scheme is as follows:
(1) Suppose all LLRs within the internal memory are quantized with Qm bits,
determine the minimal Qm such that the error performance degradation of the fixed
point performance is negligible.
(2) Let t1, t2, · · · , tr be r integers, where t1 6 t2 6 · · · 6 tr 6 n and r = Qm−Qc.
Denote Pt = (P0,t, P1,t · · · , PL−1,t). Suppose LLRs associated with P1,P2, · · · ,Pt1
are quantized with Qc bits and the remaining LLRs are quantized with Qm bits. De-
cide the maximal t1 such that the resulting fixed point error performance degradation
is negligible. Once t1 is decided, suppose the LLRs within Pt1+1,Pt1+2, · · · ,Pt2 are
178
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
all quantized with Qc + 1 bits, find the maximal t2 such that the corresponding
error performance degradation is negligible. In this way, t3, · · · , tr are decided in
a serial manner so that Pti+1,Pti+2, · · · ,Pti+1 are quantized with Qc + i bits for
1 6 i 6 r − 1, and Pj are quantized to Qm bits for j > tr.
With the proposed MEQ scheme, the number of bits saved for the internal mem-
ory is
NB =
r+1∑
j=1
tj∑
t=tj−1+1
L2n−t(Qc + j − 1), (6.10)
where t0 = 0 and tr+1 = n are introduced for convenience.
In order to show the effectiveness of our MEQ scheme, the error performances of
our RLLD algorithm with the proposed MEQ scheme are shown in Fig. 6.6, where
the RLLD algorithm with our MEQ scheme is compared with the floating-point
CA-SCL decoding algorithm, floating-point RLLD algorithm, and RLLD algorithm
with a uniform quantization scheme for three different polar codes, (1024, 512),
(8192, 4096) and (32768, 29504) with Xth = 32, 128, 1024, respectively. For all
fixed-point decoders, each channel LLR is quantized with Qc = 5 bits. For the
RLLD algorithm with uniform quantization, each LLR in the internal memory is
quantized with Qm = 7 bits. Since Qm−Qc = 2, we need to determine two integers,
r1 and r2, for our MEQ scheme. When N = 2
10, 213 and 215, (r1, r2) = (1,2), (3,4)
and (4,5), respectively. As shown in Fig. 6.6, the performance degradation caused by
our MEQ scheme is small. Compared with the uniform quantization, the proposed
MEQ scheme reduces the number of stored bits by 17%, 25% and 27% for N = 210,
213 and 215, respectively.
179
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
SNR (dB)
1 2 3 4 5
FE
R
10-6
10-5
10-4
10-3
10-2
10-1
100
Floating CA-SCL
Floating RLLD
Uniform
MEQ
N=215
N=210N=213
Figure 6.6: Effects of the proposed MEQ scheme on the error performances
6.4.3 Proposed path pruning unit
When a rate-1 node with Iv 6 WT or an FP node is activated, each decoding path
splits into multiple ones and only the L most reliable paths are kept. The PPU in
Fig. 6.5 implements our CG and MBS algorithms, and is responsible for calculating
L returned codewords, βv,0, βv,1, · · · , βv,L−1, and L path indices, a0, a1, · · · , aL−1.
For l = 0, 1, · · · , L− 1, decoding path l copies from decoding path al before further
decoding steps.
Take L = 4 as an example, the proposed PPU is shown in Fig. 6.7, which can be
easily adapted to other L values. Our PPU in Fig. 6.7 has two types of node metric
generation (NG) units, NG-I and NG-II, which compute the node metrics for a rate-
1 node and an FP node, respectively. NG-Il and NG-IIl correspond to decoding path
l. For decoding path l, the expanded path metrics PMjl ’s are obtained by adding the
node metrics to the path metric PMl, which is stored in the path metric registers
(PMR) and initialized with 0.
When a rate-1 node is activated, NG-Il outputs two node metrics for l =
180
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
0, 1, · · · , L − 1. After 2L expanded path metrics are computed, a stage of metric
sorter (MS2L−L) selects the L minimum metrics and their corresponding codewords
from 2L ones. The metrics sorter MS2L−L implements the minL function in Alg. 21
and can be constructed with a BBS. When an FP node is activated, L NG-II mod-
ules implement the first part of our two-stage sorting scheme. For each decoding
path, qIv ,L node metrics and their correspondent codewords are computed. The tree
of metric sorters sort the L minimum metrics among qIv ,LL ones. This is achieved
by log2 qIv ,L stages of metric sorters when qIv ,L is a power of 2. The output ex-
panded path metrics of the last stage of metric sorter are saved in the PMR. The
corresponding codewords of the selected L expanded path metrics are also chosen.
The related circuitry is omitted for simplicity.
MS8-4
NG-II0
NG-I0
NG-II1
NG-II2
NG-II3
NG-I1
NG-I2
NG-I3
MS8-4
P
M
0
P
M
1
P
M
2
P
M
3
MS8-4
PMR
,0v
,1v
,2v
,3v
,0v
,1v
,2v
,3v
Figure 6.7: The proposed architecture for PPU
The micro architecture of NG-Il is shown in Fig. 6.8. The most complex part of
NG-Il is finding the minimum LLR magnitude and its corresponding index among
the LLR vector |αv,l| , (|αv,l[0]|, |αv,l[1]|, · · · , |αv,l[Iv − 1]|). Since the node metric
of the most reliable candidate codeword is always 0, we need to compute NM1l =
|αv,l[kM,l]| in Fig. 6.8, which is the node metric of the second most reliable candidate
181
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
codeword, with a corresponding index kM,l. For our list decoder architecture, for
each decoding path, at most T LLRs are computed in one clock cycle, since we have
only T PUs per decoding path. The Min-1 unit in Fig. 6.8 is capable of finding the
minimum value, mLLR, and its corresponding index, mIdx, from at most T parallel
inputs. When Iv 6 T , NM1l = mLLR and kM,l = mIdx. Cv,l,0 = h(αv,l) in Fig. 6.8 is
the hard decision of αv,l, which is the most reliable candidate codeword. The second
most reliable candidate codeword is obtained by flipping the kM,l-th bit of Cv,l,0.
Min-1
mLLR
mIdx
mLR
mIR
cmp
HCM0 HCM1
0
1
0
1
,| |v l
,( )v lh 
0
1
En
kM,l
1NMl
, ,0v l
Figure 6.8: Hardware architecture of the proposed NG-Il
When Iv > T , suppose T is a power of 2, then Iv can be divided by T . During
each clock cycle, only T LLRs are fed to NG-Il, and the minimum value and its
corresponding index are computed in a partial parallel way. The minimum value
and associated index of the first T inputs are stored in mLR and mIR, respectively.
The minimum value of the second group of T inputs is compared with the current
value stored in mLR, and is stored in mLR if it is smaller than the current value
of mLR. This repeats until the whole LLR vector αv,l is processed. At last, the
minimum value of |αv,l| and its index are stored in mLR and mIR, respectively.
The hard decoding of αv,l is stored in the hard decoded constituent codeword mem-
ory (HCM0), and is copied to HCM1 when the second most reliable constituent
codeword is computed.
The micro-architecture of NG-IIl under X0 = 8 and X1 = 16 is shown in Fig. 6.9,
182
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
where the block MUX4T256 includes 256 4-to-1 multiplexers. Our NG-IIl consists
of two parts, where the first part calculated 2Iv node metrics, NM0l , NM
1
l , · · · ,
NM2
Iv−1
l , based on Alg. 22. The second part implements the first stage sorting of
our MBS algorithm. For L = 4, when 2Iv > 2L, the 2Iv metrics are first divided
into four groups. The Min-2 [81] block is modified slightly to find the two minimum
node metrics and their associated indices for each metric group. The MS8−4 block
calculates the final output metrics. When 2Iv = 2L = 8, the MS8−4 blocks work
directly on the 2L = 8 expanded path metrics. When 2Iv 6 L, the expanded path
metrics are output directly. As shown in Figs. 6.7 to 6.9, our PPU has long critical
path delay, since there are many levels of logic from the inputs to outputs. Pipelines
should be used to improve overall decoder frequency.
0 01 1
|αv,l[0]| 0
m0
0 01 1
m1
MUX4T256
|αv,l[1]| 0 |αv,l[14]| 
m14 m15
|αv,l[15]| 
...
...
...SUM
0 01 1
0
0 01 1
MUX4T256
0
SUM
0NMl
255NMl
Min-2 Min-2 Min-2 Min-2
MS8-4
Figure 6.9: Architecture of NG-IIl
Based on the DMM method in Eq. (6.9), the node metric computation part needs
2Iv(2n−t − 1)L adders and 2Iv2n−tL 2-to-1 multiplexers, where N = 2n and t is the
layer index of an FP node v. Based on the RCC method, it takes (
∑n−t−1
i=1 2
i22
n−t−i
+
2Iv)L adders, 2Iv+1L 22
n−t−1
-to-1 multiplexers and 4 × 2n−t−1L 2-to-1multiplexers.
In contrast, based on our DRH method, it takes 4× 2n−t−1 + 2Iv(2n−t−1− 1) adders,
183
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
2Iv2n−t−1 4-to-1 and 4 × 2n−t−1 2-to-1 multiplexers. Table 6.2 compares hardware
resources needed by the DMM, RCC and DR-Hybrid methods when X0 = 8, X1 =
16, and αv,l[j] (0 ≤ j < 2n−t) is a 6-bit LLR. As shown in Table 6.2, the DRH
method requires the smallest total area. Besides, the implementations based on
DMM, RCC and DRH have roughly the same critical path delay.
Table 6.2: Hardware resources needed by different methods per list
DMM RCC DRH
# of adders 3840 864 1824
# of MUX2−1 4096 32 32
# of MUX4−1 0 0 2048
# of MUX256−1 0 512 0
total area (# of NANDs) 313,967 1,673,810 229,449
6.4.4 Proposed hybrid partial sum computation unit
For the list decoder architectures in [31, 32], all partial sums are stored in registers
and the partial sums of decoding path l′ are copied to decoding path l when decoding
path l′ needs to be copied to decoding path l. The PSU in [31] and [32] needs
L(N − 1) and L(N
2
− 1) single bit registers to store all partial sums, respectively.
Thus, for large N , the register based PSU architectures in [31,32] are inefficient for
two reasons. First, the area of the PSU is linearly proportional to N . For large N
(e.g. N > 215), the area of PSU is large since registers are usually area demanding.
Second, the power dissipation due to the copying of partial sums between different
decoding paths is high when N is large.
184
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
Proposed Index Based Partial Sum Computation Algorithm
In order to avoid copying partial sums directly, an index based partial sum compu-
tation (IPC) algorithm is proposed in Algorithm 23, where pl[z] (l = 0, 1, · · · , L −
1 and z = 0, 1, · · · , n) is a list index reference. Cl,z for l = 0, 1, · · · , L − 1 and
z = 0, 1, · · · , n are partial sum matrices [32, 55]. Cl,z has 2n−z elements, each of
which stores two binary bits.
For our RLLD algorithm, once a rate-0, rate-1 or an FP node sends L codewords
to its parent node, the partial sum computation is performed after decoding path
pruning. Let t denote the layer index of such a node v. Let (Bn−1, Bn−2, · · · , B0)
denote the binary representation of the index of the last leaf node belonging to
node v, where Bn−1 is the most significant bit. Let te = n − j, where j is the
smallest integer such that Bj = 0. If Bj = 1 for j = 0, 1, · · · , n − 1, te = 0. Once
βv,0, βv,1, · · · , βv,L−1 are calculated, decoding path l′ may need to be copied to path
l before the following partial sum computation. Under this circumstance, the index
references are first copied, where pl′ [z] is copied to pl[z] for z = t, t− 1, · · · , 0. The
lazy copy algorithm was proposed in [55] to avoid copying partial sums directly.
However, the lazy copy algorithm is not suitable for hardware implementation due
to complex index computation. The PSU in [32] copies partial sums directly.
Micro Architecture of the Proposed Hybrid Partial Sum Unit
Based on our IPC algorithm, a Hyb-PSU is proposed with two improvements. First,
some partial sums are stored in memory, while others are stored in registers. Second,
instead of partial sums, only list index matrices are copied. These two improvements
reduce the area and power overhead of partial sum computation unit whenN is large.
185
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
Algorithm 23: Index Based Partial Sum Computation (IPC) Algorithm
input : te, t, (βv,0, βv,1, · · · , βv,L−1)
output: Cl,te [j][0] for l = 0, 1, · · · , L− 1 and j = 0, 1, · · · , 2n−te
for l = 0 to L− 1 do
for j = 0 to 2n−t − 1 do
if v is the left child node of its parent node then
Cl,t[j][0] = βv,l[j]; pl[t] = l
else Cl,t[j][1] = βv,l[j]
if v is the left child node of its parent node then exit
for l = 0 to L− 1 do
for z = t− 1 to te do
for j = 0 to 2n−z−1 do
v0 = Cpl[z+1],z+1[j][0]; v1 = Cl,z+1[j][1]
if z == te then
Cl,z[2j][0] = v0 ⊕ v1; Cl,z[2j + 1][0] = v1
pl[z + 1] = pl[z] = l
else
Cl,z[2j][1] = v0 ⊕ v1; Cl,z[2j + 1][1] = v1
pl[z + 1] = l
186
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
PEl,n-2,0
PEl,n-2,1
PEl,n-2,2
PEl,n-2,3
PEl,n-1,0
PEl,n-1,1
PEl,m,0
PEl,m,1
PEl,m,2
PEl,m,q
PEl,n,0
...
...
...
...
...
...
... ...
CN CNBMl,m-1 CN...BMl,m-2 BMl,1
...
T
T
T
T
T
T
Sl,m-1 Dl,m-1
Ml,m-1
Sl,m-2 Dl,m-2
Ml,m-2
Sl,1 Dl,1
Ml,1T
T
T T T
T
T T
T
Xl
stage n stage n-1 stage n-2 stage m
stage m-1 stage m-2 stage 1
1
D
LZz
1
1
ol,z,2j
ol,z,2j+1
bl,z,j
sl,z,j
dl,z,j
ml,z,j 1
1
bl,z,j 1
1
1
1
0
LDz
ENz
(a)
(b) (c)
DLU
CNT
T
T
T
I0
I1
O0
O1
(d)
1
0
1
0
1
0
selm-1 selm-2 sel1
, [ ]v l j
dL-1,z,j
d1,z,j
d0,z,j
1
1
1
1
...
...
1
1
D
1
ol,z,2j
ol,z,2j+1
sl,z,j
dl,z,j
ml,z,j 1
1
1
ENz
dL-1,z,j
d1,z,j
d0,z,j
1
1
1
1
...
...
1
1
D0,m-1 DL-1,m-1
... ............
D1,m-1
D0,m-2 DL-1,m-2D1,m-2
D0,1
DL-1,1
Figure 6.10: (a) Top architecture of CUl. (b) Type-I PE. (c) Type-II PE. (d) Inputs
and outputs of the CN.
The Hyb-PSU consists of L computation units, CU0, CU1, · · · , CUL−1, where the
micro architecture of CUl is shown in Fig. 6.10(a) and is described as follows.
(a) For block length N = 2n, CUl consists of n stages, where m is an integer and
the first n−m+ 1 stages are a binary tree of the type-I and type-II unit processing
elements (PEs) shown in Figs. 6.10(b) and 6.10(c), respectively. Stage z (z > m)
has 2n−z PEs. Each of the remaining m− 1 stages has the same circuitry.
(b) Two types of PEs are used in the PE tree in Fig. 6.10(a). Suppose the
maximal length of a constituent codeword that is returned from a rate-0, rate-1 or
FP node is 2µ, then stage z (z > n−µ) employs only the type-I PEs. The remaining
stages in the PE tree employ the type-II PEs.
(c) Compared with the type-II PE, the type-I PE has an extra data load unit
(DLU). For PEl,z,j within stage z (j = 0, 1, · · · , 2n−z − 1), the binary outputs,
ol,z,2j and ol,z,2j+1, are connected to bl,z−1,2j and bl,z−1,2j+1, respectively. The wired
connections are not shown in Fig. 6.10(a) for simplicity.
(d) BMl,z (z 6 m− 1) is a bit memory with cw = 2n−zT words, where each word
187
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
contains T bits. T is the number of processing elements belonging to a decoding
path in a partial parallel list decoder. For our memory compiler, if cw is greater
than a threshold value, then BMl,z is implemented with an RF. If cw is even greater
than another threshold value, then BMl,z is implemented with an SRAM.
(e) The connector module (CN) has two T -bit inputs and two T -bit outputs.
The connections between the outputs and inputs are

O0[2j] = I0[j]⊕ I1[j] 0 6 j < T/2
O0[2j + 1] = I1[j] 0 6 j < T/2
O1[2j − T ] = I0[j]⊕ I1[j] T/2 6 j < T
O1[2j + 1− T ] = I1[j] T/2 6 j < T
(6.11)
(f) For our Hyb-PSU, L computation units are needed. For each PE within
CUl, ml,z,j in Figs. 6.10(b) and 6.10(c) is the output of an L-to-1 multiplexer whose
inputs are d0,z,j, d1,z,j, · · · , dL−1,z,j, where L−1 of them are from other computation
units. For each CN, Ml,z is the output of an L-to-1 multiplexer whose inputs are
D0,z, D1,z, · · · , DL−1,z.
Computation Schedule of Our Hybrid Partial Sum Unit
Once the returned L codewords βv,0, βv,1, · · · , βv,L−1 are computed, the path pruning
unit also outputs L indices a0, a1, · · · , aL−1, where al needs to be copied to decoding
path l. For l = 0, 1, · · · , L − 1, βv,l is first loaded into stage t by the DLU in
Fig. 6.10(b), and the output partial sums in Alg. 23 come out from stage te. For
stage t, if βv,l is sent from a rate-0 node, then the control signal LZt is 0, since βv,l
is a zero vector. Otherwise, LDt = 0 and LZt = 1. For the other stages, LDz = 1
188
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
and LZz = 1 (z 6= t).
For all partial sums within the partial sum matrix Cl,z, we divide them into two
sets: C0l,z and C
1
l,z, where C
0
l,z consists of Cl,z[j][0] for j = 0, 1, · · · , 2n−z−1 and C1l,z
consists of the other partial sums within Cl,z. For each Cl,z, our Hyb-PSU stores
only C0l,z in the registers or bit memory of stage z. As shown in Alg. 23, for z = t−1
to te + 1, C
1
l,z is computed in serial. At last, C
0
l,te
is computed. For our Hyb-PSU,
after loading the returned L codewords into stage t, for z = t − 1 to te + 1, C1l,z is
computed on-the-fly and passed to the next stage as shown in Fig. 6.10.
When te > m, C0l,te is computed in one clock cycle and is output from stage
te, where Cl,te [j][0] is set to sl,te,j produced by the type-I and type-II PEs for j =
0, 1, · · · , 2n−te−1. When te < m, C0l,te is computed in 2n−te/T cycles, and T updated
partial sums are computed in each clock cycles. Since decoding path al needs to be
copied to path l, for z = t, t − 1, · · · , te + 1, the computation of C1l,z is based on
C0al,z+1 and C
1
l,z+1. Hence, the multiplexers within stage z are configured so that
ml,z,j = dal,z,j for z > m. When z < m, Ml,z = Qal,z.
Comparisons with Related Works
Compared to the partial sum computation architectures in [31, 32], the proposed
Hyb-PSU architecture has advantages in the following two aspects.
(1) The proposed Hyb-PSU is a scalable architecture. The PSU architectures
in [31,32] require L(N − 1) and L(N/2− 1) single bit registers, where N = 2n is the
block length. Hence, they will suffer from excessive area overhead when the block
length N is large. In contrast, the proposed Hyb-PSU stores L(N − 1) bits and
most of these bits are stored in RFs or SRAMs, which are more area efficient than
189
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
registers.
(2) The architectures in [31,32] copies partial sums of a decoding path to another
decoding path when needed, while our Hyb-PSU copies only index references. We
define the copying of a single bit from one register to another as a single copy
operation. When decoding path l′ needs to be copied to path l, the PSU in [32]
requires N1 = 2
n−1 − 1 copy operations, while our Hyb-PSU needs only N2 = (n +
1) log2 L copy operations. Since the value of L for practical hardware implementation
is small, our lazy copy needs much fewer copy operations than direct copy.
In this chapter, when L = 4 and T = 128, for N = 213 and 215, the proposed
hybrid partial sum computation unit architecture is implemented with m = 3 and
m = 5, respectively, under a TSMC 90nm CMOS technology. Our partial sum
computation unit consumes an area of 0.779mm2 and 1.31mm2 for N = 213 and
N = 215, respectively.
To the best of our knowledge, those decoder architectures in [31,32,36,82] are the
only for SC based list decoding algorithms of polar codes. However, in [31, 36, 82],
the partial sum computation unit architecture was not discussed in detail and the
implementation results on the PSU alone are not shown. Hence, we compare our
proposed Hyb-PSU with that in [32]. When L = 4, the partial sum unit architecture
in [32] for N = 213 and 215 consumes an area of 1.011mm2 and 3.63mm2, respectively,
under the same CMOS technology. All PSUs are synthesized under a frequency of
500MHz. Our Hyb-PSU achieves an area saving of 23% and 63% for block length
213 and 215, respectively.
190
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
6.4.5 Latency and Throughput
For the proposed high throughput decoder architecture, the number of clock cycles,
ND, used on the decoding of a codeword depends on the block length, the code rate
and the positions of frozen bits. For our RLLD algorithm, let NV be the number
of nodes (except the root node) visited in Gn. Let SV denote the set of indices of
visited nodes (except the root node). Let S ′V be a subset of SV , where S
′
V consists
of rate-1 nodes with Iv 6 Xth and all FP nodes. For vi ∈ SV , let ti be the layer
index of node vi for i = 0, 1, · · · , NV − 1. Then
ND =
NV −1∑
i=0
(N
(i)
L +N
(i)
P ) +NC , (6.12)
where N
(i)
L = d2
n−ti
T
e is the number of clock cycles needed to calculate the LLR
vectors sending to node vi. N
(i)
P is the number of clock cycles used by our PPU
when vi is activated. Note that decoding path splits only if node vi is a rate-1 node
with Iv 6 Xth or an FP node. Hence, N (i)P = 0 if vi 6∈ S ′V . If vi ∈ S ′V , N (i)P 6= 0 and
depends on the node type, Xth, qIv ,L, T , L and the number of pipeline stages in our
PPU. This will be discussed in more detail in Section 6.5.
Since our list decoder outputs xN−10 instead of u
N−1
0 , we need to obtain u
N−1
0
based on xN−10 before calculating the CRC checksum of the information bits. A
partial-parallel polar encoder [83] can be used and the corresponding latency is
N/T when T bits are fed to the encoder in parallel. For the computation of CRC,
a partial parallel CRC unit [84] can be used, and the corresponding latency is also
N/T . As a result, NC =
2N
T
.
The latency of our decoder is TL = NR/f , where f is the decoder frequency.
191
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
Since we are using CRC for output final data word, we calculate the net information
throughput (NIT) of our decoder, where NIT = (NR−h)f
ND−NC , where h is the CRC
checksum length. Here, the latency due to the CRC checksum computation does
not affect out decoder throughput, since our decoder can work on the next frame
once our Hyb-PSU begins to output decoded codewords for current frame.
192
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
T
ab
le
6.
3:
Im
p
le
m
en
ta
ti
on
R
es
u
lt
s
fo
r
N
=
21
0
,R
=
0.
5
p
ro
p
os
ed
[3
3
]
[3
2
]‡
[3
6
]
[3
7
]
L
2
4
8
2
4
8
2
4
8
2
4
4
F
re
q
u
en
cy
(M
H
z)
42
3
40
3
2
8
9
8
4
7
7
9
4
6
3
7
5
0
7
4
9
2
4
6
2
5
0
0
∗
3
6
1
†
4
0
0
∗
2
8
8
†
5
0
0
C
el
l
A
re
a
(m
m
2
)
1.
92
3.
69
6
.9
5
0
.8
8
1
.7
8
3
.8
5
1
.2
3
2
.4
6
5
.2
8
1
.0
6∗
2
.0
3†
2
.1
4∗
4
.1
0†
1
.4
0
3
#
of
D
ec
o
d
in
g
C
y
cl
es
33
7
37
1
4
0
4
2
5
9
2
2
6
4
9
2
6
4
9
2
5
9
2
2
5
9
2
3
1
0
4
1
0
2
2
1
0
2
2
1
2
9
0
N
IT
(M
b
p
s)
66
6
57
0
3
7
4
1
6
8
1
5
4
1
2
3
9
3
9
1
7
1
2
5
0
∗
1
8
0
†
2
0
0
∗
1
4
4
†
1
8
6
L
at
en
cy
(u
s)
0.
79
0.
92
1
.2
1
3
.0
6
3
.3
4
4
.1
6
5
.1
1
5
.2
6
6
.7
2
2
.0
4∗
2
.8
3†
2
.5
5∗
3
.5
4†
2
.5
8
A
E
(M
b
p
s/
m
m
2
)
34
7
15
4
5
3
1
9
1
8
6
3
2
7
6
3
7
1
3
2
3
7
∗
8
8
†
9
4
∗
3
5
†
1
3
2
‡T
h
e
d
ec
o
d
er
ar
ch
it
ec
tu
re
in
[3
2]
h
as
b
ee
n
re
-s
y
n
th
es
iz
ed
u
n
d
er
th
e
T
S
M
C
9
0
n
m
C
M
O
S
te
ch
n
o
lo
g
y.
∗
T
h
es
e
a
re
th
e
o
ri
g
in
a
l
im
p
le
-
m
en
ta
ti
on
re
su
lt
s
b
as
ed
on
a
65
n
m
C
M
O
S
te
ch
n
o
lo
g
y.
†T
h
es
e
a
re
th
e
sc
a
le
d
re
su
lt
s
u
n
d
er
th
e
T
S
M
C
9
0
n
m
C
M
O
S
te
ch
n
o
lo
g
y.
193
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
T
ab
le
6.
4:
Im
p
le
m
en
ta
ti
on
R
es
u
lt
s
fo
r
N
=
21
3
,R
=
0.
5
p
ro
p
o
se
d
[3
3
]†
[3
2
]‡
[3
7
]‡
L
2
4
8
2
4
8
2
4
8
4
F
re
q
u
en
cy
(M
H
z)
41
6
3
9
8
2
8
9
8
4
7
7
9
4
6
3
7
4
6
7
4
3
4
4
3
4
4
3
4
C
el
l
A
re
a
(m
m
2
)
3.
1
2
5
.8
6
1
1
.0
6
6
.4
8
1
2
.7
3
2
8
.0
4
3
.9
7
7
.9
3
1
7
.4
5
7
.0
2
#
of
D
ec
o
d
in
g
C
y
cl
es
21
4
6
2
3
6
7
2
5
7
6
2
0
7
3
6
2
0
7
3
6
2
0
7
3
6
2
0
7
3
6
2
0
7
3
6
2
4
8
3
2
1
1
4
8
8
N
IT
(M
b
p
s)
83
9
7
2
3
4
7
9
1
6
7
1
5
6
1
2
5
9
2
8
5
7
1
1
5
3
L
at
en
cy
(u
s)
5.
1
6
5
.9
4
8
.9
1
2
4
.4
8
2
6
.1
1
3
2
.5
5
4
4
.4
0
4
7
.7
8
5
8
.5
6
2
6
.4
7
A
E
(M
b
p
s/
m
m
2
)
26
8
1
2
3
4
3
2
6
1
2
4
.6
2
3
1
1
4
.1
2
1
.7
9
†T
h
es
e
re
su
lt
s
ar
e
es
ti
m
at
ed
co
n
se
rv
a
ti
ve
ly
.
‡T
h
e
d
ec
o
d
er
a
rc
h
it
ec
tu
re
s
in
[3
2
,3
7
]
h
av
e
b
ee
n
re
-s
y
n
th
es
iz
ed
u
n
d
er
th
e
T
S
M
C
90
n
m
C
M
O
S
te
ch
n
o
lo
g
y.
T
h
e
n
u
m
b
er
o
f
P
U
p
er
d
ec
o
d
in
g
p
a
th
is
1
2
8
.
194
6.4. HIGH THROUGHPUT LIST POLAR DECODER ARCHITECTURE
T
ab
le
6.
5:
Im
p
le
m
en
ta
ti
on
R
es
u
lt
s
fo
r
N
=
21
5
,R
=
0.
90
04
p
ro
p
o
se
d
[3
3
]†
[3
2
]‡
[3
7
]‡
L
2
4
8
2
4
8
2
4
8
4
F
re
q
u
en
cy
(M
H
z)
36
7
3
5
9
2
8
6
8
4
7
7
9
4
6
3
7
3
9
8
3
8
9
3
8
9
3
8
9
C
el
l
A
re
a
(m
m
2
)
4.
56
8
.5
6
1
6
.4
5
2
5
.6
8
5
0
.4
1
1
1
1
.0
8
8
.5
9
1
7
.5
4
3
4
1
5
.5
#
of
D
ec
o
d
in
g
C
y
cl
es
60
70
6
4
9
2
6
8
9
5
9
6
5
7
6
9
6
5
7
6
9
6
5
7
6
9
6
5
7
6
9
6
5
7
6
1
2
6
0
8
0
6
3
6
0
6
N
IT
(M
b
p
s)
19
49
1
7
7
2
1
3
2
3
2
5
8
2
4
2
1
9
4
1
2
1
1
1
8
9
0
1
8
0
L
at
en
cy
(u
s)
16
.5
3
1
8
.0
8
2
4
.1
1
1
1
4
.0
2
1
2
1
.6
3
1
5
1
.6
1
2
4
2
.6
5
2
4
8
.2
6
3
2
4
.1
1
6
3
.5
A
E
(M
b
p
s/
m
m
2
)
42
7
2
0
7
8
0
1
1
4
.8
1
.7
5
1
4
6
.7
2
2
.6
4
1
1
.6
1
†T
h
es
e
re
su
lt
s
ar
e
es
ti
m
at
ed
co
n
se
rv
a
ti
ve
ly
.
‡T
h
e
d
ec
o
d
er
a
rc
h
it
ec
tu
re
s
in
[3
2
,3
7
]
h
av
e
b
ee
n
re
-s
y
n
th
es
iz
ed
u
n
d
er
th
e
T
S
M
C
90
n
m
C
M
O
S
te
ch
n
ol
o
g
y.
T
h
e
n
u
m
b
er
o
f
P
U
p
er
d
ec
o
d
in
g
p
a
th
is
1
2
8
.
195
6.5. IMPLEMENTATION RESULTS AND COMPARISONS
6.5 Implementation Results and Comparisons
To compare with prior works, we implement our high throughput list decoder ar-
chitecture for three polar codes with lengths of 210, 213 and 215, respectively, and
rates 0.5, 0.5 and 0.9, respectively. The last polar code is intended for storage
applications. For each code, three different list sizes are considered: L = 2, 4, 8.
All our decoders are synthesized under the TSMC 90nm CMOS technology using
the Cadence RTL compiler. The area efficiency (AE) of a partly parallel decoder
architecture depends on the number of PUs. In order to make a fair comparison
with prior works in [32, 33, 37], the number of PUs for each decoding path of our
implemented decoders is selected to be 64 when N = 210. When N = 213 and 215,
the number of PUs per decoding path is 128 for our decoders. The list decoders
in [85] are based on a line architecture, which always requires N
2
PUs.
A total of 3, 4 and 6 pipeline stages, respectively, are inserted in the PPU for
decoders with L = 2, 4 and 8, respectively. The number of pipeline stages needed
for our PPU is determined by the longest data path. For each vi ∈ S ′V , if node vi is a
rate-1 node with Iv 6 Xth, N (i)P depends on the number of PUs in a decoding path:
when Iv 6 T , N (i)P = 2 for all our implemented decoders; otherwise, N
(i)
P = 4 for all
our decoders, since the minimum value of a received LLR vector is calculated in a
partial parallel way, which incurs extra clock cycles. When node vi is an FP node,
N
(i)
P relates to qIv ,L. Depending on the detailed value of qIv ,L, we may use different
data paths when computing the L minimum expanded path metrics. The locations
of all pipelines are arranged so that a fewer number of clock cycles is needed when
the qIv ,L is smaller. In Table 6.6, we list the detailed value of N
(i)
P with respect to
Iv and L.
196
6.5. IMPLEMENTATION RESULTS AND COMPARISONS
The selection of Xth is a trade-off between AE and error performance. When
increasing Xth, more rate-1 nodes will be processed by our CG algorithm. Hence,
ND increases and the resulting NIT decreases. Meanwhile, the corresponding error
performance is better especially in high SNR region. Our high throughput list
decoder architecture supports all Xth values. For all our implemented decoders, Xth
is set to be large enough so that all rate-1 nodes are processed by our CG algorithm.
In this setup, for each implemented decoder, ND is maximized with respect to Xth,
and hence the throughput of our decoder architecture in Tables 6.3, 6.4 and 6.5
is the minimum achieved by our decoders. For each code, the corresponding error
performance is better than that of the RLLD with the MEQ in Fig. 6.6.
Table 6.6: N
(i)
P with Respect to Iv and L
Iv 1 2 3 4 5 6 7 8
L = 2 2 2 3 3 3 3 3 3
L = 4 2 4 4 4 4 4 4 3
L = 8 2 3 4 5 5 6 5 3
The implementation results are shown in Table 6.3, 6.4 and 6.5. The implemen-
tation results show that our decoders outperform existing SCL decoders [32, 33, 36]
in both decoding latency and area efficiency. Compared with the decoders of [33],
the area efficiency and decoding latency of our decoders are 1.65 to 45 times and
3.4 to 6.8 times better, respectively. The area efficiency and decoding latency of
our decoders are 4.07 to 30 times and 5.5 to 13 times better, respectively, than the
decoders of [32]. Compared with decoders of [37], our decoders improve the area ef-
ficiency and decoding latency by 1.16 to 17.8 times and 2.8 to 9 times, respectively.
When N = 210, the area efficiency and decoding latency of our decoders are 3.9
to 4.4 times and 3.58 to 3.84 times better, respectively, than the decoders of [36].
197
6.5. IMPLEMENTATION RESULTS AND COMPARISONS
Compared with the decoders of [36], our decoders would show more significant im-
provements in area efficiency and decoding latency when N is larger.
Based on the implementation results shown in Tables 6.3, 6.4 and 6.5, it is
observed that when the block length is fixed, as the list size L increases, the area
efficiency and decoding latency will decrease and increase, respectively, due to the
following reasons:
• It takes more memory to store internal LLRs when L increases.
• The number of pipeline stages within our PPU will increase when L increases,
which in turn increases the overall decoding clock cycles.
The latency reduction and area efficiency improvement of our decoders are due
to the reduced number of nodes activated in the decoding. However, the area and
frequency overhead of the proposed PPU somewhat dilute the effects due to decoding
clock cycles reduction. For example, our decoder reduces the number of decoding
cycles to approximately 1
7
of that of the decoders in [33] for L = 2, 4 and 8. However,
the reduction in decoding cycles does not fully transfer into the improvement in
decoding latency and area efficiency. Based on our implementation results, take
L = 2 as an example, the PPU occupies 61.99%, 40.16% and 25.40% of the area
of the whole decoder, for N = 210, 213 and 215, respectively. Compared with the
decoders with N = 210 and 213, the effects on the area efficiency caused by the area
overhead of PPU are smaller for decoders with N = 215. Keeping T unchanged, as
N increases, the area of the PPU increases very slowly while the total area of all
LLR memories is proportional to N . Hence, for larger N , PPU occupies a smaller
percentage of the total area of a whole decoder. When list size L is fixed, as N
198
6.6. CONCLUSION
increases, the latency reduction and area efficiency improvement compared with
other decoders in the literature will be greater.
6.6 Conclusion
In this chapter, a reduced latency list decoding algorithm is proposed for polar codes.
The proposed list decoding algorithm results in a high throughput list decoder ar-
chitecture for polar codes. A memory efficient quantization method is also proposed
to save the size of message memories. The proposed list decoder architecture can
be adapted to large block lengths due to our hybrid partial sum computation unit,
which is area efficient. The implementation results of our high throughput list de-
coder demonstrates significant advantages over current state-of-art SCL decoders.
199
Chapter 7
Conclusions and Future Work
7.1 Conclusions
In this thesis, efficient hardware decoder architectures for NB-LDPC codes, polar
codes, MV and KK codes are presented. An algorithm-architecture co-optimization
approach is employed.
In Chapter 2, the shuffled decoding algorithm and decoder architecture are pre-
sented. The shuffled decoding algorithm reduces the average number iterations.
Implementations for the (837, 726) nonbinary QC-LDPC code show that the ef-
ficiency of the proposed decoder architecture is much higher than these previous
works.
In Chapter 3, a fully parallel decoder architecture based on the is presented. A
reduced memory complexity trellis based check node processing (RTBCP) algorithm
is first proposed. A parallel check node unit (CNU) and a low-latency variable node
200
7.1. CONCLUSIONS
unit (VNU) are also proposed. Based on the proposed CNU and VNU, an effi-
cient fully parallel decoder architecture is also proposed. A fully parallel NB-LDPC
decoders based on GF(256) is implemented with 28nm CMOS technology. The de-
coder over GF(256) achieves a throughput of 546Mb/s and an energy efficiency of
0.178nJ/b/iter. Compared with the state-of-art NB-LDPC decoder architectures,
the implementation results demonstrate that our fully parallel decoder architecture
has obvious advantages in terms of throughput and area efficiency.
In Chapter 4, efficient decoder architectures for KK and MV codes are presented.
A serial decoder architecture and an unfolded decoder architecture for KK codes are
proposed for applications with moderate and high throughputs, respectively. Both
architectures are implemented for KK codes over GF(28) and GF(216) to demon-
strate their efficiency. Compared to the rank metric decoder architectures for KK
codes [43], the proposed serial decoder architecture improves the throughput by
4.9 and 13.2 times, while its gate counts are only 56% and 76% of their respec-
tive counterparts in [43]. Moreover, for these two codes, the unfolded architecture
achieves a throughput of 12.5Gb/s and 41.6Gb/s, much higher than the throughput
of 214Mb/s and 134Mb/s of their respective counterparts in [43]. The throughputs
per thousand NAND gates of our architectures are much higher and their latency
much shorter than their counterparts in [43]. A serial list decoder architecture for
MV codes is also proposed. To the best of our knowledge, this is the first hardware
implementation of MV decoders.
In Chapter 5, we present the first hardware implementation of the CA-SCL
algorithm to the best of our knowledge. An memory efficient memory partition
method is employ to reduce the area of the message memories. A fine grained PU
201
7.2. FUTURE WORK
profiling (FPP) algorithm is proposed to determine the minimum quantization size
of each input message for each processing unit so that there is no message overflow.
An efficient scalable path pruning unit (PPU) is proposed to control the copying of
decoding paths. Based on the proposed memory architecture and the scalable PPU,
our list decoder architecture is suitable for large list sizes. For a (1024, 512) rate-1
2
polar code, the proposed list decoder architecture is implemented for list size L = 2
and 4, respectively, under a 90nm CMOS technology. Compared with the decoder
architecture in [31] synthesized under the same technology, our decoder achieves
1.24 to 1.83 times area efficiency (throughput normalized by area). Besides, the
proposed CA-SCL decoder has better error performance compared with the SCL
decoder in [31].
In Chapter 6, a tree based reduced latency list decoding algorithm and its cor-
responding high throughput hardware architecture for polar codes are presented.
Our reduced latency list decoding algorithm reduces the number of nodes visited in
a decoding tree. The proposed high throughput list decoder architecture has been
implemented for several block lengths and list sizes under the TSMC 90nm CMOS
technology. The implementation results show that our decoders outperform existing
SCL decoders in both decoding latency and area efficiency. For example, compared
with the decoders of [33], the area efficiency and decoding latency of our decoders
are 1.65 to 45 times and 3.4 to 6.8 times better, respectively.
7.2 Future Work
For future work, the following point may be worthy to be looked into:
202
7.2. FUTURE WORK
• Low complexity low latency decoding algorithms and decoder ar-
chitectures for NB-LDPC codes. Compared to binary LDPC decoders,
current NB-LDPC still suffer from excessive hardware complexity. For the
practical application of NB-LDPC codes, efficient decoding algorithms still
need to be investigated. Stochastic computation can reduce the computa-
tional complexity at the cost of long decoding latency. It is interesting to
combine stochastic computation with current NB-LDPC decoding algorithms
over real domain. It is also promising to explore the joint detection and de-
coding with NB-LDPC codes. Besides, efficient hard decoding algorithms for
NB-LDPC codes still need to be investigated.
• Efficient belief propagation decoding algorithms and decoder archi-
tectures for polar codes. Lots of efforts have already been devoted to
the research of efficient SC based decoding algorithms for polar codes. For
applications that require soft output, the belief propagation decoding algo-
rithms are essential. Current belief propagation decoding algorithms suffer
from high computational complexity, high hardware complexity, inferior error
performance due to inefficient message updating schedule. It is interesting to
explore efficient schedules for belief propagation decoding of polar codes.
203
Bibliography
[1] A. Voicila, D. Declercq, F. Verdier, M. Fossorier and P. Urard, “Low-complexity
decoding for non-binary LDPC codes in high order fields,” IEEE Trans. Com-
mun., vol. 58, no. 5, pp. 1365–1375, May 2010.
[2] G. Sarkis, S. Hemati, S. Mannor and W. J. Gross, “Stochastic decoding of
LDPC codes over GF(q),” IEEE Trans. Commun., vol. 61, no. 3, pp. 939–950,
Mar. 2013.
[3] M. Davey and D. J. C. Mackay, “Low-density parity check codes over GF(q),”
IEEE Commun. Lett., vol. 2, no. 6, pp. 165–167, Jun. 1998.
[4] C. Poulliat, M. Fossorier, and D. Declercq, “Design of non binary LDPC codes
using their binary image: algebraic properties,” in Proc. IEEE Int. Symp. on
Information Theory, Seattle, USA, Jul. 2006.
[5] D. J. C. MacKay, “Online database of low-density parity check codes, Available:
http:www.inference.phy.cam.ac.uk/mackay/codes/data.html.”
[6] V. Savin, “Min-Max decoding for non-binary LDPC codes,” in Proc. IEEE Int.
Symp. on Information Theory, Toronto, Canada, Jul. 2008, pp. 960–964.
204
BIBLIOGRAPHY
[7] X. Zhang and F. Cai, “Reduced-complexity decoder architecture for non-Binary
LDPC codes,” IEEE Trans. VLSI Systems, vol. 19, no. 7, pp. 1229–1238, Jul.
2011.
[8] G. Sarkis and W. J. Gross, “Efficient stochastic decoding of non-binary LDPC
codes with degree-two variable nodes,” IEEE Commun. Lett., vol. 16, no. 3,
pp. 389–391, Mar. 2012.
[9] A. Ciobanu, S. Hemati and W. J. Gross, “Adaptive multiset stochastic decoding
of non-binary LDPC codes,” IEEE Trans. Signal Processing, vol. 61, no. 16,
pp. 4100–4113, Aug. 2013.
[10] A. Voicila, D. Declercq, F. Verdier, M. Fossorier and P. Urard, “Architecture
of a low-complexity non-binary LDPC decoder for high order fields,” in Proc.
IEEE Int. Symp. Commun. and Inf. Technologies (ISCIT), Sydney, Australia,
Oct. 2007, pp. 1201–1206.
[11] J. Lin, J. Sha, Z. Wang, and L. Li, “Efficient decoder design for nonbinary
quasicyclic LDPC codes,” IEEE Trans. Circuits Syst. I: Reg. Papers, vol. 57,
no. 5, pp. 1071–1082, May 2010.
[12] J. Lin and Z. Yan, “Efficient shuffled decoder architecture for nonbinary Quasi-
Cyclic LDPC codes,” IEEE Trans. VLSI Systems, vol. 21, no. 9, pp. 1756–1761,
Sept. 2013.
[13] J. Lin, J. Sha, Z. Wang, and L. Li, “An efficient VLSI architecture for nonbinary
LDPC decoders,” IEEE Trans. Circuits Syst. II: Exp. Briefs, vol. 57, no. 1, pp.
51–56, Jan. 2010.
205
BIBLIOGRAPHY
[14] F. Cai and X. Zhang, “Relaxed Min-Max decoder architectures for nonbinary
Low-Density Parity-Check Codes,” IEEE Trans. VLSI Systems, vol. 21, no. 11,
pp. 2010–2023, Nov. 2013.
[15] Y. Ueng, C. Leong, C. Yang, C. Cheng, K. Liao and S. Chen, “An efficient
layered decoding architecture for nonbinary QC-LDPC codes,” IEEE Trans.
Circuits Syst. I: Reg. Papers, vol. 56, no. 2, pp. 385–398, Feb. 2012.
[16] X. Chen and C.-L. Wang, “High-Throughput Efficient Non-Binary LDPC De-
coder Based on the Simplified Min-Sum Algorithm,” IEEE Trans. Circuits Syst.
I: Reg. Papers, vol. 59, no. 11, pp. 2784–2794, Nov. 2012.
[17] C. Zhang and K. K. Parhi, “A network-efficient nonbinary QC-LDPC decoder
architecture,” IEEE Trans. Circuits Syst. I: Reg. Papers, vol. 59, no. 6, pp.
1359–1371, Jun. 2012.
[18] E. Arıkan, “Channel polariztion: a method for constructing capacity-achieving
codes for symmetric binary-input memoryless channels,” IEEE Trans. Info.
Theory, vol. 55, no. 7, pp. 3051–3073, Jul. 2009.
[19] E. T. E. Sasoglu and E. Arıkan, “Polariztion for arbitrary discrete memoryless
channels,” in Proc. IEEE Information Theory Workshop, 2009, pp. 144–148.
[20] C. Leroux, A. J. Raymond, G. Sarkis and W. J. Gross, “A semi-parallel
successive-cancellation decoder for polar codes,” IEEE Trans. Signal Process-
ing, vol. 61, no. 2, pp. 289–299, Jan. 2013.
[21] I. Tal and A. Vardy, “List decoding of polar codes,” in Proc. IEEE Int. Symp.
on Information Theory, Jul. 2011, pp. 1–5.
206
BIBLIOGRAPHY
[22] ——, “List decoding of polar codes,” in http://webee.technion.ac.il/ people/
idotal/ papers/ preprints/ polarList.pdf .
[23] K. Niu and K. Chen, “Crc-aided decoding of polar codes,” IEEE Commun.
Lett., vol. 16, no. 10, pp. 1668–1671, Oct. 2012.
[24] H. S. B. Li and D. Tse, “An adaptive successive cancellation list decoder for po-
lar codes with cyclic redundancy check,” IEEE Commun. Lett., vol. 16, no. 12,
pp. 2044–2047, Dec. 2012.
[25] S. K. N. Goela and M. Gastpar, “On lp decoding of polar codes,” in Proc. IEEE
Information Theory Workshop, Aug. 2010, pp. 1–5.
[26] A. Eslami and H. Pishro-Nik, “On finite-length performance of polar codes:
stopping sets, error floor, and concatenated design,” IEEE Trans. Commun.,
accepted.
[27] P. Trifonov, “Efficient design and decoding of polar codes,” IEEE Trans. Com-
mun., vol. 60, no. 11, Dec. 2012.
[28] K. Chen, K. Liu, and J. Lin, “Improved successive cancellation decoding of
polar codes,” IEEE Trans. Commun., vol. 61, no. 8, pp. 3100–3107, Aug. 2013.
[29] C. Zhang and K. K. Parhi, “Low-latency sequential and overlapped architec-
tures for successive cancellation polar decoder,” IEEE Trans. Signal Processing,
vol. 61, no. 10, pp. 2429–2441, Mar. 2013.
207
BIBLIOGRAPHY
[30] A. Alamdar-Yazdi and F. R. Kschischang, “A simplified successive-cancellation
decoder for polar codes,” IEEE Commun. Lett., vol. 15, no. 12, pp. 1378–1380,
Dec. 2011.
[31] A. Balatsoukas-Stimming, A. J. Raymond, W. J. Gross and A. Burg, “Tree
search architecture for list SC decoding of polar codes,” in arXiv:1303.7127.
[32] J. Lin and Z. Yan, “An efficient list decoder architecture for polar codes,” IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., 2015, to appear.
[33] A. Balatsoukas-Stimming, M. B. Parizi, and A. Burg, “LLR-based successive
cancellation list decoding of polar codes,” http://arxiv.org/abs/1401.3753v3,
submitted to IEEE Trans. Signal Process. [Online]. Available: http:
//arxiv.org/abs/1401.3753
[34] ——, “LLR-based successive cancellation list decoding of polar codes,” in Proc.
IEEE Int. Conference on Acoustics, Speech, and Signal Processing (ICASSP),
Florence, Italy, May 2014, pp. 3903–3907.
[35] B. Li, H. Shen, and D. Tse, “Parallel decoders of polar codes,”
http://arxiv.org/abs/1309.1026v1, Sep. 2013.
[36] B. Yuan and K. K. Parhi, “Low-latency successive-cancellation list decoders
for polar codes with multibit decision,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., to appear.
[37] C. Xiong, J. Lin, and Z. Yan, “Symbol-decision successive cancellation list
decoder for polar codes,” http://arxiv.org/abs/1501.04705, submitted to IEEE
Trans. Signal Processing.
208
BIBLIOGRAPHY
[38] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Increasing the
speed of polar list decoders,” in Proc. IEEE Workshop on Signal Processing
Systems (SiPS), Belfast, UK, 2014.
[39] R. Ahlswede, N. Cai, S. Li, and R. Yeung, “Network information flow,” IEEE
Trans. Info. Theory, vol. 46, pp. 1204–1216, Jul. 2000.
[40] P. A. Chou, Y. Wu, and K. Jain, “Practical network coding,” in Allerton Conf.
on Comm., Control, and Computing, Monticello, IL, oct 2003.
[41] T. Ho, M. Me´dard, R. Ko¨tter, D. Karger, M. Effros, J. Shi, and B. Leong,
“A random linear network coding approach to multicast,” IEEE Trans. Info.
Theory, vol. 52, no. 10, pp. 4413–4430, Oct. 2006.
[42] T. Ho, R. Ko¨tter, M. Me´dard, D. R. Karger, and M. Effros, “The benefits of
coding over routing in a randomized setting,” in Proc. IEEE Int. Symp. on
Information Theory, Yokohama, June-July 2003, p. 442.
[43] N. Chen, Z. Yan, M. Gadouleau, Y. Wang and B. W. Suter, “Rank metric
decoder architectures for random linear network coding with error control,”
IEEE Trans. VLSI Syst., vol. 20, no. 2, pp. 296–309, feb 2012.
[44] R. Ko¨tter and F. R. Kschischang, “Coding for errors and erasures in random
network coding,” IEEE Trans. Info. Theory, vol. 54, no. 8, pp. 3579–3591,
August 2008.
[45] N. Cai and R. W. Yeung, “Network coding and error correction,” in Proc. IEEE
Information Theory Workshop, Bangalore, India, oct 2002, pp. 20–25.
209
BIBLIOGRAPHY
[46] D. Silva, F. R. Kschischang and R. Ko¨tter, “A rank-metric approach to error
control in random network coding,” IEEE Trans. Info. Theory, vol. 54, no. 9,
pp. 3951–3967, September 2008.
[47] H. Mahdavifar and A. Vardy, “Algebraic list-decoding on the operator channel,”
in Proc. IEEE Int. Symp. Info. Theory, Austin, USA, June 2010, pp. 1193–1197.
[48] ——, “Algebraic list-decoding of subspace codes,” http:// arxiv.org/ abs/ 1202.
0338 .
[49] ——, “Algebraic list-decoding of subspace codes with multiplicities,” in Proc.
2011 Allerton Conf. Communications, Control and Computing, Illinois, USA,
Sep. 2011, pp. 1430–1437.
[50] V. Guruswami and M. Sudan, “Improved Decoding of Reed-Solomon Codes
and Algebraic Geometry Codes,” IEEE Trans. Info. Theory, vol. 45, no. 6, pp.
1757–1767, sep 1999.
[51] H. Xie, J. Lin, Z. Yan and B. W. Suter, “Linearized polynomial interpolation
and its applications,” IEEE Trans. Signal Processing, vol. 61, no. 1, pp. 206–
217, Jan. 2013.
[52] G. He, G. Sarkis, S. Hemati, W. J. Gross and B. Bai, “Low-complexity channel-
likelihood estimation for non-binary codes and QAM,” IEEE Commun. Lett.,
vol. 16, no. 6, pp. 801–804, Aug. 2012.
[53] M. Sudan, “Decoding of Reed-Solomon codes beyond the error-correction
bound,” J. Complexity, vol. 13, pp. 180–193, 1997.
210
BIBLIOGRAPHY
[54] G. Sarkis and W. J. Gross, “Increasing the throughput of polar decoders,”
IEEE Commun. Lett., vol. 17, no. 9, pp. 725–728, Apr. 2013.
[55] I. Tal and A. Vardy, “List decoding of polar codes,” IEEE Trans. Info. Theory,
2015, [Online; DOI: 10.1109/TIT.2015.2410251].
[56] L. Barnault and D. Declercq, “Fast decoding algorithm for LDPC over GF(2q),”
in Proc. IEEE Information Theory Workshop, 2003, pp. 70–73.
[57] H. Wymeersch, H. Steendam, and M. Moeneclaey, “Log-domain decoding of
LDPC codes over GF(q),” in Proc. IEEE Int. Conf. Commun, Paris, France,
Jun. 2004, pp. 772–776.
[58] D. Declercq and M. Fossorier, “Decoding algorithms for nonbinary LDPC codes
over GF(q),” IEEE Trans. Commun., vol. 55, no. 4, pp. 633–643, Apr. 2007.
[59] X. Chen, S. Lin and V. Akella, “Efficient configurable decoder architecture
for nonbinary quasic-cyclic LDPC codes,” IEEE Trans. Circuits Syst. I: Reg.
Papers, 2012.
[60] C. Lin, K. Lin, H. Chan, and C. Lee, “A 3.33 Gb/s (1200, 720) low-density
parity check code decoder,” in 31st Eur. Solid-State Circuits Conf., Sep. 2005,
pp. 211–214.
[61] M. Beermann, L. Schmalen, and P. Vary, “Improved decoding of binary and
non-binary LDPC codes by probabilistic shuffled belief propagation,” in Proc.
IEEE Int. Conference on Communications (ICC), Kyoto, Japan, Mar. 2011,
pp. 1–5.
211
BIBLIOGRAPHY
[62] L. Liu and C.-J. Shi, “Sliced message passing: High throughput overlapped
decoding of high-rate low-density parity-check codes,” IEEE Trans. Circuits
Syst. I: Reg. Papers, vol. 55, no. 11, pp. 3697–3710, 2008.
[63] R. M. Tanner, D. Sridhara, A. Sridharan, T. E. Fuja, D. J. Costello, Jr., “LDPC
block and convolutional codes based on circulant matrices,” IEEE Trans. Info.
Theory, vol. 50, no. 12, pp. 2966–2984, Dec. 2004.
[64] B. Zhou, Y. Y. Tai, L. Lan, S. Song, L. Zeng, and S. Lin, “Construction of Non-
Binary Quasi-Cyclic LDPC Codes by Arrays and Array Dispersions,” IEEE
Trans. Commun., vol. 57, no. 6, pp. 1652–1662, Jun. 2009.
[65] H. Zhong, W. Xu, N. Xie and T. Zhang, “Area-efficient Min-Sum decoder
design for high-rate Quasi-Cyclic Low-Density Parity-Check codes in magnetic
recording,” IEEE Trans. Magnetics, vol. 43, no. 12, pp. 4117–4122, 2007.
[66] K. Kasai, M. Hagiwara, H. Imai and K. Sakaniwa, “Quantum error correc-
tion beyond the bounded distance decoding limit,” IEEE Trans. Info. Theory,
vol. 58, no. 2, pp. 1223–1230, Feb. 2012.
[67] M. R. Yazdani, S. Hemati, and A. H. Banihashemi, “Improving belief propa-
gation on graphs with cycles,” IEEE Commun. Lett., vol. 8, no. 1, pp. 57–59,
Jan. 2004.
[68] B. Zhou, J. Kang, S. Song, S. Lin, K. Abdel-Ghaffar, and M. Xu, “Construction
of non-binary quasi-cyclic LDPC codes by arrays and array dispersions,” IEEE
Trans. Commun., vol. 57, no. 6, pp. 1652–1662, Jun. 2009.
212
BIBLIOGRAPHY
[69] Y. Tao, Y. S. Park and Z. Zhang, “High-throughput architecture and imple-
mentation of regular (2, dc) nonbinary LDPC decoders,” in Proc. IEEE Int.
Symp. on Circuits and Systems (ISCAS), Seoul, Korea, May 2012.
[70] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 Low-
Density Parity-Check code decoder,” IEEE J. Solid-State Circuits, vol. 37,
no. 3, pp. 404–412, Mar. 2002.
[71] X.Y. Hu and E. Eleftheriou, “Binary representation of cycle Tanner-graph
GF(2q) codes,” in in Proc. IEEE Int. Conference on Communications (ICC),
Paris, France, Jun. 2004.
[72] “Design Compiler Graphical,” www.synopsys.com/tools/implementation/
rtlsynthesis/dcgraphical/Pages/default.aspx, 2013, [Online; accessed 29-July-
2013].
[73] J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D.
Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal, “FreePDK: An
open-source variation-aware design kit,” in Proc. IEEE Int. Conf. Microelec-
tron. Syst. Education (MSE07), San Diego, USA, Jun. 2007, pp. 173–174.
[74] J. Lin and Z. Yan, “An efficient list decoder architecture for polar codes,” in
http:// arxiv.org/ abs/ 1409.4744 .
[75] C. A. Klein, www2.ece.ohio-state.edu/∼klein/ ece766/ 766-10n.ppt .
[76] R. Cideciyan and M. Gustlin, “Double burst error detection capability of ether-
net CRC,” in http://www.ieee802.org/ 3/ bj/ public/ jul12/ cideciyan 01 0712.
pdf .
213
BIBLIOGRAPHY
[77] K. E. Batcher, “Sorting networks and their applications,” in Proc. ACM spring
joint computer conference, Apr. 1968, pp. 307–314.
[78] S. M. Sait and W. Hasan, “Hardware design and VLSI implementation of a
byte-wise CRC generator chip,” IEEE Trans. Consumer Electron., vol. 41,
no. 1, pp. 195–200, Feb. 1995.
[79] N. Hussami, S. B. Korada, and R. Urbanke, “Performance of polar codes for
channel and source coding,” in Proc. IEEE Int. Symp. on Information Theory,
Seoul, South Korea, Jun. 2009, pp. 1488–1492.
[80] J. Lin, C. Xiong, and Z. Yan, “A reduced latency list decoding algorithm for
polar codes,” in Proc. IEEE Workshop on Signal Processing Systems (SiPS),
Belfast, UK, October 2014, pp. 56–61.
[81] C.-L. Wey, M.-D. Shieh, and S.-Y. Lin, “Algorithms of finding the first two
minimum values and their hardware implementation,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 55, no. 11, pp. 1549–8328, Dec. 2008.
[82] C. Zhang, X. Yu, and J. Sha, “Hardware architecture for list successive can-
cellation polar decoder,” in Proc. IEEE Int. Symp. on Circuits and Systems
(ISCAS), Melbourne, AU, Jun. 2014, pp. 209–212.
[83] H. Yoo and I.-C. Park, “Partially parallel encoder architecture for long polar
codes,” IEEE Trans. Circuits Syst. II, Exp. Briefs, accepted.
[84] C. Cheng and K. K. Parhi, “High-speed parallel CRC implementation based
on unfolding, pipelining, and retiming,” IEEE Trans. Circuits Syst. II, Exp.
Briefs, vol. 53, no. 10, pp. 1017–1021, Oct. 2006.
214
BIBLIOGRAPHY
[85] B. Yuan and K. K. Parhi, “Low-latency successive-cancellation polar decoder
architectures using 2-bit decoding,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 61, no. 4, pp. 1241–1254, Apr. 2014.
215
Vita
Jun Lin received the B.S. degree in physics and the M.S. degree in microelectronics
from Nanjing University, Nanjing, China, in 2007 and 2010, respectively. From
2010 to 2011, he was an ASIC design engineer with AMD. During summer 2013, he
was an intern with Qualcomm Research, Bridgewater, NJ. He is currently working
toward the Ph.D. degree in the Department of Electrical and Computer Engineering,
Lehigh University, Bethlehem.
His current research interests include low-power high-speed VLSI design, specif-
ically VLSI design for digital signal processing and cryptography. He was a co-
recipient of the Merit Student Paper Award at the IEEE Asia Pacific Conference
on Circuits and Systems in 2008. He was a recipient of the 2014 IEEE Circuits
& Systems Society (CAS) student travel award. Has also a recipient of the 2015
Lehigh University Doctoral Travel Grant for Global Opportunities.
216
