SIGNAL PROCESSING TECHNIQUES AND APPLICATIONS by Shi, Feng
Lehigh University
Lehigh Preserve
Theses and Dissertations
2015
SIGNAL PROCESSING TECHNIQUES AND
APPLICATIONS
Feng Shi
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Dissertation is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Shi, Feng, "SIGNAL PROCESSING TECHNIQUES AND APPLICATIONS" (2015). Theses and Dissertations. Paper 1624.
SIGNAL PROCESSING TECHNIQUES
AND APPLICATIONS
by
Feng Shi
Presented to the Graduate and Research Committee
of Lehigh University
in Candidacy for the Degree of
Doctor of Philosophy
in
Electrical Engineering
Lehigh University
January 2015
c© Copyright 2015 by Feng Shi
All Rights Reserved
ii
Approved and recommended for acceptance as a dissertation in partial fulfillment
of the requirements for the degree of Doctor of Philosophy.
Date
Prof. Zhiyuan Yan
(Dissertation Advisor)
Accepted Date
Committee Members:
Prof. Zhiyuan Yan
(Committee Chair)
Prof. Meghanad D. Wagh
Prof. Tiffany Jing Li
Dr. Viswanath Annampedu
Avago Technologies
iii
Acknowledgments
Foremost, I would like to express my sincere gratitude to my advisor Prof. Zhiyuan
Yan for the continuous support of my Ph.D study and research, for his patience
and expertise. There were numerous times when I was lost in details and made no
progress. He can always guide me to work out the details with his expertise and
valuable advices. More importantly, he showed me how to look the work in a higher
level and more intriguing perspective. I am obliged to him for his endless support
and encouragement. He is also my mentor and gave me a lot of valuable advices in
seeking career opportunities. He recommended me for an internship, from which I
gained invaluable industry experience. It is really lucky for me to pursue my Ph.D
degree under his guidance. I could not imagine having a better advisor and mentor
for my Ph.D study.
Besides my advisor, I would like to thank my co-advisor Prof. Wagh for his
guidance in my later Ph.D research. His valuable expertise and insightful comments
were very beneficial and inspired me in the new area. I would like also to thank Prof.
Tiffany Jing Li and Dr. Viswanath Annampedu for serving on my Ph.D committee
and spending their precious time for examming my work. I am grateful for their
encouragement, insightful comments and inspiring questions.
iv
Many sincere thanks to my labmates and friends in Lehigh University, Ning
Chen, Xuebin Wu, Hongmei Xie, Chenrong Xiong, and Jun Lin. We shared a lot of
memorable time and exciting discussions. In particular, I would like to thank Xuebin
Wu for helping me and my wife in finding an apartment and shopping grocery every
week in my first year. I am also grateful for many discussions and comments we
had on our collaborated work. Many thanks also go to my friends at Lehigh: Chen
Chen, Xingjian Zhang, Yang Liu, et al.
Finally, I would like to thank my parents for their endless support and uncondi-
tional love. They provided everything financially and spiritually for me to get better
educations. I am proud of them and owe them this Ph.D degree. I am indebted to
my wife for everything. She sacrificed her best years accompanying me at Lehigh.
Without her love, I could not imagine how far I have gone.
v
Contents
Acknowledgments iv
Contents vi
List of Tables xii
List of Figures xiv
Abstract 1
1 Introduction 4
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Delay modeling for On-Chip Interconnects . . . . . . . . . . . 5
1.1.2 Crosstalk Avoidance Codes . . . . . . . . . . . . . . . . . . . 6
1.1.3 Quantum Error Correction . . . . . . . . . . . . . . . . . . . . 7
1.1.4 Efficient Threshold Architecture . . . . . . . . . . . . . . . . . 8
1.1.5 Multiway Sorting . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Delay Models for On-Chip Interconnects 14
vi
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 DELAY MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Three-wire model . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.3 Five-wire model . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 PERFORMANCE EVALUATION . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Three-wire and five-wire buses . . . . . . . . . . . . . . . . . . 24
2.3.2 17-wire bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Performance of CACs . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Improved CACs Based On A New Classification 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 INTERCONNECT DELAYS AND CLASSIFICATION . . . . . . . . 33
3.2.1 Interconnect Modeling . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Derivation of Closed-form Expressions . . . . . . . . . . . . . 34
3.2.3 Pattern Classification . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 NEW MEMORYLESS CROSSTALK AVOIDANCE CODES . . . . . 44
3.3.1 Previous CAC Design . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 CAC Design with New Classification . . . . . . . . . . . . . . 47
3.3.3 Codes Under (C3, 1C) . . . . . . . . . . . . . . . . . . . . . . 52
3.3.4 Codes Under (C4, 2C) . . . . . . . . . . . . . . . . . . . . . . 53
3.3.5 Codes Under (C5, 3C) . . . . . . . . . . . . . . . . . . . . . . 55
3.3.6 Codes Under (C2, 1C) . . . . . . . . . . . . . . . . . . . . . . 56
3.3.7 Pruned Codes Under (C2, 1C) . . . . . . . . . . . . . . . . . . 57
vii
3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Crosstalk avoidance codes for RLC On-Chip Interconnects 64
4.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 CAPACITANCE AND INDUCTANCE EFFECTS . . . . . . . . . . 68
4.2.1 Interconnect Model . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Crosstalk Delay . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.3 Interconnect Ring . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 CAC design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Previous CAC Design . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.3 New CAC Design . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.4 (2, 1)-SOTA codes . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 CODEC design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1 (2, 1)-SOTA codes . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5 Quasi-Cyclic Low-Density Parity-Check Stabilizer Codes 93
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 QC-LDPC Stabilizer Codes . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.1 Base parity check matrix . . . . . . . . . . . . . . . . . . . . . 101
5.3.2 QC-LDPC stabilizer codes with no cycles of girth four . . . . 104
viii
5.3.3 QC-LDPC stabilizer codes with rotation . . . . . . . . . . . . 105
5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6 Efficient Threshold Architectures for Finite Field Operations 111
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2.1 Boolean function . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2.2 Symmetric function . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2.3 Threshold logic . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2.4 RTD Implementation of TG . . . . . . . . . . . . . . . . . . . 119
6.3 XOR via Sort-and-Search . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.1 Sort-and-search . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.2 Generalized sort-and-search . . . . . . . . . . . . . . . . . . . 124
6.3.3 Analysis of gate area . . . . . . . . . . . . . . . . . . . . . . . 128
6.4 Tree Implementation of XOR . . . . . . . . . . . . . . . . . . . . . . 129
6.4.1 Direct conversion . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.4.2 XOR with a small number of inputs . . . . . . . . . . . . . . . 130
6.4.3 XOR with a large number of inputs . . . . . . . . . . . . . . . 133
6.4.4 Complexity of multi-input XOR . . . . . . . . . . . . . . . . . 133
6.5 Multiplication over GF(2m): Threshold Implementation . . . . . . . . 139
6.5.1 Polynomial basis multiplication over GF(2m) . . . . . . . . . . 140
6.5.2 Implementation of PB multiplication using multi-input XORs 141
6.5.3 Normal basis multiplication over GF(2m) . . . . . . . . . . . . 142
6.5.4 Implementation of NB multiplication using multi-input XORs 145
ix
6.5.5 Complexity of threshold implementations of multiplication . . 146
6.5.6 Comparison with existing approaches . . . . . . . . . . . . . . 152
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7 An Enhanced Multiway Sorting Network Based on n-Sorters 155
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.3 Multiway Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.4 Multiway Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.4.1 Multiway sorting algorithm . . . . . . . . . . . . . . . . . . . 170
7.4.2 Latency analysis . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.4.3 Analysis of the number of sorters . . . . . . . . . . . . . . . . 172
7.4.4 Comparison of the number of sorters . . . . . . . . . . . . . . 175
7.5 Application in Threshold Logic . . . . . . . . . . . . . . . . . . . . . 178
7.5.1 Threshold logic . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.5.2 n-sorter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.5.3 Analysis of number of gates . . . . . . . . . . . . . . . . . . . 180
7.5.4 Comparison of the number of gates . . . . . . . . . . . . . . . 182
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8 Conclusion and Future Work 189
Bibliography 193
A Appendix 210
A.1 Proof of Lemma 6.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
x
A.2 Proof of Lemma 6.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
A.3 Proof of Theorem 6.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 212
A.4 Proof of Lemma 6.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
A.5 Proof of Theorem 6.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Vita 215
xi
List of Tables
2.1 Analytical three-wire model . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Decomposition of worst-case patterns in the five-wire model. . . . . . 20
2.3 Analytical five-wire model . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Comparison of delays of our three-wire model and the model in [7] . . 22
2.5 Comparison of delays of our five-wire model and the model in [7] . . . 23
2.6 Comparison of delays for wire 9 in a 17-wire bus . . . . . . . . . . . . 25
2.7 Comparison of delays for all wires in a 17-wire bus . . . . . . . . . . . 27
3.1 Subclassification of patterns for a five-wire bus . . . . . . . . . . . . . 36
3.2 Closed-form expressions for wire 3 in a five-wire bus . . . . . . . . . . 37
3.3 Closed-form expressions for wire 2 in a four-wire bus . . . . . . . . . 42
3.4 Closed-form expressions for wire 1 in a four-wire bus . . . . . . . . . 43
3.5 Largest 5-bit codebook(s) under constraint (Ci, jC). . . . . . . . . . 50
3.6 Comparison of size and throughput of IOLC, UC, and OLC [9] . . . . 61
4.1 Classification of patterns with respect to W3 = |
∑5
i=1wi|. . . . . . . . 76
4.2 Classification of patterns with respect to W3 = |
∑5
i=1wi|. . . . . . . . 76
4.3 Code rates of our (2,1)-SOTA codes for an m-bit bus (m = 5, · · · , 32). 90
xii
4.4 Reduction of worst case delays . . . . . . . . . . . . . . . . . . . . . . 91
4.5 Reduction of worst case noise . . . . . . . . . . . . . . . . . . . . . . 91
6.1 Truth table for the searching network of a 4-input parity function. . . 123
6.2 A searching function y of ordered binary sequences (x′1, x
′
2, · · · , x′9). . 127
6.3 Comparison of XORs via [81, 91, 92] for fan-ins B = 3, 4, 5, 6, 7. . . . . 134
6.4 Comparison of complexities of PB multipliers with that in [25, 106] . . 151
6.5 Comparison of complexities of NB multipliers with that in [25, 106] . 152
7.1 Comparison of latencies of sorting networks of N = np inputs . . . . . 172
7.2 Comparison of the number of sorters of SS-Mk and our scheme . . . . 178
7.3 Comparison of the number of gates with buffers . . . . . . . . . . . . 185
7.4 Comparison of the number of gates without buffers (n ≤ 10) . . . . . 187
7.5 Comparison of the number of gates without buffers (n ≤ 20) . . . . . 188
xiii
List of Figures
2.1 A distributed RC model of an m-wire bus. . . . . . . . . . . . . . . . 18
3.1 A distributed RC model for five wires. . . . . . . . . . . . . . . . . . 33
3.2 Delays of the middle wire in a five-wire bus . . . . . . . . . . . . . . . 40
3.3 Delays of side wires in a four-wire bus . . . . . . . . . . . . . . . . . . 40
3.4 Simulated delays of middle and side wires . . . . . . . . . . . . . . . 45
4.1 A distributed RLC model for five wires. . . . . . . . . . . . . . . . . . 69
4.2 Ringing on wire 3 of a five-wire bus for ↑↑↑↑↑ and ↓↑↑↑↓. . . . . . . . 71
4.3 Construct m-bit (i, kw)-SOTA codebook . . . . . . . . . . . . . . . . 77
4.4 CODEC for an m-bit CAC via Alg. 4 . . . . . . . . . . . . . . . . . . 84
5.1 Classification of stabilizer codes. . . . . . . . . . . . . . . . . . . . . . 95
5.2 Parity check matrices of Codes 1 and 2 . . . . . . . . . . . . . . . . . 108
5.3 Comparison of block error probability of our codes with others . . . . 109
6.1 Symbol of a threshold gate . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 RTD implementation of a threshold gate . . . . . . . . . . . . . . . . 120
6.3 Computing a symmetric function via a sort-and-search algorithm [98]. 121
6.4 2-Sorters implemented in threshold logic . . . . . . . . . . . . . . . . 122
xiv
6.5 A 4-input XOR implemented via the sort-and-search algorithm in [98].123
6.6 n-Sorters implemented in threshold logic . . . . . . . . . . . . . . . . 123
6.7 9-input XOR implemented via a sort-and-search algorithm . . . . . . 127
6.8 Threshold implementation of two-input XOR gate . . . . . . . . . . . 130
6.9 Threshold gate implementation of s = x1 ⊕ x2 · · · ⊕ xn via [91]. . . . . 132
6.10 Threshold gate implementation of s = x1 ⊕ x2 · · · ⊕ xn via [92]. . . . . 132
6.11 Tree implementation of n-input XOR . . . . . . . . . . . . . . . . . . 134
6.12 Gate area of n-input XOR with fan-in bound B . . . . . . . . . . . . 136
6.13 Number of interconnects of n-input XOR with fan-in bound B . . . . 137
6.14 Latency of n-input XOR with fan-in bound B . . . . . . . . . . . . . 137
6.15 Implementation of polynomial basis multiplication . . . . . . . . . . . 143
6.16 Implementation of normal basis multiplication . . . . . . . . . . . . . 146
6.17 Gate area for PB and NB multiplications . . . . . . . . . . . . . . . . 148
6.18 Number of interconnects for PB and NB multiplications . . . . . . . . 149
6.19 Latency for PB and NB multiplications . . . . . . . . . . . . . . . . . 149
7.1 Symbols for 2-sorter and n-sorter . . . . . . . . . . . . . . . . . . . . 160
7.2 The odd-even merge of two sorted lists of 4 values using 2-sorters . . 161
7.3 Iterative construction rule for the n-way merger [111]. . . . . . . . . . 162
7.4 The network for n sorted lists of m wires. . . . . . . . . . . . . . . . . 164
7.5 Configurations of two adjacent sorters S1 and S2 . . . . . . . . . . . 165
7.6 A 3-way merging network of 21 inputs . . . . . . . . . . . . . . . . . 168
7.7 A 3-way merging network of 27 inputs . . . . . . . . . . . . . . . . . 169
7.8 A 3-way sorting network of 81 inputs . . . . . . . . . . . . . . . . . . 169
7.9 Comparison of the number of sorters (n ≤ 10 and n ≤ 20) . . . . . . 177
xv
7.10 Comparison of the latency (n ≤ 10 and n ≤ 20) . . . . . . . . . . . . 177
7.11 Symbol for threshold gate . . . . . . . . . . . . . . . . . . . . . . . . 179
7.12 Sorters implemented in threshold logic . . . . . . . . . . . . . . . . . 180
7.13 Comparison of the number of gates (n ≤ 10 and n ≤ 20) . . . . . . . 184
7.14 Comparison of the latency (n ≤ 10 and n ≤ 20) . . . . . . . . . . . . 184
xvi
Abstract
As the technologies scaling down, more transistors can be fabricated into the same
area, which enables the integration of many components into the same substrate,
referred to as system-on-chip (SoC). The components on SoC are connected by on-
chip global interconnects. It has been shown in the recent International Technology
Roadmap of Semiconductors (ITRS) that when scaling down, gate delay decreases,
but global interconnect delay increases due to crosstalk. The interconnect delay
has become a bottleneck of the overall system performance. Many techniques have
been proposed to address crosstalk, such as shielding, buffer insertion, and crosstalk
avoidance codes (CACs). The CAC is a promising technique due to its good crosstalk
reduction, less power consumption and lower area. In this dissertation, I will present
analytical delay models for on-chip interconnects with improved accuracy. This
enables us to have a more accurate control of delays for transition patterns and lead
to a more efficient CAC, whose worst-case delay is 30-40% smaller than the best
of previously proposed CACs. As the clock frequency approaches multi-gigahertz,
the parasitic inductance of on-chip interconnects has become significant and its
detrimental effects, including increased delay, voltage overshoots and undershoots,
and increased crosstalk noise, cannot be ignored. We introduce new CACs to address
1
both capacitive and inductive couplings simultaneously.
Quantum computers are more powerful in solving some NP problems than the
classical computers. However, quantum computers suffer greatly from unwanted in-
teractions with environment. Quantum error correction codes (QECCs) are needed
to protect quantum information against noise and decoherence. Given their good
error-correcting performance, it is desirable to adapt existing iterative decoding al-
gorithms of LDPC codes to obtain LDPC-based QECCs. Several QECCs based
on nonbinary LDPC codes have been proposed with a much better error-correcting
performance than existing quantum codes over a qubit channel. In this disserta-
tion, I will present stabilizer codes based on nonbinary QC-LDPC codes for qubit
channels. The results will confirm the observation that QECCs based on nonbinary
LDPC codes appear to achieve better performance than QECCs based on binary
LDPC codes.
As the technologies scaling down further to nanoscale, CMOS devices suffer
greatly from the quantum mechanical effects. Some emerging nano devices, such
as resonant tunneling diodes (RTDs), quantum cellular automata (QCA), and sin-
gle electron transistors (SETs), have no such issues and are promising candidates
to replace the traditional CMOS devices. Threshold gate, which can implement
complex Boolean functions within a single gate, can be easily realized with these
devices. Several applications dealing with real-valued signals have already been re-
alized using nanotechnology based threshold gates. Unfortunately, the applications
using finite fields, such as error correcting coding and cryptography, have not been
realized using nanotechnology. The main obstacle is that they require a great num-
ber of exclusive-ORs (XORs), which cannot be realized in a single threshold gate.
2
Besides, the fan-in of a threshold gate in RTD nanotechnology needs to be bounded
for both reliability and performance purpose. In this dissertation, I will present a
majority-class threshold architecture of XORs with bounded fan-in, and compare it
with a Boolean-class architecture. I will show an application of the proposed XORs
for the finite field multiplications. The analysis results will show that the majority
class outperforms the Boolean class architectures in terms of hardware complexity
and latency. I will also introduce a sort-and-search algorithm, which can be used
for implementations of any symmetric functions. Since XOR is a special symmet-
ric function, it can be implemented via the sort-and-search algorithm. To leverage
the power of multi-input threshold functions, I generalize the previously proposed
sort-and-search algorithm from a fan-in of two to arbitrary fan-ins, and propose an
architecture of multi-input XORs with bounded fan-ins.
3
Chapter 1
Introduction
In many communication systems, such as on-chip interconnects and quantum sys-
tems, interferences from the system and environment often aggravate the perfor-
mance and lead to functional issues. Techniques, such as crosstalk avoidance coding
and error correction coding, have been proposed to address these issues. In nan-
otechnologies, conventional implementations of Boolean operations are quite differ-
ent from CMOS technology and new techniques are needed for efficient implementa-
tions. In this proposal, we investigate and propose such signal processing techniques
to address issues in these systems.
In the following, we first give a brief introduction and show the motivations of
our work. Then, we present the main results of our work.
4
1.1. MOTIVATIONS
1.1 Motivations
1.1.1 Delay modeling for On-Chip Interconnects
With the process technologies scaling down into deep submicrometer, coupling
capacitance between adjacent wires becomes more significant and increases the
crosstalk delays greatly. Recent International Technology Roadmap of Semicon-
ductors (ITRS) [1] shows that gate delay decreases with scaling while global wire
delay increases. The crosstalk delay becomes a major part of the total delay, and
greatly affects the overall system performance.
To evaluate and alleviate crosstalk delays, various delay models of interconnects
have been proposed recently (see, for example, [2–7]), most of which are based
on numerical approaches and offer little insight (see, e.g., [2–5]). Although these
numerical models can have high accuracy, they have several drawbacks, such as bulky
lookup tables, dependence on technology, poor portability, and high complexity.
In contrast, analytical delay models (see, e.g., [6, 7]) depend on few technology
parameters and have very low computational complexities. The model in [6] has
much higher accuracy. However, it is not conductive to alleviate the crosstalk.
One widely used analytical delay model, proposed by Sotiriadis et al. [7], illustrates
the connection between delays of coupled interconnects and transition patterns and
appears to be the most comparable previous delay model. However, the model in [7]
has limited accuracy. To improve accuracy, we focus on closed-form expressions of
the signals on the bus based on a distributed RC model, and approximate the wire
delays by evaluating these closed-form expressions.
5
1.1. MOTIVATIONS
1.1.2 Crosstalk Avoidance Codes
The analytical model proposed by Sotiriadis et al. [7], a widely used delay model,
gives upper bounds on the delay of all wires on a bus. In this model, the delay of
the k-th wire depends on the transition patterns of at most three wires, k − 1, k,
and k + 1 only. From [7], the delay of the k-th wire (k ∈ {1, 2, · · · , m}) of an m-bit
bus is given by
Tk =


τ0[(1 + λ)∆
2
1 − λ∆1∆2], k = 1
τ0[(1 + 2λ)∆
2
k − λ∆k(∆k−1 +∆k+1)], k 6= 1, m
τ0[(1 + λ)∆
2
m − λ∆m∆m−1], k = m,
(1.1)
According to this model, there are five classes of transition patterns, denoted by iC
for i = 0, 1, 2, 3, 4, each of which has a delay (1 + iλ)τ0. This classification enables
one to limit the worst-case delay over a bus by restricting the patterns transmitted
on the bus. That is, by avoiding all transition patterns in iC for i > i0, one can
achieve a worst-case delay of (1+ i0λ)τ0 over the bus. Based on this basic principle,
crosstalk avoidance codes (CACs) of different worst-case delays have been proposed
recently (see, for example, [8–10]). For example, forbidden overlap codes (FOCs),
forbidden transition codes (FTCs), forbidden pattern codes (FPCs), and one lambda
codes (OLCs) achieve a worst-case delay of (1 + 3λ)τ0, (1 + 2λ)τ0, (1 + 2λ)τ0, and
(1 + λ)τ0, respectively. In theory, a worst-case delay of τ0 can be achieved by
assigning two protection wires to each data wire [9].
The classification of transition patterns based on the model in [7] has two draw-
backs. First, the model in [7] has limited accuracy because of its dependence on
only three wires. That is, the model overestimates the delays of patterns in 1C
6
1.1. MOTIVATIONS
through 4C, while it underestimates the delays of patterns in 0C. For this reason,
the scheme with a worst-case delay of τ0 is invalid since its actual delay is much
greater. Second, the actual delay ranges in some classes overlap with others in their
adjacent classes.
This, plus the overestimation of delays for 1C through 4C, implies that the
delays of existing CACs are not tightly controlled. These drawbacks motivate us to
include more wires and to classify the transition patterns without overlapping delay
ranges.
1.1.3 Quantum Error Correction
Quantum computers are more efficient than classical computers for some computa-
tional problems, such as factoring a large number and searching an unknown space
for an element satisfying a known property [11]. However, quantum information,
represented by quantum bits or qubits, suffers greatly from unwanted interactions
with the outside world. Thus, quantum error correction codes (QECCs) are needed
to protect quantum information against noise and decoherence [11].
Many QECCs have been proposed in the literature by importing classical er-
ror correction codes, such as low-density parity-check (LDPC) codes, convolutional
codes, Turbo codes, and polar codes (see, for example, [12–21]). Among them,
QECCs based on LDPC codes (see, for example, [12,13,16,17]) are important, since
they can be decoded by adapting existing iterative decoding algorithms. As classical
LDPC codes have asymptotically good performance for a wide class of noisy chan-
nels when decoded by the belief propagation algorithm [22], well-designed quantum
LDPC codes also show good performance [16, 17, 23]. While most quantum LDPC
7
1.1. MOTIVATIONS
codes are based on binary LDPC codes, recently several QECCs based on nonbi-
nary LDPC codes have been proposed in [23] with a much better error-correcting
performance than existing quantum codes over a qubit channel.
Since stabilizer codes based on nonbinary LDPC codes have not been studied,
motivated by the success of adopting nonbinary QC-LDPC codes in CSS codes
in [23], we investigate stabilizer codes based on nonbinary QC-LDPC codes for
qubit channels.
1.1.4 Efficient Threshold Architecture
According to the International Technology Roadmap of Semiconductors (ITRS) [1],
the conventional CMOS technology has great challenge in further scaling. Although
new materials and device structures can keep the CMOS scaling for the next sev-
eral years, the CMOS scaling would reach the fundamental limits eventually. Some
emerging nanotechnology, such as resonant tunneling diodes (RTDs), quantum cellu-
lar automata (QCA), and single electron transistors (SETs), have nanoscale struc-
ture and are promising candidates to replace the CMOS technology [24]. These
new nanotechnology devices promise to have smaller feature size, higher speed and
lower power consumption. Even at system level design they present two advan-
tages. Firstly they easily realize threshold gates (see Fig. 7.11). Threshold gates
are often more powerful than Boolean gates, and can implement complex Boolean
functions with a single gate [25]. Thus the hardware complexity of larger systems
implemented using nanotechnology tends to be a lot smaller. Secondly, the outputs
of the threshold gates built with nanotechnology are self-latched. This provides a
natural way of pipelining these systems in most signal processing applications.
8
1.1. MOTIVATIONS
Several applications dealing with real-valued signals have already been realized
using nanotechnology based threshold gates [26–29]. However, there is an equally
important class of signal processing applications using finite fields, such as error
correcting coding and cryptography [30]. Unfortunately, the applications using finite
fields have not been realized using nanotechnology.
The main obstacle for the nanotechnology based implementations of applica-
tions of finite fields of characteristic two, denoted as GF(2m), is that they require
exclusive-ORs (XORs) to realize all arithmetic operations over GF(2m). Unlike most
conventional Boolean gates such as AND, OR, NOT, NAND, and NOR, XOR cannot
be realized as a single threshold gate. Thus the translation of a finite field architec-
ture to nanotechnology merely by replacing a conventional gate with an appropriate
combination of threshold gates becomes overly complex. Efficient implementations
based on threshold logic are desired.
1.1.5 Multiway Sorting
Merging-based sorting networks are an important family of sorting networks. One
popular 2-way merging algorithm called odd-even merging [31] merges two sorted
lists (odd and even lists) into one sorted list. Most merge sorting networks are based
on 2-way or multi-way merging algorithms using 2-sorters as basic building blocks.
An alternative is to use n-sorters, instead of 2-sorters, as the basic building blocks
so as to greatly reduce the number of sorters as well as the latency. This is also
motivated by efficient threshold implementations of n-sorters due to the powerful
computing capability of threshold logic.
9
1.2. MAIN RESULTS
1.2 Main results
Motivated by the above ideas, we have proposed several techniques and methods to
address these issues. The main results are given by
1. Delay modeling: In chapter 2, we propose analytical delay models for cou-
pled interconnects with improved accuracy. Based on a distributed RC model,
we first derive closed-form expressions of the signals on the bus. Then we
approximate the wire delays by evaluating these closed-form expressions. Our
delay models differ from the model in [7] in two aspects. First, we use direct
evaluations other than the Elmore delay in the model in [7] to approximate
the delays. Second, we consider either three wires or five wires in our delay
models for improved accuracy. Thus, our models achieves improved accuracy
than the model in [7]. Since our delay models use the same classification as the
model in [7], they also maintain the simplicity of the model in [7]. Hence, it
is easy and conducive to use our delay models for the CAC designs. Also, our
five-wire model can be applied to buses of any number of wires. Our extensive
simulation results show that our delay models have improved accuracy than
the model in [7].
2. Crosstalk avoidance codes: In chapter 3, we propose new CACs for RC-
coupled on-chip interconnects. First, we partition all transition patterns with
respect to the delays on the middle wire of a 5-wire bus. By grouping these
patterns according to their evaluated delays, we have a finer classification of
patterns without overlapping delays between adjacent classes. This enables us
to have a more accurate control of delays for transition patterns, and CACs
10
1.2. MAIN RESULTS
designed based on our classification will be more effective. Then, we provide a
method to design CACs with our classification. To illustrate this method, we
present a new CAC based on our classification, which achieves a worst-case
delay that is 30%–40% smaller than that of OLCs.
In chapter 4, we propose novel CACs accounting for both the capacitive and
inductive couplings. The capacitive crosstalk is reduced by restricting oppo-
site transitions in adjacent wires. Since the inductive coupling is a long-range
effect, more neighboring wires are considered for inductive crosstalk. The re-
duction of inductive coupling is achieved by restricting same transitions in
neighboring wires. We also propose CODEC design for our codes based on bi-
nary mixed-radix numeral systems. The complexity and delay of our CODECs
are quadratically increasing with the size of the bus.
3. Quantum error correction: In chapter 5, we propose quasi-cyclic (QC)
LDPC stabilizer codes over a qubit depolarizing channel. The construction of
our QC-LDPC stabilizer codes is reduced to the construction of nonbinary QC-
LDPC codes over GF(2m) satisfying the zero SIP condition, and the decoding
of our QC-LDPC stabilizer codes is based on that of the nonbinary QC-LDPC
codes. First, we derive conditions for nonbinary QC-LDPC codes over GF(2m)
in order to satisfy the zero SIP condition and to eliminate the cycles of girth
four, which usually lead to poor decoding performance by iterative decoding
algorithms for LDPC codes. We have constructed two QC-LDPC stabilizer
codes, and simulation results show that they outperform their counterparts
in [16, 17]. This seems to confirm the observation [23] that QECCs based on
nonbinary LDPC codes appear to achieve better performance than QECCs
11
1.2. MAIN RESULTS
based on binary LDPC codes.
4. Threshold architecture: In chapter 6, we propose efficient threshold ar-
chitectures for exclusive-ORs with bounded fan-in. The main results are two
classes of threshold architectures with bounded fan-ins of an n-input XOR.
The first, called the Boolean class, expresses the XOR in a two-stage NAND
circuit implemented through threshold gates. The second, referred to as the
majority class, also has a two-level implementation and uses only generalized
majority gates in the first level. Since one can implement an n-input XOR
as a tree of two-input XORs, each of which can be expressed based on other
Boolean gates and implemented by their threshold gates, we refer to this ap-
proach as direct conversion and use it as a basis for comparison. It turns out
the architectures obtained by direct conversion are the same as Boolean class
architectures with B = 3. Hence, our Boolean class architectures provide a
variety of tradeoffs between hardware and time complexities beyond the direct
conversion architectures. Our analysis results also show that the majority class
performs better than the Boolean class as well as the architectures by direct
conversion in both the hardware and time complexity, because the majority
class takes better advantage of the more powerful nature of threshold gates.
5. Multiway Sorting Network: In chapter 7, we propose a new multiway
merging algorithm with n-sorters as basic blocks. This merging algorithm
merges n sorted lists of m values each in 1 + ⌈m/2⌉ steps, where n ≤ m. A
sorting algorithm based on the proposed merging algorithm is also introduced.
Our sorting networks of N inputs have an order O(N log2N) of basic sorters,
12
1.2. MAIN RESULTS
which is asymptotically the same with previously proposed multiway sorting
algorithms. In the wide range of N , our algorithm performs better than other
sorting algorithms. For N ≤ 1.46 × 104, our algorithm has up to 46% fewer
sorters. For a more accurate comparison, we show a binary sorting network in
threshold logic, where the basic sorter size scales linearly with the number of
inputs, and compare the number of gates for sortingN inputs. ForN ≤ 2×104,
there are up to 39% fewer gates for a binary sorting network in threshold logic.
13
Chapter 2
Delay Models for On-Chip
Interconnects
2.1 Introduction
As the process technologies scale into deep submicron region, crosstalk delay is be-
coming increasingly severe, especially for global on-chip buses. To cope with this
problem, accurate delay models of coupled interconnects are needed. In particular,
delay models based on analytical approaches are desirable, because they not only
are largely transparent to technology, but also explicitly establish the connections
between delays of coupled interconnects and transition patterns, thereby enabling
crosstalk alleviating techniques such as crosstalk avoidance codes (CACs). Unfor-
tunately, existing analytical delay models, such as the widely cited model [7], have
limited accuracy and do not account for possibly asynchronous switching instants
of wires.
14
2.1. INTRODUCTION
The delay of the k-th wire (k ∈ {1, 2, · · · , m}) of an m-bit bus is rewritten in
the following [7], ,
Tk =


τ0[(1 + λ)∆
2
1 − λ∆1∆2], k = 1
τ0[(1 + 2λ)∆
2
k − λ∆k(∆k−1 +∆k+1)], k 6= 1, m
τ0[(1 + λ)∆
2
m − λ∆m∆m−1], k = m,
(2.1)
where λ is the ratio of the coupling capacitance between adjacent wires and the
loading capacitance, τ0 is the intrinsic delay of a transition on a single wire, and
∆k is 1 for 0 → 1 transition, -1 for 1 → 0 transition, or 0 for no transition on the
k-th wire. We observe that in this model, the delay of the k-th wire depends on
the transition patterns of wires k − 1, k, and k + 1 only. As shown in Eq. (3.1),
all possible values of Tk are given by (1 + iλ)τ0 for i ∈ {0, 1, 2, 3, 4}. Thus, all
transition patterns on wires k − 1, k, and k + 1 can be divided into five classes iC
for i ∈ {0, 1, 2, 3, 4} according to their corresponding i (this classification was also
used in [8]). By limiting transition patterns over the bus, the worst delay can be
reduced. Various crosstalk avoidance codes (CACs) (see, for example, [8,10,32,33])
have been proposed based on this model.
Unfortunately, the model in [7] has limited accuracy for the following reasons.
To achieve simplicity, only three wires are considered in the derivation of the model.
In a bus with more than three wires, the simulated wire delay for 0C transition
patterns is much larger than τ0, the delay of 0C given by (3.1). For example, the
scheme to achieve a delay of τ0 in [9] would be ineffective. Furthermore, the Elmore
delay, which tends to overestimate the delay [34], is used in the derivation. This is
also verified by our simulation results.
15
2.2. DELAY MODEL
In the following, we propose analytical delay models for coupled interconnects
with improved accuracy. Based on a distributed RC model, we first derive closed-
form expressions of the signals on the bus. Then we approximate the wire delays by
evaluating these closed-form expressions. Our delay models differ from the model
in [7] in two aspects. First, we use direct evaluations other than the Elmore delay
in the model in [7] to approximate the delays. Second, we consider either three
wires or five wires in our delay models for improved accuracy. Thus, our models
achieves improved accuracy than the model in [7]. Since our delay models use the
same classification as the model in [7], they also maintain the simplicity of the model
in [7]. Hence, it is easy and conducive to use our delay models for the CAC designs.
Also, our five-wire model can be applied to buses of any number of wires. Our
extensive simulation results show that our delay models have improved accuracy
than the model in [7].
2.2 DELAY MODEL
2.2.1 System model
The on-chip buses are often approximated by the distributed RC model [35]. Our
delay models do not consider the effects of inductance for two reasons. First, it is
difficult to derive a closed-form expression of the signals on the bus based on the
RLC model. More importantly, according to the criteria in [36], the inductance
effects are negligible for buses with length in some range. This conclusion was also
confirmed by other works: the 16b, 32Gb/s, 5mm-long bus and 8b, 16Gb/s, 10mm-
long bus in [37] show that the distributed RC model is still accurate to characterize
16
2.2. DELAY MODEL
these high-speed long interconnects from 5mm to 10mm. So we use distributed RC
model in our derivation for delay models.
In the following, our models do not account for the source resistance and load
capacitance. However, they can be readily modified to account for both using the
techniques in [38]. In general, source resistance and load capacitance tend to increase
the delay. Since the crosstalk delay on the bus is the major part of the whole delay,
the delays introduced by other parts are ignored. For this reason, no buffer is used.
We assume that ideal step signals are applied on the bus directly.
According to [5], the closed-form expressions of the signals on the bus via a
distributed RC model are sums of infinite terms. It was shown in [38] that sums of
the two most significant terms provide a very close approximation of signals on the
bus. This technique is crucial for the evaluation of the closed-form expressions.
The distributed RC model of an m-wire bus is shown in Fig. 2.1, where Vi(x, t)
denotes the transient signal at a position x along wire i for i ∈ {1, 2, · · · , m}, r and c
denote the resistance and capacitance per unit length, respectively. Also, λc denotes
the coupling capacitance per unit length between two adjacent wires. In this work,
we focus on a uniformly distributed bus and hence assume the parameters r, c, and
λ are the same for all wires.
Our models are based on the 50% delay, which is defined as the time difference
between the respective instants when the input signal and corresponding output
signal cross 50% of the supply voltage Vdd. In the following, we focus on worst-case
patterns leading to the largest 50% delay of the middle wire(s). For some transition
patterns, the delay of the middle wire(s) is the greatest among all wires. For other
transition patterns, other wires may have a greater delay, but the worst delays of
17
2.2. DELAY MODEL
Wire 1
Wire 2
Wire 3
Wire m
x
r   x
c   x
c   x
V1(0,t)
V2(0,t)
V3(0,t)
Vm(0,t)
Vm(L,t)
V3(L,t)
V2(L,t)
V1(L,t)
L
6
6
6
6
λ
Figure 2.1: A distributed RC model of an m-wire bus.
all wires within the same class are close. Hence, our model can also be applied to
other wires so as to approximate their delays with high accuracy. For simplicity, we
assume m is odd, and hence wire m+1
2
is the middle wire. We use T iCm to denote the
worst delay of the middle wire (wire m+1
2
) of an m-wire bus for all iC patterns.
We first investigate the casem = 3 and then extend our results to the casem = 5.
There are two reasons for studying the three-wire model. First and foremost, the
three-wire model is the foundation of the derivation of our five-wire model. Second,
our three-wire model shows higher accuracy than our five-wire model for buses with
only three wires, which are used in partial coding schemes (see, e.g., [8, 10, 32]).
2.2.2 Three-wire model
Based on the same technique in [38], the differential equations characterizing a
three-wire bus with length L are given by:
∂2
∂x2
V (x, t) = RC
∂
∂t
V (x, t), (2.2)
18
2.2. DELAY MODEL
where V (x, t) = [V1(x, t) V2(x, t) V3(x, t)]
T and Vi(x, t) denotes the voltage of wire
i at distance x (0 ≤ x ≤ L) at time t for i = 1, 2, 3, R =
[
r 0 0
0 r 0
0 0 r
]
, and C =[
c+λc −λc 0
−λc c+2λc −λc
0 −λc c+λc
]
.
The three eigenvalues of C are given by p1 = c, p2 = (1+λ)c, and p3 = (1+3λ)c,
and their respective eigenvectors ei’s are [1 1 1]
T , [1 0 −1]T , and [−1 2 −1]T . Hence,
Eq. (3.2) is transformed to
∂2
∂x2
Ui(x, t) = rpi
∂
∂t
Ui(x, t) for i = 1, 2, 3, (2.3)
where Ui(x, t) = V
T (x, t)ei for i = 1, 2, 3. So U1(x, t) = V1(x, t) + V2(x, t) + V3(x, t),
U2(x, t) = V1(x, t)− V3(x, t), and U3(x, t) = 2V2(x, t)− V1(x, t)− V3(x, t).
Applying Laplace transform on Eq. (2.3), we have
∂2
∂x2
Ui(x, s) = rpi[sUi(x, s)− Ui(x, 0)] for i = 1, 2, 3. (2.4)
Using appropriate initial conditions, we solve Eq. (2.4) for Ui(x, t) and obtain
V2(L, t) =
1
3
[U1(L, t) + U3(L, t)]. By solving V2(L, t) = 0.5Vdd, we can approximate
the 50% delay of a three-wire bus for different transition patterns.
The expressions of wire 2 are given by V2(L, t)
.
= 1−a1e− tτ −a2e−
t
(1+3λ)τ , where ai,
i = 1, 2 are constant coefficients, and τ = 8
pi2
τ0. We use “↑” to denote a transition
from 0 to the supply voltage Vdd (normalized to 1), “-” no transition, and “↓” a
transition from Vdd to 0. We first identify the worst-case patterns in all classes
through simulations, which are shown in Tab. 2.1. The expressions of wire 2 and
the approximate delays of all classes are also shown in Tab. 2.1, respectively.
19
2.2. DELAY MODEL
iC
Worst-case V2(L, t) Delay
patterns a1 a2
0C ↑↑↑ 4pi 0
(
ln 8pi
)
τ
1C -↑↑ 83pi 43pi
(
ln 16pi
)
τ
2C -↑- 43pi 83pi
(
ln 163pi
)
(1 + 3λ)τ
3C -↑ ↓ 0 4pi
(
ln 8pi
)
(1 + 3λ)τ
4C ↓↑↓ − 43pi 163pi
(
ln 323pi
)
(1 + 3λ)τ
Table 2.1: Analytical three-wire model (V2(L, t)
.
= 1−a1e− tτ −a2e−
t
(1+3λ)τ , τ0 =
rcL2
2
,
and τ = 8
pi2
τ0).
iC Worst-case Decomposition
patterns
0C ↓↑↑↑↓ (↓-↑-↓)+(-↑- - -)+( - - -↑-)
1C ↓-↑↑↓ (↓-↑-↓)+(- - -↑-)
2C ↓-↑-↓ (↓-↑-↓)
3C ↑-↑↓↑ (↑↑↑↑↑)+ 2(- - -↓-) + (-↓- - -)
4C ↑↓↑↓↑ (↑↑↑↑↑)+ 2(-↓- - -)+ 2(- - -↓-)
Table 2.2: Decomposition of worst-case patterns in the five-wire model.
20
2.2. DELAY MODEL
2.2.3 Five-wire model
To further improve the accuracy of delay, we include two extra adjacent wires and
consider the influences of all five wires to approximate the delays. There are three
kinds of transition: ↑, -, and ↓ for each wire. Thus, for such a five-wire bus, there are
35 transition patterns. To maintain the simplicity of our models, we still categorize
them into five classes (iC, i ∈ {0, 1, 2, 3, 4}) based on the transition patterns of
middle three wires (wires 2, 3, and 4). Hence, there are nine different transition
patterns for each pattern of the same class.
Since the bus is a linear system, any pattern could be decomposed into a com-
bination of patterns with single transition. Then the expression of the middle wire
equals the sums of expressions of all individual wires on the middle wire. However,
this would lead to complicated expressions, which are not easy to solve. We propose
to group these individual wires to form some special patterns, reducible transition
patterns (RTPs) and single transition patterns (STPs), which are easy to analyze.
An RTP is defined as a transition pattern in the five-wire model which can
be reduced to a transition pattern in the three-wire model. {↑↑↑↑↑, ↓↓↓↓↓, ↓-↑-
↓, ↑-↓-↑} is the set of RTPs for the five-wire model. For the transition ↑↑↑↑↑
(similarly for ↓↓↓↓↓), the expression of wire 3 is approximated by V3(L, t) .= 1− 4pie−
t
τ .
For the transition ↓-↑-↓ (similarly for ↑-↓-↑), it can be converted into a three-wire
pattern ↓↑↓, where the coupling capacitor between wire 1 (or 5) and wire 3 is λ
2
per
unit length. The expression of wire 3 is approximated by V3(L, t)
.
= 1 + 4
3pi
e−
t
τ −
16
3pi
e
− t
(1+ 32λ)τ , and the delay is approximated by ln( 16
3pi
)(1 + 3
2
λ)τ .
An STP is defined to be a transition pattern with transitions on only one wire
in the five-wire model. For our five-wire model, we focus on the set of STPs with
21
2.2. DELAY MODEL
iC
Worst-case V3(L, t) Delay
patterns a1 a2 a3
0C ↓↑↑↑↓ 0 16
3pi
− 8
3pi
0.165(1 + 3λ)τ
1C ↓-↑↑↓ 0 16
3pi
− 4
3pi
0.310(1 + 3λ)τ
2C ↓-↑-↓ − 4
3pi
16
3pi
0
(
ln 32
3pi
)
(1 + 3
2
λ)τ
3C ↑-↑↓↑ 0 0 4
pi
(
ln 8
pi
)
(1 + 3λ)τ
4C ↑↓↑↓↑ − 4
3pi
0 16
3pi
(
ln 32
3pi
)
(1 + 3λ)τ
Table 2.3: Analytical five-wire model (V3(L, t)
.
= 1−a1e− tτ−a2e
− t
(1+ 32λ)τ−a3e−
t
(1+3λ)τ ,
τ0 =
rcL2
2
, and τ = 8
pi2
τ0).
iC
Worst-case Sim. Our three-wire model [7]
patterns Td (ps) T
iC
3 (ps)
|T iC3 −Td|
Td
T2 in (3.1) (ps)
|T2−Td|
Td
0C ↑↑↑ 2.87 (ln 8
pi
)
τ 2.84 1.05% τ0 3.75 30.66%
1C ↑↑- 4.99 (ln 16
pi
)
τ 4.94 1.00% (1 + λ)τ0 41.25 726.65%
2C -↑- 49.70 (ln 16
3pi
)
(1 + 3λ)τ 49.87 0.34% (1 + 2λ)τ0 78.75 58.45%
3C ↓↑- 88.61 (ln 8
pi
)
(1 + 3λ)τ 88.08 0.60% (1 + 3λ)τ0 116.25 31.19%
4C ↓↑↓ 115.89 (ln 32
3pi
)
(1 + 3λ)τ 115.18 0.61% (1 + 4λ)τ0 153.75 32.67%
Table 2.4: Comparison of simulated delays, our three-wire model, and the model
in [7] (τ0 = 3.75ps, τ =
8
pi2
τ0, and λ = 10)
transitions on wire 2 or 4, {-↑- - -, -↓- - -, - - -↑-, - - -↓-}.
The expressions of wire 3 can be approximated by our three-wire model. Let
V ij (x, t) denote the signal on wire j due to coupling from wire i. For example, by
ignoring coupling from wires 4 and 5 in -↑- - -, the output of wire 3 is approximated
by V 23 (L, t)
.
= − 4
3pi
e−
t
τ + 4
3pi
e−
t
(1+3λ)τ , which is obtained by our three-wire model.
We propose the following approaches to derive the delay of the five-wire bus.
(1) We first decompose the worst-case pattern in each class into a combination
of an RTP and STP(s).
(2) Then we combine the expressions of the RTP and STP(s) for the middle wire
based on our three-wire model.
22
2.3. PERFORMANCE EVALUATION
iC
Worst-case Sim. Our five-wire model [7]
patterns Td (ps) T
iC
5 (ps)
|T iC5 −Td|
Td
T3 in (3.1) (ps)
|T3−Td|
Td
0C ↓↑↑↑↓ 23.30 0.165(1 + 3λ)τ 19.18 17.68% τ0 3.75 83.91%
1C ↓-↑↑↓ 37.42 0.310(1 + 3λ)τ 34.99 6.49% (1 + λ)τ0 41.25 10.24%
2C ↓-↑-↓ 56.58 (ln 32
3pi
) (
1 + 3
2
λ
)
τ 59.46 5.09% (1 + 2λ)τ0 78.75 39.18%
3C ↑-↑↓↑ 88.55 (ln 8
pi
)
(1 + 3λ)τ 88.08 0.53% (1 + 3λ)τ0 116.25 31.28%
4C ↑↓↑↓↑ 127.29 (ln 32
3pi
)
(1 + 3λ)τ 115.18 9.51% (1 + 4λ)τ0 153.75 20.79%
Table 2.5: Comparison of simulated delays, our five-wire model, and the model in [7]
(τ0 = 3.75ps, τ =
8
pi2
τ0, and λ = 10).
(3) Finally, we evaluate the expression of the middle wire to approximate its
delay.
Since the performance is limited by the worst delay in each class, we only need to
approximate the delays of the worst-case patterns in all classes. For classes 0C-4C,
we use simulations to identify the worst-case patterns, which are given by ↓↑↑↑↓, ↓-
↑↑↓, ↓-↑-↓, ↑-↑↓↑, and ↑↓↑↓↑, respectively (assuming the middle wire has an upward
transition). With RTPs and STPs, we decompose the worst-case pattern in each
class as shown in Tab. 2.2.
The expressions of wire 3 are given by V3(L, t)
.
= 1 − a1e− tτ − a2e
− t
(1+ 32λ)τ −
a3e
− t
(1+3λ)τ , where ai, i = 1, 2, 3 are constant coefficients. For all worst-case patterns
in a five-wire bus, the expressions of wire 3 and the approximate delays are shown
in Tab. 2.3, respectively.
2.3 PERFORMANCE EVALUATION
To evaluate the performance of our models and compare them with the model in [7],
we consider following three scenarios. First, we focus on three-wire and five-wire
buses, where our models are originally derived. This scenario can also be applied
23
2.3. PERFORMANCE EVALUATION
to partial coding schemes (see, e.g., [8, 10, 32]), where a wide bus is divided into
sub-buses with a few wires. Then we consider buses with more than five wires and
run extensive simulations with an odd number of wires (up to 33 wires). For brevity,
only simulation results for a 17-wire bus are presented. In the first two scenarios,
we only focus on the worst delays of middle wires. In the third scenario, we assume
the transition patterns are limited to three families of CACs and consider the worst
delays for all wires of a 17-wire bus.
The simulation results are obtained by HSPICE. The coupling factor λ depends
on the layer for routing the interconnect, the layer for the ground, the width for each
wire, and the space between adjacent wires. We adopt a 0.1µm process and route
the global interconnects in the top metal layer. The bulk capacitance is considered
from top metal layer to the substrate, with λ = 10. For the 0.1µm process, the
parasitic parameters are given by [39], and the parameter τ0 =
rcL2
2
for a 5mm long
bus is approximately 3.75ps. Though this process is somewhat outdated, we have
also tried other process technologies with different values for λ and τ0 such as 45nm
technology [40]. For all process technologies, our delay models can be easily adapted
and show better accuracy than the model in [7].
2.3.1 Three-wire and five-wire buses
For a three-wire bus, we compare the simulated delays with the delays given by
our model and the model in [7] for all classes in Tab. 3.2, where Td denotes the
simulated worst delay of wire 2, T iC3 the approximate delay for iC pattern by our
three-wire model, and T2 by the model in [7]. The error percentages of our model
and the model in [7] for each class are also included in Tab. 3.2. For all five classes
24
2.3. PERFORMANCE EVALUATION
iC
Worst-case patterns Sim. Our model [7]
with respect to our assumptions Td T
iC
5
|T iC5 −Td|
Td
T9
|T9−Td|
Td
0C ↑↑↑↑↓↓↓ (↑↑↑) ↓↓↓↑↑↑↑ 25.07 19.18 23.49% 3.75 85.04%
1C ↑↑↑↑↑↓↓ (↑↑ -) ↓↓↑↑↑↑↑ 39.13 34.99 10.58% 41.25 5.42%
2C ↓↓↑↑↑↑↓ (- ↑ -) ↓↑↑↑↑↓↓ 65.93 59.46 9.81% 78.75 19.44%
3C ↓↓↓↑↑↑↑ (↓↑ -) ↑↑↑↑↓↓↓ 95.39 88.08 7.66% 116.25 21.87%
4C ↑↓↓↓↑↑↑ (↓↑↓) ↑↑↑↓↓↓↑ 130.43 115.18 11.69% 153.75 17.88%
Table 2.6: Comparison of simulated delays and delays given by our five-wire model
and the model in [7] for wire 9 in a 17-wire bus (τ0 = 3.75ps, τ =
8
pi2
τ0, and λ = 10).
All delays are in ps.
of transition patterns in a three-wire bus, the maximum and minimum errors by
our model are 1.05% and 0.34%, respectively, as opposed to 726.65% and 30.66% by
the model in [7], respectively. As shown in Tab. 3.2, our three-wire model is much
more accurate than the model in [7] for all patterns in a three-wire bus. We remark
that the delay of a 1C pattern by our model,
(
ln 16
pi
)
τ , does not depend on λ.
For a five-wire bus, we compare the simulated delays with the delays given by
our five-wire model and the model in [7] for all classes in Tab. 3.6, where Td denotes
the simulated worst delay of wire 3 for all iC patterns, T iC5 the approximate delay
for iC pattern by our five-wire model, and T3 by the model in [7]. For a five-wire
bus, the maximum and minimum errors by our model are 17.68% and 0.53%,
respectively, in comparison to 83.91% and 10.24% by the model in [7], respectively.
As shown in Tab. 3.6, our five-wire model is more accurate than the model in [7]
for all patterns in a five-wire bus. Particularly, we observe that the worst delay for
the 0C patterns are much larger than that given by the model in [7]. In [9], the
author proposed a scheme to achieve a delay of τ0 by surrounding each data wire
with two identical wires. According to our model, this scheme is ineffective, because
the worst delay could be as large as 0.165(1 + 3λ)τ0.
25
2.3. PERFORMANCE EVALUATION
2.3.2 17-wire bus
We next compare our five-wire model with the model in [7] for a 17-wire bus. We
still focus on the middle wire (wire 9) in the 17-wire bus and identify the worst-case
patterns in all classes through simulations. The transition patterns are categorized
by the transitions of the middle three wires (wires 8, 9, and 10). Since there are 314
transition patterns in each class, it is infeasible to search all transitions to identify
the pattern with the longest delay. For any two wires symmetric to wire 9 (wire
i and wire 18-i, i ∈ {1, 2, · · · , 8}), there are nine possible patterns, ↑↑, ↓↓, - -, ↑-,
-↑, ↓-, -↓, ↑↓, and ↓↑. If the transitions on the two symmetric wires are in opposite
direction, we assume the influences of the two transitions will cancel out as a result
of symmetry. For other patterns, we assume ↑↑ has greater influence than ↑- or -↑.
Similarly, ↓↓ has greater influence than ↓- or -↓. Based on the discussion above, we
assume the longest delay happens when two symmetric wires have either ↑↑ or ↓↓
transitions. So there are only 27 = 128 patterns left to check in each class.
To find the worst-case patterns, we search all possible symmetric transition pat-
terns in each class. The worst-case patterns are listed in the second column of
Tab. 2.6, where the pattern on wires 8, 9, and 10 is shown in the parenthesis. We
compare the simulated worst delays with the delays given by our five-wire model
and the model in [7] for all classes in Tab. 2.6, where Td denotes the simulated worst
delay of wire 9 for all iC patterns, T iC5 the approximate delay for iC pattern by our
five-wire model, and T9 by the model in [7]. For all five classes, the maximum and
minimum errors by our model are 23.49% and 7.66%, respectively, as opposed to
85.04% and 5.42% by the model in [7], respectively. For all classes except 1C, our
five-wire model outperforms the model in [7]. the model in [7] also shows a large
26
2.3. PERFORMANCE EVALUATION
error percentage for 0C. We have also tried other buses with odd number of wires
up to 33. Based on the simulation results, we conjecture that our five-wire model
would be more accurate than the model in [7] for buses with any number of wires.
Delays
OLC FPC FOC
[7] 41.25 78.75 116.25
T iC5 34.99 59.46 88.08
wire i
1 33.96 64.31 63.00
2 21.59 62.52 95.06
3 32.37 63.73 94.90
4 32.40 62.14 96.56
5 32.16 63.40 94.30
6 32.49 64.63 96.99
7 32.55 65.00 93.69
8 32.50 62.26 93.31
9 33.18 60.74 94.93
10 33.27 62.21 95.94
11 31.92 61.10 94.74
12 32.07 61.55 94.33
13 32.02 63.31 96.56
14 32.97 60.22 97.24
15 32.83 64.70 92.01
16 21.29 63.25 95.29
17 33.61 63.52 63.83
Table 2.7: Comparison of simulated delays and delays given by our five-wire model
and [7] for all wire in a 17-wire bus (τ0 = 3.75ps, τ =
8
pi2
τ0, and λ = 10). All delays
are in ps.
2.3.3 Performance of CACs
In the simulation results above, we only focus on the middle wire of a 17-wire
bus. Now we evaluate the delays on all wires of a 17-wire bus for three families of
CACs [8, 10, 32]: one Lambda codes (OLCs), forbidden pattern codes (FPCs), and
27
2.4. SUMMARY
forbidden overlap codes (FOCs). Based on our five-wire model, the worst delays of
aforementioned CACs are given by 0.310(1+3λ)τ ,
(
ln 32
3pi
) (
1 + 3
2
λ
)
τ , and
(
ln 8
pi
)
(1+
3λ)τ , respectively. Based on the model in [7], the worst delays of aforementioned
CACs are given by (1+λ)τ0, (1+2λ)τ0, and (1+3λ)τ0, respectively. For each CAC,
1000 codewords from the code book are randomly chosen and transmitted over a
17-wire bus consecutively, thus forming 999 transitions. We compare the simulated
worst delays of each wire and the delays given by our five-wire mode and the model
in [7], respectively, in Tab. 2.7. For the OLC, the FPC, and the FOC in a 17-wire
bus, the largest delays are emphasized in boldface. As shown in Tab. 2.7, our five-
wire model is more accurate than the model in [7] for all three families of CACs.
Also, in [41], a new CAC with a smaller worst delay than an OLC has been proposed
based on our five-wire model and a new classification of transition patterns. It shows
that our five-wire model enables the design of more effective CACs.
2.4 Summary
In this chapter, we propose improved analytical delay models for coupled intercon-
nects, based on the distributed RC model. First the closed-form expressions of the
signals on three-wire and five-wire buses are derived, which are also motivated by
partial coding schemes. Then delays corresponding to different patterns are approx-
imated by evaluating the closed-form expressions. The simulation results show that
our models have better accuracy than that in [7]. Although our models are based
on three-wire and five-wire buses, they can also be employed for a bus with more
than five wires. Our simulation results also show that our five-wire model could still
approximate delays better than the model in [7] for a general bus.
28
Chapter 3
Improved Crosstalk Avoidance
Codes Based On A Novel Pattern
Classification
3.1 Introduction
Recent International Technology Roadmap of Semiconductors (ITRS) [1] has shown
a troubling trend: while gate delay decreases with scaling, global wire delay in-
creases. This is because with the process technologies scaling down into deep sub-
micrometer (DSM), the crosstalk delay becomes dominant in global wire delay due
to the increasing coupling capacitance between adjacent wires. Hence, the crosstalk
delay has become a serious bottleneck of the overall system performance.
The analytical model proposed by Sotiriadis et al. [7, 42], a widely used delay
model, gives upper bounds on the delay of all wires on a bus. According to [7, 42],
29
3.1. INTRODUCTION
the delay of the k-th wire (k ∈ {1, 2, · · · , m}) of an m-bit bus is given by
Tk =


τ0[(1 + λ)∆
2
1 − λ∆1∆2], k = 1
τ0[(1 + 2λ)∆
2
k − λ∆k(∆k−1 +∆k+1)], k 6= 1, m
τ0[(1 + λ)∆
2
m − λ∆m∆m−1], k = m,
(3.1)
where λ is the ratio of the coupling capacitance between adjacent wires and the
ground capacitance, τ0 is the propagation delay of a wire free of crosstalk, and ∆k
is 1 for 0 → 1 transition, -1 for 1 → 0 transition, or 0 for no transition on the k-th
wire. In this model, the delay of the k-th wire depends on the transition patterns
of at most three wires, k − 1, k, and k + 1 only. The transition patterns over
these three wires can be classified based on Eq. (3.1) into five classes, denoted by
Di for i = 0, 1, 2, 3, 4, and the patterns in Di have a worst-case delay (1 + iλ)τ0.
This classification enables one to limit the worst-case delay over a bus by restricting
the patterns transmitted on the bus. That is, by avoiding all transition patterns
in Di for i > i0, one can achieve a worst-case delay of (1 + i0λ)τ0 over the bus.
Based on this principle, crosstalk avoidance codes (CACs) of different worst-case
delays have been proposed (see, for example, [8–10]). For example, forbidden overlap
codes (FOCs), forbidden transition codes (FTCs), forbidden pattern codes (FPCs),
and one lambda codes (OLCs) achieve a worst-case delay of (1 + 3λ)τ0, (1 + 2λ)τ0,
(1+2λ)τ0, and (1+λ)τ0, respectively. Based on Eq. (3.1), a worst-case delay of τ0 can
be achieved by assigning two protection wires to each data wire [9]. Other types of
CACs, such as those with equalization [43] or two-dimensional CACs [44], have been
proposed in the literature. For CACs, since the area and power consumption of their
encoder/decoder (CODECs) are all overheads, the complexities of the CODECs are
30
3.1. INTRODUCTION
important to the effectiveness of CACs. Thus, efficient CODECs have been proposed
for CACs [45–47].
The classification of transition patterns based on the model in [7, 42] has two
drawbacks. First, the model in [7,42] has limited accuracy because of its dependence
on only three wires: the model overestimates the delays of patterns in D1 through
D4, while it underestimates the delays of patterns in D0. For this reason, the
scheme with a worst-case delay of τ0 in [9] is invalid since its actual delay is much
greater. Second, the actual delay ranges in some classes overlap with others. This,
plus the overestimation of delays for D1 through D4, implies that the delays of
existing CACs are not tightly controlled. These drawbacks motivate us to include
more wires and to classify the transition patterns without overlapping delay ranges.
In [48], we have proposed a new analytical five-wire delay model. Two extra
neighboring wires are included in the delay model [48], and the delay of the middle
wire of five neighboring wires is determined by the transition patterns on all five
wires. This five-wire model has better accuracy than the model in [7,42] for Di for
i = 0, 1, 2, 3, 4 [48]. This work confirms that using more wires leads to improved
accuracy.
There are two main contributions in this chapter:
• First, we approximate the crosstalk delay in a five-wire model and propose a
new classification of transition patterns.
• Second, we propose a family of CACs based on our classification.
The work in this chapter is different from previous works, including our previous
works, in several aspects:
31
3.1. INTRODUCTION
• First, although the delay approximation in this chapter is also based on a
five-wire model, it is different from that in our previous work [48]. The delay
approximation in this chapter is carried out by extending the approach in [38]
from a three-wire model to a five-wire one.
• Second, our classification of transition patters is different from that in [7, 42]
(based on Eq. (3.1)), in two aspects. First, our classification has seven classes
as opposed to five based on Eq. (3.1). Second, while the delays of some classes
overlap for the classification based on Eq. (3.1), all classes in our classification
have non-overlapping delays. These two key differences allow us to have a
more accurate control of delays for transition patterns.
• Our new family of CACs is also different from previously proposed CACs, all
of which are based on the classification in [7, 42] (based on Eq. (3.1)). While
some codes in this new family are shown to be the same as existing CACs,
OLCs, FPCs, and FOCs, this family also includes new codes that achieve
smaller worst-case delays and improved throughputs than OLCs, which have
the smallest worst-case delays among all existing CACs.
The rest of the chapter is organized as follows. In Section 3.2, we first propose
our classification and compare it with that in [7, 42]. We then present our new
family of CACs in Section 3.3 and compare their performance with existing CACs
in Section 4.5. Some concluding remarks are provided in Section 7.6.
32
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
3.2 INTERCONNECT DELAYS AND CLASSI-
FICATION
3.2.1 Interconnect Modeling
Since the functionality and performance in DSM technology are greatly affected
by the parasitics, distributed RC models are widely employed to analyze on-chip
interconnects. In this chapter, we consider the distributed RC model of five wires
shown in Fig. 4.1, where Vi(x, t) denotes the transient signal at time t and position
x (0 ≤ x ≤ L) over wire i for i ∈ {1, 2, 3, 4, 5}, r and c denote the resistance and
ground capacitance per unit length, respectively. Also, λc denotes the coupling
capacitance per unit length between two adjacent wires. The value of λ depends on
many factors, such as the metal layer in which we route the bus, the wire width, the
spacing between adjacent wires, and the distance to the ground layer. We consider
a uniformly distributed bus with the same parameters r, c, and λ for all the wires.
Wire 1
Wire 2
Wire 3
Wire 5
x
r   x
c   x
c   x
V1(0,t)
V2(0,t)
V3(0,t)
V5(0,t)
V5(L,t)
V3(L,t)
V2(L,t)
V1(L,t)
L
Wire 4
V4(0,t)
V4(L,t)
6
6
6
6
λ
Figure 3.1: A distributed RC model for five wires.
33
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
3.2.2 Derivation of Closed-form Expressions
When determining the delay of a wire, the model in [7,42] considers only the effects
of either one or two neighboring wires (cf. Eq. (3.1)). To address the drawbacks
of the model in [7, 42] described above, additional neighboring wires need to be
accounted for. In our delay derivation below, whenever possible we consider four
neighboring wires of a wire, two neighboring wires on each side, to determine its
delay. To approximate the delay of a side wire (wires 1, 2, n− 1 or n) of an n-wire
bus, three neighboring wires are considered. This is because the side wires are
affected by fewer neighboring wires. This scheme is similar to the model in [7, 42]
and appears to work well. We focus on the 50% delay, which is defined as the time
required for the unit step response to reach 50% of its final value.
In [38], the crosstalk of two coupled lines was described by partial differential
equations (PDEs), and a technique for decoupling these highly coupled PDEs was
introduced by using eigenvalues and corresponding eigenvectors. In our work, we
extend this approach from a three-wire model to a five-wire one. Specifically, we
first use the technique in [38] to decouple the PDEs that describe the crosstalk of
four coupled wires, then solve these independent PDEs for closed-form expressions,
and finally approximate the delays of each wire.
The PDEs characterizing five wires with length L are given by:
∂2
∂x2
V(x, t) = RC
∂
∂t
V(x, t), (3.2)
where R = diag{r r r r r}, V(x, t) = [V1(x, t) V2(x, t) V3(x, t) V4(x, t) V5(x, t)]T ,
34
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
and
C = c
[ 1+λ −λ 0 0 0
−λ 1+2λ −λ 0 0
0 −λ 1+2λ −λ 0
0 0 −λ 1+2λ −λ
0 0 0 −λ 1+λ
]
.
The eigenvalues of C/c are given by p1 = 1, p2 = 1 +
5+
√
5
2
λ, p3 = 1 +
5−√5
2
λ,
p4 = 1+
3+
√
5
2
λ, and p5 = 1+
3−√5
2
λ. Their corresponding eigenvectors ei’s are given
by e1 = [1 1 1 1 1]
T , e2 = [
√
5−1
4
− 1+
√
5
4
1 − 1+
√
5
4
√
5−1
4
]T , e3 = [
−√5+1
4
√
5−1
4
1
√
5−1
4
−
√
5+1
4
]T , e4 = [−1
√
5+1
2
0 −
√
5+1
2
1]T , and e5 = [−1 −
√
5−1
2
0
√
5−1
2
1]T , respectively.
With a technique for decoupling partial differential equations similar to [38],
Eq. (3.2) is transformed into
∂2
∂x2
Ui(x, t) = rcpi
∂
∂t
Ui(x, t), for i = 1, 2, 3, 4, 5,
where Ui(x, t) = V
T (x, t)ei denotes the transformed signals. This decoupled PDEs
are independent of each other. Each Ui(x, t) describes a single wire with a modified
capacitance cpi. The solution to Ui(L, t) is given by a series of the form Ui(L, t) =
Vdd +
∑∞
k=0 rke
− t
skτ . As shown in [38], a single-exponent approximation Vdd(1 +
r0e
− t
s0τ ) is enough for t/τ > 0.1, where r0 and s0 are the coefficients of the most
significant term.
For different transitions, we solve Eq. (3.2.2) for Ui(x, t) and obtain V3(L, t) =
1
5
[U1(L, t) + 2U2(L, t) + 2U3(L, t)], which is given by a sum of a constant and three
exponent terms, Vdd(1 − c0e−
t
a0τ − c1e−
t
a1τ − c2e−
t
a2τ ). Then the 50% delay of wire
3 can be evaluated by solving V3(L, t) = 0.5Vdd.
For side wires, PDEs characterizing four wires with length L are given by:
∂2
∂x2
V(x, t) = RC
∂
∂t
V(x, t),
35
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
Table 3.1: Subclassification of patterns by signal expressions on wire 3 in a five-wire
bus.
Subclass k Patterns Subclass k Patterns
1 ↑↑↑↑↑
13
- -↑- -, ↑-↑-↓, ↓-↑-↑
2 -↑↑↑↑, ↑↑↑↑- -↑↑↓-, ↑↑↑↓↓, ↓↑↑↓↑
3 ↑-↑↑↑, ↑↑↑-↑ -↓↑↑-, ↑↓↑↑↓, ↓↓↑↑↑
4
-↑↑↑-, ↓↑↑↑↑,
14
- -↑-↓, ↓-↑- -, -↑↑↓↓,
↑↑↑↑↓ ↓↑↑↓-, -↓↑↑↓, ↓↓↑↑-
5
- -↑↑↑, ↑↑↑- -, 15 ↓-↑-↓, ↓↑↑↓↓, ↓↓↑↑↓
-↑↑-↑, ↑-↑↑- 16 ↓↓↑-↓, ↓-↑↓↓
6
↑-↑-↑, ↑↑↑↓↑,
17
- -↑↓↓, ↓↓↑- -,
↑↓↑↑↑ -↓↑-↓, ↓-↑↓-
7 -↑↑↑↓, ↓↑↑↑-
18
- -↑↓-, -↓↑- -, ↑-↑↓↓,
8
- -↑↑-, -↑↑- -, ↑↓↑-↓, ↓-↑↓↑, ↓↓↑-↑
↓-↑↑↑, ↓↑↑-↑,
19
- -↑↓↑, ↑↓↑- -,
↑-↑↑↓, ↑↑↑-↓ -↓↑-↑, ↑-↑↓-
9 ↓↑↑↑↓ 20 ↑-↑↓↑, ↑↓↑-↑
10
- -↑↑↓, ↓↑↑- -, 21 ↓↓↑↓↓
-↑↑-↓, ↓-↑↑- 22 ↓↓↑↓-, -↓↑↓↓
11 ↓-↑↑↓, ↓↑↑-↓ 23 -↓↑↓-, ↑↓↑↓↓, ↓↓↑↓↑
12
- -↑-↑, ↑-↑- -, 24 ↑↓↑↓-, -↓↑↓↑
-↑↑↓↑, ↑↑↑↓-,
25 ↑↓↑↓↑
-↓↑↑↑, ↑↓↑↑-
where R = diag{r r r r}, V(x, t) = [V1(x, t) V2(x, t) V3(x, t) V4(x, t)]T , and C =
c
[
1+λ −λ 0 0
−λ 1+2λ −λ 0
0 −λ 1+2λ −λ
0 0 −λ 1+λ
]
.
The eigenvalues of C/c are given by p1 = 1, p2 = 1 + (2 −
√
2)λ, p3 = 1 + 2λ,
and p4 = 1 + (2 +
√
2)λ. Their corresponding eigenvectors ei’s are given by e1 =
[1 1 1 1]T , e2 = [−1 (1 −
√
2) − (1 − √2) 1]T , e3 = [1 − 1 − 1 1]T , and
e4 = [−1 (1 +
√
2) − (1 +√2) 1]T , respectively.
By decoupling the PDEs for side wires, we have
∂2
∂x2
Ui(x, t) = rcpi
∂
∂t
Ui(x, t), for i = 1, 2, 3, 4,
36
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
Table 3.2: Closed-form expressions for the output signals on wire 3 (V3(L, t) =
Vdd(1− c0e−
t
a0τ − c1e−
t
a1τ − c2e−
t
a2τ ) in a five-wire bus with evaluated and simulated
50% delays (τ0 = 1.42 ps, τ =
8
pi2
τ0, λ = 12.24, a0 = 1, a1 = 1 +
5−√5
2
λ, and
a2 = 1 +
5+
√
5
2
λ for all classes).
Ci Subclass k
Coeffs. of V3(L, t) Eva. Sim.
c0 c1 c2 (ps) (ps)
0
1 4
pi
0 0 1.08 1.18
2 16
5pi
2(1+
√
5)
5pi
2(1−√5)
5pi
1.41 1.50
3 16
5pi
2(1−√5)
5pi
2(1+
√
5)
5pi
1.41 1.50
1
4 12
5pi
4(1+
√
5)
5pi
4(1−√5)
5pi
2.35 2.40
5 12
5pi
4
5pi
4
5pi
2.35 2.40
6 12
5pi
4(1−√5)
5pi
4(1+
√
5)
5pi
2.35 2.45
2
7 8
5pi
6(1+
√
5)
5pi
6(1−√5)
5pi
6.17 6.84
8 8
5pi
2(3+
√
5)
5pi
2(3−√5)
5pi
9.62 9.21
9 4
5pi
8(1+
√
5)
5pi
8(1−√5)
5pi
9.90 10.70
3
10 4
5pi
4(2+
√
5)
5pi
4(2−√5)
5pi
14.07 14.22
11 0 2(5+3
√
5)
5pi
2(5−3√5)
5pi
16.91 17.18
12 8
5pi
2(3−√5)
5pi
2(3+
√
5)
5pi
19.24 18.47
4
13 4
5pi
8
5pi
8
5pi
22.67 22.60
14 0 2(5+
√
5)
5pi
2(5−√5)
5pi
24.58 24.68
15 − 4
5pi
4(3+
√
5)
5pi
4(3−√5)
5pi
25.84 26.03
5
16 − 8
5pi
2(7+
√
5)
5pi
2(7−√5)
5pi
36.63 36.91
17 − 4
5pi
12
5pi
12
5pi
37.24 37.52
18 0 2(5−
√
5)
5pi
2(5+
√
5)
5pi
38.07 38.35
19 4
5pi
4(2−√5)
5pi
4(2+
√
5)
5pi
39.22 39.47
20 8
5pi
6(1−√5)
5pi
6(1+
√
5)
5pi
40.87 41.11
6
21 − 12
5pi
16
5pi
16
5pi
48.43 48.85
22 − 8
5pi
2(7−√5)
5pi
2(7+
√
5)
5pi
50.43 50.86
23 − 4
5pi
4(3−√5)
5pi
4(3+
√
5)
5pi
52.78 53.25
24 0 4(5−3
√
5)
5pi
4(5+3
√
5)
5pi
55.48 55.97
25 4
5pi
8(1−√5)
5pi
8(1+
√
5)
5pi
58.52 59.04
The expressions of wires 1 and 2 are given by V1(L, t) =
1
4
U1(L, t)− 2+
√
2
8
U2(L, t)+
1
4
U3(L, t)−2−
√
2
8
U4(L, t) and V2(L, t) =
1
4
U1(L, t)−
√
2
8
U2(L, t)−14U3(L, t)+
√
2
8
U4(L, t),
respectively. Then the 50% delays of wires 1 and 2 can be evaluated by solving
Vi(L, t) = 0.5Vdd for i = 1, 2.
37
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
3.2.3 Pattern Classification
First, we consider the classification of transition patterns over five wires with respect
to the delay of the middle wire (wire 3). In this chapter, we use “↑” to denote a
transition from 0 to the supply voltage Vdd (normalized to 1), “-” no transition, and
“↓” a transition from Vdd to 0. We first focus on patterns with a ↑ transition on wire
3 in a five-wire bus and derive V3(L, t) for each pattern as described in Sec. 3.2.2.
There are 34 = 81 different transition patterns, which can be partitioned into 25
subclasses as shown in Tab. 3.1 according to the expressions of the output signals on
wire 3: All transition patterns in each subclass have the same expression V3(L, t).
The coefficients for all 25 subclasses are shown in columns 3-5 of Tab. 3.2. Then
the expressions V3(L, t) of all patterns in the 25 subclasses are evaluated for their
50% delays. By grouping subclasses with close delays into one class, we can divide
the 81 transition patterns into seven classes Ci for i = 0, 1, · · · , 6 shown in Tab. 3.2.
For all 25 subclasses, evaluated and simulated delays are provided in columns 6 and
7 of Tab. 3.2, respectively. For all seven classes, the difference between evaluated
delay and simulated delay in Tab. 3.2 is small.
All evaluations and simulations are based on a freePDK 45nm CMOS technology
with 10 metal layers [49]. We assume that the top two metal layers, layers 9 and
10, are used for routing global interconnects, and that metal layer 8 is used as the
ground layer. An interconnect model in [50] is used for parasitic extraction. For a
5mm bus in the top metal layer, the key parasitics, resistance, ground capacitance,
and coupling capacitance, are given by R = 68.75Ω, Cgnd = 41.32fF , and Ccouple =
505.68fF , respectively. The bus is modeled by a distributed RC model as shown
in Fig. 4.1 with 100 segments. The two important parameters used in our delay
38
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
approximation are τ0 = 0.5RCgnd = 1.42ps and λ = Ccouple/Cgnd = 12.24. Since the
crosstalk delay on the bus constitutes a major part of the whole delay, the delays
introduced by buffers are ignored. We assume that ideal step signals are applied
on the bus directly. The closed-form expressions are evaluated for 50% delays via
MATLAB and the simulation is done by HSPICE.
From Tab. 3.2, it can be easily verified that C5 and C6 are the same as D3 and
D4 in [7,42], respectively. That is, the middle three wires of the transition patterns
in C5 (C6, respectively) constitute D3 (D4, respectively). The transition patterns
in D0, D1, and D2 are divided into five classes C0—C4 in our classification with
following relations, C4 ⊂ D2, C3 ⊂ D1∪D2, C2 ⊂ D0∪D1, C1 ⊂ D0∪D1∪D2,
and C0 ⊂ D0 ∪D1.
Note that the coefficients ci for i = 0, 1, 2 of the expression of wire 3 are indepen-
dent of technology and determined by different patterns. For a given pattern, the
coefficients ci are fixed and the delay is a function of τ0 and λ. Since the ratio t/τ0
appears in the exponent term, varying τ0 would scale delays in all classes. Thus,
the classification does not depend on τ0. The coupling factor λ could affect the
delay differently. In the following, we verify our classification for technology with
different coupling factor, λ = 1, 2, · · · , 13, and show the results in Fig. 3.2. Different
classes are denoted by different line styles. Each class contains multiple lines, which
represents a subclass. Patterns in each subclass have the same delay. For λ ≥ 3,
the ranges of delays in all classes do not overlap. Also, the delay in each subclass
increases linearly with λ. This implies that our classification is valid provided that
the coupling factor λ is at least 3.
Then, we consider the classification of transition patterns over four wires with
39
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
1 3 5 7 9 11 13
0
5
10
15
20
25
30
35
40
45
λ
D
e
la
y
/τ
0
C0
C1
C2
C3
C4
C5
C6
Figure 3.2: Delays of the middle wire for all patterns with respect to λ in a five-wire
bus (τ0 = 1.42ps).
1 3 5 7 9 11 13
0
5
10
15
20
25
30
35
40
45
λ
D
e
la
y
/τ
0
0C
1C
2C
3C
4C
Figure 3.3: Delays of side wires for all patterns with respect to λ in a four-wire bus
(τ0 = 1.42ps).
respect to the delays of the side wires. We classify patterns by considering the worst-
case delays of wires 1 and 2, respectively. Note that the classification with respect
40
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
to the delays of wires 4 and 5 would be the same by symmetry. We first focus on
patterns with a ↑ transition on wire 2 in a four-wire bus. There are 33 = 27 different
transition patterns. As described in Sec. 3.2.2, we first derive the expressions V2(L, t)
of these 27 patterns shown in Tab. 3.3. By evaluating these patterns for their 50%
delays, we group patterns with close delays into one class, and form five classes
jC for j = 0, 1, 2, 3, 4 as shown in Tab. 3.3. Then, we focus on patterns with a ↑
transition on wire 1. There are 33 = 27 different transition patterns. As described
in Sec. 3.2.2, we first derive the expressions V1(L, t) of these 27 patterns shown in
Tab. 3.4. By evaluating these patterns for their 50% delays, we group patterns with
close delays into one class, and form three classes jC for j = 0, 1, 2 as shown in
Tab. 3.4. When both wires 1 and 2 have transitions, the delay on wire 2 is larger
than that of wire 1, which can be verified from Tabs. 3.3 and 3.4. In this case, we
focus on the delay of wire 2. When only wire 1 has transition, we focus on the delay
of wire 1. The difference between evaluated delay and simulated delay is small as
shown in Tabs. 3.3 and 3.4 with one exception (the pattern ↑↑↓↑ in 1C in Tab. 3.3),
which doesn’t change our classification.
From Tabs. 3.3 and 3.4, the classes 3C and 4C of our classification are exactly the
same as D3 and D4 in [7,42], respectively. The class 1C and 2C of our classification
are subsets of D1 and D2 in [7,42], respectively. The class 0C is a subset of D0∪D1
in [7, 42].
Similar to the classification of middle wires, we conclude that the classification
on side wires does not depend on τ0. To verify our classification for technology with
different coupling effects, we consider coupling factor λ = 1, 2, · · · , 13, and show the
results in Fig. 3.3. Each class contains multiple lines, each of which represents a
41
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
Table 3.3: Closed-form expressions for the output signals on wire 2 (V2(L, t) =
Vdd(1−c0e−
t
a0τ −c1e−
t
a1τ −c2e−
t
a2τ −c3e−
t
a3τ )) in a four-wire bus with evaluated and
simulated 50% delays (τ0 = 1.42 ps, τ =
8
pi2
τ0, λ = 12.24, a0 = 1, a1 = 1+(2−
√
2)λ,
a2 = 1 + 2λ, and a3 = 1 + (2 +
√
2)λ for all classes).
jC Pattern
Coeffs of V2(L, t) Eva. Sim.
c0 c1 c2 c3 (ps) (ps)
0
↑↑↑↑ 4
pi
0 0 0 1.08 1.18
↑↑↑- 3
pi
√
2
2pi
1
pi
−
√
2
2pi
1.55 1.61
↑↑-↑ 3
pi
2−√2
2pi
− 1
pi
−2+
√
2
2pi
1.55 1.62
-↑↑↑ 3
pi
−
√
2
2pi
1
pi
√
2
2pi
1.55 1.64
1
↑↑↑↓ 2
pi
√
2
pi
2
pi
−
√
2
pi
3.33 3.22
↑↑- - 2
pi
1
pi
0 1
pi
4.54 3.48
-↑↑- 2
pi
0 2
pi
0 7.21 5.15
↑↑-↓ 1
pi
2+
√
2
2pi
1
pi
2−√2
2pi
9.70 9.38
↑↑↓↑ 2
pi
0 2−
√
2
2pi
− 2
pi
9.98 3.92
-↑↑↓ 1
pi
√
2
2pi
3
pi
−√2
2pi
12.89 13.03
2
↑↑↓- 1
pi
4−√2
2pi
− 1
pi
4+
√
2
2pi
17.02 16.05
-↑-↑ 2
pi
1−√2
pi
0 1+
√
2
pi
19.67 18.79
↑↑↓↓ 0 2
pi
0 2
pi
20.05 19.85
-↑- - 1
pi
2−√2
2pi
1
pi
2+
√
2
2pi
22.59 22.48
-↑-↓ 0 1
pi
2
pi
1
pi
24.12 24.22
↓↑↑↑ 2
pi
−
√
2
pi
2
pi
√
2
pi
26.02 26.06
↓↑↑- 1
pi
−
√
2
2pi
3
pi
√
2
2pi
26.89 27.06
↓↑↑↓ 0 0 4
pi
0 27.45 27.68
3
-↑↓↓ − 1
pi
4−√2
2pi
1
pi
4+
√
2
2pi
37.44 37.74
-↑↓- 0 2−
√
2
pi
0 2+
√
2
pi
38.61 38.89
↓↑-↓ − 1
pi
2−√2
2pi
3
pi
2+
√
2
2pi
39.06 39.40
-↑↓↑ 1
pi
4−√2
2pi
− 1
pi
4+
√
2
2pi
40.12 40.39
↓↑- - 0 1−
√
2
pi
2
pi
1+
√
2
pi
40.21 40.55
↓↑-↑ 1
pi
2−3√2
2pi
1
pi
2+3
√
2
2pi
41.63 41.98
4
↓↑↓↓ − 2
pi
2−√2
pi
2
pi
2+
√
2
pi
50.92 51.36
↓↑↓- − 1
pi
4−3√2
2pi
1
pi
4+3
√
2
2pi
52.99 53.44
↓↑↓↑ 0 2−2
√
2
pi
0 2+2
√
2
pi
55.28 55.79
pattern in Tabs. 3.3 and 3.4. For λ ≥ 1, the ranges of delays in all classes do not
overlap. Also, the delay in each subclass increases linearly with λ. This implies that
our classification on side wires is valid provided that the coupling factor λ is at least
1.
42
3.2. INTERCONNECT DELAYS AND CLASSIFICATION
Table 3.4: Closed-form expressions for the output signals on wire 1 (V1(L, t) =
Vdd(1−c0e−
t
a0τ −c1e−
t
a1τ −c2e−
t
a2τ −c3e−
t
a3τ )) in a four-wire bus with evaluated and
simulated 50% delays (τ0 = 1.42 ps, τ =
8
pi2
τ0, λ = 12.24, a0 = 1, a1 = 1+(2−
√
2)λ,
a2 = 1 + 2λ, and a3 = 1 + (2 +
√
2)λ for all classes).
jC Pattern
Coeffs of V1(L, t) Eva. Sim.
c0 c1 c2 c3 (ps) (ps)
0
↑↑↑↑ 4
pi
0 0 0 1.08 1.18
↑↑↑- 3
pi
−2+
√
2
2pi
− 1
pi
2−√2
2pi
1.55 1.59
↑↑-↑ 3
pi
√
2
2pi
1
pi
−
√
2
2pi
1.55 1.61
↑-↑↑ 3
pi
−
√
2
2pi
1
pi
√
2
2pi
1.55 1.64
1
↑↑↑↓ 2
pi
2+
√
2
pi
− 2
pi
2−√2
pi
2.50 2.70
↑↑- - 2
pi
1+
√
2
pi
0 1−
√
2
pi
2.83 2.90
↑↑↓↑ 2
pi
√
2
pi
2
pi
−
√
2
pi
3.33 3.20
↑↑-↓ 1
pi
4+3
√
2
2pi
− 1
pi
4−3√2
2pi
4.65 4.99
↑-↑- 2
pi
1
2pi
0 1
2pi
4.54 3.49
↑↑↓- 1
pi
2+3
√
2
2pi
1
pi
2−3√2
2pi
5.53 5.88
↑↑↓↓ 0 2+2
√
2
pi
0 2−2
√
2
pi
7.03 7.39
↑- -↑ 2
pi
0 2
pi
0 7.21 5.15
↑-↑↓ 1
pi
4+
√
2
2pi
− 1
pi
4−√2
2pi
7.41 6.89
↑- - - 1
pi
2+
√
2
2pi
1
pi
2−√2
2pi
9.70 9.35
↑- -↓ 0 2+
√
2
pi
0 2−
√
2
pi
10.68 10.54
↑-↓↑ 1
pi
√
2
2pi
3
pi
−√2
2pi
12.89 13.03
↑-↓- 0 2+2
√
2
2pi
2
pi
2−2√2
2pi
13.03 13.14
↑-↓↓ − 1
pi
4+3
√
2
2pi
1
pi
4−3√2
2pi
13.11 13.21
2
↑↓↑↓ 0 2
pi
0 2
pi
20.05 19.85
↑↓-↓ − 1
pi
4+
√
2
2pi
1
pi
4−√2
2pi
21.86 21.91
↑↓↑- 1
pi
2−√2
2pi
1
pi
2+
√
2
2pi
22.59 22.48
↑↓↓↓ − 2
pi
2+
√
2
pi
2
pi
2−√2
pi
23.10 23.23
↑↓- - 0 1
pi
2
pi
1
pi
24.12 24.22
↑↓↓- − 1
pi
2+
√
2
2pi
3
pi
2−√2
2pi
25.10 25.30
↑↓↑↑ 2
pi
−√2
pi
2
pi
√
2
pi
26.02 26.06
↑↓-↑ 1
pi
−
√
2
2pi
3
pi
√
2
2pi
26.89 27.06
↑↓↓↑ 0 0 4
pi
0 27.45 27.68
In addition to being a finer classification, the new classification has no over-
lapping delays among different classes. Fig. 3.4 compares the simulated delays of
different classes based on the classification in [7, 42] and our new classification. In
Fig. 3.4, the grey bars identify the minimum and maximum simulated delays in
every class. Note that only two extremes are important, and not all delay values in
43
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
the grey bars are achievable by some transition patterns. In Fig. 3.4(a), the thick
line segments denote the upper bounds for delay of each class based on Eq. (3.1).
The upper bounds by the model in [7, 42] overestimate the delays of D1 through
D4 and underestimate the delay of D0. As shown in Fig. 3.4(a), the actual delays
in D0, D1, and D2 overlap with each other. Some patterns with smaller delays
have potential to transmit information at a higher speed, but are categorized into a
class with a larger delay bound. Thus, the classification by the model in [7,42] does
not result in effective crosstalk avoidance codes. In contrast, the delays of different
classes in our new classification do not overlap as shown in Fig. 3.4(b), 4(c), and
4(d). By classifying patterns this way, we have a more accurate control of delays for
transition patterns.
3.3 NEWMEMORYLESS CROSSTALKAVOID-
ANCE CODES
3.3.1 Previous CAC Design
CACs reduce the crosstalk delay for on-chip global interconnects by encoding a k-
bit data word (x1x2 · · ·xk) into an n-bit (n > k) codeword (c1c2 · · · cn). Two kinds
of CACs, CACs with memory and memoryless CACs, have been investigated in
the literature [51]. CACs with memory need to store all codebooks corresponding
to different codewords (c1c2 · · · cn), since the encoding depends on the data word
(x1x2 · · ·xk) as well as the preceding codeword. In contrast, memoryless CACs
44
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
D0
D1
D2
D3
D4
100
Classification Based on (1)
C2
C3
C4
C5
C6
C0
C1
Delay  
New Classification for Wire 3
(a)
(b)
(c)
New Classification for Wire 1
20 30 40 50 60 70
Delay  (ps)
100 20 30 40 50 60 70
(ps)
0C
1C
2C
3C
4C
100 20 30 40 50 60 70
Delay  (ps)
New Classification for Wire 2
(d)
0C
1C
2C
100 20 30 40 50 60 70
Delay  (ps)
Figure 3.4: Simulated delays of different classes of transition patterns using (a)
Classification based on (3.1); (b) Classification with respect to the delay of the
middle wire in a five-wire bus; (c) Classification with respect to the delay of wire 2
in a four-wire bus; (d) Classification with respect to the delay of wire 1 in a four-wire
bus (λ = 12.24 and τ0 = 1.42ps).
require a single codebook to generate codewords for transmission, because the en-
coding depends on the data word only. Hence, memoryless CACs are simpler to
implement than CACs with memory. We focus on memoryless CACs in this chap-
ter.
The codebook of a memoryless CAC satisfies the property that each codeword
45
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
must be able to transition to every other codeword in the codebook with a delay less
than the requirement. Most memoryless CACs in the literature are based on the
model in [7,42]. The key idea is to eliminate undesirable patterns for transmission.
Existing memoryless CACs include OLCs, FPCs, FTCs, and FOCs [8–10,32], which
achieve a worst-case delay of (1 + λ)τ0, (1 + 2λ)τ0, (1 + 2λ)τ0, and (1 + 3λ)τ0,
respectively. As mentioned above, the scheme that was proposed to achieve a worst-
case delay of τ0 is invalid since the model in [7,42] underestimates the delays for 0C.
Thus, OLCs achieve the smallest worst-case delay (1 + λ)τ0 among existing CACs.
There exist several methods to obtain a memoryless codebook based on pattern
pruning, transition pruning, or recursive construction. The pattern pruning tech-
nique is quite straightforward, and gives a codebook with a smaller worst-case delay
by eliminating some patterns. For example, FOCs cannot have both 010 and 101
patterns around any bit position, and FPCs are free of 010 and 101 patterns [32].
The transition pruning technique [10] is based on graph theory. This method first
builds a transition graph with all possible codewords as nodes and all valid transi-
tions as edges, and then finds a maximum clique. A clique is defined as a subgraph
where every pair of nodes are connected with an edge. A maximum clique is defined
as a clique of the largest possible size in a given graph. Since every pair of nodes is
connected, a maximum clique in this graph constitutes a memoryless codebook with
the largest size. The codebook generation method is based on exhaustive search.
Although it is easy to get a maximum clique from a transition graph with a small
n, the complexity increases rapidly with n. This is because the number of edges in
an n-bit transition graph is upper bounded by 2n−1(2n − 1), which increases expo-
nentially with n. In fact, it is an NP problem to find a maximum clique for given
46
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
constraints [52]. The recursive technique constructs an (n+1)-bit codebook from an
n-bit codebook [8,9]. Since for a small n, a largest codebook can be obtained easily
via the second method, a codebook for an n-wire bus can be constructed recursively.
3.3.2 CAC Design with New Classification
Since our classification of patterns is different from that in [7,42], the CAC designs
should be reconsidered with our new classification. In the following, we first intro-
duce a recursive method for codebook construction under different constraints, and
then derive the size of codebooks.
In our work, we use the recursive method to obtain a memoryless codebook
for the following two reasons. First, it is complex to apply the pattern pruning
technique, since our new classification is based on transitions over five wires, and
it is not clear which patterns have larger worst-case delays and should be removed.
Second, it is hard to find a maximum clique for a transition graph with a large n. In
our method, we first start with a 5-bit codebook, obtained by searching for maximum
cliques in a five-wire bus, and then build an (n + 1)-bit codebook by appending ’0’
and ’1’ to codewords of an n-bit codebook while satisfying delay constraints.
Our new classifications partition patterns over five adjacent wires into seven
classes, C0 to C6, and patterns over four adjacent wires into five classes, 0C to 4C.
Similar to the CAC design based on the model in [7,42], the new classifications are
conducive to the design of CACs by eliminating undesirable transition patterns with
large worst-case delays.
To get valid 5-bit codebooks, we first assume the allowed patterns are from C0
to Ci for i = 0, 1, · · · , 6 in our classification for middle wires. Then, for the side
47
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
wires, we assume patterns are from 0C to jC based on the classification for side
wires. Under these two assumptions, there are many configurations of constraints,
which are referred as (Ci, jC), where i ∈ {0, 1, · · · , 6} and j ∈ {0, 1, · · · , 4}.
Since the worst-case delay of a bus is determined by the largest delays among
all wires, for an n-bit (n ≥ 5) bus under (Ci, jC) we require that the worst-case
delays on middle wires and side wires are close enough. By our classifications, we
find 0C is close to C0, 1C close to C2 and C3, 2C close to C4, 3C close to C5, and
4C close to C6. Hence, among all configurations of constraints (Ci, jC), we only
focus on (C0, 0C), (C2, 1C), (C3, 1C), (C4, 2C), (C5, 3C), and (C6, 4C). When
n ≤ 4, the constraint Ci cannot be enforced. Hence, the constraint (Ci, jC) reduces
to jC. The constraint (C0, 0C) appears to be too restrictive, and hence we do not
investigate it in this chapter. The last configuration (C6, 4C) is trivial, since it
allows arbitrary transitions.
Algorithm 1 Codebook design under (Ci, jC)
Input: C05 , C
1
5 , n;
Initialize: k = 5, C5 = C
0
5 , s = 1;
while k ≤ n− 1 do
for ∀ck = (c1c2 · · · ck) ∈ C(k) do
if (ck−3ck−2ck−1ck0) ∈ Cs5 then
append 0 to ck and add the new codeword to C(k + 1);
else if (ck−3ck−2ck−1ck1) ∈ Cs5 then
append 1 to ck and add the new codeword to C(k + 1);
end if
end for
s = 1− s;
k = k + 1;
end while
Output: C(n).
In the following, we propose a scheme for finding an n-bit codebook C(Ci,jC)(n).
48
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
For simplicity, we denote C(Ci,jC)(n) as C(n) when there is no ambiguity about the
constraint. First, for a five-wire bus under constraint (Ci, jC), a pattern transition
graph is obtained. We search the graph for the largest 5-bit codebooks. One or two
5-bit codebooks of maximum sizes exist for each constraint in Tab. 3.5, where we
denote an n-bit binary codeword (c1c2 · · · cn) as a decimal number
∑n
i=1 ci2
n−i for
simplicity. In [10], a bit boundary in a set of codewords is said to be 01-type if only
codewords with 00, 01, and 11 are allowed across that boundary, and a bit boundary
is said to be 10-type when only codewords with 00, 10, and 11 are allowed across that
boundary. It is shown that the largest clique for a given constraint has alternating
boundary types. Thus, there are two largest cliques. Similarly, from Tab. 3.5,
we conjecture that the largest codebooks have alternating constraints, C05 and C
1
5 ,
for every five consecutive wires. For constraint (C4, 2C), only one maximum 5-bit
codebook exists. We assume C15 is the same as C
0
5 for constraint (C4, 2C). Since
we have two types of constraints, two largest codebooks for each constraint can
be obtained, except for (C4, 2C), where the two codebooks are the same. Then we
apply Alg. 1 to obtain C(n). In the initialization, we pick a 5-bit codebook C5 = C
0
5 .
Then, the algorithm recursively appends one bit to the codewords in the codebook
in each iteration. For ck = (c1c2 · · · ck), the appended bit x needs to satisfy that
the last five bits (ck−3ck−2ck−1ckx) form a codeword in Cs5 , which alternates between
C05 and C
1
5 . If we pick the other 5-bit codebook C5 = C
1
5 , we would obtain another
codebook.
The recursive construction allows us to derive the size of the codebooks. Let
V(Ci,jC) be an all-onem-dimensional row vector (m = |C05 |) under constraint (Ci, jC).
Let csk be a k-bit codeword with last five consecutive bits (ck−4ck−3ck−2ck−1ck) ∈ Cs5
49
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
Table 3.5: Largest 5-bit codebook(s) under constraint (Ci, jC).
Constraint C05 C
1
5
(C5, 3C)
{0, 1, 2, 3, 6, 7, 8, 9, {0, 1, 3, 4, 5, 6, 7, 12,
10, 11, 12, 14, 15, 16, 13, 14, 15, 16, 17, 19,
17, 18, 19, 24, 25, 20, 21, 22, 23, 24,
26, 27, 28, 30, 31} 25, 28, 29, 30, 31}
(C4, 2C)
{0, 1, 3, 6, 7, 12, 14, 15, 16,
17, 19, 24, 25, 28, 30, 31}
(C3, 1C) {0, 3, 14, 15, 24, 30, 31} {0, 1, 7, 16, 17, 28, 31}
(C2, 1C) {0, 3, 15, 24, 30, 31} {0, 1, 7, 16, 28, 31}
for s = 0 or 1. If a 0 or 1 can be appended to csk to form a (k + 1)-bit codeword
whose last five bits (ck−3ck−2ck−1ckck+1) ∈ C1−s5 , such an expansion is called a valid
expansion. Otherwise, it is called an invalid expansion. An expansion matrix is
denoted as a m×m matrix Ds(Ci,jC), where Ds(Ci,jC)(i, j) = 0 denotes an invalid ex-
pansion and Ds(Ci,jC)(i, j) = 1 a valid expansion from the i-th codeword in C
s
5 to the
j-th codeword in C1−s5 under constraint (Ci, jC). Each row of D
s
(Ci,jC) has at most
two ones, since each k-bit codeword can be appended to form at most two (k + 1)-
bit codewords whose last five bits satisfy the appropriate constraints. Let Y be an
m×m anti-diagonal matrix with all ones. Due to symmetry between C05 and C15 , D0
and D1 satisfy D1(Ci,jC) = YD
0
(Ci,jC)Y. Define D(Ci,jC) = D
0
(Ci,jC)Y = YD
1
(Ci,jC).
We denote V(Ci,jC) and D(Ci,jC) as V and D, respectively, when there is no ambi-
guity about the constraint. For example, the expansion matrices corresponding to
50
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
constraints (C3, 1C), (C4, 2C), and (C5, 3C) are given by
D(C3,1C) =


0 0 0 0 0 1 1
0 0 0 0 1 0 0
0 1 0 0 0 0 0
1 0 0 0 0 0 0
0 0 1 1 0 0 0
0 1 0 0 0 0 0
1 0 0 0 0 0 0

 ,D(C4,2C) =


0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0


,
D(C5,3C) =


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


.
Then, for n ≥ 5, the number of codewords in an n-bit bus is equal to counting the
valid transitions and is given by
|C(n)| = VD0D1 · · ·VT
=


V(D0YYD1)
n−5
2 VT if n is odd;
V(D0YYD1)
n−6
2 D0YYVT if n is even;
= VDn−5YVT .
(3.3)
In the following, we first focus on constraints (C3, 1C), (C4, 2C), and (C5, 3C).
The codes based on these constraints are shown to have the same codebooks as
OLCs, FPCs, and FOCs, respectively. Then, we consider constraint (C2, 1C), which
51
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
would lead to codes with a smaller delay at the expense of a lower code rate. Several
lemmas and theorems about the aforementioned codebooks and their sizes have
been established below. All the proofs are straightforward, and hence omitted for
conciseness. See the extended manuscript [53] of this work for more details.
3.3.3 Codes Under (C3, 1C)
The one Lambda codes have a worst-case delay (1 + λ)τ . According to [32], the
worst-case delay (1+λ)τ can only be achieved if and only if the transitions ↑↓ ×,
-↑-, and ↑-↑ plus their symmetric and complement versions (e.g. ↑↓ × and × ↓↑
are symmetric, and -↓- is the complement of -↑-) are avoided, where ↑, ↓, ×, and -
denote 0→1, 1→0, don’t care, and no transition, respectively. The first constraint of
avoiding ↑↓ × ensures that a transition between any two codewords does not cause
opposite transition on any wire. This condition is referred as a forbidden-transition
(FT) condition. The second constraint of avoiding -↑- ensures that 2C patterns are
removed. This constraint ensures two adjacent bit boundaries cannot both be 01-
type or 10-type, and is referred as a forbidden adjacent boundary pattern (FABP)
condition [32]. The last two forbidden patterns give the constraint that no patterns
010 and 101 appear in the codeword, which is referred as a forbidden-pattern (FP)
condition [32]. Codes satisfying these necessary and sufficient conditions are
called one Lambda codes (OLCs). We denote the largest OLC codebook size for an
n-bit bus as Gn, and Gn is given by
Gn = Gn−1 +Gn−5 (3.4)
52
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
with initial conditions G1 = 2, G2 = 3, G3 = 4, G4 = 5, and G5 = 7 [54].
With our classification, we explore codes under constraint (C3, 1C). From
Tab. 3.5, the two largest 5-bit codebooks are given by C05={0, 3, 14, 15, 24, 30,
31} and C15={0, 1, 7, 16, 17, 28, 31}. An n-bit codebook C(n) can be obtained via
Alg. 1. The number of codewords is given by
|C(n)| = VDn−5(C3,1C)VT for n ≥ 5, (3.5)
where V is a seven-dimensional all one vector and D(C3,1C) is a 7× 7 expansion ma-
trix. We further establish that the largest codebook sizes under constraint (C3, 1C)
satisfy the recursion:
Lemma 3.3.1. For n ≥ 8, |C(C3,1C)(n)| is given by a recursion |C(C3,1C)(n)| =
|C(C3,1C)(n − 2)| + |C(C3,1C)(n − 3)|, with initial conditions |C(C3,1C)(n)| =7, 9, 12,
for n =5, 6, 7, respectively.
In fact, we can further relate these codes with OLCs by the following:
Theorem 3.3.1. The codes under (C3, 1C) have the same codebooks as OLCs.
Hence, Gn = |C(C3,1C)(n)|.
Theorem 3.3.1 implies that the codes under constraint (C3, 1C) are equivalent
to the class of OLC codes.
3.3.4 Codes Under (C4, 2C)
The (1+2λ) codes have a worst-case delay of (1+2λ)τ . No necessary and sufficient
condition is known for a code to be a (1 + 2λ) code. Two sufficient conditions FT
53
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
and FP are found, which lead to two families of (1 + 2λ) codes, FTC and FPC,
respectively. The size of an FTC codebook for an n-wire bus is given by Fn+2,
where Fn is the Fibonacci sequence that satisfies Fn+2 = Fn+1 + Fn and has initial
conditions F1 = F2 = 1 [10]. The FPCs for an n-wire bus have a larger codebook
size 2Fn+1 [8].
With our classification, we explore codes under constraint (C4, 2C). From
Tab. 3.5, only one largest 5-bit codebook is found C05={0, 1, 3, 6, 7, 12, 14, 15,
16, 17, 19, 24, 25, 28, 30, 31}. An n-bit codebook C(n) can be obtained via Alg. 1
by setting C15 = C
0
5 . The number of codewords is given by
|C(n)| = VDn−5(C4,2C)VT for n ≥ 5 (3.6)
where V is a 16-dimensional all one vector and D(C4,2C) is a 16× 16 expansion ma-
trix. We further establish that the largest codebook sizes under constraint (C4, 2C)
satisfy the recursion:
Lemma 3.3.2. For n ≥ 9, |C(C4,2C)(n)| can be simplified as recursion |C(C4,2C)(n)| =
2|C(C4,2C)(n − 1)| − |C(C4,2C)(n − 2)| + |C(C4,2C)(n − 4)|, with boundary conditions
|C(C4,2C)(n)| =16, 26, 42, 68, for n =5, 6, 7, 8, respectively.
Again, we can relate these codes to existing CACs by the following:
Theorem 3.3.2. The codes under (C4, 2C) have the same codebooks as FPCs.
Hence, 2Fn+1 = |C(C4,2C)(n)|.
Since FPCs and our codes under (C4, 2C) can be obtained by excluding D3 plus
D4 patterns and C5 plus C6 patterns, respectively, Theorem 3.3.2 is not surprising
54
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
given that C5 and C6 are the same as D3 and D4, respectively. Theorem 3.3.2
implies that results in the literature regarding FPCs are also applicable to codes
under constraint (C4, 2C).
3.3.5 Codes Under (C5, 3C)
The (1+3λ) codes have a worst-case delay of (1+3λ)τ , which can be achieved if and
only if ↓↑↓ and ↑↓↑ are avoided. So the necessary and sufficient condition for the
(1+3λ) codes is that the codebook cannot have both 010 and 101 appearing centered
around any bit position, which is referred as a forbidden-overlap (FO) condition.
Codes satisfying the FO condition are called FOCs. It is shown that the largest
FOC codebook for an n-bit bus is given by Tn+2, where Tn = Tn−1 + Tn−2 + Tn−3
is the tribonacci number sequence with initial conditions T1 = 1, T2 = 1, and
T3 = 2 [32].
With our classification, we explore codes under constraint (C5, 3C). Two largest
5-bit codebooks C05={0, 1, 2, 3, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 24, 25,
26, 27, 28, 30, 31} and C15={0, 1, 3, 4, 5, 6, 7, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22,
23, 24, 25, 28, 29, 30, 31} are found. Via Alg. 1, an n-bit codebook C(n) can be
obtained. The number of codewords is given by
|C(n)| = VDn−5(C5,3C)VT for n ≥ 5, (3.7)
where V is a 24-dimensional all one vector and D(C5,3C) is a 24 × 24 expansion
matrix.
We further establish that the largest codebook sizes under constraint (C5, 3C)
55
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
satisfy the recursion:
Lemma 3.3.3. For n ≥ 8, |C(C5,3C)(n)| can be simplified as recursion |C(C5,3C)(n)| =
|C(C5,3C)(n − 1)| − |C(C5,3C)(n − 2)| + |C(C5,3C)(n − 3)|, with boundary conditions
|C(C5,3C)(n)| =24,44,81, for n =5, 6, 7, respectively.
Again we can relate these codes to existing CACs by the following:
Theorem 3.3.3. The codes under (C5, 3C) have the same codebooks as FOCs.
Hence, Tn+2 = |C(C5,3C)(n)|.
Theorem 3.3.3 is not surprising, since FOCs and our codes under (C5, 3C) can
be obtained by excluding D4 and C6 patterns, respectively, and D4 and C6 have
been shown to be the same. Theorem 3.3.3 implies that results in the literature
regarding FOCs are also applicable to codes under constraint (C5, 3C).
3.3.6 Codes Under (C2, 1C)
With our classification, we explore codes under constraint (C2, 1C). From Tab. 3.5,
the two largest 5-bit codebooks are given by C05={00000, 00011, 01111, 11000, 11110,
11111} and C15={00000, 00001, 00111, 10000, 11100, 11111}. An n-bit codebook
C(n) can be obtained via Alg. 1. The number of codewords is given by
|C(n)| = VDn−5VT for n ≥ 5, (3.8)
where V is a six-dimensional all one vector and D =
[ 0 0 0 0 1 1
0 0 0 1 0 0
1 0 0 0 0 0
0 0 1 0 0 0
0 1 0 0 0 0
1 0 0 0 0 0
]
.
We further establish that the largest codebook sizes under constraint (C2, 1C)
satisfy the recursion:
56
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
Lemma 3.3.4. For n ≥ 10, |C(C3,1C)(n)| can be simplified as recursion |C(C2,1C)(n)| =
|C(C2,1C)(n−2)|+ |C(C2,1C)(n−5)|, with initial conditions |C(C2,1C)(n)| =6, 7, 9, 11,
14, for n =5, 6, 7, 8, 9, respectively.
Lemma 3.3.5. The codebook under (C2, 1C) is a subset of OLC.
3.3.7 Pruned Codes Under (C2, 1C)
For (C2, 1C), the restriction on the side wires is more relaxed than that on the
middle wires, which results in larger worst-case delays for the side wires. Hence,
we prune the CACs under constraint (C2, 1C) by removing codewords with larger
delays on the side wires in order to achieve a smaller worst-case delay. Since the
pruned codes have a smaller delay than OLCs, we call these pruned CACs improved
one Lambda codes (IOLCs). We obtain IOLCs by first finding an n-bit codebook
via Alg. 1 as in Sec. 3.3.6, and then pruning the codebook with Alg. 2. To prune
the codebook C(n), we search for maximum subsets of C i5 (i = 0, 1) with smaller
delays on the side wires. For C05 , two maximum subsets C
0,0
5 ={0, 3, 15, 30, 31}
and C0,15 ={0, 15, 24, 30, 31} are found with smaller worst-case delays on wires 1
and 2 and wires 4 and 5, respectively. For C15 , a maximum subset C
1,1
5 ={0, 1, 7,
16, 31} is found with smaller worst-case delays on wires 4 and 5. Finally, a valid
n-bit codebook is obtained with the leftmost five bits belonging to C0,05 , and the
rightmost five bits belonging to C0,15 or C
1,1
5 depending on whether n is odd or even.
The pruning algorithm for CACs under (C2, 1C) on an n-bit bus is shown in
Alg. 2. By pruning all codewords cn in C(n), the algorithm removes codewords
with larger delay on side wires. With Alg. 2, we get an n-bit IOLC under constraint
57
3.3. NEW MEMORYLESS CROSSTALK AVOIDANCE CODES
Algorithm 2 Pruning CACs under (C2, 1C)
Input: C0,05 , C
0,1
5 , C
1,1
5 , C(n);
if n is odd then
i = 1;
else
i = 0;
end if
for ∀cn = (c1c2 · · · cn) ∈ C(n) do
if (c1c2c3c4c5) 6∈ C0,05 or (cn−4cn−3cn−2cn−1cn) 6∈ C1−i,15 then
eliminate cn from C(n);
end if
end for
Output: C(n).
(C2, 1C), and its size is given by
|CIOLC(n)| =W1Dn−5YWT2 for n ≥ 5, (3.9)
where W1 = [1 1 1 0 1 1], W2 = [1 0 1 1 1 1], and D is the same as that in
Eq. (3.8). Note that W1 and W2 are used instead of V, because of the pruning of
valid patterns on side wires.
We further establish that the largest codebook sizes of IOLCs satisfy the recur-
sion:
Lemma 3.3.6. For n ≥ 10, |CIOLC(n)| can be simplified as recursion |CIOLC(n)| =
|CIOLC(n− 2)|+ |CIOLC(n− 5)|, with initial conditions |CIOLC(n)| =4, 5, 7, 8, 11,
for n =5, 6, 7, 8, 9, respectively.
Lemma 3.3.7. The IOLC codebook is a subset of OLC.
58
3.4. PERFORMANCE EVALUATION
3.4 Performance Evaluation
In this section, we evaluate the performance of CACs based on our classification
with extensive simulations, and compare them with existing CACs. Each CAC has
two key performance metrics: delay and rate. The delay of a CAC is the worst-case
delay when the codewords from the CAC are transmitted over the bus. Codebook
size and code rate are often used to measure the overhead of CACs. The codebook
size of a CAC is simply the number of codewords. Suppose a CAC of size M is
transmitted over an n-bit bus, then its rate is defined as ⌊log2M⌋
n
. A CAC of rate
k/n implies that n−k extra wires are used in addition to k data wires so as to reduce
the crosstalk delay. Hence, the code rate measures the area and power overhead of
CACs: the higher the rate, the smaller the overhead. Obviously, there is a tradeoff
between the code rate and delay of a CAC: typically a lower rate code is needed
to achieve a smaller delay. To measure the overall effects of both rate and delay,
we also define the throughput of a CAC as the ratio of code rate and delay. The
assumptions for this definition are: (1) the clock rate of the bus is determined by the
inverse of the worst-case delay; (2) the throughput of the bus is linearly proportional
to k, the number of data wires.
Since codes under (C3, 1C), (C4, 2C), and (C5, 3C) have exactly the same code-
books as OLCs, FPCs, and FOCs, their delay, rate, and throughput are also the
same. Under constraint (C2, 1C), we propose two kinds of codes, unpruned codes
and pruned codes (IOLCs). In the following, we compare their performance with
OLCs in [9] with extensive simulations.
To compare the worst-case delay of our IOLCs, unpruned (C2, 1C) codes, and
OLCs, we simulate two buses, a 10-bit bus and a 16-bit bus, with all transitions
59
3.4. PERFORMANCE EVALUATION
between any two codewords in their codebooks and obtain the worst-case delays of
each wire. The simulation environment has been explained in Sec. 3.2.3. Both buses
have a length of 5mm, and τ0 = 1.42ps and λ = 12.24. For a 10-bit bus, the worst-
case delays of our IOLC, unpruned (C2, 1C) code, and an OLC are given by 10.14ps,
13.50ps, and 14.84ps, respectively. The worst-case delay of our IOLC and unpruned
(C2, 1C) code are 31.67% and 9.03% smaller than that of the OLC, respectively.
For a 16-bit bus, the worst-case delays of our IOLC, unpruned (C2, 1C) code, and
an OLC are given by 10.40ps, 13.92ps, and 16.11ps, respectively. The worst-case
delay of our IOLC and unpruned (C2, 1C) code are 35.44% and 13.59% smaller than
that of the OLC, respectively. See the extended manuscript [53] of this work for
additional information.
For all simulations, our IOLCs have better delay performance than OLCs. Al-
though both IOLCs and unpruned (C2, 1C) codes have almost the same code rate
and better delay performance than OLCs, the delay performance of IOLCs is much
better than the unpruned (C2, 1C) codes. With a more advanced technology where
the coupling effect is significant, the improvement of our IOLCs is bigger.
The comparisons of the codebook size between our IOLCs, unpruned (C2, 1C)
codes, and OLCs [9] and the throughput gain with respect to OLCs are shown
in Tab. 3.6. The throughput gain of our CACs with respect to OLCs is given
by the ratio between the throughput of our CACs and the throughput of OLCs.
The codebook sizes of the three codes are close. In all cases, the difference of the
number of bits between our IOLCs and unpruned (C2, 1C) codes is within 1 bit. The
difference of the number of bits between our IOLCs and OLCs [9] is within 2 bits
for n ≤ 16. In respect to throughput, our IOLCs always have a greater throughput
60
3.4. PERFORMANCE EVALUATION
than OLCs, and their throughput gain ranges from 1.02 to 1.55 for an n-wire bus
(5 ≤ n ≤ 16). The unpruned (C2, 1C) codes have better throughput in some cases
than OLCs, and the throughput gain ranges from 0.78 to 1.10 for an n-wire bus
(5 ≤ n ≤ 16). When unpruned (C2, 1C) codes have a lower throughput than OLCs,
IOLCs can be used.
Our IOLCs and unpruned (C2, 1C) codes provide additional options for the
tradeoff between code rate and code delay. In addition to achieving higher through-
puts, the new CACs are also appropriate for interconnects where the delay is of top
priority.
Table 3.6: Comparison of codebook size and throughput of IOLC, unpruned
(C2, 1C) code (UC), and OLC [9] (λ = 12.24 and τ0 = 1.42ps).
# of # of words # of bits Throughput Gain
wires IOLC UC [9] IOLC UC [9] IOLC UC
5 4 6 7 2 2 2 1.55 1.10
6 5 7 9 2 2 3 1.07 0.78
7 7 9 12 2 3 3 1.02 1.14
8 8 11 16 3 3 4 1.12 0.84
9 11 14 21 3 3 4 1.10 0.84
10 12 17 28 3 4 4 1.10 1.10
11 16 21 37 4 4 5 1.18 0.88
12 18 26 49 4 4 5 1.19 0.89
13 23 32 65 4 5 6 1.03 0.96
14 27 40 86 4 5 6 1.02 0.95
15 34 49 114 5 5 6 1.27 0.95
16 41 61 151 5 5 7 1.11 0.83
It has been shown that the encoding and decoding of OLCs, FPCs, and FOCs
have quadratic complexity based on numeral systems [47]. Since codes under (C3, 1C),
(C4, 2C), and (C5, 3C) have exactly the same codebooks as OLCs, FPCs, and FOCs,
61
3.5. SUMMARY
their CODECs also have quadratic complexity. Also, it is expected that the encod-
ing and decoding of our IOLCs and unpruned (C2, 1C) codes have a quadratic
complexity, since the codebooks of our IOLCs and unpruned (C2, 1C) codes are
proper subsets of OLCs.
We remark that the simulation results in Sections 3.2.3 and 4.5 are all based
on a 45nm CMOS technology. We have also run the same set of simulations based
on a 0.1-µm technology (omitted for brevity). Between the two sets of simulation
results, the main conclusions of the manuscript and the key features of our proposed
classification and CACs remain the same. For instance, the delays of the patterns
in different classes do not overlap, regardless of the technology. Also, the proposed
CACs based on the new classification are also the same. This actually demonstrates
that our approach to delay classification and CACs is applicable to a wide variety of
technology. This is because in our approach, the dependency of the crosstalk delay
on the technology is represented by the two parameters, the propagation delay τ0 of
a wire free of crosstalk and the coupling factor λ. Since our analytical approach to
the classification and CACs treats these two parameters as variables, our approach
can be easily adapted to a wide variety of technology.
3.5 SUMMARY
In this chapter, we propose a new classification of transition patterns. The new
classification has finer classes and the delays do not overlap among different classes.
Hence the new classification is conducive to the design of CACs. To illustrate this,
we design a family of CACs with different constraints. Some codes of the family
62
3.5. SUMMARY
are the same as existing codes, OLCs, FPCs, and FOCs. We also propose two new
CACs with a smaller worst-case delay and better throughput than OLCs. Since our
analytical approach to the classification and CACs treats the technology-dependent
parameters as variables, our approach can be easily adapted to a wide variety of
technology.
63
Chapter 4
Crosstalk avoidance codes for
RLC On-Chip Interconnects
4.1 INTRODUCTION
Recent International Technology Roadmap of Semiconductors (ITRS) [1] has shown
a troubling trend: while gate delay decreases with scaling, global wire delay in-
creases. This is because with the process technologies scaling down, the crosstalk
delay becomes more prominent due to the increasing capacitive and inductive cou-
plings among all wires. At low clock frequency, the inductive coupling can be ig-
nored and only the capacitive coupling determines the propagation delays. Many
approaches (see, e.g. [8,10,32,33,44,55–57]) have been proposed to alleviate the ca-
pacitive coupling. As the clock frequency approaches multi-gigahertz, the parasitic
inductance of on-chip interconnects has become significant and its detrimental ef-
fects, including increased delay, voltage overshoots and undershoots, and increased
64
4.1. INTRODUCTION
crosstalk noise [58–60], cannot be ignored. Hence, when the process technologies
scaling down into deep submicrometer (DSM) and the clock frequency approaching
multi-gigahertz range, the crosstalk delay and noise due to the capacitive and induc-
tive coupling become the performance bottleneck in many high-performance VLSI
designs, especially for global on-chip buses. It is imperative for designers to devise
new techniques to address both capacitive and inductive couplings simultaneously.
Many approaches have been proposed to reduce the crosstalk delays due to the
capacitive coupling, such as shielding, repeater insertion, and bus encoding [8, 10,
32,33,44,55–57]. Among these approaches, the shielding scheme is the simplest one,
but it requires a large area overhead. The repeater insertion scheme prevents simul-
taneously opposite switching between adjacent wires by introducing intentional time
skewing. But it is hungry for power consumption. The bus encoding scheme, referred
to as crosstalk avoidance coding (CAC) [8,10,32,33,44,56,57], is a promising tech-
nique for its effective delay reductions and low power consumptions compared with
other techniques. Hence, in this work, we focus on this coding scheme for crosstalk
reduction. However, the previously proposed CACs are based on distributed RC
model and only consider neighboring two wires for crosstalk. When the inductance
effect is significant, more neighboring wires should be considered for crosstalk due
to the long-range effect of inductive coupling. It has been shown that the worst
case switching pattern with the largest delay for the RLC-coupled interconnects is
quite different from the RC-coupled interconnects [59, 60]. The growing inductive
coupling renders the previously proposed approaches inefficient in delay reduction.
In addition, signal noise like overshoots and undershoots are not accounted for by
these previously proposed techniques. Hence, it is necessary to develop other coding
65
4.1. INTRODUCTION
schemes to reduce the crosstalk delay and noise due to both capacitive and inductive
couplings.
The inductive coupling is greatly dependent on the switching patterns on on-chip
interconnects. It is important to find the patterns incurring larger delays and noises.
In this chapter, we use “↑” to denote a transition from 0 to the supply voltage Vdd
(normalized to 1), “-” no transition, and “↓” a transition from Vdd to 0. In [59], a
worst case pattern considering capacitive and inductive coupling is given by ↑↓↑↓↑
(↑ and ↓ denote up and down transitions, respectively), where immediate neighbors
switch oppositely and higher order neighbors switch in the same direction. In [60],
the authors show that the worst case switching pattern would change from ↑↓↑↓↑ to
↑↑↑↑↑ when inductance coupling dominates. A bus invert scheme is also proposed
to reduce the inductance effects by inverting the input data when the number of
wires switching in the same direction is more than half of the number of wires [60].
Hence, patterns with more than half of wires switching in the same direction are
eliminated.
The bus inver scheme in [60] is the first coding scheme in the literature to address
the on-chip inductive coupling. However, there are two disadvantages of this scheme.
First, the capacitive coupling is ignored for the crosstalk delay. The worst case
pattern is only based on the largest inductive coupling, which increases linearly
with length. However, the capacitive coupling is a quadratic function of length and
cannot be ignored for long wires of global on-chip buses. Second, the classification
of patterns for RLC modeled bus is too simple, since only one worst case pattern
is considered for inductive coupling reduction. Other patterns with slightly less
inductive coupling would compromise the coding scheme. For instance, for a 5-bit
66
4.1. INTRODUCTION
pattern ↑↑↑↑- (- denotes no transition), the inductive coupling is also large for the
middle wire and this pattern should also be avoided for better inductive coupling
reduction.
Addressing these disadvantages for the scheme in [59,60], in this work we propose
a new coding scheme. There are two main contributions in this chapter:
• First, we define a parameter to quantify the significance of inductive effects and
propose a new classification of patterns based on a combined of two constraints
for capacitive and inductive couplings, respectively.
• Second, we proposed new CACs based on our classification and design ar-
chitectures of encoders and decoders (CODECs) based on a revised numeral
system.
Our approach allows us to fine tune the patterns for different combination of
capacitive and inductive couplings. Note that there are two extreme scenarios. If
capacitive coupling dominates, our classification would reduce to the classification
for RC-coupled interconnects [56]. If inductive coupling dominates, our classification
would only consider inductance effects.
The rest of the chapter is organized as follows. In Section II, we first present
adverse inductance effects and then define a parameter to quantify the significance
of inductance effects. We then propose new CACs for RLC-coupled interconnects
based on our classification of patterns in Section III and their CODEC designs based
on a revised binary mixed-radix numeral system in Section IV. In Section 4.5, we
compare their performance in terms of worst case delays and peak noises. Some
concluding remarks are provided in Section 7.6.
67
4.2. CAPACITANCE AND INDUCTANCE EFFECTS
4.2 CAPACITANCE AND INDUCTANCE EF-
FECTS
4.2.1 Interconnect Model
With the scaling of technologies and the clock frequency approaching multi-gigahertz,
the inductance is becoming significant and impacts the signals on the bus greatly. In-
ductive coupling can cause adverse effects, such as crosstalk delay, signal overshoots
and undershoots, and switching noise, which can lead to serious signal integrity
issues [60]. In addition, the worst-case patterns due to the inductive coupling are
quite different from those due to capacitive coupling [60], making previously pro-
posed coding schemes ineffective. Hence, in today’s high performance circuit design,
the inductance effects cannot be neglected. A transition from an RC interconnect
model to an RLC model is necessary.
A distributed RLC model of a five-wire bus is shown in Fig. 4.1, where Vi(x, t)
denotes the transient signal at time t and position x (0 ≤ x ≤ L) over wire i for
i ∈ {1, 2, 3, 4, 5}, r, l, and c denote the resistance, inductance, and capacitance
per unit length, respectively. λ is the ratio of the coupling capacitance between
two adjacent wires over the wire capacitance. li,j denotes the coupling inductance
per unit length between wires i and j. The values of λ and li,j depend on many
factors, such as the metal layer in which we route the bus, the wire width, the
spacing between adjacent wires, and the distance to the ground layer. We consider
a uniformly distributed bus with the same parameters r, l, c, and λ for all the wires.
68
4.2. CAPACITANCE AND INDUCTANCE EFFECTS
Figure 4.1: A distributed RLC model for five wires.
4.2.2 Crosstalk Delay
For a single line, many approaches have been proposed to analyze and characterize
the delay and noise [38,61]. For a single distributed RC line, a closed-form expression
of time delay is derived in [38] and given by τ = 0.693RtrcL + 0.377rcL
2, where
Rdrv is the driver resistance and L is the interconnect length. In [61], inductance
is included to derive time delay of a single line in the following two scenarios:
• If (R/Z0) ≤ ln[4Z0/(Rtr+Z0)] AND Rtr < 3Z0, the delay is τ == tf = L
√
lc,
where Z0 =
√
l/c is the lossless characteristic impedance, R = rL is the
resistance of each wire, and tf = L
√
lc is the time of flight of the signals
across the whole interconnects;
• If (R/Z0) ≥ 2 ln[4Z0/(Rtr + Z0)] OR Rtr > 3Z0, the time delay is τ =
0.693RtrcL+ 0.377rcL
2, the same as that of a distributed RC line.
The first case occurs when the inductance becomes significant, since the two in-
equalities can be easily satisfied for large Z0. In this case, the 50% time delay is
approximated as the time flight tf = L
√
lc. The second case is for small inductance
effects and the delay is the same as that of an RC-modeled bus.
69
4.2. CAPACITANCE AND INDUCTANCE EFFECTS
For a multi-wire bus, the on-chip RLC interconnects are characterized by teleg-
rapher’s equations given by [5]


∂
∂x
V(x, t) = −L ∂
∂t
I(x, t)−RI(x, t),
∂
∂x
I(x, t) = −C ∂
∂t
V(x, t),
(4.1)
where V(x, t) and I(x, t) denote the voltage and current vectors of the interconnects,
respectively, and R = [ri,j], L = [li,j], and C = [ci,j] are the resistance, inductance,
and capacitance matrices, respectively. R = RI is a diagonal matrix. Since only
wire capacitance ci,i and coupling capacitance between adjacent wires ci,i+1 are con-
sidered, we have ci,j = 0 for i 6= i − 1, i, i + 1. Hence, C = [ci,j] is a tri-diagonal
matrix. L is a dense matrix, since inductance effect is long-rang effect. Eq. (4.1)
can be simplified as
∂2
∂x2
V(x, t) = LC
∂2
∂t2
V(x, t) +RC
∂
∂t
V(x, t). (4.2)
It is known that the PDEs for a distributed RC interconnect can be decoupled to
isolated equations by diagonalizing the coupling matrix C [5,38]. It has been shown
that capacitance and inductance matrices of a bus with ideal return path satisfy [5]
LC =
1
ν2
[I],
where ν is the speed of an electromagnetic wave in a given dielectric material and
[I] is the identify matrix. Hence, Eq.(4.1) is simplified as
∂2
∂x2
V(x, t) =
1
ν2
∂2
∂t2
V(x, t) +RC
∂
∂t
V(x, t). (4.3)
70
4.2. CAPACITANCE AND INDUCTANCE EFFECTS
Figure 4.2: Ringing on wire 3 of a five-wire bus for ↑↑↑↑↑ and ↓↑↑↑↓.
Then, Eq. (4.3) can be decoupled using the same technique in [5, 38] for a dis-
tributed RC interconnect. The conclusion for single wire can be used for estimating
the 50% delays and noises of all wires in a multi-wire bus. Since the product of
LC = 1
ν2
[I] is a constant matrix, patterns with larger capacitive couplings have
smaller inductive couplings, and vice versa. This has been verified in [60] by finding
the best and worst case patterns considering inductive couplings. The worst pattern
with the largest ring has all wires switching simultaneously in the same direction,
and the worst pattern with the largest delay has immediate neighbors switching
oppositely [60].
4.2.3 Interconnect Ring
Another adverse inductance effect is severe ringing of on-chip interconnect with
growing inductance. The ringing is more severe for patterns with many wires switch-
ing in the same direction due to larger inductive couplings. The time delay has been
approximated as the time of flight when inductance is significant [61]. However,
this is only true when overshoots or undershoots are not crossing 50% Vdd multi-
ple times. When the inductance is significant, the ring decays slowly and multiple
undershoots (overshoots) may go below (above) 50% Vdd for a rising (falling) step
71
4.2. CAPACITANCE AND INDUCTANCE EFFECTS
signal. Glitches would appear at the receiver end. A larger delay is required to get
a stable result. In this case, the 50% delay is obtained based on the last crossing of
50% Vdd.
In the following, we show that the significance of ringing depends on the tran-
sition activity. Since the mutual inductance decays slowly, the inductance effect
is long-range effect. All high order neighbors would contribute to the crosstalk.
For this reason, we include two more wires and focus on wire 3 of a 5-wire bus in
Fig. 4.1. The total capacitance Ct and inductance Lt of wire 3 satisfy CtLt =
1
ν2
.
For transition ↑↑↑↑↑, Ct gets its smallest value, since there is no capacitive coupling.
However, Lt gets its maximum value and the inductive coupling is significant. The
resulted ring decays slowly as shown in Fig. 4.2. Similarly, for transition ↓↑↑↑↓, Ct
increases and Lt decreases. The inductive coupling decreases and the ring decays
quickly as shown in Fig. 4.2.
The significance of ringing can also be explained by a parameter ζ introduced
in [62], where a closed-form delay model is derived for a single RLC wire as a function
of parameter ζ . The parameter ζ is given by
ζ =
Rt
2
√
Lt/Ct
· RT + CT +RTCT + 0.5√
1 + CT
,
where CT =
CL
Ct
and RT =
Rtr
Rt
. For a small ζ , the ringing is significant and the 50%
delay is large due to multiple crossing of 50% Vdd [62]. For a large ζ , the ringing is
weak and the 50% delay is obtained based on the first crossing of 50% Vdd. For the
two transition patterns, ↑↑↑↑↑ and ↓↑↑↑↓, the former has a smaller ζ than that of
the latter, since ↑↑↑↑↑ has larger Lt and smaller Ct. Hence, the ringing of pattern
72
4.3. CAC DESIGN
↑↑↑↑↑ is more significant as shown Fig. 4.2.
4.3 CAC design
4.3.1 Previous CAC Design
CACs are first proposed to reduce the crosstalk delay for on-chip global intercon-
nects. A k-bit data word (xkxk−1 · · ·x1) is encoded into an m-bit (m > k) codeword
(cmcm−1 · · · c1). Two kinds of CACs, CACs with memory and memoryless CACs,
have been investigated in the literature [51]. CACs with memory need to store all
codebooks corresponding to different codewords (cmcm−1 · · · c1), since the encoding
depends on the data word (xkxk−1 · · ·x1) as well as the preceding codeword. In
contrast, memoryless CACs require a single codebook to generate codewords for
transmission, because the encoding depends on the data word only. Hence, memo-
ryless CACs are much simpler to implement than CACs with memory. We focus on
memoryless CACs in this chapter.
There exist several methods to obtain a memoryless codebook based on pattern
pruning, transition pruning, or recursive construction. The pattern pruning tech-
nique is quite straightforward, and gives a codebook with a smaller worst-case delay
by eliminating some patterns. The transition pruning technique [10] is based on
graph theory. This method first builds a transition graph with all possible code-
words as nodes and all valid transitions as edges, and then finds a maximum clique.
A clique is defined as a subgraph where every pair of nodes are connected with an
edge. A maximum clique is defined as a clique of the largest possible size in a given
graph. Since every pair of nodes is connected, a maximum clique in this graph
73
4.3. CAC DESIGN
constitutes a memoryless codebook with the largest size. The codebook generation
method is based on exhaustive search. Although it is easy to get a maximum clique
from a transition graph with a small m, the complexity increases rapidly with m.
This is because the number of edges in an m-bit transition graph is upper bounded
by 2m−1(2m − 1), which increases exponentially with m. In fact, it is an NP prob-
lem to find a maximum clique for given constraints [52]. The recursive technique
constructs an (m+1)-bit codebook from an m-bit codebook [8,9]. Since for a small
m, a largest codebook can be obtained easily via the second method, a codebook
for an m-wire bus can be constructed recursively.
Previously proposed CACs (see, for example, [8, 10, 32, 33]) are not efficient if
the inductance effects are significant. Adverse inductance effects, such as volt-
age overshoots and undershoots, and switching noise, would change the worst case
switching pattern and also lead to serious signal integrity issues [60]. Hence, other
coding scheme is needed to account for these adverse effects. The key idea of previ-
ous CACs is to eliminate transition patterns incurring larger delays. To account for
inductance effects, we first find patterns with larger inductive couplings. Then, us-
ing a similar idea, we extend CACs to account for inductance effects by eliminating
those patterns with larger inductive couplings.
4.3.2 Classification
In the following, we consider the classification of transition patterns with respect to
total inductance of the middle wire (wire k). To quantify the inductive coupling,
74
4.3. CAC DESIGN
we introduce a parameter
Wk =
∣∣∣∣∑i=k+⌊∆/2⌋i=k−⌊∆/2⌋wi
∣∣∣∣ for wire k, (4.4)
where ∆ is the number of neighboring wires considered for mutual inductance and
wi = −1, 0, 1 corresponds to ↓, −, and ↑ on wire i, respectively. Note that wi = 1
or −1 denotes the largest inductive coupling. Since the mutual inductance decays
slowly, more neighbors would contribute to the crosstalk. Instead of choosing two
adjacent wires for capacitive coupling, we choose two more adjacent wires (∆ = 4)
for inductive coupling. The first reason of choosing ∆ = 4 is that the classification of
transitions would be easy, since we have a reasonable number of transition patterns
to classify. For instance for ∆ = 4, there are a total of 35 = 243 transition patterns
compared with 37 = 2187 for ∆ = 6. The other reason is that our CAC design is
based on a recursive coding scheme as explained in Sec. III-C, which would help to
restrict inductive coupling on wires beyond the chosen neighboring wires.
We first focus on a five-wire bus for transition pattern classification. There are
34 = 81 different transition patterns with ↑ transition on wire 3, which can be
partitioned into 6 classes as shown in Table 4.1. For patterns with a ↓ transition on
wire 3, a similar classification can be obtained by inverting all patterns in Table 4.1.
For patterns with no transition on wire 3, a classification is shown in Table 4.2.
From Table 4.1, we note that the worst case pattern with respect to inductive
coupling is ↑↑↑↑↑, which is the best case pattern in terms of capacitive coupling. By
choosing those with |W3| ≤ kw in Tables 4.1 and 4.2 where kw = {0, · · · , 4}, we can
reduce the worst case inductive couplings. The smaller kw is, the larger reduction
75
4.3. CAC DESIGN
Table 4.1: Classification of patterns with respect to W3 = |
∑5
i=1wi|.
W3 Patterns with ↑ on wire 3
0
↓-↑- -, -↓↑- -, - -↑↓-, - -↑-↓, ↑-↑↓↓, ↑↓↑-↓, ↑↓↑↓-, -↑↑↓↓
↓↑↑-↓, ↓↑↑↓-, -↓↑↑↓, ↓-↑↑↓, ↓↓↑↑-, -↓↑↓↑, ↓-↑↓↑, ↓↓↑-↑
1
- -↑- -, ↑↓↑- -, ↑-↑↓-, ↑-↑-↓, ↓↑↑- -, -↑↑↓-, -↑↑-↓, ↓-↑↑-
-↓↑↑-, - -↑↑↓, ↓-↑-↑, -↓↑-↑, - -↑↓↑, ↑↑↑↓↓, ↓↓↑↑↑, ↑↓↑↑↓
↑↓↑↓↑, ↓↑↑↑↓, ↓↑↑↓↑, ↓↓↑- -, - -↑↓↓, ↓-↑-↓, ↓-↑↓-, -↓↑-↓
-↓↑↓-, ↑↓↑↓↓, ↓↑↑↓↓, ↓↓↑↑↓, ↓↓↑↓↑
2
↑-↑- -, -↑↑- -, - -↑↑-, - -↑-↑, ↑↑↑↓-, ↑↑↑-↓, ↑↓↑↑-, ↑-↑↑↓
↑↓↑-↑, ↑-↑↓↑, ↓↑↑↑-, -↑↑↑↓, ↓↑↑-↑, -↑↑↓↑, ↓-↑↑↑, -↓↑↑↑
-↓↑↓↓, ↓-↑↓↓, ↓↓↑-↓, ↓↓↑↓-
3
↑↑↑- -, - -↑↑↑, ↑-↑-↑, ↑-↑↑-, -↑↑↑-, -↑↑-↑
↓↑↑↑↑, ↑↓↑↑↑, ↑↑↑↓↑, ↑↑↑↑↓, ↓↓↑↓↓
4 ↑↑↑↑-, ↑↑↑-↑, ↑-↑↑↑, -↑↑↑↑
5 ↑↑↑↑↑
Table 4.2: Classification of patterns with respect to W3 = |
∑5
i=1wi|.
W3 Patterns with - on wire 3
0
- - - - -, ↑↓- - -, ↑- -↓-, ↑- - -↓, -↑-↓-, -↑- -↓, - - -↑↓, - - -↓↑
-↓-↑-, -↓- -↑, ↓↑- - -, ↓- -↑-, ↓- - -↑, ↑↑-↓↓, ↓↓-↑↑, ↑↓-↑↓
↑↓-↓↑, ↓↑-↑↓, ↓↑-↓↑
1
↑- - - -, -↑- - -, - - -↑-, - - - -↑, ↓- - - -, -↓- - -, - - -↓-, - - - -↓
↑↑-↓-, ↑↑- -↓, ↑- -↑↓, ↑- -↓↑, ↑↓-↑-, ↑↓- -↑, -↑-↑↓, -↑-↓↑
-↓-↑↑, ↓↑-↑-, ↓↑- -↑, ↓- -↑↑, ↑- -↓↓, ↑↓- -↓, ↑↓-↓-, -↑-↓↓
-↓-↑↓, -↓-↓↑, ↓↑- -↓, ↓↑-↓-, ↓- -↑↓, ↓- -↓↑, ↓↓-↑-, ↓↓- -↑
2
↑↑- - -, - - -↑↑, ↑- -↑-, ↑- - -↑, -↑- -↑, -↑-↑-, ↓↓- - -, - - -↓↓
↓- -↓-, ↓- - -↓, -↓- -↓, -↓-↓-, ↑↑-↑↓, ↑↑-↓↑, ↑↓-↑↑, ↓↑-↑↑
↓↓-↓↑, ↓↓-↑↓, ↓↑-↓↓, ↑↓-↓↓
3 ↑↑-↑-, ↑↑- -↑, ↑- -↑↑, -↑-↑↑, ↓↓-↓-, ↓↓- -↓, ↓- -↓↓, -↓-↓↓
4 ↑↑-↑↑, ↓↓-↓↓
of inductive coupling.
76
4.3. CAC DESIGN
Figure 4.3: (a) Find a 5-bit (i, kw)-SOTA codebook; (b) Construct an m-bit (i, kw)-
SOTA codebook recursively.
4.3.3 New CAC Design
New CACs accounting for both capacitive and inductive couplings are desired. Op-
posite transitions on adjacent wires lead to large capacitive couplings, and same
transitions on adjacent wires lead to large inductive couplings. Hence, to reduce re-
liability issue as well as delay issue due to capacitive and inductive coupling effects,
we need to avoid those patterns with most same and opposite switchings as shown
in Tables 4.1 and 4.2. In the following, with consideration of both the capacitive
and inductive couplings, we propose a Same and Opposite Transitions Avoidance
(SOTA) coding scheme. The reduction of the capacitive coupling is achieved by
avoiding iC patterns in [42] for i = {1, 2, 3, 4}. The reduction of the inductive
coupling is achieved by eliminating patterns in Tables 4.1 and 4.2 for W3 > kw
(kw = {0, 1, 2, 3, 4}). Such codes are referred to as (i, kw)-SOTA codes.
In this chapter, we use the recursive scheme to find an (i, kw)-SOTA codebook for
77
4.3. CAC DESIGN
an m-bit bus. First, we focus on wire 3 in a five-bit window and find a 5-bit (i, kw)-
SOTA codebook. The procedure is illustrated in Fig. 4.3(a) with two steps. The
first step is to obtain all allowable transitions by applying the two constraints, iC
and W3 ≤ kw. This can be done by applying the two constraints sequentially. Here,
we first pick all transitions satisfying W3 ≤ kw in Tables 4.1 and 4.2, and remove
those having (i+1)C, · · · , 4C patterns. The second step is to find a maximum clique
of nodes from the list of all allowable transitions. A 5-bit (i, kw)-SOTA codebook is
given by such a maximum clique of nodes. To obtain a 6-bit (i, kw)-SOTA codebook,
we obtain a 6-bit candidate codebook by appending 0’s and 1’s to the left of all 5-bit
codewords. Then, we check if the left 5-bit pattern satisfies the two constraint and
remove those 6-bit codewords violating any of the two constraints from the 6-bit
candidate codebook. After doing all the appending and checking operations, we
obtain a 6-bit (i, kw)-SOTA codebook. Similarly, an m-bit (i, kw)-SOTA codebook
can be obtained recursively as shown in Fig. 4.3(b).
4.3.4 (2, 1)-SOTA codes
For a worst capacitive coupling 2C and a worst inductive coupling W3 ≤ 1, a list
of allowable transitions in Fig. 4.3 can be obtained by removing 3C and 4C pat-
terns in Tables 4.1 and 4.2. Using MATLAB, we find a maximum clique given by
{00011, 00110, 00111, 01100, 01110, 10001, 10011, 11000, 11001, 11100}, which is a
5-bit (2, 1)-SOTA codebook. Let C(m) be the set of m-bit (2, 1)-SOTA codewords
and c(m) = cmcm−1 · · · c1 be a codeword in C(m). An m-bit (2, 1)-SOTA code-
book can be generated recursively in the following algorithm, where · denotes the
concatenation operation.
78
4.3. CAC DESIGN
Algorithm 3 (2, 1)-SOTA codeword generation.
Input: C(5) = {00011, 00110, 00111, 01100, 01110, 10001, 10011, 11000, 11001,
11100}; k = 5;
while k ≤ m− 1 do
for ∀c(k) ∈ C(k) do
if ckck−1ck−2ck−3 = 0001 then
add 1 · c(k) to C(k + 1);
else if ckck−1ck−2ck−3 = 0011 then
add 0 · c(k) and 1 · c(k) to C(k + 1);
else if ckck−1ck−2ck−3 = 0110 then
add 0 · c(k) to C(k + 1);
else if ckck−1ck−2ck−3 = 0111 then
add 0 · c(k) to C(k + 1);
else if ckck−1ck−2ck−3 = 1000 then
add 1 · c(k) to C(k + 1);
else if ckck−1ck−2ck−3 = 1001 then
add 1 · c(k) to C(k + 1);
else if ckck−1ck−2ck−3 = 1100 then
add 0 · c(k) and 1 · c(k) to C(k + 1);
else if ckck−1ck−2ck−3 = 1110 then
add 0 · c(k) to C(k + 1);
end if
end for
k = k + 1;
end while
Output: C(m);
79
4.3. CAC DESIGN
The m-bit codebook generated by Alg. 3 is a subset of m-bit FPC codebook.
Hence, no 010 or 101 patterns are allowed in all codewords. The following lemma
shows a necessary and sufficient condition for a 5-bit codebook to be a (2, 1)-SOTA
codebook.
Lemma 4.3.1. An m-bit (m ≥ 5) codebook is a (2, 1)-SOTA codebook if and only
if all m-bit codewords avoid 010, 101, 0000, and 1111 patterns.
Proof. It is easy to see that Alg. 3 does not introduce 010, 101, 0000, and 1111
patterns. Hence, it is equivalent to prove that a 5-bit codebook is a (2, 1)-SOTA
codebook if and only if all 5-bit codewords avoid 010, 101, 0000, and 1111 patterns.
We first prove the necessity. A 5-bit (2, 1)-SOTA codebook is given by {00011,
00110, 00111, 01100, 01110, 10001, 10011, 11000, 11001, 11100}. It is observed that
no 010, 101, 0000, and 1111 patterns appear in any of these codewords.
To prove its sufficiency, we eliminate those codewords with 010, 101, 0000, and
1111 pattern from all 32 5-bit codewords. The refined codebook is given by {00011,
00110, 00111, 01100, 01110, 10001, 10011, 11000, 11001, 11100}, which is the same
as the 5-bit (2, 1)-SOTA codebook.
Let Cm be the size of codebook C(m). We further establish that the largest
(2, 1)-SOTA codebook size satisfies the recursion:
Lemma 4.3.2. For m ≥ 8, Cm is given by a recursion Cm = Cm−2 + Cm−3, with
initial conditions Cm = 10, 14, 18 for m =5, 6, 7, respectively.
Proof. ∀c(m) ∈ C(m), define Cdm as the number of codewords satisfying cm = cm−1
and cm−1 6= cm−2. Define Ctm and Cfm as the numbers of codewords satisfying
80
4.3. CAC DESIGN
cm = cm−1 = cm−2 and cm 6= cm−1, respectively. Hence, Cm = Cdm + Ctm + Cfm. For
m = 5, we have Cd5 = 4, C
t
5 = 2, C
f
5 = 4, and C5 = 10.
For m > 5, according to Alg. 3, we have
Cdm = C
f
m−1,
Ctm = C
d
m−1,
Cfm = C
t
m−1 + C
d
m−1,
Cm = C
t
m−1 + 2C
d
m−1 + C
f
m−1.
For m = 6, Cd6 = 4, C
t
6 = 4, C
f
6 = 6, and C6 = 14. For m = 7, C
d
7 = 6, C
t
7 = 4,
Cf7 = 8, and C7 = 18.
Since Cm = C
d
m+C
t
m+C
f
m, we also have Cm−1 = C
d
m−1+C
t
m−1+C
f
m−1 = C
d
m+C
f
m.
Hence, for m ≥ 8,
Cm = C
t
m−1 + 2C
d
m−1 + C
f
m−1
= (Cdm−1 + C
f
m−1) + (C
t
m−1 + C
d
m−1)
= Cm−2 + (Cdm−2 + C
f
m−2)
= Cm−2 + Cm−3.
81
4.4. CODEC DESIGN
4.4 CODEC design
A numeral system is a mathematical notation for representing numbers of a given
set by symbols in a consistent manner [63]. A binary mixed-radix numeral system
represents a number as
∑m
i=1 = difi, where (dm, · · · , d2d1) is a binary string and
{fm, · · · , f2, f1} is a basis set of non-negative numbers. If any integer u ∈ [0,
∑m
i=1 fi]
can be represented by at least one binary string dm · · · d2d1, the numeral system is
complete. In [47], a generic CAC encoding algorithm is proposed based on a binary
mixed-radix numeral system. Since all-zero and all-one codewords are forbidden for
RLC-coupled interconnects, the representable range starting from a nonzero value
P is u ∈ [P, P +∑mi=1 fi]. The revised encoding algorithm is shown in Alg. 4, where
{fi}mi=1 denotes the basis set of the encoding numeral system, v (0 ≤ v ≤
∑m
i=1 fi) is
data message, P is a non-zero integer, {αi}mi=1, {αi}mi=1, and Θ are some constants to
be determined for different CACs, and dmdm−1 · · · d1 is the encoded codeword. The
data message v is first added by P . The decoding is straightforward by computing∑m
i=1 difi − P . The CODEC based on Alg. 4 is shown in Fig. 4.4. The encoder has
m − 1 same processing elements as shown in Fig. 4.4(c) and one additional adder
for input v. One of the inputs to the top processing element denotes a don’t care
and is connected to the ground. Each processing element has two inputs, dk+1 and
rk+1, and two outputs, dk and rk, and is consisted of two comparators, one adder,
one multiplexer, one AND, and one OR.
82
4.4. CODEC DESIGN
Algorithm 4 Generic CAC encoding algorithm.
Input: code length m; data message v; non-zero integer P ;
Initialize v = v + P ;
for k = m downto 2 do
if k = m then
if v ≥ Θ then
dm = 1;
else
dm = 0;
end if
rm = v − dm · fm;
else
if rk+1 ≥ αk then
dk = 1;
else if rk+1 < βk then
dk = 0;
else
dk = dk+1;
end if
rk = rk+1 − dk · fk;
end if
end for
d1 = r2;
Output: dmdm−1 · · · d1.
83
4.4. CODEC DESIGN
0
1
Figure 4.4: CODEC for an m-bit CAC via Alg. 4. (a) Encoder; (b) Decoder; (c)
Processing element in (a).
84
4.4. CODEC DESIGN
4.4.1 (2, 1)-SOTA codes
Assume the basis {fi}mi=1 is positive and non decreasing. We define the min and
max codewords of (2, 1)-SOTA as
Minn =


(00011)k, n = 5k,
(00011)k · 0, n = 5k + 1,
(00011)k · 00, n = 5k + 2,
(00011)k · 000, n = 5k + 3,
(00011)k · 0001, n = 5k + 4,
(4.5)
and
Maxn =


(11100)k, n = 5k,
(11100)k · 1, n = 5k + 1,
(11100)k · 11, n = 5k + 2,
(11100)k · 111, n = 5k + 3,
(11100)k · 1110, n = 5k + 4,
, (4.6)
where n ≤ m and (00011)k denotes k repetition of 00011. For n = 1, 2, 3, 4, let
Minn be 0, 00, 000, 0001, respectively, and Maxn be 1, 11, 111, 1110, respectively.
Define g(cm) =
∑m
i=1 cifi as the weight function based on a basis {fi}mi=1. A
basis {fi}mi=1, f1 = 1, f2 = 1, f3 = 2, fi = g(Maxi−1) − g(Mini−2) − fi−1 + 1 for
4 ≤ i ≤ m− 1, and fm = g(Maxm−1)− g(Minm−1) + 1, defines a complete system.
With the basis set {fi}mi=1, the (2, 1)-SOTA CODEC can be designed by choosing
γk = fk,Θ = g(Minm−1) + fm,
αk = g(Maxk−1) + 1, βk = g(Mink−1) + fk.
(4.7)
85
4.4. CODEC DESIGN
To show the correctness of the encoding algorithm in Alg. 4 for (2, 1)-SOTA
codes, we need following lemmas.
Lemma 4.4.1. fn = fn−2 + fn−3 for 6 ≤ n ≤ m− 1.
Proof. We prove this property by induction on n from 6 to m − 1. For n = 6,
f6 = g(Max5)− g(Min4)− f5+1 = 5 = 3+ 2 = f4+ f3. Suppose for n ≤ i (i ≥ 6),
fn = fn−2+ fn−3. When n = i+1, fn = fi+1 = g(Maxi)− g(Mini−1)− fi +1. It is
equivalent to prove that
g(Maxi) + 1 = g(Mini−1) + fi + fi−1 + fi−2. (4.8)
If i = 5k+1 (k ≥ 1),Maxi = 111·(00111)k−1·001 andMini−1 = 000·(01100)k−1·
011. Then, in Eq. (4.8), LHS = fi+ fi−1+ fi−2+
∑k−1
j=1(fi−5j + fi−5j−1+ fi−5j−2) +
f1 +1. RHS = fi + fi−1+ fi−2 +
∑k−1
j=1(fi−5j+1+ fi−5j) + f2+ f1. According to our
assumption, fi−5j+1 = fi−5j−1 + fi−5j−2. Since f2 = 1, we have LHS = RHS.
If i = 5k+2 (k ≥ 1),Maxi = 111·(00111)k−1·0011 andMini−1 = 000·(01100)k−1·
0110. Then, in Eq. (4.8), LHS = fi+fi−1+fi−2+
∑k−1
j=1(fi−5j+fi−5j−1+fi−5j−2)+
f2 + f1 + 1. RHS = fi + fi−1 + fi−2 +
∑k−1
j=1(fi−5j+1 + fi−5j) + f3 + f2. Since
fi−5j+1 = fi−5j−1 + fi−5j−2 and f3 = f1 + 1 = 2, LHS = RHS.
If i = 5k + 3 (k ≥ 1), Maxi = 111 · (00111)k−1 · 00111 and Mini−1 = 000 ·
(01100)k−1·01100. Then, in Eq. (4.8), LHS = fi+fi−1+fi−2+
∑k−1
j=1(fi−5j+fi−5j−1+
fi−5j−2)+ f3+ f2+ f1+1. RHS = fi+ fi−1+ fi−2+
∑k−1
j=1(fi−5j+1+ fi−5j)+ f4+ f3.
Since fi−5j+1 = fi−5j−1 + fi−5j−2 and f4 = f2 + f1 + 1 = 3, LHS = RHS.
If i = 5k + 4 (k ≥ 1), Maxi = 111 · (00111)k−1 · 001110 and Mini−1 = 000 ·
(01100)k−1·011000. Then, in Eq. (4.8), LHS = fi+fi−1+fi−2+
∑k−1
j=1(fi−5j+fi−5j−1+
86
4.4. CODEC DESIGN
fi−5j−2)+ f4+ f3+ f2+1. RHS = fi+ fi−1+ fi−2+
∑k−1
j=1(fi−5j+1+ fi−5j)+ f5+ f4.
Since fi−5j+1 = fi−5j−1 + fi−5j−2 and f5 = f3 + f2 + 1 = 6, LHS = RHS.
If i = 5k + 5 (k ≥ 1), Maxi = 111 · (00111)k−1 · 0011100 and Mini−1 = 000 ·
(01100)k−1 · 0110001. Then, in Eq. (4.8), LHS = fi + fi−1 + fi−2 +
∑k−1
j=1(fi−5j +
fi−5j−1 + fi−5j−2) + f5 + f4 + f3 + 1. RHS = fi + fi−1 + fi−2 +
∑k−1
j=1(fi−5j+1 +
fi−5j) + f6+ f5+ f1. Since fi−5j+1 = fi−5j−1+ fi−5j−2 and f6+ f1 = f4+ f3+1 = 6,
LHS = RHS.
Lemma 4.4.2. αn−1 = βn for 3 ≤ n ≤ m− 1.
Proof. For n = 3, α2 = g(Max1) + 1 = 2 and β3 = g(Min2) + f3 = 2. We
have α2 = β3. For n = 5k − 2 (k ≥ 1), Maxn−2 = (00111)k−1 · 001 and 1 ·
Minn−1 = (10001)k−1 ·100. LHS =
∑k−1
j=1(fn−5j+3+fn−5j+2+fn−5j+1)+f1. RHS =∑k−1
j=1(fn−5j+5 + fn−5j+1) + f3. Since fn−5j+5 = fn−5j+3 + fn−5j+2 and f3 = f1 + 1,
LHS = RHS. For n = 5k−1, 5k, 5k+1, 5k+2 (k ≥ 1), the proof is similar to that
in the proof of Lemma 4.4.1. Hence, for 3 ≤ n ≤ m− 1, we have αn−1 = βn.
Lemma 4.4.3. g(Maxn) = g(Maxn−3) + fn + fn−1 for 4 ≤ n ≤ m− 1.
Proof. For n = 4, LHS = g(1110) = 5 and RHS = g(1) + f4 + f3 = 5. We have
LHS = RHS. For n = 5k−1 (k ≥ 1),Maxn = 11·(10011)k−1·10 and 11·Maxn−3 =
11 ·(01110)k−1 ·01. LHS = fn+fn−1+
∑k−1
j=1(fn−5j+3+fn−5j+fn−5j−1)+f2. RHS =
fn + fn−1 +
∑k−1
j=1(fn−5j+2 + fn−5j+1 + fn−5j) + f1. Since fn−5j+3 = fn−5j+1 + fn−5j ,
fn−5jfn−5j−1 = fn−5j+2, and f1 = f2, we have LHS = RHS. For n = 5k, 5k +
1, 5k + 2, 5k + 3 (k ≥ 1), the proof is similar to that in the proof of Lemma 4.4.1.
Hence, for 4 ≤ n ≤ m− 1, we have g(Maxn) = g(Maxn−3) + fn + fn−1.
Lemma 4.4.4. g(Minn) = g(Maxn−4) + 1 for 5 ≤ n ≤ m.
87
4.4. CODEC DESIGN
Proof. For n = 5, LHS = g(00011) = 2 and RHS = g(1) + 1 = 2. We have
LHS = RHS. For n = 5k (k ≥ 1), Minn = 00 · (01100)k−1 · 011 and Maxn−4 = 00 ·
(00111)k−1 ·001. LHS =
∑k−1
j=1(fn−5j+2+fn−5j+1)+f2+f1. RHS =
∑k−1
j=1(fn−5j+1+
fn−5j + fn−5j−1) + f1 + 1. Since fn−5j+2 = fn−5j + fn−5j−1 and f2 = f1 + 1, we
have LHS = RHS. For n = 5k + 1, 5k + 2, 5k + 3, 5k + 4 (k ≥ 1), the proof
is similar to that in the proof of Lemma 4.4.1. Hence, for 5 ≤ n ≤ m, we have
g(Minn) = g(Maxn−4) + 1.
The following theorem shows the correctness of the encoding algorithm for (2, 1)-
SOTA codes.
Theorem 4.4.1. The output of encoding algorithm in Alg. 4 with constants specified
in Eq. (4.7) is a (2, 1)-SOTA codebook.
Proof. According to Lemma 4.3.1, the correctness of the encoding algorithm can be
proved by showing that 010, 101, 0000, and 1111 are forbidden patterns.
If dk = 1 and dk−1 = 0, we have rk < βk−1 = g(Mink−2) + fk−1. Hence,
rk−1 = rk < βk−1 = αk−2 (Lemma 4.4.2), implying that dk−2 = 0. Hence, 101 is
forbidden.
If dk = 0 and dk−1 = 1, we have rk ≥ αk−1 = g(Maxk−2) + 1. Hence, rk−1 =
rk − fk−1 ≥ αk − fk−1 = (g(Maxk−2 + 1)− (g(Maxk−2)− g(Mink−3)− fk−2 + 1) =
g(Mink−3) + fk−2 = βk−2, implying that dk−2 = 1. Hence, 010 is forbidden.
If dk = dk−1 = dk−2 = 0, we have rk−2 = rk−1 = rk ≥ g(Mink) = g(Maxk−4) +
1 = αk−3 (Lemma 4.4.4), implying that dk−3 = 1. Hence, 0000 pattern is forbidden.
If dk = dk−1 = dk−2 = 1, we have rk−2 ≤ g(Maxk) − fk − fk−1 − fk−2 =
g(Maxk)− fk − fk−1 − g(Maxk−3) + βk−3 − 1 = βk−3 − 1 (Lemma 4.4.3), implying
88
4.5. PERFORMANCE
that dk−3 = 0. Hence, 1111 pattern is forbidden.
4.5 Performance
In this section, we evaluate the performance of our new CACs for RLC-coupled
interconnects with respect to worst case delays, peak noises, and rates. The worst-
case delay of a CAC is the largest delay among all wires when the codewords from
the CAC are transmitted over the bus. The peak noise of a CAC the maximum of
overshoots and undershoots, which are normalized to supply voltage Vdd. The code
rate of a CAC over an m-bit bus is defined by ⌊log2 Cm⌋
m
, where Cm is the codebook
size. The code rate measures the redundance of a CAC. A rate k/n implies that
additional n− k bits are needed for a k-bit data to reduce the crosstalk.
The codebook size and code rate of our (2,1)-SOTA codes are summarized in
Table 4.3. For m-bit bus (5 ≤ m ≤ 32), the code rate ranges between 0.41 and
0.60. The best code rate 0.6 is achieved for m = 5. When m approaches inf, the
asymptotic code rate is given by 0.406, the same as that of OLCs [8].
All the simulation results in this chapter are obtained from HSPICE based on
a 45nm technology with 10 metal layers [49]. We focus on global buses in the top
metal layer 10 with substrate as the ground. The bus parameters are obtained by
structure 1 in [50]. All wires are uniformly distributed with a length L = 5 mm,
width w = 0.8µm, spacing s = 0.8µm, thickness t = 2µm, and height to ground
h = 9.3µm. The bus parameters, unit length resistance, inductance, capacitance and
coupling capacitance, are obtained by a 2D extraction tool, Raphael from Synopsys,
for on-chip multi-level interconnect structures. We assume RS = 50 Ω and CL = 100
89
4.5. PERFORMANCE
Table 4.3: Code rates of our (2,1)-SOTA codes for an m-bit bus (m = 5, · · · , 32).
m-bit # of words Rate m-bit # of words Rate
5 10 3/5 19 530 9/19
6 14 3/6 20 702 9/20
7 18 4/7 21 930 9/21
8 24 4/8 22 1232 10/22
9 32 5/9 23 1632 10/23
10 42 5/10 24 2162 11/24
11 56 5/11 25 2864 11/25
12 74 6/12 26 3794 11/26
13 98 6/13 27 5026 12/27
14 130 7/14 28 6658 12/28
15 172 7/15 29 8820 13/29
16 228 7/16 30 11684 13/30
17 302 8/17 31 15478 13/31
18 400 8/18 32 20504 14/32
fF for simulations. To show the reduction of capacitive and inductive couplings, we
also simulate interconnects without coding. For the same information bits, the
scheme without coding uses less wires than our CAC scheme. Assume the scheme
without coding uses equal width and spacing, we find the value of width and spacing
of the scheme without coding for the same area used by our CAC scheme.
The simulation results of delays and noises are shown in Tables 4.4 and 4.5,
respectively. As shown in Table 4.4, our (2,1)-SOTA codes can significantly reduce
the worst case delays except for a 3-wire bus. This is because the inductive coupling
is only from neighboring two wires for a 3-wire bus. For larger bus, the ring due
to the increasing inductive coupling would cross the threshold multiple times for
scheme without coding, leading to larger delays. For our CAC scheme, the ring is
significantly reduced and the delay is determined by capacitive coupling. For k ≥ 4,
90
4.6. CONCLUSIONS
Table 4.4: Reduction of worst case delays via our (2,1)-SOTA coding scheme over
no coding scheme (NC).
NC Ours
Reduction
k Delay (ps) n Delay (ps)
3 80.84 5 95.90 -18.63%
4 140.30 7 105.77 24.61%
5 164.39 9 106.02 35.51%
6 164.30 12 112.37 31.61%
7 169.53 14 107.06 36.85%
Table 4.5: Reduction of worst case noise via our (2,1)-SOTA coding scheme over no
coding scheme (NC).
NC Ours
Reduction
k Noise (ps) n Noise (ps)
3 0.63 5 0.38 39.68%
4 0.64 7 0.39 39.06%
5 0.65 9 0.34 47.69%
6 0.65 12 0.41 36.92%
7 0.66 14 0.38 42.42%
the reduction of worst case delay is at least about 24% compared with the scheme
without coding. With regard to the peak nose, the reduction of our CAC scheme
is at least 37%. Hence, our proposed CAC scheme can reduce both crosstalk delay
and noise due to the capacitance and inductance effects.
4.6 CONCLUSIONS
In this chapter, we propose a new family of CACs accounting for both the capacitive
and inductive couplings. The capacitive crosstalk is reduced by restricting opposite
91
4.6. CONCLUSIONS
transitions in adjacent wires and the inductive coupling is reduced by restricting
same transitions in neighboring wires. CODECs based on a revised binary mixed-
radix numeral system are also proposed. Simulation results show that our codes
can significantly reduce the worst case delay and peak noise simultaneously. The
complexity and delay of our CODECs are quadratically increasing with the size of
the bus.
92
Chapter 5
Quasi-Cyclic Low-Density
Parity-Check Stabilizer Codes
5.1 Introduction
Quantum computers are more efficient than classical computers for some computa-
tional problems, such as factoring a large number and searching an unknown space
for an element satisfying a known property [11]. However, quantum information,
represented by quantum bits or qubits, suffers greatly from unwanted interactions
with the outside world. Thus, quantum error correction codes (QECCs) are needed
to protect quantum information against noise and decoherence [11].
Many QECCs have been proposed in the literature by importing classical er-
ror correction codes, such as low-density parity-check (LDPC) codes, convolutional
codes, Turbo codes, and polar codes (see, for example, [12–21]). Among them,
QECCs based on LDPC codes (see, for example, [12,13,16,17]) are important, since
93
5.1. INTRODUCTION
they can be decoded by adapting existing iterative decoding algorithms. As classical
LDPC codes have asymptotically good performance for a wide class of noisy chan-
nels when decoded by the belief propagation algorithm [22], well-designed quantum
LDPC codes also show good performance [16, 17, 23]. While most quantum LDPC
codes are based on binary LDPC codes, recently several QECCs based on nonbi-
nary LDPC codes have been proposed in [23] with a much better error-correcting
performance than existing quantum codes over a qubit channel.
Most existing QECCs belong to two related classes. In [64], Gottesman proposed
the theory of stabilizer codes, which allows us to construct QECCs based on classical
error correction codes by satisfying a zero symplectic inner product (SIP) condition
(also called the general stabilizer formalism). A subclass of stabilizer codes, known as
CSS codes [65,66], enables us to construct QECCs by using classical error correction
codes that satisfy the dual-containing condition (referred to as the CSS formalism
sometimes). Since the dual-containing condition is a special case of the zero SIP
condition, CSS codes are a subclass of stabilizer codes. Since the dual-containing
condition is much easier to satisfy than the zero SIP condition, CSS codes have
attracted a lot of attention. However, the error correction capability of CSS codes is
limited [16,17] in comparison to stabilizer codes. For example in a binary quantum
system, to correct one qubit error, a CSS code takes seven qubits to encode one
qubit, while a general stabilizer code needs only five qubits [67]. Most QECCs
mentioned above are CSS codes. Tan et al. [16, 17] proposed several systematic
constructions of binary quasi-cyclic low-density parity-check (QC-LDPC) based on
the general stabilizer formalism, and their codes are the first LDPC stabilizer codes
to the best of our knowledge.
94
5.1. INTRODUCTION
Stabilizer codes
QC-LDPC stabilizer codes
CSS codes
CSS QC-LDPC codes
Figure 5.1: Classification of stabilizer codes.
Since stabilizer codes based on nonbinary LDPC codes have not been studied,
motivated by the success of adopting nonbinary QC-LDPC codes in CSS codes
in [23], in this chapter we investigate stabilizer codes based on nonbinary QC-LDPC
codes, referred to as QC-LDPC stabilizer codes henceforth, for qubit channels. As
in [16,17], we consider LDPC codes with a quasi-cyclic (QC) structure, which makes
it easier to satisfy the zero SIP condition. The relationship of stabilizer codes, CSS
codes, and QC-LDPC stabilizer codes is shown in Fig. 5.1. Our QC-LDPC stabilizer
codes are a subclass of stabilizer codes, while the CSS QC-LDPC codes, including
those proposed in [23], are a special case of the QC-LDPC stabilizer codes.
Our main contributions are:
• The construction of our QC-LDPC stabilizer codes is reduced to the con-
struction of nonbinary QC-LDPC codes over GF(2m) satisfying the zero SIP
condition, and the decoding of our QC-LDPC stabilizer codes is based on
that of the nonbinary QC-LDPC codes. First, we derive conditions for nonbi-
nary QC-LDPC codes over GF(2m) in order to satisfy the zero SIP condition
and to eliminate the cycles of girth four, which usually lead to poor decoding
performance by iterative decoding algorithms for LDPC codes.
95
5.2. PRELIMINARY
• We have constructed two QC-LDPC stabilizer codes, and simulation results
show that they outperform their counterparts in [16,17]. This seems to confirm
the observation [23] that QECCs based on nonbinary LDPC codes appear to
achieve better performance than QECCs based on binary LDPC codes.
Our work is different from recent works in [16,17,23]. Our QC-LDPC stabilizer
codes are constructed through nonbinary codes and are decoded by a nonbinary sum-
product algorithm, whereas Tan et al. [16, 17] focus on QC-LDPC stabilizer codes
based on binary LDPC codes and decoded by a binary sum-product algorithm.
Our codes also outperform those in [16, 17]. As mentioned above, the QC-LDPC
codes in [23] are CSS codes, whereas our codes herein are stabilizer codes. The
two stabilizer codes constructed herein have worse performance than those in [23].
However, we emphasize that the CSS codes in [23] build on extensive research on
CSS codes and are the results of various optimizations in [23]. In contrast, our
two codes are the first QC-LDPC stabilizer codes based on nonbinary codes. As
explained above, CSS codes are a subclass of stabilizer codes, and hence stabilizer
codes promise better error performance. Hence, we plan to further improve our
QC-LDPC stabilizer codes in our future work.
5.2 Preliminary
In this section, we present basic concepts and notions of stabilizer codes. More
details on the theory of stabilizer codes can be found in [64].
Quantum noise can be modeled in several ways. Among them, the depolarizing
channel is often used to characterized a worst scenario channel, where three types
96
5.2. PRELIMINARY
of errors, bit flip error X , phase flip error Z, and bit-and-phase flip error Y , occur
independently and equal likely on each qubit [11]. For a depolarizing channel with
a total flip probability f on each qubit, X , Z, and Y occur with probability f/3.
Since a Y error is equivalent to the combination of an X error and a Z error, the
marginal probability of X (Z) error is given by 2f/3.
Stabilizer codes can be represented by a compact quaternary form with I,X, Y, Z
corresponding to 0, 1, ω, ω2 over GF(4), where ω is a primitive element in GF(4) [68].
It is more convenient to denote stabilizer codes by an expanded parity check matrix
over GF(2). For an [[n, k]]2 stabilizer code, the n − k stabilizer generators can be
described as the juxtaposition of a pair of (n− k)× n matrices, H = (C|D), where
each row in H corresponds to a unique stabilizer generator and each pair of columns
correspond to a qubit [16, 17]. Each “1” entry in C and D corresponds to an X
and a Z operator, respectively, and each “0” entry corresponds to an I operator.
For a qubit channel, such a matrix of size (n− k)× 2n over GF(2) defines a binary
stabilizer code. For example, H =
(
I X I X
X I I I
I Z I Z
)
can be represented by an expanded
parity check matrix H =
(
0 1 0 1
1 0 0 0
0 0 0 0
∣∣∣ 0 0 0 00 0 0 0
0 1 0 1
)
.
A necessary and sufficient condition for a matrix to represent a stabilizer code
is given by
Theorem 5.2.1 (Zero symplectic inner product condition [16,17]). An (n−k)×2n
matrix H = (C|D) is a parity check matrix of a stabilizer code if and only if H
satisfies
CDT +DCT = 0, (5.1)
where T denotes the matrix transpose and 0 denotes an (n−k)×(n−k) zero matrix.
97
5.2. PRELIMINARY
Many existing stabilizer codes are based on the CSS formalism, which makes use
of classical dual-containing codes for the design of QECCs. Let HC and HD be two
parity check matrices corresponding to two classical code C and D, respectively.
If D⊥ ⊂ C (the dual code D⊥ of D is a subset of C), then HCHTD = 0, which is
referred to as the dual-containing condition. The following matrix defines a stabilizer
code [65, 66]:
H =

 HC
0
∣∣∣∣∣∣∣
0
HD

 .
If HC = HD, the dual-containing condition reduces to HCH
T
C = 0. Code C (D)
is called a weakly self-dual code. A stabilizer matrix is given by
H =

 HC
0
∣∣∣∣∣∣∣
0
HC

 .
It can be easily verified that CSS codes satisfy the zero SIP condition in Eq. (5.1).
CSS codes are a special family of stabilizer codes.
Recently, nonbinary LDPC codes have been used for the construction of bi-
nary CSS codes through a ring homomorphism [23]. A ring homomorphism, A :
GF(2m)→ GF(2)m×m with its images homomorphic to GF(2m) by matrix addition
and multiplication operations, is given in [23]. Let α be a primitive element of
GF(2m). The minimal polynomial of α is pi(x) =
∑m−1
i=0 piix
i + xm. Such a map-
ping is given by A(αi) := A(α)i ∈ GF(2)m×m, ∀αi ∈ GF(2m), with A(0) = 0 and
98
5.3. QC-LDPC STABILIZER CODES
A(α) :=


0 0 · · · 0 pi0
1 0 · · · 0 pi1
0 1 · · · 0 pi2
...
...
. . .
...
...
0 0 · · · 1 pim−1


.
By the ring homomorphism, two nonbinary LDPC codes satisfying the zero SIP
condition can be mapped to a binary stabilizer code. For example, for C,D ∈
GF(2m)M×N , CA,DA ∈ GF(2)mM×mN are obtained by replacing all entries in C and
D with their images under the mapping A. If CDT +DCT = 0, then CA(DA)T +
DA(CA)T = 0. Hence, it is shown that H = (C|D) satisfying Eq. (5.1) over GF(2m)
defines a binary stabilizer code with (CA|DA).
Quantum LDPC codes can be decoded by a belief propagation decoding algo-
rithm similar to that of classical LDPC codes [12,69]. For CSS codes with a parity
check matrix H =
(
HC
0
∣∣ 0
HD
)
, X and Z errors can be corrected by decoding C and
D, respectively. For general stabilizer codes with a parity check matrix H = (C|D),
it has been shown that the decoding is equivalent to a syndrome version of sum-
product algorithm on the Tanner graph of [C,D], which is obtained by merging
corresponding checks of C and D [16, 17].
5.3 QC-LDPC Stabilizer Codes
In this section, we propose two constructions of QC-LDPC stabilizer codes for a
qubit channel. This is achieved by constructing nonbinary QC-LDPC codes over
finite fields of characteristic two satisfying Eq. (5.1). This is because the state of a
99
5.3. QC-LDPC STABILIZER CODES
qubit in most quantum systems is binary and nonbinary codes over GF(2m) can be
easily connected to a binary stabilizer code in a qubit channel. Also, a nonbinary
LDPC code satisfying Eq. (5.1) over GF(2m) defines a binary stabilizer code, which
can be decoded by a sum-product algorithm for nonbinary LDPC codes. The rest
of the work is to find good nonbinary QC-LDPC codes with parity check matrices
satisfying Eq. (5.1) over GF(2m).
We focus on nonbinary codes over GF(2m) with column weight two only, since
the nonbinary LDPC codes with column weight two over GF(2m) are empirically
known as the best performing codes for 2m ≥ 64 [70]. Several approaches to the
construction of QC-LDPC codes have been proposed based on finite geometry, arrays
and array dispersions, and finite fields [71–73]. The key idea is first constructing a
base matrix over some finite field satisfying a certain constraint, and then replacing
the elements in the base matrix by binary or nonbinary cyclic matrices to obtain
parity check matrices of QC-LDPC codes.
We propose the following method to obtain a nonbinary parity check matrix
over GF(2m) via a pair of base matrices over GF(2). We first construct two base
quasi-cyclic parity check matrices Cb andDb satisfying Eq. (5.1) with column weight
J = 2 and row weight L. Both matrices consist of 2× L block matrices, which are
shifted identity matrices of size P × P . Then, we use the pair of base parity check
matrices Hb = (Cb|Db) and replace each one in Cb and Db with a nonzero element
in GF(2m). By solving a set of linear equations over Z2m−1 satisfying Eq. (5.1), we
obtain two 2P ×LP nonbinary parity check matrices C and D, which form a parity
check matrix H = (C|D) over GF(2m). The code length is given by LP symbols
and the number of information symbols is approximated by LP − 2P . Hence, the
100
5.3. QC-LDPC STABILIZER CODES
quantum code rate is lower bounded by RQ = 1− 2/L.
5.3.1 Base parity check matrix
Our nonbinary quantum codes are obtained from their base. Hence, the performance
of the nonbinary codes is affected by the parameters of the base matrices. There
are three parameters to consider, row weight, minimum distance, and girth, when
designing such base matrices. At the error-floor region, small minimum distance
leads to poor decoding performance. If the regular (J, L) LDPC code is a CSS
code, the minimum distance is upper-bounded by the row weight L due to the
dual-containing condition and sparsity of the parity check matrix [23]. To have a
large minimum distance, the row weight of the parity check matrix should be chosen
large. At the waterfall region, the sum-product decoding performance degrades with
increasing row weight L [74]. So the row weight L should not be too large. It is
also known that cycles of girth four in the Tanner graph degrade the SP decoding
performance [23]. The cycles of girth four can be classified into two groups, critical
cycles of girth four and non-critical cycles of girth four [16]. The critical cycles
of girth four are present in both the compact quaternary form and the expanded
form and the non-critical cycles of girth four present only in the compact form.
For example, H1 =
(
I X I X
Z I I I
I X I X
)
contains a critical cycle of girth four, and H2 =(
I X I X
Z I I I
I Z I Z
)
contains only a non-critical cycle of girth four. In our work, we consider
only the cycles of girth four in the expanded parity check matrix H = (C|D), since
our decoding scheme is based on the Tanner graph corresponding to the expanded
parity check matrix. It is desired to reduce or avoid critical cycles of girth four.
In the following, we first introduce the base matrices used in the quasi-cyclic
101
5.3. QC-LDPC STABILIZER CODES
structure of our proposed stabilizer codes. Then, we introduce a juxtaposition tech-
nique to construct longer codes.
Definition [Binary cyclic matrices] Let I be a P × P identity matrix. A binary
cyclic matrix I(1) ∈ {0, 1}P×P is obtained by cyclicly shifting each row of I to the
right by one position:
I(1) :=


0 1 0 · · · 0
0 0 1 · · · 0
...
...
...
. . .
...
0 0 0 · · · 1
1 0 0 · · · 0


.
We define I(0) := I and I(i) := I(1)i, where i is the offset and 0 < i < P . We
have I(a)I(b) = I(a+ b) and IT (a) = I(−a).
To obtain longer codes with different code rate, we juxtapose shorter codes as
follows.
Definition [Juxtaposition of Matrices] For a set of matrices C1,C2, · · · ,CL and
D1,D2, · · · ,DL with the same number of rows, we juxtapose corresponding pairs
horizontally as H = (C1C2 · · ·CL|D1D2 · · ·DL).
It is shown that juxtaposition preserves the zero SIP condition in the following
lemma.
Lemma 5.3.1 ( [16, 17]). Let Hi = (Ci|Di) for 1 ≤ i ≤ L. If Hi satisfies the zero
SIP condition in Eq. (5.1), the juxtaposed matrix H = (C1C2 · · ·CL|D1D2 · · ·DL)
also satisfies the zero SIP condition.
According to Lemma 5.3.1, we first construct shorter codes with a pair of ma-
trices C and D satisfying the zero SIP condition. In the following, we propose a
102
5.3. QC-LDPC STABILIZER CODES
construction of binary QC stabilizer codes free of cycles of girth four.
Let H = (C|D), where C =
(
I(c1,1) I(c1,2)
I(c2,1) I(c2,2)
)
and D =
(
I(d1,1) I(d1,2)
I(d2,1) I(d2,2)
)
.
Theorem 5.3.1. A sufficient condition for a binary QC-LDPC stabilizer code with
H = (C|D) satisfying the zero SIP condition is given by


c1,1 − d1,1 = d1,2 − c1,2
c2,1 − d1,1 = d2,1 − c1,1
c2,2 − d1,2 = d2,2 − c1,2
(mod P ). (5.2)
Proof. CDT+DCT =
(
I(c1,1) I(c1,2)
I(c2,1) I(c2,2)
)(
I(−d1,1) I(−d2,1)
I(−d1,2) I(−d2,2)
)
+
(
I(d1,1) I(d1,2)
I(d2,1) I(d2,2)
)(
I(−c1,1) I(−c2,1)
I(−c1,2) I(−c2,2)
)
=[
I(c1,1−d1,1)+I(c1,2−d1,2)+I(d1,1−c1,1)+I(d1,2−c1,2)
I(c2,1−d1,1)+I(c2,2−d1,2)+I(d2,1−c1,1)+I(d2,2−c1,2)
I(c1,1−d2,1)+I(c1,2−d2,2)+I(d1,1−c2,1)+I(d1,2−c2,2)
I(c2,1−d2,1)+I(c2,2−d2,2)+I(d2,1−c2,1)+I(d2,2−c2,2)
]
= 0. Hence, H satisfies the zero SIP condition.
Example: Given parameters J = 2, L = 2, and P = 15, a parity check matrix of
a (2,2) QC stabilizer code is given by
(
I(7) I(5)
I(5) I(7)
∣∣∣ I(13) I(1)
I(1) I(13)
)
.
To construct longer codes, we juxtapose additional pairs of codes satisfying the
condition in Eq. (5.1). The problem of the juxtaposition is that cycles of girth four
can be introduced, if the offset parameters are not carefully chosen. For example, for
P = 15, an expanded parity matrix of a (2, 4) code satisfying Eq. (5.2) is given by(
I(7 ) I(5) I(8 ) I(6)
I(5 ) I(7) I(6 ) I(8)
∣∣∣ I(3) I(9) I(12) I(2)
I(9) I(3) I(2) I(12)
)
, where the four cyclic matrices with italic offsets
introduce cycles of girth four since 7− 5 = 8− 6 (mod 15).
Let H = (hj,l) denote a matrix containing the offset information ofH = (I(hj,l)),
where 1 ≤ j ≤ J and 1 ≤ l ≤ L. The following theorem gives a necessary and
sufficient condition to avoid cycles of girth four for a QC-LDPC code withH = (hj,l).
Theorem 5.3.2 ( [75]). A QC-LDPC code with H = (hj,l) has no cycles of girth
103
5.3. QC-LDPC STABILIZER CODES
four if and only if hj1,l1 − hj2,l1 6= hj1,l2 − hj2,l2( mod P ) for 1 ≤ j1 < j2 ≤ J and
1 ≤ l1 < l2 ≤ L, where P is the size of cyclic matrices I(hj,l).
Example: Given parameters J = 2, L = 4, and P = 15, a parity check matrix of
a (2, 4) QC stabilizer code is given by
(
I(7) I(5) I(8) I(5)
I(5) I(7) I(5) I(8)
∣∣∣ I(3) I(9) I(12) I(1)
I(9) I(3) I(1) I(12)
)
, where no
cycle of girth four exists.
5.3.2 QC-LDPC stabilizer codes with no cycles of girth four
The parity check matrix of nonbinary QC-LDPC codes can be obtained from a pair
of base matrices based on single-weight shifted identity matrices. This is achieved
by replacing the ones in its binary image with nonzero elements in GF(2m) such that
the nonbinary matrix H satisfies the zero SIP condition in Eq. (5.1) and defines a
binary stabilizer code.
Let Hb = (Cb|Db) be a parity check matrix of a (2, 2) binary [[N,K]]2 code,
where Cb = (I(ci,j))2P×2P (Db = (I(di,j))2P×2P ) for i, j = 1, 2. A sufficient condition
for zero SIP is given in Eq. (5.2).
Let α be a primitive element in GF(2m). Suppose each block I(ci,j) (I(di,j), re-
spectively) ofCb (Db, respectively) is replaced withXij(ci,j) = diag(α
xi,j,1 , · · · , αxi,j,P )·
I(ci,j) (Yij(di,j) = diag(α
yi,j,1, · · · , αyi,j,P ) · I(di,j), respectively) for i, j = 1, 2. For
simplicity, we denote Xij(ci,j) and Yij(ci,j) as Xij and Yij, respectively, when there
is no ambiguity about the offsets ci,j and di,j. Since each component block ma-
trix Xij (Yij, respectively) has P nonzeros α
xi,j,l (αyi,j,l, respectively) over GF(2m),
there are a total of 8P unknown exponents xi,j,l’s and yi,j,l’s to determine. After the
replacement, the parity check matrix is given by H = (C|D) = (X11 X12
X21 X22
∣∣Y11 Y12
Y21 Y22
)
.
104
5.3. QC-LDPC STABILIZER CODES
CD
T +DCT =
(
X11Y
T
11 +X12Y
T
12 +Y11X
T
11 +Y12X
T
12 X11Y
T
21 +X12Y
T
22 +Y11X
T
21 +Y12X
T
22
X21Y
T
11 +X22Y
T
12 +Y21X
T
11 +Y22X
T
12 X21Y
T
21 +X22Y
T
22 +Y21X
T
21 +Y22X
T
22
)
.
(5.3)
C
R(DR)T +DR(CR)T
=
(
X11 X12R
T
RX21 RX22R
T
)(
Y11 Y12R
T
RY21 RY22R
T
)T
+
(
Y11 Y12R
T
RY21 RY22R
T
)(
X11 X12R
T
RX21 RX22R
T
)T
=
(
(X11YT11 +X12Y
T
12) + (X11Y
T
11 +X12Y
T
12)
T [(X11YT21 +X12Y
T
22) + (X21Y
T
11 +X22Y
T
12)
T ]RT
R[(X21YT11 +X22Y
T
12) + (X11Y
T
21 +X12Y
T
22)
T ] R[(X21YT21 +X22Y
T
22) + (X21Y
T
21 +X22Y
T
22)
T ]RT
)
.
(5.4)
The symplectic inner product is shown in Eq. (5.3). Due to the quasi-cyclic struc-
ture, each of the four block matrices in Eq. (5.3) would introduce P linear equations
of exponents xi,j,l’s and yi,j,l’s for i, j = 1, 2 and there are a total of 4P equations.
Since the number of equations is smaller than the number of variables, we can always
find a set of solutions satisfying the zero SIP condition. By picking randomly from
the solutions, we obtain a parity check matrix H = (C|D) over GF(2m). Then, we
can use juxtaposition to obtain codes with different rates.
5.3.3 QC-LDPC stabilizer codes with rotation
In the following, we use the rotation operation similar to that in [16,17] to increase
the randomness.
Definition [General rotation operation]: A binary square matrix R is called a
general rotational matrix if RT = R−1. We only focus on sparse matrix R, since
dense matrix could increase the density of sparse parity check matrix. Permutation
matrix is a special rotational matrix and is used in our work for rotation operations.
105
5.4. PERFORMANCE EVALUATION
The general rotation operation Π on a square matrix X is given by
Π{X} = RXT ,
Πk{X} = Π{Πk−1{X}}, k = 2, 3, · · · .
Let H =
(
X11 X12
X21 X22
∣∣Y11 Y12
Y21 Y22
)
be a parity check matrix obtained in Sec. 5.3.2.
We apply rotation operation on H and obtain HR as follows: HR = (CR|DR) =(
X11 (Π{X12})T
Π{XT21} Π2{X22}
∣∣∣ Y11 (Π{Y12})TΠ{YT21} Π2{Y22}
)
=
(
X11 X12R
T
RX21 RX22R
T
∣∣∣ Y11 Y12RT
RY21 RY22R
T
)
. Then, the sym-
plectic inner product is shown in Eq. (5.4). When X11Y
T
11+X12Y
T
12 = 0, X11Y
T
21+
X12Y
T
22 = 0, X21Y
T
11 + X22Y
T
12 = 0, and X21Y
T
21 + X22Y
T
22 = 0, H
R satisfies the
zero SIP condition. Note thatXij and Yij are single-weight cyclic matrices, each set
of the four sets of equations above would introduce P linear equations of unknown
exponents xi,j,l’s and yi,j,l’s. We obtain a total of 4P linear equations. Since each
component block matrix of CR and DR has P nonzeros, there are a total of 8P
unknown exponents xi,j,l’s and yi,j,l’s to determine. Thus, we can always find a set
of solutions satisfying the zero SIP condition. By picking randomly from the set of
solutions, we obtain a parity check matrix HR = (CR|DR) over GF(2m). A higher
rate code can be obtained by juxtaposition.
5.4 Performance Evaluation
In this section, we evaluate our QC-LDPC stabilizer codes in a qubit depolarizing
channel with a total flip probability f , where bit-flip error X , phase-flip error Z,
and bit-and-phase flip error Y occur independently with probability f/3 [11]. Our
binary QC-LDPC stabilizer codes are constructed through nonbinary QC-LDPC
106
5.4. PERFORMANCE EVALUATION
codes satisfying Eq. (5.1). The decoding is based on the nonbinary QC-LDPC codes
with an expanded parity matrix H = (C|D) over GF(2m). Hence, we use a sum-
product algorithm powered by FFT over finite field GF(2m) [76] and simulate the
frame error rate (FER) of our codes. In [21], Dutton et al. proposed quantum polar
codes with good performance for the depolarizing channel. However, at medium
length, the codes proposed by Kasai et al. in [23] outperform the quantum polar
codes [21]. Hence, we compare our code with the best CSS QC-LDPC code in [23].
We also include the best binary QC-LDPC stabilizer code in [16,17] for comparison,
since both our codes and the codes in [16, 17] are based on the general stabilizer
formalism.
Using the approach described in Sec. 5.3.2, we first construct a (2, 4) nonbinary
QC-LDPC code (referred to as code 1) over GF(28) without cycles of girth four. It
has a code rate 1/2 and length of 520 symbols, which is equivalent to 2080 qubits.
A parity check matrix of code 1 is given by
H =
(
X11(1) X12(64) X13(7) X14(58)
X21(64) X22(1) X23(58) X24(7)
∣∣∣ Y11(5) Y12(60) Y13(13) Y14(52)
Y21(60) Y22(5) Y23(52) Y24(13)
)
,
whereXi,j(ci,j) andYi,j(di,j) are cyclic shifted matrix over GF(2
8) with a size of P =
65. Based on the approach described in Sec. 5.3.3, we obtain another (2, 4) nonbinary
QC-LDPC code (referred to as code 2) over GF(28), which has the same offsets,
code rate, and length as code 1. A parity check matrix of code 2 is given by HR =(
X11(1) X12(64)RT X13(7) X14(58)RT
RX21(64) RX22(1)RT RX23(58) RX24(7)RT
∣∣∣ Y11(5)
RY21(60)
Y12(60)RT Y13(13) Y14(52)RT
RY22(5)RT RY23(52) RY24(13)RT
)
. The
two parity check matrices of codes 1 and 2 with column weight 2, row weight 8, and
size 130× 520 are plotted in Fig. 5.2, where each dot denotes a nonzero elements.
107
5.4. PERFORMANCE EVALUATION
0 100 200 300 400 500
0
50
100
(b)
0 100 200 300 400 500
0
50
100
(a)
Figure 5.2: Parity check matrices of (a) Code 1 (b) Code 2.
We compare our codes with the KHIS (Kasai-Hagiwara-Imai-Sakaniwa) codes
proposed in [23], which are the best known CSS QC-LDPC codes. The KHIS code
for comparison has a code rate 1/2 and length of 2624 qubits. We also compare our
codes with code B in [16, 17], which has a code rate 1/2 and length of 2068 qubits.
The FER performances of our codes, code B in [16, 17], and KHIS code in [23] are
shown in Fig. 5.3. Our code 2 has better FER performance than code 1 up to 10−3. It
shows that the rotation operation can improve the FER performance by introducing
randomness, which can reduce the number of smallest cycles. Though both our codes
and code B in [16,17] are for a qubit channel, our codes outperform code B by using a
nonbinary LDPC decoding algorithm. We conclude that QC-LDPC stabilizer codes
via a nonbinary decoding algorithm have better FER performance than the binary
QC-LDPC stabilizer codes via a binary decoding algorithm. The performance of
our code 1 is not as good as the KHIS code in [23]. This is because our construction
removes only the cycles of girth four. In contrast, the method in constructing KHIS
codes ensures that cycles of girth up to 2L are eliminated, where L is the row weight.
108
5.5. SUMMARY
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055
10−5
10−4
10−3
10−2
10−1
100
 f: total flip probability
 
Fr
am
e 
Er
ro
r R
at
e
 Decoding performance in depolarizing channel
 
 
Code 1, Rq=1/2, n=2080, q=28
Code 2, Rq=1/2, n=2080, q=28
Code B, Rq=1/2, n=2068, q=2
KHIS code, Rq=1/2, n=2624, q=28
Figure 5.3: The block error probability of our codes, code B in [16, 17], and KHIS
code in [23].
By introducing randomness, which can reduce the number of girth six cycles, the
performance of our code 2 is better than code 1. Our codes do not perform well
as the KHIS code. We remark that this comparison is somewhat misleading and to
our disadvantage definitely, because the KHIS code in [23] is the result of significant
optimization efforts on CSS codes, whereas our code 1 is obtained without much
optimization. With more work on optimizations of our stabilizer codes, it is possible
to find better QC-LDPC stabilizer codes than the CSS QC-LDPC codes, which is
shown to be a special case of the QC-LDPC stabilizer codes.
5.5 Summary
Many quantum error correction codes in the literature are based on the CSS formal-
ism, which uses classical dual-containing codes as component codes. The drawback
109
5.5. SUMMARY
of CSS codes is the restriction on the code structure, which leads to a lower rate
code compared with non-CSS codes. In this work, we focus on the constructions of
quantum codes based on the general stabilizer formalism and propose two construc-
tions of QC-LDPC stabilizer codes decoded by a nonbinary sum-product algorithm.
Our simulation results show that our nonbinary quantum QC-LDPC codes outper-
form their binary counterparts. Though the performance of our codes are not good
as the best CSS QC-LDPC code in [23], it possibly leads to better codes than CSS
QC-LDPC codes by further reducing the number of smallest cycles. We plan to
search for better codes in our future work.
110
Chapter 6
Efficient Threshold Architectures
with Bounded Fan-Ins for Finite
Field Operations
6.1 Introduction
According to the International Technology Roadmap of Semiconductors (ITRS) [1],
the conventional CMOS technology has great challenges in further scaling. Al-
though new materials and device structures can keep the CMOS scaling for the
next ten years, it would reach fundamental limits during 2020–2025 [1]. After
that, it would be difficult to operate any MOS-based transistor structure using
classical physics. With smaller feature sizes, higher speeds, and lower power con-
sumption, some emerging nanotechnology devices such as resonant tunneling diodes
(RTDs), quantum cellular automata (QCA) and single electron transistors (SETs)
111
6.1. INTRODUCTION
are promising candidates to replace the CMOS devices. At the system level, they
have two distinct advantages over their CMOS counterparts. Firstly they can easily
realize threshold gates (see Fig. 6.2). Since threshold gates can implement complex
Boolean functions with single gates [25], the area of larger systems implemented us-
ing nanotechnology tends to be a lot smaller. Secondly, the outputs of the threshold
gates built with nanotechnology are self-latched. This provides a natural way of
pipelining these systems in most signal processing applications.
Several applications dealing with real valued signals have already been real-
ized based on threshold gates, such as parallel adders via RTDs [27–29], compar-
ison via QCAs [77], pattern matching for nanotechnology [78], parity via neural
networks [79], multiplication via neural networks [80–82], division via neural net-
works [83]. However, an important class of signal processing applications, including
error correcting coding and cryptography which use characteristic-2 fields (denoted
by GF(2m)) [30, 84], have yet been realized using nanotechnology.
The main obstacle to the nanotechnology implementations of applications over
finite field GF(2m) is that all arithmetic operations, addition, multiplication [30] and
inversion [85–89], require exclusive-ORs (XORs). Unlike most conventional Boolean
primitives such as AND, OR, NOT, NAND and NOR, XOR is not a threshold func-
tion and cannot be realized as a single threshold gate [90]. While two-input XORs
have been the focus in CMOS technology, multi-input XORs are better suited to
finite field applications, which typically employ a large number of XOR operations.
Further, multi-input XORs allow us to better exploit the power of the threshold
gates. Previously proposed threshold logic gate (TLG) implementation of n-input
XORs have linear (O(n)) [91, 92] or sublinear (O(
√
n)) [81] number of threshold
112
6.1. INTRODUCTION
gates. However, these TLG implementations of multi-input XORs require threshold
gates with unbounded fan-ins. Bounded fan-in is critical to both reliability and
performance of nanotechnology architectures. The reliability of a threshold gate in
nanotechnology decreases sharply as the fan-in grows [93]. In addition, threshold
gates with a large fan-in tend to have slower switching speeds [94]. Previous theo-
retical results on XOR implementation with threshold logic ignored the practically
important fan-in bounds [91, 92], and hence they are not readily applicable when
fan-in is constrained.
One straightforward method of applying fan-in constraint to a multi-input XOR
is to decompose each threshold gate of the XOR in [81,91,92] into gates with smaller
fan-ins. Many approaches for decomposing threshold functions or symmetric func-
tions have been proposed in [95–99]. In [95], the decomposition of an arbitrary
Boolean function of n inputs considers only a fan-in of 2 and has a complexity of
O(2n/n). In [96–98], a divide and conquer algorithm for fan-in reduction was pro-
posed for a majority function of n inputs with a complexity of quasi-polynomial in
n, O(nlogn/BlogB), where B is the maximum fan-in. In [99], the proposed decom-
position of a majority function of n inputs with a fan-in B reduced the complexity
to quadratic in n, O(n2/B). For XORs with sublinear complexity in [81], O(
√
n)
threshold gates are needed. The total complexity of an XOR with a bounded fan-in
is on the order of O(n2.5) using the decomposition in [99].
Alternately one could use generic threshold synthesis approaches with arbitrary
fan-ins proposed for Fn,m (a class of functions of n inputs with m groups of ones
in their truth table) [100, 101] or symmetric Boolean functions [98], since XORs
are both a class of Fn,m functions and symmetric functions. However, we show that
113
6.1. INTRODUCTION
when applied to XORs, these approaches result in much higher complexities than our
approach, and hence are ineffective. For instance, the number of groups of ones in
the truth table of an n-input XOR increases exponentially and the total complexity
via the Fn,m approach with a bounded fan-in B is on the order of O(n2
n/B). The
total number of gates required via the sort-and-search approach in [98] is on the order
of O(n log2 n). For all previous approaches based on decomposition and synthesis,
the complexity of n-input XORs via the sort-and-search [98] is much smaller than
other approaches. However, the sort-and-search algorithm in [98] assumed a fan-
in of two. Due to the regular structure of XORs, one could decompose a large
XOR into a tree of smaller XORs with bounded fan-ins. In our work, we present
tree implementations of XORs by using TLG implementation of multi-input XORs
in [79, 91, 92] as primitives. We treat the fan-in as a parameter for the architecture
of multi-input XORs, which satisfies the fan-in requirement by design. Regardless
of the fan-in requirement, the complexity of n-input XORs of our design is linear
with n.
In this chapter we aim to address the class of architectures over finite fields
GF(2m). In particular, we study the multiplication architectures using threshold
gates with a given fan-in bound. The work in this chapter presents two main results.
The first main result of the manuscript is TLG implementation of multi-input XORs
with bounded fan-ins. To leverage the power of multi-input threshold function, we
first generalize the sort-and-search algorithm in [98] from fan-in of two to arbitrary
fan-ins, and propose an architecture of multi-input XORs with finite fan-ins. We use
the XORs in [81,91,92] as primitives and propose two classes of tree implementations
of multi-input XORs with finite fan-ins and compare them with the XORs via the
114
6.1. INTRODUCTION
sort-and-search algorithm.
The other main contribution of our manuscript is multiplication over finite fields.
Using our proposed multi-input XORs, we then develop two efficient threshold ar-
chitectures for multiplication in GF(2m) with a given fan-in bound. Many other
GF(2m) architectures such as those for division and inversion are based on multipli-
cations [102]. Yet, to the best of our knowledge, this is the first work on the imple-
mentation of characteristic-2 multiplication in threshold logic. We investigate two
types of bit-parallel multiplications, the polynomial basis multiplication [103, 104]
and the Massey-Omura (MO) multiplication [105]. We propose efficient implemen-
tations of both of these using multi-input XORs and obtain analytical expressions
for the gate area and the latency of our designs. These are compared with the ar-
chitectures synthesized by approaches in [25, 106]. While the synthesis approach in
[2,Theorem 12.2.1.2] is chosen for its simplicity, the work in [106] is the first compre-
hensive synthesis methodology and provides a tool for general multilevel threshold
logic design. Our results show that our custom-designed multipliers outperform
those synthesized via generic approaches in [25, 106].
The rest of the chapter is organized as following. In Sec. 7.2, we introduce
threshold logic and show a typical threshold gate implementation using resonant
tunneling diodes (RTDs). Sec. 6.3 generalizes the sort-and-search approach to arbi-
trary fan-ins and presents a threshold architecture for multi-input XORs. Sec. 6.4
presents our tree architectures for multi-input XORs and evaluates their perfor-
mance. Sec. 6.5 provides the new efficient threshold implementations of polynomial
basis and normal basis multiplications over GF(2m). This section also evaluates the
gate area, number of interconnects and latency of these designs and compares them
115
6.2. BACKGROUND
to prior work. Finally Sec. 6.6 presents the conclusions of this work.
Figure 6.1: Threshold gate realizing f(x) for n inputs, x1, x2, · · · , xn, with corre-
sponding weights w1, w2, · · · , wn and a threshold T .
6.2 Background
6.2.1 Boolean function
A Boolean function is a function with the mapping f : {0, 1}n → {0, 1}, where
n is the number of inputs. Denote the n-input Boolean function by Bn. Boolean
functions B2, such as 2-input AND, OR, and XOR, play an important role in CMOS
circuit design. For x ∈ {0, 1}, the negation of x is denoted by ¬x or x. Let x1 = x
and x0 = x. For x, y ∈ {0, 1}, the logical conjunction x∧y is 1 if and only if x = y = 1
and the logical disjunction x ∨ y is 1 if and only if x = 1 or y = 1 [107]. For x =
(x1, · · · , xn), define the minterm ma for a = (a(1), · · · , a(n)) ∈ {0, 1}n by ma(x) =
x
a(1)
1 ∧· · ·∧xa(n)n . Similarly, define the maxterm sa for a = (a(1), · · · , a(n)) ∈ {0, 1}n
by sa(x) = x
¬a(1)
1 ∨ · · · ∨ x¬a(n)n [107].
An arbitrary n-input Boolean function can be expressed by f(x) =
∨
a∈f−1(1)ma(x) =∧
b∈f−1(0) sb(x), where the first and second representations are called disjunctive and
conjunctive normal form (DNF and CNF), respectively [107]. {∧,∨,¬} is called a
complete basis [107]. For Boolean functions, we use ·,+,¯ for ∧,∨,¬, respectively,
and omit · when there is no ambiguity.
116
6.2. BACKGROUND
6.2.2 Symmetric function
A Boolean function is said to be symmetric if its output is invariant under any
permutation of its input bits. Let f be a symmetric function of n variables, we have
f(x1, x2, · · · , xn) = f(xσ(1), xσ(2), · · · , xσ(n))
for all permutations σ of {1, · · · , n}. Some Boolean functions, such as n-input
AND, OR, NAND, NOR, and XOR, are all symmetric functions. Since any Boolean
function has a disjunctive norm form, a symmetric function can be constructed in
two levels. The first level is to compute all conjunctions (products) of literals. The
second level is to compute the disjunction (sum) of all terms obtained in the first
level.
Symmetric functions can also be represented by partially defined Boolean func-
tions if the inputs are sorted first. A partially defined Boolean function has a map-
ping f : {0, 1}n → {0, 1, ?}, where “?” can be either 0 or 1. Let 〈x′1, x′2, · · · , x′n〉 be
the sorted sequence of (x1, x2, · · · , xn). fp(x′1, x′2, · · · , x′n) denotes a partially defined
Boolean function with inputs only from n + 1 ordered binary sequences, (0 · · ·00),
(0 · · ·01), (0 · · ·11), · · · , (1 · · ·11). We refer to fp(x′1, x′2, · · · , x′n) as a searching
function of the symmetric function f(x1, x2, · · · , xn).
The cost of a polynomial f(x1, x2, · · · , xn) =
∑m
i=1 fi(x1, x2, · · · , xn) is defined
as the sum of costs of all products fi(x1, x2, · · · , xn), of which each has a cost equal
to the number of its literals. For example, f(x1, x2, x3) = x1x2 + x1x3 + x2x3 has a
cost of six. A polynomial fmin is a minimal polynomial for f , if fmin computes f and
no other polynomial computing f has a smaller cost than fmin. An implicant of f is
117
6.2. BACKGROUND
a product term fm satisfying f
−1
m (1) ⊆ f−1({1, ?}) and f−1m 6⊆ f−1(?). An implicant
fp of f is called a prime implicant if no proper sub-term of fp is an implicant of f .
According to Thm. 1.1 in [107], minimal polynomials for f consist only of prime
implicants. By computing all prime implicants of f , we can obtain the minimal
polynomial for f(x′1, · · · , x′n).
6.2.3 Threshold logic
A threshold function f with n inputs (n ≥ 1), x1, x2, · · · , xn, is a Boolean function
whose output is determined by [25]
f(x1, x2, · · · , xn) =


1 if
∑n
i=1wixi ≥ T
0 otherwise,
(6.1)
where wi is called the weight of xi and T the threshold. In this chapter we de-
note this threshold function as [x1, x2, · · · , xn;w1, w2, · · · , wn;T ], and for simplic-
ity sometimes denote it as f = [x;w;T ], where x = (x1, x2, · · · , xn) and w =
(w1, w2, · · · , wn).
For the Boolean functions NOT and n-input AND, OR, NAND, and NOR, each
corresponds to a single threshold function: [x;−1; 0] is the NOT gate, [x; 1, 1, · · · , 1;n]
and [x; 1, 1, · · · , 1; 1] are n-input AND and OR, respectively, [x;−1,−1, · · · ,−1; 0]
and [x;−1,−1, · · · ,−1; 1− n] equal n-input NAND and NOR, respectively. Unfor-
tunately, an XOR cannot be expressed as a single threshold function.
Certain threshold functions are of particular interest. An n-input threshold
function with unit weights and a threshold ⌊n
2
⌋ + 1 is called a majority function.
A threshold function with all unit weights but an arbitrary threshold is called a
118
6.2. BACKGROUND
generalized majority function. Henceforth we denote a generalized n-input majority
gate with a threshold k by tnk .
The physical entity realizing a threshold function is called a threshold gate.
Fig. 7.11 shows a threshold gate realizing (7.5).
6.2.4 RTD Implementation of TG
RTD is a promising (see, e.g., [27]) nanotechnology to replace CMOS. Hence, in
this work we focus on RTD implementations of threshold gates. RTD is a diode
with resonant tunneling structure. It has a negative differential resistance, i.e., if
one increases voltage across it, the current through it initially increases and then
drops down to zero again after reaching a certain peak. If two RTDs are tied in
series and voltage across them is swept from low to high, then at the end, the
RTD with the higher peak current bears all the applied voltage. The peak current
depends on the area of the RTD. By replacing each of these RTDs by multiple RTDs
in parallel, and selectively adding them into the circuit with input variables, one
can change the effective area of the top and bottom RTDs. The voltage at the
junction of the two sets of RTDs is thus decided by the comparison of the two sets
of areas. This structure, in fact, implements a threshold gate. A typical threshold
gate built with RTDs is shown in Fig. 6.2, where a load RTD and a driver RTD
are needed for every output. Let a Boolean function of n variables be realized as
a network of k threshold functions fi(x) = [x
i;wi;Ti] with x
i = (xi1, · · · , xini) and
wi = (wi1, · · · , wini) for i = 1, · · · , k, where ni denotes the number of variables
involved in the i-th threshold function. The total gate area of the implementation,
composed of areas of all the RTD gates, including the load and the driver RTDs, is
119
6.3. XOR VIA SORT-AND-SEARCH
Figure 6.2: RTD implementation of a threshold gate computing f = ab+ bc+ ac =
[a, b, c; 1,−1, 1; 1]. The numerical value next to each RTD is its area.
given by A(n) = 2k +
∑k
i=1(
∑ni
j=1 |wij|+ |Ti|).
The fan-in of a threshold gate in RTD nanotechnology needs to be bounded for
both reliability and performance. First, the reliability of an RTD threshold gate
decreases sharply with the fan-in due to noise, fluctuation of supply voltage, and
manufacture deviations [108]. Second, the switching speed of an RTD implementa-
tion of a threshold gate depends on the radio of load to drive peak currents [94]:
The closer the ratio is to one, the slower the RTDs switch. Since a gate with more
inputs is more likely to have a ratio closer to one than a small gate, it also suggests
that the fan-in of an RTD threshold gate be bounded. A maximum fan-in of seven
inputs was suggested in [27] for RTDs.
6.3 XOR via Sort-and-Search
In this section, we first propose a sort-and-search algorithm with n-input (n ≥
2) threshold gates for implementing any symmetric function. Then, we show an
120
6.3. XOR VIA SORT-AND-SEARCH
Figure 6.3: Computing a symmetric function via a sort-and-search algorithm [98].
implementation of XOR via the proposed algorithm.
6.3.1 Sort-and-search
Any Boolean function has a two-level implementation. The first level is to compute
all product terms and the second level to compute the disjunction of all product
terms. For symmetric functions, the evaluation is reduced to comparing the sum of
the binary input variables with some constants [98]. A sort-and-search algorithm
was proposed in [98] for realizing symmetric functions. It first sorts the inputs and
then detects the position in the sorted list switching from zero to one. This sort-
and-search algorithm contains two blocks: a sorting block and a searching block as
shown in Fig. 6.3. The searching block can be easily implemented as a tree of gates
as in [98]. The sorting networks have more complex implementation. Many efficient
sorting networks exist in the literature, such as the Batcher’s odd-even sort [31]
and the bitonic merge sort [109], which use a 2-sorter as the basic block. Binary
sorters can be easily implemented in threshold logic. In [98], a 2-by-2 comparator
121
6.3. XOR VIA SORT-AND-SEARCH
(2-sorter) was implemented by two threshold gates as shown in Fig. 7.12(a). We use
the Knuth diagram in [110] for easy representation of the sorting networks, where
switching elements are denoted by connections on a a set of wires. A symbol for a
two-sorter is shown in Fig. 7.12(b).
Figure 6.4: Sorters implemented in threshold logic (a) 2-sorter threshold implemen-
tation; (b) 2-sorter symbol.
Since XOR is a special symmetric function, it can be implemented via the sort-
and-search algorithm in [98]. For instance, a 4-input XOR is shown in Fig. 6.5(a).
The corresponding threshold logic implementation is shown in Fig. 6.5(b) with 2-
input threshold gates as the basic blocks. The weights and thresholds for all 2-
sorters are omitted for simplicity and can be obtained by reviewing Fig. 7.12. The
sorting network is simply a 4-input sorting network, where the inputs and outputs
are denoted by (x1, x2, x3, x4) and 〈x′1, x′2, x′3, x′4〉, respectively. For the searching
network a truth table with sorted inputs is shown in Table 6.1, where the sorted list
x′1, x
′
2, x
′
3, x
′
4 has only five possible cases and the output y is only true if the inputs
have an odd number of 1’s. The searching network is composed of two ANDs and
one OR as shown in Fig. 6.5(a).
122
6.3. XOR VIA SORT-AND-SEARCH
Figure 6.5: A 4-input XOR implemented via the sort-and-search algorithm in [98].
Table 6.1: Truth table for the searching network of a 4-input parity function.
x′1 x
′
2 x
′
3 x
′
4 y
0 0 0 0 0
0 0 0 1 1
0 0 1 1 0
0 1 1 1 1
1 1 1 1 0
Figure 6.6: Sorters implemented in threshold logic (a) n-sorter threshold implemen-
tation; (b) n-sorter symbol.
123
6.3. XOR VIA SORT-AND-SEARCH
6.3.2 Generalized sort-and-search
Any symmetric function can be realized by the sort-and-search algorithm in [98].
However, the previously proposed sort-and-search algorithm [98] is based on gates
with a fan-in of two. In this work, we generalize the sort-and-search algorithm with
fan-in of two in [98] to arbitrary fan-ins. Similar to the basic block, 2-sorter, used
in [98], we use n-sorters as the basic blocks. The TLG implementation of n-sorter
is shown in Fig. 6.6(a), where n threshold gates are required. A symbol for n-
sorter is shown in Fig. 6.6(b). As shown in Fig. 6.6(a), the number of gates of an
n-sorter scales linearly with the number of inputs n. For practical concerns, such
as reliability, some limit on the fan-in of the basic sorter is assumed. Many sorting
networks with n-sorters as building blocks have been proposed in the literature. The
multiway merge sort in [111] and the multiway bitonic sort in [109] use n-sorters
in part of the sorting network and 2-sorters required for some other parts. Sorting
networks with n-sorters (n ≥ 2) as the basic blocks were proposed in [112, 113]. It
has been shown that using larger sorters can reduce the number of gates greatly. In
this work, we generalize the sort-and-search algorithm with fan-in of two in [98] to
arbitrary fan-ins via our proposed multiway merge sort algorithm in [114].
The searching network is to implement a partially defined function with sorted
binary inputs. As explained in Sec. 6.3.1, the truth table for an n-input partially
defined function contains n+ 1 output entries corresponding to n+ 1 sorted binary
sequences. We denote the (n + 1)-entry output by vf = (vf (0), vf(1), · · · , vf(n)).
Suppose there are k groups of 1’s in vf , denoted by pairs (bj , ej) for 1 ≤ j ≤ k,
where bj and ej are the beginning and ending positions of the j-th group with
0 ≤ bj ≤ ej ≤ n and ej−1 + 1 < bj . We assume x′0 = 0 and x′n+1 = 1 to deal with
124
6.3. XOR VIA SORT-AND-SEARCH
groups of 1’s with boundaries on the left and right most positions of vf . We have
the following lemma.
Lemma 6.3.1. For a partially defined Boolean function fp of sorted binary inputs
(x′1, · · · , x′n), suppose there are k groups of 1’s in the (n+1)-entry output vf , denoted
by pairs (bj , ej) (1 ≤ j ≤ k). Then, the minimal polynomial is given by fmin =∑k
j=1 x
′
n−ejx
′
n+1−bj assuming x
′
0 = 0 and x
′
n+1 = 1.
Proof. We first prove that
∑k
j=1 x
′
n−ejx
′
n+1−bj is a polynomial for fmin. Since mj =
x′n−ejx
′
n+1−bj outputs 1 for all inputs corresponding to the j-th group of 1’s, mj is an
implicant of fmin. For each group of 1’s, there is an implicant mj for fmin. Hence,∑k
j=1mj is a polynomial of fmin.
Then we show that
∑k
j=1 x
′
n−ejx
′
n+1−bj is the minimal polynomial of fmin. It is
equivalent to prove that all implicants are prime implicants. For bj = 1 or ej = n,
the implicant mj contains only one literal and is already a prime implicant. For
bj 6= 1 and ej 6= n, each implicant mj = x′n−ejx′n+1−bj has two proper sub-terms,
mj1 = x
′
n−ej and mj2 = x
′
n+1−bj . Since m
−1
j1
(1) contains all ordered sequences with
x′n−ej = 0, at least one input is not in f
−1
min({1, ?}). Hence, m−1j1 (1) 6⊆ f−1min({1, ?}) for
1 ≤ j ≤ n. Similarly, m−1j2 (1) 6⊆ f−1min({1, ?}) for 1 ≤ j ≤ n. Thus, all implicantsmj’s
are prime implicants of fmin and
∑k
j=1mj is the minimal polynomial of fmin.
According to Lemma 6.3.1, the searching network is to implement the minimal
polynomial fmin =
∑k
j=1 x
′
n−ejx
′
n+1−bj . If the fan-in is not constrained, the searching
network is simply a two-level tree, with the first level computing all prime implicants
of fmin and the second level combining all terms as a k-input OR gate. If the fan-in is
no more than B, the searching network can be decomposed as a B-ary tree of at most
125
6.3. XOR VIA SORT-AND-SEARCH
⌈logB(2⌊n2 ⌋+ 1)⌉ levels. The following approaches are used for the decomposition.
If B is even, we apply the following approach.
1. Staring from j = k, adjacent B/2 implicants mj ’s are grouped to form a B-
input threshold gate with weight 1 for x′n+1−bj , -1 for x
′
n−ej , and a threshold
of 1. If less than B/2 implicants are left for small j, combine the left impli-
cants as a smaller threshold gate using the same rule for denoting weights and
threshold.
2. Since only one output from the first level is true, outputs from the first level
are combined as a B-ary tree of OR gates.
If B is odd, we apply the following approach.
1. Starting from j = k, adjacent B literals in mj ’s are grouped to form a B-input
threshold gate with alternating 1 and -1 as weights starting from the largest
literal in the group and a threshold of 1. If less than B literals are left for
small j, combine the left literals as a smaller threshold gate using the same
rule for denoting weights and threshold.
2. It can be easily shown that outputs from the first level are still an ordered
sequence. Repeat step 1) until only one gate is needed for all inputs from last
level.
The above proposed sort-and-search algorithm works for any symmetric func-
tions. Since the XOR function is symmetric, we can implement XORs via the
generalized sort-and-search algorithm. For an n-input XOR, we first sort the in-
puts x1, x2, · · · , xn and obtain x′1 ≤ x′2 ≤ · · · ≤ x′n. According to Lemma 6.3.1,
126
6.3. XOR VIA SORT-AND-SEARCH
Table 6.2: A searching function y of ordered binary sequences (x′1, x
′
2, · · · , x′9).
x′1 x
′
2 x
′
3 x
′
4 x
′
5 x
′
6 x
′
7 x
′
8 x
′
9 y
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 1 1 0
0 0 0 0 0 0 1 1 1 1
0 0 0 0 0 1 1 1 1 0
0 0 0 0 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 0
0 0 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1 1 0
1 1 1 1 1 1 1 1 1 1
Figure 6.7: A 9-input XOR implemented via the generalized sort-and-search algo-
rithm.
the searching function is y = x′1 + x
′
2x
′
3 + x
′
4x
′
5 + · · · + x′n−1x′n for odd n, and
y = x′1x
′
2+x
′
3x
′
4+ · · ·+x′n−1x′n for even n. For a 9-input XOR, the partially defined
truth table of a searching network y with ordered binary inputs (x′1, x
′
2, · · · , x′9) is
shown in Table 6.2. According to Lemma 6.3.1, y = x′1+x
′
2x
′
3+x
′
4x
′
5+x
′
6x
′
7+x
′
8x
′
9.
127
6.3. XOR VIA SORT-AND-SEARCH
The whole network for the 9-input XOR is shown in Fig. 6.7, and consists of 33
threshold gates in 6 levels. The weights and thresholds of 2-sorters and 3-sorters in
Fig. 6.7 are omitted for simplicity, and can be obtained by reviewing Fig. 6.6.
6.3.3 Analysis of gate area
In the following, we first derive the total number of gates. For n = Bp, the latency
is given by
Lsort(B
p) = p+ ⌈B
2
⌉ × p(p−1)
2
. (6.2)
The total number of gates for the sorting network is given by
Gsort(B
p) = Bp · Lsort(Bp)−G(Bp), (6.3)
where G(Bp) = (p − 1)Bp−2B2+6B−5
4
+ ((p−2)B
p−1−(p−1)Bp−2+1)(B+5)
4(B−1) +
(p−1)(p−2)
2
Bp−1
for B 6= 2 and G(Bp) = (p2 − p + 4)2p−1 − 2 for B = 2. The total number of gates
for the searching network is given by
Gsearch(B
p) =
Bp − 1
B − 1 . (6.4)
Hence, the total number of gates for implementing n-XOR is given by
Gs&s(n) =
n− 1
B − 1 + n · Lsort(n)−G(B
p). (6.5)
For a fixed fan-in bound B, the area of each gate is bounded by a constant. Hence,
the total gate area of an n-input XOR via the generalized sort-and-search approach
128
6.4. TREE IMPLEMENTATION OF XOR
is on the order of O(n log2 n).
6.4 Tree Implementation of XOR
In Sec. 6.3, we show that the gate area of n-input XORs based on the general-
ized sort-and-search approach is on the order of O(n log2 n). Next, we propose tree
implementations of n-input XORs with linear gate area. First, we present an im-
plementation of an n-input XOR as a tree of two-input XORs, each of which is
expressed based on other Boolean gates and implemented by their threshold gates.
We refer to this approach as direct conversion and use it as a basis for comparison.
Then, we present a traditional manner of implementing an n-input XOR, and then
propose new designs. The former, called the Boolean class, expresses the XOR in a
two-level NAND circuit implemented through threshold gates. The latter, referred
to as the majority class, has a two-level implementation and uses only generalized
majority gates in the first level. Finally, we analyze the gate area, number of inter-
connects, and latency of an n-input XOR via these approaches.
6.4.1 Direct conversion
Although an XOR cannot be expressed as a single threshold function, one can first
express an XOR based on other Boolean gates and then use their threshold logic
implementations. We refer to this approach as direct conversion.
One can implement an n-input XOR through a binary tree of two-input XORs.
A two-input XOR s = a⊕b can be expressed as s = ab¯+a¯b, s = a¯+ b¯+ a+ b, or s =
ab¯ a¯b, and implemented as shown in Fig. 6.8. Among the three implementations in
129
6.4. TREE IMPLEMENTATION OF XOR
Fig. 6.8, the NAND-type implementation has the smallest gate area and is therefore
chosen in our implementation.
Figure 6.8: Threshold implementation of two-input XOR gate (a) SOP-type (b)
NOR-type and (c) NAND-type. All three use threshold gates with a fan-in of only
two.
6.4.2 XOR with a small number of inputs
We consider two classes of architectures without considering the maximum fan-in,
which is valid when the number of inputs to an XOR is sufficiently small.
Boolean-class implementations
We can use Boolean algebra to express an n-input XOR based on two levels of
NANDs. Since NANDs can be implemented as threshold gates, this provides a
two-level implementation, called the Boolean-class implementation.
Let s denote the XOR of x1 x2, . . ., xn. One can express s in a sum-of-product
(SOP) form. The NAND implementation of such a form is obtained by using a
NAND to combine the literals (a literal is a variable or its inverse) in each product
130
6.4. TREE IMPLEMENTATION OF XOR
term and combining the outputs of these NANDs with another NAND. For example,
a three-input XOR s = a⊕ b⊕ c can be expressed as s = abc abc abc abc.
Note that there are 2n−1 product terms in the SOP expression of an n-input
XOR. Since the last threshold gate has inputs from each of these terms, this imple-
mentation is possible only when the maximum fan-in B satisfies B ≥ 2n−1.
Majority-class implementations
A two-level implementation of n-input XORs is proposed in [91] as shown in Fig. 6.9.
All the gates in the first level are generalized majority gates with thresholds from
1 to n. A single gate is in the second level with alternating weights 1 and -1 cor-
responding to odd and even gates, respectively, in the first level. The gate in the
second level combines the outputs from the first level, and compare with a thresh-
old of 1 to compute the n-input XOR. All the threshold gates have the same fan-in
n. Thus this two-level implementation in [91] is useful only when B ≥ n. An-
other two-level implementation of n-input XORs is proposed in [92] as shown in
Fig. 6.10. For an n-input XOR s = x1 ⊕ x2 ⊕ · · · ⊕ xn, let the generalized ma-
jority functions tni ’s with even threshold i’s be the intermediate variables. Then,
s = [x1, x2, · · · , xn, tn2 , tn4 , · · · , tn2⌊n
2
⌋; 1, 1, · · · , 1,−2,−2, · · · , −2; 1], which can be im-
plemented in two levels. The first level is to compute how many pairs of ones there
are in the inputs. The second level subtracts all pairs of ones from n. The result
is either one (odd number of ones) or zero (even number of ones). The threshold
gate of the second level has the maximum fan-in amongst all the gates and it equals
⌊3n/2⌋. Thus this two-level implementation in [92] is useful only when B ≥ ⌊3n/2⌋.
The above two two-level implementations require linear number of gates. In [81],
131
6.4. TREE IMPLEMENTATION OF XOR
the number of gates for an n-input XOR is reduced to 2
√
n +O(1) in a three-level
implementation. There are ⌊√n⌋ gates in the first level, 2l (2l ≤ ⌈n/⌊√n⌋⌉) gates
in the second level, and 1 gate in the third level, where l is an integer. The threshold
gates of the second level has the maximum fan-in amongst all the gates and it equals
n+ 2l. Thus the three-level implementation in [81] is useful only when B ≥ n+ 2l.
Figure 6.9: Threshold gate implementation of s = x1 ⊕ x2 · · · ⊕ xn via [91].
Figure 6.10: Threshold gate implementation of s = x1 ⊕ x2 · · · ⊕ xn via [92].
132
6.4. TREE IMPLEMENTATION OF XOR
6.4.3 XOR with a large number of inputs
When the number of inputs for an XOR exceeds the maximum fan-in, the implemen-
tations in Sec. 6.4.2 cannot be used directly. Instead, an n-input XOR is realized
using a tree of B′-input XOR gates as shown in Fig. 6.11. For given B and n, the
height of the tree l is minimized while satisfying n ≤ B′l, and B′ is maximized such
that no gate in the tree exceeds the maximum fan-in B.
If the Boolean-class implementation is used for the B′-input XORs, then from
the results in Sec. 6.4.2, B ≥ 2B′−1, which gives B′ = 1 + ⌊log2B⌋.
Alternatively, we can use the threshold gate implementations of XORs via [81,
91, 92] as basic B′-input XORs. For [91], B = B′. For [92], B ≥ ⌊3B′/2⌋, which
gives B′ = ⌊(2B + 1)/3⌋. For [81], B ≥ B′ + 2l, where 2l ≤ ⌈B′/⌊√B′⌋⌉. In
Table 6.3, we show the gate area (G), number of interconnects (I), and latency (L)
of XORs via [81,91,92] for fan-ins B = 3, 4, 5, 6, 7. For the same B in Table 6.3, the
XOR via [92] has the smallest gate area, number of interconnects, and latency. In
the following, we refer to XORs via [92] as majority-class XORs and use them as
primitives for our tree implementation.
6.4.4 Complexity of multi-input XOR
We now investigate the gate area, number of interconnects, and latency of various
designs presented in Secs. 6.4.2 and 6.4.3. We treat the gate area, number of in-
terconnects, and latency of an XOR as a function of two parameters: the number
of inputs n and the fan-in bound B. When the fan-in bound is not violated, the
XOR is simply implemented in a two-level structure. When the fan-in bound is
violated, the XOR is decomposed into a tree of smaller XORs such that all smaller
133
6.4. TREE IMPLEMENTATION OF XOR
Table 6.3: Comparison of XORs via [81, 91, 92] for fan-ins B = 3, 4, 5, 6, 7.
B B′ Impl. G I L
3
3 Muroga [91] 27 12 2
2 Minnick [92] 13 5 2
2 Siu [81] 29 12 3
4
4 Muroga [91] 41 20 2
3 Minnick [92] 15 7 2
2 Siu [81] 29 12 3
5
5 Muroga [91] 58 30 2
3 Minnick [92] 15 7 2
3 Siu [81] 44 20 3
6
6 Muroga [91] 78 42 2
4 Minnick [92] 29 14 2
4 Siu [81] 48 24 3
7
7 Muroga [91] 101 56 2
5 Minnick [92] 32 17 2
5 Siu [81] 94 44 3
Figure 6.11: n-input XOR realized as a tree with a height of l using B′-input XORs
as basic units, where s = x1 ⊕ x2 ⊕ · · · ⊕ xn for n = B′l.
XORs can be implemented without violating the fan-in bound. For mathematical
convenience, assume that n = B′l for some l, where B′ is dependent on the fan-in
134
6.4. TREE IMPLEMENTATION OF XOR
bound B. That is, the tree is complete with height l = logB′ n. Thus the implemen-
tation involves NB = (B
′l − 1)/(B′ − 1) B′-input XORs. For a tree of k threshold
gates, the total gate area is simply the sum of all smaller XORs, which is given by
AXOR(n,B) = 2k+
∑k
i=1(
∑ni
j=1 |wij|+ |Ti|). The number of interconnects is given by
IXOR(n,B) =
∑k
i=1 ni. The latency LXOR(n,B) of the XOR is given by identifying
the critical path from the inputs to the output.
For Boolean-class implementations, the gate area, number of interconnects, and
latency are given by
ABCXOR(n,B) =


3(n+ 2)2n−2 + 1, n ≤ B′
(n−1)(3(⌊log2B⌋+3)2⌊log2 B⌋−1+1)
⌊log2 B⌋ , n > B
′
IBCXOR(n,B) =


(n+ 1)2n−1, n ≤ B′
(n−1)((⌊log2 B⌋+2)2⌊log2 B⌋)
⌊log2 B⌋ , n > B
′
LBCXOR(n,B) =


2, n ≤ B′
2⌈log⌊1+log2B⌋ n⌉, n > B′
respectively, where B′ = 1+⌊log2B⌋, which determines the fan-in violation condition
for the Boolean-class implementation.
For majority-class implementations, the gate area, number of interconnects, and
latency are given by
AMCXOR(n,B) =


⌊n/2⌋⌊(3n + 10)/2⌋+ n + 3, n ≤ B′
(n−1)(⌊B
3
⌋2+⌊ 2B+16
3
⌋⌊B
3
⌋+⌊ 2B+10
3
⌋)
⌊ 2B−2
3
⌋ , n > B
′
135
6.4. TREE IMPLEMENTATION OF XOR
0 100 200 300 400 500 600
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
n
G
at
e
ar
ea
(i
n
R
T
D
s)
 
 
Boolean, B=4,5,6,7
Boolean, B=3
Majority, B=3
Majority, B=6
Majority, B=4,5
Majority, B=7
Figure 6.12: Gate area of n-input XOR using threshold gate with fan-in bound B.
IMCXOR(n,B) =


(n+ 1)⌊n/2⌋+ n, n ≤ B′
(n−1)(⌊ 2B+4
3
⌋⌊B
3
⌋+⌊ 2B+1
3
⌋)
⌊ 2B−2
3
⌋ , n > B
′
LMCXOR(n,B) =


2, n ≤ B′
2⌈log⌊(2B+1)/3⌋ n⌉, n > B′
respectively, where B′ = ⌊(2B+1)/3⌋, which determines the fan-in violation condi-
tion for the majority-class implementation.
The gate area, number of interconnects, and latency of the three classes of XORs
are compared in Figs. 6.12, 6.13, and 6.14, respectively, for B = 3, 4, · · · , 7. Though
the closed-form expressions of the gate area, number of gates, and latency for tree
implementations are derived for complete tree with n = B′l, we assume the expres-
sions are valid for all n. According to [115], the operand size n is as large as 521.
Hence, all the curves in Figs. 6.12, 6.13, and 6.14 are depicted for n up to 600. In
Figs. 6.12, 6.13, and 6.14, all the legends correspond to the discrete values n = B′l
136
6.4. TREE IMPLEMENTATION OF XOR
0 100 200 300 400 500 600
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
n
#
o
f
in
te
rc
o
n
n
ec
ts
 
 
Majority, B=7
Majority, B=4,5
Boolean, B=3
Majority, B=3
Majority, B=6
Boolean, B=4,5,6,7
Figure 6.13: Number of interconnects of n-input XOR using threshold gate with
fan-in bound B.
0 100 200 300 400 500 600
0
2
4
6
8
10
12
14
16
18
20
n
L
at
en
cy
 
 
Majority, B=6
Boolean, B=4,5,6,7
Majority, B=4,5
Boolean, B=3
Majority, B=3
Majority, B=7
Figure 6.14: Latency of n-input XOR using threshold gate with fan-in bound B.
for l = 1, 2, · · · . For n 6= B′l, a multi-input XOR is realized as a partial tree of
B′-input XORs. Assume the gate area and number of interconnects of the partial
tree of B′-input XORs increases linearly with n. All the curves corresponding to
the gate area and number of interconnects are straight lines. In contrast, the curves
137
6.4. TREE IMPLEMENTATION OF XOR
corresponding to the latency are step functions, which is due to the ceiling function.
All three classes of XORs have linear complexity. The direct conversion has the same
gate area, number of interconnects, and latency as the Boolean class with B = 3,
which implies that the Boolean class includes the direct conversion as a special case
and hence provides more tradeoffs between the gate area, number of interconnects,
and latency. From Fig. 6.12, both the Boolean-class and majority-class XORs have
the same gate area when B = 3, the majority-class XOR is more area efficient than
the Boolean-class XOR when B = 4, 5, 6, 7. From Fig. 6.13, the number of inter-
connects of the majority-class XOR is smaller than that of the Boolean-class XOR
for any B. For example, when B = 7, the gate area and number of interconnects of
Boolean-class XOR are about twice of those of majority-class XOR with the same
number of inputs. For any given B, the latency of the majority-class XOR is smaller
than that of the Boolean-class XOR.
It is observed that the optimum fan-in with respect to the gate area is B = 3 for
Boolean-class, and B = 4, 5 for majority-class XORs. For a large maximum fan-in
B, the overall gate area and number of interconnects are not the smallest, though
the tree is composed of fewer gates, each of which is more powerful dealing with
multiple inputs. It implies the majority-class implementation of a multi-input XOR
is more efficient in terms of gate area and number of interconnects when B = 4, 5,
even if a greater fan-in is available. The optimum fan-in can also be explained
through the expressions of AXOR(n,B). For an n-input XOR implemented as a tree,
there is a tradeoff between the number of B′-input XORs, NB, and the gate area
of each XOR. A small B leads to a smaller B′-input XOR but a larger NB. There
exists an optimum value B for a fixed n such that the overall gate area is minimized.
138
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
When the cost is the main concern, we search all B’s for the minimum gate area
of ABCXOR(n,B) and A
MC
XOR(n,B). The following lemma gives the optimal fan-ins for
Boolean-class and majority-class XORs implemented as a tree.
Lemma 6.4.1. For a given n (n ≥ 3), among all values of B such that B ≤ 2n−1
and B ≤ ⌊3n
2
⌋ for Boolean class and majority class, respectively, B = 3 and B=4 (or
5) minimize the gate area of the Boolean-class and majority-class implementations
of an n-input XOR, respectively.
Proof. For the Boolean-class implementation, the total gate area of an n-input XOR
with the maximum fan-in B is given by ABCXOR(n,B) = (n − 1) · [3(⌊log2B⌋ +
3)2⌊log2B⌋−1 + 1] / ⌊log2B⌋. For a given n (n ≥ 3), ABCXOR(n,B) is a piece-wire
function of B. For all B ≤ 2n−1, ABCXOR(n,B) is minimized when B = 3.
Similarly, for the majority-class implementation of an n-input XOR, the gate
area is given by AMCXOR(n,B) = (n−1) · (⌊B3 ⌋2+ ⌊2B+163 ⌋⌊B3 ⌋+ ⌊2B+103 ⌋) / ⌊2B−23 ⌋. By
scanning B ≤ ⌊3n
2
⌋, AMCXOR(n,B) is minimized when B=4 (or 5).
6.5 Multiplication over GF(2m): Threshold Imple-
mentation
In CMOS technology, many characteristic-2 field multiplication structures have been
proposed using different basis representations of field elements in the literature.
Most of them are based on the polynomial basis and normal basis [104]. In the
following, we propose threshold architectures of polynomial basis and normal basis
multiplications over GF(2m) using multi-input XORs. Our implementation is based
139
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
on the RTD technology, which needs a four-phase clocking scheme. Hence, buffers
are inserted to the circuit wherever needed. In our implementation, each output is
synthesized as an independent network of threshold gates. No sharing of gates with
other outputs is considered. We analyze the gate area, number of interconnects, and
latency of our implementations.
6.5.1 Polynomial basis multiplication over GF(2m)
Polynomial basis is widely used for representing finite field elements. There are
two classes of implementations: bit-serial and bit-parallel [103]. The former can be
modified to obtain a systolic structure, which has a highly regular structure and less
interconnect complexity. However, the structure is not suitable for implementation
in the new nanotechnology due to the complex clocking scheme. In contrast, the bit-
parallel multiplication can be constructed in a cascaded network, which is suitable
for the new nanotechnology.
Let an irreducible polynomial P (x) = p0 + p1x + · · · + pm−1xm−1 + xm be the
generator polynomial of the field GF(2m). Let A(x) and B(x) be two field elements
in GF(2m) and C(x) be their product modulo P (x). Then,
C(x) = A(x)B(x) mod P (x)
= (A(x)b0 mod P (x)) + (A(x)xb1 mod P (x))
+ · · ·+ (A(x)xm−1bm−1 mod P (x))
(6.6)
Representing C(x) and B(x) as vectors, the multiplication can be rewritten in matrix
140
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
form as C = ZB


c0
c1
...
cm−1


=


z0,0 z0,1 · · · z0,m−1
z1,0 z1,1 · · · z1,m−1
...
...
. . .
...
zm−1,0 zm−1,1 · · · zm−1,m−1




b0
b1
...
bm−1


, (6.7)
where Z is called a product matrix. The i-th column of Z is obtained by A(x)xi mod
P (x).
The matrix-vector multiplication in (6.7) requires m2 two-input AND gates and
m m-input XOR gates. The complexity of computing Z depends on the selected
generator polynomial P (x). The choice of the generator polynomials may reduce
the arithmetic complexity over GF(2m). Trinomials, pentanomials, equally-spaced
polynomials (ESPs), and all-one polynomials (AOPs) are usually considered for
selecting generator polynomials [104]. It has been shown that roughly one half of
characteristic-2 fields GF(2m) for 2 ≤ m ≤ 1000 has a trinomial generator [103].
For fields without trinomial generator, a prime pentanomial exists with very high
probability [103].
6.5.2 Implementation of polynomial basis multiplication us-
ing multi-input XORs
Based on polynomial basis, an m-bit multiplication C(x) = A(x)B(x) ( mod P (x))
over GF(2m) is implemented as C = ZB, where C = (c0, c1, · · · , cm−1)T and B =
(b0, b1, · · · , bm−1)T . It needs two steps: product matrix Z computation and vector
141
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
multiplication. For a generator polynomial P (x) = xm + x+ 1, the product matrix
Z is given by
Z =


a0 am−1 am−2 ··· a2 a1
a1 a0+am−1 am−1+am−2 ··· a3+a2 a2+a1
a2 a1 a0+am−1 ··· a4+a3 a3+a2
...
...
...
...
...
...
am−1 am−2 am−3 ··· a1 a0+am−1

 ,
where column i is given by the coefficients of A(x)xi mod P (x) for i = 0, 1, · · · , m−1.
Hence, row 0 of Z is given by (a0, am−1, am−2, · · · , a1) and row i for i = 1, 2, · · · , m−1
given by (ai, ai−1, · · · , a1, a0 + am−1, am−1 + am−2, · · · , ai+1 + ai), where all indices
are modulo m. An m-bit polynomial basis multiplication contains m independent
blocks in parallel, each of which computes ci for i = 0, 1, · · · , m− 1 simultaneously.
The multiplication structure for computing C(x) = A(x)B(x) mod P (x) is shown
in Fig. 6.15(a), where the block ui for computing ci is shown in Fig. 6.15(b). The
generation of row i of Z, given by A(x)xi mod P (x), is shown in Fig. 6.15(c) and
requires m − i (i = 1, · · · , m − 1) two-input XORs. Each block ui requires m
two-input ANDs, m − i two-input XORs, and one m-input XOR. Hence, the total
numbers of two-input ANDs, two-input XORs, and m-input XORs for an m-bit
multiplication are given by m2, m(m−1)
2
, and m, respectively.
6.5.3 Normal basis multiplication over GF(2m)
Similar to the implementations of polynomial basis multiplications, there are two
classes of implementations of normal basis multiplications: the bit-serial and the
bit-parallel [103]. The latter can be easily obtained by putting multiple identical
blocks in parallel, each of which is the same as that in the former, with cyclic shifted
versions of inputs. Bit-parallel implementations achieve much higher throughput at
142
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
Figure 6.15: Implementation of polynomial basis multiplication (a) Structure for
PB multiplication C(x) = A(x)B(x) mod P (x); (b) Block ui for computing ci; (c)
Generation of row i of Z.
the expense of larger gate area.
Some characteristic-2 field operations in normal basis can be implemented ef-
ficiently. For instance, the squaring of an element in GF(2m) is simply given by
a cyclic shift. With this property, multiplications in normal basis can be imple-
mented via an algorithm given by Massey and Omura in [105]. The implemented
multiplications are referred to as Massey-Omura (MO) multiplications.
Suppose β ∈ GF(2m) so that {β, β2, · · · , β2m−1} forms a normal basis of GF(2m).
143
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
Let A′ = a′0β + a
′
1β
2 + · · · a′m−1β2m−1 be any element in GF(2m). Denote the vec-
tor form of A′ by A′ = (a′0, a
′
1, · · · , a′m−1)T . Then, A′2 is a right cyclic shift of
A′, (a′m−1, a
′
0, · · · , a′m−2)T . Let B′ be any element in GF(2m) with vector form
B′ = (b′0, b
′
1, · · · , b′m−1)T , and C ′ = A′B′ with vector form C′ = (c′0, c′1, · · · , c′m−1)T
the product with respect to the same normal basis. Then, the last coefficient c′m−1 is
some function y of coefficients ofA′ and B′, c′m−1 = y(A
′,B′). The i-th coefficient of
C ′ is given by c′i = y(A
′(m−1−i),B′(m−1−i)), whereA′(j) = (a′m−j , a
′
m−j+1, · · · , a′m−1, a′0,
a′1, · · · , a′m−j−1)T denotes the j-fold right cyclic shift of A′. The y function is imple-
mented in the matrix form, y(A′,B′) = A′THB′, where H = [hij ]m×m is a binary
matrix.
For an irreducible polynomial P (x) = 1+x+x2+x3+x4, let β be one of its roots.
The set {1, β, β2, β3} forms a polynomial basis of GF(24). It can be shown that the
set {β, β2, β4, β8} is linearly independent and forms a normal basis of GF(24). The
two bases are related in the following form:


1
β
β2
β3


=


1 1 1 1
1 0 0 0
0 1 0 0
0 0 0 1




β
β2
β4
β8


. (6.8)
Then C ′ = A′B′ =
∑3
i=0
∑3
j=0 a
′
ib
′
jβ
2i+2j . Let β2
i+2j =
∑3
l=0 λl(i, j)β
2l, where
λl(i, j) denotes the coefficient of β
2l corresponding to the normal basis representation
of β2
i+2j . Then, we have c′l =
∑3
i=0
∑3
j=0 λl(i, j) a
′
ib
′
j . Hence, c3 = y(A
′,B′) =
A′THB′ = ( a′0 a′1 a′2 a′3 )
(
0 1 1 0
1 0 0 1
1 0 1 0
0 1 0 0
)( b′0
b′1
b′2
b′3
)
= a′0b
′
1+a
′
0b
′
2+a
′
1b
′
0+a
′
1b
′
3+a
′
2b
′
0+a
′
2b
′
2+a
′
3b
′
1.
144
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
Assume H contains Nm 1’s and hikjk = 1 for k = 1, 2, · · · , Nm. The y func-
tion implements additions of Nm terms,
∑Nm
k=1 a
′
ik
b′jk over GF(2). Each block y
requires Nm two-input ANDs and one Nm-input XOR. For an m-bit bit-parallel
multiplication, there are m identical blocks of y in parallel, and the total numbers
of two-input ANDs and Nm-input XORs are given by mNm and m, respectively.
If the sharing of the same terms a′ib
′
j among different blocks is allowed, the total
number of two-input ANDs is at most m2, since there are at most m2 different terms
a′ib
′
j for 0 ≤ i, j ≤ m − 1. The gate area is determined by block y, which depends
on the choice of normal basis. It has been shown [116] that Nm ≥ 2m − 1, and a
normal basis that achieves the equality is said to be optimal. About 23% of the
fields GF(2m) for 2 ≤ m ≤ 1200 have an optimal normal basis [116]. In the above
example, Nm = 2×4−1 = 7, which means that {β, β2, β4, β8} is an optimal normal
basis.
6.5.4 Implementation of normal basis multiplication using
multi-input XORs
The normal basis multiplication algorithm is first given by Massey and Omura in
[105]. An m-bit MO multiplication contains m blocks in parallel, each of which
implements a function y with cyclic shifted versions of inputs. The m-bit MO
multiplication C ′ = A′B′ over GF(2m) computes all c′i for i = 0, 1, · · · , m − 1
simultaneously. The computation of each c′i = y(A
′,B′) needs Nm two-input ANDs
and one Nm-input XOR. The multiplication structure for computing C
′ is shown in
Fig. 6.16(a). Each block y is implemented as shown in Fig. 6.16(b).
For simplicity, we focus on characteristic-2 fields with an optimal normal basis.
145
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
Hence, the number of 1’s in H is given by Nm = 2m − 1. If the same term a′ib′j
among different blocks is shared, large gate area is needed for routing the global
interconnects. Thus, we assume that no term a′ib
′
j is shared among different blocks.
Each block needs (2m−1) two-input ANDs and one (2m−1)-input XOR. The total
numbers of two-input ANDs and (2m− 1)-input XORs for an m-bit multiplication
are given by m(2m− 1) and m, respectively.
Figure 6.16: Implementation of normal basis multiplication (a) Structure of MO
multiplication; (b) Block y for computing c′.
6.5.5 Complexity of multiplication implementations in nan-
otechnology technology
Since majority-class XOR has smaller gate area complexity than Boolean-class XOR,
we use majority-class XORs for implementations of multipliers. The two-input AND
can be realized as a single threshold gate with a threshold function given by f2AND =
146
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
AMCPB,Mul(m,B) = m
2A2AND +
m(m−1)
2
AMCXOR(2, B) +mA
MC
XOR(m,B)
=
{
m⌊m
2
⌋⌊3m+10
2
⌋+ 13.5m2 − 3.5m, m ≤ B′
⌊B
3
⌋2+⌊ 2B+16
3
⌋·⌊B
3
⌋+⌊ 2B+10
3
⌋
⌊ 2B−2
3
⌋ (m
2 −m) + 12.5m2 − 6.5m, m > B′
(6.9)
IMCPB,Mul(m,B) = m
2I2AND +
m(m−1)
2
IMCXOR(2, B) +mI
MC
XOR(m,B)
=
{
(m2 +m)⌊m
2
⌋ + 5.5m2 − 2.5m, m ≤ B′
⌊ 2B+4
3
⌋⌊B
3
⌋+⌊ 2B+1
3
⌋
⌊ 2B−2
3
⌋ (m
2 −m) + 4.5m2 − 2.5m, m > B′
(6.10)
LMCPB,Mul(m,B) = L2AND + L
MC
XOR(2, B) + L
MC
XOR(m,B)
=
{
5, m ≤ B′
3 + 2⌈log⌊ 2B+1
3
⌋m⌉, m > B′
(6.11)
[x1, x2; 1, 1; 2]. Denote the gate area, number of interconnects, and latency of a two-
input AND by A2AND, I2AND, and L2AND, respectively. Then A2AND = 6, I2AND = 2,
and L2AND = 1.
Denote the gate area, number of interconnects, and latency of anm-bit majority-
class polynomial basis multiplication by AMCPB,Mul(m,B), I
MC
PB,Mul(m,B), and L
MC
PB,Mul
(m,B), respectively, where PB denotes polynomial basis. According to Sec. 6.4.4,
the gate area, number of interconnects, and latency of an m-input majority-class
polynomial basis multiplication are given in Eqs. (6.9)-(6.11), where B′ = ⌊(2B +
1)/3⌋ determines the fan-in violation condition.
Denote the gate area, number of interconnects, and latency of anm-bit majority-
class normal basis multiplication by AMCNB,Mul(m,B), I
MC
NB,Mul(m,B), and L
MC
NB,Mul(m,B),
respectively, where NB denotes normal basis. According to Sec. 6.4.4, the gate area,
number of interconnects, and latency of an m-input majority-class polynomial basis
multiplication are given in Eqs. (6.12)-(6.14), where B′ = ⌊(2B + 1)/3⌋ determines
the fan-in violation condition.
147
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
AMCNB,Mul(m,B) = m(2m− 1)A2AND +mAMCXOR(2m− 1, B)
=


3m3 + 14m2 − 7m, m ≤ (B′ + 1)/2
⌊B
3
⌋2+⌊ 2B+16
3
⌋·⌊B
3
⌋+⌊ 2B+10
3
⌋
⌊ 2B−2
3
⌋ (2m
2 − 2m)
+12m2 − 6m, m > (B′ + 1)/2
(6.12)
IMCNB,Mul(m,B) = m(2m− 1)I2AND +mIMCXOR(2m− 1, B)
=
{
2m3 + 4m2 − 3m, m ≤ (B′ + 1)/2
⌊ 2B+4
3
⌋⌊B
3
⌋+⌊ 2B+1
3
⌋
⌊ 2B−2
3
⌋ (2m
2 − 2m) + 4m2 − 2m, m > (B′ + 1)/2
(6.13)
LMCNB,Mul(m,B) = L2AND + L
MC
XOR(2m− 1, B)
=
{
3, m ≤ (B′ + 1)/2
1 + 2⌈log⌊ 2B+1
3
⌋(2m− 1)⌉, m > (B′ + 1)/2
(6.14)
0 100 200 300 400 500 600
0
2
4
6
8
10
12
14
x 10
6
m
G
a
t
e
a
r
e
a
PB, B=3
PB, B=4
PB, B=5
PB, B=6
PB, B=7
NB, B=3
NB, B=4
NB, B=5
NB, B=6
NB, B=7
Figure 6.17: Gate area for polynomial basis and normal basis multiplications over
GF(2m).
The gate area, number of interconnects, and latency of both PB and NB imple-
mentations are illustrated for comparison with respect to different maximum fan-in
B, respectively, in Figs. 6.17, 6.18, and 6.19. Though the closed-form expressions
in Eqs. (6.9)-(6.14) are derived for some discrete values m = B′l for l = 1, 2, · · · ,
148
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
0 100 200 300 400 500 600
0
1
2
3
4
5
6
x 10
6
m
#
o
f
in
te
rc
o
n
n
e
c
ts
PB, B=3
PB, B=4
PB, B=5
PB, B=6
PB, B=7
NB, B=3
NB, B=4
NB, B=5
NB, B=6
NB, B=7
Figure 6.18: Number of interconnects for polynomial basis and normal basis multi-
plications over GF(2m).
0 100 200 300 400 500 600
0
5
10
15
20
25
PB, B=3
PB, B=4
PB, B=5
PB, B=6
PB, B=7
0 100 200 300 400 500 600
0
5
10
15
NB, B=3
NB, B=4
NB, B=5
NB, B=6
NB, B=7
Figure 6.19: Latency for polynomial basis and normal basis multiplications over
GF(2m).
we assume the expressions are valid for all m with B ≤ ⌊3m
2
⌋ or B ≤ 3m − 2. In
Figs. 6.17, 6.18, and 6.19, all the legends correspond to the discrete values m = B′l
149
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
for l = 1, 2, · · · . For m 6= B′l, a multi-input XOR used in various implementa-
tions is realized as a partial tree of B′-input XORs. Assume the gate area and
number of interconnects of the partial tree of B′-input XORs increase linearly with
m. Since the gate area and number of interconnects of various implementations are
dominated by the multi-input XORs, all the curves corresponding to the hardware
complexity are smooth. In contrast, the curves corresponding to the latency are
step functions, since the latency is a ceiling function of m. Both polynomial basis
and normal basis implementations require multi-input XORs. Though polynomial
basis implementation needs additional (m2 −m)/2 two-input XORs for computing
Z, it has smaller complexity than the normal basis implementation. This is because
the gate area and number of interconnects are dominated by the multi-input XORs,
and polynomial basis implementation requires m m-input XORs, compared with m
(2m−1)-input XORs. This is observed in Figs. 6.17 and 6.18, where the complexity
of normal basis implementation is about 1.4 times of that of polynomial basis im-
plementation. From Figs. 6.17 and 6.18, the gate area and number of interconnects
with respect to B = 4, 5 are the smallest for both polynomial basis and normal basis
implementations, since the complexities of both implementations are dominated by
multi-input XORs, which has the smallest complexities when B = 4, 5 for majority
class, as explained in Sec. 6.4.4. From Fig. 6.19, the polynomial basis multiplication
has a smaller latency than the normal basis multiplication for the same B.
Though the proposed implementations have similar structures to the bit-parallel
multiplications in CMOS technology [103], there are two main differences. First,
unlike CMOS technology, where a gate with n > 2 inputs will be implemented as a
tree of two-input gates for smaller area [103], our implementation uses majority-class
150
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
Table 6.4: Comparison of our complexities of polynomial basis multipliers over
GF(2m) with those of [25] and [106] with or without preprocessing. NP denotes
no preprocessing, and B and A denote preprocessing by running script.Boolean and
script.algebraic, respectively.
m Impl. G B A I L
3
[106], NP 524 104 628 275 5
[106], B 256 628 884 266 10
[106], A 242 324 566 183 9
[25] 515 128 643 263 4
Ours 138 116 254 83 5
4
[106], NP 4900 1676 5066 1915 6
[106], B 739 2704 3443 1015 16
[106], A 714 1004 1718 549 11
[25] 1916 784 2700 1036 5
Ours 286 264 550 176 7
5
[106], NP 17240 8732 25972 9715 7
[106], B 1688 2944 4632 1458 12
[106], A 1618 2928 4546 1380 11
[25] 7329 3688 11017 4093 6
Ours 485 388 873 282 7
6
[106], NP 82199 15084 97283 38926 7
[106], B 3132 7320 10452 3153 15
[106], A 3340 7968 11308 3321 16
[25] 28794 15720 44514 16314 7
Ours 669 560 1229 401 7
multi-input XORs, which significantly reduce the gate area, number of interconnects,
and latency. Second, the operation of RTD nanotechnology needs a four-phase
clocking scheme. The output is self-latched and the operation is suited for pipelining
by constructing a cascaded network of threshold gates.
151
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
Table 6.5: Comparison of our complexities of normal basis multipliers over GF(2m)
with those of [25] and [106] with or without preprocessing. NP denotes no pre-
processing, and B and A denote preprocessing by running script.Boolean and
script.algebraic, respectively.
m Impl. G B A I L
3
[106], NP 735 68 803 273 5
[106], B 385 1128 1513 464 13
[106], A 330 372 702 232 8
[25] 831 264 1095 435 6
Ours 213 84 297 102 5
4
[106], NP 4223 2464 6687 2428 6
[106], B 1395 9816 11211 3099 27
[106], A 1250 1804 3054 960 10
[25] 3892 1824 5716 2148 8
Ours 348 176 524 184 5
5
[106], NP 20975 9856 30831 11440 7
[106], B 3960 8152 12112 3816 12
[106], A 3540 6992 10532 3154 13
[25] 18425 9880 28305 10405 10
Ours 570 240 810 290 5
6
[106], NP 101117 14192 115309 46721 7
[106], B 9155 22040 31195 9466 15
[106], A 8105 19228 27333 7993 17
[25] 86766 48624 135390 49398 12
Ours 912 432 1344 468 7
6.5.6 Comparison with existing approaches
We focus on m-bit polynomial basis and normal basis multiplications and com-
pare the gate area, number of interconnects, and latency of our implementations
with those obtained via generic synthesis approaches in [25, 106]. The implemen-
tations in [106] are obtained with and without running preprocessing. Two pre-
processed Boolean networks are obtained by running two scripts, script.Boolean
152
6.5. MULTIPLICATION OVER GF(2M): THRESHOLD IMPLEMENTATION
and script.algebraic, respectively. The fan-in B = 5 is chosen for our implementa-
tion, since the gate area of our implementation is minimized. For implementations
in [106], we assume the fan-in B = 5. For implementations in [25], the fan-in B = 3
is chosen by necessity, since only symmetric three-input majority gates are used. For
m = 3, 4, 5, 6, the complexities of our proposed multipliers and those synthesized
via the approaches in [25,106] are shown in Tables 6.4 and 6.5 for polynomial basis
and normal basis multiplications, respectively, where the gate area G, buffer area B,
total gate area A, number of interconnects I, and latency L are shown in columns
3 through 7, respectively. Buffers are listed separately in Tables 6.4 and 6.5. Each
buffer consumes four RTDs and one interconnect. The total gate area is the sum
of gate area and buffer area. Due to efficient implementation of multi-input XORs,
our implementation requires smaller area than those in [25, 106], as well as fewer
interconnects. For a 6-bit multiplier, the total gate area and number of intercon-
nects of our implementations are about one order of magnitude smaller than the
best results given in [25, 106]. For a larger multiplier, the complexity saving would
be even greater. The advantage of our implementation becomes more significant as
m grows. According to the application of elliptical curve in cryptography, m can be
as as large as 512 [115]. For m = 512, our custom-designed implementation requires
much smaller area than others. For normal basis multiplier, the latency of ours is
always better than the that in [25]. For polynomial basis multiplier, the latency of
ours is slightly greater than the that in [25] for m = 3, 4, 5. However, the latency
of ours increases logarithmically with m as shown in Eq. 6.11, compared with that
in [25], which increases linearly with m. Hence, for a large multiplier, the latency of
ours is much smaller than that in [25]. For both polynomial basis and normal basis
153
6.6. CONCLUSION
multiplications, the implementations without preprocessing in [106] have roughly
the same latency as ours, but require much larger gate area. The implementations
with preprocessing in [106] have larger latency than ours, as well as larger gate area.
Our analysis results show that our implementations of multipliers performs better
than existing approaches in [25,106] in terms of overall area complexity and latency.
6.6 Conclusion
This chapter has provided the designs of multipliers over GF(2m) using threshold
gates with bounded fan-in that are suitable for nanotechnology implementation.
Fan-in of nanotechnology gates influences their reliability and speed. Thus our de-
signs allow the trade-off between complexity, reliability and speed. A comparison of
our implementations of various multiplication architectures shows that they use less
gate area and fewer interconnects than those obtained by the approaches available
in the literature [25, 106].
Since most designs of GF(2n) architectures use a large number of XOR gates,
we have first focused on efficient designs of multi-input XORs using threshold gates.
We have shown that the implementations based on generalized majority gates have
smaller area and latency as compared with those that use Boolean algebra ap-
proaches. Other architectures over GF(2m) will also benefit from use of multi-input
XORs developed here.
154
Chapter 7
An Enhanced Multiway Sorting
Network Based on n-Sorters
7.1 Introduction
Sorting is one important operation in data processing, and hence its efficiency greatly
affects the overall performance of a wide variety of applications [31, 110]. Sorting
networks can achieve high throughput rates by performing operations simultane-
ously. These parallel sorting networks have attracted attention of researchers due
to increasing hardware speed and decreasing hardware cost. One of the most pop-
ular sorting algorithm is called merge-sort algorithm, which performs the sorting in
two steps [31]. First, it divides the input list (a sequence of values) into multiple
sublists (a smaller sequence of values) and sorts each sublist simultaneously. Then,
the sorted sublists are merged as a single sorted list. The sorting process of sublists
can then be decomposed recursively into the sorting and merging of even smaller
155
7.1. INTRODUCTION
sublists, which are then merged as a single sorted list. Hence, the merging operation
is the key procedure for the decomposition-based sorting approach. One popular
2-way merging algorithm called odd-even merging [31] merges two sorted lists (odd
and even lists) into one sorted list. In [117], a modulo merge sorting was introduced
as a generalization of the odd-even merge by dividing the two sorted input lists into
multiple sublists with a modulo not limited to 2. Another popular 2-way merging
algorithm is bitonic merging algorithm [109]. Two sorted lists are first arranged as
a bitonic list, which is then converted to obtain a sorted list. These 2-way merg-
ing algorithms employ 2-way merge procedure recursively and have a capability of
sorting N values in O(log2N) stages [31]. In [118], a sorting network, named AKS
sorting network, with O(logN) stages was proposed. However, there is a very large
constant in the depth expression, which makes it impractical. Recently, a modular
design of high-throughput low-latency sorting units are proposed in [119]. However,
the basic building block in these 2-way merging algorithm is a 2-sorter, which is
simply a 2× 2 switching element or comparator as shown in Fig. 7.1(a).
Instead of using 2-sorters, n-sorters can be used as basic building blocks. This
was first proposed as a generalization of the Batcher’s odd-even merging algo-
rithm [111]. It was also motivated by the use of n-sorters, which sort n (n ≥ 2)
values in unit time [112, 120]. Since large sorters are used as basic building blocks,
the number of sorters as well as the latency is expected to be reduced greatly. An
n-way merging algorithm was first proposed by Lee and Batcher [111], where n is
not restricted to 2. A version of the bitonic n-way merging algorithm was proposed
by Nakatani et al. [121,122]. However, the combining operation in the n-way merg-
ing algorithms still use 2-sorters as basic building blocks. Leighton proposed an
156
7.1. INTRODUCTION
algorithm for sorting r lists of c values each, represented as an r × c matrix [123].
This algorithm is a generalization of the odd-even merge-sort and named column-
sort, since it merges all sorted columns to obtain a single sorted list in row order.
In the original columnsort, no specific operation was provided for sorting columns
and no recursive construction of sorting network was provided. In [112], a mod-
ified columnsort algorithm was proposed with sorting networks constructed from
n-sorters (n ≥ 2) [124]. However, a 2-way merge is still used for the merging pro-
cess. In [113], an n-way merging algorithm, named SS-Mk, based on the modified
columnsort was proposed with n-sorters as basic building blocks, where n is prime.
For n sorted lists of m values each, the idea is to sort the m× n values first in each
row and then in slope lines with decreasing slope rates. An improved version of the
SS-Mk merge sort, called ISS-Mk, was provided in [125], where n can be any integer.
We compare our sorting scheme with the SS-Mk but not the ISS-Mk, because for our
interested ranges of N , the ISS-Mk requires larger latency due to a large constant.
In this chapter, we propose an n-way merging algorithm, which generalizes the
odd-even merge by using n-sorters as basic building blocks, where n (≥ 2) is prime.
Based on this merging algorithm, we also propose a sorting algorithm. For N = np
input values, p + ⌈n/2⌉ × p(p−1)
2
stages are needed. The complexity of the sorting
network is evaluated by the total number of n-sorters. The closed-form expression
for the number of sorters is also derived.
Instead of 2-sorters, n-sorters (n > 2) are used as basic blocks in this chapter.
This is because larger sorters have some efficient implementation. For example, for
binary sorting in threshold logic, the area of an n-sorter scales linearly with the
number of inputs n, while the latency stays as a constant. Hence, a smaller number
157
7.1. INTRODUCTION
of sorters and latency of the whole sorting network can be achieved. However,
we cannot use arbitrary large sorters as basic blocks, since larger sorters are more
complex and difficult to be implemented. Hence, the benefit of using a larger block
diminishes with increasing n. We assume that the size of basic sorter n ≤ 20 and
10 when evaluating the number of sorters and latency. Our algorithm works for any
upper bound on n, and one can plug any upper bound on n into our algorithm.
Asymptotically, the number of sorters required by our sorting algorithm is on the
same order of O(N log2N) as the SS-Mk [113] for sorting N inputs. Our sorting
algorithm requires fewer sorters than the SS-Mk in [113] in wide ranges of N . For
instance, for n ≤ 20, when N ≤ 1.46× 104, our algorithm requires up to 46% fewer
sorters than the SS-Mk. When 1.46× 104 < N ≤ 1.3× 105, our algorithm has fewer
sorters for some segments of N ’s. When N > 1.3 × 105, our algorithm needs more
sorters.
The work in this chapter is different from previous works [111, 113, 125] in the
following aspects:
• While the multiway merge [111] uses 2-sorters in the combining network, our
proposed n-way merging algorithm uses n-sorters as basic building blocks. By
using larger sorters (n > 2), the number of sorters as well as the latency would
be reduced greatly.
• The merge-based sorting algorithms in [113, 125] are based on the modified
columnsort [124], which merges sorted columns as a single sorted list in row
order. Our n-way merge sorting algorithm is a direct generalization of the
multiway merge sorting in [111].
158
7.2. BACKGROUND
• We analyze the performance of our approach by deriving the closed-form ex-
pressions of the latency and the number of sorters. We also derive the closed-
form expression of the number of sorters for the SS-Mk [113], since it was not
provided in [113]. Then we present extensive comparisons between the latency
and the number of sorters required by our approach and the SS-Mk [113].
• Finally, we show an implementation of a binary sorting network in threshold
logic. With an implementation of a large sorter in threshold logic, we compare
the performance of sorting networks in terms of the number of gates.
The rest of the chapter is organized as following. In Sec. 7.2, we briefly review
the background of sorting networks. In Sec. 7.3, we propose a multiway merging
algorithm with n-sorters as basic blocks. In Sec. 7.4, we introduce a multiway sorting
algorithm based on the proposed merging algorithm, and show extensive results for
the comparison of our sorting algorithm and previous works. In Sec. 7.5, we focus
on a binary sorting network, where basic sorters are implemented by threshold logic
and have complexity linear with the input size, and measure the complexity in terms
of number of gates. Finally Sec. 7.6 presents the conclusion of this chapter.
7.2 Background
A sorting network is a feedforward network, which gives a sorted list for unsorted
inputs. It is composed of two items: switching elements (or comparators)
and wires. The depth of a comparator is defined to be the longest length from
the inputs of the sorting network to that comparator’s outputs. The latency of
the sorting network is the maximum depth of all comparators. The network is
159
7.2. BACKGROUND
Figure 7.1: (a) 2-sorter (y1 ≤ y2); (b) n-sorter (y1 ≤ y2 ≤ · · · ≤ yn).
oblivious in the sense that the time and location of input and output are fixed
ahead of time and not dependent on the values [31]. We use the Knuth diagram
in [110] for easy representation of the sorting networks, where switching elements
are denoted by connections on a a set of wires. The inputs enter at one side and
sorted values are output at the other side, and what remains is how to arrange
the switching elements. The sorting network is measured in two aspects, latency
(number of stages) and complexity (number of sorters). The basic building block
used by the odd-even merge [31] is a 2-by-2 comparator (compare-exchange element).
It receives two inputs and outputs the minimum and maximum in an ordered way.
The symbol for a 2-sorter is shown in Fig. 7.1(a), where xi and yi for i = 1, 2 are
input and output, respectively. Similarly, an n-sorter is a device sorting n values
in unit time. The symbol for an n-sorter is shown in Fig. 7.1(b), where xi and
yi for i = 1, 2, · · · , n are input and output, respectively, and the output satisfies
y1 ≤ y2 ≤ · · · ≤ yn. In this chapter, we denote the sorted values y1 ≤ y2 ≤ · · · ≤ yn
by 〈y1, y2, · · · , yn〉 and use n-sorters as basic blocks for sorting.
Merging-based sorting networks are an important family of sorting networks,
where the merging operation is the key. There are two classes of merging algorithms,
the odd-even merging [31] and the bitonic merging [109]. The former is an efficient
160
7.2. BACKGROUND
Figure 7.2: The odd-even merge of two sorted lists of 4 values each using 2-sorters.
sorting technique based on the divide-and-conquer approach, which decomposes the
inputs into two sublists (odd and even), sorts each sublist, and then merges two
sorted lists into one. Further decomposition and merging operations are applied on
the sublists. An example of odd-even merging network using 2-sorters is shown in
Fig. 7.2, where two sorted lists, 〈x(0)1,1, · · · , x(0)1,4〉 and 〈x(0)2,1, · · · , x(0)2,4〉, are merged as a
single list 〈x(2)1,1, · · · , x(2)1,4, x(2)2,1, · · · , x(2)2,4〉 in two stages.
Instead of merging two lists, multiple sorted lists can be merged as a single sorted
list simultaneously. An n-way merger (n ≥ 2) of size m is a network merging n
sorted lists of size m (m values) each into a single sorted list in multiple stages. This
was first proposed as a generalization of the Batcher’s odd-even merging algorithm.
It is also motivated by the use of n-sorters, which sort n (n ≥ 2) values in unit time
[112,120]. Since large sorters are used as basic building blocks, the number of sorters
as well as the latency is expected to be reduced greatly. Many multiway merging
algorithms exist in the literature [111–113, 121–127]. The algorithms in [126, 127]
implement multiway merge using 2-sorters. In [111], a generalization of Batcher’s
161
7.2. BACKGROUND
Figure 7.3: Iterative construction rule for the n-way merger [111].
odd-even is introduced as shown in Fig. 7.3, where an n-way merger of n lists of
size ud is decomposed into d n-way mergers of n sublists of size u plus a combining
network. Each of the small n-way mergers is further decomposed similarly. However,
the combining network in the merging network in Fig. 7.3 still uses 2-sorters as
basic blocks. In [123], Leighton proposed a columnsort algorithm, which showed
how to sort an m × n matrix denoting the n sorted lists of m values each. A
modification of Leighton’s columnsort algorithm was given in [112]. In [113, 125],
merging networks with n-sorters as basic blocks are introduced based on the modified
Leighton’s columnsort algorithm.
In this chapter, we focus on multiway merge sort with binary values as inputs.
Our merge sort also works for arbitrary values, which is justified by the following
theorem.
Theorem 7.2.1 (Zero-one principle [31]). If a network with n input lines sorts all
162
7.3. MULTIWAY MERGING
2n lists of 0s and 1s into nondecreasing order, it will sort any arbitrary list of n
values into nondecreasing order.
7.3 Multiway Merging
In the following, we propose an n-way merging algorithm with n-sorters as basic
building blocks as shown in Alg. 5. We consider a sorting network, where all iter-
ations of Alg. 5 are simultaneously instantiated (loop unrolling). We refer to the
instantiation of iteration i of Alg. 5 as stage i of the sorting network. The sorters
in the last for loop in Alg. 5 consist of the last stage. Let the n sorted input lists be
〈x(0)j,1 , x(0)j,2 , · · ·x(0)j,m〉 for j = 1, · · · , n. Denote the values of j-th list after stage k by
(x
(k)
j,1 , x
(k)
j,2 , · · · , x(k)j,m). After T = 1 + ⌈m2 ⌉ stages, all input lists are sorted as a single
list, 〈x(T )1,1 , x(T )1,2 , · · · , x(T )1,m〉, 〈x(T )2,1 , x(T )2,2 , · · · , x(T )2,m〉, · · · , 〈x(T )n,1 , x(T )n,2 , · · · , x(T )n,m〉.
For convenience of describing and proving our algorithm, we introduce some
notations and definitions. Denote the number of zeros in the j-th list after stage i
as r
(i)
j , where i = 1, 2, · · · , ⌈m2 ⌉+ 1 and j = 1, · · · , n. A sorter is called a k-spaced
sorter if its adjacent inputs span k other wires and each connection of the same
sorter comes from different lists of m wires, where 0 ≤ k ≤ m − 1. For simplicity,
we arrange the sorters in the order of their first connections in each stage. Denote
{1, 2, · · · , m} as Zm. Two k-spaced sorters are said to be adjacent if they connect
adjacent two wires, x
(i)
j,k and x
(i)
j,k+1, respectively, for some j ∈ Zm and k ∈ Zm−1.
Then, our n-way merging Alg. 5 can be intuitively understood as flooding lists with
zeros in descending order. The correctness of Alg. 5 can be shown by first proving
the following lemmas. See the appendix for the proofs of the following lemmas and
163
7.3. MULTIWAY MERGING
Algorithm 5 Algorithm for n-way merging network.
Input: n sorted lists 〈x(0)j,1 , x(0)j,2 , · · ·x(0)j,m〉 for j = 1, · · · , n;
i = 1;
while i ≤ ⌈m
2
⌉ do
for j = 1 to n− 1 do
Apply (m− i)-spaced sorters between lists j and j + 1;
end for
Merge all (m− i)-spaced sorters;
Update n sorted lists 〈x(i)j,1, x(i)j,2, · · ·x(i)j,m〉 for j = 1, · · · , n;
i = i+ 1;
end while
for j = 1 to n− 1 do
Apply (m− 1)-sorters on m− 1 adjacent lines with first half, x(i−1)j,m−k, from list
j and second half, x
(i−1)
j+1,k, from list j + 1, where k = 1, · · · , m−12 ;
end for
Output: Sorted lists.
Figure 7.4: The network for n sorted lists of m wires.
theorems.
Lemma 7.3.1. Apply (m−1)-spaced sorters to n lists ofm values, 〈xj,1, xj,2, · · · , xj,m〉,
164
7.3. MULTIWAY MERGING
Figure 7.5: Adjacent two sorters S1 and S2 in each stage of Alg. 5 can be classified
into four four cases. (a) Case I (∆ = v−w
b−a ); (b) Case II (∆ =
v−1
b−a+1); (c) Case III
(∆ = m−w+1
b−a+1 ); (d) Case IV (∆ =
m
b−a+2).
for j = 1, · · · , n. The outputs of each list are still sorted, 〈x′j,1, x′j,2, · · · , x′j,m〉, for
j = 1, 2, · · · , n.
For n sorted lists of m values, there are m (m− 1)-spaced sorters as illustrated
in Fig. 7.4(a). The proof of the lemma can be reduced to showing that any two
wires s, s+ l ∈ Zm of each list connected by the s- and (s+ l)-th sorters are sorted.
The simplified network is shown in Fig. 7.4(b). Without lose of generality, we can
choose l = 1.
Lemma 7.3.2. In each stage of Alg. 5, there are at most four cases of adjacent two
sorters as shown in Fig. 7.5. If m is prime, case IV is impossible.
We first show that the first connections of adjacent two sorters, S1 and S2,
belong to either the same list or adjacent two lists. The same relation is true for
the last connections of S1 and S2. This gives us a total of four cases as shown in
165
7.3. MULTIWAY MERGING
Fig. 7.5, where b ≥ a + 1 for Fig. 7.5(a)-(c), and b ≥ a for Fig. 7.5(d) such that S1
and S2 have a size of at least two.
The following theorem proves the correctness of Alg. 5.
Theorem 7.3.1. For a prime m in Alg. 5, all lists are self-sorted after every stage.
In particular, all lists are sorted after the final stage.
The theorem can be proved by induction on i.
In Alg. 5, the latency increases linearly with ⌈m
2
⌉. When m is large, the latency
is also very large. By further decomposing m into a product of small factors, we can
reduce the latency significantly. In the following, we propose Alg. 6 for merging n
lists of m values, where m = np−1 for p ≥ 2. When m is not a power of n, we can use
a larger network of m′ = np
′
> m inputs. For any q in stage i (2 ≤ i ≤ p−1), denote
the number of zeros in each new formed list after stage i as r
(i)
j,q, where j = 1, · · · , ni.
Assume two dummy lists with r
(i)
0,q = n and r
(i)
ni+1,q
= 0 are appended to the two ends
of ni lists. The correctness of Alg. 6 can be shown by first proving the following
lemma.
Lemma 7.3.3. In Alg. 6, the new lists in stage i with respect to q are self-sorted.
The numbers of zeros of all new lists after stage i are non-increasing,
r
(i)
j,q ≥ r(i)j+1,q for j = 1, · · · , ni − 1,
where i = 2, · · · , p − 1 and q = 1, · · · , np−1−i. Furthermore, there are at most n
consecutive lists that have between 1 and n− 1 zeros,
r(i)s,q = n > r
(i)
s+1,q ≥ · · · ≥ r(i)s+l,q > 0 = r(i)s+l+1,q for l ≤ n,
166
7.3. MULTIWAY MERGING
Algorithm 6 Algorithm for combining n lists of m = np−1 values.
Input: n sorted lists 〈x(0)j,1 , x(0)j,2 , · · ·x(0)j,m〉 for j = 1, · · · , n and m = np−1;
i = 1;
for q = 1 to np−2 do
Apply Alg. 5 on 〈x(0)j,q , x(0)j,np−2+q, x(0)j,2np−2+q, · · · , x(0)j,(n−1)np−2+q〉
for j = 1, · · · , n and obtain a single sorted list
〈x(1)1,q , x(1)1,np−2+q, · · ·x(1)1,(n−1)np−2+q, x(1)2,q, x(1)2,np−2+q, · · · , x(1)2,(n−1)np−2+q, · · · , x(1)n,q,
x
(1)
n,np−2+q, · · · , x(1)n,(n−1)np−2+q〉;
end for
for i = 2 to p− 1 do
for q = 1 to np−1−i do
Group n adjacent values of 〈x(i−1)j,q , x(i−1)j,np−i−1+q, x(i−1)j,2np−i−1+q, · · ·x(i−1)j,(n−1)np−i−1+q〉
for j = 1, · · · , n and denote the new lists as
〈x(i−1)j,q , x(i−1)j,np−i−1+q, · · ·x(i−1)j,(n−1)np−i−1+q〉 for j = 1, · · · , ni;
for k = 2 to ⌈n
2
⌉ do
Apply (n− k)-spaced sorters between lists j and j + 1;
end for
Apply (n− 1)-sorters between lists j and j + 1 for j = 1, · · · , ni − 1;
Obtain a single sorted list 〈x(i)1,q, x(i)1,np−i−1+q, · · · , x(i)1,(n−1)np−i−1+q, x(i)2,q,
x
(i)
2,np−i−1+q, · · · , x(i)2,(n−1)np−i−1+q, · · · , x(i)ni,q, x(i)ni,np−i−1+q, · · · , x(i)ni,(n−1)np−i−1+q〉;
end for
end for
Output: Sorted list.
167
7.3. MULTIWAY MERGING
Figure 7.6: A 3-way merging network of N = 3×7 inputs implemented via 7 stages.
where s ≥ 0 and s+ l ≤ ni.
See Sec. A.4 for the proof.
The following theorem proves the correctness of Alg. 6.
Theorem 7.3.2. Alg. 6 combines n sorted lists of m = np−1 values as a single
sorted list.
In Alg. 6, the latency is reduced to 1 + (p− 1)⌈n
2
⌉ for n sorted lists of m = np−1
values.
In the following, we show two examples for comparison of the two algorithms.
First, a 3-way merging network of N = 3× 7 inputs via Alg. 5 is shown in Fig. 7.6.
Then, a 3-way merging network of N = 3× 9 inputs via Alg. 6 is shown in Fig. 7.7.
Though there are more inputs in Fig. 7.7 than that in Fig. 7.6, the latency of Alg. 6
is smaller due to recursive decomposition. The numbers of sorters in Figs. 7.6 and
7.7 are given by 40 and 41, respectively. For six more inputs, it requires only one
more sorter in Fig. 7.7. Hence, Alg. 6 can be more efficient than Alg. 5 for a large
m.
168
7.3. MULTIWAY MERGING
Figure 7.7: A 3-way merging network of N = 3×9 inputs implemented via 5 stages.
Figure 7.8: A 3-way sorting network of N = 33 inputs implemented via 9 stages.
169
7.4. MULTIWAY SORTING
7.4 Multiway Sorting
In this section, we first focus on how to construct sorting networks with n-sorters
using the multiway merging algorithm in Sec. 7.3. Then, we analyze the latency and
the number of sorters of the proposed sorting networks by deriving the closed-form
expressions. We compare them with previously proposed SS-Mk in [113] but not
the ISS-Mk [125], because for our interested ranges of N , the ISS-Mk requires larger
latency due to a large constant.
7.4.1 Multiway sorting algorithm
Based on the multiway merging algorithm in Sec. 7.3, we proposed a parallel sorting
algorithm using a divide-and-conquer method. The idea is to first decompose large
list of inputs into smaller sublists, then sort each sublist, and finally merge them
into one sorted list. The sorting of each sublist is done by further decomposition.
For instance, for N = np inputs, we first divide the np inputs into n lists of np−1
values. Then we sort each of these n lists and combine them with Alg. 6. The
sorting operation of each of the n lists is done by dividing the np−1 inputs into n
smaller lists of np−2 values. We repeat the above operations until that each of n
smaller lists contains only n values, which can be sorted by a single n-sorter. The
detailed procedures are shown in Alg. 7.
For example, a 3-way sorting network of N = 33 inputs is shown in Fig. 7.8. The
first stage contains 9 3-sorters. The second stage contains 3 three-way mergers with
a depth of 3. The last stage contains a three-way merger with a depth of 5. The
total depth is given by 9.
170
7.4. MULTIWAY SORTING
Algorithm 7 Algorithm for sorting N = np values.
Input: N = np values, x
(0)
1 , x
(0)
2 , · · · , x(0)np ;
Partition the N = np values as np−1 lists of n values each, (x(0)j,1 , x
(0)
j,2 , · · · , x(0)j,n) for
j = 1, · · · , np−1;
Apply one n-sorter on each of np−1 lists and obtain 〈x(1)j,1 , x(1)j,2 , · · · , x(1)j,n〉 for j =
1, · · · , np−1;
for i = 2 to p do
for j = 1 to np−i do
Apply Alg. 5 on 〈x(i−1)(j−1)n+k,1, x(i−1)(j−1)n+k,2, · · · , x(i−1)(j−1)n+k,ni−1〉 for k = 1, · · · , n,
and obtain a single sorted list 〈x(i)j,1, x(i)j,2, · · ·x(i)j,ni〉;
end for
end for
Output: Sorted list.
7.4.2 Latency analysis
First, we focus on the latency for sorting N values. The latency is defined as the
number of basic sorters in the longest paths from the inputs to the sorted output.
In Alg. 7, there are p iterations. In iteration i, there are ni merging networks, each
of which is to merge n sorted lists of np−i values. For iteration i, the latency is given
by Lour(n, n
i−1) = 1+(i−1)⌈n
2
⌉. For a sorting network of N = np values via Alg. 7,
by summing up the latencies of all levels, we obtain the total latency
Lour(n
p) =
∑p
i=1 Lour(n, n
i−1)
= p+ ⌈n
2
⌉ × p(p−1)
2
.
(7.1)
The closed-form expression of latency for the SS-Mk given in [113] is
LSS−Mk(n
p) = 1 + (p− 1)n + (p− 1)(p− 2)
2
⌈log2 n⌉. (7.2)
We compare our latency for sorting N = np values with that for the SS-Mk
171
7.4. MULTIWAY SORTING
in [113]. From Eqs. (7.1) and (7.2), for N = np inputs, p should be as small
as possible to obtain small latencies. In Table 7.1, we compare the latencies of
Eqs. (7.1) and (7.2) for small p (p = 2, 3, 4). It is easily seen that our implementation
has a smaller latency than the SS-Mk in [113] for a prime greater than 3. It is also
observed that Lour(2
p) = LSS−Mk(2p) = p(p + 1)/2 for n = 2, which is the same as
the odd-even merge sort in [31].
Table 7.1: Comparison of latencies of sorting networks of N = np inputs via the
SS-Mk in [113] and our implementation.
p = 2 p = 3 p = 4
[113] 1 + n 1 + 2n+ ⌈log2 n⌉ 1 + 3n+ 3⌈log2 n⌉
Ours 2 + ⌈n
2
⌉ 3 + 3⌈n
2
⌉ 4 + 6⌈n
2
⌉
7.4.3 Analysis of the number of sorters
In the following, we compare the number of sorters of our algorithms with the SS-Mk
in [113]. Since the distribution of sorters for an arbitrary sorting network ofN inputs
is not known, we assume that any m-sorter (m < n) has the same delay and area as
the basic n-sorter and count the number of sorters. We first derive the closed-form
expression of the number of sorters for sorting N values via our Alg. 7. Since the
expression of the number of sorters for the SS-Mk was not provided in [113], we also
derive the corresponding closed-form expression and compare it with our algorithm.
The whole sorting network is constructed recursively by merging small sorted lists
into a larger sorted list. We first derive the number of sorters of a merging network
172
7.4. MULTIWAY SORTING
of n lists of np−i values, which is given by
Sour(n, n
p−i) = (p− i) ·M∗np−i +
np−i − 1
n− 1 · C
∗
n + n
p−i,
where M∗np−i =
(
1 + ⌈n/2⌉(⌈n/2⌉−1)
2
)
np−i and C∗n = (⌈n/2⌉ − 1)n− 3⌈n/2⌉(⌈n/2⌉−1)2 − 1.
By summing up the numbers of sorters of all mergers in all stages, we obtain the
total number of sorters, which is given by
Tour(n
p) =
∑p−1
i=1 n
i−1 · Sour(n, np−i) + np−1
= p(p−1)
2
·M∗np−1 +
[
(p−1)np−1
n−1 − n
p−1−1
(n−1)2
]
·C∗n + pnp−1,
(7.3)
As N → ∞, Tour(np) is on the order of O(A1N logN(logN−logn)(log n)2/n + A2N(logN−logn)logn +
A3
N logN
n logn
). Similarly for the SS-Mk in [113], the number of sorters of the merging
network of n lists of np−i values each is given by
SSS−Mk(n, np−i) =M
†
np−i
+K†
n,np−i
+ C†n,
where
M †
np−i
=
((n + 1− ⌈n/2⌉)(n− ⌈n/2⌉)
2
+
(⌈n/2⌉+ 1)(⌈n/2⌉ − 2)
2
+ 2
)
np−i,
K†n,np−i = ⌈log2 np−1−i⌉np−i + (n− 3)2⌈log2 n
p−1−i⌉+1,
173
7.4. MULTIWAY SORTING
and
C†n = (⌈n/2⌉−2)n−
3(⌈n/2 + 1⌉)(⌈n/2⌉ − 2)
2
− (n+ 1− ⌈n/2⌉)(n− ⌈n/2⌉)
2
−(n−3).
The total number of sorters of the sorting network via the SS-Mk in [113] is given
by
TSS−Mk(np) =
∑p−1
i=1 n
i−1 · SSS−Mk(n, np−i) + np−1
= (p− 1) ·M †np−1 + n
p−1−1
n−1 · C†n + np−1
+np−1
∑p−2
i=1 ⌈i log2 n⌉
+
∑p−1
i=1 n
i−1(n− 3)2⌈(p−1−i) log2 n⌉+1,
(7.4)
As N →∞, TSS−Mk(np) is on the order of O(B1N(logN−logn)(logn)/n +B2N logN(logN−logn)n logn +
B3
N(logN−logn)
n logn
+B4
N
n
).
According to the big-O expressions of Tour(n
p) and TSS−Mk(np), when n is
bounded, the asymptotic bounds on the number of sorters required by both our
Alg. 7 and the SS-Mk in [113] are given by O(N log2N), which is also the asymp-
totical bound for the odd-even and bitonic sorting algorithms [31, 109]. When N
is fixed and n increases, the first term of the big-O expressions of Tour(n
p) and
TSS−Mk(np) decreases first, then increases, and decreases to zero when n → N .
While other terms decrease monotonically with n. Hence, if n is not constrained,
the minimum value of Tour(n
p) and TSS−Mk(np) is one when n = N , meaning a
single N -sorter is used.
174
7.4. MULTIWAY SORTING
7.4.4 Comparison of the number of sorters
According to the analysis of both our Alg. 7 and the SS-Mk in [113], the number
of sorters for sorting N = np inputs can be reduced by using a larger basic sorter.
However, a very large basic sorter is not feasible due to some practical concerns, such
as fan-in and cost. In this chapter, we assume that the basic sorter size is limited.
For a given N , we take the total number of sorters in Eqs. (7.3) and (7.4) as a
function of p with n = N1/p ≤ nb, where nb is the upper bound of the basic sorter
size. When N is not a power of a prime, we append redundant inputs of 0’s and get a
larger N ′ such that N ′ is a power of a prime. Hence, we have n′ = N ′1/p = ⌈⌈N1/p⌉⌉,
where ⌈⌈x⌉⌉ denotes the smallest prime larger than or equal to x. There exists an
optimal p such that the total number of sorters is the minimum. We search for
the optimal p’s for our Alg. 7 and the SS-Mk [113] using MATLAB. By plugging
the optimal p’s into Eqs. (7.3) and (7.4), we obtain the total number of sorters for
sorting networks of N inputs.
We compare the number of sorters for sorting networks via the Batcher’s odd-
even algorithm [31], our Alg. 7, and the SS-Mk [113] for wide ranges of N . The
results are show in Fig. 7.9. The numbers of sorters are illustrated by staircase
curves, because we use a larger sorting network for N not being a power of prime.
From Fig. 7.9, the Batcher’s odd-even algorithm using 2-sorters always requires more
sorters than both our Alg. 7 and the SS-Mk in [113]. For both our Alg. 7 and the
SS-Mk [113], the number of sorters is smaller for a larger nb, meaning that using
larger basic sorters reduces the number of sorters. For the comparison of the number
of sorters required by our Alg. 7 and the SS-Mk [113], there are three scenarios with
respect to three ranges of N . We first focus on nb = 10. For N ≤ 6.25 × 102, our
175
7.4. MULTIWAY SORTING
Alg. 7 has fewer or the same number of sorters than the SS-Mk as shown in Fig. 7.9.
For some segments in 6.25 × 102 < N ≤ 3.13 × 103, our Alg. 7 has fewer sorters
than the SS-Mk. For N > 3.13 × 103, the SS-Mk in [113] needs fewer sorters. For
nb = 20, we have similar results. For N ≤ 1.46 × 104, our Alg. 7 has fewer or the
same number of sorters than the SS-Mk as shown in Fig. 7.9. For some segments
in 1.46 × 104 < N < 1.3 × 105, our Alg. 7 has fewer sorters than the SS-Mk. For
N > 1.3× 105, the SS-Mk in [113] needs fewer sorters.
Similarly, we compare the latency of the Batcher’s odd-even algorithm, our
Alg. 7, and the SS-Mk in [113]. The latencies are obtained by plugging the corre-
sponding optimal p’s into Eqs. (7.1) and (7.2) and shown in Fig. 7.10 forN ≤ 2×104.
From Fig. 7.10, the Batcher’s odd-even algorithm using 2-sorters has the largest la-
tency. For both our Alg. 7 and the SS-Mk [113], the latency can be reduced by
having a larger nb. The latency of our Alg. 7 is not greater than the SS-Mk for
N ≤ 2×104 for both nb = 10 and nb = 20 as shown in Fig. 7.10. This is because our
Alg. 7 tends to use large sorters, leading to less stages of sorters. We note that the
latency goes up and down for some N in Fig. 7.10. This is because of the switching
from a smaller basic sorter to a larger one to reduce the number of sorters.
To some researchers’ interest, we also compare the number of sorters for N being
a power of two. The results are shown in Table 7.2, where columns two and three
show the numbers of sorters for the SS-Mk and our Alg. 7, respectively, and column
five shows the reduction by our Alg. 7 compared with the SS-Mk [113]. For our
Alg. 7, there are up to 46% fewer sorters than the SS-Mk in [113] for N = 2i, for
i = 4, 5, · · · , 16. It is also observed that a greater reduction is obtained for small
p, meaning our approach is more efficient for networks with larger sorters as basic
176
7.4. MULTIWAY SORTING
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0
2
4
6
8
10
12
14
16
18
x 105
# 
of
 so
rte
rs
N
 
 
Odd-even (n = 2)
SS-Mk (n ≤ 10)
Our (n ≤ 10)
SS-Mk (n ≤ 20)
Ours (n ≤ 20)
0 1000 2000
0
2
4
6
x 104
 
 
Figure 7.9: Comparison of the number of sorters (n ≤ 10 and n ≤ 20) for sorting N
inputs via the SS-Mk in [113] and our Alg. 7.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0
20
40
60
80
100
120
140
La
te
nc
y
N
 
 
Odd-even (n = 2)
SS-Mk (n ≤ 10)
Our (n ≤ 10)
SS-Mk (n ≤ 20)
Ours (n ≤ 20)
Figure 7.10: Comparison of the latency for sorting N inputs with n ≤ 10 and n ≤ 20
via the SS-Mk in [113] and our Alg. 7.
blocks.
177
7.5. APPLICATION IN THRESHOLD LOGIC
Table 7.2: Comparison of the number of sorters for sorting N = 2k inputs (1 ≤ k ≤
16) with n ≤ 20 via the SS-Mk in [113] and our Alg. 7.
N SS-Mk Ours Rd. (%)
2 1 1 0.0
4 5 5 0.0
8 11 11 0.0
16 38 30 21.05
32 95 65 31.58
64 347 207 40.35
128 566 326 42.40
256 1250 690 44.80
512 3952 3500 11.44
1024 8287 6378 23.04
2048 15595 12039 22.80
4096 44652 33891 24.10
8192 143762 136574 5.00
16384 179631 183143 -1.96
32768 1176250 1134692 3.53
65536 1176250 1134692 3.53
7.5 Application in Threshold Logic
In Sec. 7.4.4, we assume all basic sorters in the sorting network are the same and
measure the complexity by the number of sorters, since the distribution of sorters is
unknown. This would overestimate the total complexity. In this section, we focus
on the threshold logic and measure the complexity by the number of threshold gates.
In the following, we first briefly introduce the threshold logic, which is very powerful
for computing complex functions, such as parity function, addition, multiplication,
and sorting, with significantly reduced number of gates. Then, we present an im-
plementation of a large sorter in threshold logic. Last, we compare the complexity
178
7.5. APPLICATION IN THRESHOLD LOGIC
of sorting networks in terms of the number of gates. This is a very narrow applica-
tion in the sense that sorters are implemented by threshold logic and the inputs are
binary values.
7.5.1 Threshold logic
A threshold function [25] f with n inputs (n ≥ 1), x1, x2, · · · , xn, is a Boolean
function whose output is determined by
f(x1, x2, · · · , xn) =


1 if
∑n
i=1wixi ≥ T
0 otherwise,
(7.5)
where wi is called the weight of xi and T the threshold. In this chapter we de-
note this threshold function as [x1, x2, · · · , xn;w1, w2, · · · , wn;T ], and for simplic-
ity sometimes denote it as f = [x;w;T ], where x = (x1, x2, · · · , xn) and w =
(w1, w2, · · · , wn). The physical entity realizing a threshold function is called a
threshold gate, which can be realized with CMOS or nano technology. Fig. 7.11
shows the symbol of a threshold gate realizing (7.5).
Figure 7.11: Threshold gate realizing f(x) for n inputs, x1, x2, · · · , xn, with corre-
sponding weights ω1, ω2, · · · , ωn and a threshold T .
179
7.5. APPLICATION IN THRESHOLD LOGIC
7.5.2 n-sorter
Binary sorters can be easily implemented in threshold logic. In [98], a 2-by-2 com-
parator (2-sorter) was implemented by two threshold gates as shown in Fig. 7.12(a).
Similarly, we introduce a threshold logic implementation of an n-sorter as shown
in Fig. 7.12(b), where n threshold gates are required. As shown in Fig. 7.12, the
number of gates of an n-sorter scales linearly with the number of inputs n. Hence,
large sorters are preferred to be used as basic blocks. However, larger sorters are
more complex and expensive to be implemented. For practical concerns, such as
fan-in and cost, some limit on the size of basic sorters is assumed.
Figure 7.12: Sorters implemented in threshold logic (a) 2-sorter; (b) n-sorter.
7.5.3 Analysis of number of gates
In the following, we assume all gates are the same and derive the total number of
gates. The sorting network of N inputs is composed of multiple stages, of which each
partially sorts N values. Not all values in each stage participate the comparison-
and-switch operation. A simple way to count the gates is to insert buffer gates in
each stage to store values without involving any sorting operation. Buffer insertion
180
7.5. APPLICATION IN THRESHOLD LOGIC
is also needed for implementation of threshold logic in some nanotechnology, where
synchronization is required for correction operation. Hence, each stage contains N
gates and the total number of gates is obtained by multiplying N to the latency.
Note that N does not have to be a power of n. Hence, the total number of gates of
our Alg. 7 and the SS-Mk [113] are simply given by
Qour(N) = N · Lour(N), (7.6)
and
QSS−Mk(N) = N · LSS−Mk(N). (7.7)
If n is bounded, the total numbers of gates in Eqs. (7.6) and (7.7) have an order
of O(N log2N), which is the same as the order for the numbers of sorters via our
Alg. 7 and the SS-Mk in [113] in Sec. 7.4.3.
To derive the accurate number of gates, we first derive the number of buffers
added for Eqs. (7.6) and (7.7). WhenN is a power of prime, the number of buffers for
sorting N = np values via our Alg. 7 and the SS-Mk [113] can be easily obtained due
to a regular structure. For our Alg. 7, the number of buffers is given by Gour(N) =
(p−1)np−2 n2+6n−5
4
+ ((p−2)n
p−1−(p−1)np−2+1)(n+5)
4(n−1) +
(p−1)(p−2)
2
np−1 for n 6= 2 and G(np) =
(p2 − p + 4)2p−1 − 2 for n = 2. Similarly, we derive the number of buffers for the
SS-Mk in [113], which is given by GSS−Mk(N) = 2
∑p
i=2(2
⌈(i−2) log2 n⌉+1 − 1)np−i +
(np−1−1)(n2−5)
2(n−1) +
(p−1)(n−1)2np−1
4
for n 6= 2 and G(np) = (p2− p+ 4)2p−1− 2 for n = 2.
By subtracting the number of buffers from Eqs. (7.6) and (7.7), we obtain the total
181
7.5. APPLICATION IN THRESHOLD LOGIC
numbers of gates for our algorithm and the SS-Mk as shown in the following,
Rour(n
p) = np · Lour(np)−Gour(np), (7.8)
and
RSS−Mk(np) = np · LSS−Mk(np)−GSS−Mk(np). (7.9)
Though it would overestimate the total number of gates by adding buffers. However,
the asymptotic gate counts are not affected, since both Gour(n
p) and GSS−Mk(np)
have the same order of O(N log2N).
7.5.4 Comparison of the number of gates
In the following, we first compare the number of gates with consideration of buffers.
Using the same idea as in Sec. 7.4.3, we search for the optimal p’s of Eqs. (7.6) and
(7.7) using MATLAB. For n ≤ 10 and n ≤ 20, the numbers of gates of the SS-Mk
and our two implementations are illustrated in Fig. 7.13. We also plot the odd-even
sorting for comparison. The curves in Fig. 7.13 are segmented linear lines. This can
be explained by Eqs. (7.6) and (7.7), which are functions of N and latency. From
Fig. 7.13, the Batcher’s odd-even algorithm using 2-sorters has more gates than both
our algorithm and the SS-Mk in [113]. For both our Alg. 7 and the SS-Mk [113], the
number of gates is smaller with a larger nb, meaning that using larger basic sorters
reduces the number of gates. For the comparison of the number of gates required
by our Alg. 7 and the SS-Mk [113], there are also three scenarios with respect to
three ranges of N . We first focus on nb = 10. For N ≤ 1.68 × 104, our Alg. 7
has fewer or the same number of gates than the SS-Mk as shown in Fig. 7.13. For
182
7.5. APPLICATION IN THRESHOLD LOGIC
1.68×104 < N ≤ 1.17×105, our Alg. 7 has the same number of gates as the SS-Mk.
For N > 1.17 × 105, the SS-Mk in [113] needs fewer gates. For nb = 20, we have
similar results. For N ≤ 3.71 × 105, our Alg. 7 has fewer or the same number of
gates than the SS-Mk. For some segments in 3.71 × 105 < N ≤ 2.47 × 106, our
Alg. 7 has fewer gates than the SS-Mk. For N > 2.47 × 106, the SS-Mk in [113]
needs fewer gates.
Similarly, we compare the latency of our sorting algorithm with the SS-Mk
in [113]. The latencies are obtained by plugging the corresponding optimal p’s
into Eqs. (7.6) and (7.7) and shown in Fig. 7.14 for N ≤ 2 × 104. Note that the
minimization of the number of gates is essentially to minimize the latency, since
each N is fixed in Eqs. (7.6) and (7.7). Fig. 7.14 also shows the minimal latencies
of the Batcher’s odd-even algorithm. All the latencies are illustrated by staircase
curves. From Fig. 7.13, the Batcher’s odd-even algorithm using 2-sorters has the
largest latency. For both our Alg. 7 and the SS-Mk [113], the latency can be reduced
by having a larger nb. The latency of our Alg. 7 is not greater than the SS-Mk for
N ≤ 2 × 104 for both nb = 10 and nb = 20 as shown in Fig. 7.14. This is because
our Alg. 7 tends to use large basic sorters, leading to less stages.
We also compare the number of gates with buffers for N being a power of two.
The numbers of gates are minimized by varying p according to Eqs. (7.6) and (7.7)
for our algorithm and the SS-Mk [113]. Note the optimal p’s are different from those
in Sec. 7.4.4. The results are shown in Table 7.3, where columns two to four show
the numbers of gates for the SS-Mk, our Alg. 7, and the reduction of our Alg. 7,
respectively, with n ≤ 20, and columns five to seven show those with n ≤ 10. For
n ≤ 10 and n ≤ 20, there are up to 25% and 39% fewer gates, respectively, than
183
7.5. APPLICATION IN THRESHOLD LOGIC
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0
0.5
1
1.5
2
2.5
x 106
# 
of
 g
at
es
N
 
 
Odd-even (n = 2)
SS-Mk (n ≤ 10)
Our (n ≤ 10)
SS-Mk (n ≤ 20)
Ours (n ≤ 20)
0 1000 2000
0
5
10
15
x 104
 
 
Figure 7.13: Comparison of the number of gates (n ≤ 10 and n ≤ 20) for sorting N
inputs via the SS-Mk in [113] and our Alg. 7.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 104
0
20
40
60
80
100
120
140
La
te
nc
y
N
 
 
Odd-even (n = 2)
SS-Mk (n ≤ 10)
Our (n ≤ 10)
SS-Mk (n ≤ 20)
Ours (n ≤ 20)
Figure 7.14: Comparison of the latency (n ≤ 10 and n ≤ 20) for sorting N inputs
via the SS-Mk in [113] and our Alg. 7.
the SS-Mk in [113] for N = 2i with i = 1, 5, · · · , 16. It is observed that fewer and
the same number of gates are needed for n ≤ 20 than for n ≤ 10 for all N = 2i with
i = 1, 2, · · · , 16. The reduction percentage of n ≤ 20 is also greater than or equal
to that of n ≤ 10 for all N = 2i with i = 1, 2, · · · , 16 but N = 16. This means our
184
7.5. APPLICATION IN THRESHOLD LOGIC
Table 7.3: Comparison of the number of gates with buffers for sorting N = 2k inputs
(1 ≤ k ≤ 16) with n ≤ 20 via the SS-Mk in [113] and our Alg. 7.
N
n ≤ 20 n ≤ 10
SS-Mk Ours
Rd.
SS-Mk Ours
Rd.
(%) (%)
2 1 1 0.00 1 1 0.00
22 4 4 0.00 4 4 0.00
23 8 8 0.00 32 32 0.00
24 16 16 0.00 96 80 16.67
25 256 192 25.00 256 192 25.00
26 768 512 33.33 896 768 14.29
27 1792 1152 35.71 2304 1920 16.67
28 4608 2816 38.89 4608 3840 16.67
29 12800 11264 12.00 12800 11264 12.00
210 27648 21504 22.22 31744 28672 9.68
211 63488 49152 22.58 63488 57344 9.68
212 163840 122880 25.00 192512 184320 4.26
213 376832 327680 13.04 385024 368640 4.26
214 770048 737280 4.26 770048 737280 4.26
215 2162688 1900544 12.12 2162688 2162688 0.00
216 4325376 3801088 12.12 4325376 4325376 0.00
sorting network takes better advantage of larger basic sorters.
For N being a power of prime, we compare the number of gates without buffers
according to Eqs. (7.8) and (7.9). For N ≤ 3 × 104, we search for the same N ’s
for our Alg. 7 and the SS-Mk with the minimum number of gates. The results are
shown in Tables 7.4 and 7.5 for n ≤ 10 and n ≤ 20, respectively, where columns
three and four show the numbers of gates for the SS-Mk and our Alg. 7, and column
five shows the reduction of our Alg. 7. For all N ’s except for N = 75, our Alg. 7 has
no more gates than the SS-Mk in [113]. There are up to 13% and 23% fewer gates
than the SS-Mk in [113] for n ≤ 10 and n ≤ 20, respectively. This means our sorting
185
7.6. CONCLUSION
network takes better advantage of larger basic sorters. We also remark that using a
larger sorter size n may reduce the number of gates for sorting N = np inputs. For
all common N ’s for n ≤ 10 in Table 7.4 and n ≤ 20 in Table 7.5, the same number
of gates is needed, since the same sorter size n is used. For all remaining N ’s except
for N = 39 in Table 7.4, there is a corresponding larger N ’s in Table 7.5 with fewer
gates. For N = 39 = 19683 in Table 7.4 and N = 134 = 28561 in Table 7.5, the
latter has about 1% more gates than the former, but accounts for 45% more inputs.
7.6 Conclusion
In this chapter, we proposed a new merging algorithm based on n-sorters for parallel
sorting networks, where n is prime. Based on the n-way merging, we also proposed
a merge sorting algorithm. Our sorting algorithm is a direct generalization of odd-
even merge sort with n-sorters as basic blocks. By using larger sorters (2 ≤ n ≤ 20),
the number of sorters as well as the latency is reduced greatly. In comparison with
other multiway sorting networks in [113], our implementation has a smaller latency
and fewer sorters for wide ranges of N ≤ 1.46× 104. We also showed an application
of sorting networks implemented by linearly scaling sorters in threshold logic and
have a similar conclusion that the number of gates can be greatly reduced by using
larger sorters.
186
7.6. CONCLUSION
Table 7.4: Comparison of the number of gates without buffers for sorting N = np
inputs for n ≤ 10 via the SS-Mk in [113] and our Alg. 7.
N np
n ≤ 10
SS-Mk Ours Rd. (%)
2 2 2 2 0.00
3 3 3 3 0.00
5 5 5 5 0.00
7 7 7 7 0.00
9 32 29 29 0.00
25 52 118 110 6.78
27 33 197 188 4.57
49 72 305 269 11.80
81 34 1067 998 6.47
125 53 1450 1315 9.31
128 27 2942 2942 0.00
343 73 5072 4728 6.78
625 54 13489 12140 10.00
729 36 22801 20411 10.48
1024 210 48126 48126 0.00
2401 74 63354 62254 1.74
3125 55 108175 97265 10.09
4096 212 278526 278526 0.00
6561 38 377375 330236 12.49
8192 213 655358 655358 0.00
16807 75 688713 704693 -2.32
19683 39 1443791 1259711 12.75
187
7.6. CONCLUSION
Table 7.5: Comparison of the number of gates without buffers for sorting N = np
inputs for n ≤ 20 via the SS-Mk in [113] and our Alg. 7.
N np
n ≤ 20
SS-Mk Ours Rd. (%)
2 2 2 2 0.00
3 3 3 3 0.00
5 5 5 5 0.00
7 7 7 7 0.00
11 11 11 11 0.00
13 13 13 13 0.00
17 17 17 17 0.00
19 19 19 19 0.00
25 52 118 110 6.78
27 33 197 188 4.57
49 72 305 269 11.80
121 112 1117 917 17.91
125 53 1450 1315 9.31
169 132 1814 1454 19.85
289 172 3970 3074 22.57
361 192 5501 4205 23.56
625 54 13489 12140 10.00
729 36 22801 20411 10.48
1331 113 29107 26668 8.38
2197 133 54703 50763 7.20
2401 74 63354 62254 1.74
3125 55 108175 97265 10.09
4913 173 156812 143443 8.53
6859 193 239590 221052 7.74
14641 114 564513 562214 0.41
16807 75 688713 704693 -2.32
28561 134 1230724 1271788 -3.34
188
Chapter 8
Conclusion and Future Work
As the technologies scaling down, many issues arise and need to be accounted for,
such as delay issue in on-chip interconnect, noise and decoherence in quantum com-
putation, and quantum effects with nano technologies. To address these issues,
efficient signal processing techniques or new design approaches are imperative. In
this dissertation, we propose several efficient processing techniques and approaches
in the following area: delay modeling, crosstalk avoidance coding, quantum error
correction, and threshold logic design.
The delay issue in on-chip interconnect is motivated by the fact that gate delay
decreases with scaling, but global interconnect delay increases due to crosstalk. We
proposed analytical delay models for on-chip interconnects with improved accuracy,
leading to a new CAC with worst-case delay 30-40% smaller than the best known
in the literature.
Quantum error correction codes (QECCs) are needed to protect quantum infor-
mation against noise and decoherence. Given the good error-correcting performance
189
and existing iterative decoding algorithms of classic LDPCs, it is desirable to obtain
LDPC-based QECCs. Several QECCs based on nonbinary LDPC codes have been
proposed with a much better error-correcting performance than existing quantum
codes over a qubit channel. We proposed stabilizer codes based on the nonbinary
QC-LDPC codes for qubit channels. Results show that QECCs based on the non-
binary LDPC codes achieve better performance than that based on binary LDPC
codes.
Finally, threshold logic designs in nano technologies are investigated. Nano de-
vices, such as resonant tunneling diodes (RTDs), quantum cellular automata (QCA),
and single electron transistors (SETs), are fit for the threshold logic, which is dif-
ferent from the widely used Boolean logic in CMOS technology. Boolean gates,
such as AND, OR, NOT, NAND, NOR, and XOR, are used there as basic building
blocks. Besides, the fan-in of a threshold gate in RTD nanotechnology needs to be
bounded for both reliability and performance purposes. We first focus on the imple-
mentations of symmetric functions in threshold logic, as AND, OR, NAND, NOR,
and XOR are all special cases of symmetric functions. Furthermore, any Boolean
function can be treated as a symmetric function by replicating its inputs. We pro-
pose an improved sort-and-search algorithm to implement any symmetric function
in threshold logic. Both sorting and searching networks in our proposed algorithm
use multi-input threshold gates. Since XOR cannot be realized in a single threshold
gate, we also proposed a majority-class threshold tree architecture for XORs with
bounded fan-in, and compare it with a Boolean-class architecture and the sort-and-
search approach therein. Analytical results show that the majority class outperforms
other architectures in terms of both hardware complexity and latency.
190
For the future work, following directions can be investigated:
• As the clock frequency approaches multi-gigahertz, the parasitic inductance
of on-chip interconnects has become significant and its detrimental effects,
including increased delay, voltage overshoots and undershoots, and increased
crosstalk noise [58–60], cannot be ignored. Hence, with the process technolo-
gies scaling down into deep submicrometer (DSM) and the clock frequency
approaching multi-gigahertz range, the crosstalk delay and noise due to the
capacitive and inductive coupling become the performance bottleneck in many
high-performance VLSI designs, especially for global on-chip buses. It is im-
perative for designers to devise new techniques to address both capacitive and
inductive couplings simultaneously.
• The proposed QECCs in this dissertation have a column weight of two. Quasi-
cyclic QECCs with column weights more than two provide more powerful error
correction capability and are worth to be investigated.
• It has been shown that some complex Boolean functions can be realized with
a single threshold gate. However, efficient identification of the threshold func-
tion for a given problem is not fully investigated. The identification by re-
formulating existing Boolean expressions for a given problem is worth to be
investigated.
• In this dissertation, we show how to construct multi-input XORs using our
proposed sort-and-search algorithm and use them as building blocks for finite
field multiplications. However, some arithmetic operations might be decom-
posed into small blocks, which can be implemented by symmetric functions
191
more efficiently. Since our proposed sort-and-search algorithm is applicable
for any symmetric Boolean function, such decomposition using blocks based
on our sort-and-search algorithm is worth to be investigated.
192
Bibliography
[1] [Online], “International technology roadmap for semiconductors,” available at
http://www.itrs.net/Links/2011ITRS/Home2011.htm.
[2] R. Kay and L. Pileggi, “PRIMO: Probability interpretation of moments for
delay calculation,” Proceedings of the 35th annual Design Automation Con-
ference, vol. 35, pp. 463–468, 1998.
[3] C. J. Alpert, A. Devgan, and C. V. Kashyap, “RC delay metrics for perfor-
mance optimization,” IEEE Transactions on Computer Aided Design Inte-
grated Circuits System, vol. 20, no. 5, pp. 571–582, 2001.
[4] F. Liu, C. Kashyap, and C. J. Alpert, “A delay metric for RC circuits based
on the weibull distribution,” Proc. IEEE/ACM International Conference on
Computer-Aided Design, pp. 620–624, 2002.
[5] J. A. Davis and J. D. Meindl, “Compact distributed RLC interconnect models-
part ii: Coupled line transient expressions and peak crosstalk in multilevel
networks,” IEEE Transactions on Electron Devices, vol. 47, no. 11, pp. 2078–
2087, November 2000.
193
BIBLIOGRAPHY
[6] S. Roy and A. Dounavis, “Closed-form delay and crosstalk models for RLC on-
chip interconnects using a matrix rational approximation,” IEEE Transactions
on Computer Aided Design Integrated Circuits System, vol. 28, no. 10, pp.
1481–1492, October 2009.
[7] P. P. Sotiriadis and A. Chandrakasan, “Reducing bus delay in submicron tech-
nology using coding,” Proceedings of the Asia and South Pacific Design Au-
tomation Conference, pp. 109–114, February 2001.
[8] C. Duan, A. Tirumala, and S. Khatri, “Analysis and avoidance of cross-talk
in on-chip buses,” The Ninth Symposium on High Performance Interconnects
(HOTI ’01), pp. 133–138, August 2001.
[9] C. Duan and S. Khatri, “Exploiting crosstalk to speed up on-chip buses,”
Proceedings of the Conference on Design Automation and Test in Europe,
vol. 2, pp. 778–783, February 2004.
[10] B. Victor and K. Keutzer, “Bus encoding to prevent crosstalk delay,” Proc.
IEEE/ACM International Conference on Computer-Aided Design, pp. 57–63,
2001.
[11] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Infor-
mation. Cambridge, 2000.
[12] D. J. C. MacKay, G. Mitchison, and P. L. McFadden, “Sparse-graph codes for
quantum error correction,” Proc. IEEE Int. Symp. on Information Theory,
vol. 50, no. 10, pp. 2315–2330, 2004.
194
BIBLIOGRAPHY
[13] M. S. Postol, “A proposed quantum low density parity check code,” quant-
ph/0108131, 2001.
[14] G. D. Forney, M. Grassl, and S. Guha, “Convolutional and tail-biting quantum
error-correcting codes,” IEEE Trans. Info. Theory, vol. 53, no. 3, pp. 865–880,
2007.
[15] D. Poulin, J. Tillich, and H. Ollivier, “Quantum serial turbo codes,” IEEE
Trans. Info. Theory, vol. 55, no. 6, pp. 2776–2798, 2009.
[16] P. Tan, “Exploring error-correction technology in source coding and quantum
communications,” Ph.D. Dissertation, Lehigh University, 2009.
[17] P. Tan and J. Li, “Efficient quantum stabilizer codes: LDPC and LDPC-
convolutional constructions,” IEEE Trans. Info. Theory, vol. 56, no. 1, pp.
476–491, January 2010.
[18] M. M. Wilde and M. Hsieh, “Entanglement boosts quantum turbo codes,”
Proc. IEEE Int. Symp. on Information Theory, pp. 445–449, July 2011.
[19] M. M. Wilde and S. Guha, “Polar codes for degradable quantum channels,”
eprint ArXiv: quant-ph/1109.5346v2 2011.
[20] M. M. Wilde and J. M. Renes, “Quantum polar codes for arbitrary channels,”
eprint ArXiv: quant-ph/1201.2906v2 2012.
[21] Z. Dutton, S. Guha, and M. M. Wilde, “Performance of polar codes
for quantum and private classical communication,” eprint ArXiv: quant-
ph/1205.5980v1 2012.
195
BIBLIOGRAPHY
[22] M. C. Davey and D. J. C. MacKay, “Low-density parity check codes over
GF(q),” IEEE Commun. Lett., vol. 2, no. 6, pp. 165–167, June 1998.
[23] K. Kasai, M. Hagiwara, H. Imai, and K. Sakaniwa, “Quantum error correc-
tion beyond the bounded distance decoding limit,” IEEE Trans. Info. Theory,
vol. 58, no. 2, pp. 1223–1230, February 2012.
[24] D. Goldhaber-Gordon, M. S. Montemerlo, J. C. Love, G. J. Opiteck, and J. C.
Ellenbogen, “Overview of nanoelectronic devices,” Proceedings of the IEEE,
vol. 85, no. 4, pp. 521–540, April 1997.
[25] S. Muroga, Threshold Logic and Its Applications. New York: WILEY-
INTERSCIENCE, 1971.
[26] V. Annampedu and M. D. Wagh, “Reconfigurable approximate pattern match-
ing architectures for nanotechnology,” Microelectronics, vol. 38, pp. 430–438,
2007.
[27] C. Pacha, U. Auer, C. Burwick, P. Glosekotter, A. Brennemann, W. Prost,
F. Tegude, and K. F. Goser, “Threshold logic circuit design of parallel adders
using resonant tunneling devices,” IEEE Trans. VLSI Systems, vol. 8, no. 5,
pp. 558–572, October 2000.
[28] C. Pacha and K. Goser, “Design of arithmetic circuits using resonant tunnel-
ing diodes and threshold logic,” in Proc. of the 2nd Workshop on Innovative
Circuits and Systems for Nanoelectronics, Delft, NL, Sep. 1997, pp. 83–93.
196
BIBLIOGRAPHY
[29] Y. Sun and M. D. Wagh, “A fan-in bounded low delay adder for nanotechnol-
ogy,” in Proc. of 2010 NanoTech Conf., vol. 2, Anaheim, CA, July 2010, pp.
83–86.
[30] B. Sunar and C. L. Koc, “Mastrovito multiplier for all trinomials,” IEEE
Trans. Computers, vol. 48, no. 5, pp. 522–527, May 1999.
[31] K. E. Batcher, “Sorting networks and their applications,” in Proc. The Spring
Joint Computer Conference. ACM, 1968, pp. 307–314.
[32] S. Sridhara, A. Ahmed, and N. Shanbhag, “Coding for reliable on-chip buses:
A class of fundamental bounds and practical codes,” IEEE Transactions on
Computer Aided Design Integrated Circuits System, vol. 26, no. 5, pp. 977–982,
May 2007.
[33] X. Wu, Z. Yan, and Y. Xie, “Two-dimensional crosstalk avoidance codes,”
in Proc. IEEE Workshop on Signal Processing Systems (SiPS), pp. 106–111,
October 2008.
[34] R. Gupta, B. Tutuianu, and L. T. Pileggi, “The Elmore delay as a bound for
RC trees with generalized input signals,” IEEE Transactions on CAD, vol. 16,
no. 1, pp. 95–104, January 1997.
[35] J. M. Rabaey, Digital integrated circuits. Prentice-Hall, 1996.
[36] Y. I. Ismail, E. G. Friedman, and J. L. Neves, “Figures of merit to characterize
the importance of on-chip inductance,” IEEE Trans. VLSI Systems, vol. 7,
no. 4, pp. 440–449, December 1999.
197
BIBLIOGRAPHY
[37] L. Zhang, J. M. Wilson, R. Bashirullah, L. Luo, J. Xu, and P. D. Franzon, “A
32-Gb/s on-chip bus with driver pre-emphasis signaling,” IEEE Trans. VLSI
Systems, vol. 17, no. 9, pp. 1267–1274, September 2009.
[38] T. Sakurai, “Closed-form expressions for interconnection delay, coupling, and
crosstalk in VLSI’s,” IEEE Transactions on Electron Devices, vol. 40, no. 1,
pp. 118–124, January 1993.
[39] S. Khatri, R. Brayton, and A. Sangiovanni-Vincentelli, Crosstalk noise im-
mune VLSI design using regular layout fabrics. Kluwer Academic Publishers,
2001.
[40] [Online], “FreePDK45,” available at http://www.eda.ncsu.edu/wiki/FreePDK.
[41] F. Shi, X. Wu, and Z. Yan, “Improved crosstalk avoidance codes based on a
novel pattern classification,” IEEE Workshop on Signal Processing Systems
(SiPS), pp. 245–250, October 2011.
[42] P. P. Sotiriadis, “Interconnect modeling and optimization in deep sub-micron
technologies,” Ph.D. Dissertation, Massachusetts Institute of Technology,
2002.
[43] S. Sridhara, G. Balamurugan, and N. Shanbhag, “Joint equalization and cod-
ing for on-chip bus communication,” IEEE Trans. VLSI Systems, vol. 16,
no. 3, pp. 314–318, March 2008.
[44] X. Wu, Z. Yan, and Y. Xie, “Two-dimensional crosstalk avoidance codes,”
in Proc. IEEE Workshop on Signal Processing Systems (SiPS), pp. 106–111,
October 2008.
198
BIBLIOGRAPHY
[45] C. Duan, C. Zhu, and S. P. Khatri, “Forbidden transition free crosstalk avoid-
ance codec design,” Proceedings of annual Design Automation Conference, pp.
986–991, 2008.
[46] C. Duan, V. H. C. Calle, and S. P. Khatri, “Efficient on-chip crosstalk avoid-
ance codec design,” IEEE Trans. VLSI Systems, vol. 17, no. 4, pp. 551–560,
April 2009.
[47] X. Wu and Z. Yan, “Efficient CODEC designs for crosstalk avoidance codes
based on numeral systems,” IEEE Trans. VLSI Systems, vol. 19, no. 4, pp.
548–558, April 2011.
[48] F. Shi, X. Wu, and Z. Yan, “Improved analytical delay models for coupled
interconnects,” in Proc. IEEE Workshop on Signal Processing Systems (SiPS),
pp. 134–139, October 2011.
[49] [Online], “PDK for the 45nm technology,” available at
http://www.eda.ncsu.edu/wiki/FreePDK.
[50] ——, “Predictive technology model (PTM),” available at
http://http://ptm.asu.edu.
[51] B. Victor, “Bus encoding to prevent crosstalk delay,” Master Thesis, Univer-
sity of California, Berkeley, 2001.
[52] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to
the Theory of NP-Completeness. W. H. Freeman and Company, New York,
1979.
199
BIBLIOGRAPHY
[53] F. Shi, X. Wu, and Z. Yan, “New crosstalk avoidance codes based on a novel
pattern classification,” available at http://arxiv.org/abs/1209.2672.
[54] S. R. Sridhara, A. Ahmed, and N. R. Shanbhag, “Area and energy efficient
crosstalk avoidance codes for on-chip buses,” in Proc. Int. Conference on Com-
puter Design, pp. 12–17, 2004.
[55] K. Hirose and H. Yasuura, “A bus delay reduction technique considering
crosstalk,” in Design, Automation and Test in Europe Conference and Ex-
hibition 2000. Proceedings. IEEE, 2000, pp. 441–445.
[56] F. Shi, X. Wu, and Z. Yan, “New crosstalk avoidance codes based on a novel
pattern classification,” IEEE Trans. VLSI Systems, vol. 21, no. 10, pp. 1892–
1902, 2013.
[57] ——, “Improved analytical delay models for rc-coupled interconnects,” IEEE
Trans. VLSI Systems, vol. 22, no. 7, pp. 1639–1644, 2014.
[58] Y. I. Ismail, “On-chip inductance cons and pros,” Very Large Scale Integration
(VLSI) Systems, IEEE Transactions on, vol. 10, no. 6, pp. 685–694, 2002.
[59] Y. Cao, X. Huang, N. H. Chang, S. Lin, O. S. Nakagawa, W. Xie, D. Sylvester,
and C. Hu, “Effective on-chip inductance modeling for multiple signal lines
and application to repeater insertion,” Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, vol. 10, no. 6, pp. 799–805, 2002.
[60] S. Tu, Y. Chang, and J. Jou, “RLC coupling-aware simulation and on-chip bus
encoding for delay reduction,” IEEE Transactions on Computer Aided Design
Integrated Circuits System, vol. 25, no. 10, pp. 2258–2264, 2006.
200
BIBLIOGRAPHY
[61] J. A. Davis and J. D. Meindl, “Compact distributed RLC interconnect models
part i: Single line transient, time delay, and overshoot expressions,” IEEE
Transactions on Electron Devices, vol. 47, no. 11, pp. 2068–2077, November
2000.
[62] Y. I. Ismail and E. G. Friedman, “Effects of inductance on the propagation
delay and repeater insertion in vlsi circuits,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 8, no. 2, pp. 195–206, 2000.
[63] D. Knuth, The Art of Computer Programming, 3rd ed. Addison-Wesley, 1997,
vol. 2.
[64] D. Gottesman, “Stabilizer codes and quantum error correction,” Ph.D. Dis-
sertation, California Institute of Technology, 1997.
[65] A. R. Calderbank and P. W. Shor, “Good quantum error-correcting codes
exist,” Phys. Rev. A, vol. 54, no. 2, pp. 1098–1105, August 1996.
[66] A. M. Steane, “Multiple particle interference and quantum error correction,”
vol. 452, pp. 2551–2577, 1996.
[67] ——, “Error-correcting codes in quantum theory,” Phys. Rev. Lett, vol. 77,
pp. 793–797, June 1996.
[68] A. R. Calderbank, E. M. Rains, P. W. Shor, and N. J. A. Sloane, “Quantum
error correction via codes over GF(4),” Proceedings Of the Annual Allerton
Conference On Communication Control And Computing, vol. 34, pp. 662–672,
1996.
201
BIBLIOGRAPHY
[69] D. Poulin and Y. Chung, “On the iterative decoding of sparse quantum codes,”
Quantum Info. and Comp., vol. 8, no. 10, pp. 987–1000, November 2008.
[70] C. Poulliat, M. Fossorier, and D. Declercq, “Design of regular (2,dc)-LDPC
codes over GF(q) using their binary images,” IEEE Trans. Computers, vol. 56,
no. 10, pp. 1626–1635, October 2008.
[71] L. Zeng, L. Lan, Y. Y. Tai, B. Zhou, S. Lin, and K. Abdel-Ghaffar, “Con-
struction of nonbinary cyclic, quasi-cyclic and regular LDPC codes: A finite
geometry approach,” IEEE Trans. Computers, vol. 56, no. 3, pp. 378–387,
March 2008.
[72] S. Song, B. Zhou, S. Lin, and K. Abdel-Ghaffar, “A unified approach to the
construction of binary and nonbinary quasi-cyclic LDPC codes based on finite
fields,” IEEE Trans. Computers, vol. 57, no. 1, pp. 84–93, January 2009.
[73] B. Zhou, J. Kang, S. Song, S. Lin, and K. Abdel-Ghaffar, “Construction of
non-binary quasi-cyclic LDPC codes by arrays and array dispersions,” IEEE
Trans. Computers, vol. 57, no. 6, pp. 1652–1662, June 2009.
[74] M. Hagiwara, K. Kasai, H. Imai, and K. Sakaniwa, “Spatially coupled quasi-
cyclic quantum LDPC codes,” Proc. IEEE Int. Symp. on Information Theory,
pp. 638–642, July 2011.
[75] M. Fossorier and D. Declercq, “Quasi-cyclic low-density parity-check codes
from circulant permutation matrices,” IEEE Trans. Info. Theory, vol. 50,
no. 8, pp. 1788–1793, 2004.
202
BIBLIOGRAPHY
[76] [Online], “An LDPC decoding algorithm based
on number-theoretic transform,” available at
http://ivms.stanford.edu/∼varodayan/multilevel/index.html.
[77] M. D. Wagh, Y. Sun, and V. Annampedu, “Implementation of compari-
son function using quantum-dot cellular automata,” in Proc. NanoTech2008,
vol. 3, June 2008, pp. 76–79.
[78] V. Annampedu and M. D. Wagh, “Reconfigurable approximate pattern match-
ing architectures for nanotechnology,” Microelectronics Journal, vol. 38, pp.
430–438, 2007.
[79] K.-Y. Siu, V. P. Roychowdhury, and T. Kailath, “Depth-size tradeoffs for
neural computation,” Computers, IEEE Transactions on, vol. 40, no. 12, pp.
1402–1412, 1991.
[80] K.-Y. Siu and J. Bruck, “Neural computation of arithmetic functions,” Pro-
ceedings of the IEEE, vol. 78, no. 10, pp. 1669–1675, 1990.
[81] K.-Y. Siu, V. Roychowdhury, and T. Kailath, “Circuit complexity for neural
computation,” in Signals, Systems and Computers, 1991. 1991 Conference
Record of the Twenty-Fifth Asilomar Conference on. IEEE, 1991, pp. 487–
490.
[82] K.-Y. Siu and V. P. Roychowdhury, “On optimal depth threshold circuits for
multiplication and related problems,” SIAM Journal on Discrete Mathematics,
vol. 7, no. 2, pp. 284–292, 1994.
203
BIBLIOGRAPHY
[83] K.-Y. Siu, J. Bruck, T. Kailath, and T. Hofmeister, “Depth efficient neu-
ral networks for division and related problems,” Information Theory, IEEE
Transactions on, vol. 39, no. 3, pp. 946–956, 1993.
[84] R. Lidl and H. Niederreiter, Introduction to Finite Fields and Their Applica-
tions. Cambridge University Press, 1994.
[85] Z. Yan and D. Sarwate, “Reduced-complexity pipelined architectures for finite
field inversions,” in Proc. 2006 IEEE Workshop on Signal Processing Systems
(SiPS06), October 2006, pp. 56–61.
[86] Z. Yan, “Digit-serial systolic architectures for inversions over GF(2m),” in
Proc. 2006 IEEE Workshop on Signal Processing Systems (SiPS06), October
2006, pp. 77–82.
[87] Z. Yan, D. Sarwate, and Z. Liu, “Hardware-efficient systolic architectures for
inversion in GF(2m),” in IEE Proc. on Information Security, vol. 152, no. 1,
October 2005, pp. 31–45.
[88] ——, “High-speed systolic architectures for finite field inversion,” Integration:
the VLSI Journal, vol. 38, no. 3, pp. 383–398, January 2005.
[89] Z. Yan and D. V. Sarwate, “New systolic architectures for inversion and di-
vision in GF(2m),” IEEE Transactions on Computers, vol. 52, no. 11, pp.
1515–1520, November 2003.
[90] P. M. Lewis and C. L. Coates, Threshold logic. Wiley, 1967.
204
BIBLIOGRAPHY
[91] S. Muroga, “The principle of majority decision logical elements and the com-
plexity of their circuits.” in IFIP Congress, 1959, pp. 400–406.
[92] R. C. Minnick, “Linear-input logic,” IRE Trans. Electronic Computers, vol.
EC-10, no. 1, pp. 6–16, March 1961.
[93] V. Annampedu and M. D. Wagh, “Building multi-input RTD circuits under
reliability constraints,” in Proc. the 2nd IEEE Int. Workshop on Defect and
Fault Tolerant Nanoscale Architectures (NANOARCH 2006), June 2006, pp.
45–52.
[94] K. Maezawa, “Analysis of switching time of monostable-bistable transition
logic elements based on simple model calculation,” Jpn. J. Appl. Phys., vol. 34,
no. 2B, pp. 1213–1217, February 1995.
[95] B. G. Horne and D. R. Hush, “On the node complexity of neural networks,”
Neural Networks, vol. 7, no. 9, pp. 1413–1426, 1994.
[96] V. Beiu, J. Peperstraete, and R. Lauwereins, “Algorithms for fan-in reduc-
tion,” in Les re´seaux neuro-mime´tiques et leurs applications. Journe´es inter-
nationales, 1992, pp. 589–600.
[97] ——, “Simpler neural networks by fan-in reduction,” Proceedings of the In-
ternational Joint Conference on Neural Networks IJCNN’92,, pp. 204–209,
1992.
[98] ——, “Enhanced threshold gate fan-in reduction algorithms,” in Proceedings
of the third international conference on Young computer scientists. Tsinghua
University Press, 1993, pp. 339–342.
205
BIBLIOGRAPHY
[99] V. Annampedu and M. D. Wagh, “Decomposition of threshold functions into
bounded fan-in threshold functions,” Information and Computation, vol. 227,
pp. 84–101, 2013.
[100] N. Red’kin, “Synthesis of threshold circuits for certain classes of boolean func-
tions,” Cybernetics and Systems Analysis, vol. 6, no. 5, pp. 540–544, 1970.
[101] V. Beiu, “Constant fan-in digital neural networks are vlsi-optimal,” in Math-
ematics of Neural Networks. Springer, 1997, pp. 89–94.
[102] J.-H. Guo and C.-L. Wang, “Systolic array implementation of euclid’s algo-
rithm for inversion and division in GF(2m),” Computers, IEEE Transactions
on, vol. 47, no. 10, pp. 1161–1167, 1998.
[103] E. D. Mastrovito, “VLSI architectures for computations in Galois field,” Ph.D.
Dissertation, Linkoping University, 1991.
[104] J. Deschamps, J. L. Imana, and G. D. Sutter, Hardware Implementation of
Finite-Field Arithmetic. The McGraw-Hill Companies, 2009.
[105] J. L. Massey and J. K. Omura, “Computational method and apparatus for
finite field arithmetic,” US Patent No. 4,587,627, to OMNET Assoc., Sunny-
vale CA, Washington, D.C.: Patent and Trademark Office, 1986.
[106] R. Zhang, P. Gupta, L. Zhong, and N. K. Jha, “Threshold network synthesis
and optimization and its application to nanotechnologies,” IEEE Transactions
on Computer Aided Design Integreated Circuits System, vol. 24, no. 1, pp. 107–
118, January 2005.
206
BIBLIOGRAPHY
[107] I. Wegener et al., “The complexity of boolean functions,” 1987.
[108] W. Prost, U. Auer, F. J. Tegude, C. Pacha, K. F. Goser, G. Janssen, and
T. van der Roer, “Manufacturability and robust design of nanoelectronic logic
circuits based on resonant tunnelling diodes,” Int. J. Circ. Theor. App, vol. 28,
pp. 537–552, 2000.
[109] K. E. Batcher, “On bitonic sorting networks,” in Proc. Internation Conference
on Parallel Processing (ICPP), 1990, pp. 376–379.
[110] D. E. Knuth, “The Art of Computer Programming. Sorting and Searching,
vol. III,” 1973.
[111] D.-L. Lee and K. E. Batcher, “A multiway merge sorting network,” IEEE
Transactions on Parallel and Distributed Systems, vol. 6, no. 2, pp. 211–215,
1995.
[112] B. Parker and I. Parberry, “Constructing sorting networks from k-sorters,”
Information Processing Letters, vol. 33, no. 3, pp. 157–162, 1989.
[113] Q. Gao and Z. Liu, “Sloping-and-shaking,” Science in China Series E: Tech-
nological Sciences, vol. 40, no. 3, pp. 225–234, 1997.
[114] F. Shi, Z. Yan, and M. D. Wagh, “An enhanced multiway sort-
ing network based on n-sorters,” Available at, [Online]. Available at
http://arxiv.org/abs/1407.0961.
[115] [Online], “Sec 2: Recommended elliptic curve domain parameters,” available
at http://http://www.secg.org.
207
BIBLIOGRAPHY
[116] R. C. Mullin, I. M. Onyszchuk, S. A. Vanstone, and R. M. Wilson, “Optimal
normal bases in GF(pn),” Discrete Appl. Math, vol. 22, no. 2, pp. 149–161,
1988/89.
[117] K. J. Liszka and K. E. Batcher, “A modulo merge sorting network,” in Proc.
Fourth Symposium on the Frontiers of Massively Parallel Computation, 1992.
IEEE, 1992, pp. 164–169.
[118] M. Ajtai, J. Komlo´s, and E. Szemere´di, “An o(n logn) sorting network,” in
Proc. The fifteenth annual ACM symposium on Theory of Computing. ACM,
1983, pp. 1–9.
[119] A. Farmahini-Farahani, H. J. Duwe, M. J. Schulte, and K. Compton, “Modular
design of high-throughput, low-latency sorting units,” IEEE Transactions on
Computers, vol. 62, no. 7, pp. 1389–1402, 2013.
[120] R. Beigel and J. Gill, “Sorting n objects with a k-sorter,” IEEE Transactions
on Computers, vol. 39, no. 5, pp. 714–716, 1990.
[121] T. Nakatani, S.-T. Huang, B. W. Arden, and S. K. Tripathi, “k-way bitonic
sort,” IEEE Transactions on Computers, vol. 38, no. 2, pp. 283–288, 1989.
[122] D. Lee and K. E. Batcher, “On sorting multiple bitonic sequences,” in Proc.
International Conference on Parallel Processing (ICPP 1994)., vol. 1, 1994,
pp. 121–125.
[123] T. Leighton, “Tight bounds on the complexity of parallel sorting,” in Proc.
The sixteenth annual ACM symposium on Theory of Computing. ACM, 1984,
pp. 71–80.
208
BIBLIOGRAPHY
[124] K. J. Liszka and K. E. Batcher, “A generalized bitonic sorting network,” in
Proc. International Conference on Parallel Processing (ICPP 1993), vol. 1,
1993, pp. 105–108.
[125] L. Zhao, Z. Liu, and Q. Gao, “An efficient multiway merging algorithm,”
Science in China Series E: Technological Sciences, vol. 41, no. 5, pp. 543–551,
1998.
[126] R. Drysdale, III and F. H. Young, “Improved divide sort merge sorting net-
works,” SIAM Journal on Computing, vol. 4, no. 3, pp. 264–270, 1975.
[127] D. C. Van Voorhis, “An economical construction for sorting networks,” in Pro-
ceedings of the May 6-10, 1974, national computer conference and exposition.
ACM, 1974, pp. 921–927.
209
Appendix A
Proofs for Sorting Algorithms
A.1 Proof of Lemma 6.3.1
Proof. The proof of the lemma can be reduced to showing that for l > 0 any two
wires s, s + l ∈ Zm of each list are sorted as shown in Fig. 7.4(b). We prove the
lemma by contradiction. The inputs satisfy xj,s ≤ xj,s+l for j ∈ Zn and s, s+ l ∈ Zm.
Suppose there exist k ∈ Zn and s, s+ l ∈ Zm such that x′k,s > x′k,s+l. Since the sorter
for xk,s k ∈ Zm acts as a permutation of the index k, we denote such permutation
of the sorter connecting wire s as f : {1, · · · , n} → {1, · · · , n}. Because f is
bijection, an inverse f−1 exists. Then we have xf−1(t),s+l ≥a xf−1(t),s = x′t,s ≥ x′k,s >
x′k,s+l for k ≤ t ≤ n, where the “≥a” is because the inputs are sorted and the
“=” is due to the permutation. There are n − k + 1 inputs of xf−1(t),s+l satisfying
xf−1(t),s+l > x
′
k,s+l. However, at most n − k outputs satisfy x′t,s+l > x′k,s+l for
t ∈ {k+1, k+2, · · · , n}, resulting in a contradiction. Hence, all lists are self sorted
after applying n-sorters.
210
A.2. PROOF OF LEMMA 6.3.2
A.2 Proof of Lemma 6.3.2
Proof. First, we show that the first connections of adjacent two sorters belong to
either the same list or adjacent two lists. Let (j, t1) and (j + l, t2) be the first
connections of adjacent two sorters S1 and S2, respectively, where (j, t) denotes
wire t in list j. If l > 1, the connection of S1 in list j + l − 1 should be wire m;
otherwise, S2 would have a valid connection in list j + l. For lists j to j + l − 2,
only wires m in each list are connected by S1, since wire m can be connected to
the preceding list only by a (m − 1)-spaced sorter. Hence, S1 is the last (m − 1)-
spaced sorter in stage 1 and S2 does not exist. Similarly, we can show that the
last connections of adjacent two sorters S1 and S2 belong to either the same list or
adjacent two lists. This gives us a total of four cases as shown in Fig. 7.5, where
b ≥ a + 1 for Fig. 7.5(a)-(c), and b ≥ a for Fig. 7.5(d) such that S1 and S2 have a
size of at least two.
If m is prime, no adjacent two sorters belong to case IV, which is equivalent to
showing that m is a composite number if case IV in Fig. 7.5 exists. Assume two
adjacent sorters S1 and S2 belong to case IV. Let the first connection of S1 be
(j,m) and the last connection of S2 be (j+p, 1). The last connection of S1 satisfies
(k + 1)p ≡ 0 mod m. We have m | (k + 1)p. Since case IV is not possible in
the first stage, we have p < m. Since two adjacent sorters connect two adjacent
wires in at least one list, we have p > 1. If k = 0, S1 would connect the last and
first wires of adjacent lists, respectively, in which case S2 does not exist. We have
1 < k + 1 < m. So m should have a proper factor dividing k + 1 or p. Hence, m is
a composite number.
211
A.3. PROOF OF THEOREM 6.3.1
A.3 Proof of Theorem 6.3.1
Proof. The theorem can be proved by induction on i. In stage 1, m-sorters are
applied on corresponding wires of allm lists. According to Lemma 7.3.1, the outputs
of each list are sorted. Assume any two adjacent wires s and s+1 in list j are sorted
after stage i − 1, x(i−1)j,s ≤ x(i−1)j,s+1 for 1 ≤ j ≤ n and 1 ≤ s ≤ m − 1. We will show
that x
(i)
j,s ≤ x(i)j,s+1 for 1 ≤ j ≤ n and 1 ≤ s ≤ m− 1.
According to Lemma 7.3.2, for a prime m, there are three cases of two adjacent
sorters S1 and S2 as shown in Fig. 7.5(a)-(c).
1. For case I, let y
(i−1)
j,1 and y
(i−1)
j,2 be the two adjacent wires in list j connected by
adjacent two sorters in stage i − 1 for a ≤ j ≤ b. According to Lemma 7.3.1
(n = 2), the outputs of each list are sorted.
2. For case II, there is an additional single wire y
(i−1)
b+1 connected by S2. If y
(i−1)
b+1,1 =
1, we have y
(i)
b+1,1 = 1. The last connection of S2 can be removed without
changing the order of others in S2. S1 and the revised S2 reduce to case I
and the outputs are sorted according to Lemma 7.3.1. If y
(i−1)
b+1,1 = 0, we have
y
(i−1)
b,1 = 0. This is because they are connected by the same sorter in stage i−1.
Then, we have y
(i)
a,1 = y
(i)
a,2 = 0, which are sorted outputs in list a. Remove
y
(i−1)
b+1,1, y
(i−1)
b,1 , y
(i)
a,1, and y
(i)
a,2, the remaining of S1 and S2 reduce to a smaller
configuration of case II. With recursively applying the above approach, S1
and S2 either reduce to a smaller case I or a single wire, both of which gives
sorted outputs.
3. For case III, there is an additional single wire y
(i−1)
a−1,m connected by the first
sorter. Similarly, the two sorters can be reduced to either a case I or a smaller
212
A.4. PROOF OF LEMMA 6.3.3
configuration of case III and the outputs of two adjacent wires in each list are
sorted.
Assume all lists are self-sorted after stage i − 1, we have x(i−1)j,1 ≤ · · · ≤ x(i−1)j,m for
1 ≤ j ≤ n. For stage 1 ≤ i ≤ ⌈m
2
⌉, all wires in lists j = 2, · · · , n−1 have connections
with some sorters. We have x
(i)
j,k ≤ x(i)j,k for j = 2, · · · , n− 1 and k = 1, · · · , m − 1.
Hence, lists j = 2, · · · , n− 1 are self-sorted after stage i. For list 1, x(i−1)1,i−1 ≤ x(i−1)2,1
and x
(i−1)
1,i−1 ≤ x(i−1)1,i , we have x(i)1,i−1 ≤ x(i)1,i. We have 〈x(i)1,1, x(i)1,2, · · · , x(i)1,i−1〉, since list
1 is self-sorted after stage i− 1 and x(i−1)1,k = x(i)1,k for k = 1, · · · , i− 1. We also have
x
(i)
1,i, x
(i)
1,i+1, · · · , x(i)1,m〉. Hence, list 1 is self-sorted after stage i, x(i)1,1, x(i)1,i+1, · · · , x(i)1,m〉.
Due to symmetry, list n is also self-sorted after stage i, x
(i)
m,1, x
(i)
m,i+1, · · · , x(i)n,m〉.
To prove that the outputs of n sorted lists 〈x(⌈
m
2
⌉)
j,1 , · · · , x
(⌈m
2
⌉)
j,m 〉 for j = 1, · · · , n
after stage ⌈m
2
⌉ are combined as a single sorted list in stage ⌈m
2
+ 1⌉, we need to
show that x
(⌈m
2
⌉+1)
j,m+1
2
≤ x(⌈
m
2
⌉+1)
j,m+1
2
+1
for j = 1, · · · , n− 1 and x(⌈
m
2
⌉+1)
j,m+1
2
−1 ≤ x
(⌈m
2
⌉+1)
j,m+1
2
for j =
2, · · · , n. Since x(⌈
m
2
⌉)
j,m+1
2
≤ x(⌈
m
2
⌉)
j,m+1
2
+1
and x
(⌈m
2
⌉)
j,m+1
2
≤ x(⌈
m
2
⌉)
j+1,1 , we have x
(⌈m
2
⌉+1)
j,m+1
2
≤ x(⌈
m
2
⌉+1)
j,m+1
2
+1
for j = 1, · · · , n− 1. Similarly, we have x(⌈
m
2
⌉+1)
j,m+1
2
−1 ≤ x
(⌈m
2
⌉+1)
j,m+1
2
for j = 2, · · · , n
A.4 Proof of Lemma 6.3.3
Proof. In stage i − 1, there are ni−1 sorted lists of n values with respect to each
q (q = 1, · · · , np−i). Since the outputs of each merging network are sorted after
stage i − 1, we can replace each merging network by an ni-sorter. According to
Lemma 7.3.1, the outputs of each new formed list after stage i are sorted, x
(i)
j,q ≤
x
(i)
j,np−i−1+q ≤ x(i)j,(n−1)np−i−1+q for j = 1, · · · , ni. Since the corresponding wires in the
new lists are connected by the same ni-sorter in stage i − 1, we have x(i)j,q ≤ x(i)j+1,q
213
A.5. PROOF OF THEOREM 6.3.2
for j = 1, · · · , ni − 1. Hence, r(i)j,q ≥ r(i)j+1,q for j = 1, · · · , ni − 1.
For r
(i)
s,q = n > r
(i)
s+1,q ≥ · · · ≥ r(i)s+l,q > 0 = r(i)s+l+1,q for l ≤ n, it is equivalent
to prove that x
(i)
j+n,q = 1 if x
(i)
j,(n−1)np−i−1+q = 1 for j ∈ {1, · · · , ni − n}. For any q ∈
{1, · · · , np−1−i} in stage i, there are ni lists of n values. Suppose x(i)
j,(n−1)np−i−1+q = 0
for j ≤ s and x(i)
s+1,(n−1)np−i−1+q = 1. If t (t ≤ s) zeros of x(i)j,(n−1)np−i−1+q are from
the same list of the original n sorted lists, there are at most t+ 1 zeros of x
(i)
j,q from
that same list. Since x
(i)
j,(n−1)np−i−1+q = 0 for j ≤ s are from at most n original lists,
there are at most s+ n zeros in x
(i)
j,q, implying that x
(i)
s+n,q = 1. Hence, x
(i)
j+n,q = 1 if
x
(i)
j,(n−1)np−i−1+q = 1 for j ∈ {1, · · · , ni − n}.
A.5 Proof of Theorem 6.3.2
Proof. In stage 1, all outputs with respect to the operation of the same Alg. 5
are sorted. For any q ∈ {1, · · · , np−1−i} in stage i, according to Lemma 7.3.3, at
most n consecutive lists are not full of zeros. All preceding lists are all-zero lists
and all following lists are all-one lists. Hence, the combining network in stage i
is to sort n lists of n values, which is reduced to Alg. 5. In stage p − 1, we have
q = 1 and the single sorted list, 〈x(i)1,q, x(i)1,np−i−1+q, · · · , x(i)1,(n−1)np−i−1+q, x(i)2,q, x(i)2,np−i−1+q,
· · · , x(i)2,(n−1)np−i−1+q, · · · , x(i)ni,q, x(i)ni,np−i−1+q, · · · , x(i)ni,(n−1)np−i−1+q〉, contains np values,
implying all inputs are sorted as a single list.
214
Vita
Feng Shi received his B.E. and M.E. degrees from electrical engineering and mi-
croelectronics institute of Tsinghua University, Beijing, China in 2006 and 2009,
respectively. He is currently pursuing the Ph.D degree in electrical and computer
engineering at Lehigh University, Bethlehem, Pennsylvania.
His research interest lies in delay modeling and crosstalk avoidance coding for
current CMOS technologies, and threshold architectures for finite field operations
in nanotechnology.
215
