A high-speed integrated circuit with applications to RSA Cryptography by Onions, Paul David
A High-Speed Integrated Circuit for 
Applications to RSA Cryptography 
by 
Paul David Onions 
BEng(Hons), MSc 
A thesis submitted to the University of Plymouth 
in-partial fulfilment for the degree-. of' 
DOCTOR OF PHILOSOPHY 
School of Electronic Communication and Electrical Engineering 
Faculty of Technology 
July 1995 
For Sian, 
who wanted to wear the floppy hat. 
A High-Speed Integrated Circuit for Applications to RSA Cryptography 
Paul David Onions 
BEng(Hons), MSc 
The rapid growth in the use of computers and networks in government, commercial and 
private communications systems has led to an increasing need for these systems to be 
secure against unauthorised access and eavesdropping. To this end, modern computer 
security systems employ public-key ciphers, of which probably the most well known is the 
RSA ciphersystem, to provide both secrecy and authentication facilities. 
The basic RSA cryptographic operation is a modular exponentiation where the modulus 
and exponent are integers typically greater than 500 bits long. Therefore, to obtain rea- 
sonable encryption rates using the RSA cipher requires that it be implemented in hardware. 
This thesis presents the design of a high-performance VLSI device, called the WHiSpER 
chip, that can perform the modular exponentiations required by the RSA cryptosystem 
for moduli and exponents up to 506 bits long. The design has an expected throughput 
in excess of 64kbit/s making it attractive for use both as a general RSA processor within 
the security function provider of a security system, and for direct use on moderate-speed 
public communication networks such as ISDN. 
The thesis investigates the low-level techniques used for implementing high-speed arith- 
metic hardware in general, and reviews the methods used by designers of existing modular 
multiplication/exponentiation circuits with respect to circuit speed and efficiency. 
A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arith- 
metic, together with an efficient multiplier architecture, are proposed that remove the 
speed bottleneck of previous designs. 
Finally, the implementation of the new algorithm and architecture within the WHiSpER 
chip is detailed, along with a discussion of the application of the chip to ciphering and key 
generation. 
Contents 
List of Figures xii 
List of Tables xvi 
Acknowledgements xvii 
Declaration xviii 
Glossary xix 
1 Introduction 1 
2 Cryptology 3 
2.1 Cryptography and Cryptanalysis ........................ 
3 
2.2 Secret- and Public-Key Ciphersystems ..................... 
6 
2.2.1 Secret-Key Ciphersystems ........................ 
6 
2.2.2 Public-Key Ciphersystems ........................ 
7 
2.3 Modular Arithmetic ............................... 
9 
2.3.1 Divisibility ................................ 
9 
2.3.2 Integer Range ............................... 
9 
2.3.3 Integer Division .............................. 
10 
2.3.4 Congruences ............................... 
10 
2.3.5 Least Non-negative Residue ....................... 10 
V 
2.4 Modular Arithmetic Operations ......................... 
10 
2.4.1 Addition .................................. 
10 
2.4.2 Subtraction ................................ 
11 
2.4.3 Multiplication ............................... 
11 
2.4.4 Division .................................. 
11 
2.5 Euclid's Algorithms ............................... 
12 
2.5.1 The Euclidean Algorithm ........................ 
12 
2.5.2 The Extended Euclidean Algorithm .................. 13 
2.6 Fundamental Theorems ............................. 
14 
2.6.1 The Fundamental Theorem of Arithmetic ............... 14 
2.6.2 Fermat's Little Theorem ......................... 14 
2.6.3 The Euler Totient Function ....................... 16 
2.6.4 Euler's Theorem ............................. 16 
2.6.5 The Chinese Remainder Theorem .................... 17 
2.7 The RSA Ciphersystem 
.............................. 17 
2.7.1 Key Generation 
.............................. 18 
2.7.2 Secret Information Exchange 
...................... 
19 
2.7.3 Authentic Information Exchange .................... 20 
2.7.4 Secret and Authentic Information Exchange .............. 21 
2.7.5 Framing .................................. 22 
2.8 Summary ..................................... 
22 
3 Binary Arithmetic 24 
3.1 The Binary Representation of Numbers .................... 24 
3.1.1 Right-to-Left Binary Evaluation Algorithm .............. 24 
3.1.2 Left-to-R. ight Binary Evaluation Algorithm .............. 25 
3.2 Binary Arithmetic ................................ 26 
vi 
I 
27 
3.2.1 Addition ......................... ......... 
28 
3.2.2 Multiplication ..... ... ..... . ... ..... ...... ... 
. 29 3.2.3 Exponentiation ..................... ........ 
. 32 3.3 Modular Binary Arithmetic .................. ........ 
. 32 3.3.1 Modular Multiplication ................ ........ 
. 33 3.3.2 Modular Exponentiation ............... ......... 
. 34 3.4 Summary ........................... ......... 
35 
4 Arithmetic Hardware 
. 35 4.1 Adders ............................. ......... 
4.1.1 Ripple Adder ..................... .......... 
35 
4.1.2 Carry-Completion Adder .............. .......... 
36 
4.1.3 Carry-Select Adder .................. .......... 
38 
4.2 Iterative Multipliers ..................... .......... 
39 
4.2.1 Carry-Save Adder .................. .......... 
40 
4.2.2 High-Radix Iterative Multiplication ....... ........... 
43 
4.3 Multiplier Recoding .................... ........... 
49 
4.4 Signed Number Representations .............. ........... 
53 
4.4.1 2's Complement Representation .......... ........... 
54 
4.4.2 1's Complement Representation .......... ........... 
55 
4.4.3 Sign-Magnitude Representation .......... ........... 
56 
57 4.4.4 Signed-Digit Representation ............ ........... 
4.4.5 Redundant Signed-Digit (RSD) Representation . ........... 
59 
4.5 A R. ecoded Multiplier .............................. 
63 
4.6 Summary ............. ..................... 
65 
5 Standard RSA Hardware 67 
5.1 Multiple-Precision Arithmetic Hardware .................... 
67 
VII 
5.2 Multiply-Divide Hardware ............................ 
68 
5.3 Radix-2 Concurrent Multiply/Reduce Hardware ................ 68 
5.3.1 Simple Modular Reduction ....................... 69 
5.3.2 Residue-Table Reduction ........................ 71 
5.3.3 Quotient Estimation ........................... 
75 
5.4 Radix-4 Concurrent Multiply/Reduce Hardware ................ 
78 
5.5 Radix-2b Concurrent Multiply/Reduce Hardware ............... 79 
5.6 Other Proposed Systems ............................. 
87 
5.7 Summary ..................................... 88 
6 Montgomery Arithmetic 90 
6.1 Montgomery Multiplication ........................... 90 
6.1.1 Calculating Z ............................... 91 
6.1.2 Interpreting P .............................. 92 
6.2 Montgomery Exponentiation 
.......................... 94 
6.2.1 N-residue Representation 
........................ 95 
6.2.2 N-residue Exponentiation 
........................ 97 
6.3 Iterative Montgomery Multiplication 
...................... 100 
6.3.1 Radix-2 Montgomery Multiplication .................. 101 
6.3.2 Radix-26 Montgomery Multiplication 
.................. 103 
6.4 Montgomery Multiplier Implementations .................... 106 
6.4.1 Multi-precision Implementations .................... 106 
6.4.2 Systolic Array Implementations ..................... 106 
6.4.3 A Pipelined Implementation ....................... 108 
6.5 Summary ..................................... 109 
7 Optimized Montgomery Multiplication 110 
7.1 Radix-2 Multiplication .............................. 111 
vu' 
7.1.1 The DAMMM algorithm ......................... 112 
7.1.2 Result Range ............................... 115 
7.1.3 Radix-2 DAMMM Performance Summary ............... 116 
7.2 Radix -4 Multiplication .............................. 
117 
7.2.1 Recoding X ................................ 
118 
7.2.2 R. ecoding Z ................................ 
121 
7.2.3 An RSD Montgomery Multiplier .................... 
122 
7.2.4 The MMDAMMM Algorithm ...................... 
125 
7.2.5 Generating z(i) under MMDAMMM .................. 
128 
7.2.6 Radix-4 recoded MMDAMMM Performance Summary ........ 
129 
7.3 Radix-2b Multiplication ............................. 130 
7.3.1 Generating x(i) .............................. 131 
7.3.2 Generating z(i) .............................. 131 
7.3.3 MMDAMMM Performance Summary ................. 135 
7.4 The MMDDAMMM Algorithm ......................... 135 
7.4.1 Radix-4 MMDDAMMM 
......................... 137 
7.4.2 Radix-26 MMDDAMMM ........................ 139 
7.4.3 Radix-26 MMDDAMMM Performance Summary ........... 140 
7.5 Summary ..................................... 140 
8 The WHiSpER Chip 142 
8.1 Technology .................................... 142 
8.2 The Multiplier .................................. 144 
8.2.1 Efficiency of Recoded Multipliers .................... 144 
8.2.2 Multiplier Selection ........................... 147 
8.2.3 The Carry-Propagate Adder ....................... 149 
8.3 The Exponentiator ................................ 153 
lx 
8.3.1 R-to-L M-residue Exponentiation .................... 
154 
8.3.2 L-to-R M-residue Exponentiation .................... 
155 
8.3.3 Optimizing L-to-R Exponentiation ................... 
156 
8.4 Register Variable Analysis ............................ 
159 
8.5 Architecture .................................... 
162 
8.5.1 The SRAM Device ............................ 
162 
8.5.2 WHiSpER ................................. 
163 
8.6 Operation ..................................... 
169 
8.6.1 RAM ................................... 
170 
8.6.2 Registers ................................. 
171 
8.6.3 Commands ................................ 
172 
8.6.4 Operation Examples ........................... 
175 
8.7 Performance .................................... 176 
8.7.1 Key Load Process ............................ 
177 
8.7.2 Transfer Process ............................. 
177 
8.7.3 Exponentiation Process ........................ . 
177 
8.7.4 Reduction Process ........................... . 
177 
8.7.5 RSA Throughput ............................ . 
178 
8.7.6 Gate-Array Selection .......................... . 
179 
8.7.7 Power Consumption .......................... . 
179 
8.8 Testability .................................... . 
180 
8.9 The WHiSpER PC-Card ............................ . 
182 
8.1 0 Summary .................................... . 
183 
9 Conclusions 184 
9.1 The WHiSpER Chip and Extended Moduli .................. 184 
9.1.1 CRT Exponentiation Using The WHiSpER Chip ........... 185 
X 
9.1.2 Host Exponentiation with a Small Exponent ............. 188 
9.2 The WHiSpER Chip and Key Generation ................... 189 
9.2.1 Rabin's Primality Test .......................... 
189 
9.2.2 Primality Testing on WHiSpER ..................... 190 
9.3 Achievements ................................... 
193 
9.3.1 A New Algorithm ............................ 193 
9.3.2 An Efficient Architecture ........................ 193 
9.3.3 The WHiSpER Chip ........................... 193 
9.4 Further Work ................................... 194 
9.4.1 Exponentiation Algorithms ....................... 194 
9.4.2 Improved Technology .......................... 196 
9.4.3 Towards a New Architecture ...................... 197 
9.5 Summary ...................... .......... 198 
Bibliography 
A The WHiSpER SMC 
199 
210 
A. 1 SMC Input Signals 
................................ 210 
A. 2 SMC Output Signals 
............................... 211 
A. 3 SMC Internal Signals 
............................... 213 
A. 4 State-Transition Diagrams 
............................ 
214 
B The WHiSpER Schematics 218 
C Published Work 265 
xi 
List of Figures 
2.1 A generic ciphersystem . ............................. 
4 
3.1 A k-bit ripple adder . ............................... 27 
3.2 Multiplier partial products . ........................... 28 
4.1 One-bit full adder (FA) . ............................. 36 
4.2 Carry-completion adder .............................. 37 
4.3 Carry-select adder sub-block . .......................... 39 
4.4 Generic iterative multiplier . ........................... 39 
4.5 Carry-save adder .................................. 40 
4.6 Generating x; "Y.................................. 42 
4.7 A 2-level CSA multiplier . ............................ 44 
4.8 A 5: 3 adder cell . ................................. 46 
4.9 Adder interconnect optimization . ........................ 47 
4.10 A pipelined iterative multiplier .......................... 47 
4.11 General full adders ................................. 60 
4.12 RSD addition .................................... 61 
4.13 Addition of 2's complement vector to RSD vector . .............. 62 
4.14 Simplification of lower (k - 1)-th adder of Figure 4.13. ........... 62 
4.15 Conversion of RSD vector to 2's complement . ................. 62 
4.16 Bitslice of 0, ±Y and ±2Y generation . ..................... 64 
Xll 
4.17 R. ecoded multiplier architecture . ........................ 
64 
5.1 L-to-R modular multiplication . ......................... 
69 
5.2 L-to-R MM: overflow determination of subtractions of N. .......... 70 
5.3 L-to-R MM: residue-table lookup ......................... 72 
5.4 Tomlinson modular multiplier . ......................... 72 
5.5 L-to-R MM: quotient estimation . ........................ 
75 
5.6 Delayed-carry adder (DCA) ............................ 76 
5.7 Half-adder (HA) . ................................. 76 
5.8 VICTOR architecture . .............................. 81 
5.9 DR b=2 architecture ............................... 85 
5.10 3-stage pipelined, b=4 DR multiplier ...................... 87 
6.1 Montgomery modular multiplication ....................... 93 
6.2 Right-to-Left modular exponentiation . ..................... 94 
6.3 R-to-L Montgomery N-residue exponentiation . ................ 99 
6.4 L-to-R Montgomery N-residue exponentiation . ................ 100 
6.5 A 1-dimensional systolic array . ......................... 106 
7.1 Implementation of AMMM . ........................... 111 
7.2 Implementation of DAMMM . .......................... 113 
7.3 Radix-2 DAMMM CSA array .......................... 114 
7.4 Radix-2 DAMMM x; "Y and z; -N generation ................. 114 
7.5 Radix-2 DAMMM delay-path ........................... 115 
7.6 Recoding Xi E {0,1,2,3} to x(i) E {-2, -1,0,1} ................ 119 
7.7 Multiple x(i) -Y generation ............................ 120 
7.8 Pipelined generation of x(i) ............................ 120 
7.9 Radix-4 recoded DAMMM with RSD architecture . .............. 123 
Xlll 
7.10 Circuit for generating z(i) ............................. 
124 
7.11 Radix-4 DAMMM delay path ........................... 125 
7.12 Optimized generation of z(i) using MMDAMMM ................ 128 
7.13 Delay path for MMDAMMM ........................... 128 
7.14 MMDAMMM recoded RSD multiplier ...................... 
130 
7.15 General x(i) recoding ............................... 
131 
7.16 Radix-26 z(i) generation .............................. 
132 
7.17 Radix-26 xo(i) generation . ............................ 
133 
7.18 Radix-26 zb/2_1($) generation ........................... 133 
7.19 Radix-4 MMDDAMMM adder structure ..................... 
137 
7.20 Radix-4 MMDDAMMM z(i) generation ..................... 138 
7.21 Radix-2b recoded MMDDAMMM multiplier . ................. 140 
8.1 Full RSD MMDDAMMM b=2 multiplier .................... 150 
8.2 The WHiSpER and SRAM devices ........................ 162 
8.3 SRAM memory map ................................ 163 
8.4 The WHiSpER chip ................................ 164 
8.5 MME - Montgomery Modular Exponentiator .................. 165 
8.6 SMC - State Machine Controller ......................... 166 
8.7 WHiSpER memory map .............................. 170 
8.8 Multiple exponentiation state transition diagram ................ 176 
8.9 XT-bus address map ................................ 182 
8.10 The WHiSpER PC-Card schematic . ...................... 183 
9.1 Rabin's primality test . .............................. 190 
A. 1 LSM state-transition diagram ........................... 214 
` A. 2 TSM state-transition diagram . ......................... 215 
xiv 
A. 3 ESM state-transition diagram . ......................... 216 
A. 4 RSM state-transition diagram . ......................... 217 
xv 
List of Tables 
4.1 Signed-digit addition; transfer and intermediate sum digits. ......... 59 
5.1 Minimum 8 and c for minimum MAX ..................... 
80 
5.2 Optimum 8 and e................................. 81 
5.3 Optimum values for b, c and gMAX . ............... """"". 84 
6.1 Standard and N-residue representations for N= 21 and R= 32. ...... 98 
7.1 Generating z(i) .................................. 124 
8.1 CLA70000 Cell characteristics . ......................... 144 
8.2 CSA MMDDAMMM (unrecoded) performance figures ............. 146 
8.3 RSD MMDDAMMM (recoded) performance figures .............. 147 
4 
xvi 
Acknowledgements 
I would like to express my sincere thanks to the following people, 
" Peter Sanders, my Director of Studies, for his encouragement and motivation, and 
for his confidence in my abilities throughout the research program, 
9 Alan Roberts for his help with the GPS/Mentor ECAD design environment, 
" Simon Shepherd for originally suggesting the direction of the research, and 
" to all the other members of the Network Research Group and to the technicians of 
the School of Electronic Communication and Electrical Engineering for their help 
and support during the last two and a half years. 
xvii 
Declaration 
At no time during the registration for the degree of Doctor of Philosophy has the author 
been registered for any other university award. 
Relevant scientific seminars and conferences were regularly attended at which work was 
often presented; external institutions were visited for consultation purposes, and several 
papers prepared for publication. 
The work presented in this thesis is solely that of the author. 
Signed ........ ......................... 
Date ....... 
2.... 
.s................. 
xvii' 
Glossary 
the prime symbol applied to the variable x, as x', is usually 
used to indicate a modification to x. The exact meaning 
should be clear from the context in which it is used. 
[zk-l xk_z ... xo] is the k-bit vector representation of the integer X, whose 
value is given by X= Ek ö 2' " x; where x; E 10,11. 
[XI-1XI-2 ... Xo] is the 1-digit vector representation of the integer X, whose 
value is given by X= Ei_ö 2'b " X; for some integer b>0. 
x(i) the i-th recoded digit of X. See Sections 4.3 and 7.2.1. 
xIm means x divides into m exactly. 
[a, b] integer range a to b inclusive. 
(a, b) integer range a to b exclusive. 
Lx/ml greatest integer not exceeding x/m. 
[x/MI least integer not less than x/m. 
W. least non-negative residue of x modulo m. 
x'1 (mod m) multiplicative inverse of x modulo m. 
4b (m) Euler's totient on m. 
RAND propagation delay of AND gate. 
xix 
ANAND propagation delay of NAND gate. 
OHA propagation delay of Half Adder. 
AFA propagation delay of Full Adder. 
AMUX propagation delay of 4-to-1 Multiplexer. 
OFF setup and propagation delay of Flip-Flop register element. 
FAND gate-count complexity of AND gate. 
11NAND gate-count complexity of NAND gate. 
IIHA gate-count complexity of Half Adder. 
12FA gate-count complexity of Full Adder. 
QMUX gate-count complexity of 4-to-1 Multiplexer. 
11FF gate-count complexity of Flip-Flop register element. 
MR, N(X, Y) fully reduced Montgomery multiplication. See Section 6.2.2. 
)R, M (X, Y) partially reduced Montgomery multiplication. See Section 8.3.3. 
TM time (in nanoseconds) required for one multiplication. See 
Section 8.2.1. 
7ZM multiplication rate. 
cM multiplier circuit gate-count. 
EM multiplier circuit efficiency. 
TE time (in nanoseconds) required for one exponentiation. See 
Section 8.3.1. 
RE exponentiation rate. 
xx 
GE exponentiator circuit gate-count. 
CE exponentiator circuit efficiency. 
X+, X' positive and negative component vectors of RSD represen- 
tation. 
AE Array Element. 
AMMM Additive Montgomery Modular Multiplication. 
ASIC Application Specific Integrated Circuit. 
CMOS Complementary Metal-Oxide Semiconductor. 
CPA Carry-Propagate Adder. 
CRT Chinese Remainder Theorem. 
CSA Carry-Save Adder. 
DAMMM Delayed Additive Montgomery Modular Multiplication. - 
DCA Delay-Carry Adder. 
DR Diminished Radix. 
FA Full Adder. 
FCPA Fast Carry-Propagate Adder. 
FPGA Field Programmable Gate Array. 
GFA General Full Adder. 
GPS GEC Plessey Semiconductors. 
HA Half Adder. 
L-to-R Left-to-Right. 
xxi 
MMDAMMM Modified Modulus Delayed Additive Montgomery Modular 
Multiplication. 
MMDDAMMM Modified Modulus Double Delayed Additive Montgomery 
Modular Multiplication. 
MUX Multiplexer. 
PAM Programmable Array Memory. 
RAM Random Access Memory. 
RNS Residue Number System. 
ROM Read Only Memory. 
RSA Rivest, Shamir and Adleman. 
RSD Redundant Signed-Digit. 
R-to-L Right-to-Left. 
SRAM Static RAM. 
VLSI Very Large Scale Integration. 
WHiSpER Wide-word High-Speed Encryption for RSA. 
s 
xxii 
Chapter 1 
Introduction 
This thesis concerns the design of a high-speed integrated circuit device capable of per- 
forming the modular exponentiation operations required of the RSA cryptosystem for 
moduli of around 500 bits in length. 
The thesis is constructed as a progression starting from basic concepts in Chapter 2 
through standard arithmetic algorithms and hardware in Chapters 3 and 4, current RSA 
hardware in Chapter 5, Montgomery arithmetic and optimised Montgomery multipliers in 
Chapters 6 and 7 and culminating in details of the WHiSpER chip in Chapter 8 followed 
by conclusions and ideas for further work in Chapter 9. 
Chapter 2 serves as a brief introduction to cryptography, covering the definition of 
a ciphersystem and the difference between secret-key and public-key systems. This is 
followed by an overview of the main theorems and basic algorithms of modular arith- 
metic, introducing the notation that will be used throughout this thesis. Finally, the RSA 
ciphersystem is explained in detail. 
Chapter 3 reviews the binary representation of numbers and details the right-to-left 
and left-to-right binary evaluation techniques. These techniques are then used to derive 
algorithms for iterative multiplication and exponentiation. 
Chapter 4 studies arithmetic hardware. Adder circuits and iterative multipliers are 
1 
explained together with a discussion of multiplier recoding techniques and signed-number 
representations. At the end of this chapter the design of an efficient iterative multiplier is 
presented, the basic principles of which will be used to construct the optimised multipliers 
of Chapter 7. 
Chapter 5 reviews the current literature concerning the implementation of RSA cryp- 
tosystems using standard modular multipliers. In depth desciptions of two particular 
designs are given showing the trade-offs that have to be made in an effort to create a 
fast and efficient design. The limitations of current designs are identified in this chap- 
ter. In Chapter 7 it will be shown that these limitiations can be removed with optimised 
Montgomery multipliers. 
Chapter 6 introduces Montgomery modular arithmetic. The technique is explained 
and basic algorithms for Montgomery multiplication and exponentiation are studied. A 
review of some of the proposed hardware implementation schemes is also given. 
In Chapter 7 new, optimised designs for Montgomery multipliers are presented that 
allow the multiplier to operate at full-speed. The algorithms and architecture of Chapters 
6 and 4 serve as the basis from which the new optimised Montgomery multipliers are 
developed. 
Chapter 8 presents the design of the WHiSpER chip. Selection of the multiplier and 
exponentiator circuit is performed based on the technology issues of the GPS CLA7000 
series gate array device used. The architecture of the chip is described, showing how 
an efficient and high-throughput device can be realised, followed by a description of its 
operation and an analysis of its expected performance. Finally, the details of a simple 
WHiSpER based IBM PC card are given. 
Chapter 9 concludes the thesis. It discusses the use of the WHiSpER chip for applica- 
tions with key sizes of up to 1000 bits and shows how key generation can be effected by 
Using WHiSpER to implement a primality testing function. 
2 
Chapter 2 
Cryptology 
The discipline of cryptology is that of `secret-writing' or, in less dramatic terms, the study 
of systems that can hide the information content of messages. Though the complexity 
of these systems varies considerably (from the simple letter-substitution ciphers of Julius 
Caesar to the sophisticated `information-randomisation' algorithms of the present day), 
they are all generally known as ciphersystems or cryptosystems. This chapter serves as a 
brief introduction to the field of cryptology and cryptosystems [1] [2] [3]. 
2.1 Cryptography and Cryptanalysis 
Cryptology can be split into two main areas; cryptography and cryptanalysis. Broadly 
speaking, it is the cryptographers job to create new ciphersystems, and the cryptanalysts 
job to break them. Figure 2.1 shows a generic ciphersystem. The message to be encrypted, 
known as the plaintext, is converted into an encrypted message, the ciphertext, by means 
of an encryption algorithm. The encryption algorithm makes use of an encryption key to 
determine the exact ciphertext produced. Decryption from the ciphertext to the plaintext 
follows a similar process using the decryption algorithm and a decryption key. 
The goal of the cryptographer is to make these processes simple enough to be executed 
quickly, yet make them complex enough so that it is generally infeasible to infer properties 
3 
Encryption 
Key 
i 
Encryption 
ý' Algorithm 
Plaintext Ciphertext 
Decryption 
Algorithm 
t 
Decryption 
Key 
Figure 2.1: A generic ciphersystem. 
of one part of the system from properties of another part of the system. 
In designing such a system, however, the cryptographer must bear in mind the various 
different techniques that the cryptanalyst has at his disposal. Basic attack models vary 
in their assumptions concerning the nature of the information that the cryptanalyst has 
obtained. They are, in increasing order of significance, 
9 ciphertext-only; the cryptanalyst has access only to a selection of ciphertexts, 
9 known-plaintext; the cryptanalyst has access to plaintext-ciphertext pairs, 
. chosen-plaintext (-ciphertext); the cryptanalyst is able to obtain the ciphertext 
(plaintext) corresponding to a plaintext (ciphertext) of his own choosing. 
A secure ciphersystem should be resistant to all of these. 
This raises the question of what exactly is meant by the term `secure'. The following 
two definitions are commonly used, 
" perfect security; the ciphersystem is unbreakable even under the assumption that 
the cryptanalyst has unlimited computing power, and 
" practical security; the ciphersystem is considered unbreakable under the assumption 
that the cryptanalyst has powerful but finite computing resources. 
4 
It has been shown by Shannon in [4] that the only perfectly secure ciphersystem is one in 
which the uncertainty of the encryption/decryption key is at least as large as the uncer- 
tainty of the plaintext message, and that this key be used only once and then discarded. 
Such a system is known as a `one-time-pad' and obviously has only very limited use. 
The majority of ciphersystems in use today are of the practically secure type, where 
the level of security is often measured in terms of the number of years required to break 
a ciphersystem using state-of-the-art computing power. However there is no concensus on 
the level of difficulty of solving certain problems that arise in breaking cryptosystems, and 
so such estimates are always very approximate. 
The elements of the ciphersystem that are considered `secret' vary from system to 
system. For example in some governmental and most military ciphersystems, the encryp- 
tion and decryption algorithms themselves are not public knowledge. Whether this makes 
the ciphersystem more secure is open to debate since although fewer people know the 
algorithm and thus can perform cryptanalytic attacks on it, this does not prove that the 
algorithm is a strong one. Indeed, there is currently no method of proving the general 
level of security of any ciphersystem and the only widely accepted measure of the security 
of a system is that the cipher algorithm be public knowledge and that it have survived 
repeated cryptanalytic attacks over many years. 
Thus, if a particular ciphersystem is in common use (and so assuming that the algo- 
rithm is public knowledge), we see that the security of the system depends entirely on the 
cipher keys. The criteria for such systems can thus be summarised as follows, 
" the resultant ciphertext should be statistically un-correlated with the plaintext, 
" it should not be possible to determine a cipher-key given an arbitrary number of 
plaintext-ciphertext pairs, and 
`" if there is a large range from which a key can be selected (a large key-space), then 
the security of such a system is increased. e. g. against exhaustive key search attacks. 
5 
2.2 Secret- and Public-Key Ciphersyst_ems 
Ciphersystems split quite neatly into two distinct types; secret-key and public-key systems. 
2.2.1 Secret-Key Ciphersystems 
A secret-key system is, referring back to Figure 2.1, a system in which the encryption key 
and decryption key are identical. This means that, with two parties wishing to exchange 
secret information (call them Alice and Bob; Alice wants to send a message to Bob), it is 
first necessary for each of them to possess the common key. If they are physically remote 
from each other and wish to communicate via some sort of communications network then 
some method of distributing this key is needed. Obviously the key cannot be transmitted 
over a non-encrypted communication channel and so the usual solution to this problem 
(at least as far as initial key distribution is concerned) is to place the key into some secure 
physical device and manually transport it to Bob (assuming Alice created the key). 
The above solution to the key distribution problem may be acceptable in isolated cases, 
but consider the situation in which there is a network of n users each of whom wishes to 
communicate with any one of the n-1 other users in a secure way. It can be seen that 
each of the users must possess n-1 keys and so the total number of keys in the system 
is "21 sze n2 for large n. Thus for any sizeable network of n people there are two major 
problems associated with the use of a secret-key cryptosystem, 
" key distribution; physically transporting the n-1 keys to each user, and 
" key space; the greater the ratio of keys used to available key space, the greater the 
probability of `random' cryptanalytic attacks on all n users succeeding in finding the 
key for one of the "1 keys in use. 
Among the many secret-key cryptosystems in use today (such as the IDEA block 
cipher [5] and the various stream ciphers [1]) the most common is probably the DES 
(Data Encryption Standard) cipher [3]. Developed in the 1970's as an offshoot of IBM's 
6 
Lucifer ciphersystem [1], DES was accepted by the American NBS (National Bureau of 
Standards) as a U. S. Federal Information Processing Standard in 1977. In that same year 
the complete specification of DES was also published but IBM's design principles for the 
cipher were classified by the NSA (National Security Agency). Considerable controversy 
has been generated by the DES cipher, not least concerning the small key size of 56 
bits. Although the resultant key space of 216 : 1017 is considered too large to mount a 
brute-force cryptanalytic attack on any particular key using current computer hardware, 
it is not considered impossible that this approach may become feasible in the near future. 
Indeed, recent developments in cryptanalytic techniques (differential cryptanalysis, see [6] 
[7]) have shown that DES is susceptible to certain cryptanalytic attacks, and this is why 
enhanced DES-like algorithms (such as DES double key mode [8]) have been investigated 
as to their increased level of security. 
2.2.2 Public-Key Ciphersystems 
A public-key ciphersystem is one in which the encryption and decryption keys of Figure 2.1 
are distinct. Together the keys are known as a key-pair and, in general, one of the keys is 
kept secret (known only to the generator of the key-pair) whilst the other is made public. 
Returning to the example of the previous section, if Bob creates a key-pair and re- 
leases the encryption-key to the public then Alice can send secret information to Bob by 
encrypting her plaintext with the encryption-key and sending the resultant ciphertext to 
Bob. Since Bob is the only person with the decryption-key, he is the only one who can 
read the encrypted message and so the desired goal of keeping the communication secret 
has been achieved. 
If Alice and Bob are now considered to belong to a network of n users, where each 
user creates their own key-pair and makes public only the encryption-key from this pair, 
then it is possible for secret communications to take place between any two users on the 
network. Note that this has been achieved firstly with the use of only n key-pairs and 
7 
secondly without any form of secure key-distribution being necessary. 
There are a number of public-key cryptosystems known today including ElGamal's [9], 
the elliptic curve cryptosystems [10] and others [11] [12] [13]. However, perhaps the most 
widely used is the RSA system named after its inventors Rivest, Shamir and Adleman [14]. 
Like most other public-key systems, the RSA cipher uses elements from number theory 
to construct the encryption/decryption algorithms, and a complete description of the 
mathematics of RSA is given in the following sections. For the moment though, it suffices 
to say that the computation of an RSA ciphertext involves arithmetic calculations on very 
large numbers (numbers over 150 digits or 500 bits long), and the security of an RSA 
ciphersystem depends (in the main) on the difficulty of factoring very large integers into 
their prime components. An interesting consequence of the latter statement is that, since 
the integer factoring problem has been studied by mathematicians since `time immemorial' 
with no generally applicable fast algorithm ever having been found, the security of the 
RSA cryptosystem has at least a sound historical footing. 
More information on the security of RSA can be found in [15] [16] [17] [18] [19]. 
Another feature of the RSA system is that, since its level of security depends on 
the difficulty of the general problem of factoring integers of a given size, that level of 
security can be increased simply by increasing the size of the numbers that the RSA 
system manipulates. In other words, the LISA system is scalable with respect to security. 
However, LISA is not all good news. There are two points which detract from its 
appeal, and these are 
. calculation time; arithmetic operations on very large numbers take time, and 
" security; the unknown level of hardness of the integer factorization problem. 
The first of the above means that the throughput of an RSA cryptosystem is very slow 
compared to say the DES system. The second item above acknowledges the fact that 
mathematicians have never been able to show exactly how hard the integer factorization 
8 
problem is. In other words, is there a lower limit on the `easiness' of being able to factor 
integers of a given size? Since this is not known then it is not impossible for someone to 
discover tomorrow a new algorithm to solve the integer factorization problem in very fast 
time, and thus render useless any cryptosystem based on this problem. In fairness though, 
such proofs of minimal problem complexity have not been found for other cryptosystems, 
and so any cipher may be rendered obsolete tomorrow by the discovery of new efficient 
algorithms (however unlikely). 
In summary, secret-key ciphersystems are generally fast but suffer from key distribution 
problems, whilst public-key systems are slow but allow simpler key-management schemes. 
2.3 Modular Arithmetic 
Modular arithmetic is a tool that is used in the branch of mathematics known as number 
theory [20] [21] [22] [23]. Number theory, as its name suggests, is all about the properties of 
numbers, and in the following sections the notation, arithmetical operations, some simple 
algorithms and a few of the major theorems of modular arithmetic will be described. 
The following is an overview of the modular arithmetic notation that will be used 
throughout this thesis. As a general rule, if a variable, say m, is used in the context of 
a modulus then it is understood that m is a positive integer. On the other hand, if a 
variable, say x, is used in a general context then x may be any integer. 
2.3.1 Divisibility 
The notation x1m means that x divides into m exactly. In other words m is a multiple of 
x. Conversely, x%m means that x does not divide into m. 
2.3.2 Integer Range 
The notation xE [a, b] means that x is an integer that ranges over the values a to b such 
that a<x<b. The notation xE [a, b) means a5x<b. 
9 
2.3.3 Integer Division 
The notation q= tx/mi means the largest integer q not exceeding x/m. The notation 
q=f x/ml means the smallest integer q not less than x/m. 
For example, L2.7] =2 and (2.71 = 3. 
2.3.4 Congruences 
The congruence is a modulo relationship between numbers. The notation x-r (mod m) 
is pronounced `x is congruent to r modulo m' and means that mI (x - r), or in other 
words, x and r differ only by multiples of m. Thus x and r are said to belong to the same 
residue class modulo m. 
For example, 25 = 18 = 11 -4 (mod 7). 
2.3.5 Least Non-negative Residue 
The least non-negative residue is a modulo operator on numbers. The notation r= (x),,, 
means r=x- qm where q= tx/mJ is the integer part of x/m, and so r is the remainder 
part of x/m such that rE [0, m- 1]. 
For example, (20)7 = 6. 
2.4 Modular Arithmetic Operations 
The following are definitions of the basic modular arithmetic operations of addition, sub- 
traction, multiplication and `division'. 
2.4.1 Addition 
Modular addition is performed as in `normal' arithmetic addition, with the result usually 
denoted by the least non-negative residue. 
For example, 3+5 =_ 8-1 (mod 7). Or, using the alternative notation, (3 + 5)7 = 1. 
10 
2.4.2 Subtraction 
Modular subtraction can be performed by noting that -r =_ m-r (mod m). 
For example, -3 -7-3-4 (mod 7). Or (-3)7, = 4. 
2.4.3 Multiplication 
Modular multiplication is again performed as in 'normal' arithmetic multiplication. 
For example, 3.4 = 12 =5 (mod 7). Or (3.4)7 = 5. 
2.4.4 Division 
Although the above three operations are all very straightforward, the operation of division 
in modular arithmetic is not always defined, and requires special constraints to be met in 
order for it to be possible. 
Consider x"y=z (mod m). If we want to find x then, using `normal' arithmetic 
principles we would write x=z" y'1 (mod m). But this implies that, given a y, there 
must exist an integer y'1 (called the multiplicative inverse of y modulo m) such that 
y, y-1 =1 (mod m). The problem is that given arbitrary y and m it is not always 
possible to find such an integer. Thus for `division' by y to be possible modulo in, the 
multiplicative inverse of y must exist, and this occurs if and only if y and m are coprime. 
That is gcd(y, m) = 1, where gcd() is the greatest common divisor function. 
The gcd of two numbers can be computed by the Euclidean algorithm, and the multi- 
plicative inverse of an integer with respect to a coprime modulus can be computed by the 
extended Euclidean algorithm. 
11 
2.5 Euclid's Algorithms 
2.5.1 The Euclidean Algorithm 
The Euclidean algorithm can be. used to find the greatest common divisor of two integers. 
It can be explained as follows (for strict proof see for example [24]). 
Given two integers a and b with a>b and d= gcd(a, b) then we have dja and dlb, and 
so also di (a ± b). In fact dl (a - b), dl (a - 2b), ..., dl (a - nb) for arbitrary integer n. 
If now, given the two integers a and b, we want to find d= gcd(a, b), we can first 
calculate an integer q= La/bJ so that 
a-qb=r 
where rE [0, b- 1]. From the above we know that di (a - qb) and so also dir. This means 
that we now have two integers b and r, each respectively less than a and b, for which b>r 
and d= gcd(b, r). If we substitute these new numbers into a and b so that 
a 4-b 
b +-r 
and perform the calculation of q again, we see that we can keep going in this recursive 
fashion with a and b getting smaller with each iteration. If we end the process when r=0 
(before b is assigned the value of r), we will find that b is the greatest common divisor of 
the two original numbers. 
Using the notation that r(i) means the value of r at the end of the i-th iteration, we 
can restate Euclid's algorithm as: 
Algorithm 1 (Euclidean Algorithm) Given two integers a and b then to find d= 
gcd (a, b) do the following. Let 
r(-2) = a, r(-1) =6 
12 
Then, starting at i=0 let 
r(i) = (r(i - 2))r(. -i) 
until r(k) =0 for some k>0. Then d= gcd(a, b) = r(k - 1). 
Proof : In [24] 01. 
2.5.2 The Extended Euclidean Algorithm 
The extended Euclidean algorithm (sometimes known as the modified Euclidean algorithm) 
can be used to find the multiplicative inverse of an integer with respect to a coprime 
modulus. It will be stated as follows. 
Algorithm 2 (Extended Euclidean Algorithm) Given coprime integers x and m such 
that x<m, then to find the integer x-1 such that x" x-1 =_ 1 (mod m) do the following. 
Let 
r(-2) = m, s(-2) = 0, r(-1) = x, s(-1) =1 
Then, starting at i=0 let 
9(i) _ tr(i - 2)/r(i - 1). I 
r(i) = r(i - 2) - q(i) " r(i - 1) 
s(iý s(i-2)-q(i)"s(i-1) 
until r(k) =0 for some k>0. Then a-1 = (s(k - 1)),,,. 
Proof : Can be found in [24] I. 
Note that, in the above, (s(k -1))m just means if s(k - 1) <0 then add m to get the 
result into the range [0, m- 1]. 
'The symbol R means 'end of proof'. 
13 
2.6 Fundamental Theorems 
The theorems presented in this section are fundamental to the study of number theory, 
and as such their proofs can be found in many text books. The only proof given here is 
for Fermat's Little Theorem, the sole purpose of which is to give an overall impression of 
how theorems can be proven within this subject. 
2.6.1 The Fundamental Theorem of Arithmetic 
The fundamental theorem of arithmetic states that any integer can be uniquely expressed 
as the product of distinct powers of prime numbers. Since the expression is unique, it is 
known as the prime factorization of the integer. 
Theorem 1 (Fundamental Theorem of Arithmetic) Any positive integer m can be 
uniquely expressed as 
n-1 
m= II (Po li 
i-o 
where pi are distinct primes, ki is the power to which each pi is raised and n is the number 
of distinct primes in the factorization of m. 
Proof : Intuitively `obvious', but proof can be found in [24] M. 
2.6.2 Fermat's Little Theorem 
Before presenting Fermat's Little Theorem (FLT for short), we first need to understand 
the following two lemmas. 
Lemma 2 For pa prime and alp, the numbers a, 2a, 3a, ... , 
(p-1)a are all incongruent 
modulo p. 
Proof (by contradiction): Suppose i"a-j"a (mod p) for 15i<i<p-1, then 
(i - j) "a=0 (mod p) and so, since p is prime, either pla or pl(i - j). But the former is 
not possible by choice of a, and the latter is not possible since 0< (i - j) <pI. 
14 
Lemma 3 For pa prime and alp, the numbers a, 2a, 3a, ... , 
(p- 1)a are all incongruent 
to 0 modulo p. 
Proof (by contradiction): Suppose k"a=0 (mod p) for 1<k<p-1, then, since p is 
prime, either pla or pIk. But the former is not possible by choice of a, and the latter is 
not possible since 0<k<pI. 
Theorem 4 (Fermat's Little Theorem) For pa prime and a an integer such that pAa, 
then 
aP-1 -1 (mod p) 
Proof: From the above lemmas we know that the numbers a, 2a, 3a, ..., 
(p - 1)a are 
all incongruent modulo p and that none of them are congruent to 0 modulo p. This must 
mean that the numbers {a, 2a,..., (p-1)a} must be congruent to the numbers {1,2,. 
in some order. Therefore 
a"2a"3a..... (p-1)a-1.2.3..... (p-1) (modp) 
so 
a"-1"(p-1)! =(p-1)! (modp) 
and since gcd(p, (p - 1)! ) = 1, we can cancel the (p - 1)! terms in the above to obtain 
ap-1 =1 (mod p) 
IL 
A generalisation of the above can be obtained for any integer a as follows. 
Corollary 5 For pa prime, and a any integer, then ap =a (mod p). 
Proof: If p%a then multiply both sides of FLT by a, else if pia then a=0 (mod p) and the 
answer is trivial R. 
15 
2.6.3 The Euler Totient Function 
Let I(m) be the number of integers in the range [0, m-1] that are coprime with the integer 
m; this is the Euler Totient function (sometimes known as the Euler Phi function). It has 
an intimate relationship with the prime factorization of m. 
Theorem 6 (Euler Totient Function) Given positive integer m, the expression for the 
Euler Totient function is 
n-1 
gy(m) = II pik; -1(pi - 1) 
. =o 
where the pi, ki and n are as in Theorem 1. 
Proof : can be found in [24] M. 
For example, if m= 539 = 72.11 then 4(m) =7" (7 - 1) " (11 - 1) = 420. That is, 
there are 420 numbers less than- 539 that are coprime with 539. To be precise, the 420 
numbers that are not multiples of 7 or 11. 
2.6.4 Euler's Theorem 
Euler's theorem is an extension of Fermat's Little Theorem to non-prime moduli. 
Theorem 7 (Euler's Theorem) Given coprime integers x and m, then 
xo('"ý -1 (mod m) 
Proof : in [24] I. 
A generalisation that considers non-coprime integers x and m can be stated as 
Corollary 8 For positive integers x and m and n, xl'*O('")+t =x (mod m) 
Proof : in [24] I. 
16 
2.6.5 The Chinese Remainder Theorem 
The Chinese Remainder Theorem (CRT) allows arithmetic operations modulo m to be 
performed using coprime divisors of m. For large m this can significantly speed up modulo 
m operations. 
Theorem. 9 (Chinese Remainder Theorem) Given n coprime moduli ml, m2, ..., mit 
and setting 
n 
m=flm, 
then any integer, xE [0, m- 1] can be uniquely expressed as the n-tuple of residues of x 
modulo mi. That is 
z= (xi, z2, ". ", xn) (mod m) 
where x; = (x),.,. 
Furthermore, x can be recovered from its n-tuple using 
"n 
in' m. x= 
(iX. 
i"(1)m, )m 
where 
m m; =- 
m; 
Proof : in [24] I. 
2.7 The RSA Ciphersystem 
The RSA ciphersystem is based upon the generalized version of Euler's Theorem. It relies 
upon the following observation; if e and d are two integers such that 
e"d=1 (mod gy(m)) 
where m is a positive integer, then we have 
e"d=n"-iD(m)+1 
17 
for some positive integer n, and so, by Corollary 8 
(ze)d =x (mod m) 
What this equation is saying, is that, given a piece of information, x, this information 
can first be `encrypted' (raised to the power e mod m), and then decrypted (raised to the 
power d mod m), and we will recover the original information, x. Since the integer e is 
the multiplicative inverse of d modulo c(m), and, in general, e#d, we have effectively 
just created a two-key cryptosystem. 
The significance of the numbers d, e, '(m) and m with regard to security can be 
summarised as follows, 
" integers d and e can only be derived once *(m) is known, and 
" -P(m) can only be calculated once the prime factorization of m is known. 
Now, if m is a number such that, using current techniques, it is very hard to factor into its 
prime components, then it will be very difficult to derive integers d and e directly from m. 
The key to using all this in the context of a cryptosystem then, is for the implementor of 
the system to create the number m from the product of a few very large prime numbers. 
If he then creates the numbers I(m), d and e using his knowledge of the prime factors of 
m, and releases the numbers m and e to the public (but keeps the primes, 1(m) and d 
secret), then it will be an extremely difficult task for a member of the public to recreate 
either the primes, I(m) or d from just m and e. For suitably large initial primes, the 
problem can be made to be infeasibly difficult. 
2.7.1 Key Generation 
Formalizing the above discussion, an RSA ciphersystem requires both public and secret 
keys - called the RSA key-pair - to encrypt and decrypt plaintext and ciphertext respec- 
tively. Each key consists of the modulus and an exponent. The public key consists of the 
18 
modulus and the public exponent, while the secret key consists of the modulus and the 
secret exponent. That is 
kp = {ep, m} 
k, = {e m} 
where kp and k, are the public and secret keys, ep and e, are the public and secret 
exponents, and m is the modulus. 
An RSA key-pair is typically generated as follows, 
1. Generate two large primes, p and q, each greater than 75 digits (250 bits) in size, 
then calculate the modulus m, such that 
m=P"4 
2. Calculate Euler's Totient gy(m), 
c(m)=(P-1)"(q-1) 
3. Choose the public exponent ep such that, 
ey E [0,4(m) -1] 
gcd(ep, gy(m)) =1 
4. Calculate the secret exponent e, given that, 
ep " e, (mod 4)(m)) 
using the extended Euclidean algorithm. 
2.7.2 Secret Information Exchange 
Assume that we have two people, Alice and Bob, and Alice wants to send some secret 
information to Bob. Assume also that Bob has previously generated an RSA key-pair 
19 
such that his public key, kp(B) = {ep(B), m(B)}, has been placed into some public list of 
keys that Alice has access to, and his secret key, k, (B) = {e, (B), m(B)}, is known only 
to Bob himself. Now, if the information that Alice wants to send to Bob is represented by 
the integer I where IE [0, m(B) - 1], then 
1. Alice calculates encrypted data, 
(1`p(B)) 
m(s) 
2. Alice sends encrypted data, x, to Bob. 
3. Bob decrypts data, 
y= 
(Xe'(B))m(B) 
= 
(jey(B)"e. (B)\ 
m(B) 
_l 
j'ý gy(m)}1 
m(B) 
= (I)m(B) 
=j 
and Bob now has the information, I. 
Note that for an eavesdropper to recover the information I from x he would either have 
to know the secret exponent e, (B) or be able to perform the number-theoretic logarithm 
function on x which is similar in difficulty to factoring m. 
2.7.3 Authentic Information Exchange 
Assume now that Alice wants to send some information to Bob, but that this time the 
information is not secret. However, Bob' wants to be sure that the information originated 
from Alice and not from anyone else. This is called authenticity of information and can 
be achieved as follows. 
20 
If Alice has previously generated a key-pair, kp(A) and k, (A), and placed kp(A) into a 
public registry, then authentic transfer can take place by Alice first `encrypting' the infor- 
mation, I, with her secret key. Since no-one else knows Alice's secret key, the ciphertext 
thus produced will be the unique result of Alice and the information. In effect Alice has 
`digitally signed' the document represented by I. Thus 
1. Alice authenticates data, 
X= (I`'(A)) 
m(A) 
2. Alice sends authenticated data, x, to Bob. 
3. Bob validates the authentication, 
y= 
(x`p(A))m(A) 
_ 
(jea(A). e (A)\ 
m(A) 
= (I). (A) 
=I 
2.7.4 Secret and Authentic Information Exchange 
The obvious problem with the method for authenticated information exchange presented 
above, is that, since ep(A) is public, persons other than Bob will be able to read the 
message, I, if they have access'to the transported data, x. This can be easily prevented 
by combining the protocols for secret and authentic information exchange. 
Assuming that both Alice and Bob have created and distributed their key-pairs, k. (A), 
kp(A), k3(B) and kp(B) respectively, then Alice can send secret information to Bob, that 
Bob knows must have originated from Alice, as follows. 
1. Alice authenticates data, 
Z1 = 
(Je. (A)) 
m(A) 
21 
2. Alice encrypts data, 
X2 = 
(XIncBl )m(B) 
3. Alice sends data, x2, to Bob. 
4. Bob decrypts data, 
Y2 = 
(xzý(B)/ 
m(B) 
5. Bob validates the authentication, 
yl = 
('Zr(A))m(A) 
=1 
2.7.5 Framing 
In practice, the use of RSA to encrypt/authenticate messages of arbitrary length requires a 
process called framing. This process is simply the `slicing' up of messages so that each slice, 
when combined with a small amount of protocol information, forms a `block' which, when 
viewed as representing a number, has magnitude less than that of the modulus being 
used. In the case of secret-only or authentication-only information exchange protocols, 
these `blocks` are then simply enciphered and sent to their destination. In the case of 
the secret-authentic protocol there are two moduli in use, and so the information must be 
re-blocked after the first encipherment to prepare it for the second pass. 
In essence, the sender of encrypted messages must perform framing immediately before 
any RSA exponentiation, whilst the receiver of the messages must perform de-framing 
immediately after any exponentiation. 
2.8 Summary 
This chapter served as a brief introduction to cryptography, covering the definition of a 
ciphersystem and the difference between secret-key and public-key systems. This was fol- 
lowed by an overview of the main theorems and basic algorithms of modular arithmetic, 
22 
introducing the notation that will be used throughout this thesis. Finally, the RSA cipher- 
system was explained including key generation, ciphering for secrecy and authentication 
and the need to frame the plaintext into suitable blocks before performing the modular 
exponentiation cipher operation. 
23 
Chapter 3 
Binary Arithmetic 
In this chapter we will look at algorithms that permit the efficient calculation of long- 
integer arithmetic operations. 
3.1 The Binary Representation of Numbers 
A positive integer, X, may be represented in the binary number system as a k-bit bit-vector 
X= [xk-1, xk-2, ... , xp] 
where x; E 10,1}, and whose value is given by 
k-i 
E2'- xi 
: =o 
(3.1) 
The right-hand side of the above expression can be thought of as a set of instructions 
for evaluating the integer X given the k-bit vector [xk_1, xk_2, .... xo]. This is therefore 
a kind of algorithm for finding X given its bit-vector representation. 
Two other well-known [25] algorithms for the evaluation of binary integers are discussed 
in the next sections. 
3.1.1 Right-to-Left Binary Evaluation Algorithm 
An iterative algorithm can be defined to determine the value of an integer, X, given its 
bit-vector representation. Using the notation of the previous section, with s(i) and t(i) 
24 
referring respectively to the value of s and t after the i-th iteration, then 
Algorithm 3 Given a positive binary integer, X, and setting 
$(o) = 0, t(0) =1 
then letting 
s(i + 1) = s(i) + xi " t(i) 
t(i + 1) = 2. t(i) 
on the k-th iteration we will have s(k) = X. 
Proof : By noting that t(i) = 2' and expanding out the right-hand side of Equation 3.1, 
we have 
X= xo " t(O) + xi " t(1) + x2 " t(2) + ... + xk-2 " t(k - 2) + xk-1 " t(k -1) 
= [xo " t(0) + ... + xk-2 " t(k - 2)] + xk-l " t(k -1) 
= s(k -1) xk_1 - t(k -1) 
= s(k) 
IL 
Note that this is called the right-to-left algorithm because the indexing, x;, of the 
bit-vector starts at 0 and moves up towards k-1. That is from the least-significant- 
bit towards the most-significant-bit which, under normal binary notation, corresponds to 
starting at the right-hand end of the vector and working left. 
3.1.2 Left-to-Right Binary Evaluation Algorithm 
Again using the previous notation, 
Algorithm 4 Given a positive binary integer, X, set 
S(O) =0 
25 
and let 
s(i+1) = i"s(i)+xk-i-1 
then on the k-th iteration it can also be shown that s(k) = X. 
Proof : By applying the above equation k times, and then expanding out the recursion by 
3 steps, we get 
s(k) = 2"s(k-1)+xo 
= 2"(2"s(k-2)+xl)+xo 
= 2"(2"(2"s(k-3)+x2)+xi)+xo 
= 23"s(k-3)+22"x2+21 "x1+20. xo 
and so we see that, expanding the recursion o t. gives 
a-1 
s(k) -2°`"s(k-a)+E2'"xi 
i=o 
if we now set a=k, then 
k-1 
s(k) = 2k-s(O)+E2'-xi 
i-o 
k-1 
_ E2'-xi 
i_o 
=X 
a 
Similarly, this algorithm is called left-to-right because processing starts at the most- 
significant-bit and proceeds towards the least-significant-bit. 
3.2 Binary Arithmetic 
Having seen how a positive integer can be represented in binary bit-vector form, we now 
look at the fundamental algorithms for performing integer arithmetic on these numbers. 
26 
3.2.1 Addition 
Given two positive k-bit integers, X and Y, whose values are 
k-1 k-i 
X=>2'-xi, Y=F2'"yi 
i=O i=O 
where x;, y; E {0,1}. 
Then their sum, Z, can be expressed by 
k 
Z=X+Y=L21-Zi 
i-o 
where z; E {O, 11 and so Z is a (k + 1)-bit vector. 
The value of each zi must be calculated from i=0 up to i=k with 
zi = (xi+yi+Ci-1)2 
where 
Ci = 
Lxi 
+ yi '+' Ci-1 
2 
and xk=yk=C-1=0" 
If the above addition mechanism is viewed from a hardware perspective, then its im- 
plementation would take the form of a ripple adder as shown in Figure 3.1. To add two 
xbr Yb/ xb? Yb] xb! Ybt ------------X, Yo 
cb4 k-1 C k-2 k-2 CbJ k-3 ----------- 
Co 0 ca 
zk zbl Z2 Zb! z0 
Figure 3.1: A k-bit ripple adder. 
k-bit binary numbers the adder employs k 1-bit adders, and it can be seen that the z, and 
c; of the above equations correspond to the `sum-out' and `carry-out' signals respectively 
of the i-th adder unit. 
27 
3.2.2 Multiplication 
Given two k-bit integers, X and Y, as in the previous section, their product, Z, can be 
expressed by 
2k-1 
Z=X"Y= E2`-z; 
i=o 
where z; E {O, 11 and so Z is a (2k)-bit vector. 
The product can be calculated by adding together all of the partial products formed 
by multiplying Y by each weighted bit of X. That is 
k-i 
Z=E2`"xi-Y 
-o 
and since each xi is either 0 or 1, this is, in effect, adding together the left-shifted versions 
of Y that correspond to the bit positions where xi = 1. When the multiplication is 
expressed in this way, X is considered to be the multiplier and Y the multiplicand. This 
is shown, for a 4x4-bit multiplication, in Figure 3.2. 
Lxovj xay: X Y, j w. 20xY 
-A xryý 2'x, Y 
x»'j ' xiy? xiy, ' XL)'" 22x2Y 
I xay' x3y' ' XL)'' X792'x, Y 
.............................................................................................................. 
I Z7 ; Z& + Z, + Z S Zf I Zi + Z, ! Z@ Z 
Figure 3.2: Multiplier partial products. 
On comparison with the right-to-left and left-to-right binary evaluation algorithms of 
Section 3.1, we see that similar iterative algorithms can be found for multiplication. 
Algorithm 5 (Right-to-Left Serial Multiplication) If X and Y are two k-bit binary 
integers and s(i) refers to the value of s after the i-th iteration, then setting 
s(o) =0 
and letting 
s(i + 1) = s(i)+2' " x; "Y 
28 
then on the k-th iteration we will have s(k) =X"Y. 
Proof : Essentially the same as the proof of Algorithm 3. 
X"Y = 
(20 
"xo+21 "xl+2Z "x2+... +2k-2'xk-2+2k-1 "xk-1) .Y 
= 
(20 
" xo "Y+... + 2k-2 , xk-2 , Y) + 2k-1 . xk-1 "Y 
= s(k - l) + 2k-1 ' Xk-1 "Y 
= 3(k) 
M. 
Algorithm 6 (Left-to-Right Serial Multiplication) With X and Y as above, then 
setting 
$(o) =o 
and letting 
s(' + 1) =2- s(i) + xk-i-i "Y 
will give s(k) =X"Y. 
Proof : Similar to Algorithm 4. For 0<a<k, then 
a-1 
s(k)=2a"s(k-a)+E2`"x; "Y 
i=O 
which, for a=k gives 
k-1 
s(k)=E2"z; "Y=X"Y 
: -o 
IL 
3.2.3 Exponentiation 
Consider the exponentiation of A to the power E>0, so that 
D=AE 
29 
now if E is expressed as a k-bit bit-vector, then its value is given by 
k-i 
E= 2` e: 
1=0 
and therefore 
AE = Alk-1', 
k-1+2k-2. ek_2+... +2o. ß, (3.2) 
Alk-'"ek-I . A2h-2'Ch-2 ... A2°" 0 (3.3) 
and the Right-to-Left and Left-to-FUght algorithms can be modified to perform exponen- 
tiation as follows. 
Algorithm 7 (Right-to-Left Exponentiation) Given an integer A and a positive k- 
bit binary integer E, then setting 
s(0) = 1, t(0) =A 
and letting 
S(i + 1) = s(i) " (t(i))`i 
t(i+ 1) = (t(=))2 
results in 
s(k) _AE 
Proof : Noting that t(i) = A2', we have 
s(i + 1) = s(i) " AZ''0' 
which, when multiplied over i=0... k-1, gives 
s(k) =1. A2°`0 . A2'"el ... Alk-'"ek-l 
and so, on comparison with the right-hand-side of Equation 3.3 
s(k) = AE 
R. 
30 
Algorithm 8 (Left-to-Right Exponentiation) Given an integer A and a positive k- 
bit integer E, then setting 
s(O) =1 
and letting 
s(i + 1) = (s(i))2 " A', '-'-I 
results in 
s(k) = AE 
Proof : Using a similar technique to the proof of Algorithm 4, we have 
s(k) = (s(k - 1)) 2" Aeo 
= 
((s(k 
- 2))2 . A")2 , Aw 
= 
(((s(k 
- 3))2 . Ae2)2 , A" 
2. 
Ado 
= (s(k - 3))8'- A4"es , A2 , Al" ° 
thus in general 
a-1 
s(k) = (s(k - a»2" " fJ A2t "e' 
i=O 
and setting a=k gives 
k-1 
s(k) _ (s(p))2k . Il A 2'-ei 
. -o 
k-1 
A2'"e' 
i-o 
= AE 
R 
Note that in the above algorithms, the term Ai is evaluated simply as 
1 ife; =0 
A`. _ 
A ife; =1 
and so, assuming uniformly distributed k-bit exponents, the average number of multipli- 
cations required by the algorithms is 3 
31 
3.3 Modular Binary Arithmetic 
Modular arithmetic algorithms for addition, multiplication and exponentiation can be 
created by modifying the previous algorithms so that the results are reduced to their least 
non-negative residue modulo N. 
i. e. Addition of two numbers X and Y modulo N, where X, YE [0, N- 1], can be 
performed by adding the X and Y and then subtracting N if the result is greater than or 
equal to N. 
3.3.1 Modular Multiplication 
There are two distinct methods of performing the modular multiplication of two numbers 
X and Y modulo N. 
The first is to compute the product T=X"Y and then reduce T by dividing it by 
N and keeping the remainder. The disadvantage with this technique is that the product, 
T, is twice the size of the arguments X and Y, and thus requires extra storage and 
manipulation space. This is especially critical in hardware with the large word-sizes used 
in RSA calculations. 
The second method involves modifying the previously stated iterative multiplication 
algorithms so that the product (X " Y)N is computed directly with all intermediate results 
in the range [0, N- 1]. 
Modifying the Right-to-Left algorithm leads to the following. 
Algorithm 9 (Right-to-Left Modular Multiplication) If X and Y are two positive 
k-bit integers less than N, and s(i) refers to the value of s after the i-th iteration, then 
setting 
S(O) =0 
and letting 
s(i + 1) = 
(s(i) +2' -x'. Y)N 
32 
then on the k-th iteration we will have s(k) _ (X " Y)N. 
Examining the right-hand-side of the s(i + 1) calculation however, reveals the need to 
perform a modular reduction of 2` "Y where i=0... k-1. Since 2' "Y can be a very large 
integer for high values of i, this calculation involves a division by N, and so therefore is 
no better than the separate multiplication/division approach. 
A much better solution is obtained by modifying the Left-to-Right algorithm. 
Algorithm 10 (Left-to-Right Modular Multiplication) With X, Y and N as above, 
then setting 
S(O) =0 
and letting 
s(i + 1) = (2 ' s(i) + xk-+-1 ' Y)N 
will give s(k) = (X " Y)N. 
Here we see that the calculation of s(i+1) involves the reduction of a number with upper 
bound less than 3N (since s(i), Y < N) for any i. Therefore only N or 2N need be sub- 
tracted during each iteration of the algorithm, and hence this algorithm lends itself most 
easily to hardware implementations. Indeed, most existing long-word modular multipliers 
use this algorithm as a basis from which to develop more efficient, faster algorithms. 
3.3.2 Modular Exponentiation 
Modular exponentiation algorithms can be derived from non-modular exponentiation al- 
gorithms simply by replacing the non-modular multiplications with modular ones. For 
completeness, they are stated below. 
Algorithm 11 (Right-to-Left Modular Exponentiation) Given an integer A, a pos- 
itive k-bit integer E and modulus N, then setting 
s(0) = 1, t(O) =A 
33 
and letting 
s(= + 1) = (s(=) ' (t(i))ei)N 
t(= + 1) = 
((t(=))2%N 
results in 
s(k) = 
(AE)N 
Algorithm 12 (Left-to-Right Modular Exponentiation) Given an integer A, a pos- 
itive k-bit integer E and modulus N, then setting 
S(O) =1 
and letting 
S(i + 1) = 
C((S(i))2)N 
" Aek s-1>N 
results in 
s(k) = 
(AE>N 
3.4 Summary 
This chapter reviewed the binary representation of positive integers and explored funda- 
mental right-to-left and left-to-right binary evaluation techniques. These techniques were 
subsequently used to derive algorithms for iterative multiplication and exponentiation. 
These algorithms were then modified to perform standard modular operations, in partic- 
ular showing that standard modular multiplication is inherently a left-to-right process. 
34 
Chapter 4 
Arithmetic Hardware 
In this chapter we will look at digital hardware techniques that allow for the efficient 
implemention of long-integer arithmetic operations. 
4.1 Adders 
There are numerous ways in which the addition of two positive k-bit numbers can be 
performed in hardware. In this section we will look at three of the most common methods 
applicable to long-integer addition, and review their design trade-offs in terms of circuit 
complexity and operational speed. 
4.1.1 Ripple Adder 
The ripple adder is the simplest of adder logic circuits and, as may be expected, also the 
slowest. Two k-bit positive integers can be added with k one-bit full adders as was shown 
in Figure 3.1. The circuit diagram for a one-bit full adder is shown in Figure 4.1. 
Letting OFA represent the time required for a full adder to generate both the sum-out 
and carry-out signals, then the time needed to add two k-bit numbers will be k- OFA. 
In most technologies, if the adder is implemented as in Figure 4.1, the time required 
to generate the carry-out signal will be slightly less than that required to generate the 
35 
xi Y, 
C5' 
out 
sum out 
carry in 
Figure 4.1: One-bit full adder (FA). 
sum-out signal. This is because, usually, OxoR 3'ONAND. However, in trying to allow 
for both differing technologies and differing adder implementations, it will be assumed 
throughout this thesis that LFA applies to both sum-out and carry-out signals and that 
OFA =2' LXOR. 
Furthermore, it will be assumed that in any reasonably complex design primitive - 
such as a full adder or flip-flop - there is no penalty involved in taking the inverted value 
of a signal as opposed to its true value. The justification for this approach is that, since 
the implementation of any particular design primitive will vary from one manufacturer's 
technology to another, it can easily happen that a particular signal is generated true in one 
technology and inverted in another. In practice, there is usually a way to speedily generate 
the desired true or inverted signal in any primitive of a few gates or more. This is true 
particulary in CMOS where the basic transistor unit is the p-type/n-type complementary 
pair. 
4.1.2 Carry-Completion Adder 
Although the ripple adder is simple and uses relatively little circuit area, it is very slow. Its 
main drawback is that it must always wait for the worst-case carry propagation from the 
0-th FA to the (k -1)-th FA, irrespective of whether any such carry is actually generated. 
Analysis has shown [25] that the average length of propagated carries when adding two 
randomly chosen k-bit integers is of the order of logt k. The carry-completion adder 
36 
takes advantage of this fact by incorporating extra circuitry within the adder to detect 
when all carries have fully propagated. A simple block diagram is shown in Figure 4.2. 
With reference to the diagram, the two numbers to be added, X and Y, are placed onto 
xt-i Yt I X1 Yi Xe Yo 
car 
car 
Addition 
Complete 
carry_in 
carry_tn 
the inputs of the CP (Carry Propagate) cells where an intermediate carry vector, C, is 
generated. A CP cell generates its true and inverted carry-out signals [26] such that the 
signals will be mutual inverses only when all of the carries from the preceeding CP cells 
have propagated past the current cell. Carry propagation is thus monitored by the k OR 
gates and, once propagation has ceased, the k-input AND gate goes active indicating that 
the first stage of the addition (the generation of C) is complete. The second stage may 
now commence which is simply the bitwise addition (modulo 2) of the vectors X, Y and 
C. 
In practice, the two stages are allowed to complete simultaneously, and the result 
becomes stable at the same time as the AND gate goes active. Assuming the delay of a 
CP cell is the same as that of an FA, we have the average time required to add two k-bit 
numbers as OFA " logt k. Note however that the complexity of this adder is roughly twice 
that of the ripple adder. 
37 
Figure 4.2: Carry-completion adder. 
4.1.3 Carry-Select Adder 
Although the carry-completion adder is, on average, faster than the ripple adder, it suffers 
from the disadvantage of a variable-length addition time. This makes the control circuitry 
surrounding the adder more complex than it would be for a fixed-time addition. The 
carry-select adder, on the other hand, is a fast, fixed-time adder circuit. 
The carry-select adder works by partitioning the k-bit bit-vectors X and Y into vectors 
composed of b-bit sub-blocks. i. e. if k=l"b, then using vector notation we have 
X= [XI-Ii XI-2i ... I XOI 
where now X; E [0,2b - 1] is a b-bit sub-block. The value of X is given by 
r-1 
_E 2cb . X, 
i-o 
Initially, the sub-blocks of each vector are treated as individual sections - ignoring the 
relationship that each sub-block has with its neighbours. Each sub-block of the X vector 
is then involved in two simultaneous additions with its corresponding sub-block in the Y 
vector. One addition is performed with the carry-in of the sub-block adder active, and the 
other with it inactive. Since these additions are performed in parallel for all sub-blocks, 
it is not known at the time of the addition whether the carry input of any particular 
block (except for the lowest block) should be high or low. Thus we have to select, after 
the addition and for each sub-block, which of the two additions is the correct one. This 
selection is based on the carry-out of the previous sub-block addition. Figure 4.3 shows 
a simplified diagram of one sub-block of a carry-select adder. Assuming that each Sub- 
Block Adder of Figure 4.3 is implemented as a b-bit ripple adder, then the time needed to 
add two k=1"b numbers is approximately b" AFA +1" AMUX2TO1" The complexity of 
the carry-select adder is typically around twice that of the ripple adder [27]. 
38 
Y 
Sub-Block Adder I 
carry-out carry-U 0 
set =0 MUX 
carry-out 
to next block 
"i Yi 
33 
Sub-Block Adder 
caciy-out carry-in 1 
ý4 
sei =1 
se. carry-out 
from previous block 
Figure 4.3: Carry-select adder sub-block. 
4.2 Iterative Multipliers 
With reference to the iterative multiplication algorithms of the previous chapter (Algo- 
rithms 5 and 6) we see that multiplication can be implemented as repeated addition. This 
leads to the generic hardware implementation of a multiplier in Figure 4.4. Here we have 
a simple adder/accumulator circuit, where the product of two positive numbers, 
k-1 k-1 
X =E2'. xi, Y=L2'"yi 
i=0 i=0 
is calculated by the accumulated summation of the partial products xi"Y for i=0... k-1. 
N. 
s(i) 
An implementation of the Right-to-Left multiplication algorithm (Algorithm 5) for 
example, would require an adder/accumulator configuration that, initially resets the ac- 
39 
Figure 4.4: Generic iterative multiplier. 
cumulator to zero, then for each clock cycle i=0... k-1, adds xi "Y to the contents of 
the accumulator right-shifted by one bit position. The lower k-bits of the product are the 
k bits that were shifted out of the accumulator during cycles 0 to k-1. The upper k bits 
are those that are left in the accumulator. 
The main problem with this circuit is the time taken to add the partial products x; -Y 
during each cycle. Using any of the adders of Section 4.1 would lead to multiplication 
times proportional to k°x where 1<a<2. This is because the addition times of each 
of the above methods depend on the length, k, of the operands. To remove this power 
relationship on multiplication speed, and make it linear with k, a special multi-operand 
adder called a carry-save adder is used. 
4.2.1 Carry-Save Adder 
The carry-save adder is a fast, constant-time multi-operand adder suitable for implement- 
ing iterative multipliers. As its name suggests, the carries generated when adding numbers 
are not propagated but are `saved'. In other words, all the partial products of a multi- 
plication can be added in carry-save form, with carry-propagation performed only at the 
end of processing. 
The circuit diagram for a carry-save adder (CSA) is shown in Figure 4.5. The circular 
Uka Vk-i 
wt-I 
0 
Sk el Sa cla 
ul vi Uo VO 
wi WO 
s1 ei So Co 
Figure 4.5: Carry-save adder. 
components in this diagram are one-bit full adders and the rectangular components at the 
bottom of the diagram represent clocked flip-flops. Thus three k-bit numbers, U, V and 
40 
W, 
k-1 k-1 k-1 
U=E2`"ui, V=E2'"vi, W=E2`"w: 
i=O 1=o 1=o 
can be added together to form a (k+ 1)-digit result, where each digit of the result consists 
of the two-bit combination s; + c;. In other words, 
k 
U+V+W =E2''(si+ci) 
i=0 
Another way of looking at the result is as the sum of two distinct bit-vectors; the sum 
vector S, and the carry vector C. Thus 
U+V+W = S+C 
where 
kk 
S=E2`-si, C=E2`"ci 
i=O i=O 
The big advantage of the CSA is that the addition of U+V+W to produce S and 
C takes a constant time equal to the delay of a single full adder. In other words, addition 
time is unrelated to operand size. 
The CSA can be used iteratively to add multiple numbers by feeding back the S and 
C outputs to the U and V inputs and adding successive bit-vectors via the W inputs. 
Thus on successive clock cycles the accumulated partial result held in the flip-flops at the 
bottom of the diagram is added to a new input operand W and the sum again stored in 
this register. At the end of this multi-operand addition process the S and C vectors can 
be added together, using one of the carry-propagation adders of Section 4.1, so that the 
result will be in conventional binary bit-vector form. 
Using the CSA to implement the Right-to-Left multiplication algorithm, for example, 
would require, as stated above, that the partial products z; "Y be added to the right- 
shifted contents of the accumulator. The partial products are generated by the AND gate 
circuitry of Figure 4.6. The right-shift can be accomplished efficiently by `hardwiring' 
it into the feedback of the S and C vectors. Thus the following equations would describe 
41 
Y Register 
Yk-I )'k_2 
........... 
Y3 Y2 yl Ys 
NY 
Figure 4.6: Generating x; " Y. 
the connections from S and C back to U and V, 
ui E- Si+1 
Vi E- C, +1 
Note that bits so and co are effectively shifted out of the accumulator on each iteration of 
the algorithm. Since these bits form the least-significant k bits of the result they must be 
saved into ak bit shift register operating along-side the CSA as the calculation progresses. 
Moreover, examining the circuit diagram of the CSA shows us that the CO bit is always 
`0'. Therefore only the shifted out so's need be saved and these directly form the lower 
k bits of the result. The upper k bits of the result are formed from the carry-propagated 
addition of the S and C vectors. 
Using the property that co =0 always, we can perform a k-bit multiplication obtaining 
the result in standard binary bit-vector form without performing a post-processing carry- 
propagate addition simply by allowing-the CSA to cycle through another k iterations 
without adding any partial products into the accumulator. This works because the k 
extra cycles will produce a 2k-bit vector from the shifted out so's. Since the product of 
two k-bit vectors can be expressed in at most 2k-bits, then the shifted out 2k-bit vector 
must be this result. Of course, the disadvantage of using this technique is that twice as 
many iterations are required to complete the multiplication, but it does have the advantage 
of not requiring a carry-propagate adder, and so uses less hardware. 
The time required to perform a multiplication using CSA hardware is the sum of the 
times required for the iterative partial product summation and the time required for the 
42 
carry-propagated addition of the CSA S and C vectors. The latter is obviously dependent 
upon the type of adder used for the carry-propagated addition, but the former can be 
approximated by the product of the number of iterations required and the time required 
for each iteration. They are, respectively, 
Number of iterations =k 
Iteration time = RAND + OFA + OFF 
where, in this context, OFF corresponds to the `setup' and delay time required by the 
flip-flops that make up the accumulator register. 
The circuit complexity can be approximated by 
Number of bitslices =k+1 
Bitslice complexity = f1AND + SZFA +4* OFF 
where the il notation measures the gate-count of the design primitives. Note that in the 
above bitslice complexity measure, the 4" i1FF term reflects the fact that four flip-flops 
are required per bitslice for the X, Y, S and C vectors. 
4.2.2 High-Radix Iterative Multiplication 
The multipliers of the previous section have been of the radix-2 type. That is simple 
binary multipliers where the creation of partial products has been performed for one bit 
of the multiplier operand, X, at a time. High-radix multipliers generate partial products 
by looking at b-bits of X at. a time, and so are radix-26 multipliers. 
A positive k-bit integer X can be viewed as a radix-2b vector consisting of 1 blocks of 
b bits each. That is an 1-digit vector 
X= [XI-I o Xt-s i ... I XOI 
where each digit X; E [0,2b - 1] and l= [k/bi. The value of X is given by 
I-1 
X=ý2i6 X; 
c-o 
43 
Each radix-26 digit of X can be expressed as 
b-1 
Xi=E2j . Xi, (1) 
j=o 
where X;, (j) is the j-th bit of the i-th digit and corresponds to the 
(ib + j)-th bit of X, 
that is 
X{ýý1) =X b+. 1 
The product of two radix-2b vectors, X and Y, can be expressed as 
! -1 
x-Y=E2ib. xi, Y 
i-o 
For example, the Right-to-Left multiplication algorithm becomes 
sýý lý _ 2t6, Xi -Y 
2 ib (26-1 "Xi, (b-1) 'Y+2 
b-2 
' X1, (b-2) 'Y+... + 20 . Xi, (o) . 
Y) 
where, when implemented in CSA hardware, the 2'b term on the right-hand side of the 
above equation corresponds to a right-shift of the accumulator by b bits, and each partial 
product of the form 2' " X;, (j) "Y corresponds to a horizontal line of full adders so that 
this partial product may be added to the accumulated product. 
For example, with b=2, the partial products are generated two at a time and a 2-level 
CSA. network is required to sum these products. This is shown in Figure 4.7 where the 
uk-i Vt-1 
a ------ 
Sk+l Ck+1 Sk Ck Sk"i Ca Si C1 So Co 
Figure 4.7: A 2-level CSA multiplier. 
input vectors W and Wl correspond to 
W =X+, (o)-Y, W, =X;, (1)"Y 
44 
ul v, % V0 
and the feedback from the accumulator outputs to the adder inputs, incorporating a right- 
shift by 2 bits is defined by 
Ui E- s; +2 
vi +- c; +2 
The time required for the summation of the partial products, assuming all X;, (, ) bits 
are available immediately following the accumulator's active clock edge, is the product of 
Number of iterations = Ik/bl 
Iteration time = LAND +b "' FA + OFF 
The complexity is 
Number of bitslices =k+b 
Bitslice complexity =6" SZAND +b- nFA +4" QFF 
Adder Network Optimisation 
For multi-level CSA adder networks there are various methods that can be employed to 
speed up the propagation of signals through-the adder levels. 
One method involves the `collapsing' of successive adder levels into a single, larger 
adder, and then performing logic optimization of this larger adder. For example, in a 
2-level CSA design, the `upper' and `lower' full adders can be collapsed into a single larger 
adder. To see how this works we must think of a one-bit full adder as a special case of 
a more general class of adders. The one-bit full adder can be called a 3: 2 adder because 
it `compresses' three input signals of equal weight down to two output signals of differing 
weights. That is, with the input signals denoted as u;, v; and w;, and the outputs as s; 
and c; +,, then the adder obeys the following equation 
u, +v, +w: =2ci}1+s; 
45 
Two 3: 2 adders connected as in a 2-level CSA tree can be collapsed down into a single 5: 3 
adder (also known as a 4: 2 compressor or a 4: 2 counter) as shown in Figure 4.8. Here 
ri S. 4 
........................................ 
-Cj 
......... ......... ................. .... Ii 
h41 
'7 
Figure 4.8: A 5: 3 adder cell. 
the inputs and outputs obey the equation 
ri+s; +t, +u; +c, = 2c; +t +2b; +i+ai 
In the diagram, the 5: 3 adder is constructed directly from two 3: 2 adders and so obviously 
has the same function, but in practice the internals of the 5: 3 adder can be redesigned so 
that they retain the same function but offer a faster implementation. Various designs are 
possible (see (28] and [29]) but the advantages they give in operational speed are generally 
offset by the added complexity of the design. e. g. a 50% increase in speed accompanied 
by a similar increase in circuit area. Higher order adders can be created in this way (i. e. 
7: 4,9: 5 etc. ) but to be advantageous in a VLSI device would require careful full-custom 
design. 
Another method for adder network optimization that is useful for networks of 4 levels 
and upwards is the optimization of adder interconnections. For example, with a 4-level 
network, Figure 4.9 shows how the interconnections between 3: 2 adders can be modified 
so that the delay through the network is only 30FA instead of 40FA. This type of 
optimization can also be applied to networks of 5: 3 adders, 7: 4 adders etc. when a high 
Number of levels is to be implemented. 
The final method of adder network optimization to be examined here is that of a 
46 
ii 
Ci 
ui 
Vi 
A. 
Wi 
S, 4 Vi 
ý+ý 
Figure 4.9: Adder interconnect optimization. 
pipelined implementation of the separate levels. The basic strategy is to break up multi- 
level adder networks by placing latches between the levels. This has the effect of dramati- 
cally reducing the time required for each iteration of a multiplication but at the expense of 
increasing the hardware complexity and increasing the total number of iterations required. 
A conceptual diagram of a pipelined, iterative multiplier is shown in Figure 4.10. In 
2? "YC; Ca1)Y 2b2 . 2)Y 
2 3)Y 
Adder Network (1) j 2b%. pY 
ßh(1) 
2°X; XO)Y 
Figure 4.10: A pipelined iterative multiplier. 
this diagram the adder networks, as implied by the labelling, consist of simple 1-level 3: 2 
adders but this need not necessarily be the case. Each network could consist of 2-level 
47 
3: 2 adders or 1-level 5: 3 adders etc. The p pipeline stages are used to accumulate partial 
sums of partial products as they proceed from the top to the bottom of the diagram. At 
the bottom is an adder/accumulator that is used to sum the partial sums as they arrive. 
The adder/accumulator has a built-in hardwired shift as it feeds back its outputs to its 
inputs as was the case in the simple CSA multiplier. 
The time required for each iteration of a multiplication is the sum of the times required 
for a single adder network and the latch setup/hold times. The total number of iterations 
required is increased from the simple multi-level adder CSA design because of the need to 
fill/empty the pipeline at the beginning/end of a multiplication. 
For example, consider the case with radix-16 (b = 4) multiplication using a 2-stage 
pipeline plus adder/accumulator. The Right-to-Left algorithm will be used, thus the 
adder/accumulator has a built-in right-shift of 4 bits and the shifted out bits are assumed 
to be saved in a 4-bit wide shift register attached to the accumulator. If we limit the size 
of the operands to just 1=5 digits then we can follow the process of multiplication as 
follows. 
Reset: Latch(0) =0 
Latch(1) =0 
Accum/SR =0 
Cycle 0: Latch (0) = 23 " Xo, (3) "Y+ 22 ' Xo, (2) "Y+ 21 " Xo, (1) "Y 
Latch(1) =0 
Accum/SR =0 
Cycle 1: Latch(0) = 23 " X1, (3) "Y+ 22 " Xi, (2) "Y+ 21 " Xl, (l) "Y 
Latch(1) = 23 " Xo, (3) "Y+ 22 " Xo, (2) "Y+ 21 " Xo, (1) "Y+ 2° " Xo, (o) "Y= Xo "1 
Accum/SR =0 
48 
Cycle 2: Latch(0) = 23 " X2, (3) "Y+ 22 " X2, (2) "Y+ 21 " X2, (1) -Y 
Latch(1) = 23 "Xß, (3) "Y+ 22 " Xl, (2) "Y+ 21 "X1, (1) -Y+ 2° " Xl, (o) -Y= Xt -Y 
Accum/SR = 216 " Xo "Y 
Cycle 3: Latch(0) = 23 ' X3, (3) "Y+ 22 " X3, (2) 'Y+ 21 ' X3, (1) "Y 
Latch(i) = 23 ' X2, (3) "Y+ 22 " x'2, (2) "Y+ 21 " X2, (i) -Y+ 2° " X2, (°) 'Y= X2 'Y 
Accum/SR = 216 " Xl "Y+212 " Xo -Y 
Cycle 4: Latch (0) = 23 * X4, (3) 'Y+ 22 ' 
X4, (2) -Y+ 21 -X4, (1) -Y 
Latch(l) = 23 . X3, (3) -Y+ 22 ' X3, (2) -Y + 21 - X3, (1) -Y+ 2° - 
X3, (U) -Y= X3 -Y 
Accum/SR = 216"X2"Y+212"Xl"Y+28"Xo"Y 
Cycle 5: Latch(0) =0 
Latch(1) = 23"X4, (3)"Y'+'22"X4, (2)"Y+2'-X4, (1)"Y+2°"X4, (O)"Y-X4"Y 
Accum/SR = 216. X3, Y+212, X2, Y+28, X1 . Y+24 "Xo"Y 
Cycle 6: Latch(0) =0 
Latch(1) =0 
Accum/SR = 2'6. X4. Y+212-X3"Y+28"X2. Y+24. X1 -+2°. X0 Y 
and so at the end of processing the accumulator/shift-register contains 
º-i 
E2ib. Xi, Y_X"Y 
iO 
In this example with each adder network being 1-level 3: 2 adders, then 
Number of iterations = (k/bl +b-3 
Iteration time = RAND +A FA + AFF 
4.3 Multiplier Recoding 
In the previous section multipliers have been implemented by adding partial products 
generated by terms of the form X;, (J) -Y where X;, (; ) E {0,1} as was shown in Figure 4.6. 
49 
But in digital hardware it is relatively easy to generate numbers of the form 2,6 "Y for 
A>0 by a simple left-shift of Y by p bits. This fact can be taken advantage of in string 
recoding theory [30]. 
First-order string recoding is the transformation of the bit-vector 
X= [xk-1 
9 xk-Z i ... I x0I 
where x; E (0,1), to the representation 
x i_ iIi 
where x; E {-1,0,1}, such that the values of X and X' are the same, i. e. 
k-1 k 
E2`-xi =EY-xý, (4.1) 
i=O i=O 
This can be accomplished by a mapping of the bits of X to the digits of X' governed 
by the following equation, 
Xi = -xi + xi-1 
for i=0... k with x_1 = xk = 0. That this produces a vector XI of the same value as X 
can be seen by substituting the definition of x; into the right-hand side of Equation 4.1 
and simplifying. 
Since each bit of X is in {O, 1}, the mapping is equivalent to 
xi x; x, _1 Reason 
0 0 0 No string 
1 0 1 End of string 
-1 1 0 Beginning of string 
0 1 1 Center of string 
with the assumption that x_1 = xk = 0. 
The comments on the right-hand side refer to `strings of 1's appearing in the bit-vector 
representation of X. If the above mapping is applied while scanning X from right-to-left, 
50 
then the first 1-bit in a sequence of 1's will reveal itself with x; =1 and x; _1 = 0, i. e. the 
beginning of a string of 1's. (Note that scanning starts with i=0 and x_1 = 0). 
For example, with k=8 and an arbitrary bit-vector X, then 
X8 X7 X6 X5 X4 X3 X2 x1 x0 X-1 
X: 0101110010 
Xý :1 -1 100 -1 01 -1 
and to check we have 
X=2 7 +25+24 +23+20 = 185 
X'=28-27+26-23+21-2°=185 
Note that the mapping does not have to be performed serially from right-to-left. Since 
each digit x; depends only on bits x; and x; _1, the mapping can equally well be applied 
while scanning from left-to-right or even applied to all digits in parallel. 
So, we now have a (k + 1)-digit vector with each digit E {-1,0,1}. At first sight this 
does not seem too promising, it looks as though we have just over-complicated things, but 
the vector does have one interesting point, and this is that no two adjacent digits have the 
same sign. This can be seen to be true directly from the mapping table. This property is 
used to advantage when second-order string recoding is performed. 
A second-order recoding can be performed by examining the first-order recoded vector 
X' to generate the second-order recoded vector X". It should be noted that X" is a 
radix-4 vector of approximately half the length of the radix-2 vector X'. 
Explicitly, each digit x; 7 of X" is generated by the equation 
xi =2. x2i+1 + 02i 
for i=0... [(k+ 1)/21 - 1. Since no two adjacent digits of X' are of the same sign, then 
51 
the possible combinations of x2 i+1 and xl2; that generate x; ' are as follows. 
Xi = x2i+1 x2I i 
000 
101 
210 
11 -1 
-2 -1 0 
-1 -1 1 
and so we see that x; E {-2, -1,0,1,21. 
That X" has equivalent value to X' can be seen directly from the equation governing 
its construction. Thus we have created a second-order recoded vector X" of the same 
value as X but of approximately half the length of X with digits in the extended range of 
(-2, -1,0,1,2}. It is also possible to construct X" from X in a parallel fashion since 
xi =2' x2i+i + x2i 
= 2(-x241 + X2i) + (-x2i + x2i-1) 
_ -2 " 02i+1 + x2i + 22i-1 
and so we see that each digit of X" depends only on 3 adjacent digits of X. 
To multiply two positive numbers X and Y using the second-order recoded version of 
X, we can write 
t-i 
X"Y=>2ib"x(i)"Y 
i-o 
where b=2 and l= [(k + 1)/bl and x(i) is the i-th element of the recoded X vector in 
the range 1-2, -1,0,1,21. 
On comparison with the un-recoded (b = 2) multiplication method where we had 
t-1 
X, Y=E2"b. X, "Y 
i_o 
52 
the main difference is that X; E {0,1,2,3}, and we can see that it is the generation of 
3"Y that necessitates the use of a 2-level adder network because this multiple must be 
generated by the separate sub-multiples of 1"Y plus 2"Y. But if we use the recoded X 
vector, then since we know that multiples of 2 can be generated by a simple left-shift, it 
should be possible to implement ab=2 multiplier with only a single level adder network. 
All that is needed to do this is some means of implementing signed numbers in hardware. 
This is the subject of the next section. 
For the moment, assuming such a signed number system is available, we can see how 
high-radix multipliers using recoding techniques can be constructed using only half the 
number of adder levels that were required in the previous section. 
A radix-26 (b a multiple of 2) multiplier can be constructed by again writing 
1-1 
X, Y_F, 2ib, x(=). y 
. -o 
but this time with 
d/2-1 
E 22j 
j=o 
where x3 (i) denotes the (i z+ j)-th digit of the second-order recoded X vector with each 
digit in the range 1-2, -1,0,1,21. Then there are b/2 adder levels, each level adding the 
partial product x,, (i) "Y to the accumulated result. 
For example, with b=4, 
r-ý 
X ,Y_ E24, (4, xi(i). Y+xo(i), Y) 
i=O 
is constructed with a 2-level adder. The. first level adds multiples 0, ±Y, ±2Y, and the 
second level adds multiples 0, ±4Y, ±8Y. 
4.4 Signed Number Representations 
As was stated in the previous section, to enable multiplier recoding techniques to be used 
for the construction of efficient iterative multipliers requires a means of representing both 
53 
positive and negative numbers in hardware. In this section we will examine various signed 
number representations. 
The three most common ways of representing signed numbers in digital hardware are 
the 2's complement, 1's complement and sign-magnitude methods. In the following sub- 
sections X is assumed to be a k-bit bit-vector, X= [xk_1, xk-2, ..., . To] where x; E {O, 1}. 
The procedures for calculating the value of X based on this bit-vector representation are 
given for each of the above signed number encoding schemes. 
4.4.1 2's Complement Representation 
The value of X is given by 
k-2 
-2k-1"Zk-1+E2'"x: 
: =o 
and thus the range of values that X may take on is 
-2k-1 <X< 2k-1 -1 
To negate a 2's complement bit-vector (form the negative of the number), we use-the 
well-known `invert and add one' technique - ignoring any carry from the (k - 1)-th bit. A 
proof of this technique is as follows. 
Using the bit-inversion relation x; =1-x; and if X' is the result of inverting X and 
adding one, i. e. X' = [24_19 xk_z,..., xo] + 1, then we have 
k-Z 
-2k-1 , (1-xk-1)+E2`"(I-xi)+ 
-o 
k-2 k-2 
(-2k-I +E 2i + 1) + (2k_1 ' zk-1 -E 2i . xi) 
i=0 i=0 
k-2 
_ (-2k-1 +(2 k-I - 1) + 1)+-l-(-2 
k-1 " xk_1 +E 2'- xi) 
i=0 
= (o) + -1 " (x) 
= -X 
and thus [xk_1, xk_z, ..., zo] +1= -X. 
54 
To sign-extend a 2's complement bit-vector (convert from a k-bit bit-vector to an n-bit 
bit-vector where n> k), we can use the following mapping. Given X= [2k-1, Xk-2e ... , xo] 
and X' = [x;, _1, xn_2, ..., zö], n>k then 
xi fori=0... k-2 
xk_1 fori=k-1... n-1 
This obviously works when xk_1 = 0. To see how it works when xk_1 = 1, we note that 
what we are trying to do is to make a contribution of -2k-1 to the XI vector. To see how 
this happens we simply write 
-2k-1 = -2k + 2k-1 
= _2k+1+ 2k + 2k-1 
= -2n-I +2 n-2 +... +2k-1 
and note that this is exactly what happens with the upper n-k bits in the above mapping. 
4.4.2 1's Complement Representation 
The value of a k-bit 1's complement vector is given by 
k-2 
X_ (_2k-1+i). Xk-1+[ý2'- xi 
i=O 
and so the range of values that X may take on is 
_(2k-i -1) <X <2 k-I -1 
Negation of a 1's complement bit-vector is performed by simply inverting every bit. 
That is 
-X = 
ýxk-1 
ý xk-2ý ... ý x0ý 
Sign-extension is the same as for 2's complement. 
Both of these assertions can be proved in a similar manner to the 2's complement 
proofs and are omitted here. 
55 
4.4.3 Sign-Magnitude Representation 
The value of a k-bit sign-magnitude bit-vector is given by 
k-2 
E 
i=O 
and so its range is 
<X< 2k-1 -1 
Negation of a sign-magnitude number is obviously performed by inverting the sign bit. 
That is 
-X = 
[-x-k-1 
9 Xk-21... 9 x0l 
Sign-extension from X= [24-19 24-29 ... , x0] to 
X' _ [xn_I I xn_2, ... , ml], n>k 
is 
given by 
xi fori=0... k-2 
x: = 0 fori=k-1... n-2 
xk_1 for i=n-1 
and is obvious. 
Upon examination of the above three methods of signed number representation we can 
see that they all share a common feature; they all use a special bit, Xk_1, to determine the 
sign of the number. When trying to implement an iterative multiplier using these signed 
number systems in a CSA-type architecture, this special bit causes major problems. This is 
because, as we have seen, iterative multipliers require a shift operation to be built into the 
adder/accumulator circuit. When dealing with signed number systems this shift operation 
is basically a sign-extension operation and, as we have seen above, to be able to sign-extend 
a number using any of these representations requires knowledge of the special sign-bit. The 
crux of the matter is that this special sign bit is the top bit of each bit-vector and can only 
be accurately determined by the fully carry-propagated additions/subtractions. Since the 
CSA architecture prohibits these propagations, there is no way to determine the sign of any 
56 
partial results held in the accumulator during processing. Thus the above representations 
cannot be used within a CSA architecture. 
What is needed is a method of signed number representation that does not rely on any 
special sign bit. Such a method is the signed-digit representation. 
4.4.4 Signed-Digit Representation 
A signed-digit number is one in which each digit of the number can take on positive and 
negative values. In general 
A= [ak-i 
i ak-2 i ... 1 aal 
is a k-digit vector whose value is given by 
k-1 
i=o 
where each a; can take on values in the range 
ai E [-7,7] 
for some fixed -y in the range 
r(r - 1)/21 < -y <r-1 
For example, with r=4 and 7=3, we have 
k-1 
A =E2 2i 
i=o 
where a; E 1-3, -2, -1,0,1,2,3} and so A has range 
-(22k - 1) <A< 22k -1 
The negation of a signed-digit vector is simply the negation of each digit, i. e. 
-A = [-ak-1 i -ak-y, ... , -0101 
57 
and `sign-extension' from A= [ak_l, ak_2, ... , aa] to A' = 
[an- 19 a_2, ... , aa] for n>k 
simply involves filling the n-k upper digits of A' with zeros. That is 
a; fori=0... k-1 
0 fori=k... n-1 
The addition of two signed-digit vectors, A and B, can be performed with carry- 
propagation limited to one position to the left as follows. With A= [ak-1, ak-2, """, ao] 
and B= [ßßk-1, ßk-2, .... ßo] and their sum S=A+B= 
[sk, sk-1, """, so] then the 
addition proceeds with the generation of intermediate sum digits w; and transfer digits 
t; +l obeying the relation 
ai -I- Ai =r" ti+l + wi 
and is completed by the addition of appropriate intermediate sum and transfer digits 
si = wi + ti 
Note that the transfer digit generated in the first equation is in effect a limited carry from 
the i-th adder digit to the (i + 1)-th digit. The transfer digit in the second equation is the 
limited carry that was generated in the (i - 1)-th digit position. Thus the transfer digits 
are actually carries whose propagation is limited to one place to the left. 
In order to ensure that this limited carry-propagation/transfer-digit scheme works, the 
possible values that w; and t; can assume must be limited so that 
-7<si=IL'i+ti<_-f 
For example, with r=4 and 7=3, then restricting 
v. iE {-2, -1,0,1,2}, ti E {-1,0,1} 
ensures that -3 < w; +t; < 3. In particular, Table 4.1 shows just one method of generating 
the transfer digit and partial sum based on the sum from the input digits a; +, O;. 
Although this general signed-digit approach does have simple number manipulation 
properties and a fast addition scheme, it suffers from a slight over-complexity [26] when 
58 
a'I+YI th4 Wi 
-6 4 -2 
-s -1 -1 
-4 4 0 
-3 4 1 
-2 0 -2 
-1 0 -1 
0 0 0 
1 0 1 
2 0 2 
3 1 4 
4 1 0 
3 1 1 
6 1 2 
Table 4.1: Signed-digit addition; transfer and intermediate sum digits. 
it comes to implementing the addition scheme in hardware. A better method for using 
signed-digit numbers in high-performance hardware is shown in the next section. 
4.4.5 Redundant Signed-Digit (RSD) Representation 
The redundant signed-digit representation of signed numbers combines the flexibility of 
using a signed-digit approach with the simple hardware of CSA adders. 
A k-digit RSD vector X, is written as 
X= ýxk-1 xk-ýý ... XOI 
has digits, x;, in the range 
zE {-1,0,1} 
and whose value is given by 
k-1 
X=E2`"x; 
c=o 
In a manner similar to the CSA approach, RSD vectors can be viewed as the difference 
(not sum) of two distinct vectors, such that 
x; = x; - x7 
where 
X_ , x; E 
{O, l} 
59 
and so the value of X is given by 
k-I 
X ='E 2' - (xl'- X7) 
i=O 
or alternatively 
X=X+-X- 
where 
k-i 
i-o 
k-1 
x- - E2'-xi 
=o 
The heart of the RSD approach is in the generalisation of the 3: 2 adder to a class of 
General Full Adders (GFA). Each member of the class can add a particular combination 
of positively and negatively weighted bits. Figure 4.11 shows the circuit symbols of these 
GFAs and their functions are summarized as follows. 
Type 0 Type 1 Type 2 
Ui Vi Ui Vi Us Vi 
W Wi cri 
G Si Ci Si Ci Si 
Figure 4.11: General full adders. 
" Type 0: ui +v; +w; =2c; +t +s; 
" Type 1: -ui+vi+w: =2c, +1-si 
" Type 2: -u; - v; + w; = -2c; +i + Si 
9 Type 3: -u; -v; -wi=-2c; +1-s; 
Type 3 
ui Vi 
Wi 
Ci Si 
Note that the standard one-bit full adder is classified as type 0, and that the `bubbles' 
on certain GFA inputs and outputs are actually implemented in hardware as inverters 
and they signify which of the inputs/outputs are negatively weighted. Also note that, in 
CMOS VLSI technology, inverters usually `come-for-free', meaning that taking the true or 
inverted value of a signal in a CMOS circuit normally just means connecting to a different 
60 
point within the circuit. This is simply a consequence of the CMOS transistor-unit being 
a pair of complementary p-type and n-type transistors. 
The addition of two k-digit RSD vectors X and Y to yield a (k + 1)-digit sum S, 
can be performed either by summing their respective positive and negative components 
together with fully carry-propagated additions, i. e. S+ = X+ +Y+ and S- = X- +Y-, 
or, because the vectors have a redundant representation, by using a 2-level GFA adder 
network as shown in Figure 4.12. Note that this method is much preferable since the 
X k-I X-k-I X+t_2 X -k-2 X+0 X'0 
ß -- 
2 -- 
0 
Figure 4.12: RSD addition. 
addition time is constant (2-GFA delay) because no carry propagation is necessary. 
Referring back to the section on 2's complement representation we can see that a 2's 
complement vector can be viewed as a special case of an RSD vector. That is, for Y= 
[yk-i, Yk-s, ..., yo] with y; E {0,1} a 2's complement vector, and X= [xk-1, xk-s, ... , xo] 
with x; = xi - xi E {-1,0,1} an RSD vector, the mapping 
y; fori=0... k-2 
0 fori=k-1 
0 fori=O... k-2 
xi _ 
y; fori=k-1 
converts from 2's complement to RSD. This can be used to advantage when adding a 2's 
complement vector to an RSD vector as shown in Figure 4.13. This circuit can be further 
simplified by noting that the lower (k - 1)-th adder in the diagram simplifies down to a 
61 
s*k rk s k-1 rk-1 s'k-2 rk-2 st, S'1 3+0 SQ 
0 
Figure 4.13: Addition of 2's complement vector to RSD vector. 
wow 
o 
Cio I' w" 11 
Figure 4.14: Simplification of lower (k - 1)-th adder of Figure 4.13. 
direct connection as shown in Figure 4.14. Thus a 2's complement vector can be added 
to an RSD vector by a single layer of GFAs. Since both of these representations allow for 
signed numbers, this mechanism is ideal for implementing recoded multipliers as we shall 
see in the next section. 
Finally, a k-digit'RSD vector, X, can be converted to a (kß-1)-bit 2's complement vector 
Y, with the aid of a fully carry-propagated adder. as shown in Figure 4.15. The addition 
X+k-1 X7k-1 X+ k-2 X 7k-2 X+0 X0 
F FA FA ""'-- 
FA 
Yk Yk-1 Yk-2 Yo 
Figure 4.15: Conversion of RSD vector to 2's complement. 
works by converting the negative part of the RSD vector, X- to its 2's complement 
representation on the inputs to the adder. Thus 
k-1 
Y=-2k"yk+E2'"Yi 
i=0 
which is identical to the value of X. 
62 
x'$ 'cI es re X t4I Xt-1 X*k-2 Kk-2 
4.5 A Recoded Multiplier 
As an example of how to implement a recoded multiplier using a mixture of 2's comple- 
ment and RSD signed number representations, a simple 1-level iterative multiplier will be 
presented. Because of the use of recoding, 2 bits of the multiplier operand, X, are exam- 
fined in each iteration and multiples 0, ±Y and ±2Y of the multiplicand, Y, are added to 
the accumulated result. Also, the operands X and Y are assumed to be in 2's complement 
form. This means that we can directly multiply positive as well as negative numbers. 
Firstly, since Y is a k-bit 2's complement vector and we wish to generate multiples of 
±2Y, also in 2's complement form, for addition to the adder/accumulator, we shall have 
to generate the multiples as a (k + 1)-bit 2's complement vector Y' = [tfk, Yk_l, ... , A]. 
Remembering that negating a 2's complement vector means `invert and add one', then 
multiples ±Y and ±2Y can be generated as follows. 
" Generate +Y: +Y = Y' where 
yi fori=0... k-2 
yi = 
yk-1 fori=k-1, k 
" Generate -Y: -Y = Y' +1 where 
fori=0... k-2 
yk_1 for i=k-1, k 
" Generate +2Y: +2Y = Y' where 
yi_1 for i=0... k -I with y_1 =0 
Yi, 
yk-1 for i=k 
" Generate -2Y: -2Y. = Y' +1 where 
y, _ 
yi-i 
Yk-1 
for i=0... k-I with y_1 =0 
fori=k 
63 
Bitslice i Bitslice i-1 
oo-Y 
01--Y 
10 -2Y 
It - -2Y 
zero 
Figure 4.16: Bitslice of 0, ±Y and ±2Y generation. 
The hardware required to implement this mapping is shown in Figure 4.16. 
The recoding of the multiplier operand X, when X is in 2's complement form, is 
particularly straightforward. This is because the string recoding techniques of Section 4.3 
automatically take care of 2's complement vectors without the need for the dummy xk =0 
bit. Thus a second order recoded vector of X, namely X" can be generated, if k is a 
multiple of 2, with k/2 digits. This direct conversion of 2's complement vectors to recoded 
vectors is called Booth encoding (see [26]). 
Thus X"Y can be computed with 
k/2-1 
X, y _E 22'-x(i), Y 
i=o 
where x (i) E 1-2, -1,0,1,21 is the i-th digit of the recoded X vector. 
The architecture of the multiplier is shown in Figure 4.17 where it should be noted 
u-k-) u k-1 
Y'k 'k-t 
Sk S+k S k-I S+ k-1 
u. 1 u+1 U-0 u+o 
S'2 St2 S'1 S+t 5o S+o 
Figure 4.17: Recoded multiplier architecture. 
that y'_1 =1 if a negative multiple of Y is being added (see Y multiple generation defini- 
Lions above). This represents the `add one' instruction from the `invert and add one' 2's 
complement negation process. 
64 
Y'i Y'i-1 
To implement the Right-to-Left multiplication algorithm a right-shift of 2 bits must be 
performed by the accumulator every iteration. This is hardwired into the feedback loop 
as 
ut h- S+ 2 
LI E- 3I+2 
for i=0... k-2 with uk 1= uk_1 =0 constant. The shifted out bits of the accumulator, 
si , si , sö and sý , are assumed to be collected in a shift-register and a full 2's complement 
carry-propagated addition can be performed at the end of processing to yield a 2k-bit 
result. 
Thus a recoded multiplier has been developed with a single level adder network. Per- 
formance figures are as follows. 
Number of iterations = 
[21 
Iteration time = AMUX + RAND + OFA + 1FF 
Number of bitslices =. k+1 
Bitslice complexity = SZMUX + IZAND + CFA +4* OFF 
On comparison with the unrecoded b=2 multiplier we see that the iteration time has 
been reduced by OFA -I MUX and the complexity reduced by SiFA + 1AND -11MUX per 
bitslice. 
4.6 Summary 
This chapter has studied the basic building blocks of arithmetic hardware. Efficient adder 
circuits and iterative multipliers were explained followed by a discussion of multiplier 
recoding techniques and the necessity for an efficient signed-number representations. 
This culminated in the design of an efficient iterative multiplier with a recoded RSD 
architecture that can perform signed multiplications with both operands and the result in 
65 
2's complement form. 
This basic architecture, modified to perform Montgomery multiplications, will be used 
to construct the optimised multipliers of Chapter 7. 
66 
Chapter 5 
Standard RSA Hardware 
Over the past 10-15 years there have been various proposals for implementing long-integer 
modular arithmetic circuits in custom ASIC devices, that would be suitable for use within 
an RSA cryptosystem. Few of these proposals have been realized in silicon. In this chapter 
we will review the methods that have been put forward to perform standard modular 
multiplication in custom hardware. By `standard' it is meant that the proposals are based 
on modified versions of the algorithms that appeared in Chapter 3. Where applicable, the 
successful implementation of a design will be noted. 
5.1 Multiple-Precision Arithmetic Hardware 
Long-integer arithmetic in software is performed using multi-precision techniques. Simply 
put, this means constructing long-integer operations, such as addition and multiplication, 
from smaller sized arithmetic primitives. These primitives are usually based on the natural 
word-size of the computer being used, e. g. 16-bit additions, multiplications etc. Whilst 
long-integer arithmetic performed in this way is of course much slower than dedicated 
hardware, it does have the advantage of ease of implementation on more general arithmetic 
hardware [31] [32] [33] [34] [35]. 
This approach has been successfully used to implement RSA cryptography on dedicated 
67 
DSP (Digital Signal Processing) chips [36] [37]. The rationale behind using a DSP chip 
is that such chips usually contain fast arithmetic processors, in particular, very fast 16-32 
bit multipliers are usually available. 
Multi-precision techniques have also been used in designing custom ASIC devices [38]. 
The idea here is that these chips will be small and low-powered and so suitable for em- 
bedding in SmartCards. 
5.2 Multiply-Divide Hardware 
As was mentioned in Section 3.3.1, one of the ways of performing modular multiplication 
is by first multiplying the operands and then dividing by the modulus and keeping the 
remainder. Whilst this approach is acceptable in multi-precision software, it leads to 
ineffecient use of circuit area when implemented in hardware. Nevertheless, some proposals 
use this scheme including [39] and notably [40] where a chip has been fabricated, using 
full-custom design techniques, that can perform RSA exponentiations at the rate of 8kbps 
for key lengths of 1024-bit. 
Other hardware division algorithms can be found in [41] [42] [43] [44] [45] [46]. 
5.3 Radix-2 Concurrent Multiply/Reduce Hardware 
In this section we will look at radix-2 serial multiplier implementations. That is, multipliers 
that `consume' the multiplier operand X 1-bit at a time and perform modular reduction 
concurrently with the multiplication. (i. e. b=1 multipliers from the previous section 
modified to perform modular reduction). 
68 
5.3.1 Simple Modular Reduction 
Looking again at the Left-to-Right modular multiplication algorithm (Algorithm 10) we 
have 
s(i + 1) = (2 - s(i)'+' Xk-i-1 ' Y)N 
re-writing this as 
r(i) =2" s(i) + Xk-i-i "Y 
s(i + 1) = r(i) - qi "N 
where s(0) =0 and qi = lr(i)/NJ E (0,1,2} gives s(k) = (X " Y)N and X, Y and N are 
all k-bit positive integers with X, YE [0, N- 1]. This can be expressed in pseudo-code as 
shown in Figure 5.1. Upon completion of this code, S= (X " Y)N. 
1. S :=0 
2. FOR 10 TO k-1 
3. S 2*S 
4. IF S >= N 
S. S: =S -N 
6. ENDIF 
7. IF Xk. I. 1 =1 
8. S : =S+Y 
9. IFS>=N 
10. S: =S -N 
11. ENDIF 
12. ENDIF 
13. ENDFOR 
Figure 5.1: L-to-R modular multiplication. 
The main problem with this algorithm is the need to calculate qj. From the pseudo- 
code it can be seen that this involves a comparison of S and N in lines 4 and 9. Since 
this is a full k-bit comparison of S and N, the time taken to perform this would be the 
same as that taken for a fully carry-propagated addition of S and N. As we have seen 
in Section 4.1 this can take a relatively long time. Nevertheless, this algorithm has been 
69 
used in an RSA chip [47] using a type of carry-completion adder. 
The implementation 
was succesful, if a little slow. 
A slight improvement to the above algorithm can be made by, instead of keeping s(i) 
in the range [0, N- 1], allowing it to cover the increased range of s(i) E 
[0,2k -1]. Thus 
again 
r(i) = 2"s(i)+xk_1 "Y 
s(i+1) = r(i)-q; "N 
but now q; = Lr(i)/2kJ. Note that, since the range of s(i) has increased, and it is assumed 
that N is a 2k-bit integer 
2k-1 <N< 2k 
then qj has the increased range of q; E {O, 1,2,31. The advantage of this approach is that 
q; is determined solely from bits k and k+1 of r(i); the `overflow' bits of r(i). Thus 
no long-integer comparison is required. A pseudo-code interpretation of this algorithm is 
shown in Figure 5.2. 
1. S :=0 
2. FOR 10 T0 k-1 
3. S 2*S 
4. IFS >= 2k 
5. S:  S -N 
6. IF S >= 2k 
7. SS-N 
8. ENDIF 
9. ENDIF 
10. IF Xk-I-1 1 
11. S: =S+Y 
12. IFS >i 2k 
13. SS -N 
14. ENDIF 
15. ENDIF 
16. ENDFOR 
Figure 5.2: L-to-R MM: overflow determination of subtractions of N. 
The disadvantage of this approach is that the result of the algorithm s(k), is in an 
70 
extended range, and to bring it back to [0, N-1] may require a subtraction of N. Also, all 
additions and subtractions used here are fully carry-propagated so that the overflow can be 
accurately determined. When using algorithms such as this one in a modular exponentiator 
the first problem, that of the extended range of partial results, can be overcome if it 
can be shown that operands with such an extended range are allowed as inputs to the 
multiplication routine, i. e. during the repeated multiplications of an exponentiation the 
range of the multiplication results does not diverge. This usually requires a trade-off 
between the range of the operands/result and the range of q,. The second problem, of 
using carry-propagated additions, is not acceptable in high-performance implementations. 
5.3.2 Residue-Table Reduction 
Instead of using the overflow bits to determine the multiple of N that should be subtracted, 
as was done in the previous section, these bits can be used to index a small table of pre- 
computed residues. If each entry in the table is denoted by T(j] for j=0,1,..., then the 
contents of each entry are 
TU] = `i . 
2k )x 
and they are used as follows 
r(i) =2" s(i) + xk-; -i "Y 
s(i + 1) = (r(s))2k + T[q] 
where again qj = Lr(i)/2kJ is the overflow of r(i) beyond the k-th bit. Note that, since the 
table entries are residues modulo N, they are added (not subtracted) to the accumulated 
sum. If q; E {O, 1,2,3} then, since T [O] = 0, a total of three entries are required in the 
table. 
A pseudo-code version of this algorithm is shown in Figure 5.3. The main advantage of 
. this approach should be clear from the code - residue reduction requires only one addition 
as opposed to the previous methods where multiple subtractions were necessary. 
71 
1. S :=0 
2. FOR 10 TO k-1 
3. S 2*S 
4. IF xk-1-1 =1 
5. S: =S+Y 
6. ENDIF 
7. qS DIV 2k 
8. S<S>2k+T[q] 
9. ENDFOR 
Figure 5.3: L-to-R MM: residue-table lookup. 
In [48] and [49] Tomlinson implements this method using a CSA architecture. The 
basic idea is shown in Figure 5.4. The inputs to the first and second level of adders are 
Uk U'k Uk4 u't-1 u1 uzt u0 U'o 
II 
wk-1 ............ 
rl<-" Wi Wo 
0 
tt. i ............ ti to 
Si S'l SO s'O 
Figure 5.4:. Tomlinson modular multiplier. 
respectively, 
till = xk_; _1 " yj 
t, = j-th bit of table residue T[q; _1] 
and since, in modular multiplication, the Left-to-Right multiplication algorithm is used, 
a built-in left shift must be included in the CSA feedback loop. So 
uJ 4- sß_1 
üJ t- sß_1 
Thus the algorithm can be expressed as 
r(i) = 2's(i)+xk-i-1 "Y 
72 
-k -k 'k-I ° k-I 
q 
s(i + 1) = Ir(i)12k + T[4: -i) 
where, if 
k 
2k+l " 3k+1 +E 
2i - 
(sj 
j=0 
then q; =2" sek+l + sk + sek and Tomlinson claims (but gives no proof) that q; 
04 so that 
q; E {0,1,2,3}. Thus a 3-element residue table is required. The notation 
jr(i)12k is used 
to imply the removal of the CSA overflow bits of r(i). 
Note that, in this algorithm, on each iteration the table-residue corresponding to the 
previous iteration, T[q; _1] 
is added to the accumulator. Thus no time is wasted during 
each iteration on decoding the upper bits of s(i) and selecting the appropriate table entry. 
This takes place on the following iteration in parallel with the addition of xk_; _s "Y 
in 
the first level of adders. Thus the adder array can be run at a rate approaching full 
speed with very little time wasted in waiting for any intermediate `residue-determination' 
calculations. Once the multiplication is complete, an assimilation of the CSA vectors S 
and S' must be performed (fully carry-propagated addition) along with a subtraction of 
at most 3N to bring the result into the range [0, N- 1]. 
In [50] Iwamura et al. propose a table-lookup scheme based on a modified CSA archi- 
tecture. In their architecture, the vertical bit-slices of the CSA adder are grouped together 
horizontally into m-bit sections, m>3. This has the effect of reducing the CSA register 
size (carries are `saved' after m bit propagations so that the carry vector has fewer ele- 
ments) but at the expense of complicating the adder array. They show that the overflow 
after each iteration is limited to 3 bits, and that its value is in the range 
[0,6]. Therefore 
their design requires a 6-element residue-table. 
Chiou, in [51], proposes a similar technique but with a single-level CSA adder array. 
With this array he alternately adds partial products Xk_; _1 "Y and table residues. 
Thus 
. 2k iterations are required to complete a modular multiplication 
(as opposed to k iterations 
above) but, since only a 1-level adder is required to add either partial products or table 
73 
residues, each iteration is almost twice as fast as a standard 2-level CSA adder. He also 
uses a table of 6 pre-computed residues. 
A fully functional RSA chip has been implemented by the Belgian company Cryptech. 
In [52], Hoornaert et al. describe their system whereby the top few bits of s(i) are compared 
with the top few bits of the pre-computed binary expansion of 1/N. It is not stated how 
many bits are involved in this 'comparison', or how the comparison is performed, but the 
result is then used to index a 3-element residue-table. The original version of this chip was 
capable of performing 512-bit RSA exponentiations at the rate of 17kbps. A more recent 
version using improved VLSI technology (that is assumed to use the same algorithm) can 
operate at 32kbps. 
Recently, in [53], Chiou et al. proposed a system that takes the table-lookup scheme 
to its extreme. They use a 1-level CSA architecture with just one addition required per 
cycle and the multiplication completed in k cycles. A 7-element residue table is used with 
three of the residues of the form 
(j 
" 2k)N and the other four of the form 
ýj 
" 2k +Y>N 
for j=0,1,2,3. Selection of the appropriate residue to add to the accumulator during 
each iteration of the algorithm is performed by examining the overflow bits of the previous 
iteration and the current multiplier bit xk_; _1. 
Initially this scheme looks very attractive 
but the hidden complication is that four of the table entries involve pre-computed residues 
of the multiplicand Y. These must be computed before each multiplication and, since 
they must be in binary, this involves fully carry-propagated additions and subtractions 
to be performed before the multiplication can take place. Unless these calculations are 
performed in parallel with fast adders, the time required for the multiplication as a whole 
is no better than with more conventional approaches. Of course if parallel fast adders are 
used then circuit complexity becomes an issue. 
74 
5.3.3 Quotient Estimation 
Instead of using lookup-tables indexed by overflow bits, it is possible to estimate the 
number of N that should be subtracted from partial results by examining the top few bits 
of s(i) and N. In discarding the tables we can decrease the hardware requirements of a 
modular multiplier. 
Typically, the more bits of s(i) and N that are examined the smaller is the range to 
which q; can be restricted. In general, this approach can be described by 
r(i) =2" s(i) -}- xk_i_1 "Y 
s(i+1) = r(i)-q; "N 
where qj =f (top(s(i)), top(N)) is some specific calculation optimised for speed. 
A pseudo-language interpretation'is-shown in Figure 5.5. 
1. S: =0 
2. FOR 10 TO k-1 
3. S 2*S 
4. IF xk. I. 1 =1 
5. S: =S+Y 
6. ENDIF 
7. q f(top(S), top(N)) 
8. SS- q"N 
9. ENOFOR 
Figure 5.5: L-to-R MM: quotient estimation. 
In 1982 Brickell [54] proposed a design for fast modular multiplication using quotient 
estimation. -His design is based around an adder circuit called a Delayed Carry Adder 
(DCA). The DCA is similar to the CSA in that carries are not propagated, but it differs 
in that it is constructed from Half-Adders (HA). Figure 5.6 shows a single bit-slice of a 5- 
stage DCA. The circuit diagram of Figure 5.7 shows the logic of a half-adder. The DCA 
has the property that its outputs, s; and s; +l in the diagram, 
do not have a completely 
redundant representation as in CSA. To be specific 
Si s1+101 
75 
w wý Wi ", bi 
01+1 
I 
ei+I 
f1. ß 
0i 
14 
0i 
4 
G+1 S 
Figure 5.7: Half-adder (HA). 
that is s; and s; +l cannot both be equal to 1. This can be seen from the circuit diagram 
of the HA. Brickell was able to use. this fact to advantage in the feedback of the DCAs 
output to its input during the iterations of the multiplication. Note that the speed of the 
half-adder is about twice that of the full-adder, thus the adder array shown above is only 
slightly slower than that of a 2-level CSA array. 
Brickell's algorithm for quotient estimation allows the partial result s(i) to overflow 2k 
by 11 bits. A subtraction of either 210 "N or 211 "N is performed when appropriate. Thus 
the range of q is {0,1,2}. The determination of which multiple to subtract is based upon 
an addition involving the top 4 bits of s(i) and N. Since this is a fully carry-propagated 
4-bit addition, it will take time, and it is likely that the DCA will not be able to operate 
at full speed. 
76 
S'MI Si 
Figure 5.6: Delayed-carry adder (DCA). 
Xi Yi 
In [55] Baker proposed a quotient estimation technique using a CSA architecture that 
subtracts multiples ±N and ±2N. The selection of the appropriate multiple depends upon 
a comparison of the fully assimilated top 6 bits of s(i) and N. Since this involves 6-bit 
carry propagated arithmetic, it too will not allow the CSA to operate at full speed. 
Takagi proposed, in [56], to use an RSD architecture with each step of the algorithm 
reduced separately using multiples. ±N. That is 
r(i) =2" s(i) - q, ý "N 
s(i + 1) = rý=) + xk-+-1 "Y-q; "N 
where q;, q; ' E {-1,0,11. Selection of 0, ±1 for q and q' is based on a comparison of the 
top 3 digits of r(i) and s(i + 1) respectively. That is, if 
v= 4"rk+1+2"rk+rk-1 
- 4(rk+1 - rk+1) + 2(rk - rk) + irk 1- rk-1) 
then 
0 ifv=0 
9ý= 1 ifv>0 
-1 ifv*<0 
Since an RSD vector has signed digits, the determination of the sign of v depends only 
on the highest non-zero digit. However, finding the highest non-zero digit still implies 
a propagation of some sort, albeit limited to 3 digits. Note that splitting the reduction 
into two parts, that is separate reductions of r(i) and s(i + 1), can be advantageous since, 
for xk_i_1 = 0, no addition of a partial product and therefore no reduction takes place. 
However, this only happens approximately 50% of the time and so the multiple selection 
logic would have to deal with about 4-5 digit propagations per iteration, making the 
determination of q; =q+ q" again slower than the adder array. 
77 
5.4 Radix-4 Concurrent Multiply/Reduce Hardware 
Implementing radix-4 modular multiplication, we have as a starting point 
r(i) =4" s(i) + Xk_i_l "Y 
s(i+1) = r(i)-q; "N 
where Xi E {O, 1,2,3} and q; = lr(i)/NJ E {O, 1, (0,1,2,3,4,5,6). Immediately we see that 
the range of q; is much increased from the radix -2 case, and because of this, no proposals 
have been published in the literature for high-radix residue-table modular multipliers. 
The tables simply become too large. This leaves quotient estimation as the only viable 
high-radix modular multiplication method. 
Takagi extended his radix-2 multiplication scheme to a radix-4 method in [56]. Each 
iteration of the algorithm is split into four separate calculations, 
r(i) = 2"s(i)-qs. N 
r'(i) = gj', N 
t(i) = x(i)"Y-qýýý. N 
s(i+1) = r'(i) +t(i) -qý', N 
where the vector X is recoded such that x(i) E {-2, -1,0,1,2} and the multiples of N 
are determined as above in the radix-2 case for q!, q", q; ", q, "" E {-1,0,11. Whilst being 
a simple extension of the radix-2 method, it obviously suffers from the many calculations 
that have to be performed during each iteration, and so is not very efficient. 
In [57] Takagi improved upon this algorithm to give the following 
r(i) = 4"s(i)+x(i) "Y 
s(i + 1) = r(i) - 4.4. -i "N 
'where x(i), q; E 1-2, -1,0,1,21. Thus, since x(i) and q; share the same range, a simple 
recoded RSD architecture (2-level adder array) can be used to implement the multiplier. 
78 
The problem with this approach however, is that to restrict qj to this range requires 5 
parallel comparisons of the top 8 bits of s(i) and N. Since an 8-bit calculation requires 
more time to complete than the 2 . OFA delay of the adder array, this method again fails 
to allow the adder array to operate at full speed. 
In [58] Morita uses a similar algorithm that requires multiple concurrent comparisons 
involving the top 7 bits of s(i) and N. 
5.5 Radix-2b Concurrent Multiply/Reduce Hardware 
Generalizing the quotient estimation technique to b-bit high-radix modular multipliers 
helps us to understand the trade-offs that are implicit in this approach. The trade-off 
concerns three elements, 
1. minimizing the calculation time of q;, 
2. minimizing the range of q;, and 
3. increasing b- the number of bits of X that are consumed during each iteration. 
In general, trying to optimize just one of the above elements will have an adverse effect 
on the other two. 
For example, in [59] with their VICTOR design, Orup et al. show that, for X, Y<N 
all lb-bit positive integers, i. e. 
X= [XI-IiXt-s,,.., XOI 
where X; E [0,2b - 1], and using the following algorithm 
r(i) = 26 " s(i) + XI-i_1 "Y 
s(i+1) = r(i)-22b-gi-1 -N 
then using a method to calculate q; that involves a multiplication of the top 6 bits of s(i) 
and the top e bits of 1/N an equation governing the design tradeoffs is given in [59] as 
79 
follows 
[2b(2b+3_S +1+ 21-6) 
9MAX =1- 26-e 
For fixed values of b the above equation is asymptotic in MAX as 6 and c are varied. 
Table 5.1 shows the minimum values of 8 and c necessary to achieve minimum gMAX for 
b=1... 6. Optimized designs are then derived by noting the following, 
b 8e 4, e, _x 
1 65 4 
2 87 6 
3 10 10 10 
4 12 12 18 
5' 14 13 34 
6 16 15 66 
Table 5.1: Minimum 5 and e for minimum gMAX. 
1. All integers in the range 0... 10 can be expressed as the sum (or difference) of at most 
two powers of 2. i. e. 3= 21 +2°, 7= 23 - 2°, and 10 = 23 + 21. Thus multiples q; "N 
for q; E [0,10] can be expressed as the sum of two partial sub-multiples q,! " N+ q, " "N 
with Ti E {O, 2,4,8} and q; ' E {-1,0,1,2} that are easy to generate in hardware. 
2. All integers in the range 0... 42 can be expressed as the sum (or difference) of at 
most three powers of 2. i. e. 11 = 23+21+2°, 27 = 25-22-2°, and 42 = 25+23+21. 
Thus multiples q; "N for qj E [0,42] can be expressed as the sum of three partial 
sub-multiples q"N+q, " "N+ q"" "N with g; E {0,8,16,32}, Ti' E {-8, -4,0,4,81 
and q" E 1-2, -1,0,1,2} that are easy to generate in hardware. 
By selecting ab for which the asymptotic value of gMAX is less than one of the `optimum' 
values of 10 or 42, and then reducing 8 and e such that MAX is equal to (or slightly 
less than) one of these values, an `optimum' trade-off in terms of qj calculation time and 
range can be achieved for each selection of b. These values -are shown in Table 5.2. In 
terms of multiplier circuit complexity and the number of iterations required to complete a 
multiplication, the optimum values for b in Table 5.2 are b=3 and b=5. This is because 
80 
b 8c gMAX 
1 33 10 
2 56 10 
3 10 10 10 
4 77 38 
5 10 11 42 
6 -- - 
Table 5.2: Optimum 6 and e. 
the maximum values of b for which the ranges of q; are [0,10] and [0,42] are b=3 and 
b=5 respectively. 
The architecture used for VICTOR is CSA with a limited amount of parallelism in the 
organization of the 3: 2 adders. A general diagram is shown in Figure 5.8. With reference 
qN Generation 
S so 
X'Y Generation 
a'N a"N x'Y x"Y 
3: 2 Adder Array 
3: 2 Adder Array 
3: 2 Adder Array 
3: 2 Adder Array 
Accumulator 
S S' 
Figure 5.8: VICTOR architecture. 
to the diagram, and for the cases of b=1... 5, the adder array is constructed as follows 
"b=1: The multiplicand multiple X- Y= xi-; _1 - Y, while X" "Y=0. Therefore the 
top 3: 2 adder array shown in the diagram is not required. The modulus multiples, 
v' "N and a" " N, are generated as in the 0... 10 sum-of-two-powers-of-2 coding 
scheme described above. 
Thus a 3-level adder array is required. 
81 
"b=2: The multiplicand multiples, '/ "Y and X" " Y, are derived from X, _; _1 "Y 
in the obvious way (x' = XI-; _1, (o) and X" = X_; _1, (1)). 
The modulus multiples, 
a' "N and a" " N, are generated as in the sum-of-two-powers-of-2 coding scheme. 
Thus a 4-level adder array is required. 
"b=3: The multiplicand multiples, X' "Y and X" " Y, are derived from the 3- 
bit multiple X1_; _1 "Y 
by using the sum-of-two-powers-of-2 coding method. The 
modulus multiples, a- N and a" " N, are again generated by the sum-of-two-powers- 
of-2 method. 
Thus a 4-level adder array is again required. 
"b=4: The multiplicand multiples, X'"Y and X"-Y, are derived from the 4-bit multiple 
X1_; _1-Y by a combination of the sum-of-three-powers-of-2 coding technique and a 
row of 3: 2 adders to. reduce the three powers-of-2 vectors to just two vectors. Thus a 
1-level adder array is `hidden' inside the X, "Y generation sub-block. The modulus 
multiples, d"N and all " N, are also generated by the sum-of-three-powers-of-2 plus 
1-level adder method. Therefore ä 1-level adder array is also hidden inside the q; -N 
generation sub-block. 
Thus 6 1-level adders are used in the design, but because the generation of the 
multiples of N is paralleled with the main adder array the delay through the whole 
array is equal to 5 . OFA" 
"b=5: The multiplicand multiples, x-Y and f"Y, are derived from the 5-bit 
multiple XI-; _1 "Y again by a combination of the sum-of-three-powers-of-2 coding 
technique and a row of 3: 2 adders. The modulus multiples, d"N and a" " N, are 
also generated by the sum-of-three-powers-of-2 method and a 3: 2 adder array. 
Therefore, 6 1-level adders are again required to implement the design and the delay 
through the whole array is equal to 5- LFA. 
82 
Examining the q; calculation requirements for the more area-efficient designs of b=3 
and b=5, we see that in each case approximately 10-bits of s(i) and 1/N are required 
to be multiplied together to yield a value for the approximated quotient. Even with the 
non-trivial sum-of-two-powers-of-2 and sum-of-three-powers-of-2 coding techniques being 
used, this calculation is likely to take longer than the delay time of the adder array. If this 
is not the case (maybe for b= 5), then we still cannot say that the adder array is running 
at full speed because it will be the complex coding schemes that limit the cycle time of 
the multiplier - particularly the coding of the modulus multiple q; " N. 
In [60] Orton et al. propose a Diminished Radix (DR) modular multiplier. In this 
scheme the k-bit modulus, N, is modified to produce a new modulus, M. The modification 
is to multiply N by a relatively small number T, such that 
M=T. N=2k+c_A 
where T is chosen so that A< 2k. Therefore T is, at most, a (c + 1)-bit number. This 
modification ensures that the top obits of M are all `l's. 
The multiplication algorithm then proceeds as follows, 
r(i) = 26 " s(i) -}-Xß_; _1 "Y 
s(i+ 1) = I*(_)12k+C +2b " q1-1 "A 
where q; = 
l2 'ýJ and 1=f k- l. As before, the notation jr(i)12k+e means discard the 
overflow bits of r(i) beyond the (k + c)-th bit. 
Implied by the equation for q; (and proved in [60]) is that the range of qj depends to a 
certain extent on the value of c. That is, the larger the 'modulus extension' c, the smaller 
the range of q;. This is in comparison with Orup et at. 's method which increased the 
precision of the q; calculation to limit its range. Orton et al. found optimum values for c 
. that depend only on the multiplier bit-scan width b. They are 
*. b = 1: c=6 gives q; E [0,3), and 
83 
"b=2... 10: c= 2b +5 gives q, E [0,26+2 - 1]. 
These are shown for b=1... 6 in Table 5.3. On comparison of Table 5.2 (the VICTOR 
b c gMAX 
1 6 3 
2 9 15 
3 11 31 
4 13 63 
5 15 127 
6 17 255 
Table 5.3: Optimum values for b, c and gmAx. 
design) with Table 5.3 we can see that, in the DR case, a design trade-off has been made in 
favour of simplifying the calculation of q; at the expense of increasing its range and slightly 
increasing the size of the modulus. The latter trade-off, although slightly increasing the 
number of iterations required by the algorithm, is trivial because the total number of 
iterations required for long-integer multiplication is large. The former trade-off, that of 
increasing the range of the approximated quotient q;, can be alleviated to some extent by 
partitioning q; into 2-bit sub-blocks and holding values for both A and 3A on-chip. Then 
each 2-bit digit of qj can use the stored value of A to generate the multiples A and 2A, 
and the stored value of 3A to generate the remaining multiple of 3A. This is effectively a 
hybrid quotient-estimation/table approach with the size of the table less than any of the 
previously reviewed table-residue schemes. 
For example, with b=2 and qj E [0,15], then q; can be expressed as 
qi=4-q +q' 
with q', q, ' E {O, 1,2,3}. A simple diagram of the b=2 CSA architecture is shown in 
Figure 5.9. 
Thus the following implementations are possible, 
"b=1: With partial product x, _; _1 "Y and partial residue q; "A where q; E {O, 1,2,31, 
then a 2-level CSA is required. 
84 
3: 2 Adder Amy I X14.14o)Y 
3: 2 Adder Array r", ý- x14.14»Y 
3: 2 Adder Array r4Q- q, -, 'A 
3: 2 Adder Array r'W- g-, "A 
Accumulator 
s 
Figure 5.9: DR b=2 architecture. 
"b=2: With partial products X1_; _1 "Y = 2"X: _; _l, (t)'Y+Xl-i-,, (o)"Y and partial 
residues q; "A=4"q, "A+ q'"' "A where qs, q' E {O, 1,2,3}, then a 4-level CSA is 
required 
"b=3: With partial products X1_; _1 "Y=4" Xt-i-i, (2) "Y+2" Xt-i-i, (i) "Y+ 
Xz_; 
_l, (o)"Y and partial residues q; "A = 16"q'. A+4"q, '"'"A+g7'. A where q; E {0,1} 
and g; ', q! " E 10,1,2,3}, then a 6-level CSA is required 
"b=4: With partial products X1_; _1 "Y=8" XI_; _1, (3) "Y+4" 
X1_; _1, (2) "Y+2 
Xi_; 
_l, (l)"Y+X: _; _l, (o)"Y and partial residues q; "A= 16"q; "A+4"q""A+q,! 
""A 
where q, q', qi" E {O, 1,2,3}, then a 7-level CSA is required 
"b=5: With partial products X, _; _1 -Y= 16 " 
X: 
_; _1, (4) -Y+8" 
Xj_; 
_1, (3) "Y+ 
4' Xt_; 
_1, (2) -Y+2- Xt_t_1, (1) -Y+ 
XI-; 
-1, (o) "Y and partial residues q, "A= 
64"q! "A+16"q! '"A+4"q! ""A+q; "'"Awhere q' E {0,1} and q; ', q; ", q; "' E 10,1,2,31, 
then a 9-level CSA is required 
On comparison with VICTOR we see that the DR method is good for b52 but that it 
"needs more adders for b>2. This is because the trade-off in the DR design has been 
made in favour of the speed of q; calculation, as opposed to the VICTOR design where 
85 
the restriction of the range of q; and a more efficient multiple-encoding scheme have been 
developed. 
To assimilate q;, from a (b + 2)-digit CSA vector, to binary form requires adder logic 
with carries propagating through all b+2 positions. Attempting to complete this assimi- 
lation before qj "A is needed may still not be possible. This is why Orton et al. suggested 
using a pipelined approach for high-radix multiplication. Using this approach the addition 
of the q; "A multiples is delayed by p cycles such that the assimilation of q; has had time 
to complete. The algorithm can be described as follows, 
r(i) 26 " s(i) -}- X1_; _1 "Y 
s(i + 1) = Ir(i)12k+C + 2p' " qi-p "A 
where 
0 fori-p<0 
qi-p = 
ls(i-p)/2k}0, for0<i-p<l+p 
and also, since the modulus multiples are not subtracted immediately from the partial 
result, c must be increased by pb bits such that 
ý_ (P+2)b+5 
The algorithm now requires I+p iterations. 
The architecture for implementing this multiplier is a concurrent pipelined design 
where separate data-paths are used for the partial products and partial residues. For 
example, whilst the pipeline design is not made explicit in [60], a possible architecture for 
the case of b=4 and p=3 using multiples X, _; _1 and q; such that 
XI-i-i =8" Xr-: -i, (3) +4"X: -+-i, (2) +2- 
X1-+=1, (1) + Xl-i-1, (o) 
9i = 16"9, ý+4"qý'+qý"+, qýýn 
'where q!, q" E {0,1,2,31 use stored multiples of A and 3A, and q; ", q"' E {O, 1} use only 
multiples of A is shown in Figure 5.10. 
86 
xwl a»Y a+r'A y,,.. A 
xu-1. a)Y 3: 2 Adder Array 3: 2 Adder Array qß .. A 
Y44.1(0)Y 3: 2 Adder Array 3: 2 Adder Array yw, """A 
Latch Latch 
3: 2 Adder Array 
3: 2 Adder Array 
Latch 
3: 2 Adder Array 
3: 2 Adder Array 
Latch 
Is Is, 
Figure 5.10: 3-stage pipelined, b=4 DR multiplier. 
Using this concurrent pipelined technique it may well be possible to run the multiplier 
at full speed (i. e. a speed determined solely by the delay inherent in each adder array - not 
dependent upon quotient estimation delays) but the cost in hardware terms is large since 
to achieve this 'break-even point' requires a large, complex multiplier. Also, with a large 
number of pipeline stages the multiplier becomes inefficient because of the extra iterations 
required to fill and empty the multiplier at the beginning and end of a multiplication 
respectively. Nevertheless, the DR method is probably the most promising high-speed 
method proposed so far among the `standard' modular multiplication algorithms. 
5.6 Other Proposed Systems 
A few companies have manufactured RSA chips without publishing details of the algo- 
rithms and techniques used in their design. The most recent survey of known and working 
RSA chips was conducted by Brickell in [61]. In this survey performance figures were 
quoted for all designs and it should be noted that none of them outperformed the Cryptech 
chip reviewed above. 
Other radix-2 designs include [62] [63] [64] and [65]. The latter uses a super-fast exper- 
imental 150MHz silicon-on-insulator technology to achieve claimed rates of over 64kbps. 
87 
A simple radix-2 multiple subtraction algorithm is used but the precise details of its oper- 
ation are not given. It is thought that the algorithm is probably not very efficient, and the 
speed of the device is directly attributable to the implementation technology. An efficient 
algorithm implemented with this technology would no doubt yield much higher encryption 
rates. 
Other proposed designs based on `standard' algorithms include Sedlak's [66] complex 
`0'-skip multiplier/reducer incorporating barrel-shifters and complex control circuitry, Iwa- 
mura's [50] systolic array modular multiplier using localized ROM-table lookup, Kochan- 
ski's [67] processor array, and Prasanna's [68] highly parallel residue reduction and se- 
lection method. Others designs, more suitable for short-word modular multipliers, have 
been proposed by Alia [69] and Piestrak [70]. A small review of some of these (and other) 
techniques can be found in [71]. They have not been included in any detail here because 
they are thought to be either too complex or too inefficient for VLSI design. 
5.7 Summary 
This chapter reviewed the current literature concerning the implementation of RSA cryp- 
tosystems using standard modular multipliers. Descriptions of radix-2, radix-4 and general 
radix-26 multipliers were included. In depth desciptions of the VICTOR [59] and DR [60] 
designs were given showing the trade-offs that have to be made in an effort to create a 
fast and efficient design. The problem of quotient estimation was identified as the core 
limiting factor in these designs. Designs with simple quotient estimation algorithms suffer 
from an extended quotient range which leads to more adders being necessary to sum the 
modulus multiple which in turn leads to increased multiplier circuitry and longer addition 
times. Designs with complex quotient estimation circuitry are limited by the time taken 
to calculate the next modulus multiple. 
In Chapter 7 it will be shown that these limitiations can be removed with optimised 
88 
Montgomery multipliers. 
89 
Chapter 6 
Montgomery Arithmetic 
In 1985 Peter Montgomery published a paper [72] showing that it is possible to perform 
modular arithmetic modulo N without having to perform divisions by N. The technique 
relies upon a non-standard representation of the residues modulo N and is explained in 
the following sections. 
6.1 Montgomery Multiplication 
As was shown in Chapters 2 and 3, the standard method of multiplying two positive 
integers X and Y modulo N is, 
(X'Y)N=X"Y-Q"N (6.1) 
where 
Q=FN *Yl 
and so (X"Y)NEE [0, N-1]. 
The Montgomery product, P (an integer), of the integers X and Y can be expressed 
as follows, 
P_X"Y+Z"N 
R 
(6.2) 
90 
where R is a constant coprime to, and greater than, N. The integer Z can be viewed as 
the number of N's that have to be added onto X "Y in order to make the sum X "Y+Z"N 
a multiple of R and thus make P an integer. We know that there is such a multiple Z 
that can do this because R and N are coprime. 
If N is an odd number (in RSA N=p"q is odd) then to satisfy the coprimeness 
constraint we can make Ra power of 2. Specifically if N is a k-bit integer 
2k-' <N<2k 
then make 
R= 2k 
and thus we see that, in a binary system, division by R in Equation 6.2 is now a trivial 
matter. 
6.1.1 Calculating Z 
Since X"Y+Z"N is a multiple of R, then 
X "Y+Z"N=0 (mod R) 
and so 
Z"N- -X "Y (mod R) 
Now since gcd(R, N) = 1, therefore there exists an integer N-1 such that 
N"N'i. 1 (mod R) 
therefore 
Z=- -X "Y" N'1 (mod R) 
Since Z is in the numerator of Equation 6.2, we will take the least non-negative residue 
of Z modulo R in order to limit the range of P, thus 
Z=(-X . Y. N-1)R 
91 
or, with pre-computed constant N' = (-N-1)R dependant only upon N and R, then 
Z=(X-Y-N')R 
noting that, since R= 2k, modular reduction modulo R is simple in a binary system. 
Thus, if X, Y<N and Z<R, the maximum value of P 
PMAX < 
N"N+R"N 
R 
is greater than N but less than 2N, therefore the range of P is certainly limited to 
PE [0,2N) 
6.1.2 Interpreting P 
We have calculated the integer P, but what exactly does it mean? Well, from Equation 
6.2 we have 
P"R=X"Y+Z"N 
therefore 
P"R=X"Y (modN) 
and since gcd(R, N) =1 there exists an integer R'1 such that 
R" R-1 -1 (mod N) 
therefore 
P=X"Y" R'1 (mod N) 
In other words P is the product of X and Y and a constant R'1 modulo N. The constant 
R'1 depends only upon R and N and does not vary with X and Y. Thus, the Montgomery 
method allows us to calculate a number that is related, by a constant, to the product of 
two integers modulo a third integer, and whose range is much reduced from that of the 
product. Therefore it should be possible to use Montgomery multiplications in algorithms 
that require modular multiplication but that do not make any decisions based on the 
92 
results of these multiplications. A post-conversion operation can then be performed at the 
end of the algorithm to remove the accumulated constants. As we saw in Section 3.3.2, 
exponentiation algorithms fall into this category. 
A pseudo-code interpretation of Montgomery multiplication with reduction of the prod- 
uct P to the range [0, N- 1] is shown in Figure 6.1. Thus, on completion of the code, 
P= (X -Y" R-1 )N. Declaring reduction modulo R and division by R to be trivial 
1. tt x*r 
2. t2 < t1 >R 
3. 2<t2*NO >R 
4. t32"N 
5. t4 tl + t3 
6. Pt4/R 
7. IFP>=N 
8. P: =P -N 
9. ENDIF 
Figure 6.1: Montgomery modular multiplication. 
operations, then we see from the pseudo-code that Montgomery multiplication involves 
three multiplications (one of them modulo R) and a possible subtraction. Comparing 
this with standard modular reduction where a multiplication and a division is required 
we see that Montgomery multiplication effectively `trades-off' a division in favour of two 
multiplications. Since one of these multiplications is modulo R, which when implemented 
using multi-precision operations is less time-consuming than standard multiplication, and 
multiplication algorithms are inherently faster than division algorithms anyway, then we 
can see that the computational complexity of the Montgomery approach may well be quite 
attractive. Selection of either standard or Montgomery algorithms depends very much on 
the specifics of the implementation environment. As we shall see in the remaining chapters, 
the Montgomery approach (with certain optimizations) can lead to very efficient modular 
exponentiators in VLSI. 
93 
6.2 Montgomery Exponentiation 
Looking again at the modular exponentiation algorithms of Section 3.3.2 we have, for the 
Right-to-Left algorithm, s(O) = 1, t(O) =A 
s(= + 1) = ýsý=) ' ýtý=))e )N 
t(i + 1) = 
((t(i))2)N 
re-writing this in pseudo-code form in Figure 6.2 we see from this code that the (S " T)N 
1. S :=1, T : =A 
2. FOR i :=0 TO k"1 
3. IF ei =1 
4. S : =<S*T>N 
5. ENDIF 
6. T: =<T2> N 
7. ENDFOR 
Figure 6.2: Right-to-Left modular exponentiation. 
operation is performed only when e; =1 whilst the (T2)N operation is performed on 
every iteration. If Montgomery multiplications are used instead of standard modular 
multiplications then each multiplication will introduce the constant factor of R'' into its 
result. 
For example, the variable t(i) is modified as 
t(i + 1) = 
((t(=))Z 
" R-1) 
Thus, since in standard exponentiation t(i) = 
(A" )N, in Montgomery exponentiation t(i) 
will be 
tý=ý = 
(A2'. (R-') ) 
N 
_ 
(A2' R -')N 
Substituting this into the expression for evaluating s(i + 1) and also using Montgomery 
multiplication in this calculation we have 
s(i + 1) = 
(s(i) 
" 
(A 2' " R-') 
c' 
" 
(R-1) )N 
94 
= 
(s(i) 
. 
(A2' 
. R-('+1))`i 
(s(i) . (A2'es . R-'i('+')) 
which when multiplied over i=0... k-1 gives 
s(k) = (AE " R-Ei Ö `i('+1)} \N 
Thus the result can be post-converted to 
(AE)N by Montgomery multiplying s(k) by 
the `constant' (R`+1)N where 
k-i 
E=>e; (i+1) 
i-o 
and the +1 term in the exponent of R`+l negates the effect of the constant R'1 introduced 
in the post-conversion Montgomery multiplication. Thus the `constant', (R`+t)N, depends 
only on R, N and E. It does not depend on A. 
A similar post-conversion constant that depends only on R, N and E can be derived 
for the Left-to-Right exponentiation method. 
Although the above method is quite acceptable for modular exponentiations with N 
and E constant, another more general method is available that makes use of both pre- and 
post-conversions with all intermediate numbers represented in a special `N-residue' form. 
6.2.1 N-residue Representation 
In [72] Montgomery shows how numbers in the range [0, N- 1] can be converted into 
N-residue form where the modular operations of multiplication and addition can be per- 
formed using Montgomery and standard methods respectively, such that the results of 
these operations are also in N-residue form. At the end of processing the results can be 
post-converted back to their normal residue representation modulo N. 
Consider the constant 
H= (R2)x 
that depends only on R and N. Substituting H for Y in the Montgomery multiplication 
95 
algorithms we get 
P= (X"H"R-1)N 
= 
(X, R2, R-1)N 
= (X - R)N 
So for two positive integers, X and Y, in the range [0, N- 1] two new integers can be 
calculated, X' and Y', via Montgomery multiplication such that 
Xi = 
(X 
-H-R-1)N = (X -R)N 
Yl = 
(Y-H-R-1)N= (y'R)N 
Now, if the product of X and Y using standard modular multiplication is W= (X " Y)N, 
then converting this product as we did for X and Y gives 
W'= (W -H-R-1)N=(X "y"R)N 
But now looking at the Montgomery product of XI and Y', we get 
(X'. Y'. R-1)N = «X, R). (Y . R)-R-1ýN 
= (X .Y. R)N 
= w, 
Thus we see that the Montgomery product of two `converted' integers, X' and Y', is the 
same as the `conversion' of the standard modular product of the two integers X and Y. If 
the integers X, Y and W are said to be in standard residue form, then their counterparts 
X', Y' and W' are said to be in N-residue form. 
The implication of the above is that, 
1. conversion of an integer, X, from standard residue form to N-residue form is achieved 
either by 
(a) standard modular multiplication of X by R, or 
96 
(b) Montgomery multiplication of X by H= (R2)N. 
2. the Montgomery product of any two integers in N-residue form yields a result that 
is also in N-residue form, 
3. the standard modular addition of any two integers in N-residue form yields a result 
that is also in N-residue form, and 
4. conversion of an integer, X', from N-residue form to standard residue form is 
achieved either by 
(a) standard modular multiplication of XI by (R-16 or 
(b) Montgomery multiplication of XI by 1. 
At the beginning of the chapter it was stated that the Montgomery technique uses a 
non-standard representation of residue classes. This can be understood as follows. Since 
the conversion of an integer XE (0, N - 1] to its N-residue form corresponds to the 
calculation XI = (X " R)N then, because R and N are coprime, this operation can be 
viewed as a re-ordering of the numbers 0... N-1. That is, if X is viewed as a variable 
that can range over the interval [0, N- 1], then letting X take on the numbers 0 ... N-1 
in order leads to its N-residue representation XI taking on the numbers 0 ... N-1 in a 
different order. 
For example, with N= 21 and R= 32, Table 6.1 shows the standard and N-residue 
representations of X as it ranges over [0, N- 1]. 
6.2.2 N-residue Exponentiation 
Using the notation that X' refers to the integer X in N-residue form, and denoting the 
Montgomery product of two integers X and Y modulo N with the constant R as 
MR, N(X, y)_ 
(X 
,Y, R-1) `N 
97 
X std residue 
<X>21 
N-residue 
<32X>21 
0 0 0 
1 1 11 
2 2 1 
3 3 12 
4 4 2 
S S 13 
6 6 3 
7 7 14 
8 8 4 
9 9 15 
10 10 S 
11 11 16 
12 12 6 
13 13 17 
14 14 7 
15 15 18 
16 16 8 
17 17 19 
18 18 9 
19 19 20 
20 20 10 
Table 6.1: Standard and N-residue representations for N= 21 and R= 32. 
then the conversion of an integer, X, to N-residue form is 
X' = MR, N(X, H) 
and conversion of XI from N-residue form to standard residue form is 
X= MR, N(X', 1) 
As before, H is the pre-computed constant H= (R2)N. 
Thus, the Right-to-Left and Left-to-Right modular exponentiation algorithms of Sec- 
tion 3.3.2 can be modified to use Montgomery multiplications by including pre- and post- 
conversions into and out of N-residue form respectively (see for example [73]). 
Algorithm 13 (Right-to-Left Montgomery N-residue Exponentiation) Given an 
integer A, a positive k-bit exponent E, modulus N, constant R and pre-computed constant 
H= (R2)N, then calculating (AE)N is a 3-stage process. Setting 
$(0) = 1, t(0) =A 
then 
98 
1. Pre-conversion 
s'(O) = MR, N(s(O), H) 
t'(0) = MR, N(t(0), H) 
2. Processing (for i=0... k- 1) 
se(i) if ei =0 
MR, N(sl(i), t'(i)) if ei =1 
t'(i+ 1) = .A 
4R, N(t'(i), t'(i)) 
3. Post-conversion 
s(ý) = MR, N(s'(k), 1) 
results in s(k) = 
(AE> 
. N 
A pseudo-code version of this algorithm is shown in Figure 6.3. 
1. S1, T: =A 
2. S' MR, H(S, H) 
3. T' MRH(T, H) 
4. FOR I :=0 TO k-1 
5. IF ei =1 
6. S' := MR, N(S', T') 
7. EJDIF 
8. T' :_N, p(T', T') 
9. QDFOR 
10. S := MR. H(S', 1) 
Figure 6.3: R-to-L Montgomery N-residue exponentiation. 
Algorithm 14 (Left-to-Right Montgomery N-residue Exponentiation) Given an 
integer A, a positive k-bit exponent E, modulus N, constant R and pre-computed constant 
H= (R2)N, then calculating (AE)N is a 3-stage process. Setting 
s(O)=1 
then 
99 
1. Pre-conversion 
Aý = MR, N(A, H) 
91 (O) = MR, N(S(0), H) 
2. Processing (for i=0... k- 1) 
MAN(WWAO) if ek-i-1 =0 
MAN (MR, N(3'(i), s'(i)), A') if ek-i-1 =1 
3. Post-conversion 
s(k) _ MR, N(s'(k), 1) 
results in s(k) = 
(AE> 
. N 
A pseudo-code version of this algorithm is shown in Figure 6.4. 
1. S: =1 
2. A' MR, N(A, H) 
3. S' MR, N(S, H) 
4. FOR 10 TO k-1 
S. S' := 14R. N(S', S') 
6. IF ek. i. 1 =1 
7. S' := IIR, N(S', A') 
8. EIDIF 
9. E1DFOR 
10. S := Hg, N(S'81) 
Figure 6.4: L-to-R Montgomery N-residue exponentiation. 
6.3 Iterative Montgomery Multiplication 
Algorithms that perform Montgomery multiplication in an iterative fashion, similar to the 
algorithms of Section 3.3.1, exist. These algorithms consume the multiplier, X, b-bits at 
a time for radix-26 multiplication. 
We will look first at the b=1 radix-2 multiplication algorithm, and then generalize 
this to the b-bit radix-26 algorithm. 
100 
6.3.1 Radix-2 Montgomery Multiplication 
The following algorithm can be found in Montgomery's original paper [72]. 
Algorithm 15 (Radix-2 Montgomery Multiplication) Given a constant R= 2k, 
odd modulus N<R, and two positive integers X, YE [0, N- 1], then setting 
s(O) =0 
and letting 
s(i)+xi "Y+z; "N s(i -ý- 1) =2 
with 
zi=(s(t)+xi"Y)2 
will give 
s(k)X"Y+Z"N R 
where Z= (X "Y" N')R for N' = (-N'1)R. Also, the quantities z; E {0,11 for i= 
0 ... k-1 form the k bits of Z in its bit-vector representation Z= [zk_l, zk_2,..., zo]. 
Proof: From the above we have 
2"s(i-}-1)=s(i)+xi"Y+zi"N 
For fixed k this gives 
2-s(k) = s(k -1) + Xk_1 "Y+ zk_1 "N 
s(k-2) +xk_2 "Y_ zk_z "N 
2 = +Xk_1 "Y+zk_1 "N 
therefore 
22. s(k) =s(k-2)+xk-2'Y+zk-2 -N+2"xk-i "Y+2"zk-l -N 
and, by extension, for variable a 
a-1 a-1 
2a " s(k) = s(k - a) +E 2a-i-1 , xk-i-1 "V+E 2a-'-1 " xk-: -1 "N 
i-o i-o 
101 
For a=k this gives 
k-1 k-i 
2k , s(k) = sl(O/ )+ 2k-i-1 " Xk-i-1 "Y+ `2k 
t-1 , Zk-i-1 "N l 
i_o i=o 
changing the order of summation and noting that s(O) = 0,2k =R and 
k-I 
E2'-xi =X 
i=O 
then 
Assuming here that 
k-1 
R"s(k)=X"Y+>2`"zi. N 
t-o 
k-1 
E2"z; =Z 
i=O 
(it will be proven as a special case in the general radix-26 algorithm's proof), then 
s(k)=X "Y+Z"N R 
R 
Note that the above algorithm proceeds in a Right-to-Left direction along the multiplier 
X. This is in contrast to all the standard modular multiplication algorithms reviewed in 
Chapter 5 which proceed in a Left-to-Right direction. 
A visual interpretation of the way in which this algorithm works can be understood as 
follows. First, re-write the main loop of the algorithm as 
r(i) = s(i) + xi "Y 
_ 
r(i) +z"N 
2 
where now z= (r(i))2. We can think of the calculation of r(i) as being the `multiplication' 
part of each iteration, and the subsequent calculation of s(i + 1) as being the `reduction' 
part of each iteration. Now writing r(i) as a (k + 1)-bit bit-vector 
r(: ) = [r(k') , rk'ý 1, ... , rO 
] 
102 
then the goal of the reduction part of each iteration is simply to stop r(i) from growing 
in magnitude. It tries to do this by dividing r(i) by 2 (i. e. a right-shift of r(i) by one bit 
position) but if r(i) is an odd number (i. e. rö`) =1 so that z, = 1) then it cannot do this 
and keep the partial result s(i + 1) an integer. To overcome this problem r(i) must first 
be converted to an even number so that ro'ý = 0. Since we are working modulo N the 
only number that we can add to r(i) without affecting the result is N. Also, since N is an 
odd number then its addition to the odd r(i) will yield an even number and so enable the 
reduction (division by 2) to take place. This happens on every iteration of the algorithm. 
An alternative way of performing Montgomery multiplication is to completely separate 
the multiplication and reduction phases of the algorithm. Thus if T=X"Y is the 2k-bit 
product of two k-bit positive integers, such that 
ý' _ [tsk-I 9 tsk-s i ... , 
to] 
then, in a manner directly analogous to the above algorithm, T can be Montgomery 
reduced to the range [0,2N) by proceeding in the direction i=0... k-1 and, for every 
ti = 1, adding 2' "N onto T. At the end of this process the lower k bits of T will all 
be zero. If T is then right-shifted by k bit positions we will have T=X"Y" 2'k 
X"Y" R-1 (mod N). 
6.3.2 Radix-26 Montgomery Multiplication 
The radix-2 Montgomery multiplication algorithm can be generalized to the radix-26 case 
as follows. 
Algorithm 16 (Radix-26 Montgomery Multiplication) Given a constant R= 216, 
odd modulus N<R, and two positive integers X, YE [0, N- 1], expressing X as the 
1-digit vector X= [Xi-1, X, _2, ... 9 Xo] with Xi E 
[0,26 - 1], then setting 
s(o) =0 
103 
and letting 
s(i + l) = 
s(i)+X; "Y+Z; "N 
2b 
with 
Z; = ((s(i) + X; - Y) - N')2b 
will give 
X"Y+Z"N 
R 
where Z= (X "Y" N')R for N' = (-N'1)R. Also, the quantities Z; E [0,26 - 1] for 
i=0... 1- 1 form the 1 digits of Z in its vector. representation Z= [Z1_1, Zß_2, ..., Zo]. 
Proof: Similar to the radix-2 proof we have, for variable a 
a-1 a-1 
2*b " s(1) = s(1 - a) +E 2a-i-1 . ýCý-i-1 "Y+E 2"-i-1 " Zl-i-1 "N 
i=0 i=0 
which for a=1, s(O) = 0,21b =R and 
1-i 
E2ib-Xi=X 
i=o 
gives 
ý-1 
R"s(l) =X "Y+E2'b"ZZ"N 
i=0 
Now to show that 
i-1 
2i6, Zi_Z 
ic0 
we can do the following. From the definition of the algorithm we have 
Z; _ (s(i) + . 
Xi " Y) " N' (mod 2b) 
then setting i=l-1 and recursing down into s(i) we get 
Zt (s(l -1) + X1_1 " Y) " N' (mod 26) 
(s('_2)+x'_2. Y+Z: N 
+, Xi_i , Y) " N' (mod 26) 26 / 
, 
therefore 
26 " ZI_1 = 
(s(1- 2) + XI-2 "Y+ Z1_2 "N+ 26 " X, _1 " Y) " N' (mod 2 
26) 
104 
and, in general for variable a 
2«b"Z1-1= 
(s(l 
-a -1) +E 2(« i)b " Xr-i-1 "Y+E 2(«-i)b " Zl_i_1 " N) "N' (mod 2(«+1)6) 
i-o i=1 
Setting a=1-1 gives 
2('-1)b"Z1_1 = s(0) + 2(1-i-1)b " XI-i-1 "Y+ 2(1-'-1)6 " Zi_i_1 "N "N' 
(mod 21b) 
i-o i=1 
Since s(O) = 0,21b =R and changing the order of summation gives 
1-1 1-2 
2(1-1)6 " Z1-1 = 
(E2*b. Xs. Y+E21b. Zi. N) " N' (mod R) 
eo ; -o 
Now, since N N. N'=- -N " N-1 = -1 (mod R) then 
1-2 
2('-1)6. Zi-i =X "Y"N'-E2ib. Z; (mod R) 
-o 
and therefore 
1-1 
E2'b"Z; =X"Y"N' (modR) 
=o 
Since Z; E [0,2b - 1] the left-hand-side of the above equation is in the range [0,21b - 1] 
and so 
l-1 
E2' Zi=(X. Y. N')R=Z 
i=0 
Returning to the main algorithm we therefore have 
s(I) .X "Y+Z. 
N 
R 
IL 
For the special case of b=1, that is'radix-2 multiplication, we have 
zi = ((s(i) + xi - Y) - N')2 
but since N" N' = -1 +q-R for some q, then with R= 216 the right-hand-side of the 
above equation is an odd number, and therefore so is the left-hand-side. Since N is odd 
this means that N' also must be odd and so (N')2 = 1. Thus the equation for z; will 
simplify to 
zi = (SW+xi'Y)2 
105 
6.4 Montgomery Multiplier Implementations 
The Montgomery multiplication technique has been implemented in a few systems that 
perform RSA cryptography. These are as follows. 
6.4.1 Multi-precision Implementations 
In [37] Dusse et al. show how the Montgomery technique can be implemented on a stan- 
dard DSP chip. Multi-precision operations are based on the 24-bit word-size of the chip 
with Montgomery reduction performed simultaneously with the convolution-like method 
of multiplying large integers (see [74]). 
. 
The Montgomery method is also efficient for general software implementations of the 
RSA scheme [75]. 
6.4.2 Systolic Array Implementations 
A systolic array is an n-dimensional array that consists of interconnected processing ele- 
ments (PEs) such that each element only communicates with its immediate neighbours. 
For example a 1-dimensional systolic array is shown in Figure 6.5. As can be seen from 
Processing Elements 
Operands PE PE PE PE Results 
Figure 6.5: A 1-dimensional systolic array. 
the diagram, the operands enter the array on the left and the results exit the array on 
the right. Each processing element performs some simple operation on the data entering 
it from the left, and sends the result to the next element on the right. All processing 
elements can be active simultaneously. The combined processing power of all these simple 
processing elements performs the desired transformation on the input operands. The sys- 
tolic array gets its name from the movement of the data through the array which appears 
106 
to be a kind of `pumping'. action - similar to the pumping of the human heart - and thus 
systolic. 
In [76], Even coupled a systolic multiplier with an array of his own design that performs 
Montgomery reduction. Assuming that the output of the multiplier array feeds directly 
into the reducer array, then the combination array takes 2k clock cycles to Montgomery 
multiply two k-bit integers modulo a k-bit modulus. 
Sauerbrey, in [77], coupled a b-bit multiplier array and b-bit Montgomery reducer array 
so that each processing element operates on b-bits of data at a time. The array takes 21 
clock cycles to complete-the multiplication. 
In [78], Iwamura developed a single systolic array that could perform Montgomery 
multiplication with each processing element operating on b bits of data. The array takes 
21 clock cycles to complete. 
The main advantage of using systolic arrays in VLSI design is that, although a global 
clock signal must still be distributed to each processing element, all other communications 
to/from processing elements are local (i. e. to/from neighbouring cells). This can simplify 
the routing and buffering requirements for a VLSI design. Specific to the problem of 
long-integer multiplier design and in particular with the parallel bit-slice approach that 
has been developed in this thesis, it means that the quantities X; and Z; do not have to 
be distributed to all processing elements simultaneously. Thus simplifying this aspect of 
the design. The disadvantages of systolic arrays for long-integer multiplication however 
are twofold. Firstly, the complexity of each processing element tends to be greater than 
that of a bit-slice element, and secondly, systolic arrays usually require twice the number 
of clock cycles to `pump' the result out of the end of the array. These, when combined, 
tend to make systolic array implementations of modular multipliers less efficient than their 
parallel bit-slice counterparts. 
107 
6.4.3 A Pipelined Implementation 
In [75] and [79] Shand et at. show how the Montgomery method can be implemented using 
pipelining. The method is similar to the pipeline technique used in the Diminished Radix 
design of Orton et at. in [60] (reviewed in Chapter 5), in that it is designed to stop the 
computation time of the Z; calculation from affecting performance. 
In [79] Shand delays the addition of the Z; "N multiple until the (i +p)-th cycle where 
p is the level of 'pipelining used. This is done by modifying the algorithm and generating 
a slightly different Z;, which we will call Z;, as follows. 
With s(0) = 0, then let 
s(i+l) _ lI8(i)+Xi - 
YI 
+s(i-p)+Z; -n'N 2 2(P+1)b 
where s(i-p) =0 for i-p< 0, and 
0 fori-p<0 
Z: 
-P = 
(S('-p) "N')2p+l)6 for i-p> 0 
then 
s(l + p) =X"Y" R-1 (mod N) 
Note that the multiple Z; is generated such that, on the i-th cycle when s(i) is `sampled' 
and the calculation of (s(i) " N')2(p+1)b is started, then on the subsequent cycles i+1, i+ 
2,..., i +p -1 the lower bits of s(i) are all `predictable' in the sense that the intermediate 
additions of Z, -p+1 . 
N, Z, 
_p+z " 
N, ... , Z; _1 "N will not effect the group of 
b bits that, 
on the (i + p)-th iteration with the addition of Z; " N, are set to zero in the intended 
manner by that multiple. This is why Z; must be a (p + 1)b-bit quantity as opposed to 
the unpipelined b-bit quantity. But by delaying the addition of Z, "N then, it is claimed, 
enough time will be available to calculate this extended multiple. 
The above method has been implemented by Orton et al. on a circuit board known 
as a Programmable Active Memory (PAM). The PAM consists of an array of Field Pro- 
grammable Gate Array (FPGA) chips mounted on a slot-in computer card such that it is 
108 
suitable for use as a general-purpose co-processor engine for the host machine. Configura- 
tion of the PAM for any processing task can be achieved by first designing the appropriate 
FPGA circuitry (with suitable ECAD tools), generating the FPGA setup data from the 
design and then downloading this data to the board. The host computer and co-processor 
can then operate together to perform whatever specialized task was required. 
Using such a PAM the fastest reported RSA processor to date has been constructed. 
When the prime components of the modulus N=p"q are known, then using the Chinese 
Remainder Theorem (which gives a speed-up factor of approx. 400% in hardware - see 
Section 2.6.5), it can encipher data at a rate of 600kbps for 512-bit moduli and 165kbps 
for 1024-bit moduli. Shand et al. claim that the design can be shrunk down into a single 
gate-array device for even higher performance; but this has yet to be done. Although they 
do not state the number of pipeline stages that were used in the PAM design, it is unlikely 
that a large number of such stages can be implemented in a single gate-array device. 
6.5 Summary 
This chapter introduced Montgomery modular arithmetic. The technique and its meaning 
were explained and basic algorithms for Montgomery multiplication and exponentiation 
were studied. A review of some of the proposed hardware implementation schemes was 
given. 
The algorithms of this chapter together with the RSD architecture of Chapter 4 serve 
as the basis from which new optimised Montgomery multipliers are developed in the next 
chapter. 
109 
Chapter 7 
Optimized Montgomery 
Multiplication 
In this chapter new, optimized versions of Montgomery multipliers will be presented that 
allow the multiplier adder array to operate at full-speed. That is, the determination of Z; 
(the number of N's to be added on each cycle) can be performed completely in parallel 
with the operation of the adder array. 
Optimized multipliers are designed for the cases of 
.b=1 with a simple CSA type architecture, 
"b=2 using recoding techniques and an RSD architecture, and 
"b>2 general high-radix multipliers with recoding and RSD architectures. 
Very recent work by Eldridge and Walter in [80], [81] and [82] has shown how to build 
optimized designs for non-recoded multipliers with a CSA architecture. The methods 
developed here go further than this, and use efficient multiplier recoding techniques with 
an RSD architecture. The methods of Eldridge and Walter and those shown here have 
been developed concurrently and the work presented in this chapter is independent of their 
own. 
110 
7.1 Radix-2 Multiplication 
In Section 6.3 the following general radix-2b algorithm was shown to perform Montgomery 
multiplication. It will be called AMMM (Additive Montgomery Modular Multiplication). 
With s(O) = 0, then letting 
r(i) = s(i) -}- Xi "Y 
s(ii + 1) = 
r(i) 
2Zi 
"N 
6 
where Z; = (r(i) " NI )2b will give 
s(k)_X"Y+Z"N R 
Implementing this algorithm directly in hardware would lead to the design of Figure 7.1. 
This diagram shows how the least-significant b bits of r(i) - the output from the top level 
S) 
b-bits 
Y4 
Figure 7.1: Implementation of AMMM. 
of adders - is used to generate Z;. The value of Z, thus generated is then used to construct 
Z; "N which is then added to r(i) by the lower level of adders. The problem with this 
approach is that the generation of Z, and Z, "N takes place between the upper and lower 
adders, and so, although r(i) is made available at the input of the lower adder immediately 
after it emerges from the upper adder, the presentation of Z; "N to the adder inputs is 
delayed by the amount of time it takes to generate it, and thus the adder array cannot 
operate at full speed. In order to optimize the design, the generation of Z; and Z; "N must 
be taken out of the critical path of the adder array. This is achieved in the next section. 
111 
S(I+l)T , 
Accumulator 
7.1.1 The DAMMM algorithm 
The DAMMM (Delayed Additive Montgomery Modular Multiplication) algorithm allows 
Z; "N to be generated in parallel with the operation of the top level of adders in the adder 
array. It can be specified as follows. 
Algorithm 17 (DAMMM) Given a constant R= 21b, odd modulus N<R, constant 
N' _ (-N-1)R, and two positive integers X, Y E [0, N - 1], expressing X as the 1-digit 
vector X= [XI-1, X, _2, ... , Xo] with Xi E 
[0,2b - 1], then setting 
S(O) =o 
and letting 
r(i) = s(i) + 26 " Xi "Y 
s(i + 1) = 
r(i) 
2Z` 
.N 
6 
with 
Zi = (s(i) N' )2b 
will give 
s(l+l)=X "Y+Z"N R 
with s(1+ 1) E [0,2N). 
Proof: Noting that X; -Y has been left-shifted by b bits (pre-multiplied by 2b) in the 
calculation of r(i), therefore r(i) = s(i) (mod 2b) and so Z; can be calculated directly 
from s(i). 
On the first iteration Zo =0 and because it takes an extra cycle for X; -Y to be shifted 
down to the least-significant bits of the accumulator, then in general, Z; in the DAMMM 
algorithm will be the same as Z; _1 in AMMM 
for i=1... 1. Therefore 
E2i6, Z; =26"Z 
i=0 
112 
where Z= (X "Y" N')R, and since X1 =0 we have 
sy+l) - 
26"X"Y+26"Z"N 
-X 
"Y+Z"N 
26R R 
IL 
An implementation of the DAMMM algorithm is shown in Figure 7.2. Note that 
b-bits 
Si) III 
ý. }4Y 
Adder Y4Y Gen Xi 
ý" ' ýpI TO 
ýI III 
Adder ZEN Gen ZýN 
IIII 
Z+ Gen I 
Figure 7.2: Implementation of DAMMM. 
the lower b bits of the accumulator are all zero. This is essentially the function of adding 
Z; -N- to ensure that the b least-significant bits of the accumulator are all zero prior to 
performing the b-bit right-shift operation embedded in the feedback of the accumulator's 
outputs to the top-level adder's inputs. 
The diagram also shows that, with the partial product X; "Y shifted left by b bits, 
the value of Z; no longer depends directly on the output of the top adder. In accordance 
with Algorithm 17 Z; is derived from s(i) which, by retracing the feedback loop of s(i) to 
its origin at the output of the accumulator (taking into account the b-bit right-shift built 
into the loop), can be seen to be equivalent to sampling s(i) at the next to lowest b-bit 
block of accumulator output. Thus Z, and Z; "N generation can be performed in parallel 
with both the generation of X; "Y and the operation of the top level adder. 
A radix-2 implementation of DAMMM with a CSA architecture is shown in Figure 7.3. 
For Na k-bit positive integer, the adder inputs are defined as follows. 
W= 
113 
: -t -- 
=I -- 
Figure 7.3: Radix-2 DAMMM CSA array. 
= XiY 
where x; E {O, 1} for iterations i=0... k-1 and xk =0 on the final iteration. 
G= [gk-i, 9k-2,..., 901 
z; "N 
where z; E {O, 1} for iterations i=0... k. The feedback loop maps register outputs to 
adder inputs with a right-shift as, 
Ui 
Uý E- sý+i 
for j=0... k-1. The s'+l and s3+1 outputs correspond to the sum and carry outputs 
respectively of the CSA accumulator. 
The generation of the multiples x; -Y and z; "N is shown in Figure 7.4. Since each x; 
Yj EI) 
7, St 
-N, 
<a -g 
XSYj Z, ß 
Figure 7.4: Radix-2 DAMMM x; "Y and z; "N generation. 
is merely the i-th bit of X, and assuming that X is stored in a shift-register that is clocked 
by the same clock that operates the accumulator, then immediately after the active edge 
114 
uI u'1 uo Wo Uk U't uk-t U't-I 
of this clock the quantities x; and s(i) are available for use. A diagram showing the delay- 
path of signals around the DAMMM circuit is shown in Figure 7.5. Since the generation 
x; s(O) 
AND XOR 
Xy Zi 
FA AND 
r(i) z; N 
FA 
s(i+1) 
Figure 7.5: Radix-2 DAMMM delay-path. 
of z; is performed by an XOR gate and, with reference to the circuit diagram of the full 
adder in Figure 4.1, the delay through an XOR gate is less than that through a full adder, 
we can see that the multiple z; "N will be presented to the second level of adders in the 
CSA adder array just before r(i) is made available. Thus we have succeeded in removing 
the calculation of z; "N from the critical path of the adder array. The 2-level CSA array 
can now operate at full speed. 
7.1.2 Result Range 
Although we have just created a fast radix-2 Montgomery multiplier, when used as part 
of a high-speed exponentiator the circuit will suffer from a major disadvantage. This is 
that the inputs are restricted to the range [0, N- 1] but that the result will occupy the 
range [0,2N). Thus it will not be possible to use the result of one multiplication as one of 
the operands in the next multiplication. To do this we have to ensure that the operand 
input ranges and result output ranges are the same. 
Consider the size of N limited such that 
N< T 
115 
and inputs X and Y limited to the extended range 
X, YE [0, R/2) 
with Z as before in the range [0, R). Then we have 
X"Y+Z"N z"2+R"R_R 
R<R2 
and thus the range of the result is the same as that of X and Y. 
Since, at the beginning of an exponentiation, inputs to the multiplication routines will 
be in the range [0, N) which by the above restriction on N is contained in the range [0, R/2), 
this means that all of the intermediate results of multiplications during an exponentiation 
can be kept in the range [0, R/2), and the multiplications can proceed one after another 
without any intermediate correction steps (such as a comparison and possible subtraction 
of N from a multiplication result) being necessary. Although a carry-propagated addition 
of the S and S' vectors is still required at the end of a multiplication, the removal of 
any comparison and possible subtraction of N considerably simplifies the circuitry of the 
multiplier. The final result of the exponentiation is the only one that has to be reduced 
to [0, N- 1]. 
7.1.3 Radix-2 DAMMM Performance Summary 
Using the condition 4 "N <R from the previous section, we see that for Na k-bit modulus 
we need R> 2k+2. Since the DAMMM algorithm requires an extra iteration to complete, 
we can summarize the radix-2 CSA implementation as follows. 
Number of iterations =k+3 
Iteration time = LAND +2 -OFA +'FF 
Number of bitslices = k+4 
Bitslice complexity =5" SZFF +2 . 12AND +2" SIFA 
116 
7.2 Radix-4 Multiplication 
Stating the DAMMM algorithm for radix-4 multiplication we have 
r(i) = s(i)+4"X; "Y 
s(i + 1) = 
r(i) +4 Zi .N 
with X;, Z; E {O, 1,2,3} and where Z, = (s(i) " N')4. 
Applying string recoding to this algorithm we would first recode X= [XI-1, XI-2,..., Xa] 
using the technique of Section 4.3 so that 
X_E 22i .x (i) 
i=0 
where x (i) E 1-2, -1,0,1,2}. This can be accomplished either by working on one digit at 
a time in either the left-to-right or right-to-left directions or by calculating all digits at once 
in parallel. This freedom in implementing the recoding technique is exactly what makes it 
attractive for use in fast, non-modular, parallel multipliers and also in some of the left-to- 
right modular multipliers of Chapter 5. The reason why it is used to recode the multiplier 
X in some of these left-to-right modular multipliers is because the recoding process can be 
performed one digit at a time in the left-to-right direction as the multiplier is `consumed' 
during multiplication. This leads to a more efficient multiplier design - compared to the 
parallel recoding approach - for long-integer applications. The reason why this recoding 
technique can be used in this way is because the recoding process does not generate any 
carries while it is being applied. Each digit of the second-order recoded vector depends 
only on groups of three adjacent bits of X. If carries were generated then, since carries 
propagate from right-to-left along a bit-vector, a carry generated at the i-th step of a left- 
to-right recoding process would necessitate the re-evaluation of all the previously generated 
i-1 digits. If however, a right-to-left multiplication technique were being used, then the 
generation of any carries during the recoding process would not matter. A carry generated 
when calculating the i-th digit of the recoded vector would simply be saved and used in the 
117 
next step to generate the (i + 1)-th digit. Since Montgomery multiplication is essentially a 
right-to-left process, we can take advantage of this and develop a recoding technique that 
does indeed generate carries, but offers a reduced range for the recoded digits. 
7.2.1 Recoding X 
The following recoding technique differs from the technique of Section 4.3 in that, as 
explained above, it is only suitable for right-to-left multiplication algorithms. It is based 
around the simple observation that 3=4-1 and 2=4-2. The idea is that the digits of 
a radix-4 vector each have an associated weight. That is, for X= [X, _1, 
XI_2, ... , 
Xo] a 
radix-4 vector, then its value is given by 
1-i 
X_E22'. Xi 
i-o 
where 22i is the weight associated with the i-th digit of the vector. This can be viewed as 
the i-th digit having weight 4-times that of its right-hand neighbour the (i - 1)-th digit. 
The recoding procedure is then to look at each digit Xi for i=0... I-1 and set the 
recoded digit x(i) to either 0,1, -2 or -1 accordingly as to whether X; is either 0,1,2 or 3. 
In the latter two cases, by the previous observations, a1 must be added to the next X; }1 
digit to compensate for the negative values assumed by x(i) in this iteration. This is the 
carry. The next iteration must therefore take note of this carry and, calling it c(i), leads 
to the following 
0 for Xi + c(i - 1) =0 
1 forX; +c(i-1)=1 
x(i) _ -2 for X; + c(i - 1) =2 
-1 forXi+c(i-1)=3 
0 forXi+c(i-1)=4 
118 
for i=0... 1 with c(-1) =0 and 
0 for Xi + c(i - 1) =0 
0 for Xi + c(i - 1) =1 
c(i)= 1 for Xi + c(i - 1) =2 
1 for Xi + c(i - 1) =3 
1 for Xi+c(i- 1) =4 
fori=0... 1-1. 
Thus we have a right-to-left recoding system that converts 
t-1 
2si, Xi 
ii=O 
with Xi E 10,1,2,3} to 
I 
X_22zs, x(_) 
-o 
with x(i) E {-2, -1,0,1}. Note the extra l-th digit in the recoded vector. 
This recoding mechanism can be implemented as shown in Figure 7.6. Note that the 
HA Il HA 
«1-1) 
DQ 
FF 
xp) xro)(i) clk SJ 
Figure 7.6: Recoding X; E {O, 1,2,3} to x(i) E {-2, -1,0,1}. 
bits of X are assumed to be stored in a 2-bit wide shift register which is clocked by the 
same clock as the flip-flop shown in the diagram. The circuit converts 
X; =2" XX, (i) + Xj, (o) 
to the recoded digits 
x(1)(i) +x(o)(i) 
The x(i) "Y multiple generation circuit is shown in Figure 7.7. From this diagram we see 
that the x(i) -Y generation circuitry has been reduced from the MUX/AND configuration 
119 
Bitslicc j Bitslioc j-1 
3Qi) 
of Figure 4.16 to just a single MUX. Also, the requirement for routing a special zero signal 
to each bitslice has been removed. 
The only problem with using the recoded value, x(i), instead of the original, Xi, is 
that the signals x(l)(i) and x(o)(i) are not available on the active edge of the shift-register 
clock. The delay of 2. AUA for these signals to become available would increase the cycle 
time of the multiplier. To get around this problem we can pipeline the generation of x(i). 
A simple 1-stage pipeline is all that is. needed and this is shown in Figure 7.8. The clock 
Xi-l. (I) Xi-1 ß) 
HA HA c(i. 
2) 
Cik 
DQ 
FF 
xýýý(i"1) xýoý(i-1) 
ý-T 
X(ifii) X(o)(i) 
Figure 7.8: Pipelined generation of x(i). 
shown in the diagram is the same clock that is used by the X shift-register and by the 
multiplier's accumulator. Thus the z(i) signals are able to proceed directly to the z(i) -Y 
multiplexors on the active edge of the clock and no overhead is added to the cycle time of 
the multiplication by using this recoding technique. Indeed, the delay of one AND gate 
has been removed from the critical path. 
120 
Figure 7.7: Multiple x(i) "Y generation. 
7.2.2 Recoding Z 
On reviewing the theory of Montgomery multiplication we see that the only requirement 
on Z; is that 
r(i) + Z1 N=0 (mod 2b) 
therefore Z; may take on any set of values that covers the range [0,2b -1] modulo 26. This 
means that for b=2 we may map 
Z; E {O, 1,2,3} 
onto 
z(i) E {O, 1, -2, -1} 
the only difference this will make is that, if 
l-1 
Z=> 22{. z(=) 
i=O 
then Z may take on positive and negative values so that 
ZE (-R, R) 
This in turn means that 
P_X"Y+Z"N 
R 
will also assume positive and negative values. However, because we are using signed-digit 
arithmetic this is not a problem. If N is limited as before such that 
4"N<R 
but now X and Y are limited such that 
X, Y E (-R/2, R/2) 
then by a similar reasoning as was used before, we have 
PE (-R/2, R/2) 
121 
In other words, we allow the results of multiplications to be negative as well as positive 
but their ranges will not diverge during an exponentiation. As we saw in Section 4.5 we 
can easily perform signed-number multiplications using an RSD architecture. 
7.2.3 An RSD Montgomery Multiplier 
For X and Y signed-numbers in 2's complement form, then from the above we have 
IYI `2 
If R= 216 then X and Y can be expressed as lb-bit 2's complement bit-vectors. 
For 
W=x(i)"Y 
with x (i) E {-2, -1,0,1} then W can be expressed as a (lb+ 1)-bit 2's complement vector, 
so that 
l6-1 
W=-216"wlb+E25"wj+w-1 
1-o 
where w_1 is the `add one' term for when a negative multiple of Y is being created . 
by 
the `invert and add one' technique of 2's complement negation. This allows us to create 
negative multiples of Y without having to perform the carry propagation implied by the 
`add one' instruction. (See the recoded multiplier of Section 4.5). 
Similarly 
G= z(i) -N 
with z(i) E {-2, -1,0,1} so that 
lb-1 
G= -2 
lb. 9lb+E2j"9J+9-i 
j=o 
Using the DAMMM algorithm with an RSD architecture leads to the circuit shown 
in Figure 7.9. Note that at the top end of the adder array the G vector has been 
sign-extended by two bit positions. 
122 
U'64 V+M2 UMI I%+ u. 6 V*6 vbl 11ý61 
WWWWW "' 
se., 4e. s r2 sew: rw1 awl se 4e ssl bbl 
Figure 7.9: Radix-4 recoded DAMMM with RSD architecture. 
A divide-by-4 is embedded in the feedback loop such that 
U7 sj+2 
uj+ 4- sj 2 
for j=0... lb +1 with ub+z = uý +z = 0. 
Generating x(i) 
To generate x(i) E {-2, -1,0,1) from the lb-bit 2's complement vector X then, since 
the DAMMM algorithm requires 1+l. iterations, the first step is to sign-extend X to an 
, +, and (1 + 1)b-bit 2's complement vector X'. Since b=2 this means creating two bits x' 2 
xz1 so that 
21 
X'= _221+1 zit+1 +L 2' . 
xý 
j=o 
has the same value as X. From Section 4.4.1 we see that 
x21+1 = x21 - X21-1 
and 
xj = x1 
for j =0... 21-1. Therefore X' may be viewed as an (1+1)-digit vector (Xj, X, 1, """, Xö] 
with X; E 10,1,2,3} for i=0... I-1 but with X( E J-1,0} because X"(1) = XI'(0). 
123 
On applying the left-to-right recoding technique we see that, when creating the l-th 
recoded digit, 
Xj + c(1- 1) E {-1,0,11 
Since this can be expressed by x(l) E {-2, -1,0,1} this means that no carry should be 
generated by the last recoded digit. 
What this means in terms of the generating circuit of Figure 7.6 is that, as long as 
we sign-extend X to an (l + 1)-digit vector, we can use the circuit unchanged by simply 
ignoring any (1 + 2)-th digit that it might generate. 
Generating z(i) 
To generate z(i) we first have to calculate Z; = (s(i) " N')4 then perform the mapping 
{O, 1,2,3} -+ (0,1, -2, -1}. For 
z(i) = -2 " z(i)(i) +x(o)(i) 
this is shown in Table 7.1. A circuit to generate z(i) is shown in Figure 7.10. 
< SO >4 <N' >4 Z, 7(i> T (i) Z(O)r) 
o 1 0 0 0 0 
1 1 1 1 0 1 
2 1 2 -2 1 0 3 1 3 -1 1 1 
0 3 0 0 0 0 
1 3 3 .1 1 1 
2 3 2 -2 1 0 
3 3 1 1 0 1 
Table 7.1: Generating z(i) 
ni' 
Z(o)(l) 
Z(1)(i) 
Figure 7.10: Circuit for generating z(i). 
124 
Assembling all of these elements into a radix-4 Montgomery multiplier will produce 
the critical path delay diagram of Figure 7.11. However, from this diagram we can clearly 
XG) sC) 
MUX FA 
x()Y 
FA 
J 
FA 
f(1) 
FA XOR 
ZO) 
s(i+1) MUX 
z)N 
Figure 7.11: Radix-4 DAMMM delay path. 
see that the time taken to calculate z(i) is too long. The adder array will not be able to 
operate at full-speed. Other optimization methods are needed to speed up the calculation 
of z(i), and they are presented in the following sections. 
7.2.4 The MMDAMMM Algorithm 
The MMDAMMM (Modified Modulus Delayed Additive Montgomery Modular Multipli- 
cation) algorithm enables us to remove the multiplicative part of the Z; calculation. The 
basic idea is to create a new modulus, M, from the original modulus, N, by multiplying 
N by a small constant. 
Note that, since M is just a simple multiple of N, the new modulus can be used in 
place of the old modulus in modular arithmetic calculations. This can be seen as follows. 
Supposing M=T"N then 
a=b (mod M) 
= b+9"M 
= b+9-(T-N) 
= b+(q. T)"N 
b (mod N) 
125 
Thus to perform, say, an exponentiation modulo N, we could perform all of the exponen- 
tiation's multiplications modulo M, and then, at the end of processing, reduce the final 
result to its least non-negative residue modulo N. 
As was suggested by Walter in [80], the new modulus M can be created such that 
(MI)2b = 11. In the case of Walter's systolic design this did not offer any significant 
improvements, but with the bitslice architecture presented in this thesis a considerable 
speedup can be achieved since the calculation of Z; is much simplified. For the sake of 
definiteness, we will create M such that (M')2 = +1. 
Consider a modified modulus M, such that 
M=N"(N')26 
Now 
N'=- -N'1 (mod R) 
and since R= 21b therefore 
N'=- -N-1 (mod 2b) 
thus 
M. N. (-N'1)_-1 (mod 2b) 
Similarly 
M' - -M'1 (mod R) 
therefore 
M' . -M'' (mod 26) 
and substituting M= -1 (mod 2b) we have 
M'-= -(-1)-1 =1 (mod 26) 
and so therefore (M')26 =1 for any odd modulus N as required. Note that to create M 
we only had to multiply N by a b-bit quantity. Thus the growth in size from N to M is 
minimal. 
126 
Using the modified modulus M in the radix-4 DAMMM algorithm together with the 
convergence restriction that 4"M<R and the multiplier recoding techniques developed 
in the previous section, leads to the radix-4 recoded MMDAMMM algorithm as follows. 
Algorithm 18 (Radix-4 recoded MMDAMMM) Given a constant R= 221, odd 
modulus N such that 
24"N<R 
a constant N' = (-N-1)R, and a modified modulus M such that 
M=N-(N')4 
Then two integers X, YE (-R/2, R/2), with X expressed as the (! + 1) -digit vector X= 
[Xi, XI-l, Xß_2, , .., Xo] with Xi E {O, 1,2,3} for i=0... 1-1 and X1 E (-1,0}, can be 
Montgomery multiplied by setting 
s(o) =0 
and letting 
r(i) = s(i) + 4'" x(i) "Y 
s(i + 1) = 
r(i) + z(i) .M 
4 
with x(i) E {-2, -1,0,1} being the recoded digits of X, and Z(i) E {-2, -1,0,1} such that 
z(i) - s(i) (mod 4) 
will give 
s(l+1) =X "Y-R'1 (mod M) 
with s(l + 1) E (-R/2, R/2). 
Proof: The condition 24"N <R means that the modified modulus M will satisfy 4 "M < R. 
The rest follows from the proof of DAMMM. together with the convergence restrictions 
and recoding techniques given in the previous sections I. 
127 
7.2.5 Generating z(i) under MMDAMMM 
Referring back to Figure 7.10 we-see that, using MMDAMMM, it is possible to remove the 
XOR gate from the z(i) generation circuit. Furthermore, since the carry-in to right-hand 
full adder is a `1', and the carry-out of the left-hand adder is not used, we can simplify 
the logic of this circuit to that of Figure 7.12. Upon comparing the circuit of Figure 7.12 
el F" 3ti 3,2 
Figure 7.12: Optimized generation of z(i) using MMDAMMM. 
with that of the full adder in Figure 4.1 we see that the maximum delay path through 
these circuits is the same, namely 2. OXOR" 
Using the MMDAMMM algorithm together with the, z(i) generation circuit of Fig- 
ure 7.12, the x(i) generation circuit of Figure 7.8 and the RSD architecture of Figure'7.9 
gives the critical delay path diagram of Figure 7.13. From this diagram we can see that 
x(i) s(i) 
MUX 2XOR 
x(i)Y Z( i) 
FA MUX 
r(i) Z(i)1 
FA 
s(i+l ) 
Figure 7.13: Delay path for MMDAMMM. 
the generation of z(i) does not add to the iteration time of the multiplier. The adder array 
can operate at full-speed. 
128 
Zog(') z(o)(i) 
7.2.6 Radix-4 recoded MMDAMMM Performance Summary 
Using the condition 26+2 "N<R then, for Na k-bit modulus and using b=2, we need 
R> 2k+4. Since the MMDAMMM algorithm requires an extra iteration to complete, we 
can summarize the radix-4 recoded RSD implementation as follows. 
Number of iterations = 12 I +3 
Iteration time = OMUX +2" IFA + OFF 
Number of bitslices ,=k+8 
Bitslice complexity =5" SZFF +2" fMUX +2" ciFA 
Comparing the radix-4 recoded design with an unrecoded design would show that the 
hardware requirements have been much reduced since the former requires only two levels 
of adders against the latter's four levels. Also, the iteration time is reduced because signal 
propagation through a MUX and two FAs is less than that through an AND and four FAs. 
On comparison with the radix-2 DAMMM CSA design we see that the iteration time 
has been increased only by the difference AMUX - AAND" Compared with the delays of 
2. AFA and OFF for a typical implementation technology, the difference is small. At the 
same time the number of iterations required has almost halved. The bitslice complexity has 
been increased by the difference 2" (SZMUX - IZAND) " Compared to the rest of the 
bitslice 
circuitry, this increase is small. The number of bitslices is roughly the same. Although 
an accurate comparison of the two designs would require details of the implementation 
technology, we can make the qualitative statement that the recoded MMDAMMM RSD 
design will go almost twice as fast as the DAMMM CSA design with only slightly more 
hardware. 
129 
7.3 Radix-26 Multiplication 
In this section we will examine general radix-26 recoded multipliers with, for efficient 
recoding, b an even number. 
The general architecture for ab bit MMDAMMM recoded RSD multiplier is shown in 
Figure 7.14. Each adder sums a 2-bit multiple, {-2, -1,0,11, of either Y or M. That 
b bits 
xa(i)Y Y Adder 
xICi)Y Y Adder ; 
xdyý(i)Y Y Adder i 
4(i)M , M Adds 
ZIMM M Adder 
z , (I)M M Adder 
Accumulator 
Figure 7.14: MMDAMMM recoded RSD multiplier. 
is, each recoded digit x(i) of X and z(i) of Z is composed of `sub-digits' x, (i) and z2 (i) 
respectively such that 
6/2-1 
x(i) _E 22 " x, (i) 
i=o 
and 
b/2-1 
E 22' ' zi (_) 
j=0 
where 
xi (i), x3 (i) E {-2, -1,0,1 } 
For the adder array to operate at full-speed then 
" the j-th sub-digit of x (i) must be ready in time j"OFA after the accumulator's active 
clock edge, and 
" the j-th sub-digit of z(i) must be ready in time (j+b/2) -OFA after the clock edge. 
130 
7.3.1 Generating x(i) 
We can recode a 2's complement vector X= [XI, X1_11... , Xo] with X; E 
[0,26 - 1] for 
i=0... 1-1 and X1 E [-2b-1,2b-1 - 1] to 
X=E 2ib x (s) 
ii=J0 
where 
b/2-1 
x (i) _E 22' " x, (i) 
j=0 
with x, (i) E {-2, -1,0,1} expressed as 
zý(i) _ -2 " xJ, (1)(=) + x1, (o)(i) 
with the right-to-left recoding scheme discussed in Section 7.2.3. The circuitry required 
to perform this recoding is essentially the same as that used in Figure 7.6 for the b=2 
case, but expanded out by b/2 steps. This is shown in Figure 7.15. A 1-stage pipeline, 
x(i) 
Figure 7.15: General x(i) recoding. 
similar to that of Figure 7.8, can also be used in the generation circuitry so that x(i) is 
available directly after the clock edge. 
7.3.2 Generating z(i) 
The quantity z(i) can be generated in one of two ways, either by lookup-table or by 
calculation. The former approach offers simplicity but its performance may be technology- 
dependent. That is, it is difficult to say whether z(i) can be generated fast enough to allow 
131 
xnn-t[q(i) xers1do)(i) xo. (1p) , co)(I) 
------------ Xsn º(ý) ; 
(i) 
full-speed adder array operation without examining technology-specific issues. The latter 
approach however, allows us to derive timing values for z(i) in terms of the primitives that 
are used to construct the rest of the multiplier circuit. This is the technique that will be 
used in this section. 
In a manner similar to the radix-4 case, z(i) is generated first by calculating 
Z, = (s(i»26 
and then by recoding Z; to z(i) such that 
b/2-1 
E 222 ' zf(i) 
j=o 
satisfies z(i) = Z; (mod 2b). The recoding process is essentially the same as that used 
to generate x(i) with the exception that carries from one iteration to the next do not 
have to be saved. Thus the z(i) generation circuitry is as shown in Figure 7.16. Upon 
s sd 12b-I " xb. t 7272 -------- e l. 1 i7i eb i7 
FA FA ------ FA FA 1 
Z 1) 7i. cdn 7U, ) Zia 
IIA HA ----- HA HA o 
zen-1. aß(1) ZVI-1, (o)(1) %, )(') z0, m)(1) 
4C') ZO O 
Figure 7.16: Radix-2b z(i) generation. 
examining this circuit we see that, because the carry-in signals to the FA chain and to 
the HA/OR chain are preset to `1' and 10' respectively, the zo(i) generation block can be 
simplified. This simplification takes the form of removing the HA/OR block (because the 
carry-in of the right-most block is zero - thus no carry-out will be generated and so the 
Z1, (I) and Z;, (o) signals simply pass straight through to the zo, (j)(i) and zo, (o)(i) signals) 
and optimizing the FA block such that the resultant circuit is basically the same as that of 
Figure 7.12 but with the addition of a carry-out signal. The circuit is shown in Figure 7.17. 
132 
el s3 e2 r2 
carry-out to 
FA adder the 
From this diagram we can see that the maximum propagation delay of signals through 
this circuit is bounded by 2 "'xoR" 
Looking again at Figure 7.16 we see that the carry-out signals of the FA and HA/OR 
chains for the 4/2_1(i) generation block are not used. Therefore this block can also be 
optimized as shown in Figure 7.18. The maximum propagation delay for signals through 
_. - _. _- 
7+. ßu ZW2) 
y 
zwZ., (1)(*) z2.1(i) 
Figure 7.18: Radix-26 Zb/2_1(i) generation. 
this circuit is OFA + 2. OXOR" 
Assembling the optimized z(i) generation circuitry of Figures 7.16,7.17 and 7.18 and 
setting i, equal to the generation delay of the j-th sub-digit zz (i) of z(i) we have 
" ao = AFA, 
"I 
133 
Z(Ixi) 7coý') 
Figure 7.17: Radix-2b zo(i) generation. 
0 5b/2-1 = OFA + (b/2 - 2) "2" OFA + OFA +2- tXOR. 
As was stated at the beginning of this section, in order for the adder array to operate at 
full-speed we need to satisfy Si < (i + b/2) "A FA. Examining each of the above cases in 
order, we have 
" For j=0 then 
aj = OFA S (b/2) - AFA 
which is satisfied. 
" For j=1... b/2 -2 then 
aj = IFA+J'2'OFA+OHA S (j+b/2)' AFA 
which leads to 
j"OFA< (6/2-1) LIFA - AHA 
which for maximum j gives 
(b/2 - 2) " OFA : (b/2 - 1) " OFA - OHA 
and since ANA < OFA the condition is satisfied. 
9 For j= b/2 -1 then 
£b/2-1 = AFA+(b/2-2)-2. OFA+AFA+2. OXOR: (b- 1)-AFA 
which leads to 
ýb-2) "CFA+2"tXOR :5 (b- 1) "LFA 
and since IFA = 2. OxoR the two sides are equal and the condition is again satisfied. 
Thus we have succeeded in generating z(i) fast enough to allow the adder array to 
operate at full-speed. 
134 
7.3.3 MMDAMMM Performance Summary 
For Na k-bit modulus we can summarize the radix-26 recoded RSD implementation as 
follows. Firstly, N is a k-bit quantity, therefore M is (k + b)-bit and R is (k +b+ 2)-bit. 
Since MMDAMMM requires 1+1 iterations, then 
Number of iterations =Ikb 
21 
+2 
Iteration time = OMUx +b" EFA + OFF 
Number of bitslices =k+ 2b 
Bitslice complexity =5" SZFF +b" SZMUx +b- (FA 
Although we have succeeded in generating z(i) fast enough so that, on paper anyway, 
the multiplier appears to be able to operate at full-speed, a physical realization of the 
device in VLSI silicon will require buffering of the z(i) signals as they are distributed to 
all bitslices of the processor. This will likely slow down the operation of the adder array. 
To get around this problem we will have to generate the z(i) signals and start to distribute 
them before they are needed. This is the subject of the next section. 
7.4 The MMDDAMMM Algorithm 
The MMDDAMMM (Modified Modulus Double Delayed Additive Montgomery Modular 
Multiplication) algorithm is a simple extension of the MMDAMMM algorithm. The dif- 
ference being that the partial products X; "Y are further left-shifted by b bits before being 
added to the accumulated partial product. 
Algorithm 19 (MMDDAMMM) Given a constant R= 21b, odd modulus N such that 
26+2"N<R 
,a constant 
N' = (-N'1)R, and a modified modulus M such that 
M=N"(N')2, 
135 
Then two integers X, YE (-R/2, R/2), with X expressed as the (l + 2) -digit vector X= 
[Xý}1 
i X1, ... , 
Xo] with Xi E [0,26 -1] for i=0... I and Xi+1 E [-26-1,26-1 - 1], can be 
Montgomery multiplied by setting 
s(o) =0 
and letting 
r(i) = s(i) - 22b " x(i) "Y 
s(i-{-1) = 
r(i)+zi). M 
2b 
with 
where x, (i) E {-2, -1,0,1} and 
with 
b/2-1 
x (i) _E 22' " x1 (i) 
2-o 
z(i) - s(i) (mod 26) 
6/2-1 
z(i)_ E 22'"z1(i) 
_-o 
where zz(i) E {-2, -1,0,1}. will give 
s(1+2) =X"Y" R-1 (mod M) 
with s(l + 2) E (-R/2, R/2). 
Proof: A straightforward extension of MMDAMMM I. 
The implementation of this algorithm using recoding techniques for both x(i) and 
z(i) would be very similar to the MMDAMMM case of the previous section. The main 
difference being that one more iteration of the multiplier is required. However, there is a 
way of altering the z(i) generation circuitry such that z(i) can be generated before it is 
actually needed. We will examine this as follows. 
136 
7.4.1 Radix-4 MMDDAMMM 
Looking first at the simple case of b=2, then the arrangement of the two adder levels 
that make up the multiplier is as shown in Figure 7.19. Using the z(i) generation method 
b bits 
x(i)Y Adder 
z(i)M Adder 
s(i+1) '2 
Figure 7.19: Radix-4 MMDDAMMM adder structure. 
of the previous section we would `sample' (s(i))2` at the position shown as (*1) in the 
diagram (refer back to Figure 7.2). In this section we will move the `sample-point' to the 
position shown as (*2). Thus we will sample the value of (s(i))2e to use for generating this 
cycle's z(i) at the end of the previous cycle before the accumulator is clocked. The reason 
we can do this is because there are no adders located between the (*1) and (*2) points, 
and so the values at these points will be the same. 
Why do we sample at (*2)? To answer this question we first have to make the assump- 
tion that z(i) is available directly after the active clock edge that operates the accumulator. 
(i. e. assume that, contrary to the findings of the previous section, there is no 2. AXoR 
delay after the clock edge and before z(i) becomes available). Now, because there is no 
part of the Y adder above the M adder in the sampling column (the b-bit wide column 
in which (*2) resides - shown by dotted lines in the diagram) and also because we are at 
the least-significant end of the adder array where the z(i) generation circuit resides (i. e. 
there is no need to buffer the transmission of the z(i) signals to the z(i) "M multiplexor 
of the sampling column - they are next door to each other), then the addition of s(i) 
and z(i) "M by the M adder will be complete in 1MUX + AFA time. This means that 
137 
there will be a OFA timeslot available after the output of the M adder settles and before 
the accumulator is clocked. Since, as we saw in the previous section, for b=2 it takes 
2 . OXOR = OFA time to generate z(i), then it could actually be generated here. If we did 
this then the calculation of z(i) would be complete before the accumulator is clocked, and 
so z(i) could be made available directly after the clock edge. This actually agrees with our 
initial assumption. The architecture for such an implementation is shown in Figure 7.20. 
In order for this implementation to work, all that is required is that z(i) be available 
x(i)Y 
80) 
b bits 
x(i) x(i)Y Gen Adder ý; 
', 
z(i)N1 
Adder 
I 
s(i+l) ,', 
clk 
z(i)M Gen I 
ZG) 
Figure 7.20: Radix-4 MMDDAMMM z(i) generation. 
on the active edge of the clock on the first iteration. But this is easy to arrange since, 
from the algorithm, we can see that z(O) =0 and this can be achieved in the circuitry of 
Figure 7.20 by resetting the output flip-flops of the z(i) generation block at the same time 
that the accumulator is reset before a multiplication is started. This fact, coupled with 
the knowledge that the z(i) generation circuit and the multiplexor and adder circuits for 
z(i) "M addition in the two least-significant b-bit columns of the adder array can all be 
located physically close to each other, means that z(i) can always be calculated before the 
accumulator is clocked, and therefore be available for use directly after the clock edge. 
Putting all this together, it means that the z(i) signal is available 2 . OXOR time before 
it is needed by the z(i) "M multiplexors in each bitslice. Therefore this time is available 
for the buffered transmission of z(i) to the bitslices. Exactly how this will help to keep 
138 
the adder array operating at full-speed depends very much on the technology used by the 
VLSI implementation and also by the buffering strategy used by other signals (e. g. the 
clock and x(i)) in their distribution to the bitslice processor. 
Note that, refering again to Figure 7.20, irrespective of how far left-shifted the multipli- 
cand Y may be, it is not possible to move the s(i) sample point further left (either before 
or after the accumulator) to try and gain more time for z(i) evaluation. The reason is that 
this would mean that, during the cycles where z(i) is being evaluated, further additions 
of multiples of M are being performed, and thus the particular z(i) being calculated will 
not, when used, be the correct value required to zero the lower bits of the accumulator. 
7.4.2 Radix-26 MMDDAMMM 
To use this technique for b>2 then the generation of z(i) becomes a 2-stage process. For 
b/2-1 
E 22' . z3 (i) j=0 
then, using the z(i) generation method of the previous section, each z1 (i) is available at 
or before the time it is required. To be more specific, all but the (b/2 - 1)-th sub-multiple 
are available before they are needed. The last multiple is available exactly at the time it 
is needed. Splitting the generation of z(i) into two stages so that 
" Stage 1: Calculate zo(i) 
" Stage 2: Calculate z1(i) for j=1... b/2 -1 
with the first stage completed before the accumulator is clocked and the second completed 
afterwards, will result in each z, (i) for j=0. .. b/2 -1 becoming available at least OFA 
before it is needed. Therefore at least AFA time will be available for the transmission of 
each zz(i) to the bitslice array. This is shown in Figure 7.21. 
139 
b bits 
- 80) 
T 
", (7 x1(i)Y clin Adder 
. (7 
, 
xdi)Y Gee Adds II; x" 
x"aa xs., (i)Y Get Adder 
' 
'o. 
cik 
Figure 7.21: Radix-2b recoded MMDDAMMM multiplier. 
7.4.3 Radix-26 MMDDAMMM Performance Summary 
For Na k-bit modulus we need R> 2k+b+2. Since the MMDDAMMM algorithm re- 
quires an extra two iterations to complete, we can summarize the recoded radix-26 RSD 
implementation as follows. 
Number of iterations =Ik1 
21 
+3 
Iteration time = OMUX +b- OFA + LFF 
Number of bitslices =k+ 3b 
Bitslice complexity =5" SZFF +b" IZMux +b' CFA 
7.5 Summary 
In this chapter new, optimised designs for Montgomery multipliers have been presented 
that allow the multiplier's adder array to operate at full-speed. These designs are for 
radix-2, radix-4 and general radix-26 multipliers. The most promising design uses the 
140 
MMDDAMMM algorithm together with a recoded RSD architecture. 
In the next chapter, technology specific issues will be discussed in determining which 
of these multipliers will be used to implement the high-speed RSA processor chip called 
WHi5pER. 
141 
Chapter 8 
The WHiSpER Chip 
The WHiSpER (Wide-word High-Speed Encryption for RSA) chip is an integrated circuit 
device intended for use within RSA cryptosystems. It is a dedicated long-integer modular 
exponentiator. 
For the device to be particularly useful for implementing RSA cryptosystems over 
computer and telecommunication networks, then it needs to satisfy two goals, 
. security; to accept moduli of at least 500 bits in length, and 
" speed; to be able to perform encryption/decryption operations at a rate of not less 
than 64kbps. 
This chapter details the design decisions that had to be made so that the chip was 
both feasible to manufacture and satisfied its security and speed constraints. 
8.1 Technology 
The WHISpER chip is implemented using GEC Plessey Semiconductor's (GPS) CLA70000 
gate-array technology [83]. This is a1 micron twin-well epitaxial CMOS process with two 
levels of metal interconnect. A range of nine array sizes is available, from the smallest 
CLA70XXX having 4929 gates, to the largest three, 
142 
" CLA76XXX; 110112 gates, 
" CLA77XXX; 181260 gates, 
" CLA78XXX; 256284 gates. 
By a `gate' is meant the equivalent (in terms of circuit complexity) to a 2-input NAND 
gate. Using GPS terminology, one NAND gate can be implemented in a single Array 
Element (AE) which is composed of two p-type/n-type complementary transistor pairs. 
Thus the largest array in the range contains approximately 1 million transistors. 
A range of packaging technologies is available, with the largest arrays available in the 
following packages: 
" Ceramic Leaded Chip Carrier; 
" Power Ceramic Leaded Chip Carrier; 
" Ceramic Pin Grid Array; 
" Power Ceramic Pin Grid Array. 
The Power versions of the above package types mount the chip `cavity-down' with a Cu/W 
heat-plate. The packages are available with, according to array size and package type, up 
to 257 pins. 
A CLA70000 design is created using design primitives called Cells and Macros. Cells 
range in complexity from simple 2-input NAND gates to D-type flip-flops and 1-bit full 
adders. Macros are pre-routed blocks of Cells that make up sub-circuits such as 4-bit 
counters, 4-bit multipliers etc. A complete design specifies the primitives and their inter- 
connections. Full specifications of GPS Cells and Macros are available in [83]. 
Design capture was performed on an Apollo workstation using Mentor Graphics version 
7 ECAD tools along with GPS CLA70000 technology libraries. Schematic entry and 
subsequent circuit simulation were performed using a combination of Mentor Graphics 
143 
and GPS software. Rill details of the GPS Mentor Design Kit can be found in [84], [85] 
and [86]. 
A table showing the A and 0 characteristics of the four component types used for 
analysing multiplier performance is shown in Table 8.1. Note that the 0 times are 
Component CLA70000 Cell A (ns) f2 (AEs) 
NAND NAND2 11 
MUX MUX4TO1 36 
FA FADD 68 
FF MDF 76 
Table 8.1: CLA70000 Cell characteristics. 
maximum delay times that allow for a worst-case manufacturing process scenario. 
8.2 The Multiplier 
In order to choose an efficient multiplier for the WHiSpER chip, it is first necessary to 
analyse the time and cost figures given in Chapter 7 for the various different multiplier 
architectures and algorithms. This analysis is particular to the CLA70000 gate-array 
technology and is based upon the figures given in Table 8.1. 
8.2.1 Efficiency of Recoded Multipliers 
In Section 7.2.6 it was stated that the radix-4 recoded multiplier using an RSD architecture 
with the MMDAMMM algorithm would lead to a design that is almost twice as fast, using 
only slightly more hardware, than a radix-2 multiplier with a CSA architecture using the 
DAMMM algorithm. This can be checked as follows. 
Let TM indicate the time (in nanoseconds) required to perform a multiplication and 
RM the rate of multiplications per second so that 
RM = 
log 
%M 
144 
Let cM be the total number of gates used in the multiplier design and CM a measure of 
multiplier efficiency such that 
_ 
RM 
ýM 
9M 
is equal to multiplications per second per gate. 
CSA DAMMM b=1 
Acknowledging the fact that the AND gate construction for x; -Y generation of Figure 4.6 
can be optimized to use NAND gates, then from Section 7.1.3 we have 
TM = (k+3)(INAND+2"AFA+OFF) 
GM = (k + 4) (2 " SZNAND +2" SZFA +5" lFF) 
which for k= 512 gives 
Tier = 10300 
Rey = 97087 
GM = 24768 
EM = 3.920 
RSD MMDAMMM b=2 
From Section 7.2.6 and assuming k is a multiple of 2, then 
Tit = (k/2 + 3) (OMUx +2" IFA + iFF) 
GM = (k+ s) (2 " cMux +2 ' SZFA +5 ' SIFF) 
which for k= 512 gives 
TM = 5698 
RM = 175500 
gm = 30160 
EM = 5.819 
145 
From a simple comparison of the efficiency figures for each of the above implementar 
tions we can see that the radix-4 recoded multiplier is better than the unrecoded version. 
In particular, comparing the multiplication rate and gate count figures we see that the 
recoded multiplier offers an 81% increase in multiplication speed at the expense of a 22% 
increase in circuit complexity. This is broadly what was stated in Section 7.2.6. 
In constructing performance figures for the general radix 26 RSD MMDDAMMM re- 
coded multiplier it will be useful also to create figures for a hypothetical radix-26 CSA 
MMDDAMMM unrecoded multiplier. The latter figures may then be used as a kind of 
`benchmark' for comparison purposes. This may be achieved by noting that, without us- 
ing recoding techniques, the generation of Z; will be simpler than the generation of the 
recoded z(i), and that since twice as many adder levels are used to sum the partial product 
X; "Y then the adder array is still able to operate at full-speed. 
CSA MMDDAMMM radix-26 (unrecoded) 
Assuming that k+2 is a multiple of b we have 
%M = ((k+2)/b+3)(ONAND +2b'tFA+OFF) 
gm = (k+3b)(2b"(NAND+2b"cFA+5'cFF) 
which for k= 512 and b=1,2,4,6,8 gives the performance figures of Table 8.2. Note that, 
CSA MMDDAMIv M (unrecoded) (1'512) 
TT (x10') RM(x10') G. (x103) EM 
lr 1 10.3 97.1 24.7 3.931 
b=2 8.3 120.5 34.2 3.523 
b=4 7.3 137.0 53.5 2.561 
b=6 6.9 144.9 73.1 1.982 
b=8 6.8 147.1 93.3 1.577 
Table 8.2: CSA MMDDAMMM (unrecoded) performance figures. 
as b increases so the efficiency of the multiplier decreases with little speed improvement 
obtained for high values of 6. 
146 
RSD MMDDAMMM radix-2b (recoded) 
Assuming that k+2 is a multiple of b we have 
TM = ((k+2)lb+3)(O x+b FA+LFF) 
GM = (k+36)(b"SZMUX +b'I1FA+5'OFF) 
which for k= 512 and b=2,4,6,8 gives the performance figures of Table 8.3. 
RSD MMDDANAMZ (recoded) (k=512) 
T,  (x103) RM(x103) GM(x103) Ew 
b=1 - - - - 
b=2 5.7 175.4 30.0 5.847 
b=4 4.5 222.2 45.1 4.927 
b=6 4.1 243.9 60.4 4.038 
b=8 3.9 256.4 76.1 3.369 
Table 8.3: RSD MMDDAMMM (recoded) performance figures. 
Upon comparison of these two tables we see that the recoded multipliers are both 
smaller and faster for each implementation b=2,4,6,8. For example, the b=2 recoded 
multiplier uses 12% less circuitry than the unrecoded b=2 multiplier, but goes 81% 
faster. Also, because of the reduced circuitry requirements, for a given size of VLSI device 
it may be possible to implement a recoded multiplier with a higher b value than could be 
implemented using the unrecoded method. For example, the b=8 recoded multiplier uses 
only 4% more circuitry than the b=6 unrecoded multiplier (and goes 77% faster). 
8.2.2 Multiplier Selection 
Assuming a 512-bit modulus and exponent, and also assuming that the exponent is a 
randomly selected element from the set of all possible 512-bit exponents, then the average 
number of bits set to a `1' in the exponent is 256. Using either the Right-to-Left or 
Left-to-Right Montgomery exponentiation algorithms gives 
Multiplications per exponentiation 512 + 256 +3= 771 
147 
where the `+3' term accounts for pre- and post-conversion operations. 
A 64kbps throughput for RSA requires 
Exponentiations per second = 
6451200 
= 125 
therefore 
Multiplications per second = 125.771 = 96375 
With reference to Table 8.3 we see that using a recoded RSD MMDDAMMM multiplier 
requires that b>2. 
The natural choice of which multiplier to use would obviously be the most efficient one 
that meets our requirements. This implies RSD MMDDAMMM b=2. However, to use 
this multiplier effectively would require a clock speed of 
109 
= 
109 
45 MHz 
OMUX +2' 1FA + OFF 22 
With the architecture of the multiplier consisting of a very large number of bitslices then 
the distribution of such a high-speed clock with minimum skew to all bitslices might cause 
problems. The concern is that the process of mapping the schematic design to silicon (the 
`layout' of the VLSI device) will become a critical part of the design, and that considerable 
manual input will be needed to make the result efficient. 
Another cause for concern is that, in analyzing the expected performance of the 
multiplier, we have used worst-case fabrication tolerances. A common practice in high- 
performance digital chip manufacture is to measure the fabrication process accuracy for 
each batch of chips produced. If the accuracy is good then chances are that the chips 
produced in this batch will be able to be reliably clocked at better than worst-case speeds. 
In practice several tolerance bands are used, and the chips produced are graded into a 
number of speed categories. If this technique were used with the b=2 multiplier then the 
higher clock speeds would be in excess of 50MHz. Generating a 50MHz clock off-chip and 
then trying to inject it into the chip is, generally speaking, not a good idea. Most manu- 
facturers use a lower frequency off-chip clock combined with an on-chip phase-locked-loop 
148 
circuit to create high-frequency on-chip clocks. With CLA70000 gate-array technology 
such techniques are not available, and so it would be difficult to take advantage of any 
manufacturing process grading methods. 
The next most attractive multiplier is RSD MMDDAMMM b=4. For maximum 
efficiency this requires a clock-speed of 
109 
_ 
109 
29 MHz 
ZMUX +4' LFA + OFF 34 
which is reasonable. On the question of efficiency, it goes 27% faster than the b=2 mul- 
tiplier for 50% extra circuitry, but the total number of gates is around 45 thousand which 
makes it acceptable for implementing within the array sizes mentioned in Section 8.1. 
Therefore this multiplier was chosen as the basis for the WHiSpER chip, with a conserva- 
tive operational clock speed of 25MHz. 
8.2.3 The Carry-Propagate Adder 
As was stated in Chapter 7, to make the RSD MMDDAMMM multiplier useful for expo- 
nentiations requires that a carry-propagate adder (CPA) be used to assimilate the S+ and 
S- vectors immediately after a multiplication so that the result may be used in further 
multiplications. However, before we select an appropriate CPA, an alternative architecture 
that does not require this adder will be investigated. 
Full RSD Multiplier 
Instead of assimilating the S+ and S- vectors after a multiplication, it is possible to 
construct a multiplier that uses these values directly. Assume Y is the result of a previous 
multiplication such that 
Y=Y+-Y- 
then the partial products of Y can be expressed as 
X (i) "Y=X (i) " Y+ -X (i) " Y- 
149 
For the RSD MMDDAMMM b=4 multiplier this gives 
x(i) -Y=4' xl (i) " Y+ -4" xl (i) " Y- + xo(i) " Y+ - xo(i) " Y- 
and assuming x(i) can be made available directly on the active clock edge then using the 
adder interconnection optimization technique of Section 4.2.2 leads to the architecture 
shown in Figure 8.1. Note that the generation of z(i) can still be performed before it is 
SV)i "-G) t 4x, (i)Y'$ $ 4x, (t)Y. 
xý)Y+ Y Adder Y Adder ' o(i)Y" 
Y Adder 
Y Adder 
r+G)i rc) t 
za(i)M M Adder 
4zi(i)M M Adder 
s(+1) 
Accumulator 
Figure 8.1: Full RSD MMDDAMMM b=2 multiplier. 
required since there is a 3"AFA delay in the adder array before the M adders. Performance 
figures for this multiplier, taking into account the extra adders and multiplexers for the 
x(i) "Y generation and the extra flip-flops in the X and Y registers, are therefore as follows. 
Tit = ((k+2)Ib+3) -(tMUx+5"OFA+OFp) 
gm = (k+36) " (6"nMUX +6-nFA+7-(lFF) 
which gives 
TM = 5260 
RM = 190114 
150 
GM = 66024 
EM = 2.879 
RSD/CPA Multiplier 
Assuming that the carry-propagate adder is a simple ripple-adder, then the performance 
figures for this architecture are as follows. 
Tit = ((k + 2)I b+ 3) " (OMUX +4" AFA + OFF) + (k + 36) " OFA 
Get = (k+3b)"(4"SZMux+4"flFA+5"IIFF)+(k+3b)"SZFA 
which gives 
TM = 7615 
RM = 131320 
GM = 49256 
EM = 2.666 
On comparing the Full RSD and RSD/CPA approaches we see that the Full RSD 
method is more efficient. To be specific, it gives a 45% speed increase for 34% extra 
circuitry. 
Although this seems to imply that the lull RSD approach is better than RSD/CPA, 
it gives a somewhat false impression. This is because it was assumed that the assimilation 
adder was a simple ripple-adder. If the type of adder is changed to a more efficient design, 
then dramatic improvements can be made for little extra cost. 
RSD/FCPA Multiplier 
An efficient FCPA (Fast Carry-Propagate Adder - for example the carry-completion and 
carry-select adders of Section 4.1) is the CLA70000 Macro ADT8 8-bit carry-select adder 
151 
block. It can add two 8-bit numbers in 
DADT -- 8 nanoseconds 
with a gate count of 
cADT = 85 AEs 
This gives it a per-bit addition time of approximately lns and a per-bit gate count of 
approximately 11. Thus it is 600% times faster than the FADD Cell, but only 40% more 
expensive. 
Using the ADT8 Cell as the FCPA gives the following performance figures. 
TM = ((k + 2)/b + 3) " (OMux +4" iFA + OFF) + (k + 36) " (DADT/8) 
GM = (k + 3b) " (4.1lMux +4" S2FA +5" nFF) + (k + 3b) " (SZFA/8) 
which gives 
TM = 4995 
RM = 200200 
cM = 50632 
CM = 3.954 
Thus RSD/FCPA is both faster and smaller than Full RSD. On comparison with 
RSD/CPA we see that it is 52% faster with only 3% more circuitry. In fact, when compared 
with the figures of Table 8.3 we see that the carry-propagated addition adds only 11% to 
the overall multiplication time. Therefore we can conclude that the optimum multipliers 
do seem to be those of Chapter 7 together with a fast carry-propagate adder. 
In summary, the multiplier chosen for the WHiSpER chip is the RSD MMDDAMMM 
b=4 recoded multiplier with a carry-select adder to perform result assimilation. For a 512- 
bit modulus, the multiplier is approximately 50 thousand gates in size, and is theoretically 
capable of performing 200 thousand multiplications per second. 
152 
8.3 The Exponentiator 
The exponentiator performs calculations of the form 
(AE)M. Note that this calculation 
is performed modulo M, that is, using the modified modulus that the multiplier uses. 
A subsequent reduction of 
(AE) 
M modulo 
N is assumed to take place after the main 
exponentiation by a different circuit. 
In this section we explore implementations of the alternative exponentiation schemes 
that were derived in Section 6.2. To simplify the discussion we will use the notation for 
Montgomery multiplications developed in Section 6.2.2 but using the modulus M. This 
implies that the results of the Montgomery multiplications are in the range [0, M- 1]. 
Whilst this is not strictly true when we use the MMDDAMMMM multiplier of the pre- 
vious section, it makes little difference when investigating the alternative exponentiation 
techniques. When it does become important, new notation will be introduced to handle 
it. 
Given that the multiplier has been chosen, the main consideration in designing the 
exponentiator is that of which algorithm to use. In Section 6.2 we saw that there are two 
different ways of performing Montgomery exponentiation; either with a post-computation 
involving a constant derived from R, M and E, or with pre- and post-conversion opera- 
tions into and out of the M-residue system respectively. The latter technique involving a 
constant derived from R and M only. 
The method chosen is the M-residue technique. This is for the reason that the calcu- 
lation of the pre-computed constant, H= (R2)M, depends only on R and M which are 
both public knowledge. If the other method had been chosen then the computation of 
its associated constant, (R`+l)M, would require knowledge of the exponent, and since the 
exponent may be secret then performing this calculation may be inconvenient. It would 
depend upon the particular implementation of the RSA ciphersystem. 
The next sections examine the differences between the Right-to-Left and Left-to-Right 
153 
M-residue exponentiation schemes. 
8.3.1 R-to-L M-residue Exponentiation 
The Right-to-Left M-residue exponentiation algorithm was stated in Section 6.2.2. The 
main loop of this algorithm performs the calculation of s'(i + 1) and t'(i + 1) as follows. 
s'(i) if ei =0 
mR, M(s' (i), t'(i)) if e; =1 
t'(i + 1) = 
From this we can see that the calculations of s'(i+1) and t'(i-}-1) are independent. That is, 
each is derived from the s(i) and t'(i) of the previous iteration. This means that s'(i+ 1) 
and t'(i + 1) can both be calculated at the same time. 
Parallel R-to-L Exponentiation 
Using two concurrent parallel multipliers, then the calculations of W(i + 1) and t'(i + 1) 
can proceed as follows. 
Multiplier Multiplicand 
Multiplier 1: s'(i + 1) t- t'(i) s'(i) 
Multiplier 2: t'(i + 1) 4-- t'(i) t'(i) 
From this we see that the multiplier operand is the same for both multipliers, therefore 
the multipliers can be implemented with one X register shared between them. 
Defining exponentiation performance figures in the same way as those for multiplica- 
tion, but using the subscript E, and since the number iterations of the exponentiation 
algorithm (including pre- and post-conversion operations) is k+2, we have 
TE = (k+2) "TM 
CE = (k -}- 3b) " (2b " nMUX 'i' 2b' OFA 'i- 9 . OFF 'i' 2' SIADT/8) 
154 
for k= 512 and b=4 this gives 
TE = 2567430 
RE = 389.5 
9E = 98119 
CE = 4.031.10-3 
Serial R-to-L Exponentiation 
Using a single multiplier with the same algorithm yields the following performance figures. 
Note that, because there are two variables in the algorithm, s(i) and t'(i), an extra register 
is needed to hold whichever one of them is not being used. 
TE = (3k/2 -ß- 3) " TM 
gE = (k+3b)"(b"flMUX+b"nFA+6"SZFF+nADTI8) 
for k= 512 and b=4 this gives 
TE = 3851145 
RE = 259.7 
cE = 53776 
CE = 4.829.10-3 
Thus the serial multiplier is more efficient than the parallel implementation. The parallel 
multiplier provides a 50% speedup but at a cost of 82% extra hardware. 
8.3.2 L-to-R M-residue Exponentiation 
The main loop of the Left-to-Right exponentiation algorithm is 
MR, M(d(i), s'(i)) if ek-j-i =0 
MR, M (MR, M (s'(i), d (i)), A') if ek-i-1 =1 
155 
and so, when two multiplications are required, we see that they have to be performed 
sequentially. Noting that an extra register is required to hold the pre-converted number 
A', then the performance figures for this method are 
TE = (3k/2+3) - TM 
QE = (k + 3b) " (b " 1MUX +b- fFA '+' 6' SIFF + IADT/8) 
for k= 512 and b=4 this gives 
TE = 3851145 
RE = 259.7 
YE = 53776 
EE = 4.829.10-3 
which are identical to the serial Right-to-Left method. 
The main difference between the Right-to-Left and Left-to-Right techniques is that, in 
the former there are two variables, s(i) and t'(i), that are continually updated, whereas in 
the latter there is only one updated variable, s'(i), together with a pre-converted number, 
A', that is constant for most of the exponentiation. Under the assumption that keeping 
track of just one variable will simplify the control circuitry and reduce register-to-register 
communications, then the Left-to-Right exponentiation scheme is the one chosen for the 
WHiSpER chip. 
8.3.3 Optimizing L-to-R Exponentiation 
Using the notation that 
a=[blM 
means 
a=b (mod M) 
156 
but with a limited to some range not necessarily equal to [0, M- 1], then we can define 
the operation of the RSD MMDDAMMM multiplier, call it \, R m(X, Y), as 
)ºR, M(X, i. ) = 
[X 
,y, R-11 l 1M 
where the range is defined as 
A (X, Y) E (-R/2, R/2) 
Pre-Conversion 
With H= (R2)M, then looking at the first three multiplications of the Left-to-Right 
M-residue exponentiation algorithm we have 
A' =XR, M (A, H) = [A " R]M 
s'(0) = \Rm(1, H) = [R]M 
s'(1) = AR. 1N(AR, M(S'(0), s'(0)), A') = [A' R]M 
where the last equality is due to the most-significant-bit of the exponent being a '1'. 
Examining these three identities we see that it is possible to collapse the calculations into 
a single Montgomery multiplication. Thus 
Aý =sß(1) = A, M(AºH) 
Post-Conversion 
From the Left-to-Right M-residue exponentiation post-conversion operation we have 
s(k) = AR, M(1, s'(k)) 
now since 
s'(k) E (-R/2, R/2) 
then according to the recoded MMDDAMMM algorithm (Section 7.4) we have 
Is(k)l< 2'1+226, R. M <M 22b"R 
157 
But we can show that Is(k) I }A M as follows. 
If s(k) = ±M then obviously 
s(k) =0 (mod M) 
which, since AE [0, N- 1] where M=NN. (N1)2b > N, would mean that 
A=O 
Since the M-residue representation of zero is zero, therefore 
A'=0 
and by the last section 
s'(1) =0 
since the calculation of s'(i + 1) involves the Montgomery product of s'(i) and A' which 
would be zero, therefore 
s'(k) =0 
moving back to standard residue representation we have 
s(k) =0 
and so therefore s(k) 0 ±M. 
This means that the result range of an exponentiation is 
s(k) E [-(M -1), M -1] 
and so to perform a reduction to the range [0, M- 1] involves just a possible addition of 
M. 
The WHiSpER Exponentiation Algorithm 
The exponentiation algorithm of the WHiSpER chip is as follows. 
158 
Algorithm 20 (Left-to-Right WHiSpER Exponentiation) Given an integer A, a 
positive k-bit exponent E, modulus M, constant R and pre-computed constant H= 
(R2)M, 
then calculating AE (mod M) in the range [-(M - 1), M- 1] is a 3-stage process. 
1. Pre-conversion 
A' = sß(1) = AR, M(A, H) 
2. Processing (for i=1... k- 1) 
if ek_, _1 =0 
if ek-i-1 =1 
3. Post-conversion 
s(k) _ AR, N(i, s'(k)) 
8.4 Register Variable Analysis 
In this section we will examine the whole process of performing an RSA exponentiation 
using the multiplier and exponentiator of the previous sections. Looking at the flow of data 
through these devices will allow us to derive an efficient VLSI design for the WHiSpER 
chip. 
It is assumed that we are given a modulus N, an -exponent E, and successive integers 
AE[0, N-1] 
so that we must compute the results 
D= (AE)N 
with as high a throughput as possible. 
Assuming that the constant H was pre-calculated once and for all when the RSA 
key-pair was generated, then the steps involved in performing an exponentiation are 
159 
1. Derive the modified modulus for our radix-24 MMDDAMMM multiplier 
M=N"(N')24 
2. Pre-convert the input, A, to M-residue format, A'. Call this value B here so that 
B =. \R, M(A, H) 
3. Setting s'(1) =B perform the main exponentiation loop for the next-to-most- 
significant-bit of the exponenent, ek_2, to the least-significant-bit, eo, so that 
sly+ 1) - 
AR, M(s'(i), d(i)) if ek-+-1 =0 
*\R, M (AR, M (d (1), d (i)), B) if ek-i-1 =1 
will give s(k) = 
[AE 
" R]M in the range (-R/2, R/2). 
4. Post-convert s(k) to a value C= 
[AE] 
M 
in the range [-(M - 1), M- 1] by 
C= AR. M(1, s'(k)) 
5. Finally, reduce C to the range [0, N- 1] so that 
D= (ON = 
(AE)N 
Looking at the way each of the variables is used during the exponentiation process, we 
find the following. 
" N: For the conversion of N -- M and for reducing C -+ D. 
" M: Used by the multiplier for all multiplication operations. 
" E: Examined one bit at a time between multiplications. 
" H: Used by the multiplier but only in the first multiplication. 
" A: Used by the multiplier but only in the first multiplication. 
. B: Used by the multiplier during any iteration where ek_; _1 = 1. 
160 
" C: For reducing C -+ D. 
" D: Stored until read. 
This implies that each variable has its own `bandwidth' requirement. i. e. how frequently 
and in what way it is used. This can be used to advantage by storing the low-bandwidth 
variables off-chip, in a seperate RAM device, that is mounted alongside but under the 
direct control of the WHiSpER chip. 
Referring back to Chapter 7 we know that the multiplier has registers X, Y and M. 
The first two are the multiplier and multiplicand registers respectively, whilst the third 
always holds the modified modulus M. In Section 8.3.2 we noted that the Left-to-lught 
exponentiator requires an extra register to hold the M-residue representation of its input 
operand. Therefore call this register B. Since it is assumed that, when the WHiSpER chip 
is operating, many exponentiations will be performed for the same modulus and exponent, 
this suggests the following 4 procedures. 
" Load: Load the M register with the modified modulus, N" (N')24, and determine 
the most-significant bit of the exponent. 
" Transfer: Store the contents of the X register (assumed to contain the result of a 
previous exponentiation) to C. Load the X and Y registers with A and H. 
" Exponentiate: Perform pre-conversion (storing M-residue representation of A in the 
B register), main loop processing and post-conversion. 
" Reduce: Reduce C to D. 
Note that there is an enforced serialization in the first three processes, but that the reduc- 
tion process can be overlapped with the exponentiation process. i. e. the exponentiation of 
the next input operand can be started before the current output has been fully reduced. 
Also, the load process need only be performed when the RSA key is changed. 
161 
The WHiSpER chip control circuitry is based around these four processes, with hard- 
ware `double-buffering' techniques that allow the exponentiator to operate continuously 
whilst device i/o and the reduction process operate independently and in parallel. 
8.5 Architecture 
In this section we will explore the general architecture of the WHiSpER chip. As has 
already been mentioned, the WHiSpER chip employs a recoded RSD MMDDAMMM b=4 
multiplier to perform RSA exponentiation. For ease of implementation, the multiplier is 
constructed so that R= 2512. This allows for very simple implementations of control 
circuitry and counters within the WHiSpER chip. This means that the maximum size of 
RSA modulus, N, that can be accepted by WHiSpER, according to Section 7.4.3, is 506 
bits. 
The WHiSpER chip and its associated static RAM (SRAM) chip are shown in Fig- 
ure 8.2. Communication from the host device to the WHiSpER chip is via WHiSpER's 
CS 
WE 
Host 
Microprocessor OE 
Interface Add 
Dati 
microprocessor interface port. This interface is designed to resemble a 512-byte RAM 
device, so that, together with the interrupt facility, the WHiSpER chip may be connected 
to almost any standard microprocessor based system. 
8.5.1 The SRAM Device 
'The SRAM chip of Figure 8.2 is a CMOS static RAM device of up to 64k x 8-bit in 
size. The SRAM is used to hold RSA keys (modulus and exponent) together with their 
162 
Figure 8.2: The WHiSpER and SRAM devices. 
pre-computed constants, H. The organization, of the SRAM is as follows. 
The 64-kbyte space of the SRAM is partitioned into 256 lots of 256-byte spaces. Each 
256-byte space is then partitioned into 4 lots of 64-byte spaces. Note that 64 bytes = 512 
bits. The four 64-byte spaces are then used to store an RSA key-pair (modulus N and 
exponents Ep and E, ) and pre-computed constant H= (R2)M where R= 2512 and M is 
the modified modulus M=N" (N')24. This is shown in Figure 8.3. Thus the external 
FFCO E. 
key-pair 255 
FF40 H 
FF00 N 
oico 
0180 
0140 
0100 
00c0 
0080 
0040 
0000 
H 
N 
H 
N 
key-pair 1 
key-pair 0 
Figure 8.3: SRAM memory map. 
SRAM may range in size from 256 bytes to 64 kbytes and may hold from 1 to 256 RSA 
key-pairs. 
Access to the SRAM is available only through the WHiSpER microprocessor interface 
and limits the host to write-only access. This prevents unauthorized software running on 
the host system from reading any of the keys. 
8.5.2 WHiSpER 
A top-level block diagram of the WHiSpER chip is shown in Figure 8.4. The function of 
each block is as follows. 
MME - Montgomery Modular Exponentiatior 
The MME is an implementation of the exponentiator discussed in Section 8.3. Its archi- 
tecture is shown in Figure 8.5. It consists of the three main multiplier registers X, Y and 
163 
Figure 8.4: The WHiSpER chip. 
M, and the M-residue operand register B, together with the x(i) "Y and z(i) -M mul- 
tiple generation circuits, the RSD adder array, the accumulator and the carry-propagate 
adder. Parallel 512-bit wide data-paths are shown as broad arrows, whilst serial 4-bit wide 
data-paths are shown as thin arrows. 
To perform an exponentiation, the operation of the MME is as follows (assuming the 
M register has been pre-loaded with the-modified modulus M). 
1. The X and Y registers are serially loaded with the A and H values respectively. 
2. The first multiplication is performed to yield the value B= [A " R]M. This value 
corresponds to the calculation that `collapses' down the first three calculations of 
the Left-to-Right Montgomery exponentiation as detailed in Section 8.3.3. 
3. The B value just generated is parallel loaded into the X and Y registers. 
4. The second multiplication is performed. This corresponds to the squaring operation 
for the exponent bit ek_2. During this multiplication the B register is serially loaded 
164 
interface Pod lnterfsoe 
Figure 8.5: MME - Montgomery Modular Exponentiator. 
with the B value as it emerges from the X register. The X register also reloads itself 
with the B value. Thus at the end of the multiplication all three register, B, X and 
Y, contain the B value. 
5. If the exponent bit ek_z =1 then the Y register is parallel loaded with the previous 
result and a multiplication is performed. 
6. The following is now performed for exponent bits ek_3 to CO. 
Parallel load the X and Y registers with the result of the previous multiplication. 
Perform the multiplication - serially loading the B value from the B register into 
the X register as the X register is consumed during the multiplication. 
If e; =1 then parallel load the Y register with the result of the previous multiplica- 
tion. Perform the multiplication. 
7. The Y register is parallel loaded with the result of the previous multiplication. A 
multiplication is performed with the x(i) generation circuitry switched to generate 
X=1. 
165 
8. The X register is parallel loaded with the result of the previous multiplication. 
9. The X register serially unloads the value C= 
[AE] 
M as 
it is loaded with a new A 
value. The Y register is serially loaded with the H value. 
SMC - State Machine Controller 
The SMC provides all of the control signals necessary for the operation of WHiSpER. A 
block diagram of the SMC is shown in Figure 8.6. The SMC consists of the four separate 
Control 
Signals 
semi-autonomous state machines, LSM, TSM, ESM and RSM, together with three general 
purpose counters, iC, jC and kC, and an output latch. The purpose of each state machine 
is as follows (they are closely related to the four processes of Section 8.4). 
" LSM - Load State Machine. The LSM is in charge of loading the multiplier M 
register with the modified modulus M= N"(N')24. It also finds the most-significant- 
bit of the current exponent. 
" TSM - Transfer State Machine. The TSM handles the serial transfers into and out 
of the multiplier's X and Y registers before each exponentiation starts. Once the 
transfer is complete, it will start the ESM and RSM state machines. 
166 
Figure 8.6: SMC - State Machine Controller. 
" ESM - Exponentiation State Machine. The ESM handles the operation of the MME. 
It will signal the TSM when an exponentiation is complete. The ESM can temporar- 
ily halt the operation of the RSM via stop/go signals. 
" RSM - Reduction State Machine. The RSM performs the final reduction of a result 
from the MME to the range [0, N- 1]. It informs the TSM when it has completed. 
The reason that the ESM has stop/go control over the RSM is because both the exponen- 
tiation and reduction processes need access to the SRAM as they are working. The ESM 
needs to access the exponent bits whilst the RSM needs to access the modulus N. The 
ESM has control over the RSM so that the exponentiation is not delayed. The reduction 
process has plenty of time in which to complete, so that a few halts will not effect overall 
performance. 
XMS - X, M Subtractor 
When the value C is serially transferred out of the MME then if C >_ 0 this circuit will 
serially subtract M from it. Thus a 2's complement value C* is produced in the range 
[-(M -1), -1]. This is done so that the final reduction process can simply keep adding 
N to C' until a positive value is produced. This simplifies the reduction circuitry so 
that completion is detected from the sign bit of C", that is, no long-integer comparison is 
required. 
URAM -U RAM 
The URAM is used to hold incoming A values and also to hold the C` value while it 
is being reduced to D in the range [0, N- 1]. The D value is held here until the host 
microprocessor reads it. 
167 
NAdd -N Adder 
The NAdd is an 8-bit multi-precision adder that is used to reduce C` to the range [0, N-1]. 
It does this by adding N until C* goes positive. 
SRI - SRAM Interface 
The SRI generates the necessary control signals to operate the external SRAM device. 
NIN - Negative Inverse of N 
The NIN is a 4-bit register that holds the value No' = (N')24. 
NMult -N Multiplier 
The NMult is a 4-bit multi-precision multiplier used to calculate M=N" (N')24. 
MPI - MicroProcessor Interface 
The MPI performs microprocessor control signal translation and address decoding. 
IntCtrl - Interrupt Control 
Handles enabling/disabling and setting/resetting the external TNT signal. 
ComStat - Command and Status 
Interprets the host microprocessor commands and provides a device status register. 
KEY - Key-pair Selection 
An 8-bit register that defines the current SRAM key-pair being used. 
EMSB - Exponent Most-Significant-Bit 
"A 9-bit register holding the bit-position of the current exponent's most-significant-bit. 
168 
EC - Exponent Counter 
A 9-bit counter used for keeping track of the current exponent bit under examination 
during an exponentiation. 
TC - Transfer Counter 
A 7-bit counter used when performing transfers between URAM/SRAM and multiplier 
registers. 
RC - Reduction Counter 
A 6-bit counter used when performing the multi-precision reduction of C* -1 D. 
ClkGen - Clock Generation 
Generates all necessary clock signals for the device. 
8.6 Operation 
As stated in the previous section, the WHiSpER chip appears as a 512-byte area of memory 
to the host microprocessor system. A map of this memory is shown in Figure 8.7. As 
can be seen from this diagram, the memory is partitioned into four main areas. 
9 SRAM -A 256-byte window on the SRAM memory. 
9 URAM - Access to the 64-byte URAM. 
" Registers - Access to the 8-bit read/write KEY register, and the 8-bit read-only 
STATUS register. 
" Commands - The WHiSpER operational commands. A write operation to a com- 
mand's address will cause that command to execute. 
169 
1F SRAMO; F) 
SRAM 
loo l SRAM (o) 
OFF' 
ogo 
07F URAM (3F) 
URAM 
URArt 0401 
0* 
ODA'1 
009 STATUS c 
00$ KEY 
007 RESET 
006 (wed) 
005 INTDIS 
004 WIEN 
003 INTACK 
COMM 
002 RDC 
001 EXP 
000 LDK 
Figure 8.7: WHiSpER memory map. 
8.6.1 RAM 
The SRAM Window: Address 0x100 - Ox1FF 
As was stated in Section 8.5.1 the SRAM is partitioned into 256 lots of 256-byte blocks. 
Each block holds an RSA key-pair and its associated constant. The SRAM window pro- 
vides access to one of these blocks. The particular block in question is determined by the 
KEY register (see below). 
Stored within this block are N, H, Ep and E, at microprocessor interface port addresses 
WOO, 0x140,0x180 and Ox1CO respectively. Each number is stored `little-endian', that 
is, starting with the least-significant-byte first. For example, if 
N= [Nt-i Nt-21..., No] 
where each Ni is the i-th byte of N such that 
1-1 
N=E28' Ni 
i=o 
then No will be stored at 0x100, NI will be stored at 0x101 and so on. 
170 
The URAM: Address 0x040 - OxO7F 
The URAM is a 64-byte block of RAM for writing and reading exponentiation operands 
and results respectively. As in the SRAM case, numbers are written `little-endian'. 
8.6.2 Registers 
The KEY Register: Address 0x008 
The KEY register is an 8-bit read/write register that determines which 256-byte block of 
SRAM, holding IN, H, EP, E, }, will be used for subsequent WHiSpER operations. It also 
determines the 256-byte block that will be addressed by the microprocessor port's SRAM 
window. Simply stated, the contents of this register provide the high 8 bits of the SRAM 
address lines. 
The STATUS Register: Address 0x009 
The STATUS register is an 8-bit read-only register that allows the host microprocessor to 
determine the current state of the WHiSpER chip. It has three valid status bits. 
" Bit 7- UBUSY. When this bit is active-high it signals that the URAM is unavailable 
for reading/writing by the host microprocessor. 
" Bit 6- SRBUSY. When this bit is active-high it signals that the SRAM is unavailable 
for writing to by the host microprocessor. 
" Bit 0- ES. Along with the KEY register this bit signals which exponent will be used 
when performing an exponentiation. ES=O means EE will be used. ES=1 means E, 
will be used. 
The UBUSY and SRBUSY bits will only go active as a direct result of the host micro- 
processor issuing a WHiSpER command. Furthermore, they go active during the command 
write operation. This means that there is no danger of the host microprocessor's being 
171 
blocked while part-way through reading/writing the URAM or SRAM areas. All it need 
do is make sure that the appropriate bit is inactive before proceeding. 
The ES bit is both set and reset with the LDK command detailed below. 
8.6.3 Commands 
The details of each command are presented next. The action of each command is summa- 
rized, followed by a block-level description of the command's operation. Refer to Figure 8.4 
for the WHiSpER block diagram. 
The LDK Command: Address 0x000 
The LDK command is executed by performing a microprocessor write operation to this 
address. It loads the multiplier's M register with the modified modulus and finds the 
most-significant-bit of the selected exponent. Selection of either Ep or E. is effected by 
the microprocessor writing either a `0' or a `1' respectively. 
On issuing this command the SRBUSY status bit will go active. Following this, the 
ComStat circuit will issue a signal to the SMC that starts up the LSM state machine. LSM 
will now load NIN with No and proceed to transfer N, using the TC, from the SRAM 
via the NMult circuit so that the modified modulus M is loaded into the multiplier's M 
register. After this, the LSM will preset EC to 511 and then, counting down, will examine 
each bit of the selected exponent in turn until the first `1' is found. When this happens, 
the EMSB register will be loaded with the EC count. The command will terminate by 
clearing the SRBUSY status bit and activating the external INT signal, if enabled. 
The EXP Command: Address 0x001 
The EXP command is executed by performing a microprocessor write operation to this 
address. It takes the current A value that the microprocessor has previously placed into 
the UR. AM and performs an exponentiation on it. At the same time that it reads the A 
172 
value, it will place aD value from the previous exponentiation into the URAM so that 
the microprocessor can read it. 
On issuing this command the UBUSY and SRBUSY status bits will both go active. 
ComStat then issues a signal to the SMC that starts up the TSM state machine. The TSM 
will now transfer the next exponentiation operand A from the URAM to the multiplier's 
X register, and the constant, H, from the SRAM to the multiplier's Y register. At the 
same time, the previous exponentiation result, C, will be read from the X register and 
transferred via the XMS to produce C* which will be written into the URAM. The TSM 
will now start up the ESM and RSM state machines. 
The ESM controls the MME to perform an exponentiation as discussed in Section 8.5.2. 
The RSM controls the URAM, NAdd and SRI circuits to perform the reduction of C* to 
D. It does this by using multi-precision (8-bit) additions to repeatedly add N to C', and 
stops when the result of an addition is positive. 
Once the RSD signals completion, the UBUSY status bit goes inactive and INT is 
asserted. The URAM is now available to the host microprocessor, which may read the 
result, D, of the just-reduced previous exponentiation, and then write the next exponen- 
tiation operand, A, into this memory. If the microprocessor does this, then it should issue 
another EXP or RDC command immediately, without waiting for the current EXP opera, 
tion to complete. The command will be latched by ComStat (with the UBUSY status bit 
activated once more) and acted upon as soon as the current EXP operation has completed. 
When the ESM signals completion, the TSM resumes control once more. It checks 
to make sure that the RSM has completed, and if so then it will wait for another EXP 
or RDC command. If the command is already waiting (latched by ComStat) then it will 
be executed immediately. Otherwise it will 'wait indefinitely (with SRBUSY active and 
UBUSY inactive) for one of these commands to arrive. 
173 
The RDC Command: Address 0x002 
The RDC command is executed by performing a microprocessor write operation to this 
address. It places aD value from the previous exponentiation into the URAM so that the 
microprocessor can read it. The RDC command should be used to terminate a string of 
EXP commands. 
This command should only be issued after a previous EXP command. From the above 
description of the EXP command we know that the TSM will be expecting this command 
to arrive, and that the STATUS register will have SRBUSY active and UBUSY inactive. 
On recieving this command the UBUSY status bit will go active. The result of the 
previous exponentiation, C, will be transferred from the MME via the XMS to produce 
C* which will be written to the URAM. The RSM will now be invoked to reduce C' to 
D. Upon completion, both the UBUSY and SRBUSY status bits will go inactive and the 
TNT signal will be asserted. The TSM will now return to its idle state and so the entire 
chip will be at rest. 
The INTACK Command: Address 0x003 
The INTACK command is executed by performing a microprocessor write operation to 
this address. If the external INT signal is active then this command will deactivate it. 
The INTEN Command: Address 0x004 
The INTEN command is executed by performing a microprocessor write operation to this 
address. It enables the operation of the external TNT signal. After a device reset, the INT 
signal is disabled, and this command must be used to enable it. 
The INTDIS Command: Address OxOOS 
The INTDIS command is executed by performing a microprocessor write operation to this 
address. It disables the operation of the TN-T signal. If the INT signal is active then it 
174 
will be deactivated. 
The RESET Command: Address 0x007 
The RESET command is executed by performing a microprocessor write operation to this 
address. It performs a device reset. After a reset, the chip will be in an idle state awaiting 
further commands. The KEY and STATUS registers will be cleared. The INT signal will 
be disabled. The SRAM device should still contain its pre-reset information. The UR. AM 
should be assumed to contain garbage. All other device registers should be assumed to 
contain garbage. 
8.6.4 Operation Examples 
Single Exponentiation 
The following example shows how to perform the single exponentiation D= (AE)N using 
STATUS register polling (non-interrupt operation). 
1. Power-up reset. Interrupts disabled, KEY = 0, STATUS = 0. 
2. Write N, H and E to the SRAM window. 
N -ý 0x100 
H --º 0x140 
E -+ 0x180 
3. Write LDK(O) command. 
4. Poll STATUS until SRBUSY goes inactive low. 
5. Write A to URAM. 
6. Write EXP command. 
7. Poll STATUS until UBUSY goes inactive low. 
175 
8. Write RDC command. 
9. Poll STATUS until UBUSY goes inactive low. 
10. Read D from URAM. 
Multiple Exponentiation 
The diagram of Figure 8.8 is a state-transition diagram showing how to perform the 
exponentiation of a sequence of As to yield their respective sequence of Ds. The sequence 
of As is denoted as 
A[1], A[2], ... , A[n] 
The process is interrupt driven, and it is assumed that the SRAM has been previously 
filled with keys. 
Set KEY Reg )--º( INTEN 
INT 
INfACK 
o- 
nsr 
INfACK 
INT 
INTACK EY P 
narr 
A[i] ->URAM EXP INfACK 
(Aepat for i- -n) 
irr 
INfDIS D[nJ -URAM MACK RDC D[n-IJ o-1 
Figure 8.8: Multiple exponentiation state transition diagram. 
8.7 Performance 
Based on the WHiSpER schematics of Appendix B and the LSM, TSM, ESM and RSM 
state machine diagrams of Appendix A (noting that the SMC clock signal oscillates at half 
the frequency of the main clock signal), then the expected performance of the WHiSpER 
chip can be summarized as follows. 
176 
8.7.1 Key Load Process 
The number of clock cycles required by the LDK command is the sum of the cycles required 
to load M into the multiplier and to find the most-significant-bit of the selected exponent. 
They are 
M load cycles = 1048 
e,,,, 6 search cycles = 12 to 5102 
8.7.2 Transfer Process 
Assuming the RSM does not require more cycles than the ESM to complete, then the 
number of clock cycles required by the TSM to transfer operands and results into and out 
of the multiplier's registers is 
Transfer cycles = 1036 
8.7.3 Exponentiation Process 
The number of multiplications that have to be performed for each exponentiation is 
1. Pre-conversion and e,,, ab: 1 multiplication. 
2. Processing for em, b_l ... co: average of 1z multiplications per 
bit. 
3. Post-conversion: 1 multiplication. 
For a 506-bit exponent this gives an average number of 760 multiplications. With the 
number of cycles per multiplication equal to 148, this gives 
Exponentiation cycles = 11248 
8.7.4 Reduction Process 
The multi-precision addition of N to C* by the RSM state machine is performed in 16-byte 
blocks. It therefore takes 4 of these blocks to add a single N. Since we have a maximum 
177 
of 15 "N to add, then a maximum of 60 blocks will be needed. Simulations have confirmed 
that a minimum reduction rate of 1 block per multiplication is achieved by the RSM, and 
so therefore any exponentiation where the exponent is greater than 60 bits will allow the 
reduction to take place in parallel with the exponentiation. i. e. the reduction process will 
not add to the exponentiation time. 
8.7.5 RSA Throughput 
Assuming that the key has been loaded by the LDK command, then the throughput of the 
WHiSpER chip is defined to be the time taken to execute consecutive EXP commands. 
The number of cycles required to complete an EXP command is the sum of the cycles 
required by the transfer and exponentiation processes. 
With a clock rate of 25MHz, then the clock period is 40ns and we have 
Average EXP time = (1036 + 112480) " 40ns = 4.541ms 
which gives 
Average EXPs per second = 220 
which, for a 506-bit key, gives a throughput of 
Average throughput = 111kbps 
For the worst-case exponent (where all exponent bits are '1') we need 1012 multipli- 
cations and the figures are 
Maximum EXP time = 6.032ms 
Minimum EXPs per second = 165 
Minimum throughput = 83kbps 
and so we see that this is still greater than the threshold rate of 64kbps. 
178 
8.7.6 Gate-Array Selection 
From Section 8.3.2 we have the size of the MME circuit set at approximately 54000 gates. 
Assuming the remaining control circuitry occupies less than 30% of the area of the MME, 
then the total number of gates in the WHiSpER chip will be approximately 70000. 
The conventional rule for gate-array design is that, for the design to be easily mapped 
into the array, it should not use more than 50% of the array's available gates. Applying 
this rule to the WHiSpER design suggests a gate-array with at least 140000 gates. Such 
an array is the CLA77XXX which has 181260 available gates. Thus the WHiSpER design 
would use approximately 40% of the array's gates, allowing room for manouevre in the 
layout process. 
8.7.7 Power Consumption 
In order to calculate the expected power consumption of the WHiSpER chip we need to 
estimate the number of gates that change state during each cycle of the exponentiation 
process. Looking at the MME we see that the Y and M registers will remain constant but 
the X and B registers will change every cycle. It is also likely that the x(i) "Y and z(i) "M 
generation circuits, the adder array, the accumulator and the carry-propagate adder will 
all change state on each cycle. This gives 
4'flMUX+4'I1FA+4'SZFF+QADT/8=91 
gates per bitslice changing state on each cycle. With 518 bitslices, this gives approximately 
47000 gates changing state on each cycle. Using the figure of 7uW/MHz power dissipation 
per gate for the CLA70000 gate-array, and also using the rule-of-thumb from [83] that 15% 
of the control circuitry changes state on each cycle, then we have a total power dissipation 
of 
Power Dissipation 0 25MHz .: s 9 Watts 
a not inconsiderable figure. 
179 
The packaging technology used for the WHiSpER chip therefore needs to be of the 
Power type (see Section 8.1). Also, it would seem advisable that any implementation of 
the chip use a package mounted heatsink and fan combination, similar to the units used 
by modern high-performance microprocessors. 
8.8 Testability 
Post-fabrication testing of an integrated circuit involves applying a set of test stimuli (the 
`testvectors') to the device's inputs and monitoring the state of the device's outputs. The 
idea is that the testvectors will cause a state-change on the output of every gate in the 
device, and that these changes will, ultimately, be detectable on the device outputs. i. e. 
if there is a fault in the device such that a particular gate does not change state when it 
should do, then either it is immediately detectable, or else its effect will be propagated 
to other parts of the circuit where, sooner or later, it will show itself as an unexpected 
state-change on a device output. 
The goal of full testability is to create both a design and a set of testvectors such that 
all possible fabrication faults can be detected in as short a time as possible. In practice, 
the percentage of detectable faults (the fault coverage) is rarely 100%, but a minimum 
coverage level of 95% is often quoted. 
The testing of the WHiSpER chip is twofold. 
"A signature analysis technique is used for testing the MME circuit. 
" All other circuitry is tested by exhaustive stimuli. 
The signature analysis technique lets the MME perform a small exponentiation with known 
input operands and then examines the result. The justification behind this approach is 
that, 
1. it can easily be ensured that all MME gates change state during the exponentiation, 
180 
and 
2. due to the nature of modular arithmetic, a single fault affecting one bit of the expo- 
nentiator circuit will very likely affect other bits too. This is simply a consequence of 
the diffusion property of modular arithmetic that makes it attractive for cryptologic 
purposes. 
Thus we see that it is very likely that a fault in the MME will show up in the result of an 
exponentiation. 
The remaining circuitry of the WHiSpER chip can be tested by creating a set of test 
vectors that toggle each node within the design. i. e. registers are loaded with complemen- 
tary values, state machines have all possible state transitions exercised, etc. 
To increase the observability of potential faults within the design, several auxilliary 
device inputs and outputs have been included. For full details see Appendix B, but in 
essence the extra facilities offered are controlled by a TEST input signal which, when made 
active high, allows the examination of internal control and data signals on device output 
pins. 
Although not directly concerned with fault-testing, another facility provided by the 
WHiSpER design is an externally programmable delay time for the MME's carry-propagation 
adder. The rationale behind this is that, since the carry-propagation adder is constructed 
from a chain of 64 8-bit ADT Macros, then a slight deviation in DADT from that expected 
could lead to large difference in total addition time. The R. AADse1(1: 0) signal selects be- 
tween four alternative delay times (the default is the shortest with R. AADse1 = 00) and its 
presence in the design is purely a precautionary measure that will also allow full evaluation 
of the prototypes. i. e. allow the performance of the RSD adder array to be determined 
separately from that of the carry-propagate adder. 
181 
8.9 The WHiSpER PC-Card 
Using the WHiSpER chip in conjunction with an IBM PC compatible requires that an 
interface card be built that slots into the PC's XT-bus. Specifications for the XT-bus can 
be found in [87]. 
The XT-bus address map is shown in Figure 8.9. Due to the fact that early interface 
XT-bus Address Lines 
15 14 13 12 11 10 9 876543210 
c 
High Address Decoding it Low Address Dowding 
D 
64 Slots t, 512 Ports 
0 
T 
Figure 8.9: XT-bus address map. 
cards used only the lower 10 bits for address decoding (ignoring the high 6 bits) then for 
interface cards that require more than a 'few port addresses it is necessary to implement 
the interface card address decoding in a non-obvious manner. 
The WHiSpER chip needs a 512-byte address space which requires 9 address bits. We 
construct these 9 bits from the top 6 bits and the bottom 3 bits of the XT-bus address. 
Card access is then decoded from XT-bus address bits 9 downto 3 by DIP-switches. Thus, 
when looking at just the lower 10 bits of XT-bus address, the WHiSpER PC-Card will 
appear as a block of 8 consecutive port addresses at some location defined by the DIP- 
switch settings. However, access to the entire WHiSpER memory map is possible because 
the interface card also decodes the upper 6 bits of address. In summary, if A15 ... AO refer 
to the XT-bus address lines, then address decoding is performed as follows, 
" A9 must be a 11', 
" As ... A3 compared with DIP-switch, 
" Als ... AloA2 ... Ao map to WHiSpER address lines Addr(9: 0). 
Access to the card from software is performed by X86 assembler 'in' and `out' instructions 
182 
(or HLL library functions). 
The circuit diagram for the WHiSpER PC-Card is shown in Figure 8.10. 
4--------- LATCH COMPARATOR r----- 74L3266 
Ms+ aua Qu.. AEN A+ g' -C : QUAD XNOR 
Ipi OPEN COLLECTOR Qs As ý 
QN Aye By. DIP-Svn1cb 1 
ALB EN Q3. 
AAB 
6-bit 
74HCT373 74HCT688 --------------P 
- PuUup Resistor ----------------- 
zi 
BID BUFFER 
Dr" M. s, " 
Ada (f: a) 
Addh(20) 
It 
74HCT245 Datm(7: O) 
ro-R OE 
w 
IRQ, a 
RESET DRV 
3 I-co--l m 
25MHz 
aoac Moan. 
aas. ýeýe J o- 
2-bie 
m 
S 
FROM 
CLK SRAddc(13: 0) 
RAADsd(1: 0) SRDw(7: 0) 
TEST 
Figure 8.10: The WHiSpER PC-Card schematic. 
8.10 Summary 
Static RAM 
es 
vt1 
A1,,, 
0., 
6264 150ns 
This chapter has presented the design of the WHiSpER chip. Selection of the appropri- 
ate Montgomery multiplier circuit was performed based on the technology issues of the 
CLA70000 series gate array. Furthermore, the exponentiation algorithm chosen favoured 
reduced circuit area and efficiency over raw speed. The architecture of the chip was 
described, showing how the use of operand/result double-buffering techniques and an ex- 
ternal SRAM device leads to an efficient, high-throughput device. This was followed by a 
description of the device's operation and an analysis of its expected performance. Finally, 
the details of a simple WHiSpER based IBM PC card were given. 
183 
Chapter 9 
Conclusions 
In the last chapter the WHiSpER chip was presented, and it was shown that this chip 
could perform RSA exponentiations at a rate of over 100kbps for moduli of up to 506 bits 
in length. 
In this chapter the thesis is brought to a conclusion by discussing 
1. how the WHiSpER-chip can be used in RSA cryptosystems with moduli of around 
1000 bits in length, 
2. how the WHiSpER chip can be used to generate RSA keys, and finally 
3. a discussion of the achievements of the work presented here followed by ideas for 
further work. 
9.1 The WHiSpER Chip and Extended Moduli 
A method for speeding up the RSA cryptosystem that has been presented in many pub- 
lications (see for example [79]) relies upon the use of the Chinese Remainder Theorem 
(CRT) for exponentiation with the secret key, and using a small preset exponent in the 
public key. 
The use of the CRT requires knowledge of the prime factors of N. However, since the 
184 
CRT is used in the secret exponentiation then it is not unreasonable to assume that the 
prime factors of N are also available to the entity that has access to the secret exponent. 
Indeed, in [10] it is shown that the prime factors of N can be easily calculated if the public 
and secret exponents are known. 
The use of a small public exponent, such as the number 3, is acceptable [88] so long 
as certain measures are incorporated in the cryptosystem protocol (see [1]). For example, 
a small field in the pre-exponentiation framing operation is required to ensure that the 
probability of two messages being identical is sufficiently low. 
9.1.1 CRT Exponentiation Using The WHiSpER Chip 
The RSA modulus N is the product of two prime numbers P and Q. As we saw in 
Section 2.6.5 we can perform an exponentiation modulo N by performing exponentiations 
modulo P and Q followed by two multiplications modulo N. That is, 
CAE>N 
= `((AP)E)P . 
(Q-l) 
p. 
Q+l (AQ)E\Q ' 
(P-l)Q 
'P )N 
where 
Ap = (A)P 
AQ = (A)Q 
Now, since P and Q are prime, then from Fermat's Little Theorem (Theorem 4) we 
have 
((AP)E)P ((AP)(E)r 
l) 
((AQ)E)Q 
= 
((AQ)(E)Q-I)Q 
So defining 
Ep = cEiP-i 
EQ = (E)O-1 
185 
and 
DP = 
((AP)EP)p 
DQ = 
((AQ)EQ)Q 
then we have 
D= (AE)N= (DP-QP'+'DG-PC%N 
where 
QP = 
(Q-1) 
"Q 
P' = 
(P'1)Q 
"P 
are pre-computed constants. For N and E around 1000 bits, and so assuming P and Q 
of around 500 bits, then the above calculation of D can roughly halve the number of mul- 
tiplications involved compared to non-CRT methods. Furthermore, since the calculations 
of Dp and DQ involve only 500-bit multiplications, then each multiplication takes approx- 
imately half the time of non-CRT multiplications (half the time for hardware multipliers. 
In software, using multi-precision arithmetic techniques, the multiplications would take 
approximately one quarter of the time). 
Thus, with a WHiSpER and host computer combination, the CRT method could be 
used with moduli whose prime factors are each less than 506 bits in length, as follows. 
1. Pre-calculate on the host 
Mp =PP. (-P-1) , s 
MQ =Q 
(-Q-1), 
Ep = ýP-1 
EQ =( Q-ý 
Hp = 
121024\ 
/Mp 
HQ = 
121024\ 
\ /MQ 
186 
QP = 
(Q-1)p. Q 
P'Q = 
(P-1) P 
Q- 
2. Load P, Hp and Ep into the first SRAM key storage bank. Load Q, HQ and EQ 
into the second SRAM key storage bank. 
3. To perform the exponentiation D= 
(AE)N then 
(a) calculate Ap and AQ on the host, 
(b) write Ap to WHiSpER and issue LDK, EXP and RDC commands for first key 
store, 
(c) read Dp from WHiSpER, 
(d) write Aq to WHiSpER and issue LDK, EXP and RDC commands for second 
key store, 
(e) read DQ from WHiSpER, 
(f) calculate D= (Dp " Q'p + DQ " PON on the host. 
For maximum throughput the host should perform the calculations of Ap, Aq and D in 
parallel with the WHiSpER exponentiations. To see that this is possible we can examine 
the software modular multiplication implementations of Comba [34] and Bong et al. [35]. 
Bong shows that an 8MHz 80286 PC can perform approximately 200 512-bit modular 
multiplications per second. Since doubling the modulus size approximately quadruples 
the calculation time of multi-precision arithmetic routines, this corresponds to approxi- 
mately 50 1024-bit multiplications per second. Using a mixture of 512-bit and 1024-bit 
operations, then it should be possible to perform 100 512-bit multiplications plus 251024 
bit multiplications per second. Assuming that a 100MHz Pentium processor is at least 
10-times faster than an 8MHz 80286, then a modern PC should be able to perform ap- 
proximately 1000 512-bit and 250 1024-bit multiplications in one second. Referencing 
187 
Section 8.7 we see that the WHiSpER chip performs approximately 220 exponentiations 
per second. Therefore there should be ample time for the host device to perform two 
500-bit reductions and two 1000-bit multiplications in the time it takes the WHiSpER 
chip to perform two exponentiations. Thus the throughput of the CRT method will be 
limited by the WHiSpER chip. 
The time taken by the WHiSpER chip to perform two exponentiations with different 
moduli is equal to twice the time required to execute the LDK, EXP and RDC commands. 
Approximating the figures of Section 8.7 we have 
key load = 5000 cycles 
transfer = 1000 cycles 
exponentiation = 112000 cycles 
reduction = 7000 cycles 
which gives a total of approximately 125000 cycles. With the WHiSpER chip clocked 
at 25MHz this corresponds to approximately 200 exponentiations per second. Since. we 
need two WHiSpER exponentiatidns per modulo N exponentiation this corresponds to 
approximately 100 1000-bit exponentiations per second. In other words, for keys around 
1000 bits in size, 
RSA throughput with CRT = 100kbps 
9.1.2 Host Exponentiation with a Small Exponent 
As we saw above, it should be possible to perform 500 1000-bit modular multiplications 
per second on a fast PC. To achieve a 100kbps throughput rate for RSA with a small public 
exponent, then the exponent, Ep will be limited to the following values (remembering that 
the exponent must be odd) 
Ep E {3,5,7,9,11,13,17} 
188 
Recalling from Section 2.7 that the exponents must be coprime to IID(N), then the choice 
of which exponent to use depends on which of them is coprime to both P-1 and Q-1. 
To ensure that one of them is coprime it may be necessary to create the primes P and Q 
with this condition in mind. This can be done quite simply using Euclid's algorithm with 
each candidate prime in the key generation process. 
Thus it is possible to use the WHiSpER chip and a fast PC to implement the RSA 
cryptosystem with key lengths around 1000 bit and still achieve encryption rates of ap. 
proximately 100kbps. 
9.2 The WHiSpER Chip and Key Generation 
As was shown in Section 2.7, RSA key generation is essentially prime number generation. 
The technique most commonly used to generate the large prime numbers needed for RSA 
keys is the primality testing of large odd random numbers. An efficient primality testing 
algorithm is that of Rabin in [89]. The algorithm is probabilistic in nature, which means 
that there is a small probability that the algorithm will declare an integer to be prime 
when it is not. However, this probability can be reduced to an arbitrarily small level with 
enough applications of the algorithm. 
9.2.1 Rabin's Primality Test 
In [89] Rabin shows how to test an arbitrary odd integer P for primality. First, write P 
as 
P=21 "E-F1 
where E is an odd number and f>1. A random number AE [0, P- 1], called a witness 
to the primality or otherwise of P is then used according to the pseudo-code algorithm 
shown in Figure 9.1. The algorithm may return FALSE or TRUE, and their meanings 
are as follows, 
189 
1. S: s<AE>p 
2. IF (S a 1) OR (S = P-1) 
3. RETURN TRUE 
4. ENDIF 
5. FOR ja1 to f 
6. S := S2 
7. CASE OF 
8. S 1: RETURN FALSE 
9. Ss P-1 : RETURN TRUE 
10. ENDGASE 
11. EIDFOR. 
12. RETURN FALSE 
Figure 9.1: Rabin's primality test. 
" FALSE: P is definitely composite, 
9 TRUE: P is probably prime. 
The probability that the algorithm returns TRUE when P is really composite is 1 in 4. 
In practice, a sequence of randomly chosen witnesses 
A[1], A[2], ... , A[n] 
are used to test the primality of P. If Rabin's test returns TRUE for every A[ij, then the 
probability that we are wrong in declaring P to be a prime is 2-2n. 
For example, with 10 randomly generated witnesses A[1], A[2], ... , A[n] in the range 
[0, P- 1] then if, when applied to each A[i] in turn, the algorithm returns TRUE in every 
case, then the probability that P is actually composite is approximately one in a million. 
9.2.2 Primality Testing on WHiSpER 
Using Rabin's primality test with the WHiSpER chip it is possible to test odd integers of 
up to 506 bits for primality. 
With P= 2' "E+1 then Rabin's test requires an exponentiation and f squarings 
modulo P. To do this on the WHiSpER chip requires pre-calculating 
M=P- (-P'1 )s+ 
190 
and loading an SRAM key bank with 
N=P 
H= (21024\ 
/M 
Ep =E 
E, =2 
The first exponentiation in Rabin's test is then started, after A[1] has been loaded into 
the URA-M, by issuing the commands LDK(O), EXP and RDC. Further squarings are then 
performed by the commands LDK(1), EXP, EXP,..., RDC until the algorithm terminates. 
After each exponentiation the result can be read from the URAM. 
If the algorithm returns TRUE for A[1] then A[2] is tested, if this returns true then 
A[3] is tested and so on until either the algorithm returns FALSE for some A[i] or the 
sequence of witnesses is exhausted. In the latter case P is declared to be prime. 
To estimate the time required to generate a k-bit prime number we must first estimate 
the time required to execute Rabin's test for one witness A. To do this we need to know 
the average number of squarings that will be performed for randomly chosen P. 
Consider a randomly chosen odd number P, then 
9 there is a probability of Z that the bit-vector of P will end in `11', 
" there is a probability of 4 that the bit-vector of P will end in `101', 
9 there is a probability of $ that the bit-vector of P will end in `1001', 
and so on. For Rabin's test this means that 
9 there is a probability of z that the test will require one exponentiation and one 
squaring operation, 
9 there is a probability of 4 that the. test will require one exponentiation and two 
squaring operations, 
191 
" there is a probability of $ that the test will require one exponentiation and three 
squaring operations, 
and so on. In general this implies that for randomly chosen k-bit odd integers the number 
of squaring operations required by the algorithm will be 
k-I 
t! =l 
which quickly converges to the value 2. 
Thus, on average, we will have to perform one exponentiation and two squaring op- 
erations when applying Rabin's test to randomly chosen integers. If these operations 
are performed by the WHiSpER chip for many prime candidates during the search for a 
prime number, and with the calculation of M and H for each candidate performed by the 
host in parallel with WHiSpER operation, then the time required to test each candidate 
will be the time required by the commands LDK(O), EXP, RDC, LDK(1), EXP, EXP 
and RDC. For P around 500-bit this requires approximately 120000 cycles of WHiSpER's 
clock, which at 25Mhz corresponds to approximately 5ms. Therefore approximately 200 
primality tests can be performed per second. 
If it is assumed that all composite numbers fail the primality test on at most the second 
witness (as is most often seen in practice), and using the prime distribution approximation 
that log. P is the probability that numbers of the size of P are likely prime, then we should 
be able to discover a 500-bit prime with fewer than 350 primality tests. This corresponds 
to less than 2 seconds using the WHiSpER chip. A 1000-bit RSA key requires two such 
primes, and therefore it should be possible to generate a 1000-bit RSA key in less than 4 
seconds. 
For 500-bit RSA keys the time required by the WHiSpER chip to perform 250-bit 
primality tests reduces to less than 3ms. Thus over 350 such tests can be performed per 
second. Since 250-bit primes are about twice as common as 500-bit primes, then fewer 
than 175 tests should be needed to find a 250-bit prime. Thus it should be possible to 
192 
generate a 500-bit RSA key in 1 second. 
9.3 Achievements 
The achievements of this project are threefold; a new algorithm, an efficient architecture 
and the WHiSpER chip. 
9.3.1 A New Algorithm 
The MMDDAMMM algorithm allows fast and efficient Montgomery multipliers to be 
realised in VLSI hardware. It has two clear advantages over other algorithms, 
" the calculation of the modulus multiple, Z; " M, that has to be added to the partial 
result during each iteration - traditionally the bottleneck to processing speed in 
modular multiplication - has been simplified to such an extent that it no longer 
limits multiplier performance, and 
" the range of Z; and X; is the same, allowing for efficient high-radix implementations 
of the algorithm. 
9.3.2 An Efficient Architecture 
Using an RSD approach to the design of an iterative Montgomery multiplier enables 
the use of string recoding techniques to further improve the efficiency and speed of the 
multiplier. The optimizations incorporated into the MMDDAMMM algorithm allow the 
recoding scheme to be used to full advantage. The performance bottleneck in this new 
architecture is now the architecture itself, i. e. the delay of signals through the adder array. 
9.3.3 The WHiSpER Chip 
A high-speed 506-bit RSA processor, the WHiSpER chip, has been designed and simulated 
and is expected to be able to perform RSA encryption/decryption at an average rate 
193 
of 111kbps. At the time of writing, the WHiSpER chip is awaiting final layout and 
fabrication. Negotiations with GEC Plessey are in progress to achieve this end. 
The fastest known RSA processor using a similar technology is the Cryptech chip (see 
Chapter 5). This chip can perform a one-off 512-bit exponentiation at an equivalent rate 
of 32kbps. Sustained encryption rates, however, will be lower than this since the chip does 
not incorporate operand/result input and output buffering. Therefore it is expected that 
the WHiSpER chip will show, approximately, a fourfold increase in throughput compared 
to its nearest rival. 
Applications of the WHiSpER chip include its use as an RSA processor in high- 
throughput, computationally intensive, crypto-processing engines. These are found in 
the security service providers of computer security systems such as [90] and [91]. Also, 
because encryption rates of over 64kbps are guaranteed by the WHiSpER chip, then it 
could be used by moderate-speed communication networks (such as ISDN, see [92]) to 
provide a transparent security function for users. 
9.4 Further Work 
To achieve greater throughput for RSA hardware would require investigations into the 
following areas. 
9.4.1 Exponentiation Algorithms 
Although the exponentiation algorithms discussed in Chapter 3 are efficient when imple- 
mented in hardware, they are not the fastest available. 
Addition Chain Exponentiation 
An addition chain for a given number can be defined, from [93], as a list of numbers such 
that, 
194 
" the first number is one, 
" every number is the sum of two earlier numbers, and 
" the given number is the last in the list. 
To perform an exponentiation, the numbers can be viewed as the intermediate exponents 
that are calculated during the exponentiation process. 
The Right-to-Left and Left-to-Right exponentiation algorithms given in Chapter 3 are 
special case addition chains where each number in the chain is either the sum of the 
previous number with itself (the squaring operation), or the sum of the previous number 
and one* (the multiply operation). 
To show that generalized addition chains can permit faster exponentiation then con- 
sider raising a number to the power 15. Using the standard Right-to-Left technique, the 
addition chain becomes 
1,2,3,6,7,14,15 
which requires 3 squarings and 3 multiplications. A different addition chain for the number 
15 is 
1,2,3,6,12,15 
which still requires 3 squarings but only 2 multiplications. 
The use of addition chains for fast exponentiation has been investigated by various 
authors, see for example [25], [94] and [95]. The main problem is that computing the 
shortest addition chain for a given number is known to be an NP-complete problem. 
However, algorithms are available that can compute addition chains that are up to 20% 
shorter [93] than the Right-to-Left technique. 
With respect to modular exponentiation hardware, the main problem with using addi- 
tion chains is that several intermediate results may have to be stored. This is because, as 
the exponentiation is proceeding, the current multiplication is allowed to be the product 
of any two of the previous intermediate results. 
195 
Further work would allow the WHiSpER chip to be modified in such a way that 
would allow extra register storage, for intermediate exponentiation results, to be made 
available within the device, and to implement increased flexibility in the exponentiation 
algorithm. This increased flexibility would allow pre-computed addition chains to be 
somehow encoded within the exponent space of the SRAM, defaulting to standard binary 
when no efficient chain can be found. 
Exponentiation with Table-lookup 
In [96] and [97] exponentiation algorithms are proposed which allow the number of mul- 
tiplications used in the calculation of 
(AE)N to be reduced. The technique relies on 
computing a table of powers of A. 
This idea could be used by the WHiSpER chip by reserving an area of the SRAM 
for the table. The table could be computed by WHiSpER in a first-stage exponentiation 
process, and then used subsequently to calculate 
(AE> 
N 
9.4.2 Improved Technology 
High-performance VLSI integrated circuits, such as modern RISC microprocessors, use 
silicon-efficent full-custom design techniques and advanced manufacturing process tech- 
nologies. Current state-of-the-art technology uses 3.3v 4-level metal processes allowing 
sub-0.5 micron features with die sizes that permit more than 3 million transistors on a 
chip. This leads to clock frequencies In excess of 200MHz and power dissipations of over 
20W. See, for example, the MIPS Technologies' R10000 microprocessor [98], DEC's Alpha 
AXP microprocessor [99] and IBM/Motorola's PowerPC microprocessors [100] [101]. 
Using 0.5 micron technology to implement the b=2 MMDDAMMM recoded RSD 
multiplier of Chapter 7 would surely allow this multiplier to be clocked at 100MHz. For 
moduli of around 500 bits then approximately 200 thousand clock cycles are required 
per exponentiation. This gives an RSA throughput of approximately 250kbps. From 
196 
Chapter 8 we see that the size of the multiplier is around 30 thousand gates or 120 
thousand transistors. 
Assuming that pipelining techniques similar to those of Section 6.4.3 can be applied to 
this multiplier, then we see that a four-stage pipelined architecture would probably use no 
more than 500 thousand transistors and yet be capable of performing RSA exponentiations 
in the region of 1Mbps. More pipelining and parallelism would produce still higher rates. 
Therefore, more work needs to be done in investigating pipelined implementations 
of the optimized Montgomery multipliers presented in Chapter 7 with regard to large 
full-custom designs using fast sub-micron technologies. 
9.4.3 Towards a New Architecture 
Recent proposals [102] [103] suggest using a Residue Number System (RNS) approach to 
long-integer modular arithmetic. The RNS system has the great advantage that when 
performing modular arithmetic modulo an integer constructed from many small primes 
(called the base of the RNS system, see [104]), addition and multiplication can be per- 
formed by many small calculations that proceed completely in parallel. The time required 
to complete one of these operations is thus very short. 
The main disadvantage of RNS however, is that, when used to perform arithmetic 
modulo an integer coprime to the base of the RNS system (such as will happen with RSA 
exponentiation), then this will involve a division-like operation and this is very difficult to 
do in RNS. The basic problem is that the RNS number system is a non-weighted system, 
and so comparison and division are not easily achievable. 
In summary, it is not known whether RNS systems will ever provide an efficient al- 
ternative to weighted number systems when it comes to implementing high-speed RSA 
hardware. 
197 
9.5 Summary 
A new Montgomery multiplication algorithm and an efficient architecture have yielded 
the WHiSpER chip. In combination with the Chinese Remainder Theorem this chip can 
be used to implement the RSA cryptosystem with keys of up to 1000 bits in length and 
ciphering rates of approximately 100kbps. Primality testing can also be performed by this 
chip, with RSA key generation typically taking only a few seconds. 
198 
Bibliography 
[1] Gustavus J. Simmons, editor. Contemporary Cryptology - The Science of Informa- 
tion Integrity. IEEE Press, 1992. 
[2] Jennifer Seberry and Josef Pieprzyk. Cryptography - An Introduction to Computer 
Security. Prentice-Hall, 1989. 
[3] D. W. Davies. Security for Computer Networks. Wiley-Interscience, 1984. 
[4] C. E. Shannon. Communication theory of secrecy systems. Bell Systems Technical 
Journal, 28: 656-715,1949. 
[5] Bruce Schneier. The IDEA encryption algorithm. Dr. Dobb's Journal, December 
1993. 
[6] Eli Biham and Adi Shamir. Differential cryptanalysis of DES-like cryptosystems. In 
Advances in Cryptology: CRYPTO 90. Springer-Verlag, 1991. 
[7] Adi Shamir. Differential cryptanalysis of the full 16-round DES. In Advances in 
Cryptology: CRYPTO 92. Springer-Verlag, 1993. 
[8] G. Carter, A. Clark, E. Dawson, and L. Nielsen. Analysis of DES double key mode. 
In Information Security - The Next Decade. Chapman and Hall, 1995. 
[9] Taher ElGamal. A public key cryptosystem and a signature scheme based on discrete 
logarithms. IEEE 71ansactions on Information Theory, 31(4), 1985. 
199 
i; .ý ýýý 
[10] Neal Koblitz. A Course in Number Theory and Cryptography. Springer-Verlag, 1987. 
[11] L. Harn. Public-key cryptosystem design based on factoring and discrete logarithms. 
IEE Proceedings: Computers and Digital Techniques, 141(3), May 1994. 
[12] Chih-Chwen Chuang and James George Dunham. Matrix extensions of the RSA 
algorithm. In Advances in Cryptology: CRYPTO 90. Springer-Verlag, 1991. 
[13] Paul C. van Oorschot. A comparison of practical public-key cryptosystems based on 
integer factorization and discrete logarithms. In Advances in Cryptology: CRYPTO 
90. Springer-Verlag, 1991. 
[14] R. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures 
and public-key cryptosystems. Communications of the ACM, 21(2): 120-126,1978. 
[15] Wiebren de Jonge and David Chaum. Attacks on some RSA signatures. In Advances 
in Cryptology: CRYPTO 85. Springer-Verlag, 1986. 
[16] Y. Desmedt. A chosen text attack on the RSA cryptosystem and some discrete 
logarithm schemes. In Advances in Cryptology: CRYPTO 85. Springer-Verlag, 1986. 
[17] John M. DeLaurentis. A further weakness in the common modulus protocol for the 
RSA cryptoalgorithm. Cryptologia, July 1984. 
[18] James A. Davis and Diane B. Holdridge. Factorization of large integers on a mas- 
sively parallel computer. In Advances in Cryptology: EUROCRYPT 88. Springer- 
Verlag, 1989. 
[19] Kurt Weiner. Squeamish ossifrage dents electronic armour. New Scientist, 7th May 
1994. 
[20] Alan Baker. A Concise Introduction to the Theory of Numbers. Cambridge Univer- 
sity Press, 1984. 
200 
[21] M. R. Schroeder. Number Theory in Science and Communication. Springer-Verlag, 
1984. 
[22] Kenneth H. Rosen. Elementary Number Theory and its Applications. Addison- 
Wesley, 1984. 
[23] Charles C. Pinter. A Book of Abstract Algebra. McGraw-Hill Publishing Company, 
2 edition, 1990. 
[24] Keith Devlin. Microchip Mathematics - Number Theory for Computer Users. Shiva 
Publishing Ltd, 1984. 
[25] Donald E. Knuth. The Art of Computer Programming, volume 2: Semi-Numerical 
Algorithms. Addison-Wesley, 2 edition, 1981. 
[26] Kai Hwang. Computer Arithmetic - Principles, Architecture and Design. John Wiley 
and Sons, 1979. 
[27) Akhilesh Tyagi. A reduced-area scheme for carry-select adders. IEEE Transactions 
on Computers, 42(10), October 1993. 
[281 Mark. R. Santoro and Mark A. Horowitz. SPIM: A pipelined 64x64-bit iterative 
multiplier. IEEE Journal of Solid-State Circuits, 24(2), April 1989. 
[29] Masoto Nagamatsu, Shigeru Tanaka, Junji Mori, Katsusi Hirano, Tatsuo Noguchi, 
and Kazuhisa Hatanaka. A 15-ns 32x32-b cmos multiplier with an improved parallel 
structure. IEEE Journal of Solid-State Circuits, 25(2), April 1990. 
[30] Stamatis Vassiliadis, Eric M. Schwarz, and Boik M. Sung. Hard-wired multipliers 
with encoded partial products. IEEE Transactions on Computers, 40(11), November 
1991. 
'[31] G. R. Blakley. A computer algorithm for calculating the product AB modulo M. 
IEEE Transactions on Computers, C-32(5): 497-500, May 1983. 
201 
[32] Per Brinch Hansen. Multiple-length division revisited: A tour of the minefield. 
Software - Practice and Experience, 24(6), June 1994. 
[33] A. Selby and C. Mitchell. Algorithms for software implementations of RSA. Pro- 
ceedings of the IEE, 136(3), May 1989. 
[34] P. G. Comba. Exponentiation cryptosystems on the IBM PC. IBM Systems Journal, 
29(4), 1990. 
[35] Dieter Bong and Cristoph Ruland. Optimized software implementations of the mod- 
ular exponentation on general purpose microprocessors. Computers and Security, 
8: 621-630,1989. 
[36] Paul Barrett. Implementing the rivest shamir and adleman public key encryption al- 
gorithm on a standard digital signal processor. In Advances in Cryptology: CRYPTO 
86. Springer-Verlag, 1987. 
[37] Stephen R. Dusse and Burton S. Kaliski Jr. A cryptographic library for the motorola 
DSP56000. In Advances in Cryptology: EUROCRYPT 90. Springer-Verlag, 1991. 
[38] Dominique de Waleffe and Jean-Jacques Quisquater. CORSAIR: A smart card for 
public key cryptosystems. In Advances in Cryptology: CRYPTO 90. Springer-Verlag, 
1991. 
[39] P. A. Findlay and B. A. Johnson. Modular exponentation using recursive sums of 
residues. In Advances in Cryptology: CRYPTO 89. Springer-Verlag, 1990. 
[40] Andre Vandemeulebroecke, Etienne Vanzieleghem, Tony Denayer, and Paul G. A. 
Jespers. A new carry-free division algorithm and its application to a single-chip 
1024-b RSA processor. IEEE Journal of Solid-State Circuits, 25(3), June 1990. 
[41] Paolo Montuschi and Luigi Ciminiera. Over-redundant digit sets and the design of 
digit-by-digit division units. IEEE Transactions on Computers, 43(3), March 1994. 
202 
[42] Tony M. Carter and James E. Robertson. Radix-16 signed-digit division. IEEE 
Transactions on Computers, 39(12), December 1990. 
[43] David M. Mandelbaum. A systematic method for division with high average bit 
skipping. IEEE Transactions on Computers, 39(1), January 1990. 
[44] Eric M. Schwarz and Michael J. Flynn. Parallel high-radix nonrestoring division. 
IEEE Transactions on Computers, 42(10), October 1993. 
[45] Milos D. Ercegovac and Tomas Lang. Simple radix-4 division with operands scaling. 
IEEE Transactions on Computers, 39(9), September 1990. 
[46] Milos D. Ercegovac, Tomas Lang, and Paolo Montuschi. Very-high radix division 
with prescaling and selection by rounding. IEEE Transactions on Computers, 43(8), 
August 1994. 
[47] Peter A. Ivey, Alan L. Cox, John R. Harbridge, and John K. Oldfield. A single- 
chip public key encryption subsystem. IEEE Journal of Solid-State Circuits, 24(4), 
August 1989. 
[48] A. Tomlinson. Modulo multiplier to enhance encryption rates. Electronic Engineer- 
ing, April 1990. 
[49] A. Tomlinson. Bit-serial modular multiplier. Electronics Letters, 25(24): 1664,23rd 
November 1989. 
[50] Keiichi Iwamura, Tsutomu Matsumoto, and Hideki Imai. High-speed implemen- 
tation methods for RSA scheme. In Advances in Cryptology: EUROCRYPT 92. 
Springer-Verlag, 1993. 
[51] Che Wun Chiou. A fast logic for modular multiplication. International Journal of 
Electronics, 74(6), 1993. 
203 
[52] Frank Hoornaert, Marc Decroos, Joos Vandewalle, and Rene Govaerts. Fast RSA- 
hardware: Dream or reality. Technical report. Cryptech NV/SA, Av. Lloyd George 
7,1050 Brussels, Belgium. 
[53] C. W. Chiou and T. C. Yang. Iterative modular multiplication algorithm without 
magnitude comparison. Electronics Letters, 30(24), 24th November 1994. 
[54] Ernest F. Brickell. A fast modular multiplication algorithm with application to 
two-key cryptography. In Advances in Cryptology: CRYPTO 82. Springer-Verlag, 
1983. 
[55] P. W. Baker. Fast computation of A*B modulo N. Electronics Letters, 23(15), 16th 
July 1987. 
[56] Naofumi Takagi and Shuzo Yajima. Modular multiplication hardware algorithms 
with a redundant representation and their application to rsa cryptosystem. IEEE 
Transactions on Computers, 41(7), July 1992. 
[57] Naofumi Takagi. A radix-4 modular multiplication hardware algorithm for modular 
exponentiation. IEEE Transactions on Computers, 41(8), August 1992. 
[58] Hikaru Morita. A fast modular multiplication algorithm based on a higher radix. In 
Advances in Cryptology: CRYPTO 89. Springer-Verlag, 1990. 
[59] Holger Orup, Erik Svendsen, and Erik Andreasen. VICTOR: An efficient RSA 
hardware implementation. In Advances in Cryptology: EUROCRYPT 90. Springer- 
Verlag, 1991. 
[60] Glenn Orton, Lloyd Peppard, and Stafford Tavares. A design of a fast pipelined 
modular multiplier based on a diminished-radix algorithm. Journal of Cryptology, 
6: 183-208,1993. 
204 
[61] Ernest F. Brickell. A survey of hardware implementations of RSA. In Advances in 
Cryptology: CRYPTO 89. Springer-Verlag, 1990. 
[62] Gordon Rankine. THOMAS -a complete single chip RSA device. In Advances in 
Cryptology: CRYPTO 86. Springer-Verlag, 1987. 
[63] G. A. Orton, M. P. Roy, P. A. Scott, L. E. Peppard, and S. E. Tavares. VLSI 
implementation of public-key encryption algorithms. In Advances in Cryptology: 
CRYPTO 86. Springer-Verlag, 1987. 
[64] C. K. Koc and C. Y. Hung. Multi-operand modulo addition using carry-save adders. 
Electronics Letters, 26(6): 361-363,15th March 1990. 
[65] Peter A. Ivey, Simon N. Walker, Jon M. Stern, and Simon Davidson. An ultra, 
high speed public-key encryption processor. In Proceedings of the IEEE Integrated 
Circuits Conference, 1992. 
[66] Holger Sedlak. The RSA cryptography processor. In Advances in Cryptology: 
CRYPTO 87. Springer-Verlag, 1988. 
[67] Martin Kochanski. Developing an RSA chip. In Advances in Cryptology: CRYPTO 
85. Springer-Verlag, 1986. 
[68] B. S. Prasanna and P. V. Ananda Mohan. Fast VLSI architectures using nonredun- 
dant multibit recoding for computing all (mod N). IEE Proceedings: Circuits De- 
vices and Systems, 141(5), October 1994. 
[69] Giuseppe Alia and Enrico Martinelli. A VLSI modulo m multiplier. IEEE 7hansac- 
Lions on Computers, 40(7), July 1991. 
[70] Stanislaw J. Piestrak. Design of residue generators and multioperand modulo adders 
using carry-save adders. IEEE Transactions on Computers, 43(1), January 1994. 
205 
[71] Thomas Beth and Dieter Gollmann. Algorithm engineering for public key algorithms. 
IEEE Journal on Selected Areas in Communications, 7(4), May 1989. 
[72] Peter L. Montgomery. Modular multiplication without trial division. Mathematics 
of Computation, 44(170), April 1985. 
[73] S. J. Shepherd. A high-speed cryptographic engine. Electrical Engineering Depart- 
ment, University of Bradford, UK. 
[74] Dan Zuras. More on squaring and multiplying large integers. IEEE Transactions on 
Computers, 43(8), August 1994. 
[75] M. Shand, P. Bertin, and J. Vuillemin. Hardware speedups in long integer multipli- 
cation. In Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms 
and Architectures, July 1990. 
[76] Shimon Even. Systolic modular multiplication. In Advances in Cryptology: 
CRYPTO 90. Springer-Verlag, 1991. 
[77] Jorg Sauerbrey. A modular exponentiation unit based on systolic arrays. In Advances 
in Cryptology: A USCRYPT 92. Springer-Verlag, 1993. 
[78] Keiichi Iwamura, Tsutomu Matsumoto, and Hideki Imai. Systolic-arrays for modu- 
lar exponentiation using montgomery method. In Advances in Cryptology: EURO" 
CRYPT 92. Springer-Verlag, 1993. 
[79] M. Shand and J. Vuillemin. Fast implementations of RSA cryptography. In Pro- 
ceedings of the 11th IEEE Symposium on Computer Arithmetic, 1993. 
[80] Colin D. Walter. Systolic modular multiplication. IEEE Transactions on Computers, 
42(3), March 1993. 
206 
[81] Stephen E. Eldridge and Colin D. Walter. Hardware implementation of mont- 
gomery's modular multiplication algorithm. IEEE Transactions on Computers, 
42(6), June 1993. 
[82] C. D. Walter. Still faster modular multiplication. Electronics Letters, 31(4), 16th 
February 1995. 
[83] CMOS Semi-Custom CLA70000 ASIC Handbook. GEC Plessey Semiconductors, 
July 1992. 
[84] GEC Plessey Semiconductors. Mentor Design Kit: User Manual, November 1991. 
[85] GEC Plessey Semiconductors. Mentor Design Kit: Volume One, November 1991. 
[86] GEC Plessey Semiconductors. Mentor Design Kit: Volume Two, November 1991. 
[87] Lewis C. Eggebrecht. Interfacing to the IBM Personal Computer. Howard W. Sams 
and Company, 1983. 
[88] Uyless Black. The X Series Recommendations. McGraw-Hill, 1995. 
[89] Michael 0. Rabin. Probabilistic algorithm for testing primality. Journal of Number 
Theory, 12: 128-138,1980. 
[90] S. J. Shepherd, P. W. Sanders, and A. Patel. A comprehensive security system - the 
concepts, agents and protocols. Computers and Security, 9: 631-643,1990. 
[91] Warwick Ford and Brian O'Higgins. Public-key cryptography and open systems 
interconnection. IEEE Communications Magazine, July 1992. 
[92] William Stallings. Data and Computer Communications. MacMillan Publishing 
Company, 1991. 
`[93] Jurjen Bos and Matthijs Coster. Addition chain heuristics. In Advances in Cryptol- 
ogy: CRYPTO 89. Springer-Verlag, 1990. 
207 
[94] Y. Yacobi. Exponentiating faster with addition chains. In Advances in Cryptology: 
EUROCRYPT 90. Springer-Verlag, 1991. 
[95] Jorg Sauerbrey and Andreas Dietel. Resource requirements for the application of ad- 
dition chains in modulo exponentiation. In Advances in Cryptology: EUROCRYPT 
92. Springer-Verlag, 1993. 
[96] L. C. K. Hui and K. Y. Lam. Fast square-and-multiply exponentiation for RSA. Elec- 
tronics Letters, 30(17), 18th August 1994. 
[97] K. Y. Lam and L. C. K. Hui. Efficiency of SS(1) square-and-multiply exponentiation 
algorithms. Electronics Letters, 30(25), 8th December 1994. 
[98] MIPS R10000 microprocessor product overview. Technical report, MIPS Technolo- 
gies Incorporated, October 1994. 
[99] Digital 21064-AA microprocessor product brief. Technical report, Digital Equipment 
Corporation, February 1992. 
[100] Charles R. Moore. PowerPC 601 microprocessor. Technical report, IBM Corpora- 
tion, 1993. 
[101] James Kahle and Deene Ogden. PowerPC 603 microprocessor. Technical report, 
IBM Corporation, 1994. 
[102] Mahdi Abdelguerfi, Andrea Dunham, and Wayne Patterson. MRA: A computational 
technique for security in high-performance systems. Computer Security, A-37: 401- 
417,1993. 
[103] K. C. Posch and R. Posch. Residue number systems: A key to parallelism in public- 
key cryptography. In Proceedings of the Fourth IEEE Symposium on Parallel and 
Distributed Processing, 1-4 December 1992. 
208 
[104] Nicholas S. Szabo and Richard I. Tanaka. Residue Arithmetic and its Applications 
to Computer Technology. McGraw-Hill Book Company, 1967. 
209 
Appendix A 
The WHiSpER SMC 
The SMC is a state-machine that controls the operation of the WHiSpER chip. It is 
composed of four semi-autonomous smaller state-machines called LSM, TSM, ESM and 
RSM. Figures A. 1, A. 2, A. 3 and A. 4 show state-transition diagrams for each of these 
state-machines. See the schematics of Appendix B for their circuit diagrams. 
A. 1 SMC Input Signals 
The following signals are all active high and control the operation of the SMC. 
" cLDK: LDK command signal from ComStat. 
9 cEXP: EXP command signal from ComStat. Perform first exponentiation in se- 
quence. 
. cEXP_R. DC: EXP-R. DC command signal from ComStat. Perform intermediate ex- 
ponentiation and reduction in sequence. 
" cRDC: RDC command signal from ComStat. Perform final reduction at end of 
sequence. 
9 e;: current exponent bit indexed by EC counter. Used by ESM during the exponen- 
tiation process. 
210 
" ECQOO: active when EC counter is at 0x000. Used by ESM during the exponentia- 
tion process. 
" RCQ3F: active when RC counter is at Ox3F. Used by the RSM during the reduction 
process. 
" Upos: active when currently accessed URAM byte has most-significant-bit of '0'. 
Used by RSM during the reduction process. 
A. 2 SMC Output Signals 
The control signals produced by the SMC are all active high and have the following 
functions. 
" SRI: the SRAM is being used internally by the WHiSpER chip. Switch SRAM ad- 
dress, data and control lines to internally generated signals. No host microprocessor 
access to the SRAM is allowed. 
" SRT: the SRAM is being used by the transfer process. Switch SRAM address lines 
to the transfer counter. 
" URI: the URAM is being used internally by the WHiSpER chip. No host micropro- 
cessor access allowed. 
" URT: the URAM is being used by the transfer process. Switch URAM address lines 
to the transfer counter. 
9 RCA: the SRAM and URAM are being used by the reduction process. Switch SRAM 
and URAM address lines to the reduction counter. 
" IntSet: set the INT to active low. 
" EMSBIoad: load the EMSB register. 
211 
" NINload: load the NIN register. 
" ECen: enable EC counter operation. 
" ECload: load the EC counter with preset value Ox1FF. 
" TCload: load the TC counter with preset value Ox7F. 
" Txferii: transfer H from SRAM. 
" TxferlN: transfer N from SRAM. 
" Txferl-iN: active whenever transfer operation is in progress (TxferJi OR Txfer. N). 
9 TxferClkEn: enable TxferClk. 
" RCload: load RC counter with preset value Ox3F. 
" RedcClkEn: enable RedcClk. 
" ESM-CclkEn: enable MME X register clock. 
" ESM YclkEn: enable MME Y register clock. 
" ESM. AclkEn: enable MME accumulator clock. 
9 ESMBclkEn: enable MME B register clock. 
9 ESM. Xpar: enable parallel loading of MME X register. 
9 ESM Ypar: enable parallel loading of MME Y register. 
" ESMAsrst: enable synchronous reset of MME accumulator. 
9 ESMXone: override MME x(i) generation circuitry to generate X=1. 
9 ESM. Xs2X: sign extend the MME X register as it is consumed during a multiplica- 
tion. 
" ESM-X2X: refill MME X register as it is consumed during a multiplication. 
212 
" ESM. X2B: fill MME B register from X register as X register is consumed during a 
multiplication. 
" ESMB2X: fill MME X register from B register as X register is consumed during a 
multiplication. 
A. 3 SMC Internal Signals 
The following signals are all active high and are used internally by the SMC for output 
signal generation, for LSM, TSM, ESM and RSM inter-state-machine communications and 
for iC, jC and kC counter control. 
" SRIset, SRIrst: Set SRI =1 or 0 respectively. 
" SRTset, SRTrst: Set SRT =1 or 0 respectively. 
" URlset, URIrst: Set URI =1 or 0 respectively. 
9 URTset, URTrst: Set URT =1 or 0 respectively. 
" RCAset, RCArst: Set RCA =1 or 0 respectively. 
" ESM Go: Issued by the TSM to start the ESM. 
9 EXP-Done: Issued by the ESM to inform the TSM of its completion. 
" Reduce: Issued by the TSM to start the RSM reduction process. 
" NoReduce: Issued by the TSM so that the RSM just issues an interrupt but does 
not perform a reduction. 
" R. DC-Done: Issued by the RSM to inform the TSM of its completion. 
" RSM_Go: Issued by the ESM to enable RSM operation. 
" RSM-Stop: Issued by the ESM to temporarily halt RSM operation. 
213 
" i4, i64,1512: iC count signals. 
" j4, jRAAD: jC count signals. jRAAD is programmable via RAADsel(1: 0) device 
inputs. 
" k4, k64: kC count signals. 
A. 4 State-Transition Diagrams 
(idle) 
cLDK 
TCload 
SRhet 
Txfer N J4 SRTset intSet 
EMSBIoad ___ SRlrst 
NlNload 
ices 
i4 
ECea 
JCt 
EMSBIoad 
i4ANDw 
j4 i 
Txfcr N 
i5 12 SRTrst 
i4 AND ei 
ECload -º -Cap icen 
Figure A. 1: LSM state-transition diagram. 
214 
SRIset . 44 
cEXP (idle) 
SRTset 
URlsct 
URTset 
TCload Sý4 -04- 
jCc° RDC_Done 
j4i 
Txä'a_H 
iCen 
i5121 
SRTrst 
URTrst 
ECload 
NoReduce 
i 
ESM Oo 
EXP Done 
SRIset 
SRTzet ISRTrst 
URiset URTrst 
RDC Q 
URTset 
~ iCcq 
i512 ECload 
A TCload 
Reduce 
cEXP RDC JCen 
RDC Done AND cRDC 
Figure A. 2: TSM state-transition diagram. 
SRTrst 
URTrst 
ECload 
Reduce 
i5124 
Txd'er I 
Wen 
j4 4 
SRIset 
SRTset 
URIset 
URTset 
TCload 
jCen 
215 
ESM GO rime) 
Aouý, XAat 
A, ne w 
XAoIkEn 
FXAoIkEa 
X2X 
icon 
ic- 
iRMD 
XYAo1kEn 
XYpr 
Amt 
ECen 
XAo1kEn 
I JCcZX 
i64 
FXABo1kEa 
X2B 
X2X 
ices 
RSM_Stop 
jRAAD 
AND 
ECQOO 
AND ej 
XABOIkFý 
B2 
ICcn 
RSM Stop 
i64 
RSM Stop 
jc 
jRAAD AND ci 
AYolkEn 
Ypar 
And 
RSM Go 
AND Ti 
}CAclkEn 
J 2x 
XAoIkEa 
iCen 
i64 
jCen \ 
jRAAD 
AND 
W(TO 
XYAcIkEn 
Agpar 
Aant 
ECCO 
RSM Go 
XAo1kEn 
Asst 
Xpar 
EXP Done 
4iRAAD 
Jcca 
i64 
XAcIkEn 
X2X 
Xono 
icon 
XAollcEn 
Xs2X 
Xone 
jRAAD 
AND 
Qpp AYo&En 
Ypw 
Aint 
Xonc 
RSM Go 
Figure A. 3: ESM state-transition diagram. 
216 
(idle) 
Reduco RDC Done 
NoReduce 
1RCAset I° 
RCkud URIrst 
RCS 
k64 AND RCQ3F AND Upos 
RSM Stop 
RedcClkEn "0- RCAset -ý 
a 
k64 kc, m 00 
RSM Go 
ý- RCArst RSM Stop 
Figure A. 4: RSM state-transition diagram. 
217 
Appendix B 
The WHiSpER Schematics 
The following pages contain full circuit diagrams for the WHiSpER chip. 
218 



o. 
0t 
o a+ 
L. 7 ö 
0O 
c 
U- O 
La 
w c mw ~ 
o w0 .7 0 
c ¢ a. 
o v .' .+ m A- 
" O1 a+ ao ¢ .+ 
L L 
m 
nö 
2 
r. m >> U) c U, -W 
7 Nc =O a ö ö 
a -3 cn a f- 
L U   
O mU 
U 
.ý a+ a äo 0a < m 0) a ö cö n äc 
W q 
I 
D 
E 
ý Q M 
" 
O OO O 
O V y 
¢ N N 
ö ö 
N 
ý 
N 
"  i 
a w 
O 
N .ý 
p 
O 
ý " i 
Y 
n 
WÖ 
Z 
la O O 
Q ý 
A v 
n « f 
" 
H 
n 
o o ++ L 
0 0 
> t . U 
L 
ä 
m (a 
c 0) - 
o m o 
.4 a) o C CC 
o v a+ .+ 
" 0) 4+ co cc N 
o ti . -" L w a 
L m d O 
m 7 > CO Y 
7 C U .r - Of 
co 7 N C S N 
CL -) U) x º- 
ä.: 
L U 1. ý 
O m U 
L O "'' U 
u a+ CL m 0 n 7 l0 Q) L L 
s a o o 0. Cl) 
n 7 L 
O a, 
L 7 
O O 
6 
L !ý 
L a to to c m 
o m o 
.+ to 0 c a- >. 
0 v .ý -+ rn >. .. rn u to ¢ ,. + 0 
.. .+ L w to 
to n a+ 
-+ m 7 > Cl] to 
co C C) .ý E 
m 7 m C 2 O 
CL 7 Cl) 3 U 
L U N 
O m U 
L t i+ " U 
a+ ++ n CD 0 a 
O O Y) a. L 7 
< o 0 o a co 


O 
`i 
40 9 
cr 




ö 
om 
LA 
ä 
m 

O 
6i 
O 
Z 
CL 
o c 
o u L 7 
C! ] o 
U .r 
L a 
m m c m .. 
a m o 
.. m o c ¢ a 
o " .+ .. 
al +. + to a: o L w 
.+ m 
> > 
U3 7 C U .+ - Y 
CL 7 Co M 3 Y 
L U N 
O m U 
a+ 43 a. 
O 
a a 
Z O Co L L 
o 0 o a m 
n 
t 
o a+ 
C) 0 E 
L T 
ä 
cc c 0) %- 
o m o 
-+ m 0 
c ¢ >. 0 IT 4, 
aý aý m 2 
ca .. .. w 1. G) O. 
.+ m > > cn c cl, 
0) N C S E 
Q. 3 -3 Cl) s x 
ü 
L U 41 
O 0) U 
L 0) ä+ -' U 
a+ a+ n 
m 
0 m 
o 0) L- L- :3 
0 ca 0 (L V) 
U. 
c 
o 
o o 
L A 
U r+ 
a 
W co 
c 0 . - o w o 
.. m o c ¢ 
o v 4 
rn >. "+ rn 4J co 
o -. - L w 
a. in Cl. 
_ m > > Cl) D 7 C U 
co w C 
a -ý cn z z 
i. U ü 
O w U 
G co N U 
N Aj G m 0 O 
7 to m L. L 
s o 0 0 0. Cl) 
NM 
-ö 
osf 
"i 










CL L 
o 1. + 
a. 7 
O O 
a 
L T 
CL m co 
C m r- 
o a o 
- m O 
C ¢ o + .+ .. 
C) 4' m cc Q .. -. L w 
m a s 
." o > > v> > n C 0 - c m 7 m c x m 
a -ý v> x CD 
ä. i 
L C.. / w 
O m 0 
G ö ü +-. U 
4+ n r 0 a m m L L. m 
0 o 06 Cl) 
o u 
L 
E 
t ). 
U aq 
L Q. 
ao 0 
c o o ao 0 
.. o 0 Q 
0 M O Q 
p .r 4 L W 
L O CL W 
- O > > Co c3 
7 C U .r - C 
o o c x u 
Q. ') v> > : n 
L U  
O O U 
M m i+ -i U 
4.1 a+ 0. Co 0 O 
O O L L. 7 
< o 0 o a Cl) 
ai 
11 
8E 
Yr 





CL 
C 
L D 
CO 0 
E 
t > 
U . --' 
L Q, 
co co 
C Cl) 9- 
C Cl) 0 
++ Q) O 
C CC >% 
O 11-f 1. ) . -+ CT) >% + 
" Q) 4-J co (I: 
o . -ý -4 L W 
L C) O. 
. -- C) > > CA CD 
7 C U -4 -i ' 
tU C) C = f! ) m 
CD CL 
m .r C CL 
Aj 41 
L U . 4-º 
O m U 
L Cl) +J +-ý U 
1-) 1) Q Ö) O X) 
7 co C) L L 
a A o 0 a cn 
LL U. 
O C3 
O p 
6 
a+ y 
O 
t < L 
U3 Q]Um m 
O O 
m 
y.! M JJ * 
4+ß+Y Y 
O0 
WWN- 
LlV 
uo wmmm to en ap w 
mw 
ö 
U. 
O 
ý 6 
.ý YY 
L NN 
m U0 
O 6E 
G C OO .+ .r 
O J] jo Ll 
E DDOý 
O_ 
0 
L D 
CO 0 
E 
t > 
U . --' 
L Q, 
co co 
C O 9- 
0 Cl) 0 
++ Q) O 
C CC >% 
CT) >% + 
" Q) 4-J O cc 
o . -ý -4 L W 
L Q) Q 
. -- a) > > CA co 
7 C U -4 ++ ' 
tU O C = f! ) 
a cn D 3 
L U . 4-º 
O CD U 
L m +J +-n 0 
1-) 1I Q Ö) O 
7 co C) L L 
a A o 0 a cn 
n 
O a+ 
L 7 
O a 
G ý. 
U - 
L a 40 
c m o m 0 
-ý ID o C Q >. 
CD ?. - 
(P 4+ m Q 
O .+ .4 L. W 
L. m ü 
-+ O 7 > N 4 
0 7 N C = in 
a U) x 
L U 
O N C1 
t äi ý+ ++ U 
N 4+ Cl CM O O 
7 0 N L L 7 
s o 0 o a cn 

a 0 L 
O a. 
". 3 
O 0 
G I. 
CL 
0) o C o w o co 0 
.. o 0 c ¢ 
p " a+ .. 
rn >. -. rn 4 m 
L. U) 
m a Im 
) 
0 3 m c S sl 
a -2 U) s 
i. i 
L U N 
O 0! U N A+ 
-+ U 
f+ u O, co 0 a 
7 m UI L. L 7 
< 0 0 0 a cn 
O 
LL LL 
O 
0 
t 
o a. + 
L 7 
CD 0 
r 
c 
U -4 
L. a 
m m 
C m 
o m o 
- m o 
aC >. 
D o v +J .. 
m a+ m o 
-+ -+ L w L ID n 
m > > Co m 
7 C U ý+ ý+ 
m 7 m C 2 0) 
CL - m x n 
Z 
L U a-ý 
O m U 
L (1 4+ -. U 
u u o m a n 
7 (0 m L L Z) 
a o a o a Cl) 
LL LL 
4O 
m 
/` 


7 L 
O J-' 
L 7 
O 0 
L T 
U - 
L n. 
M f0 
C m w 
O 00 0 
- au O 
C C 
0) Y ý0 Q 
q -r L W 
L m n 
. -. m > > Co fi 7 C U .I ti 
l0 a7 C S W 
CL -2 U) x a 
4-' 
L U aJ 
O N U 
L N a+ -' U 
f1 4- n 
m 
0 a 
7 m 0) L L 
a g q o a Co 
a o L 
o a+ 
CO 0 E 
L T 
U - 
L a 
m co 
c m v- 
o co o 
.. m o c ct >. 
o v ++ 
0) >. -. rn aý m cc a .. .+ L w L CD O. 
-. m > > v) 7 C U -+ ++ 
co 7 m C 2 0) 
CL in 3 m 
L U a-, 
O m U 
L C ä-+ - U 
+1 4 O. CD 0 n 
7 co m L L 
a ca ca 0 a U) 
PAGE/PAGES 
EXCLUDED 
UNDER 
INSTRUCTION 
FROM 
UNIVERSITY 
