Customisable arithmetic hardware designs by Cheung, Chak-Chung Ray & Cheung, Chak-Chung Ray
University of London 
Imperial College of Science, Technology and Medicine 
Department of Computing 
Customisable Arithmetic Hardware Designs 
Chak-Chung Ray CHEUNG 
Submitted in part fulfilment of the requirements for the degree of 
Doctor of Philosophy in Computing of the University of London and 
the Diploma of Imperial College, April 2007 
Abstract 
Field-Programmable Gate Arrays (FPGAs) are ideal platforms for realizing customisable hard-
ware designs due to their flexibility. There is an increasing gap between design complexity 
and designer productivity offered by reconfigurable hardware. Design tools which can opti-
mise the trade-offs in area, performance and other metrics such as precision are desirable. In 
this thesis, we present several design generators targeting reconfigurable hardware that cover 
various computer arithmetic designs: elliptic curve cryptography design generator (ECCGen), 
prime number validation design generator (PrimeGen), integer mathematical generation tool 
(IMGen) for embedded processors, and non-uniform random number generator (RNGGen) for 
arbitrary distributions. 
The elliptic curve cryptography design generator, ECCGen, enables the generation of parame-
terisable building blocks for a System-on-Chip (SoC) architecture for elliptic curve cryptosys-
terns which target reconfigurable hardware. A four-level partitioning scheme is proposed to 
facilitate the exploration of area and speed trade-offs. 
The prime number validation design generator, PrimeGen, generates a scalable SoC architecture 
using scalable and non-scalable Montgomery multipliers based on user-defined parameters. The 
primality test is crucial for security systems, especially for most public-key schemes. This work 
demonstrates the flexibility and area trade-offs in using reconfigurable platforms for developing 
SoC applications. 
The integer mathematical generation tool, IMGen, generates a customizable library for floating-
point function evaluation based on integer instructions available in embedded processors. An 
evaluation of the IMGen approach by automatically selecting an approximation method for a 
given function for embedded integer processors is described. This work demonstrates accuracy 
and execution time trade-offs. 
The random number generator, RNGGen, is an automated tool for producing hardware-based 
non-uniform random number generators (RNG) using the inversion method. The designs are 
capable of generating random variates from arbitrary distributions. The word-lengths of the 
generated hardware designs are optimized using an analytical error analysis approach. This 
work demonstrates precision, area and memory requirement trade-offs. 
ii 
Acknowledgements 
I would like to express my greatest gratitude to the following important persons who have made 
my PhD study possible: 
• My supervisors, Wayne Luk and Peter Cheung, for giving me invaluable guidance in every 
aspect throughout my research at Imperial College. 
• My examiners, Naranker Dulay and Jose Luis Nuliez-Yafiez for giving me invaluable 
comments on my thesis. 
• My former supervisor at the Chinese University of Hong Kong (CUHK), Philip Leong, 
for guiding me to the field of reconfigurable computing and his useful advice. 
• Michael Flynn and John Villasenor, for their gracious support of three research visits at 
Stanford University and University of California, Los Angeles (UCLA). 
• Oskar Mencer, Paul Kelly and George Constantinides, for their insightful remarks and 
numerous inspirations. 
• Dong-U Lee, Nicholas Telle and Ashley Brown, for the promising collaborations. 
• My colleagues at Imperial College London, including Altaf Abdul Gaffar, Edward Ang, 
Kubilay Atasu, Tobais Becker, Jacob Bower, Ben Cope, Rob Dimond, Andreas Fidjeland, 
Gabriel Figueiredo de Countinho, Hao-Huan Fu, Abdelmalek Halawani, Chun-Hok Ho, 
Lee Howes, Jun Jiang, Matti Juvonen, Yuet-Ming Lam, Danny Lee, Terrence Mak, 
Oliver Pell, Philip Potter, Tero Rissa, Pete Sedcole, Henry Styles, Carlos Tavares, David 
Thomas, Timothy Todman, Qiang Wu, Chi-Wai Yu, and Sherif Yusuf, for their amiable 
support and mentorship. 
• My parents and my sister, Amy, for their love and encouragement. 
• My fiancee, Serena, for her endless moral support. 
• The Croucher Foundation Hong Kong and the Department of Computing at Imperial 
College London, for the generous financial support and the superior research environment. 
iii 
To my family 
iv 
Contents 
Abstract 
Acknowledgements 
1 Introduction 1 
1.1 Motivation 	  1 
1.2 Contributions 	  4 
1.3 Thesis structure 	  6 
1.4 Statement of originality 	  6 
1.5 Publications 	  7 
2 Background and Related Work 10 
2.1 Introduction 	  10 
2.2 Field-Programmable Gate Array (FPGA) 	  10 
2.2.1 	Architecture 	  10 
2.2.2 	Design tools 	  12 
2.2.3 	Very-High-Speed-IC Hardware Design Language (VHDL) 	  13 
vi CONTENTS 
2.3 
2.2.4 	Handel-C 	  
2.2.5 	Xilinx System Generator (SysGen) 	  
System-on-Chip (SoC) 	  
13 
13 
14 
2.3.1 Processors 	  14 
2.3.2 Communication Channels 	  15 
2.3.3 Reconfigurable fabrics 	  15 
2.3.4 Memories 	  16 
2.3.5 Field-Programmable SoC 	  16 
2.4 Cryptosystems 	  17 
2.4.1 Related work 	  17 
2.4.2 Arithmetic background 	  18 
2.4.3 Secret-key cryptosystems 	  19 
2.4.4 Public-key cryptosystems 	  21 
2.5 Pseudo-primality test 	  25 
2.5.1 Fermat primality test 	  25 
2.5.2 Rabin-Miller primality test 	  26 
2.6 Montgomery arithmetic 	  27 
2.6.1 Montgomery modular multiplication 	  28 
2.6.2 Montgomery modular exponentiation 	  30 
2.6.3 High-radix Montgomery algorithm 	  30 
2.6.4 Scalable Montgomery algorithm 	  31 
CONTENTS vii 
2.7 
2.8 
Function evaluation 	  
2.7.1 	Function evaluation using embedded processor 	  
2.7.2 	Random number generator 	  
Summary 	  
33 
33 
34 
36 
3 Customisable Elliptic Curve Cryptosystems 37 
3.1 Introduction 	  38 
3.2 Parallelised field multiplier 	  39 
3.3 System architecture 	  45 
3.3.1 	Field multiplication 	  45 
3.3.2 	Field inversion 	  46 
3.3.3 	Field squaring and field addition 	  48 
3.3.4 	Point multiplication 	  48 
3.3.5 	Point addition and point subtraction 	  51 
3.3.6 	Data embedding 	  51 
3.4 FPGA implementation 	  54 
3.5 Automatic generation and customisation 	  56 
3.6 Evaluations 	  57 
3.7 Extension: Reconfigurable ECC System-on-Chip 	  61 
3.7.1 	Introduction 	  61 
3.7.2 	Operations partitioning 	  63 
3.7.3 	Extended design generator: ECCGen 	  65 
viii CONTENTS 
3.7.4 	Performance evaluations 	  66 
3.7.5 	Secure web server 	  68 
3.8 Summary 	  69 
4 Reconfigurable Primality Test Design Generator 70 
4.1 Introduction 	  70 
4.2 Design flow 	  71 
4.2.1 	Speeding up computation 	  72 
4.2.2 	Montgomery modular multiplication 	  72 
4.2.3 	Montgomery modular exponentiation 	  75 
4.3 Scalable modular multiplier 	  76 
4.4 Hardware design generator - Prime Gen 	  78 
4.5 Evaluations 	  80 
4.6 Extension: System-on-Chip (SoC) design framework 	  83 
4.7 Summary 	  84 
5 Custom-Precision Function Evaluation for Embedded Processors 86 
5.1 Introduction 	  87 
5.2 Integer instruction sets 	  88 
5.3 Automation methodology 	  89 
5.3.1 	Overview 	  89 
5.3.2 	Polynomial approximation 	  95 
CONTENTS ix 
	
5.3.3 	Rational approximation 	  
5.3.4 	Custom precision code 	  
95 
95 
5.4 Error analysis 	  99 
5.5 Optimization 	  102 
5.5.1 	Optimised code generation 	  102 
5.5.2 	Compile time optimization 	  103 
5.5.3 	Polynomial degree automation 	  104 
5.5.4 	Instruction code space optimization 	  104 
5.6 Results 	  106 
5.6.1 	Experimental setup 	  106 
5.6.2 	Evaluations 	  106 
5.7 Summary 	  110 
6 Arbitrary Random Number Hardware Design Generator 111 
6.1 Introduction 	  112 
6.2 Design flow overview 	  114 
6.3 Hierarchical segmentation of ICDFs 	  117 
6.4 Inversion-based random number generator architecture 	  122 
6.5 Bit-width optimization 	  127 
6.5.1 	Range analysis 	  128 
6.5.2 	Coefficient transformation 	  128 
6.5.3 	Precision analysis 	  129 
6.6 Hardware implementation results 	  134 
6.7 Summary 	  143 
7 Conclusions 	 149 
7.1 	Summary of thesis achievements 	  149 
7.2 Future work 	  151 
7.2.1 ECCGen 	  152 
7.2.2 PrimeGen 	  152 
7.2.3 IMGen 	  153 
7.2.4 RNGGen 	  154 
7.3 The road ahead 	  154 
Bibliography 	 155 
List of Tables 
1.1 	Hardware comparison of customisable arithmetic designs 	  4 
1.2 	Customisation comparison of customisable arithmetic designs.  	4 
2.1 Security level comparison [Cer06] 	  24 
3.1 	Number of clock cycles required for each function execution. 	  55 
3.2 Speed comparison with the reference designs [LL02, Ros99]. 	  59 
3.3 Comparison with the parallel reference designs [LL02] 	  59 
3.4 	Comparison of speed per slices used for different designs 	  60 
3.5 	Comparison of critical paths for different FPGA platforms 	  60 
3.6 Comparison between our design and other existing hardware designs. 	 60 
3.7 	Statistics of field multiplication/inversion for different field size m 	  63 
3.8 Read/Write cycles using the CoreConnect bus 	  67 
3.9 	Level 2: Comparison of one field multiplication when m = 113 	  67 
3.10 Level 3: Comparison of one field inversion when m = 113. 	  67 
3.11 Level 4: Comparison of one point multiplication when m = 53 	  68 
xi 
4.1 Primality test comparison 	  81 
4.2 Processing element comparison with word size (w = 8) 	  82 
4.3 Prime-size comparison using SMM with 8 PEs. 	  82 
4.4 Word-size comparison for 128-bit and 1,024-bit SMM using 8 PEs 	  82 
5.1 Instruction latency comparison of different processors 	  88 
5.2 Embedded PowerPC arithmetic and logical instructions. 	  94 
5.3 Approximation errors. 	  100 
5.4 Degree-4 polynomial method comparisons. 	  108 
5.5 Degree-6 polynomial method comparisons. 	  109 
5.6 Comparisons of log(x) and ,‘/Y 	  109 
6.1 Maximum random number by using different input bit-widths 	  123 
6.2 Signal bit-widths found by Adaptive Simulated Annealing (ASA). 	  133 
6.3 Segment and bit-width information without quantised x. 	  134 
6.4 Segment and bit-width information with quantised x. 	  135 
6.5 Hardware resource usage of the proposed GRNG. 	  135 
6.6 Comparisons of different hardware GRNG 	  146 
6.7 Comparisons of different hardware Gaussian random number generators. 	. 	. . 147 
6.8 Hardware implementation results of the GRNG 	  148 
7.1 Design generators comparison 	  151 
xii 
List of Figures 
1.1 Overview of this thesis 	  6 
2.1 The architecture of a 2D FPGA. 	  11 
2.2 The architecture of a fine-grained -I- coarse-grained FPGA 	  12 
2.3 QuickMIPS chip die image [MIP03]  	17 
2.4 Finite fields for elliptic curve representation. 	  24 
2.5 The Rabin-Miller probabilistic primality test algorithm 	  27 
2.6 Montgomery modular multiplication. 	  29 
2.7 The multiple word radix-2 Montgomery multiplication algorithm. 	  32 
3.1 Parametrised, pipelined and parallel field multiplier design 	  40 
3.2 Example of a wiring block when m = 5 and p = 1. 	  40 
3.3 Interactions between different operations 	  45 
3.4 Datapath of the customisable ECC system 	  46 
3.5 The Montgomery algorithm for point multiplication. 	  49 
3.6 Applying multiple parallel field multipliers in point multiplication 	  50 
3.7 Overview of the customisable ECC system 	  55 
xiv 	 LIST OF FIGURES 
3.8 Area usage for various p and in values. 	  58 
3.9 Field multiplication comparison for various m values 	  59 
3.10 The limiting factors for Configurable SoC designs 	  62 
3.11 Diagram of the design flow. 	  65 
3.12 Architectural components in a single chip. 	  65 
3.13 Level 1: OPB timer block measurement of PowerPC 	  66 
3.14 Secure web server overview. 	  69 
4.1 The design flow of the scalable prime number validator 	  71 
4.2 Generating Montgomery modular constant c. 	  73 
4.3 Generating Montgomery modular multiplication S. 	  74 
4.4 Hardware optimisation for modular multiplication and squaring 	  74 
4.5 Generating Montgomery modular exponentiation S 	  75 
4.6 The architecture of the exponentiation unit. 	  76 
4.7 	The architecture of the scalable Montgomery multiplier. 	  77 
4.8 The architecture of a single Processing Element (PE) 	  77 
4.9 The dependency graph of the computation. 	  79 
4.10 The overview of PrimeGen design generator 	  80 
4.11 The analysis of the scalable Montgomery multiplier 	  83 
4.12 An overview of the SoC design framework. 	  84 
5.1 Overview of current and the proposed embedded processors. 	  87 
LIST OF FIGURES 	 xv 
5.2 Design methods using polynomial and rational approximations. 	  90 
5.3 Matlab design tool flow. 	  91 
5.4 Maple generates the polynomial coefficients. 	  91 
5.5 Maple generates the rational coefficients. 	  92 
5.6 	 f p2int and int2 f p operations 	  93 
5.7 64-bit integer shift left operation. 	  96 
5.8 64-bit shift algebraic right operation. 	  96 
5.9 	Degree one approximation for log (x) 	  98 
5.10 Static error analysis of degree-one log(x) 	  100 
5.11 Error analysis of log(x) range reconstruction 	  101 
5.12 Code generation optimization example 	  102 
5.13 The effect of compiler optimization 	  103 
5.14 Input/Output of the automated system. 	  104 
5.15 Our embedded system under testing. 	  105 
5.16 32-bit log(x) function evaluation without optimization. 	  105 
5.17 U-shaped curve for the 32-bit approximation. 	  106 
5.18 Maximum error comparison 	  107 
5.19 Speedup comparison 	  108 
6.1 Inverse cumulative distribution function of the Gaussian distribution. 	 113 
6.2 Design flow of the proposed approach. 	  114 
6.3 	Overview of the RNG architecture. 	  115 
xvi 	 LIST OF FIGURES 
6.4 Illustration of different segmentation schemes. 	  117 
6.5 The inversion plots for the Gaussian (Normal) distribution 	  118 
6.6 Segment ranges in binary representation for Bx = 7 	  119 
6.7 Variation of the number of segments m with error requirement. 	  121 
6.8 Variation of segment number m against different input bit-width. 	 122 
6.9 The total error of the 12-bit input designs    122 
6.10 Variation of the maximum u value. 	  123 
6.11 Hardware architecture for generating Gaussian random number. 	  124 
6.12 Architecture of the Tausworthe URNG in Fig. 6.11 	  125 
6.13 Illustration of the bit selection unit 	  126 
6.14 PDFs of the generated random numbers. 	  136 
6.15 Error plot in ulp using 216 randomly selected samples. 	  137 
6.16 PDF of the generated random numbers. 	  138 
6.17 Area, clock speed and latency variation. 	  139 
6.18 Pipeline registers distribution of the GRNG design. 	  139 
6.19 Overview of the GRNG using System Generator. 	  140 
6.20 Overview of the address decoding block of the GRNG. 	  141 
6.21 Area comparisons for variable error requirements. 	  142 
6.22 Latency comparisons for variable error requirements. 	  142 
6.23 Area comparisons using variable degree-d. 	  143 
6.24 Latency comparisons using variable degree-d. 	  143 
6.25 Area comparisons for variable input bit-widths. 	  144 
6.26 Latency comparisons for variable input bit-widths. 	  144 
xvii 
xviii 
Chapter 1 
Introduction 
1.1 Motivation 
Field programmable gate arrays (FPGAs) offer rapid-prototyping platforms for Application-
Specific Integrated Circuit (ASIC) designs and can be reprogrammed for different applications. 
The rapid growth of device density has provided a foundation for computer designers and 
architects for creating complex digital systems. One of the problems of the circuit design chain 
is the lack of satisfactory and convenient design tools that can optimise design trade-offs such 
as area and performance. Design reuse is usually enabled by deploying a customisable design 
flow. In other words, it is beneficial that we can achieve high-performance or area-efficiency 
by using the design tools. In this thesis, we describe how we produce various customisable 
arithmetic designs by using design generators effectively, and the use of FPGA devices for 
rapid prototyping. Our aim is to study hardware customisation for various arithmetic designs. 
Moore [Moo65] has predicted the growth of semiconductor industry for over 40 years that 
still happens to be true. With this sustainable growth in circuit size, hardware designers are 
increasingly challenged to achieve optimum designs and to fulfill design specifications. An 
ASIC is dedicated to execute one special application only, while an FPGA is a reprogrammable 
device with plenty of logic resources that enable programmability. Customisation is defined as 
"to make or change something according to user's needs" [Cam03]. 
2 	 Chapter 1. Introduction 
There are several reasons for the importance of customisation and the use of FPGAs: 
• FPGAs provide a shorter time to market than a fixed ASIC design. As the lifetime of 
hardware designs constantly shrink, the time to market requirement is a primary and 
vital concern. 
• FPGA devices are getting much larger every year. For example, the 90nm Xilinx Virtex-4 
device was released in Feb 2005, and the latest Xilinx Virtex-5 FPGA device using 65nm 
fabrication process was released in May 2006. Virtex-5 FPGAs deliver up to 330,000 logic 
cells (65 percent increased than the Virtex-4 generation) [v5f]. How can we make use of 
the extra capacity provided by the device vendor effectively without going through the 
whole design process? 
• Many applications may require minor changes of design after production. For instance, 
security requirements and wireless protocols may change every year because of newly 
released industrial standards in many networking applications. Thus, it is essential to 
offer more robust functionality. 
• A major challenge is to provide a reconfigurable or reusable hardware architecture for 
various applications without losing design productivity. Therefore, one question is how 
to make a customisable and automated design process? 
• Customisable design tools are important to support different trade-offs. For example, 
we can trade an increased area by a clock period reduction [CW97]. Or we can trade a 
reduced delay path by an increased power consumption. 
• The Xilinx floating-point core [Xi105] enables designs with customisable operation, bitwidth, 
latency and interface. Obviously, we can develop methodology [ALC+02] for automat-
ically determining the appropriate size of mantissa and exponent of the floating-point 
core. 
On the one hand, if we plan to achieve the smallest design for a given application, it is important 
to first analyze and understand that application. To reduce design time, we also introduce the 
1.1. Motivation 	 3 
design reuse of the components in the system, and introduce several design techniques such as 
optimisation of the datapath bitwidth and pipeline stages reduction for shortening the latency. 
On the other hand, performance improvement is one of the key advantages of using customised 
hardware over software implementation. Many researchers are able to accelerate algorithms 
implemented on general purpose processors [HVO4]. This is usually achieved by decomposing 
the target algorithm into fragments and parallelising them on dedicated hardware such as 
FPGA. 
Given the importance of customisation, in this thesis we have developed a customisable design 
generator library for improving the productivity of arithmetic hardware designs. The first two 
design generators target cryptographic hardware, while the last two design generators focus on 
function evaluation using embedded processors and reconfigurable hardware. 
For the first two design generators for cryptographic applications, finite-field arithmetic is 
used in elliptic curve cryptosystems while modular or modulo arithmetic is used for the prime 
number validators. Finite-field arithmetic refers to the operations applying to a limited number 
of elements in the finite field. The results of finite-field arithmetic result in an element within 
that defined field. Modular arithmetic [Mod07] is defined as a system of arithmetic for integers 
where numbers wrap around after they reach a certain value. 
For the two design generators involving function evaluation, integer arithmetic is used in 
custom-precision embedded processor design, while fixed-point arithmetic is used in random 
number generation. Integer arithmetic means computing without fractions because the gener-
ated fractions will be truncated. Fixed-point arithmetic is a generalisation of integer arithmetic; 
a fixed-point computation involves a fixed number of bits before and after the radix point, and 
is used in the final random number generator designs. A brief comparison for the customisable 
arithmetic hardware designs are shown in Table 1.1 and Table 1.2. 
4 	 Chapter 1. Introduction 
designs arithmetic hardware structure 
ECC (Chapter 3) finite-field parallel 	field 	multiplier: 
Fig. 3.1 
primality test (Chapter 4) modular Montgomery modular multi-
plier: Fig. 4.7 
custom-precision 	function 	evaluation 
for embedded processors (Chapter 5) 
integer 32-bit 	and 	emulated 	64-bit 
datapath: Fig. 5.4 and Fig. 5.5 
random number generator (Chapter 6) fixed-point address decoding unit ± poly-
nomial evaluation: Fig. 6.11 
Table 1.1: Hardware comparison of customisable arithmetic designs. 
designs customisations 
ECC (Chapter 3) key-size, degree of parallelism 
primality test (Chapter 4) scalability, bit-size, processing elements 
custom-precision 	function 	evaluation 	for 
embedded processors (Chapter 5) 
precision, speed, code-space area 
random number generator (Chapter 6) precision, area, random number quality 
Table 1.2: Customisation comparison of customisable arithmetic designs. 
1.2 Contributions 
The major contributions are divided into the following parts from cryptographic application 
designs, fundamental multiplication operations, to customisable design generators for embedded 
processors and random number generators. 
Customisable elliptic curve cryptosystems 
We optimise a customisable elliptic curve cryptosystem (ECC) that contains all ECC basic 
operations. This cryptosystem is built based on a parametric parallelised and pipelined field 
multiplier and an optimised point multiplication using multiple parallel field multipliers. More-
over, we develop a design generator that takes the key size and degree of parallelism of the 
design for producing efficient "hardcore" control and data path, and a parametric model for 
calculating the number of cycles for the system. (Chapter 3) 
1.2. Contributions 	 5 
Cryptographic system-level application designs 
We create a design generator for producing prime validators based on user-specified parameters, 
and propose an System-on-a-Chip design framework for prime generation. We also evaluate the 
trade-offs between the ECC system using pure embedded microprocessor and hybrid processor-
FPGA designs, and adopt our cryptosystem for speeding up the SSL key exchange in a secure 
client/server system. (Chapter 3) 
Fundamental multiplication operations 
We develop a scalable design method for mapping the Rabin-Miller prime test into hardware, 
and parallel designs for Montgomery modular arithmetic operations. We also provide imple-
mentations on the proposed hardware architectures in reconfigurable devices, with an evaluation 
of their effectiveness compared with other methods. (Chapter 4) 
Customisable function evaluation library for embedded processors 
We develop a customisable library for floating-point function evaluation based on the integer 
instruction set used in embedded processors. This approach enables full automatic code gen-
eration and code optimisation in order to obtain the best trade-off in embedded code space, 
performance and accuracy of arbitrary embedded processors. (Chapter 5) 
Design generator for producing random number generators 
We develop the best existing (in terms of throughput per area, as of October 2006) Gaussian 
random number generator using RNGGen, an automated hardware-based random number gen-
eration using the inversion method. This proposed method is capable of generating random 
numbers from arbitrary distributions. Three demonstrated functions are used to generate large 
number of samples using fixed-point arithmetic from the Gaussian (Normal), Exponential and 
Log-normal distributions. (Chapter 6) 
Customisable design generation 
Arithmetic hardware 
Cryptographic 
hardware 
Function evaluation 
hardware 
Chapter 3 Chapter 4 Chapter 5 Chapter 6 
Elliptic Curve 
Cryptographic 
hardware 
Primality test 
hardware 
Custom- 
precision 
function 
evaluation 
Random number 
generation 
hardware 
Finite-field Modular Integer Fixed-point 
6 	 Chapter 1. Introduction 
Figure 1.1: Overview of this thesis. 
1.3 Thesis structure 
This thesis is organized as follows. Chapter 2 describes an overview of related work and back-
ground of this thesis. Chapter 3 then presents a customisable Elliptic Curve Cryptosystem using 
the design generator (ECCGen). Chapter 4 presents a design generator (PrimeGen), hardware 
designs for primality test using scalable and non-scalable Montgomery multipliers. Chapter 5 
describes the integer mathematical function evaluation generator (IMGen) for embedded pro-
cessors. Chapter 6 presents an automated approach for producing hardware-based non-uniform 
random number generators (RNGGen) using the inversion method. Finally, Chapter 7 sum-
marises the research achievements of this thesis, and presents a possible future research plan. 
An overview of this thesis is shown in Fig. 1.1. 
1.4 	Statement of originality 
This thesis consists of the research work conducted in the Department of Computing at Imperial 
College London from January 2004 to December 2006 and in Electrical Engineering Department 
at UCLA in January to February, June to August 2006. I declare that the work presented in 
1.5. Publications 	 7 
this thesis is my own, except where acknowledged in the thesis. 
Chapter 3 involves collaboration with N. Telle. This work extends the work of N. Telle's MSc 
project: "A fast and customisable elliptic curve cryptosystem on an FPGA" which covers a 
parametrisable field multiplier. In ECCGen, our major achievements include the optimisa-
tion of the parametrisable field multiplier (further improved the field multiplication speed by 
100%), and applying and scheduling of multiple parallel field multipliers for point multiplica-
tion. Results show that on average 2-3 times speed improvement on point multiplication, the 
key operation in an ECC system can be achieved when compared with the previous work. 
Chapter 4 involves collaboration with A. Brown. A. Brown is responsible for the initial version of 
the non-scalable modular multiplier design. In Prime Gen, our major achievements include the 
exploration of scalable and non-scalable prime validation designs, and producing and optimising 
the design generator. 
Chapter 5 involves collaboration with D. Lee and 0. Mencer. D. Lee has suggested the study 
of higher polynomial degree function evaluation for embedded processors and the comparison 
with Intel XSca1eTM [ITN]. 0. Mencer has suggested the comparative study of polynomial and 
rational approximation in this work. 
Chapter 6 involves collaboration with D. Lee and J. Villasenor. D. Lee has suggested the use of 
hierarchical segmentation method for the highly non-linear function evaluation and bit-width 
optimisation of this work. J. Villasenor has motivated the study of various approaches for 
Gaussian random number generation. 
1.5 Publications 
The following is a list of publications by the thesis author, related to the work contained in 
this thesis. 
8 	 Chapter 1. Introduction 
Journal papers 
• R.C.C. Cheung, N. Telle, W. Luk, and P.Y.K. Cheung, "Customisable Elliptic Curve 
Cryptosystems", in IEEE Transactions on Very Large Scale Integration Systems, 13(9), 
pp. 1048-1059, September, 2005. (Based on Chapter 3 in this thesis). 
• D. Lee, A. Abdul Gaffar, R.C.C. Cheung, 0. Mencer, W. Luk, and G.A. Constantinides, 
"Accuracy Guaranteed Bit-Width Optimization", in IEEE Transactions on Computer-
Aided Design, 25(10), pp. 1990-2000, October, 2006. (Based on Chapter 6 in this thesis). 
• R.C.C. Cheung, D. Lee, J.D. Villasenor, and W. Luk, "Hardware Generation of Random 
Variates from Arbitrary Distributions via the Inversion Method", to appear in IEEE 
Transactions on Very Large Scale Integration (VLSI) Systems, 2007. (Based on Chapter 
6 in this thesis). 
• D. Lee, R.C.C. Cheung, J.D. Villasenor, and W. Luk, "Hardware implementation trade-
offs of polynomial approximations and interpolations", submitted to IEEE Transactions 
on Computers, August 2006. (Based on Chapter 6 in this thesis). 
• D. Lee, R.C.C. Cheung, J.D. Villasenor, and W. Luk, "Hierarchical Segmentation for 
Function Evaluation", submitted to IEEE Transactions on Computers, December, 2006. 
(Based on Chapter 6 in this thesis). 
Conference papers 
• R.C.C. Cheung, A. Brown, W. Luk, and P.Y.K. Cheung, "A Scalable Hardware Archi-
tecture for Prime Number Validation", in Proceedings of IEEE International Conference 
on Field Programmable Technology (FPT), pp. 177-184, 2004. (Based on Chapter 4 in 
this thesis). 
• N. Telle, W. Luk and R.C.C. Cheung, "Customising Hardware Designs for Elliptic Curve 
Cryptography", in Proceedings of the International Workshop on Systems, Architectures, 
1.5. Publications 	 9 
Modeling and Simulation (SAMOS'04), pp. 274-283, 2004. (Based on Chapter 3 in this 
thesis). 
• R.C.C. Cheung, W. Luk, and P.Y.K. Cheung, "Reconfigurable Elliptic Curve Cryptosys-
tem on a Chip", in Proceedings of IEEE Design Automation and Test in Europe (DATE), 
pp. 24-29, Vol. 1, 2005. (Based on Chapter 3 in this thesis). 
• G. Zhang, P.H.W. Leong, D. Lee, J.D. Villasenor, R.C.C. Cheung and W. Luk, "Ziggurat-
based Hardware Gaussian Random Number Generator", in Proceedings of International 
Conference on Field Programmable Logic and its Applications (FPL), pp. 275-280, 2005. 
(Based on Chapter 6 in this thesis). 
• R.C.C. Cheung, D. Lee, 0. Mencer, W. Luk, and P.Y.K. Cheung, "Automating Custom-
Precision Function Evaluation for Embedded Processors" , in Proceedings of ACM/IEEE 
International Conference on Compilers, Architecture, and Synthesis for Embedded Sys-
tems (CASES), pp. 22-31, 2005. (Based on Chapter 5 in this thesis). 
• D. Lee, R.C.C. Cheung, J.D. Villasenor, and W. Luk, "Inversion-based hardware Gaus-
sian random number generator: a case study of function evaluation via hierarchical seg-
mentation", in Proceedings of IEEE International Conference on Field-Programmable 
Technology (FPT), pp. 33-40, 2006. (Based on Chapter 6 in this thesis). 
Short papers 
• R.C.C. Cheung and A. Brown, "A Scalable System-on-a-chip Architecture for Prime Num-
ber Validation", in Proceedings of IEE SoC Design, Test and Technology Postgraduate 
Seminar, Loughborough, United Kingdom, 2004. (Based on Chapter 4 in this thesis). 
• R.C.C. Cheung, "A System on Chip Design Framework for Prime Number Validation 
using Reconfigurable Hardware", in Proceedings of International Conference on Field 
Programmable Logic and Applications (FPL), LNCS 3203, pp. 1186-1187, 2004. (Based 
on Chapter 4 in this thesis). 
Chapter 2 
Background and Related Work 
2.1 Introduction 
This chapter presents background material and related work of this thesis. Section 2.2 contains 
an introduction to FPGA technology and the design tools used in this thesis. Section 2.3 
covers an overview of System-on-Chip (SoC) systems. Section 2.4 discusses the background of 
cryptographic hardware and the associated arithmetic for Chapter 3. Section 2.5 introduces the 
importance of prime number test and the related work, while Section 2.6 describes Montgomery 
arithmetic and the latest developments; both sections cover essential background for Chapter 4. 
Section 2.7 reports existing function evaluation approaches for Chapter 5 and random number 
generators for Chapter 6. Finally, Section 2.8 summaries the chapter. 
2.2 Field-Programmable Gate Array (FPGA) 
2.2.1 Architecture 
A Field-Programmable Gate Array (FPGA), based on Very Large Scale Integrated (VLSI) cir-
cuit technology, consists of an array of pre-fabricated functional blocks and wire connections 
10 
—I 
_2 
3 
4 
Ian MI MIME EME MI 	=Mlle 
it 
 
	
111  	
• 
• 11 • 
11111111111•1111irli II Mil= 	MIMI. leEll= =MEM 
MM EE 
EOM 
=II NM 
MUNI. 
111 • 
switches 
in S box  
3 
vertical channel track ID 
1 	2 3 4 
horizontal 
channel 
logic cell 
II 2 3 4 S box 
wire segment 
C box 
switches 
in C box 
pin of logic cell 
=num' —MILV 
2.2. Field-Programmable Gate Array (FPGA) 	 11 
Figure 2.1: The architecture of a 2D FPGA. 
with user-programmability of logic and routing resources. FPGAs have a fast turn-around 
time and economic manufacturing cost especially for low-volume designs; they have been dom-
inating fast digital design prototyping. A typical 2D FPGA architecture is shown in Fig. 2.1. 
Configurable logic blocks (CLBs, inside the L-boxes) are separated by vertical and horizontal 
channels. The functionality of each CLB is defined by circuit designers using high-level hard-
ware description languages. There are W prefabricated parallel wire segments running between 
each pair of adjacent CLBs in both vertical and horizontal channels. The wire segments in a 
vertical (or horizontal) channel are aligned into W vertical (or horizontal) tracks; each track 
within a channel is assigned an integer in {1, 	, W} as its track ID. There are connection 
boxes (C-boxes) in the channel between adjacent CLBs. A switch box, located at each in-
tersection of vertical and horizontal channels, contains programmable switches to connect wire 
segments running from its surrounding C-boxes. Comprehensive survey papers on FPGAs from 
a high-level perspective [CH02] or from an architectural perspective [TCW+05] are available. 
Recent generation reconfigurable devices not only provide a large amount of logic blocks (fine-
grained unit), but also provide plenty of complex coarse-grained resources such as block-
multiplier, block-RAM and Digital Signal Processing (DSP) units for cost-effective design 
solutions. This allows designers to develop highly-customised architectures for their target 
algorithms. An example of a modern FPGA is shown in Fig. 2.2. More advanced configurable 
devices may contain high-performance RISC cores and multiple peripherals bus interfaces such 
as the Stretch 55530 software-configurable processor which contains a high-performance RISC 
D O DD 
1110 
OE DD DD 
CIE DD 
OE 
Coarse 
grained 
unit 
Coarse 
grained 
unit 
MEM 
DSP.  
MULT 
DD DD DD DD aD 
CIE DD 
CIE 
DD 
E LI 
MULT . 
DSP 
E LI 
O111 aD 
DD DD DD 
O111 DD 
12 	 Chapter 2. Background and Related Work 
Fine- 	 Fine- 	 Fine- 
grained grained grained 
units 	 units 	 units 
Figure 2.2: The architecture of a fine-grained + coarse-grained FPGA. 
core and 256KB memory [Str05]. 
2.2.2 Design tools 
In this section, we describe the commonly used design tools for circuit designers and the tools 
used in this thesis. 
Integrated Synthesis Environment (ISE) 
ISE [Xi106c] is a tool suite and design environment that allows circuit designers to produce and 
implement their designs efficiently. The input design entry is synthesized into netlist and the 
final bitstream is produced after the mapping, placement and routing steps. The bitstream can 
be directly downloaded to the target FPGA or Complex Programmable Logic Device (CPLD) 
chips. In this thesis, all the designs are placed and routed using the ISE toolset. For the 
synthesis step, we use both ISE and Synplify-Pro from Synplicity for producing the netlist. For 
the simulation step, we use the ModelSim Standard Edition for verifying the simulation results. 
2.2. Field-Programmable Gate Array (FPGA) 	 13 
Embedded Development Kit (EDK) 
EDK [Xi106b] enables designers constructing their hardware Intellectual Property (IP) blocks 
and interfacing with the Xilinx embedded hard/soft processor cores. The EDK also provides a 
rich set of peripheral cores for System-on-chip design. In this thesis, we use EDK to develop 
customisable System-on-Chip (SoC) designs. The ML310 board is used to execute the proposed 
designs. In order to interface between the embedded processors and the customised IP blocks, 
the IBM CoreConnect bus architecture is used to enhance design productivity. 
2.2.3 Very-High-Speed-IC Hardware Design Language (VHDL) 
VHDL [Ped04] is a hardware design language that enables designers to create circuits containing 
millions of logic gates for FPGA or ASIC designs. It is an industrial standard which allows 
designers to specify the designs at different levels of abstraction. We can design using the 
highest behavioural description or we can also use the lowest structural or register transfer 
level description. VHDL is an effective choice for designers when the design has a wide range 
of customisation. For instance, we can incorporate the minimum bit-width information of each 
signal in the VLSI circuit design into the generated VHDL code. 
2.2.4 Handel-C 
Handel-C [Ce103] is a high-level design language which is very close to "ANSI C" and with 
the support of parallelism and synchronisation. The Handel-C compiler also provides a timing 
model for each Handel-C statement. For instance, each assignment statement takes one clock 
cycle to execute. The Handel-C library also provides arbitrary bit-width support and floating-
point data-type. 
2.2.5 Xilinx System Generator (SysGen) 
The Xilinx System Generator [Xi106c1] is a data-flow block-based design framework which is in-
tegrated to the Matlab Simulink environment. Users can provide the test vectors in Matlab and 
collect the simulated results from the System Generator design entry. The fast and automated 
design translation enables designers to convert the high-level SysGen design into low-level RTL 
14 	 Chapter 2. Background and Related Work 
code or output bitstream easily. However, SysGen is not suitable for designers who want to 
generate a large number of output circuits with different customisations. 
2.3 System-on-Chip (SoC) 
With the advance in deep sub-micron, billions of transistors on a chip has laid the foundation 
for building complex and efficient systems. System-on-chip is basically a complete system 
containing digital, analog, and even mixed-signal components in a single chip. We can easily 
find many SoC designs in embedded and mobile devices. Digital components usually consist of 
various types of integrated circuits such as microprocessor, memories including RAM and ROM. 
The current SoC systems are usually realised by full-custom ASIC, standard-cell and FPGA 
technology. In the following sections, we discuss the major components found in a typical SoC. 
2.3.1 Processors 
From the ITRS roadmap, it shows that systems embedded with up to 100 processors will be 
common by 2010. Instruction Processor [DC03] is one of the key components in SoC systems. 
An embedded processor used to mean 4- or 8-bit micro-controller and the current mainstream is 
the use of MIPS and ARM processors. Companies like ARC and Tensilica are developing config-
urable processor for embedded SoC computing and producing ASIC custom processors. Their 
approach is manually selecting the computational intensive part as tasks and builds custom-
instruction for those tasks into the processor pipeline. On the other hand, there exists active 
research in customising processors, verification of the generated processor and reconfigurable 
processor for software designer using FPGAs such as the dynamically reconfigurable processor 
XiRisc [MCD+05]. 
Mescal [MKM+02] and LISA [HKN+01] are two of the major academic research groups for 
developing new methodologies and tool-chains for embedded systems and customisable cycle-
accurate processor architectures. The user-defined processor is modelled in a high-level architec-
tural description and the LISA framework generates a cycle-accurate Instruction-Set Simulator 
(ISS) for design exploration. 
2.3. System-on-Chip (SoC) 	 15 
2.3.2 Communication Channels 
SoC designers need to consider how to exchange data amongst the SoC components. Bus 
communication channel is commonly used, examples include CoreConnect [corO6] from IBM, 
AMBA [AMB02] from ARM, however how many buses are required and how to inter-connect 
them are difficult problems. Automated bus generation [RV04] reduced the hand design time 
in weeks to seconds. System-level synthesis research focuses on the topology designs [GG99] 
as well as sharing the communication architecture. Recent research focus lies on the reduction 
of average processing time using transaction-level communication layer [LRLD04, LRD04] in a 
SoC system. 
Another class of SoC communication is called Network-on-Chips (NoC) [BDM02]. NoC provides 
a reliable communication between the SoC components in a layered packet-based methodology. 
A recent work [BJM+05] shows automated design exploration using NoCs in SoC systems. 
2.3.3 Reconfigurable fabrics 
Reconfigurable logic [Lys03b] can be used in three major areas in SoC designs: 1) functional 
verification and debugging, 2) as a component in SoC system, and 3) Platform FPGA de-
signs [Lys03a]. 
Both reconfigurable fabrics such as FPGA and processors are the two main computational 
components embedded in an SoC. We can classify the research using FPGAs in SoC design into 
two categories: 1) custom instructions such that the reconfigurable logic is used as a functional 
unit in the processor, and 2) co-processor unit such that the FPGA unit has a direct datalink 
to the processor. Related work of the first class includes PRISC [RS94], Chimaera [HFHK04, 
YSBOO] and ConCISe [RS99], and the second class includes the GARP processor [CHWOO] 
which connects a reconfigurable array and a MIPS processor by a crossbar switch. These 
approaches mainly use functional or circuit layout simulation to get a good estimate of the 
processor speed, power and area requirements. 
Recent work for integrating a reconfigurable unit to a processor pipeline includes OneChip [CC01] 
16 	 Chapter 2. Background and Related Work 
which supports additional functional units, and the software model extension for a reconfig-
urable processor [RLP03]. The major achievement of these approaches is the execution speedup 
obtained from the simulation with the use of reconfigurable units and applied runtime recon-
figuration. However, most related work in this area are soft processor simulations. 
2.3.4 Memories 
Memory is an essential component in SoC PW051. However, instruction memory and data 
memory in embedded systems, are very expensive. The two key design constraints are the 
memory efficiency in terms of latency and clock speed, and the memory size in terms of power 
and area consumption. When we design the memory component in SoC, we need to consider 
using on-chip or off-chip memory. The main question of using memory is how do we interface 
the memory block with other components of an SoC. Current research focuses on dynamic and 
static energy minimisation, and the influence of memory and cache organisation. 
2.3.5 Field-Programmable SoC 
Field-Programmable SoC (FPSoC) refers to a chip or device that has integrated one or more 
processors, memory and other peripherals, especially programmable fabric. Industry vendors 
such as Xilinx, Altera and QuickLogic offer programmable fabric-based devices as a multi-
application platform. The major difference between their products is the number and the type 
of on-chip peripherals, besides different types of processor architectures and the bus connecting 
between the processor and the peripherals. For instance, a 32-bit AMBA bus is used to connect 
the programmable logic and the 32-bit MIPS processor in the QuickLogic QuickMIPS [MIP03] 
chip, and PLB/OPB bus is used to connect the programmable logic and the PowerPC processor 
and the soft-programmable MicroBlaze processor in Virtex-II Pro [Xi103] chip. Fig. 2.3 shows 
an example of FPSoC die image which contains the programmable fabric, peripherals and a 
processor core. Other FPSoCs include "E5 and A7 chips" by Triscend which have a 8051 
micro-controller core and an 32-bit ARM core, Atmel FPSLIC which has a MIPS core. 
2.4. Cryptosystems 	 17 
Figure 2.3: QuickMIPS chip die image [MIP03]. 
2.4 Cryptosystems 
Cryptography is an active research area from military world in the 19 century to our daily 
lives. There are three major types of cryptographic operations: secret-key system which uses 
the same key for encryption and decryption, public-key system which uses two different keys, 
and hashing which generates a short message digest. The advance in cryptographic algorithms 
has prevented many possible attacks such as impersonation attack and spoofing attack. This 
section presents an overview of the latest development of various cryptosystems. 
2.4.1 Related work 
In this section, we first present the importance of cryptosystems and then we describe the 
latest cryptographic hardware designs. The difficulty of the underlying Elliptic Curve Discrete 
Logarithm Problem (ECDLP) makes ECC cryptosystems suitable for applications that need 
long-term security. In 2002, the U.S. Government adopted ECC for protecting mission-critical 
information [Cer97]. For instance, the NIST has recommended specific curves for implemen-
tations [oCIoST00] and the IEEE has provided detailed specification for the choices of private 
key length and fields [P1300]. 
ECC research can be first divided into two groups depending on the underlying field repre- 
18 	 Chapter 2. Background and Related Work 
sentation: prime field, GF(p) and binary field, CF(2m). Two bases, Optimal Normal Basis 
(ONB) and Polynomial Basis (PB) are commonly used for manipulating binary fields. It is 
known that binary field is more suitable for hardware implementation, and ONB is said to be 
dominated fast squaring. However, point multiplication requires both efficient field multiplier, 
inversion [IT98], squaring and efficient coordinate system [LD99]. In recent years, there has 
been much software [HH1\400] and hardware [OPOO, OP01] research work on both GF(p) and 
GF(2m) focusing on the performance of point multiplication, and it is obvious that a fast point 
multiplication design is necessary and crucial. 
The first ECC hardware implementation [AMV89] is presented in 1989. Previous hardware work 
includes: the first ASIC implementation with Motorola M68008 micro-controller [AMV89], 
reconfigurable finite-field multiplier [JMEH02], ASIC designs for field operations over spe-
cific GF(2m) fields [Mas91], an ECC implementation on the 8051 microprocessor in smart 
cards [WBP00], an ECC processor on a smart card device [Pie00], and recent FPGA imple-
mentations for ECC designs [LL02, BDG+02, BDT+02, KPMF02, GSE+02, NGCEG03, LH04]. 
In this thesis, we aim at optimising the field multiplier for normal basis and adopting state-of-
the-art algorithm for point multiplication. As a result, we expect to build a customisable and 
efficient ECC cryptosystem which is presented in Chapter 3. 
2.4.2 Arithmetic background 
The field and group theories relevant to ECC designs have been extensively studied [LN94]. 
In this section, we briefly describe the mathematical basics that are necessary for building the 
field multiplier. The Polynomial Basis (PB) is formed by the set of {1, a, a2, 	, am-1} where 
a is a root of the non-reducible prime polynomial P(x) of degree m, while the Normal Basis 
(NB) is using another set {a, a2, 	a2m-1} [LN94]. There is always a trade-off using different 
bases for both software and hardware implementation. For instance, squaring is easier in NB 
than in PB, while inversion in NB is slower than in PB. The underlying arithmetic of both 
PB and NB has been extensively studied [Men93a] and the corresponding high-performance 
multipliers [RHcK03, ScK01] are also reported. 
2.4. Cryptosystems 	 19 
We first introduce field multiplication for normal basis in this section, and describe the archi-
tecture of our field multiplier in the next chapter. Given that we multiply two elements A, B 
of GF(2m) using normal basis and the product is another element C in the same field, then 
m-i 
icy — 	k a2 2j C c k 
j=0 	 k=0 
where C = A x B, and the coefficients of element C, ck (k E [0, m — 1]) can be expressed as 
follows [Ros99]: 
m-1 m-1 
A= 
i=0 
a ia 
Ck = ai±k • bj±k • Aii 
The Aii 's are the elements of the multiplication matrix A, which are either 0 or 1 [Ros99]. An 
optimal normal basis means that the number of nonzero Aij  is minimised. Since most of the A 
values are zero, the sum of inner product in the above equation can be written in the form: 
E ai+k bi+k Aii = br (ap aq ) 
i=o 
for some r, p,q E [0, m — 1], except for j = 0 where (for some r,p E [0, m — 1]): 
ai±k bj+k Aij = br(ap) 
i=0 
There are two types of ONB, Type I and Type II, which are determined by the m value and 
the values in the A matrix. The ECCGen presented in this thesis has handled both cases. 
2.4.3 Secret-key cryptosystems 
Secret-key cryptosystems are also called symmetric systems because the protocol requires that 
the sender and receiver first agree on a particular key before communication, and it is suitable 
for large dataset encryption. Since the dramatic growth of computational power and cluster 
computing, the old DES standard which has been adopted as national standard since 1976 is 
20 	 Chapter 2. Background and Related Work 
considered no longer secure. In 1997, the US National Institute of Standards and Technology 
(NIST) started to develop a new encryption standard: the Advanced Encryption Standard 
(AES). The Rijndael algorithm [DR02] developed by Daemen and Rijmen has been selected 
among 15 candidates as the Federal Information Processing Standard (FIPS) for its good per-
formance in both hardware and software implementation, low memory requirement, and high 
degree of instruction-level parallelism. 
Rijndael 
The Rijndael [DR02] is named after the two Belgian creators who are the winners of the AES 
competition. The new AES NIST cipher standard has three block sizes: 128 (AES-128), 192 
(AES-192) and 256 (AES-256) bits. The whole process from original data to encrypted data 
involves one initial round, r —1 standard round and one final round. The major transformations 
involve the following steps: 
• SubBytes - an input block is transformed byte-by-byte by using a special design substi-
tution box (S-Box). 
• ShiftRows - the bytes of the input are arranged into four rows. Each row is then rotated 
with a predefined step according to its row value. 
• MixColums - the arranged four rows structure is then transformed by using polynomial 
multiplication over CF(28) per column basis. 
• AddRoundKey - the input block is XOR-ed with the key in that round. 
There is one round AddRoundKey operation in the initial round; the standard round consists 
of all four above operations; the MixColums operation is removed in the final round operation 
while the rest three operations remain. On the other hand, the inverse transformations are 
applied for decryption. The round transformation can be parallelised for fast implementation. 
When comparing between 3DES, Rijndael not only provides higher security but also achieves 
better performance. The evaluation criteria of this competition were divided into three major 
2.4. Cryptosystems 	 21 
categories: security, costs and algorithm and implementation characteristics. In fact, most of 
the candidates have showed "no weakness demonstrated" in many security attacks. The costs 
are such as program size, memory requirement and chip area. Rijndael has been finally selected 
because of its consistent good performance in both software and hardware across a wide range 
of computing environments [DR02]. 
This block cipher standard includes three block sizes: 128 (AES-128), 192 (AES-192) and 256 
(AES-256) bits. The whole block encryption is divided into different rounds. For the AES-
128 standard, it consists of 10 rounds. Hodjat and Verbanwhede [HVO4] have presented a 
heavily pipelined into FPGA device and achieve over 21Gbit/s throughput. In recent years, 
many high performance AES designs have been published, most of them have explored the 
specific architectures in FPGAs such the use of block-RAM and block multiplier, moreover, 
these designs can usually achieve over 10Gbit/s. 
2.4.4 Public-key cryptosystems 
The concept of public-key cryptography [DH76] was first introduced by Whitfield Diffie and 
Martin Hellman in 1976. This section presents the two popular public-key cryptosystems: 
RSA [RSA78] and ECC [Kob87, Mi185]. Both systems are part of the IEEE P1363 Standard 
Specifications for Public-key Cryptography [P1300]. RSA is based on the Integer Factoriza-
tion Problem (IFP) and ECC is based on another difficult Elliptic Curve Discrete Logarithm 
Problem (ECDLP). There are many other existing cryptosystems such as El-Gamal [E1g85], 
NTRU [HPS98] and Hyper-Elliptic Curve Cryptography (HECC) [Kob89a, Kob89b] that have 
not been discussed in this section. 
RSA 
Rivest, Shamir and Adelman first presented the first public-key cryptosystem, RSA [RSA78], 
in 1977. Since then, RSA has been widely used in various applications. The public key used 
in RSA system consists of a large non-prime number N or the so called RSA modulus, which 
22 	 Chapter 2. Background and Related Work 
is chosen from the product of two large prime numbers and a random integer e. The system 
is robust due to the difficulty of factoring the product of these two large prime numbers. For 
instance, in the RSA cryptosystem, the modulus N = p • q is public, and the two prime numbers 
p and q are kept secret. The random integer e is relatively prime to (p-1)(q-1). The decryption 
key, d, can be computed by using the extended Euclidean algorithm as follows: 
e • d 1 mod (p —1)(q — 1) 
The public key is N and e, and the private key is d. Suppose m is the plaintext and c is the 
ciphertext. The encryption is done by 
c = me mod N 
and decryption can be computed by 
7 = Cd mod N 
The p and q values are no longer useful after the public and private keys have been calculated 
in the RSA system. It is obvious that the primality testing is important for generating the 
above two large prime numbers. As an example, two 512-bit primes are required for generating 
one 1024-bit RSA modulus N. 
Elliptic Curve Cryptography (ECC) 
Elliptic curve has a very strong mathematical background since it has been studied extensively 
for the past 150 years. In 1985, Neal Koblitz [Kob87] and Victor Miller started a new ECC 
research area independently. ECC [Men93b] is based on the intractability of the elliptic curve 
discrete logarithm problem (ECDLP) property. Given points P and Q, find a number k such 
that Q = k • P, k is called the discrete logarithm of Q to the base P. 
An elliptic curve example is shown below: 
2.4. Cryptosystems 	 23 
y2 =- x3 ax + b 
where x, y, a and b are real numbers and 4a3 27/ 2 0. 
Elliptic curves are usually represented using either prime field GFp where p is a prime, or 
represented using binary field GF2m where it has 2m elements, since the calculation of real 
number is slow and has round-off problem. A group is a set of elements with custom-defined 
arithmetic operations on those elements. The field Fp has elements from 0 to p — 1 and the 
points are all on the curve with modulo p operation. Thus, the ECC field always has a finite 
number of points. For many m values, the F2m field has both representations: polynomial basis 
(PB) representation and optimal normal basis (ONB) representation. In the binary F2m field, 
each of the 2' elements is an m-bit element. The ONB representation is classified into Type I 
and Type II, based on the particular m value [Ros99]. Below shows the field elements of GF(7) 
and GF(23) with an irreducible polynomial p(x) = x3 + x + 1: 
GF(7) = 000, 001, 010, 011, 100, 101, 110 
GF(23) = 000, 001, 010, 011, 110, 111, 101 
The security level of ECC using 112-bit keys is equivalent to 512-bit RSA, 168-bit ECC to 
1024-bit RSA and 196-bit ECC to 2048-bit RSA. A complete security comparison on various 
security systems is reported in [LV01]. An ECC challenge [Cer97] has also been setup to prove 
the security strength of ECC systems. The following table shows the NIST guidelines for 
public-key sizes with equivalent security levels. 
Finite Field 
An elliptic curve is represented by a finite number of points which are elements over a finite 
field [LN94]. Two commonly used fields are binary field GF(2m) and prime field GF(p) as 
shown in Fig. 2.4. There is no security and standardization differences between using two 
EC Point Arithmetic 
Finite Field 
Binary Field: GF( 2m) 
Polynomial [ Optimal Normal 
• 
Prime Field: GF(p) 
24 	 Chapter 2. Background and Related Work 
Security 
bit size 
Symmetric 
encryption algorithm 
RSA 
key size 
Bit length of p in 
prime field GFp  
Dimension m of size 
binary field GF(27n) size 
80 Skipjack 1024 192 163 
112 Triple-DES 2048 224 233 
128 AES-128 3072 256 283 
192 AES-192 7680 384 409 
256 AES-256 15360 521 571 
Table 2.1: Security level comparison [Cer06] 
Figure 2.4: Finite fields for elliptic curve representation. 
different fields. The binary field can be interpreted in many different ways; in addtion to the 
two most efficient ways PB and ONB, there are dual basis and triangular basis. The major 
consideration is the system requirement and the performance of implementation in software or 
hardware. For instance, binary field is generally chosen for its significant performance gain and 
size reduction over prime field, and ONB basis has an advantage of squaring field elements by 
simple rotation of the ONB vector representation. 
Elliptic Curve Digital Signature Algorithm (ECDSA) 
ECDSA is an ANSI [ANS99] standard and part of IEEE P1363 standards. We can use elliptic 
curve point multiplication Q = k • P where P is the base point which is randomly chosen on the 
curve for key generation, Q is the public key, and k is the private key. Furthermore, ECDSA is 
applied to signature generation and verification [KMV00]. Previous work [HHIVI00] has shown 
2.5. Pseudo-primality test 	 25 
that an ECC software implementation over special Koblitz curves takes 1.18ms and 2.70ms for 
signature generation and verification over GF(2163 ) field on a Pentium-II 4001VIHz machine. 
2.5 Pseudo-primality test 
Primality test is essential for prime number generation. Cryptography often uses large prime 
numbers, for instance, RSA public key algorithm requires 512-bit, 1,024-bit for key genera-
tion [Sch96]. Much research work has been done on primality test and generation, and most 
algorithms are based on factorization [Rie94]. In [AKS02], the primality test has been proved 
to be solvable in polynomial time. However, it requires a log operation which is expensive for 
hardware implementation; it also requires knowledge of the primality of preceding numbers, 
which is impractical for arbitrary prime testing. 
In [JPV00], Joye et al. has proposed an efficient prime number generation scheme based on 
pseudo-random number generation and implemented the design on a smart-card platform. Lu 
et al. [LdSP02] has further developed the RSA key generation for smart card using different 
prime number algorithms. 
There is a special class of very large prime numbers which have the representation of 2n-1. They 
are called Mersenne prime numbers. For example, the 44th Mersenne prime number 232'582'657 -1 
was discovered in September 2006. This prime number has 9,808,358 digits which has 650,000 
more digits than the previous record prime number. A clustered workforce GIMPS [GIM] in 
the Internet is currently dedicated to locate the next Mersenne number. 
2.5.1 Fermat primality test 
There are many existing primality tests [Rib91] such as Solovay-Strassen Test, Lehmann Test, 
Fermat test and Rabin-Miller Test. The Fermat test is based on the famous little Fermat 
theorem which states that 
an-1  = 1 mod n 
26 	 Chapter 2. Background and Related Work 
The pseudo-code of Fermat primality test is shown below. 
The Fermat primality test algorithm 
Input: The number under primality test: p 
1. Random choose a number a in [1, p-1] 
2. Let r = 0-1  
3. IF 
4. r 	1 mod p 
5. RETURN "Composite" 
6. ELSE 
7. RETURN "Pseudo-prime" 
2.5.2 Rabin-Miller primality test 
In this thesis, we have selected the Rabin-Miller probabilistic primality test algorithm (Fig. 2.5) 
as the core of the primality test. This algorithm is based on the properties of strong pseudo-
primes and has also been adopted by Mathematica's PrimeQ command [Mat]. This algorithm is 
developed by Rabin [Rab80] based on Miller's [Mi176] idea. The algorithm can quickly determine 
the primality of a given large number with a controllably small probability of error [Arn95]. It 
requires a number of small primes for repeatedly testing the input number. The probability 
of falsely identifying a composite as a prime decreases with every additional small prime used. 
This method enables trade-off between accuracy (more small primes) and efficiency (fewer small 
primes) for different applications. 
From the above algorithm, "Inconclusive" implies the number p maybe prime. The selection 
of number b is based on a set of small primes Z where 
2.6. Montgomery arithmetic 	 27 
Input: The number under primality test: p 
1. Random choose a number b in [1, p-1] E Z 
2. Let p = 2qxrn+1 where m is an odd integer 
3. IF either 
4. Case 1: btm - 1 (mod p) or 
5. Case 2: there is an integer i in [0, q-1] 
6. such that bmx2' = -1 (mod p) 
7. RETURN "Inconclusive" 
8. ELSE 
9. RETURN "Composite" 
Figure 2.5: The Rabin-Miller probabilistic primality test algorithm. 
Z = 2, 3, 5, 7, 11, 13, 17, 19, ..., n < p — 1 
With the use of first eight small primes, the 100% accuracy of primality test can be achieved 
for numbers up to 3.4 x 1014 [Arn95]. The actual number of small primes can be customised 
by the design generator presented in Section 4.4 in Chapter 4. 
2.6 Montgomery arithmetic 
The Montgomery multiplication algorithm is used to speed up the modular multiplication. The 
modular squaring and exponentiation are both accelerated by using Montgomery multipliers. 
Since the release of Montgomery algorithm in 1985 [Mon85], there are many modifications and 
developments based on the initial algorithm. The major reason is that many cryptosystems 
require efficient modular operations. For instance, the following modular multiplication when 
we use the Montgomery method is much faster than the standard multiplication for normal 
number. 
C=AxBmodN 
Also, we can apply the same Montgomery method to finite field Koc et al. [cKA98] show that the 
multiplication operation in the field CF(2m) can be implemented much faster than the standard 
28 	 Chapter 2. Background and Related Work 
multiplication. Below gives the modular multiplication of the field elements a(x), b(x), c(x) and 
n(x). In their paper, software results show that the proposed Montgomery method can run 
10-20 times faster than the one using the standard multiplication. Recent work includes the 
hardware multiplier designs [MOPV04] and the unified multiplier architecture for both RSA 
and ECC [oBPV03] cryptographic designs. 
c(x) = a(x) x b(x) mod n(x) 
2.6.1 Montgomery modular multiplication 
Current modular multiplication approaches are mostly based on the Montgomery algorithm [Mon85]. 
The simpler combinational logic used in this design has shortened the critical path and thus ac-
celerates the calculation. An efficient hardware implementation has been presented in [EW93]. 
The basic idea of Montgomery's algorithm is to multiply two integers modulo M (A x B mod 
M) without the use of division of M. We first generate a Montgomery constant c in order 
to transform the integers in m-residues and compute the multiplication with these m-residues. 
Finally, we transform this result back to the normal representation as shown in Fig. 2.6. The 
Montgomery algorithm computes 
Mont(A, B) = A • B • r-1  mod M 
where A, B < M, r such that GCD(M,r) = 1 and r-1  is the inverse of r modulo M, i.e. r-1•r = 1 
(mod M). Given two m-residues A and B where A = Xr mod M and B = Xr mod M, the 
Montgomery product is defined in m-residue format: C such that 
C-,--A•B•r-1  mod M 
and C is the m-residue of the product C = A • B (mod M). The overview of the Montgomery 
Modular Multiplication (MMM) is presented below. Note that the C value can be computed 
in both the integer domain and faster in the Montgomery domain. In fact, MMM can be 
implemented in many different ways [cKAK96]. 
Integer Domain Montgomery Domain 
Modular 
Multiplication 
\.,..
..,..... C = AB mod M 
r2 mod M 
(pre-computed) 
2.6. Montgomery arithmetic 	 29 
Figure 2.6: Montgomery modular multiplication. 
Radix-2 Montgomery multiplication algorithm: C = A * B (mod M) 
1. C = 0 
2. FOR i = 0 to N - 1 (N = number of bits in A) 
3. C = C + (ai * B) 
4. C = C + (ao * M) 
5. C = C / 2 
6. END FOR 
7. IF C >= M THEN C = C - M 
8. END 
Note that modular multiplication is used in modular exponentiation since it is beneficial if we 
compute a series of multiplications in the transformed domain, i.e. the Montgomery space. 
The pseudo code of the hardware description of this Montgomery multiplier is shown in the 
Chapter 4. 
30 	 Chapter 2. Background and Related Work 
2.6.2 Montgomery modular exponentiation 
The modular exponentiation is a series of multiply and squaring operations where the multi-
plication and squaring are all modulus by a given number. Fast exponentiation is important 
for RSA [cK95] and many cryptosystems. A very detailed survey is presented in [Gor98] on 
many existing exponentiation methods. In this section, we first present two well-known binary 
methods called LR Binary method, and RL Binary method. 
Input: M, e, n, Output: C = Me mod n, h = bitwidth of e 
LR Binary method 
1. IF e1i _1 = 1 THEN C = M ELSE C = 1 
2. FOR i = h - 2 to 0 
3. C= CxC (mod n) 
4. if e, = 1 THEN C = CxM (mod n) 
5. RETURN C 
RL Binary method 
1. C = 1; P=M 
2. FOR i = 0 to h - 2 
3. IF e, = 1 THEN C = CxP (mod n) 
4. P = PxP (mod n) 
5. if eh_1 = 1 THEN C = CxP (mod n) 
6. RETURN C 
2.6.3 High-radix Montgomery algorithm 
There has been a number of publications on high-radix Modular operations [Kor93, Tod00]. 
The key feature of high-radix (Radix-2k) Montgomery algorithm is the reduction of the number 
2.6. Montgomery arithmetic 	 31 
of iterations. Some existing architectures such as [Tak92] and [HWOO] have presented Radix-4 
architecture for RSA cryptosystems. In [TTcK01], Tenca et al. presented a scalable high-
radix (radix-8) modular multiplier. However, experimental results show the Radix-8 design 
needs special retiming technique for achieving better performance than Radix-2 design. As 
shown in the following R2k MM algorithm, Booth() refers to Booth encoding which encodes 
the bits according to an expression and sign ext. refers to sign extension. In this thesis, the 
implementation of high-radix Montgomery algorithm is not applied to our primality test, as a 
future work, we can investigate the trade-off between area and performance. 
High-Radix (Radix-2k ) Montgomery multiplication (R2k MM) algorithm 
Algorithm for computing S= X x Y mod M 
1. S = 0; x_1 = 0; 
2. FOR j= 0 to N- 1 STEP k 
3. qY3 = Booth(x3+k 	3-1) 
4. S = S + qyi x Y 
5. 013 = Sk _1..0 x  (2k — 	mod 2k 
6. S = sign ext. (SH-qM3 x M)/2k 
7. END FOR 
8. IF S>=M THEN S = S-M 
2.6.4 Scalable Montgomery algorithm 
The fundamental operation of the RSA public key cryptosystem is modular exponentiation, 
achieved by repeated modular multiplications. The Montgomery modular multiplication [Mon85] 
has been widely used to speed up the multiplication, squaring and exponentiation. There 
are many different extended algorithms [cKAK96] and software implementations based on the 
Montgomery algorithm, and the high-radix Montgomery modular exponentiation has been im- 
32 	 Chapter 2. Background and Related Work 
plemented in reconfigurable hardware [BP01] and optimised for specific reconfigurable archi-
tecture on RSA cryptosystems [DM02]. 
Input: X * Y (mod M), Output: S 
1. S = 0 
2. FOR i = 0 to n - 1 
3. (Ca, S(°)) = xi*y(o) + s(°) 
4. IF 	= 1 THEN 
5. (Cb,S(°)) = s(°) + s(°) 
6. FOR j = 1 to e 
7. (Ca,S(3) ) = Ca + xt *Y(3) + S (3) 
8. (Cb, S(3) ) = Cb + M(3) + S (3)  
9. SO-1) = (S i) ,S(21P1) 
10. END FOR 
11. ELSE 
12. FOR j = 1 to e 
13. (Ca, S(3)) = Ca + xi*Y(3) + SO)  
14. S(3-1) = (S(03) ,S,,,(311 )0 
15. END FOR 
16. END IF 
17. s(e ) = 0 
18. END FOR 
Figure 2.7: The multiple word radix-2 Montgomery multiplication algorithm. 
The re-usability and scalability of the Montgomery multiplier have been investigated in [TcK03] 
such that the design is no longer restricted to a fixed precision. The input operands are 
partitioned into multiple words. Fig. 2.7 shows the algorithmic description of the Radix-2 
Montgomery multiplication. The operand Y (multiplicand) is scanned word-by-word and the 
operand X (multiplier) is scanned bitwise. The loop index n is the bit width of the prime 
number and e is the number of partitions in the operand Y. 
The scalable design of Montgomery multiplication has been implemented as a coprocessor 
in reconfigurable RSA System-on-Chip building block in [FD01]. Furthermore, a high-radix 
design of scalable modular multiplier has also been discussed in [TTcK01]. We observe that 
the scalable design is particularly useful for very large precision such as the prime test. 
2.7. Function evaluation 	 33 
2.7 Function evaluation 
2.7.1 Function evaluation using embedded processor 
We consider an elementary function f(x), where x and f(x) have a given range [a, b]. The 
evaluation of f (x) usually consists of three steps [Mu197]: (1) range reduction [BDK+05] which 
reduces x over the interval [a, b] to a more convenient x' over a smaller interval [a', b'], (2) func-
tion approximation on the reduced interval, and (3) range reconstruction which expands the 
final result back to the original result range. 
Function evaluation for both hardware designs [SE94] and software designs [KZ90] has been 
well presented in the literature. Lee et al. [LAML04] explore the effects of using different input 
ranges and precisions on FPGA area and speed. Mencer et al. [ML04] study pipelined function 
evaluation using table lookup, CORDIC , rational and polynomial approximations. Other 
efforts include using small multipliers [ETMT00], IBM RISC System/6000 [Mar90], AMD K5 
processor [LAS95], Intel IA-64 [HKST99] in recent years. 
Paul et al. [PW76] discuss if dedicated instruction sets for elementary function evaluation 
should be incorporated into computer systems. On the other hand, much research has been 
focused on accuracy, such as multiple-precision for function evaluation [Bre76], IEEE floating-
point standard mathematical library [GB91], quad-double precision [HLB01.] and quadruple 
precision [ASO3] arithmetic. 
Recent work [IT03] develops a math library for the Intel XScaleTm architecture that has an 
efficient software implementation of the basic operations and math library routines. More-
over, code optimization for Digital Signal Processors (DSP) is also important. Another recent 
work [Nes01] shows that optimized C functions can achieve up to 300% performance gain. In 
this thesis, we achieve performance improvements by using the generated Integer Mathemat-
ical Library, IMGen, and an integer processor without any additional hardware cost. Our 
proposed method is easily applicable to any embedded integer processor such as the Xilinx 
MicroBlaze [Mic06], ARM processor in Altera Excalibur [Alt02] and DSPs. 
34 	 Chapter 2. Background and Related Work 
2.7.2 Random number generator 
Previous work on random number generation can be divided into two types: the generation of 
uniform random numbers (URNG), and the generation of non-uniform random numbers such 
as the Gaussian random number generation (GRNG). Random numbers can be generated using 
physical noise or jitter in an analog or mixed-signal circuit which is called a true random number 
generator (TRNG) [TLL03] that can produce genuine random numbers. Pseudo-random num-
ber generators (PRNG) use purely digital components, providing a seemingly random stream 
of numbers. However PRNGs are deterministic allowing the stream to be restarted from any 
previous point. 
Existing URNGs include the linear feedback shift register (LFSR) [Xi196], combined Tausworthe 
approach [Tau65, L'E96], and the Thomas method [TL05]. The major design concern is to 
ensure that the generated random numbers are uniformly distributed in the given range and that 
the stream is as statistically random as possible. Scalability of the RNG is also an important 
factor. For example, the Thomas design allows users to include additional FPGA configurable 
block to generate additional random bits per cycle. 
There exist many different approaches [Knu97] for generating non-uniform random numbers, 
such as the Box-Muller [bM58] and the Wallace [Wa196] methods which are based a series of 
number transformations, the Ziggurat method [MT00] which is a type of rejection-acceptance 
method, and the inversion method [HL03] which is well-known for its highly non-linear func-
tion approximation. Although the Wallace method does not involve function evaluation, non-
random correlations can be found between successive generated samples. The Box-Muller 
method transforms two uniformly distributed variables into two normally distributed variables, 
but these two generated normal random numbers also suffer a certain degree of correlation. For 
the Ziggurat method [ZLL+05], we cannot control the output rate since this is not a constant 
rate and makes it less suitable for hardware simulation. 
Recent research contributions on the hardware GRNGs include the work proposed by Boutil-
lon et al. [BDG03], Xilinx Intellectual Property (IP) core [Xi102], Fung et al. [FLP+04], and 
2.7. Function evaluation 	 35 
the work proposed by Lee et al. [LLVC04, LVLL06]. In [BDG03], the first hardware design 
using the Box-Muller is described and the central limit theorem is used to overcome the func-
tion approximation error. Their design is implemented on an Altera Flex 10K1000EQC240-1 
FPGA and offers 24.5 million samples per second. The Xilinx IP core running on a Virtex-II 
XC2V1000-6 FPGA is able to produce 245 million samples per second, while the ASIC design 
proposed by [FLP+04] using a six layer metal 0.18 µm process generates 182 million samples 
every second. In [LLVC04, LVLL06], these two designs focus on efficient GRNG based on the 
Box-Muller method. The major differences between them are, 1) in [LLVC04], the central limit 
theorem is used to reduce the approximation and quantization errors, resulting in one sample 
per clock cycle, 2) the maximum achievable a value (i.e. the standard deviation, a higher a 
value is required to represent a larger sample population) of the generated samples was 6.7, 
while in [LVLL06], the two generated samples every clock cycle are guaranteed to be accurate to 
one ulp and are able to reach 8.2a for a population of 1015 samples. On a Virtex-II XC2V4000-6 
FPGA, [LLVC04] is able to generate 133 millions of samples per second, and [LVLL06] is able 
to generate 466 samples per second. Alimohammad et al. [ACS05] have implemented a Box-
Muller based design on a Xilinx Virtex-II XC2V4000-6 FPGA. Their design has a throughput 
of 132 million Gaussian random samples per second up to 6.55u. A comparison of hardware 
designs using the Wallace and Ziggurat methods can be found in [LVLL06]. 
Hardware designs using the CDF inversion technique have previously been implemented by 
McCollum et al. [MLBP03] and Chen et al. [CMB04]. In [MLBP03], their design requires 
the use of from using a large onboard 262 KB SRAM on a Virtex XCV300 FPGA for the 
table lookup of the cumulative distribution function. In [CMB04] a large pre-calculated CDF 
inversion table using the on-chip memory is also required to transform uniform random numbers 
into non-uniform pseudo-random numbers. In contrast, in this thesis we describe an architecture 
using the hierarchical segmentation method [LLVC03] to dramatically reduce the lookup table 
size, and using the MiniBit [LAMLO5a] method to minimize the signal bit-widths, with the aim 
of generating a compact and efficient random number generator. We implement the flexible 
architecture using a Virtex-II device to compare with previous work, and choose a Virtex-
4 FPGA to obtain the best performance. The Virtex-4 FPGA has three major resources, 
36 	 Chapter 2. Background and Related Work 
1) configurable blocks known as slices which have two four-input look-up tables, embedded 
multiplexers, carry logic and two registers, 2) DSP-blocks which can perform an 18-bit by 18-
bit multiplication followed by a 48-bit addition, and 3) block-RAMs which can store a maximum 
of 18 Kb of data, using specific data bits and memory depths. 
2.8 Summary 
This chapter presents the background material and the related work of this thesis. Most ideas 
in this thesis are inspired by the related work described in this chapter. We also present some of 
the current design challenges and possible improvements of the previous work. In the following 
chapters, we present our customisable design generators for various arithmetic hardware designs. 
At first, we have developed two design generators (ECCGen and PrirneGen) for cryptographic 
hardware designs. Next another two design generators for function evaluation designs for 
embedded processors (IMGen) and random number generators (RNGGen) are presented. 
Chapter 3 
Customisable Elliptic Curve 
Cryptosystems 
This chapter presents a method for producing hardware designs for Elliptic Curve Cryptography 
(ECC) systems over the finite field GF(2m), using the optimal normal basis for the represen-
tation of numbers. Our field multiplier design is based on a parallel architecture containing 
multiple m-bit serial multipliers; by changing the number of such serial multipliers, designers 
can obtain implementations with different trade-offs in speed, size and level of security. A 
design generator, ECCGen has been developed which can automatically produce a customised 
ECC hardware design that meets user-defined requirements. To facilitate performance char-
acterisation, we have developed a parametric model for estimating the number of cycles for 
our generic ECC architecture. The resulting hardware implementations are among the fastest 
reported: for a key size of 270 bits, a point multiplication in a Xilinx Virtex-II XC2V6000 
FPGA at 35MHz can run over 1,000 times faster than a software implementation on a Xeon 
computer at 2.6GHz. 
The rest of the chapter is organised as follows. Section 3.1 introduces ECC cryptosystems 
and our approach. Section 3.2 focuses on the proposed parallel field multiplier. Section 3.3 
presents the architecture of our ECC cryptosystem and its components. Section 3.4 provides 
the mapping from the architecture to reconfigurable hardware, and the parametric model for 
37 
38 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
estimating design trade-offs. Section 3.5 addresses design automation and customisation by a 
design generator. Section 3.6 evaluates our results and compares them against existing hardware 
ECC implementations. Section 3.7 gives the extension work on reconfigurable ECC designs. 
Finally, Section 3.8 summarises our approach. 
3.1 Introduction 
Elliptic curve cryptography (ECC) is a public key cryptography system superior to the well-
known RSA cryptography: for the same key size, it gives a higher security level than RSA [BoPV03]. 
ECC has been adopted in a wide spectrum of applications from digital certificates in web server 
authentication [NGCEG03] to embedded processors [WGP03] in wearable devices. Section 2.4 
includes background material on ECC. 
This chapter presents a scalable method for producing hardware designs for ECC systems 
over binary field GF(2711) [LN94], using the Optimal Normal Basis (ONB) [MOVW88] for the 
representation of numbers. Our design is based on a field multiplier architecture with multiple 
m — bit serial multipliers operating in parallel. An unique feature of our approach is a design 
generator which can automatically produce a customised ECC hardware design that satisfies 
specific user-defined requirements targeting to different applications. This method enables 
designers to rapidly explore and implement a design with the best trade-offs in speed, size and 
level of security. 
When optimised for speed, our design generator produces ECC designs with extensive paralleli-
sation and pipelining. These designs do not involve instructions, to avoid overhead associated 
with instruction fetch and decode. Our architecture is generic: for instance, a user-defined 
parameter controls the amount of parallelism in evaluating field multiplication. Once this and 
other parameters are decided, various design-specific constants, wiring patterns and data widths 
are generated automatically. 
To facilitate performance characterisation, we have developed formuhe for estimating the num-
ber of cycles for our generic ECC architecture. These formulae are expressed in terms of various 
3.2. Parallelised field multiplier 	 39 
customisation parameters, such as the key size, the amount of parallelism in function evaluation, 
and the number of cycles for basic operations like point addition and point multiplication. 
As an example of our implemented designs, for a key size of 270 bits, a point multiplication, 
which is the slowest operation in the ECC method, can be computed in 0.17ms with our 
hardware design implemented in an XC2V6000 Field-Programmable Gate Array (FPGA) at 
35MHz. In contrast, a software implementation requires 196.71 ms on a Xeon computer at 
2.6GHz; so our FPGA design is more than 1150 times faster, while its clock speed is almost 74 
times slower than the Xeon processors. 
To summarise, our major achievements include: 
• a parametric parallel and pipelined design for field multiplication operation (Section 3.2). 
• an optimised point multiplication using parallel field multipliers (Section 3.3). 
• a parametric model for calculating the number of cycles for our generic ECC system and 
for estimating the security/size/speed trade-offs (Section 3.4). 
• a design generator that takes the key size and degree of parallelism of the design and 
generate efficient "hardcore" control and data path (Section 3.5). 
3.2 	Parallelised field multiplier 
One of the contributions of this chapter is the design of a parallel field multiplier for normal 
basis. This multiplier is used in many different parts of our customisable cryptosystem which 
is described in the next section. 
Our field multiplier architecture is based on having p copies of an m-bit multiplier operating 
in parallel. Fig. 3.1 shows the datapath of our field multiplier, which is inspired by a non-
parametrised architecture [M086]. The wiring block is automatically generated for different m 
40 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
m-bit input register (a) 	 m-bit input register (b) 
:—-4-2 ao 	 b2 
Ain  
Generated wiring block 
(input: 2m bits), (output: 3mp bits) 
1 	 p copies 
L 
	
L 
	
L 
	
L 
Combinational Logic 
I 	II 
AND & AND & AND & AND & L 
XOR XOR XOR XOR 
Tree Tree Tree Tree 
 	m-bit output register 
	 I-4-! (accumulator) 
• 
Figure 3.1: Parametrised, pipelined and parallel field multiplier design. 
values, an example of wiring pattern is shown in Fig. 3.2 in which the degree of parallelism p 
is one. The details of design generation is described in Section 3.5. 
a4 a a2 al  ao 	b4 b b2 b1 bo 
------------ 	 ----- 
	 -------------- 
Figure 3.2: Example of a wiring block when m = 5 and p = 1. 
The field multiplication operation is presented in pseudo-code as follows with the following 
properties: 
• parallelfdo (a) ; do (b) } means that both operations do (a) and do (b) are executed in 
parallel. 
• In a given clock cycle, the value of a field element x is the one it had at the previous clock 
cycle. This property results that the program parallel{a=b ; b=a} swaps the values of 
3.2. Parallelised field multiplier 	 41 
the field element a and b because the following two operations perform simultaneously at 
time t: a(t) = b(t-1) and b(t) = a(t-1). 
• xk denotes the kth  bit of the field element x. 
We first describe our serial field multiplier in the pseudo-code format, given that the wiring 
pattern has been first generated. 
1 fieldinult (a,b) 
{ 
2 	for i = 0 ... m-1 
{ 
//compute the wiring pattern 
3 	inputs = wiring(a,b); 
//compute the outputs of the L functions 
4 	parallel for k = 0 ... m-1 
{ 
5 	temp_k <- L(inputs); 
} 
//left rotate all registers 
{ 
6 	a <- left_rotate(a); 
7 	 b <- left_rotate(b); 
8 	 c <- left_rotate(c); 
} 
//XOR with the output register 
9 	c <- c XOR temp; 
} 
//final rotation: every c_k is at the right position 
10 c <- left_rotate(c); 
42 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
11 return c; 
} 
The algorithm shows that such a computation takes 4m + 1 cycles, as we can see that steps 3, 
5, {6-8} and 9 are repeated for m times, plus one final rotation in step 10. We pipeline this 
field multiplier since the computation of the L functions, the wiring pattern operations and the 
input/output rotations can be performed concurrently in different pipeline stages. 
The time complexity of the original m-bit serial algorithm is 0(m). The key element of our 
design is to parallelise it to achieve the complexity of 0(m/p) where p is the degree of paral-
lelism. Thus, the pipelined field multiplier is mapped into a parallel design such that for each 
k value, instead of computing only L(k, a, b) at one specific cycle, it will also compute 
L(k, lrotate(1, a), lrotate(1, b)), 
L(k, lrotate(2, a), lrotate(2, b)), • • • , 
L(k, lrotate(p — 1, a), lrotate(p — 1, b)), 
where lrotate(n, x) returns x rotated left by n bits. 
In order to compute these functions, a new wiring pattern which outputs p layers of m groups 
of 3 values is developed. These layers are derived from the wiring pattern that we generate for 
the serial multiplier as shown in Fig. 3.1. If a layer (number 0) outputs for some k (for the 
kth group of 3 values) the values aq , ar and bt, then the i th layer will output the values aq _i , 
ar _i and bt_i at position k — i. Besides the wiring modification, in each step the registers are 
rotated by p bits instead of only one bit that is used in the serial design. Next, we define 
m = 1.1 X p v 
where p is the quotient and v is the remainder, when m is not divisible by p. If v 0, an extra 
step that calculates the last v elements of each sum is added. 
3.2. Parallelised field multiplier 	 43 
The parallel and pipelined algorithm is shown below. Note that if m = p, this design only takes 
(1 + m/p + 1 + 1 = 4) cycles to compute the result. 
1 fieldmult_par(a,b,p) 
{ 
2 parallel 
{ 
3 	inputs = wiring(a, b); //inputs is a m*p*3 table 
4 	a <- left_rotate(a, p); //left rotate by p bits 
5 	b <- left_rotate(b, p); 
6 	clear(output[0] 	output[p-1]); 
} 
7 	for i = 0 ... µ-1 
{ 
8 	parallel 
{ 
//pipeline stage 1: compute wiring 
9 	inputs = wiring(a, b); 
//pipeline stage 2: compute L 
10 	parallel for j = 0 	p 
{ 
11 	parallel for k = 0 ... m-1 
{ 
12 	output[j]k <- L(wiring_inputs); 
} 
} 
//pipeline stage 3: rotating 
13 	a <- left_rotate(a, p); 
14 	b <- left_rotate(b, p); 
15 	c <- left_rotate(c, p) XOR 
44 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
output[0] XOR ... XOR output[p-1]; 
} 
} 
16 c <- left_rotate(c, p) XOR 
output[0] XOR ... XOR output[p-1]; 
17 if(v!=0) 
{ 
18 	parallel 
{ 
19 	a <- left_rotate(a, p); //realign a 
20 	b <- left_rotate(b, p); //realign b 
21 	c <- left_rotate(c, p); 
22 	parallel for j = 0..v 
{ 
23 	parallel for k = 0..m-1 
{ 
24 	output[j]k <- L(wiring_inputs); 
} 
} 
} 
25 	c <- c XOR output[0] XOR ... XOR output[v-1]; 
26 	c <- left_rotate(c, p); //final rotation 
I.  
27 else //only do the final rotation 
{ 
28 	c <- left_rotate(c, p) 
} 
29 return c; 
3.3. 	System architecture 	 45 
ECC operations ECC protocols 
A 
Point operations Point multiplication Point addition 
1. N 
Auxiliary operation Data embedding 
w- 	A 
Field operations Field multiplication Field inversion Field squaring 
Figure 3.3: Interactions between different operations. 
} 
The number of cycles is reduced if a larger p value is used. A field multiply operation takes 
Lin/Pi + 5 cycles if r 0, and (m/p) + 3 cycles otherwise. The two remainder cycles are from 
step 18 —25 in the above pseudo-code. In contrast, the non-pipelined serial version takes 4m+ 1 
cycles. 
3.3 System architecture 
This section describes the operations supported by our ECC architecture. These operations 
include field multiplication, field inversion and point multiplication, and their interaction is 
summarised in Fig. 3.3. In the following, m is the key size for our ECC architecture, which is 
a characteristic of the field. The subsections below cover field multiplication, field inversion, 
field squaring, point multiplication, point addition and data embedding respectively. 
3.3.1 Field multiplication 
In the proposed ECC customisable cryptosystem, the field multiplier presented in Section 3.2 
is the crucial component that is repeatedly used by other operations as shown in Fig. 3.3. The 
datapath of our system for various operations is shown in Fig. 3.4. Users are able to select the 
1FM Design 
FM i 
2FM Design 
FM; 
FM 
4FM Design 
PC 
Input 
Register 
Input 
Register 
Data Embedding 
Point Multiplier 
FPGA 
Dataln 
DataOut Output 
Register 
Field 
Inversion 
Field 
Multiplier 
rn/p + 3 cycle 
Status 
Control 	Register 
:I Control FM 	FM  
FM 	FM 
46 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
second level parallelism in term of the number of executable field multipliers to the maximum 
4 parallel field multipliers. Different scheduling optimisations have been applied to various 
designs which are summarised in the parametric model as shown in Section 3.4. Additional 
field multiplier does not bring further speed gain but area penalty. There is a trade-off between 
efficiency and area. 
Figure 3.4: Datapath of the customisable ECC system. 
3.3.2 Field inversion 
The algorithm we used for field inversion is based on the Fermat's theorem which states that 
in a normal basis: 
a-1  = a2m-2 = (a(2m-1-1) )2 
for all a 0 in GF2m. Using this formula to compute an inverse, i.e. a • a-1  = 1, it would require 
m multiplications. The following method was proposed by Itoh and Tsujii [IT98] and has been 
widely used for computing inversion. The following reduces the complexity of the inversion, as 
it is easy to calculate x2(m 1)/2+1 where x is shifted (m — 1)/2 times and then multiplied by x. 
m — 1 is even: 
a(2m-1-1) 	a(2(m-1)/2+1)-(2(m-0/2-1) 
= 	(a2
(m-0/2_1 )2(.1)/2+1 
m — 1 is odd: 
3.3. System architecture 	 47 
a ( 2rn- 1 _ 1) = a (2 ( (m - 2 ) / 2 _ 1) 2 ( m - 2 ) / 2 + 1) + 1 ) 
= ((a2(m-2)/2_1)2(--2)/2+1)2  • a 
The algorithm is described below: 
Input: a E GF(2m) to be inverted 
Output: x = 
• x i-- a ; s 	log2(m) —1 
• while s > 0 
— r 4— right shift m by s bits 
— y 4— left shift x by [r/2i 
— y 4— multiply x by y 
— if x is odd 
y 	left shift y by 1 bit 
y <— multiply x by y 
— x 	y 
— s 	s — 1 
• x i— left shift x by 1 bit 
• return x 
Since the time to perform a multiplication is usually much longer than the time to perform 
a shift operation, the number of cycles for this inversion algorithm can be approximated by 
the following equation [Ros99] where "'mutt  represents the number of cycles to perform a field 
multiplication: 
(log2(m — 1) + number_of_bits_set_in(m — 1) — 1) x Tmuu 
48 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
3.3.3 Field squaring and field addition 
The key benefit of using ONB representation is the simplicity of the squaring operation. Each 
element in the field GF(2m) is represented by m binary digits. Field squaring is simply a cyclic 
shift [WTS-1-85] while field addition is a Boolean XOR operation. Therefore these two functions 
take up little space and are efficient in hardware. 
3.3.4 Point multiplication 
This part presents the point operations based on the following elliptic curve: 
y2 + xy = x3 + a2x2 a6 
A point is described by its x and y value in the above equation. In our point multiplication, 
we use projective coordinates such that each point on the curve is described by (X, Y, Z) on 
the following curve: 
Y 2Z XYZ = X 3 -1- a2X2Z2 + a6Z3  
Instead of using the "add-and-double" (binary) method, we adopt the improved Montgomery 
Scalar Multiplication [LD99] in our design for point multiplication. Let P, P1 and P2 be three 
points of an elliptic curve E, and assume that the property P2 - P1 = P holds. Let the (affine) 
x-coordinate of 13, (i E {1, 2}) be Xi /Z, where X, and Zi are the first and third projective 
coordinate of P. 
As shown in Fig. 3.5, the y-coordinate of any of the three points is not required in computing 
the x-coordinate (X/Z) of P1 + P2 or of 2Pi. Hence, in the main loop of the algorithm, we can 
get rid of these y-coordinates and use the above formula to compute the x-coordinate of point 
doubling, 2P, and point addition, P1 + P2 at each step. When the main computation is over, 
3.3. System architecture 	 49 
	
Input: k E GF(27n) (k 	(k1-1, • • • 	ko)), P(x, y) a 
point of a certain curve E defined by a2 and a6 
Output: Q = k • P 
• X1 +— x; Z1  4— 1; X2 	x4 + a6; Z2 <— X2 
• for i from 1 — 1 downto 0 do 
if ki = 1 then 
(Zi, Xi ) 	Madd(Xi , Zl ,  X2, Z2); 
(Z2, X2) <— Md0Uble(X2 , Z2) 
else 
(Z2 , X2 ) 4— Madd(X2, Z2, X11 Z1): 
(Z1, X1) <— Mdouble(Xi , Z1) 
• return Q = Mxy(X1, Z1 , X2, Z2) 
where: 
• Madd(Xi , Z1, X2, Z2) returns (Z3, X3) where 
Z3  = (Xi • Z2 + X2 • Z1)2  
X3 -= X • Z3 + (Xi. • Z2) • (X2 • Z1) 
• Mdouble(Xi , Z1 ) returns (Z3, X3) where 
Z3 = Zi2 • Xi2  
X3 = Xi4 + a6 • Zi4 
• illxy(Xi , Z1, X2, Z2) returns Q(xi , yi ) where 
x1 = X1/Z1  
Yi = (xi +x) • ((xi + x) • (x2 +x)+ x2 + y)/x +y 
Figure 3.5: The Montgomery algorithm for point multiplication. 
we can get the y-coordinate of the result point by using the following formula which is proved 
in [LD99]: 
yl = (x1 + x) • ((x1 x) • ((x2 x) x2  + y)/x) 
The Montgomery algorithm for point multiplication we used is shown in Fig. 3.5. 
This algorithm requires 61 + 10 multiplications and only one inversion, where 1 is the number 
of bits needed to represent k in base 2. The value of 1 is usually close to the value of m. The 
algorithm used for point multiplication is mainly sequential, since each step of the loop needs 
Using one field multiplier 
T 2l  x2 
Z12 
Begin \jh  
Mdoubfe 
Begin Madd 
Xi =X,. Z2 
X = X + 
MHO 4inel° 
Z12 
CIDITie  
X2 	Some temporary variables 
are not shown in here. 
GAD 
Time 
Using four field multipliers 
Field mult 
Field add 
Field 
square 
50 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
the results of the previous step to start. 
Figure 3.6: Applying multiple parallel field multipliers in point multiplication. 
Since the two functions Madd and Mdouble can be computed independently in parallel as 
illustrated in Fig. 3.6, the number of cycles for point multiplication can be reduced. The reason 
is that the Madd function requires 4 field multiplications, and the Mdouble function requires 
2 field multiplications; the sequential execution of Madd and Mdouble takes the time of 6 field 
multiplications, while a design using 4 parallel field multipliers will significantly reduce the time 
to that for one field multiplier. Due to data dependency and area overhead, there is no gain in 
using more than 4 field multipliers. This improvement is significant, since each step in the loop 
is executed m times, where m is typically between 100 and 500 for secure systems [oCIoST00]. 
This flexible point multiplication and the underlying field multiplication can be made large 
or small, fast or slow depending on the performance required. Since other functions involve 
extensive use of the field multiplier, an optimised design is crucial and has a large impact on 
the final ECC system. 
3.3. System architecture 	 51 
3.3.5 Point addition and point subtraction 
Although point addition is not as common as point multiplication, some security protocols 
require both. The algorithm used to compute this operation is simple. Adding P(xi , yi ) and 
Q(x2, Y2) (P Q) gives R(x3 , y3) where: 
= (y2 Y1)/(X2 xl) 
13 = 02 +0+x1+x2+a2 
Y3 = 0(xi + x3) — yl 
The details of the algorithm can be found in [Ros99]. Our system will also support point 
subtraction. To compute subtraction, we develop an algorithm that can invert a point. The 
negation of a point P(x, y) is —P(x, x+y). The addition algorithm requires 1 field inversion and 
2 field multiplications. The point inversion algorithm takes negligible time, as it only requires 
an XOR operation. 
3.3.6 Data embedding 
Data embedding embeds data onto a point of an elliptic curve. Not all elements of GF(2m) are 
the x-coordinate of a point of a given elliptic curve, and the ECC technique only allows the 
encryption of a point on an elliptic curve. It is therefore necessary to embed data into a point 
in order to encrypt them. 
It has been shown[Ros99] that, given a specific elliptic curve, if 5 don't care bits are appended 
to m — 5 bits of data and forms data embed_data, there always exists at least one combination 
of the don't care bits such that the value x of embed_data stays on the curve. The outline of 
the embedding algorithm is as follows: 
52 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
Input: d data written in base 2 (the data must be of 
length m, — 5 bits) 
Output: M(x, y) point in which the data to be en-
crypted are stored 
• x 	append(d, 000002) 
• while not on_curve(x) 
increment x 
• compute_y(x) 
• return M(x, y) 
where on_curve(x) checks if x E GF(2m ) is the x-coordinate of a point of the curve that we are 
working on. compute_y(x) returns the y-coordinate of one of the two points whose x-coordinate 
is x. The equation of the curve is: 
y2 + xy = x3 + a2x2 + a6 y2 + xy + f(x) = 0 
where f (x) = —(x3 + a2 x2 + a6). Let y = zx. Given x 0, the equation becomes: 
(xz)2 x2z + f (x) =0 z2 + z + c = 0 
where c = f (x) • x-2 . The function on_curve(x) is described below: 
3.3. System architecture 	 53 
Input: x E GF(2m) bits 
Output: true if x is the x-coordinate of a point, false 
otherwise 
• c 	(x3 + a2x2 + a6) • x-2  
• trace <— XOR of all bits in c 
• if trace =1 then 
return false 
• else 
return true 
We compute —f(x) • x-2 instead of f(x) • x-2 since for all u E GF(2m), —u is defined by 
u + (—u) = 0, that is u XOR (—u) = 0 and —u = u. To compute the y-coordinate given the 
x-coordinate, We can rewrite the equation, that we are working on as (i E [0, m — 1]): 
1/2 	1/2 Z = Z 	C 	Zi = Zi_1 Ci-1 
Moreover, if z is a solution to our equation, then the other solution is z + 1. It is easy to prove 
this by assuming z2 + z + c= 0 and by calculating 
(z +1)2 +(z+1)+c = z2 +2z+1+z+1+c 
z2 + z + c + 2(z + 1) 
= z2 +z+c=0 
In this proof, an addition is just an XOR operation in GF(2m), then for all u E GF(2m ), 2u = 0. 
Since z + 1 is actually z in a normal basis, in one of the two solutions the least significant bit 
will be 0 and the other one will be 1. As a result, the least significant bit of the solution that 
we are looking for is equal to 0. We then further compute all the other bits one by one. To 
compute the y value, we simply multiply z by x. 
54 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
The algorithm that performs the operation compute_y(x) is described below: 
Input: x E CF(2m) (and c calculated in function 
on_curve(x)) 
Output: M(x, y) a point of the curve we are working 
on 
• Zo 4— 0 
• for i from 1 to m — 1 
zi_1 + 
• y4—x• Z 
• return y 
3.4 FPGA implementation 
This section presents the implementation of our design to produce a customisable encryp-
tion/decryption system. We have implemented our designs in the Handel-C language [Ce103]. 
The key components in our design are: field operations (multiplication, inversion, squaring, 
addition), point operations (multiplication, addition), and data embedding. Fig. 3.7 provides 
an overview of our system. 
Functions are implemented as shared logic. They will only be mapped once on the hardware. 
We then generate the routing logic and the control logic to send the appropriate data to these 
functions and fetch the result when necessary. The system performance has been optimised 
by exploring the maximum possible parallelism between operations at both the field-operation 
level and the point-operation level (Fig. 3.3). We also notice that because these functions called 
each other many times and have to send large values (usually two or three m bits values) to each 
other every time, the design's speed is limited by parameter passing. To tackle this problem, 
all function calls are pipelined. 
3.4. FPGA implementation 	 55 
                               
   
System 
Specification 
Field size: m 
parallelism: p 
      
ECC System 
Code Generator 
     
- — - • — • - — • — • 
	 •
Optional 
/ Elliptic Curve: E 
Point: P(x, y) 
L.. 
              
                        
                               
         
Synthesizable Handel C / VHDL designs 
       
      
Celoxica RC2000 Board (Xilinx XC2V6000 device) 
      
                               
     
Host Computer: Transmit data and control to the FPGA device 
   
                               
                               
  
Key 
Generation 
  
Encryption 
 
/ data / 
  
Decryption 
  
Embedding 
 
                         
Figure 3.7: Overview of the customisable ECC system. 
The estimation of the number of clock cycles required by each function is presented in Table 3.1. 
Notations for this table include: Tmuit,  Ti„ and Tpotrit _add which represent the number of cycles 
for field multiplication, field inversion and point addition functions respectively, and s(x) and 
ns(x) represent the number of set bits and of clear bits in the binary representation of x. The 
nb_atterripts in the data embedding formula represents the number of times the data need to 
be incremented to find the x-coordinate of a point. 
Operation Number of cycles 
Field operations 
Addition 
Squaring 
Multiplication 
Inversion 
1 
1 
5 + Lm/pj 	or 	3 + (m/p) if m is divisible by p 
2 + (ilog2(m)l— 2) x (3 + Trnult) + 3(111 — 1) x (3 + Tmutt) + ns(rn — 1) 
Point operations 
Addition 
Subtraction 
12 + 2Trnult + Tiny 
1 + Tpoint_add 
Point multiplications 
Unoptimised 
1 FM 
2 FM 
4 FM 
28 + 10Trnuit + Tinv + (m — flog2(k)1) + ([log2 (k)1 — 1) x (14 + 6Tmult) 
14 + 10Trnult + Tiny + (m — 11092(k)] ) + ([ 1 0 g2(101 — 1) x (8 + 6T.Lit) 
10 + 6T.ti + Tiny + (m — [log2(k)1) + (Flog2 (k)1 — 1) x (9 + 4Tmuit) 
9 + 5Trnult + Tiny + (m — flog2(k)l) + (rlog2(k)] — 1) x (4 + 2T.tt) 
Data embedding 6 + 3Tmuit + nb_attempts(8 + Tiny + Trnult) 
Table 3.1: Number of clock cycles required for each function execution. 
* FM means the number of Field Multiplier, Trnuit ,Tin„,Tpoint_add means the number of cycles for field 
multiplication, field inversion and point addition. 
56 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
3.5 Automatic generation and customisation 
This section presents a design generatorECCGen which can automatically produce implemen-
tations with optimised speed, size and level of security. The major customisable elements of a 
cryptosystem are: the key size in, the degree of parallelism p, and the protocols of the system. 
We develop a program that takes a valid ONB m-value (type I or type II optimal normal basis) 
and the degree of parallelism in the field multiplication function, and produces synthesizeable 
Handel-C code. This design generator first computes the A table for the given m using the 
algorithm presented in [Ros99]. This algorithm generates an m x 2 matrix (Lambda) where the 
jth row contains two values of i for which: 
if GF(2m) has an ONB of type I: 2i + 23 equals 1 or 0 in mod (in + 1), 
if GF(2m) has an ONB of type II: 2i ± 23 equals ±1 in mod (2m. + 1). 
When j = 0, there is only one value of i that satisfies the equations. From the A table, we 
generate the wiring pattern required by our field multiplication design. This wiring rearranges 
the m-bit inputs a and b into 3p m-bit variables: 
inputa1 [0] , • • • , inputal [p-1] , 
inputa2 [O] , • • • , inputa2 [p-1] , 
inputb [0] , • • • , inputb [p-1] , 
where: 
inputb [i] _k (the kth bit) is b_ (2k-i) , 
inputal [1] _k is a_ (Lambda [k, 0] +k-i) , 
inputa2 [i] _k is a_ (Lambda [k, +k-i). 
3.6. Evaluations 	 57 
This design generator computes all the constants that are used by various functions. In partic-
ular, it computes s = Lm/p] and r = m mod p used in field multiplication. It also calculates 
the size of those variables and various values, such as the constant 1 in GF(2m) which is an 
m-bit variable with all bits set. 
Our design generator ECCGen enables users to choose appropriate encryption protocols for their 
applications. For example, a system can easily implement an encryption/decryption protocol 
using a particular elliptic curve. Users can even store their private keys on the running FPGA. 
3.6 Evaluations 
In this section, we compare the performance of various software and hardware implementations 
for point multiplication, which is the bottleneck for ECC systems. We have implemented the 
software design [Ros99] on a machine running Intel Xeon 2.66GHz with 4GB of RAM. We also 
compare our design with existing FPGA ECC cryptosystems over CF(2771 ). The comparison 
for serial and parallel designs on different m and p values, where p refers to the degree of 
parallelisation, is presented in Table 3.2. Note that Place-and-Route (P&R) results mean 
the results that are obtained from the Celoxica DK3 and Xilinx ISE 6.2 tools, and measured 
results refer to the measured results from hardware realisation. Our hardware design has been 
implemented on an RC2000 board containing an XC2V6000 FPGA chip. We have also verified 
the correctness of each design using 300,000 consecutive point multiplication using data sent 
and received from the local PC. Note that from our results, for a given m, the change of p does 
not obviously change the critical path which shows that our designs are highly scalable and 
parametrisable. The "Speedup" column in Table 3.2, 3.3 and 3.6 shows the performance gain 
of our design over other methods. This gain is due to our highly parallel architecture which 
does not involve instruction fetch and decode. 
As shown in Fig. 3.8, the area requirements for various designs such as m = 113 and m = 162 
using one, two and four parallel field multipliers (1FM, 2FM and 4FM) show a linear growth 
of resource usage when the degree of parallelism p increases. The runtime for performing one 
ar
ea
  -
  s
lic
es
  u
se
d 
18000 
16000 
14000 
12000 
10000 
4000 
0 
8000 
6000 
10 	20 	30 	40 
degree of parallelism - p 
50 60 70 
m162-FM4 
m162-FM2 
- 
m113•FM2 --s - 
rni13-FM1 - 
0 
X 0 
• 
58 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
point multiplication is shown in Fig. 3.9 with respect to four different cases: m. is not divisible 
by p using 1FM, m is divisible by p using 1FM, 2FM and 4FM. In order to test the effect of 
using different curve and base points in the point multiplication, we have also randomly picked 
a number of test cases and verified our designs. 
Moreover, we compare our results with another hardware implementation [LL02] for different 
p-values as shown in Table 3.3. A larger speedup is achieved for the design with a smaller m 
value; we expect that this effect is due to the longer critical delay path in the design with a 
larger m value, which can be further optimised. 
Additionally, we compare our designs with previous work in terms of speed per unit area (the 
number of point multiplication per second divided by the number of hardware slices used) as 
shown in Table 3.4. It shows that our approach is more efficient than designs involving custom 
instruction processors [LL02], even when area is taken into account. 
Figure 3.8: Area usage for various p and m values. 
We also compare our design with other existing hardware cryptosystems. Table 3.6 shows the 
performance improvement by using our design (with the maximum implementable value for p) 
for the m-values that have been published. We adopt maximum parallelism since a large speed 
improvement can be obtained by a small area increase. As shown in Table 3.6, the timing values 
o. 
v, 	0.3 
0.2 
0.1 
0.6 
0.5 
0.4 
0.7 
0 
100 120 140 160 180 	200 
field size - m 
220 240 260 280 
1FM indivisible 
1FM divisible ---x--- 
2FM divisible 
4FM divisible 
3.6. Evaluations 	 59 
Figure 3.9: Field multiplication comparison for various m values 
* using one, two and four parallel field multipliers (divisible means m is divisible by p). 
P&R results Measured results Software [Ros99] Hardware (serial) [LL02] 
p clock 	ms clock 	ms clock 	ms 	speedup clock 	ms 	speedup 
m = 113 
2 42MHz 0.36 56MHz 0.27 2.6GHz 15.99 59.22 31MHz 4.3 15.93 
8 42MHz 0.13 56MHz 0.09 2.6GHz 15.99 177.67 31MHz 4.3 47.78 
16 42MHz 0.08 56MHz 0.06 2.6GHz 15.99 266.50 31MHz 4.3 71.67 
32 42MHz 0.07 56MHz 0.04 2.6GHz 15.99 399.75 31MHz 4.3 107.5 
56 42MHz 0.05 56MHz 0.04 2.6GHz 15.99 399.75 31MHz 4.3 107.5 
in = 162 
2 40MHz 0.83 54MHz 0.55 2.6GHz 45.67 83.04 29MHz* 9.39* 17.07 
8 40MHz 0.27 54MHz 0.17 2.6GHz 45.67 268.65 29MHz* 9.39*  55.24 
16 40MHz 0.18 54MHz 0.11 2.6GHz 45.67 415.18 29MHz* 9.39* 85.36 
32 40MHz 0.13 54MHz 0.07 2.6GHz 45.67 652.43 291\4Hz* 9.39* 134.14 
56 40MHz 0.10 54MHz 0.06 2.6GHz 45.67 761.17 29MHz* 9.39* 156.5 
m = 270 
2 24MHz 3.28 35MHz 2.24 2.6GHz 196.71 87.82 26MHz* 27.99* 12.50 
8 24MHz 0.92 35MHz 0.63 2.6011z 196.71 312.24 26MHz* 27.99* 44.43 
16 24MHz 0.53 35MHz 0.36 2.6GHz 196.71 546.42 26MHz* 27.99*  77.75 
32 24MHz 0.35 35MHz 0.24 2.6GHz 196.71 819.63 26MHz* 27.99*  116.63 
56 24MHz 0.25 35MHz 0.17 2.6GHz 196.71 1157.12 26MHz* 27.99*  164.65 
Table 3.2: Speed comparison with the reference designs [LL02, Ros99]. 
* The symbol (*) denotes extrapolated results based on published data for different m values. 
p 
Parallel [LL02] Our design Parallel [LL02] Our design 
time(ms) time(ms) 	speedup time(ms) time(ms) 	speedup 
m = 113 m = 473 
1 4.3 0.51 8.43 126.2 20.76 6.08 
2 2.6 0.27 9.63 69.2 10.51 6.58 
4 1.7 0.15 11.33 35.7 5.41 6.60 
8 1.06 0.09 11.78 19.1 2.85 6.70 
16 0.81 0.06 13.5 12.7 1.56 8.14 
32 0.79 0.04 19.75 - 0.91 - 
64 0.77 0.03 25.67 0.61 
Average 14.3 6.82 
Table 3.3: Comparison with the parallel reference designs [LL02]. 
60 	 Chapter 3. Customisable Elliptic Curve Cryptosysterns 
Designs m -= 113 
Leong (parallel) [LL02] 0.15 
FM1 0.77 
FM2 0.71 
FM4 0.70 
Table 3.4: Comparison of speed per slices used for different designs 
* number of point multiplication in a second, using one, two and four parallel field multipliers. 
for different FPGA architectures have been compared by using two ECC designs. It shows how 
the advance in process technology brings additional performance gain. For instance, the latest 
Virtex-4 architecture can even speedup our m113 design by 30%. The last two columns in 
Table 3.6 show the scaled timings and their speedup when the designs are mapped using the 
previous process technology. 
Platforms in = 53 m = 113 
Xilinx Virtex : XCV1000-4 27.323ns 33.132ns 
Xilinx Virtex : XCV1000-5 23.753ns 28.804ns 
Xilinx Virtex : XCV1000-6 21.211ns 25.721ns 
Xilinx Virtex-E : XCV2000E-8 18.092ns 23.441ns 
Xilinx Virtex-2 : XC2V6000-4 16.139ns 21.404ns 
Xilinx Virtex-4 : XC4VFX20-11 9.438ns 15.367ns 
Table 3.5: Comparison of critical paths for different FPGA platforms. 
Name Year Platform Basis m Clock 
MHz 
Timing 
ins 
FPGA 
Slices 
Our 
timing 
Speedup 
Agnew[AMV93] 1989 ASIC ONB 155 40.0 3.90 - 0.05 78.00 
Rosner [Ros98] 1998 XC4062 PB 168 16.3 4.47 956 0.06 74.50 
Gao [GSS99] 1999 XC4044XL ONB 53 - 2.40 813 0.02 120.00 
Okada [OTIT00] 2000 EPF1OK PB 163 3.0 80.30 0.05 1606.00 
Okada [OTITOO] 2000 0.25-urn ASIC PB 163 66.0 1.10 - 0.05 22.00 
Orlando [OP00] 2000 XCV400E PB 167 76.7 0.21 1512 0.06 3.50 
Goodman [GC00] 2000 ASIC PB 160 50.0 7.00 - 0.05 140.00 
Ernst [EKHH01] 2001 XC4085XLA ONB 155 37.0 1.30 2346 0.05 26.00 
Smart [Sma01] 2001 XC4000XL PB 191 - 11.82 - 0.08 147.75 
Serial [LL02] 2002 XCV1000 ONB 173 28.0 11.10 2148 0.07 158.57 
Parallel [LL02] 2002 XCV1000 ONB 113 31.0 0.75 8753 0.03 25.00 
Gura [GSE+02, GES02] 2002 XCV2000E PB 163 66.4 0.14 15768 0.05 2.80 
Ernst [EJM+02] 2002 AT94K40 PB 113 12.0 1.40 - 0.03 46.67 
Jung [JMEH02] 2002 AT94K40 PB 128 12.0 0.15 0.04 3.75 
Kerins [KPMF02] 2002 XCV2000 PB 176 40.0 6.90 - 0.07 98.57 
LFSR [BDG+02] 2002 XCV1000 PB 191 50.0 2.27 0.08 28.38 
Parallel [BDG+02] 2002 XCV1000 PB 191 50.0 0.27 0.08 3.38 
Nguyen [NGCEG03] 2003 XC2V6000 PB 233 100 3.35 - 0.12 27.91 
Satoh [ST03] 2003 0.13-urn ASIC PB 160 510.2 0.19 - 0.05 3.80 
Lutz [LH04] 2004 XCV2000E PB 163 66.0 0.075 - 0.05 1.50 
Mentens [MoPO4] 2004 XCV800 PB 160 47.0 3.810 - 0.05 76.20 
Table 3.6: Comparison between our design and other existing hardware designs. 
* PB stands for polynomial basis and ONB stands for optimal normal basis. 
We find that area varies approximately linearly with m and p. Given p = 2, for m = 113 the 
3.7. Extension: Reconfigurable ECC System-on-Chip 	 61 
design requires 6961 FPGA slices whereas for m = 270, it requires 14787 slices. So increasing 
the key by 2.4 times (or increasing the security level radix by V2270 — 2113), the design needs 
2.1 times as many slices. 
3.7 Extension: Reconfigurable ECC System-on-Chip 
This section presents a System-on-Chip (SoC) architecture for elliptic curve cryptosystems 
(ECC) which targets reconfigurable hardware. A four-level partitioning scheme is described for 
exploring the area and speed trade-offs. A design generator is used to generate parameterisable 
IP building blocks for the configurable SoC architecture. A secure web server, which runs 
on an embedded PowerPC processor, shows 2000 times speedup when the computationally-
intensive operations run on the customised IP blocks. The embedded on-chip timer block gives 
accurate performance information. The design factors of configurable SoC architectures are 
also discussed and evaluated. 
3.7.1 Introduction 
Field Programmable Gate Arrays (FPGAs) enable a high degree of parallelism and achieve or-
ders of magnitude speedup over general purpose processors [BOPV03]. Many applications such 
as signal, image, video, networking and security have been successfully mapped into reconfig-
urable hardware, for exploring the on-chip parallelism and achieving design flexibility. As the 
design moves to system-level integration, the control logic and the computationally-intensive 
logic are usually separated. Configurable System-on-Chip (CSoC) devices [Bec02, KT00], which 
are a combination of embedded microprocessors, memory and embedded programmable logic, 
have attracted academic research and also resulted in industrial products such as Triscend 
A7, Xilinx Virtex-II Pro and Altera Excalibur. Each of them has one or more embedded mi-
croprocessors using different instruction set architectures at different clock speeds, and large 
programmable logic resources. Previous work mainly focuses on the speedup from using pure 
Running on 
Pentium class 
PC 
2.6GHz 
Running on 
embedded uP 
300MHz 
Speed of 
Core Connect bus 
Speed of logic 
(Speed of memory 
bus 
(Speed of processor) 
High speed 
ASIC 
Reconfigurable 
FPGA 
62 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
FPGAs over processors, and analyses the mapping from different application domains to dif-
ferent configurable resources [GNVV04]. 
In this section, we aim to explore the huge design spaces provided by CSoC devices, to describe 
the limiting factors in designing CSoC applications and to predict the dominating factors when 
the technology grows. It is clear that in this design space, the two extreme cases are the pure 
software implementation running on a generic processor which has no additional area overhead 
and the hardware implementation purely running on configurable logic which gives the best 
performance and the largest area. As shown in Fig. 3.10, the software program can be run on a 
PC or an embedded processor or the design could be fully mapped into hardware. On the other 
hand, the use of CoreConnect [Xi106a] enables a high throughput bus model for connecting 
the processor and user logic and handling complex control logic. In this section, we address 
the following two questions. The answers to these questions can be found in Section 3.7.4 and 
Section 3.7.5. 
1 How much application logic should we put into the reconfigurable fabric? 
2 What are the limiting factors on the performance for a CSoC design, in particular, which 
factor dominates? 
Software 	Configurable SoC 	Hardware 
Figure 3.10: The limiting factors for Configurable SoC designs. 
In this section we first propose a system-on-a-chip approach and evaluate the effect of parti-
tioning in a single chip. Our main contributions include: 
3.7. Extension: Reconfigurable ECC System-on-Chip 	 63 
• parametrisable IP block using parallel field multipliers. 
• customisable IP blocks for four-level partitioning of ECC operations. 
• a SoC architecture and design generator for ECC IPs incorporating embedded processors. 
• performance evaluations of single IP blocks and the configurable SoC chip. 
The rest of the section is organised as follows. Section 3.7.2 describes the ECC operation 
partitioning of the customisable IP blocks. Section 3.7.3 presents the design generator of our 
approach, and the system integration between IP blocks and embedded microprocessors. Sec-
tion 3.7.4 evaluates our results and compares the software and hardware designs. Section 3.7.5 
describes a framework of a secure web server system running on the CSoC chip. 
3.7.2 Operations partitioning 
This section describes the operations in our ECC architecture. We use a bottom-up approach; 
we first break down the ECC operations and put them into IP block in the following section, 
then describe the design flow and how we construct the hardware architecture. 
Level 1 - pure software 
In this work, we adopt the open source software from Rosing [Ros99]. The basic routines such 
as the field and point multiplications are implemented in C and compiled for both Pentium 
PC and embedded processors. Table 3.7 shows the number of field multiplications and field 
inversions for one point multiplication. By using this information, we can evaluate the effect of 
replacing the software routine by configurable logic. 
field size m 53 113 131 233 270 
no. of multiplications 612 1530 1760 3732 4758 
no. of inversions 68 153 176 311 366 
Table 3.7: Statistics of field multiplication/inversion for different field size m. 
64 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
Level 2 - Field multiplication 
In our ECC cryptosystem, the field multiplier is based the work we described in the early 
part of this Section. For the pure hardware architecture, users are able to select the second 
level parallelism in terms of the number of executable field multipliers up to the maximum 
four parallel field multipliers. Different scheduling optimisations have been applied to the 
corresponding designs. There is a trade-off between efficiency and area, and the designer must 
select the best design point. 
Since field multiplication is the most computationally-intensive operation, we separate and 
build it as a single IP block using the design flow as shown in Fig. 3.11. The inputs to this 
block are two m-bit values and another m-bit data is generated at the end of the computation. 
Level 3 - Field inversion 
In level 3, we develop an IP block solely for computing field inversion. The pseudo-code of 
the field inversion can be found in the previous Section. The basic idea is to find an inverse 
m-bit field element by inputting an arbitrary m-bit field element. In this field inversion block, 
the field multiplication circuit described in level 2 is reused and is placed as a sub-component 
inside this block. 
Level 4 - Point multiplication 
Point multiplication is the key operation in ECC and it is the most time-consuming process. 
In our embedded logic core, users are able to preload some of the system options such as the 
base point and the curve information. This could save time as the basic point multiplication 
"Q = k x P" where P = P(x, y) that has two components. The bus width of our prototyping 
platform is 32-bit, as a result, the k and P values are tokenised and then sent between the 
embedded logic and the embedded processor. 
IP interface 
(VHDL code) 
Custom IP block 
-4111-10. VHDL code 
MicroBlaze 
softprocessor 
EDK 
PowerPC 
embedded 
microprocessor 
3.7. Extension: Reconfigurable ECC System-on-Chip 	 65 
3.7.3 Extended design generator: ECCGen 
In this Section, we develop a design generator that generates high-level Handel-C code for 
arbitrary field size and degree of parallelism. The generated Handel-C is then translated into 
VHDL; we use a developed wrapper to connect this user logic with the Interface connect core. 
As shown in Fig. 3.11, the customised IP block is thus connected to the bus with its own address 
space such that the embedded processor can directly send data to it. 
ECC design 
generator 
Handel-C code 
DK3 synthesis 
Figure 3.11: Diagram of the design flow. 
The system overview that makes use of embedded processors and configurable logic IP blocks 
which are generated in the design flow. As shown in Fig. 3.12, the On-Chip Peripheral Bus 
(OPB) provides a fast link between logic cores such as the configurable user IP block and the 
on-chip timer. The PowerPC processor is first connected to the high-speed Processor Logic Bus 
(PLB) and then uses a bridge to connect to the OPB peripherals. 
Network port 	Serial port 
i Soft-processor approach 
BRAM LMB_BRAM ctrl 
LMB 
Bus 
MicroBlaze OPB_Ethernet OPB_Uartlite 
OPB Bus 
BRAM PLB_BRAM ctrl 
PLB 	 
Bus 	PLB2OPB 
bridge OPB_IP_Block 
OPB_External 
memory ctrl 
PowerPC 	 SRAM 
Hard-processor approach 
Figure 3.12: Architectural components in a single chip. 
OPB_Timer 
OPB_GPIO 
66 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
3.7.4 Performance evaluations 
The results are divided into two parts: evaluation of a standalone ECC IP block using the 
Celoxica RC2000 board containing an XC2V6000 FPGA chip, and evaluation of the CSoC 
design using the Xilinx ML310 board containing an Virtex-II Pro XC2VP30 FPGA chip. The 
reported timing information and area usage are collected from the Xilinx place-and-route tools 
and the OPB on-chip hardware timer. 
CSoC Design 
Fig. 3.13 shows the speed of field multiplication and field inversion for different m values using 
the embedded PowerPC. We can see that the time spent for one inversion grows faster than the 
field multiplication. We also measure the time for one single data transfer and for a continuous 
1000 data item transfer. From Table 3.8, the overhead of consecutive transfers is reduced. 
The measured speedup factors by using configurable logic for different levels are reported in 
Table 3.9, 3.10 and 3.11. Note that the cycle count starts once the processor enables the 
operation and ends when a "done" signal is received from the configurable logic. 
0.045 	
field multiplication -1-- 
field inversion --x-x--- 
0.04 
0.035 
0.03 
0.025 
0.02 
0.015 
0.01 
0.005 
sp
ee
d:
  s
ec
on
d
 
          
           
           
           
 
0 	 
40 
        
         
 
60 	80 	100 	120 	140 	160 	180 	200 
field size: m 
Figure 3.13: Level 1: OPB timer block measurement of PowerPC 
* running at 300MHz designs. 
3.7. Extension: Reconfigurable ECC System-on-Chip 	 67 
Single read Singe write Mult. read Mult. write 
Cycle 100 102 65 58 
Table 3.8: Read/Write cycles using the CoreConnect bus. 
* sharing reduced cycles for multiple Read/Write. 
In Table 3.10, the required I/O cycles mean that in order to replace the software routine by 
programmable logic, four read cycles and four write cycles are needed. We can see that if the 
operation performed in logic is very fast and this overhead results from the I/O transfer cycles. 
In Table 3.9, there is actually no performance gain when the degree of parallelism is equal to 
or larger than four. On the other hand, there is no I/O overhead for the result using embedded 
microprocessor. In Table 3.9, we use the on-chip timer to measure the number of cycles taken 
to perform one field multiplication for both embedded processor and the IP block. 
parallelism 1 2 4 8 16 
cycle count 297 239 181 181 181 
speed (us) 3.15 2.62 1.99 1.92 2.18 
area (slices) 2063 2199 2487 2691 3494 
embedded PowerPC 117591 cycle 
Table 3.9: Level 2: Comparison of one field multiplication when m = 113. 
* (I/O cycles: 8W+4R). 
parallelism 1 2 4 8 16 
cycle count 1109 703 471 355 297 
speed (us) 13.70 8.78 4.99 3.94 3.87 
area (slices) 2708 2966 3251 3528 4299 
embedded PowerPC 972098 cycle 
Table 3.10: Level 3: Comparison of one field inversion when m = 113. 
* (I/O cycles: 4W+4R). 
Technology Trend Discussion 
The limiting factors of most CSoC systems are the speed of configurable logic, speed of the 
bus or switch connecting between the embedded processor and the configurable logic, the speed 
of embedded processors and the memory bandwidth. As shown in our results, the major 
bottleneck of the CSoC designs is the bandwidth and the latency of the bus connecting the 
68 	 Chapter 3. Customisable Elliptic Curve Cryptosystems 
parallelism 1 2 16 
m = 53 
cycle count 
speed (us) 
area (slices) 
19411 
206.04 
3712 
16555 
175.63 
3869 
3603 
47.09 
4529 
embedded PowerPC 19591629 cycle 
m = 113 
cycle count 
speed (us) 
area (slices) 
79761 
1021.82 
6001 
42767 
570.73 
6344 
9875 
123.01 
7677 
embedded PowerPC 166950475 cycle 
Table 3.11: Level 4: Comparison of one point multiplication when m = 53. 
* (I/O cycles: 6W-F4R) and m = 113 (I/O cycles: 12W-F8R) using one field multiplier. 
embedded processors and the configurable logic. For instance, an elliptic point in the size 113 
field needs to be divided into eight 32-bit words for data transmission. As shown in Table 3.8, 
each single R/W operation takes 100 cycles of embedded on-chip timer that is very expensive in 
high performance applications. We expect that using a wider bus-width in CSoC is the major 
trend. 
From our results, the clock speeds of all synthesized designs are around 70-901\41-1z which is 
slower than the embedded processors. The next generation reconfigurable logic is able to achieve 
up to 500MHz circuit, however the major factor affecting the clock speed of the configurable 
logic is still the critical path of the specific design. For the memory bandwidth, most systems use 
the CoreConnect architecture for connecting the embedded memory and the microprocessors, 
we also expect that a wide low-latency memory bus is crucial. 
3.7.5 Secure web server 
A secure web server using our CSoC framework is depicted in Fig. 3.14. In this system, a 
network logic core [Xi106a] is added to the OPB bus for communicating between the web server 
program running on the embedded processors and the network. A speedup factor of 2000 is 
achieved when we replace the m = 113 point multiplication running on the embedded processor 
by the customised level 4 IP block. 
x86 
debug 
PC 
UART 
x86 
Web 
Clients 
Stress 
test 
PC 
3.8. Summary 	 69 
Xilinx ML310 
FPGA 
MB/PPC 
Web 
Server 
UART Timer 4--• 
I.,_.  Custom 
IP block Ethernet 
Figure 3.14: Secure web server overview. 
We also illustrate that our CSoC chip can benefit another secure web service. Previous 
work [GSE+02] has implemented an ECC hardware system to accelerate the Secure Sockets 
Layer (SSL) in a secure client/server system. The ECC keys are embedded into the gener-
ated X.509 certificates. Their implementation involves an FPGA on a peripheral component 
interconnect (PCI) bus at 66.4MHz, and achieves 12.5 times speedup for the Elliptic Curve 
Key exchange (ECDH) operation when comparing with pure software implementation. Since 
the point multiplication in our design is twice as fast, our design would result up to 25 times 
speedup for ECC applications. 
3.8 Summary 
In this chapter, a customisable pipelined and parallelised elliptic curve cryptography design for 
various field operations has been proposed. This design supports various parameters, such as 
the key size and the degree of parallelism, to enable trade-off between level of security, design 
size and speed. A design generator has been developed to facilitate fast implementation. Per-
formance analysis shows that our design at 35 MHz is the fastest amongst existing hardware 
implementations. It can compute a point multiplication up to 1150 times faster than a software 
ECC application on a Xeon 2.66GHz computer. We also present a configurable SoC architec-
ture for elliptic curve cryptosystem and an embedded secure web system using reconfigurable 
hardware. A four-level partitioning scheme is proposed for exploring the area and speed trade-
off of pure software implementation on embedded processor and the hybrid hardware/software 
approach. Suggestions for future work can be found in Section 7.2.1. 
Chapter 4 
Reconfigurable Primality Test Design 
Generator 
This chapter presents a scalable architecture for prime number validation which targets recoil-
figurable hardware. Section 4.1 introduces the importance of the primality test. Section 4.2 
presents a design flow for mapping scalable prime number validators into hardware. Section 4.3 
covers a scalable architecture for modular multiplication. Section 4.4 presents a design gen-
erator PrimeGen which produces architectures with different design trade-offs, based on user-
specified parameters. Section 4.5 evaluates experimentally the performance and area usage of 
the proposed approach using FPGAs, and compares it with different architectures. Section 4.6 
presents a SoC design framework for prime number validation. Finally, Section 4.7 summarises 
our proposed research. 
4.1 Introduction 
The primality test is crucial for security systems, especially for most public-key schemes. The 
Rabin-Miller strong pseudo-prime test has been mapped into hardware, which makes use of a 
circuit for computing Montgomery modular exponentiation to further speed up the validation 
and to reduce the hardware cost. This chapter describes a design generator to produce a variety 
70 
Perform Rabin-Miller Test 
Prime number width 	 HDL Design Entry 
Generate Montgomery constant 
User specific 
word width 
z 
Validate large primes 
Synthesized Hardware Block 
Flexible small 
prime number 
Case 1: 
Perform modular 
exponentiation Number p 
under test 
Case 2: 
Perform modular 
squaring 
N 	 
non-scalable or scalable montgomery multiplier 
can be selectively used for implementation 
4.2. Design flow 	 71 
Figure 4.1: The design flow of the scalable prime number validator. 
of scalable and non-scalable Montgomery multipliers based on user-defined parameters. The 
performance and resource usage of our designs, implemented in Xilinx reconfigurable devices, 
have been explored using very large prime numbers. Our work demonstrates the flexibility and 
trade-offs in using reconfigurable platform for prototyping cryptographic hardware in embedded 
systems. It is shown that, for instance, a 1024-bit primality test can be completed in less than 
a second, and a low cost XC3S2000 FPGA chip can accommodate a 32k-bit scalable primality 
test with 64 parallel processing elements. 
4.2 Design flow 
In this section we present the design flow for our scalable prime number validator as shown 
in Fig. 4.1. The input to the system includes user-specified constraints, such as the bit-width 
of the input prime number and the word-width which is required for the scalable multiplier, 
while the output from the system is a synthesizeable hardware block that can be embedded in 
different cryptographic designs. This flexible design flow enables upgrade of existing security 
systems with small overhead. 
72 	 Chapter 4. Reconfigurable Primality Test Design Generator 
The major features of the proposed design include: 
• A variable compile-time prime number validator in which user can easily update the 
complexity of the system by controlling the prime width. 
• The variable input small primes determine the accuracy and performance of the system. 
• The user-specified word-width controls the modular operation in hardware. 
• Users are able to select different Montgomery architectures for hardware implementation 
and thus enhance rapid prototyping. 
4.2.1 Speeding up computation 
In this section, we show how we improve the hardware design of the standard operation by non-
scalable and scalable Montgomery operations. The Rabin-Miller algorithm is first implemented 
using standard multiplication and exponentiation with a sequential modulo-multiply. We ob-
serve that one of the bottlenecks of the Rabin-Miller algorithm is the modular exponentiation 
in testing the two validity checks in Section 2.5. 
Montgomery algorithm is widely used for modular multiplication and exponentiation. It re-
quires a constant at the start of algorithm to facilitate conversion between standard and Mont-
gomery space. The constant c is (2(2x (width(p)±2)) % p) where width(p) is the bit-width of the 
prime number p under test. We describe the hardware c generation in Fig. 4.2. In Prime Gen, 
this constant c has been precomputed and saved in hardware. Fig. 4.3 shows the pseudo-code 
of the hardware implementation. Note that the statements embraced by parallel{...} mean 
that they are executed concurrently in hardware. 
4.2.2 Montgomery modular multiplication 
Current modular multiplication approaches are mostly based on the Montgomery algorithm [Mon85]. 
The simpler combinational logic used in this design reduces the critical path and thus acceler- 
4.2. Design flow 	 73 
= 2(2x (width(p)±2)) Generating c 	 mod p 
1. Prime_width = user-specified input 
parallel { 
2. REM 	2(2x (width(p)+2)) 
3. DIV = shiftLeft(p, (prime_widthx2+5)) 
4. i = 0 
} 
5. FOR i = 0 to (Prime_widthx2+5) 
parallel { 
6. HOLD = REM 
7. REM = REM - DIV 
} 
parallel { 
8. if (MostSignificantBit(REM) == 1) 
9. REM = HOLD 
10. DIV = shiftRight(DIV) 
} 
11. END FOR 
12. RETURN c = REM 
Figure 4.2: Generating Montgomery modular constant c. 
ates the calculation. An efficient hardware implementation has been presented in [EW93]. Our 
modular multiplication component is built based on the design in [BP01]. 
The basic idea of the Montgomery algorithm is to multiply two integers modulo M, in other 
words (A x B mod M) without division by M. We first use the generated Montgomery constant 
c to transform the integers into m-residues and compute the multiplication with these m-
residues. Finally, we transform this result back to the normal representation. Note that modular 
multiplication is used in modular exponentiation, since it is beneficial if we compute a series of 
multiplications in the transformed domain, the Montgomery space. Fig. 4.3 shows the pseudo-
code of the hardware description. For example, "q = S (B0 ? A : 0)" checks if the LSB of B 
is true, then "q = S + A" else "q = S" . Fig. 4.4 shows an example of code-level optimisation 
which has greatly reduced the overall number of cycles for more than 50% for the design using 
non-scalable Montgomery multipliers. In Section 4.3, we introduce the scalable design and its 
optimisation using embedded memory. 
74 	 Chapter 4. Reconfigurable Primality Test Design Generator 
Algorithm for computingS=AxBmod M 
parallel { 
1. S = 0 
2. j = width(A) + 2 
} 
3. FOR i = j to 0 
parallel { 
4. q = S + (Bo? A : 0) 
5. S = shiftRight(q + (q0? M : 0)) 
6. B = shiftRight(B) 
} 
7. END FOR 
8. RETURN S 
Figure 4.3: Generating Montgomery modular multiplication S. 
Pre-optimisation 
1. FOR i = j to 0 
2. q = S + (Bo? A : 0) 
3. S = shiftRight((S + 
4. B = shiftRight(B) 
5. END FOR 
(go? M : 0) + (B0? A : 0)) 
Optimisation 
1. #define q (S + (B0? A : 0) 
2. FORT= j to 0 
parallel { 
3. S = shiftRight(q + (q07 M : 0)) 
4. B = shiftRight(B) 
} 
5. END FOR 
Modular squaring (both scalable and non-scalable) 
1. Mont_mod_mult_sq(R) { 
// transform R into Montgomery space 
2. R = MontgomeryModularMulti[0](c, R) 
// perform squaring 
3. R = MontgomeryModularMulti[0](R, R) 
// transform the final result into normal space 
4. R = MontgomeryModularMulti[0](1, R) 
} 
Figure 4.4: Hardware optimisation for modular multiplication and squaring. 
4.2. Design Bow 	 75 
Algorithm for computing S = X E mod M 
1. P [2] = 0, Z [2] = 0; 
parallel { 
// Apply two parallel multipliers 
2. P[0] = P[1] = MontgomeryModularMulti [0] (c , 1) 
3. Z [0] = Z [1] = MontgomeryModularMulti [1] (c, X) 
4. j = width (E) - 1 
5.
}  
FOR i = j to 0 
parallel { 
6. 	P[!io] = E0? 
MontgomeryModularMulti [0] 0') [io] ,Z [io] ) :P [io] 
7 . 	Z [ ! 	MontgomeryModularMulti [1] (Z [io] , Z [io] ) 
8.
}  
E = shiftRight (E) ; 
9. END FOR 
10 . RETURN S = MontgomeryModularMulti [0] (1, P [4)] ) 
Figure 4.5: Generating Montgomery modular exponentiation S. 
* (both scalable and non-scalable). 
4.2.3 Montgomery modular exponentiation 
Fig. 4.5 shows the algorithm for calculating the modular exponentiation using the Montgomery 
algorithm. This algorithm is not limited to the input bit-width and is suitable for replacing 
the standard sequential modulo-multiplier. The values of X and 1 are first transformed into 
Montgomery space by using the Montgomery constant c. Since there is no data dependency 
between the modular squaring and multiplying operation in line 6 and line 7, both operations 
are put into separated hardware and execute in parallel. The final result is transformed back 
to the standard domain for the Rabin-Miller primality test. 
The datapath of the modular exponentiation unit is depicted in Fig. 4.6. Two parallel multi-
pliers have been deployed in this unit together with four temporary storages, Po, Pl , Zo and Z1. 
The inputs to this unit are X and E which are stored in registers or memory depending on the 
architecture of the multiplier. The pre-stored Montgomery constant c is used for the first step 
calculation in Fig. 4.5. The control path, other temporary storage and memory decoding unit 
BRAM/Register 
Input X 
BRAM / Register 
store Montgomery 
Constant c 
BRAM / Register 
store Constant 1 
Parallel Scalable / 	Parallel Scalable / 
non-scalable non-scalable 
Multiplier 0 Multiplier 1 
BRAM/Register 
Input E 
BRAM / Registers (P0, P1, ZO, Zl) 
76 	 Chapter 4. Reconfigurable Primality Test Design Generator 
are not shown in this figure. 
Figure 4.6: The architecture of the exponentiation unit. 
* using two parallel Montgomery multipliers and multiple Block-RAMs. 
4.3 Scalable modular multiplier 
The previous section presents the general design flow of primality test using non-scalable multi-
plier. In this section, we present the mapping of scalable Montgomery multiplication as shown 
in Section 2.5 into technology independent hardware. 
The scalable Montgomery algorithm MWR2MM is shown in Fig. 2.7 for multiplying X and 
Y. The general idea is to repeatedly multiply Xi, the it h bit of X, with Y. The parallelism of 
the design can be explored by applying multiple processing elements (PEs) to calculate the Xi 
values in parallel, since there is no data dependency between these calculations. The datapath 
of the multiplier using p PEs is shown in Fig. 4.7. We can see that the RAM decoder produces 
the p-bits, Xi, Xi+i, 	Xid_p_1 and feeds these bits into p PEs. These signals are valid for j 
cycles and the memory decoder then extracts the next p-bits from the memory storing X. 
For the scalable version, the modular multiplier is replaced by the one described in Fig. 2.7. 
Note that the number of clock cycles and the performance of the scalable MWR2MM algorithm 
Processing 
Element p 
Processing 
Element 2 
Storing n-bit X 
Processing 
Element 1 
BRAM 
holds 
/1 
Xi+p -1 Xi Xi+1 
18k-bits BRAM RAM decoder 
BRAM 
holds 
Y 
18k bits BRAM 
Storing result S 
Compute: S = X * Y mod M using scalable Montgomery algorithm 
Yj Sj mj 
product 
generator 
product 
generator 
1 
adder & 
shift alignment 411 
delay one cycle 
4.3. Scalable modular multiplier 	 77 
Figure 4.7: The architecture of the scalable Montgomery multiplier. 
depends on the number of bits of the number under test and the user-specified partition size 
which is the word-length. The words extracted from Y, M are serially put into the first PE and 
are then pipelined through the other PEs. Fig. 4.9 shows the data dependency involved in the 
multiplication. In this simplified figure, only three PEs are shown. Note that two intermediate 
pipeline registers are installed between every two PEs. We can see that at time to and t i in 
Fig. 4.9, only the first PE, PEI , operates while other PEs are stalled. 
Figure 4.8: The architecture of a single Processing Element (PE). 
In the cycle when j = 0, the PE element determines the addition of M for the next j cycles 
78 	 Chapter 4. Reconfigurable Primality Test Design Generator 
within this PE element. We can refer to line 4-10, and line 12-15 in Fig. 2.7 for more details. 
The exact datapath of a single PE is described in Fig. 4.8. We use a multiplexor to select and 
store SV) , the bit-0 of the lowest word in S, in a register only when j = 0. The input Si is 
latched for computing the Si_1 in the next cycle as shown in line 9 and 14 of Fig. 2.7. 
We retarget the scalable Montgomery multiplier in FPGA and explore the available resources 
such as embedded memory. For instance, if we use a shift register for n-bit data, the hardware 
usage is linearly proportional to the input bit-width. In order to save area, our design stores 
the n-bit data into Block-RAM (BRAM) which holds up to 18k-bit data. Together with all 
the temporary storage, each design takes 20 BRAMs which is half of the available BRAMs in 
a Virtex-II XC2V1000 FPGA chip. In Fig. 4.7, each data bit in BRAM-X is decoded for a 
PE while in each cycle, the jth data in BRAM-Y and BRAM-M are stored in the first PE and 
propagated through the pipeline registers. The outputs of PEI , ... PEp_i are stored in the 
intermediate registers, so that in each cycle only the last PE, i.e. PER , is responsible to update 
the content in BRAM-S. 
4.4 Hardware design generator - PrimeGen 
This section presents the PrimeGen tool which we develop for transforming user-constraints 
into Hardware Description Language (HDL) format, as well as implementing hardware designs. 
PrimeGen is coded in C. Developers can easily embed the generated validator into their security 
designs. This generator enhances designer productivity and exploits available resources from 
the rapid-prototyping platform. An overview of the PrimeGen tool is described in Fig. 4.10. 
The objective of this design generator is to accelerate the overall design flow from specification 
to final implementation. Using this tool, designers can specify architectural-specific description 
of their designs. The predefined libraries consist of scalable and non-scalable Montgomery 
operators based on the Rabin-Miller algorithm. The existing libraries are developed in Handel-
C [Cel], and with slightly modification, the design generator is able to support other HDL 
output format for synthesis. 
j = 0 
determine the 
addition of M 
XO 
, YO 
Y1 
M1 delay 2 
clock cycles 
(insert 2 
stage registers) 
Y2 
SO x2  ,40 
S1 t1 
t2 S2 
t3 S3 
S4 t4 
Single 
PE 
Read from 
BRAM-S 
to SO 
4.4. Hardware design generator - PrimeGen 	 79 
Figure 4.9: The dependency graph of the computation. 
* scalable Montgomery multiplier using three processing elements. 
PrimeGen enables customisable designs such that users can have full control to the output HDL 
file. Two major operations in the design generator are the architectural selection routine and 
RAM decoding routine. In our design, the width of most internal signals are dependent on the 
prime number bit width. Second, user-specified word width controls the iteration of the inner 
while loop in the scalable Montgomery multiplier (the value e in Fig. 2.7). Third, besides the 
predefined small prime numbers, users are able to add or remove small prime numbers of the 
Rabin-Miller algorithm in the final HDL output. Consider step 14 in Fig. 2.7 as an example: 
so-1) = (s(03 ) , a„,(1:41),) 
Scalable 
Design? Generate 
parametrised 
Bit-mask 
for montgomery 
multiplication 
Rabin-Miller 
Algorithm 
Implementation 
Small Prime 
numbers Coded in C language 
---t 
Generated HDL design 
Hardware Synthesis 
--1  
Developer 
Launch Design 
Predefined libraries 
P 	 
Prime number 
bit width 
User specific 
word width 
PrimeGen Software 
Select 
multiplier 
architecture 
Synthesizeable 
HDL description 
non-scalable 
montgomery 
multiplier 
Synthesizeable 
HDL description 
scalable montgomery 
multiplier 
Reconfigurable 
Platform 
80 	 Chapter 4. Reconfigurable Primality Test Design Generator 
Figure 4.10: The overview of PrimeGen design generator 
the predefined libraries provide the exact implementation of a single PE for this particular bit 
concatenate operation. The bits in the j th and (j — 1)th word from S are concatenated using 
an internal register. PrimeGen is then used to expand the design into p parallel PEs. 
4.5 Evaluations 
Our designs can be implemented on any FPGA-based prototyping platform. For small size prob-
lems, we select the RC200E development board as the testing platform which contains a Xilinx 
Virtex-4 XC2V1000-4FG456C FPGA chip. For large size problems, the RC2000 development 
board which has a Virtex-II XC2V6000-4FF1152 FPGA has been selected for implementation. 
We also place and route our designs on a low cost Spartan XC3S2000-4-FG900 chip. The gen-
erated designs are synthesized by using Celoxica DK2 and PDK tools [Cel]. The FPGA chip is 
then configured as a prime number validator. 
4.5. Evaluations 	 81 
Virtex-II - XC2V1000 / XC2V6000 Spartan - XC3S2000 
Prime size 128-bit 256-bit 512-bit 1,024-bit 128-bit 256-bit 512-bit 1,024-bit 
Non-scalable design 
Area (slices) 2,630 5,140 10,153 20,131 2,630 5,140 10,153 20,131 
Clock cycle (ns) 36.75 49.51 80.72 122.34 43.67 62.35 95.01 162.24 
Performance (ms) 0.64 3.34 21.49 129.28 0.76 4.21 25.29 171.45 
Scalable design (8 PE word-size:w = 8-bit) 
Area (slices) 2,651 2,726 2,768 2,872 2,622 2,700 2,736 2,842 
Clock cycle (ns) 31.66 30.78 31.07 32.62 40.38 34.72 39.68 37.75 
Performance (ms) 20.69 109.37 734.39 5478.14 26.39 123.36 937.82 6338.95 
Scalable design (32 PE, word-size:w = 8-bit) 
Area (slices) 9,064 9,144 9,192 9,333 9,019 9,092 9,146 9,283 
Clock cycle (ns) 30.17 29.81 29.47 31.03 40.87 40.61 38.94 39.03 
Performance (ms) 12.70 56.07 286.18 1776.80 17.20 76.38 378.11 2235.08 
Table 4.1: Primality test comparison. 
Results are divided into two parts: performance and area. Table 4.1 shows the performance and 
area comparisons between non-scalable Montgomery multiplication and scalable Montgomery 
multiplication. From the collected results, the non-scalable designs produce the fastest execu-
tion time with a linearly increase of area penalty, while the scalable designs achieve relatively 
small and stable increase in both resource usage and critical path delay as the design scale 
grows. In the same table, we observe that by adopting multiple PE elements, the total number 
of cycles taken for the primality test has been much reduced while there is a slight effect to the 
delay path. As shown in [TTc1(01], speed up of 2 to 3 times can be achieved when we use two 
or three processing elements in the scalable designs when compared to the design with only one 
PE. Our experimental results confirm this speed improvement. 
We also study the trade-off of speed and resource usage of the scalable Montgomery multiplier 
using Virtex chip in three different dimensions: degree of processing element parallelism, prime-
size and word-size. In each case, one of these three parameters has been fixed and the results 
are shown in Table 4.2, 4.3 and 4.4. In Table 4.2, we observe that if we double the number of 
concurrent processing elements, the area is also doubled. As our scalable architecture exten-
sively uses BRAM and minimises other components in the datapath, the effect of area increase 
the prime-size is slight. In Table 4.3, there are 8 PEs, the comparison of different prime-size for 
two word sizes is shown. With the use of memory decoding components, the complexity of ad-
dressing has been much reduced, resulting in a relatively slow increase in area usage. Table 4.4 
shows the effect of changing the word-size with respect to the overall performance. Finally, we 
82 	 Chapter 4. Reconfigurable Primality Test Design Generator 
draw our summary in Fig. 4.11. 
8 PE 16 PE 32 PE 64 PE 
Prime size p = 1,024-bit 
Area (slices) 1,165 2,193 4,278 8,636 
Clock cycle (ns) 20.43 27.24 28.36 34.11 
Performance (ms) 3.35 2.48 1.55 1.25 
Prime size p = 4 096-bit 
Area (slices) 1,198 2,253 4,404 8,864 
Clock cycle (ns) 19.65 23.85 23.03 26.73 
Performance (ms) 47.68 29.82 15.23 9.83 
Table 4.2: Processing element comparison with word size (w = 8). 
Prime-size (bit) 1,024 2,048 4,096 8,192 
Word size w = 8-bit 
Area (slices) 1,165 1,197 1,198 1,230 
Clock cycle (ns) 20.43 18.91 19.65 20.28 
Performance (ms) 3.35 11.78 47.68 194.11 
Word size w = 32-bit 
Area (slices) 3,581 3,628 3,634 3,673 
Clock cycle (ns) 32.43 38.77 31.31 27.20 
Performance (ms) 1.73 7.01 20.54 67.72 
Table 4.3: Prime-size comparison using SMM with 8 PEs. 
Word-size (bit) w=4 w=8 w=16 w=32 
Prime size 128-bit 
Area (slices) 762 1,110 1,939 3,499 
Clock cycle (ns) 20.98 18.31 24.02 34.19 
Performance (ms) 0.14 0.08 0.07 0.08 
Prime size 1.024-bit 
Area (slices) 828 1,165 1,985 3,581 
Clock cycle (ns) 22.08 20.43 24.23 32.43 
Performance (ms) 6.89 3.35 2.19 1.73 
Table 4.4: Word-size comparison for 128-bit and 1,024-bit SMM using 8 PEs. 
The design of 16k-bit primality test can be fitted into twenty 18k-bit BRAMs, which means 
a single XC2V1000 or a low cost Spartan XC3S2000 chip can accommodate a 32k-bit test. 
Since memory is the major limitation of the scalable multiplier, we can use off-chip memory or 
cascading multiple FPGAs for testing the Mersenne numbers. Another simple method to test 
2,048 bits prime number X is to partition the input number into 2 parts: A, B and each part 
less PE 
 
more PE 
 
small 
word size 
large 
word size 
Smallest 
design 
More resource 
usage, 	faster 
less resource 
usage 
Fastest & 
largest design 
4.6. Extension: System-on-Chip (SoC) design framework 	 83 
Figure 4.11: The analysis of the scalable Montgomery multiplier. 
contains 1,024 bits. We assume that the 1,024 bit hardware validation has been implemented. 
We can use simple algebraic and divide-and-conquer methods to test this number. For instance, 
X x 	= A x 2"/2 + B and (X x 2')2 = (A x 2"/2 + B)2 =A x A x 2n + 2AB x 2"/2 + B2. 
Since modular arithmetic is commutative, associative and distributive, again we could use this 
method to perform primality test on very long integers. 
The major overhead of the scalable multiplier is the two stall operations happened to every PE 
element. It means that if we apply 8 PE elements in a 128-bit test, (2 x (1+2+3+4+5+6+7)) 
stall cycles have been wasted. However, this effect is smaller when the input numbers are 
larger. Our flexible design scheme can fit different problem objectives from small Elliptic 
Curve Cryptography, medium RSA to large Mersenne number test. 
4.6 	Extension: System-on-Chip (SoC) design framework 
This Section presents a System-on-Chip (SoC) design framework for prime number validation 
which targets reconfigurable hardware. This design framework has two parts: (I) a design flow 
from user-defined parameters to synthesizeable core, and (II) a rapid-prototyping platform for 
integrating user cores and on-chip processor into an SoC design. This framework can be easily 
extended for developing embedded systems and cryptographic applications. 
A general design framework that makes use of the embedded processor, the fast Processor 
Local Bus (PLB) and programmable user-logic on reconfigurable hardware is presented. In this 
84 	 Chapter 4. Reconfigurable Primality Test Design Generator 
FPGA 
1 
DDR-RAM 
1 
0,. p 	— 
.iii 
User 
Primality Test 	Logic 
-01 Primality Test 
Hardcore Processor 
Power PC 
41-- 	- 1 
' 
Primality Test 
1 II. 
Local Processor 
1 	, 
-i 	-- Primality Test 
.41 Primality Test 
BRAM ..• 
Bus (PLB) 
Primality Test 
ML300 Prototyping Board 
Figure 4.12: An overview of the SoC design framework. 
framework, divide and conquer technique is applied to prime number generation. The circuit 
in part (I) is used to validate one long prime number, the generated designs have been first 
synthesized to VHDL codes and connect to the PLB bus through the predefined bus interface as 
programmable PLB slave bus modules; the processor is used to generate high quality random 
numbers and to interface between user and on-chip validators; the PLB bus is selected for 
connecting PowerPC and the user logic. We have chosen Xilinx ML300 as the platform which 
contains Virtex-II Pro. The estimated result shows that the design is highly scalable that can 
accommodate up to 8 PLB slave modules for 256-bit prime generation in XC2VP125 and can 
operate at 120MHz. The figure below shows an outline of the proposed approach using Xilinx 
ML300 prototyping board. 
4.7 Summary 
This chapter presents a scalable architecture for prime number validation which targets recon-
figurable hardware. Two significant algorithms have been discussed: Rabin-Miller and Mont-
gomery algorithms. The Rabin-Miller strong pseudo-prime test has been successfully mapped 
into hardware. Several modular multipliers have been developed as libraries in the design gen-
erator, Prime Gen. In particular, the Montgomery modular multiplication is the core of other 
4.7. Summary 	 85 
modular operations used in the primality test. The scalability and parallelism of this design 
have been explored for very large prime numbers. We systematically implement the design for 
different bit lengths on reconfigurable devices using both scalable and non-scalable multipliers, 
and study the trade-off amongst different designs. Our architectures and tools appear to be 
useful for exploring efficient designs for use in embedded systems. Suggestions for future work 
can be found in Section 7.2.2. 
Chapter 5 
Custom-Precision Function Evaluation 
for Embedded Processors 
Due to resource and power constraints, embedded processors often cannot afford dedicated 
floating-point units. For instance, the IBM PowerPC processor embedded in Xilinx Virtex-II 
Pro FPGAs only supports emulated floating-point arithmetic, which leads to slow operation 
when floating-point arithmetic is desired. This chapter presents a customizable mathemati-
cal library using fixed-point arithmetic for elementary function evaluation. We approximate 
functions via polynomial or rational approximations depending on the user-defined accuracy 
requirements. The data representation for the inputs and outputs are compatible with IEEE 
single-precision and double-precision floating-point formats. Results show that our 32-bit poly-
nomial method achieves over 80 times speedup over the single-precision mathematical library 
from Xilinx, while our 64-bit polynomial method achieves over 30 times speedup. 
The rest of the chapter is organised as follows. Section 5.1 gives an introduction of this work. 
Section 5.2 presents the integer instruction set of the IBM PowerPC405. Section 5.3 presents the 
methodology for enabling the automatic library generation and optimizing the code generation. 
Section 5.4 provides error analysis for our function evaluation library. Section 5.5 describes 
the optimizations we perform for code generation and code space minimization. Section 5.6 
discusses results, and Section 5.7 provides a summary of this work. 
86 
5.1. 	Introduction 	 87 
C-code 
Instructions Instructions generation 
using Matlab 
Embedded Embedded Embedded 
integer integer integer 
processor processor processor 
7 
Math 
co-processor 
Floating-point 
emulation Our approach 
(a) (b) (c) 
Figure 5.1: Overview of current and the proposed embedded processors. 
(*Current approach includes (a) using co-processor and (b) using floating-point emulation.) 
5.1 Introduction 
The evaluation of elementary functions is often the performance bottleneck of many compute-
bound applications [LAML04]. Examples of these functions include logarithm log(x) and square 
root VY. Evaluating such functions efficiently while meeting the precision requirements is par-
ticularly important for embedded applications, where stringent resource and power constraints 
are enforced. 
Advanced FPGAs enable the development of configurable SoC systems and high-speed function 
evaluation units that are customised to particular applications. As shown in Fig. 5.1(a), in 
embedded systems, the integer processor is usually incorporated with one or more dedicated 
coprocessors such as a math coprocessor for fast function evaluation, which results in a trade-
off between area, cost and performance. Fig. 5.1(b) illustrates the emulated floating-point 
mathematical library from Xilinx [Xi104]. In this approach, floating-point arithmetic is emulated 
using integer operations only without the use of a coprocessor 1. Performance degradation and 
code space consumption are the two major problems for using this approach. 
In this chapter, we propose an Integer Mathematical Generation tool, IMGen, which makes 
use of optimised fixed-point (integer) arithmetic for internal computations. IEEE single and 
'Recently, Xilinx has released the Virtex-4 FX FPGA which has an Auxiliary Processor Unit (APU) [APU05] 
that can connect the math coprocessor using FPGA fabrics. In this work, we compare the designs without using 
this math coprocessor. 
88 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
Table 5.1: Instruction latency comparison of different processors. 
(* using floating-point co-processor.) 
ARM 7 PowerPC 750 MicroBlaze MIPS32 24k 
add 1 1 1 4 
multi 3 5 3 4 
finulti 4* 4* 6* 4 
idiv n/a 19 34 32 
double precision floating-point formats are used for both the input and output formats such that 
internal computation is transparent to the users. A design generator is used to automatically 
select the best polynomial/rational approximation for internal computations and the degree of 
computation for a given error tolerance. 
The main contributions of this work are: 
• IMGen, a customizable library for floating-point function evaluation based on the integer 
instruction set found in embedded processors. 
• automatic code generation and optimization within IMGen using Matlab to customise 
precision, and trade offs involving code space, performance and accuracy. 
• an evaluation of the proposed approach by automatically selecting polynomial or ratio-
nal approximation for a given function, accuracy requirement and execution time for 
embedded integer processors. 
5.2 Integer instruction sets 
Instruction processors play an important role in embedded systems. We classify these proces-
sors based on their datapath width such as 8-bit, 16-bit, etc. The automated library generation 
framework proposed in this work is applicable to all integer processors. To demonstrate our 
customizable precision-based function evaluation method, IMGen, we have selected the IBM 
PowerPC405 from Xilinx as a platform. The PowerPC405 [Xi104] is a 32-bit implementation 
5.3. Automation methodology 	 89 
of the PowerPC embedded-environment architecture that is derived from the PowerPC ar-
chitecture. It also has a fixed-point execution unit that is used for 32-bit computation and 
has thirty-two 32-bit general purpose registers. The processor has 16KB 2-way set associative 
instruction-cache and data-cache, respectively. In this work, we set the clock speed of the 
processor, the processor local bus and the bus components to 100MHz. 
Industrial integer processors include ARM, MIPS, PowerPC and the Xilinx embedded software 
processor MicroBlaze. Table 5.1 shows a latency comparison of different architectures. We 
include some important instructions such as multiplication (multi) and integer division (idiv) 
in this comparison. The design method we present in this work enables code generation for all 
these integer processors. 
5.3 Automation methodology 
This section introduces our approach to automating the customization of precision based func-
tion evaluation for integer processors. The method we propose is technology-independent, and 
we use the embedded PowerPC processor as a specific example to demonstrate the flexibility 
and performance gain of this method. 
5.3.1 Overview 
The general idea of this work is user-transparent fixed-point computation that provides an effi-
cient floating-point function library without any additional hardware cost. A library generator 
shown in Fig. 5.3 is developed for constructing IMGen which targets various integer processors. 
The Matlab based code generator, which has direct access to the Maple symbolic library, inputs 
the mathematical function name and the required accuracy, and outputs optimised C code. This 
generator enables users to customise the output precision based on the specific embedded ap-
plication by using a static error analysis. This error analysis is comprised of the approximation 
error reported by the Maple command and the quantization error induced by the fixed data- 
90 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
Evaluating f (x) = log(x) 
// Range Reduction 
input.sng_as_flt = x; 
exp = input.sng_as_fld.exp - 126; 
ix = fp2int(input); // y = ix; 
// Evaluation Method 
// f(y) where y = [0.5, 1) 
// e.g. degree-3 polynomial 
fl = ((c3 x y + c2) x y + ci) x y + co; 
// Range Reconstruction 
si = range(exp); // find exp x log(2) 
fi = (fi >> overflow) + si; 
output = int2fp(f1); 
output.sng_as_fld.exp += overflow; 
Evaluating f (x) = \/.7 
// Range Reduction 
input.sng_as_flt = x; 
exp = input.sng_as_fld.exp - 126; 
ix = fp2int(input); 
ix = (exp[0])? ix >> 1 : ix; // y 	ix; 
// Evaluation Method 
// f(y) where y = [0.25, 1) 
// e.g. degree-2 rational 
fl = ((c2 x y + ci) x y + co)/((d2 x y+d1) x  y + do); 
// Range Reconstruction 
exp = (exp[0])? exp+1 : exp; 
correction = exp >> 1; 
output.sng_as_fld.exp += exp - correction; 
Figure 5.2: Design methods using polynomial and rational approximations. 
(*For example, log(x) uses degree-3 polynomial approximation and N/Y uses degree-2 rational 
approximation.) 
path. This integer mathematical library can be integrated with any embedded software. The 
proposed approach enables: (1) automating the selection of approximation method, bitwidth 
and polynomial degree, (2) generating a customizable mathematical library from user-defined 
parameters to result evaluation, and (3) optimizing the fixed-point software implementations 
S S S Exp Exp Exp Fraction Fraction Fraction 
Coefficient generated from MATLAB using symbolic library - Maple 
int2fp 
S Exp Fraction 
5.3. Automation methodology 	 91 
Matlab 	IMGen 
Select approximation method 4r- 	 Function f(x) 
and polynomial degree Error requirement 
C code 
10. 
Embedded 
integer processor 
Integer 
processor 
instruction set 
Code optimization and 
adaptive datapath correction 
Generate code with the best 
degree and performance 
Compare error and speed 
improvement 
Figure 5.3: Matlab design tool flow. 
(*Using Matlab to generate the library C source code for the library. The user can specify the 
require accuracy/error requirement for the library.) 
for elementary function evaluation. We test the code generated by our framework by running 
on an embedded PowerPC processor. 
x 
S Exp Fraction 
fp2int c2 * 231  ci * 23°  co * 229 
Figure 5.4: Maple generates the polynomial coefficients. 
(*EVIGen generator adjusts the datapath that can maximise the accuracy. x is input, 
y = f (x) is output.) 
cl •231 c0 *2" dO •23° dl *23' 
 
   
92 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
x S Exp Fraction 
 
fp2int 
 
Generated coefficients 
S Exp Fraction S Exp Fraction 
Generated coefficients 
S Exp Fraction S Exp Fraction 
int2fp 
 
S Exp Fraction 
     
Figure 5.5: Maple generates the rational coefficients. 
(*IMGen generator adjusts the datapath that can maximise the accuracy. x is input, 
y = f (s) is output.) 
Automation 
Our Matlab based code generator enables automation from user-defined error requirement to 
any particular elementary functions. For a given error requirement, the generator first generates 
coefficients using the Maple symbolic library from Matlab for both polynomial approximation 
and rational approximation. In the next sub-section, we present the detail of the approxima-
tions and the operations for enabling custom precision. We provide both 32-bit and 64-bit 
internal integer computations in order to create the custom-precision library. In the approxi-
mations, we can get better performance by using fewer polynomial degrees, however we also get 
a higher approximation error. When the embedded processor evaluates a given function using 
the generated C code, the input data x is originally in floating-point format. The range reduc-
tion transforms this number into its integer representation by scaling. As shown in Fig. 5.2, 
since the exponent field of IEEE single precision is biased 127, we can directly use the mantissa 
field (fraction part) of the input data. The details of the integer and floating-point conversion 
are shown in Fig. 5.6. We provide an analytical error analysis and define the error terms. For 
example, the approximation error is introduced by the Maple command when we generate the 
polynomial coefficients, and the quantization error is introduced in the datapath since infinite 
precision is required to represent the exact value. 
C-code: Data structure for input/output 
typedef struct sng_flds { 
unsigned sgn : 1; // 0x8000 0000 
unsigned exp : 8; // 0x7F80 0000(bias 
127) 
unsigned val : 23; // Ox007F FFFF 
} SNG_FLD; 
C-code: fp2int - floating-point to 
integer 
output = input.val << 8; 
output += 0x80000000; 
output = output >> 1; 
C-code: int2fp - integer to 
floating-point 
input = (input >= 0)? 
(sign = 0, input) : (sign = 1, -input); 
exp = check_leading_zero(input); 
input = input << exp; 
output.val = (input & 0x80000000) >> 8; 
output.exp = 129 - exp; 
output.sgn = sign; 
5.3. Automation methodology 	 93 
Figure 5.6: fp2int and int2fp operations. 
(*floating point and integer data conversions.) 
Generation 
IMGen uses the calculated maximum errors including approximation and quantization errors 
to determine the best possible library for a given error requirement. The output of the library 
construction includes: a library file, a reported generation time, an error upper bound, cycle 
counts from the hardware timer, an actual measured speedup from the embedded system when 
comparing with the Xilinx emulated math library and code space usage of the embedded C 
library. Lee et al. [LANIL04] show that how we can achieve elementary function approximation 
using hardware, in this work we can apply the same principle to include more elementary 
functions. An important step in generating different functions is the efficient code generation 
for range reduction and range reconstruction. 
94 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
Table 5.2: Embedded PowerPC arithmetic and logical instructions. 
Move mr rD, rA 
Add Carrying addc rD, rA, rB 
Add Extended adde rD, rA, rB 
Count Leading Zeros Word cntlzw rA, rS 
Multiply Low Word mullw rD, rA, rB 
Multiply High Word mulhw rD, rA, rB 
OR immediate on rA, rS, UIMM 
Shift Left Word slw rA, rS, rB 
Shift Right Algebraic Word sraw rA, rS, rB 
Shift Right Word srw rA, rS, rB 
Subtract from Immediate Carrying subfic rD, rA, SIAM 
Optimization 
In the code generator, several optimization techniques are used such as loop unrolling. By 
unrolling loops, we simplify decimal point correction. The basic idea is to make use of the 
information collected during Matlab code generation. In other words, the integer computations 
inside the embedded processor are pre-computed, and as a result we achieve a performance 
gain. In Fig. 5.2, we describe two conventional ways of implementing two functions evaluated 
in this work. We show that how we achieve range reduction, function approximation and range 
reconstruction. As shown in the same figure, we use a degree-3 polynomial approximation when 
f (x) = log(x) and we use a degree-2 rational approximation when f (x) = 
The input x is first stored in a union data structure which has both floating — point (f p) and 
bit — selective elements such that we can easily select the sign-bit, mantissa and exponent of 
the input data. The IEEE single and double representations are used in the data structures. 
Next, we correct the exponent value since it is biased 127, and transform the input x into 
the corresponding integer x (which is y) in the figure. Finally, for the range reconstruction, 
different techniques are applied according to the input function. 
5.3. Automation methodology 	 95 
5.3.2 Polynomial approximation 
The generated floating-point coefficients are stored in an array and are scaled up to integers 
which maximises the use of the 32-bit word size. Horner's rule is used to evaluate the poly-
nomials. An adaptive datapath bit-width optimization is used to adjust the datapath for a 
correct decimal point in order to avoid any data overflow or underflow. As shown in Fig. 5.4, 
the input floating-point x is transformed into its equivalent integer format by fp2int(), and the 
final integer result is converted back to the floating-point domain by int2fp(). 
5.3.3 Rational approximation 
Rational approximation can be represented as follows: 
((eri x + 	 + co)  f(x) — 
((dri x + dn_i)x + • • • cli)x + do) 
In the code generation, the degrees of the numerator and the denominator are the same. The 
major bottleneck is the final division at the end of the evaluation. A smaller degree of rational 
approximation can always achieve a similar or better accuracy than a relatively higher degree 
in polynomial approximation. Fig. 5.5 shows a second-degree rational approximation with two 
multiplications, two additions and one division. 
5.3.4 Custom precision code 
We use fixed-point (integer) arithmetic for all internal calculations. First, we obtain the frac-
tional bits from the input value and scale it up to the corresponding integer value, while an 
adaptive scaling method for the polynomial coefficients is used to avoid the mis-calculation. 
The important point is to maintain the correct "virtual" decimal point. In the following sub-
sections, the input/output conversions and the internal basic operations are described. Note 
that we only need to modify these integer arithmetic operations for different integer processors. 
PowerPC assembly: s4.s5 = s2.s3 << sl 
msl(sl, s2, s3, s4, s5); 
r21 = sl, r22 = s2, r23 = s3 
subfic rll, r21, 32; 
slw r22, r22,  r21; 
srw r20, r23,  r11; 
or r22, r22,  r20; 
addi rll, r21, -32; 
slw r20, r23, r11; 
or r22, r22, r20; 
slw r23, r23, r21; 
s4 = r22, s5 = r23 
PowerPC assembly: s4.s5 = s2.s3 >> s1 
msar(sl, s2, s3, s4, s5); 
r21 = s1, r22 = s2, r23 = s3 
subfic rll, r21, 32; 
srw r23, r23, r21; 
slw r20, r22, r11; 
or r23, r23,  r20; 
addic. r11, r21, -32; 
sraw r20, r22, r11; 
ble $+8; 
on r23, r20, r20; 
sraw r22, r22, r21; 
s4 = r22, s5 = r23 
96 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
Figure 5.7: 64-bit integer shift left operation. 
(*using 32-bit registers (sl,s2,s3,s4,s5).) 
Figure 5.8: 64-bit shift algebraic right operation. 
(*using 32-bit registers (sl,s2,s3,s4,s5).) 
In Fig. 5.6, the 32-bit floating-point to integer conversions are described. Not that the current 
tool only handles the input between 0 and 1, "input.exp" is not used in this figure. For exam-
ple, the input data is corrected by adding the hidden most significant bit (MSB). The 64-bit 
conversion is not shown here but it basically uses the same principle. 
5.3. Automation methodology 	 97 
Add operation 
In Table 5.2, the main instructions that are used for IMGen are shown. The supported integer 
instructions include arithmetic, logical, compare, rotate and shift instructions. For example, 
the add carrying instruction addc adds the contents of register TA and rB to register rD. The 
latency of most instructions is one cycle, while the latencies for multiply and divide are 4 cycles 
and 35 cycles, respectively. Since PowerPC405 is an integer processor, system software requires 
floating-point emulation which is a call interface to subroutines within a floating-point run-time 
library. The subroutines emulate the operation of floating-point instructions. 
Addition and multiplication are the two major operations for function evaluation. In IMGen, 
we support both 32-bit and 64-bit implementations. Note we add two 64-bit data where s3 = 
sl +s2. Also, PowerPC405 is a 32-bit processor and the general purpose registers are all 32-bits, 
we need to use the add carrying and the add extended instructions to complete this operation. 
The carry from addc is added to the consecutive adde operation. Note that the same idea can 
be generalised to any 8-bit or 16-bit processor and to custom output precisions. 
Multiply operation 
In the code generator, we provide two library functions that are used for 32-bit and 64-bit 
integer multiplication. If the design has a lower accuracy requirement, a 32-bit implementation 
and the "multiply-high-word" mhw() operation is used. On the other hand, if a 64-bit datapath 
is selected, sl, s2 and s3 are stored into the corresponding MSB and LSB registers and the 
"multiply-low-word" mlw() operation is also used. For example, the two input values are 
moved to two general purpose registers, r21 and r22, and then perform the embedded assembly 
instructions "mullw" and "mulhw", the final result is placed at the r23 register and thus finally 
move back to the s3 variable. 
C1 (2,30) 
/32-bit 
32-bit 
d (3,29) 
98 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
Co (3,29) 
/:32-bit 
/ "\\ 
- 
/ 32-bit 
Y (3,29) 
Figure 5.9: Degree one approximation for log (x). 
Divide operation 
An optimised division operation is crucial for the rational approximation. We cannot directly 
use the 32-bit divide instruction from PowerPC since the result of the integer division for our 
case is always zero. It is because for this work, the output of the evaluated functions are between 
zero and one. Moreover, this division instruction uses a sequential divider in hardware and is 
very slow when compared to other instructions. As a result, we adopt the "shift-and-subtract" 
method [Par00] to perform this division. Our division simply uses "add" , "shift" and "negate" 
instructions. 
Shift operations 
Our adaptive datapath method corrects the decimal point involved in the polynomial and 
rational approximations. Depending on the input coefficients, the input range (the integer part 
of the coefficient) can be very large and we need to scale up the coefficient with a smaller scaling 
factor. This scaling operation is performed by bit shifting. In the proposed library, we have 
provided six shift operations: shift left, shift right and shift algebraic right for both 32-bit and 
64-bit designs. As shown in Fig. 5.7 and Fig. 5.8, the 64-bit shift left and shift algebraic right 
are provided. For example, the s2.s3 refers to the MSB of the 64-bit data storing at s2 while 
the LSB storing in s3. The final result is shifted by s1 bits and is stored into s4.s5. 
5.4. Error analysis 	 99 
5.4 Error analysis 
We divide the error analysis into two parts: approximation error and quantization error which 
is further divided into three parts: range reduction, function approximation and range recon-
struction. 
Approximation error 
Error is introduced in the function approximation step because approximation error is the 
error of approximating a function using polynomials that is independent of the precision we 
used. First, the Maple command we used to generate the polynomial coefficients reports this 
approximation error. Second, the quantization error is introduced in the datapath since the 
datapath of our designs is either 32-bit or 64-bit. We aim at maintaining the highest accuracy 
with the available bitwidth and try to minimise the accumulated quantization error. 
Etotal = Eapproximation 	Equantization 
Equantization = Epoly_approximation 	Erange_reconstruction 
The above two equations summarise the key ideas of this error analysis. The quantization error 
has three parts: range reduction, function approximation and range reconstruction. The range 
reduction step of the evaluated functions is exact where no quantization is involved as shown 
in Fig. 5.2. We summarise the approximation errors reported by Matlab in Table 5.3. We 
can see that the approximation error generated by using the rational method is much lower 
than the polynomial method. Note that for example, degree-3 means that the degree of both 
the numerator and the denominator are three. In this error analysis, we only consider the 
polynomial approximation step and the range reconstruction step. 
100 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
Table 5.3: Approximation errors. 
(*using polynomial and rational methods reported by Matlab for log (x).) 
Degree Polynomial approx. Rational approx. 
1 
2 
3 
0.02983005 
0.00342398 
0.00044161  
0.0008607941 
0.0000017146 
0.0000000032 
Ey = Epoly_approximation 
Ey = Eaccum Equantizationy 
Ey = (Ed + Eco ) + 2-29  
Ed = Ec, + 2-29  
Ec, = 2-31  
Eco = 2-3° 
Therefore, 
Ey = 	+ 2-29 + 2-3° + 2-29  
Ey = 5.12227 x 10-9  
Figure 5.10: Static error analysis of degree-one log(x). 
Quantization Error 
Much research work focuses on the bit-width optimization problem, especially for the fraction 
bit-width optimization on hardware design [LAML04]. These approaches mainly minimise 
the hardware bit-width in order to meet the error requirement due to the area and latency 
constraints. In this work, the problem we are solving is constrained by the fixed datapath. As 
shown in Fig. 5.9, since all the datapaths in this calculation are fixed to 32-bits, the proposed 
adaptive datapath method selects the best position of the decimal point. For example, we 
use one bit to store the integer part and 31 bits to store the fraction part of x. The Maple 
command which generates polynomial coefficient C1 needs two bits for the integer part, as a 
result the multiplication product d needs three bits for the integer part. When we perform an 
addition, the two operands need to have the same decimal point, thus the final fraction bits 
for the y = f(x) equals 29 bits. As shown below, we describe how we perform this static error 
analysis in the library generation. 
We denote Esignai  as the error of a particular signal. The polynomial approximation error 
5.4. Error analysis 	 101 
Erange_reconstruction = Elog 2 X exP 
Over f lowmax = 5 
Elog 2 = 2-31  
Eover flow = Elag 2 X 2 
Eaccum = Elog 2 X Eover flow 
Eaccum 2
-31 + 2-30 + 2-20 + 2-28 + 2-27 
There f ore, 
Erange_reconstruction = 4.1607 x 10-9  
Figure 5.11: Error analysis of log(x) range reconstruction. 
bound of the signals in Fig. 5.9 is described in Fig. 5.10. The error bound induced by the 
reconstruction is given in Fig. 5.11. The floating point value, input x is exactly transformed 
to its integer format, thus there is no quantization error at the input. Note that the static 
error analysis includes the quantization error in the computation, but not the approximation 
error by the Maple approximation. There are two main ways to quantise a signal: truncation 
and round-to-nearest. Here we define FB as the fractional bitwidth. Truncation and round-
to-nearest can cause a maximum error of 2-FB and 2-FB-1 As shown in the above equations, 
the coefficients Co and C1 are rounded to the nearest, and the error at the output y = f (x) is 
the total of accumulated error and the quantization error of this signal. The second part of the 
quantization error belongs to the range reconstruction. 
Below, we show the minimum error bounds provided by the IEEE single and double preci-
sion formats. It means that if the final computed error is smaller than these two values, the 
library/design can assure the accuracy requirement of either IEEE single or double precision. 
Emngle = 2-23 = 1.192 X 10-7  
Edouble = 2-52 = 2.220 X 10-16  
Now that the quantization error of the above example is lower than E3,„91,, however the approx-
imation error of this example is Eapprox = 0.02983 for the degree-1 case as shown in Table 5.3 
which is much higher than Esing/e. We apply this static analysis in our code generation to cover 
102 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
long ic[2] = {1488522235, -1456492463}; 
#define CORRECTION if (j==0) s3 = s3 << 1; 
unoptimised code 
ix = fp2int(input); 
iy = is [0] ; 
for (j=0; j<degree; j++){ 
mhw(ix, iy, s3); 
CORRECTION 
iy = s3 + is [j+1] ; 
} 
using loop-unrolling technique 
ix = fp2int(input); 
mhw(ix, ic [0] , s3); 
s3 = s3 << 1; 
iy = s3 + is [l] ; 
Figure 5.12: Code generation optimization example. 
all the possible errors in the calculation for determining the best possible embedded C code. 
5.5 Optimization 
This section discusses the optimization techniques we use for code generation, and also for code 
space optimization. 
5.5.1 Optimised code generation 
One of the code generation optimizations we used is loop unrolling. As shown in Fig. 5.12, two 
code fragments of degree-one function evaluation are used to illustrate the basic idea. The array 
is is used to hold the polynomial coefficients. CORRECTION represents the adaptive datapath 
correction for adjusting the decimal point. Fig. 5.17 shows that when we use a higher polynomial 
degree, a smaller approximation error is obtained while the accumulated quantization error is 
increased. This U-shaped curve not only describes the effect of changing the polynomial degree, 
I
, 	,  
, 	 Xilinx Math library I 	i , 
IMGen I 	1 
. , 
- 0., .„, . 
h 	Y 
tl 	• 
. . 
. . 
. . 
. . 
. . 
:,.. 
. .. . 
. . 
. . 
. . 
- 
. 
H
ar
dw
ar
e  
tim
er
  c
yc
le
  c
ou
nt
  
100000 
10000 
1000 
100 
5.5. Optimization 	 103 
-00 	-01 	-02 	-03 
Compiler optimization effort 
Figure 5.13: The effect of compiler optimization. 
(*on library performance for log(x) function evaluation.) 
but also shows the effect of using code generation optimization. The maximum error shown in 
the graph represents maximum error of the empirical study of thousands of sample data against 
the Matlab result. These code generation optimization techniques are also applied to 64-bit 
datapath designs and rational approximation designs. 
5.5.2 Compile time optimization 
Besides the code generation optimization, compilation techniques such as code reuse and register 
renaming are extremely important for instruction processors. Fig. 5.13 shows that the existing 
emulated math library is highly optimised since further compilation optimization does not affect 
the overall performance. On the other hand, the generated IMGen library is compiled with the 
highest optimization in order to compare with the emulated math library. 
104 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
User input >> imgen('log', 0.01) 
Phase 1: Maple command generates polynomial coefficients 
Phase 2: Static error analysis calculates quantization error 
Phase 3: Select polynomial/rational approximation 
Phase 4: Select 32-bit/64-bit implementation 
Phase 5: Generate embedded C code 
and execute in embedded integer processor 
Phase 6: Output performance data and statistical error 
text data bss 	dec hex filename 
44232 4296 48 48576 bdc0 TestApp/executable.elf 
cycle count for the Xilinx math library: 63335 
cycle count for the bus overhead: 	60 
cycle count for the IMGen library: 618 
average speedup: 1.13e+002 
maximum error : 0.0034241 
IMGen is generated and tested in 4.904e+001 seconds 
Figure 5.14: Input/Output of the automated system. 
5.5.3 Polynomial degree automation 
The inputs of IMGen generator are the mathematical function name and the error require-
ment. As shown in Fig. 5.14, the Matlab interface passes the parameters for the error analysis 
as described in the previous section. The embedded C code is generated and evaluated in the 
prototyping system with the best execution time and accuracy trade-off. In the same figure, 
it shows that the Matlab function to generate the library is for example "imgen('log',0.01)". 
The polynomial degree and the design methods are automatically selected by the library gen-
erator based on the input function and also the user-defined error requirement. 
5.5.4 Instruction code space optimization 
The change of the polynomial degree is independent of the code space usage for the emulated 
math library, whereas the code space of IMGen increases gradually with a higher polynomial 
degree. In this work, we demonstrate that the proposed framework not only provides a flexible 
and efficient custom-precision math library, but also reduces the required instruction code space 
5.5. Optimization 	 105 
Soft-processor approach 
cc 120 
100 
ctj 
E 80 
_c 
(r) 60 
a) 40 > 
2 P
ol
y n
om
ia
l
 
de
g r
ee
  
Memory 
Bus 
MicroBlaze BRAM BRAM ctrl OPB_Timer 
OPB Bus 
	 PLB 
BRAM 	BRAM 	Bus ctrl 
 PLB2OPB 
bridge OPB_IP_Block 
PowerPC 
Hard-processor approach 
Figure 5.15: Our embedded system under testing. 
(*PLB: processor local bus, OPB: on-chip peripheral bus, and BRAM: block RAM.) 
10 
8 
6 
4 
0_ = 20 
a) a) 	0 	  
cn 10-1 10-4 10-5 10-6 10-9  10-2 10-3  
error requirement Input 
Figure 5.16: 32-bit log(x) function evaluation without optimization. 
(*Speedup comparison between the Xilinx emulated floating point math library (precision: 
IEEE single) and IMGen (precision: variable).) 
which is very expensive in embedded systems. The major reason for the reduction is the fact 
that our method avoids the excessive floating point to integer conversions which happen to all 
the variables in the compiled program using the Xilinx math library. 
0 
20 2 4 6 8 10 12 14 16 18 
Polynomial degree 
10 
1 
0.1 
0.01 
0.001 
le-04 
le-05 
le-06 
M
ax
im
um
  e
rr
or
  
140 
120 
100 
o. 
80 a) a) eL 
60 u) 
40 
20 
r‘ 
Maximum error 
Speedup 
Speedup (optimized) 	 
...... 	
........... 
........................ 
106 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
5.6 Results 
This section presents the experimental setup and evaluation of the proposed approach using 
different methods. 
5.6.1 Experimental setup 
We evaluate the proposed design flow using the Xilinx ML310 system which has an XC2VP30 
Virtex-II Pro device [ML304] embedded with two PowerPC processors. The embedded system 
also enables prototyping of the soft-processor approach as shown in Fig. 5.15. The OPB_Timer 
is connected to the embedded PowerPC through the OPB and PLB buses. The measured cycle 
counts refer to the clock cycles of the OPB_Timer for the function evaluation. 
Figure 5.17: U-shaped curve for the 32-bit approximation. 
(*and quantization errors of log (x).) 
5.6.2 Evaluations 
As shown in Fig. 5.16, the result of evaluating log(x) in an embedded processor shows that over 
100 times speedup is achieved when the error requirement is higher than 10'. In this work, 
1 
M
ax
im
um
  e
rr
or
  
4 1 5 
Polynomial approximation 
Rational approximation 0.1 
0.01 
0.001 
le-04 
1 e-05 
1 e-06 
1 e-07 
0 2 	3 
Polynomial degree 
5.6. Results 	 107 
we use the OPB timer to measure the exact time for computation. The embedded processor 
sends the start and done signals to the timer before and after function evaluation. The register 
in the timer records the exact number of clock cycles. In order to measure bus latency, we send 
a start signal followed by a stop signal to the timer without running any extra instructions 
in the embedded processor. This OPB bus latency is defined as Toverhead, Tmath_library is the 
time taken to evaluate functions in the emulated math library, and TimG„ is the time (in cycle 
counts) spent in IMG en. 
Figure 5.18: Maximum error comparison. 
(*using polynomial method and rational method of log (x) .) 
The equation below describes how we measure the speedup using the OPB_Timer. We also 
randomly generate a large testbench dataset in order to statistically measure the speedup. 
Looking at Fig. 5.17, we can see that significant speedup is achieved when optimization tech-
niques are applied to code generation. Fig. 5.18 shows that the rational method achieves much 
lower error than the polynomial method for small degree. Note that the maximum error values 
shown in this graph are much higher than those in Table 5.3. This is because the dominating 
factor of quantization error varies with the degree value. On the other hand, in Fig. 5.19 we 
observe that the penalty of the division used in the rational method is high when compared to 
the polynomial method. The proposed design flow satisfies custom precision requirements for 
most space expensive embedded systems. The division in the rational method also limits the 
108 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors 
3500 	  120 
Polynomial (cycles) 
	
Polynomial (speedup) 	-o 
Rational (speedup) 
Rational (cycles) - 100 
3000 
2500 - 
8 	2000 a 0 
a) 
(-), 	1500 - > 0 
1000 - 
- 40 
500 - 
20 
1 
	
2 	3 
	
4 
Polynomial degree 
Figure 5.19: Speedup comparison. 
(*using polynomial method and rational method of log (x).) 
speedup factor. 
Speedup = (Tmath_library Toverhead)/ (TIMGen Toverhead) 
32-bit 32-bit (opt.) 64-bit 
Xilinx Math (cycles) 63759 63699 64370 
Bus latency (cycles) 60 60 60 
IMGen (cycles) 1369 672 1921 
Speedup factor 48x 103x 34x 
Measured error 0.00005886 0.00005920 0.00000159 
Table 5.4: Degree-4 polynomial method comparisons. 
(*comparisons of 32-bit, 32-bit-opt and 64-bit implementations using degree-4 polynomial 
method.) 
Given that we have a customised precision math library, what is the overall speedup in real 
world applications? To answer this question, we first need to profile the embedded system 
program. Then according to Amdahl's law, overall speedup = i_p+I 	pzs where P is the im-
proved computation portion, and S is the speedup factor for that improved portion of the 
system. Now, suppose that over 60% of the total run time is spent on function evaluations in 
a graphics/multimedia application, and the speedup factor is 100. The system overall speedup 
is 2.463. 
Single precision 
Latency [cycles] 
84 
215 
Double precision 
Latency [cycles] 
183 
433 
IMGen 
10-5  
Timer [cycles] 
587  
710 
Intel XScale" [12] 
5.6. Results 	 109 
log (x) VY 
Xilinx Math (cycles) 62725 9159 
Bus latency (cycles) 60 60 
IMGen (cycles) 1696 467 
Speedup factor 38x 22x 
Measured error 0.000004313 0.0008375 
Table 5.5: Degree-6 polynomial method comparisons. 
(*comparisons of log(x) and Vi polynomial approximations using degree-6 polynomial 
method.) 
Table 5.6: Comparisons of log(x) and fi 
(*polynomial approximations with reference designs [12].) 
It is obvious that using a 64-bit datapath can achieve a higher precision, however the more 
internal computations the slower the overall function evaluation. Here, we compare the accuracy 
and performance using 32-bit and 64-bit datapaths. Table 5.4 shows some results using the 
degree-4 polynomial method. We can see that, (1) the number of clock cycle measured by 
the OPB_Timer for both Xilinx math library and using our code generator, (2) and the 64-bit 
measured error is much lower than the 32-bit one. In the table, "Bus latency" refers to T over h ead 
which is the OPB bus latency. The execution time of each elementary math function routine 
varies depending on the function nature and the input arguments. For example, as shown in 
Table 5.5 the execution time of log (x) is almost 6 times slower than 	Although we use the 
degree-6 polynomial method for both evaluations, the speedup factor and the measured error 
are both highly dependent on the input function and input arguments. We also compare our 
designs with the Intel XScale' math library. Looking at Table 5.6, the latencies in terms 
of the XScale processor clock and the OPB_Timer are described. Our current designs are 
slower than the highly optimised Intel library, mainly because of its specialised range reduction 
techniques and an efficient table-lookup and software implementation. However, our approach 
demonstrates the flexibility of automatically generating customizable embedded C code for 
integer processors with different accuracy requirements. 
110 	 Chapter 5. Custom-Precision Function Evaluation for Embedded Processors  
5.7 Summary 
This chapter introduced a customizable library for floating-point function evaluation based 
on the integer instruction set used in embedded processors. We have developed an approach 
for automating code generation and optimization within this framework to customise preci-
sion in order to obtain the best trade-off in embedded code space, performance and accuracy. 
We evaluate this approach using the embedded PowerPC processor on a Xilinx Virtex-II Pro 
XC2VP30 FPGA. By trading off accuracy, we achieve over 80 times speedup compared to the 
single-precision reference mathematical library with a static error analysis below 10-5, while a 
64-bit polynomial method achieves 30 times speedup when compared to IEEE single-precision. 
One of the current limitations of the framework is that the quantization error of the 64-bit 
method is higher than IEEE double-precision, we can use other function evaluation techniques 
such as piecewise polynomial approximation and multiple words in the datapath to achieve a 
higher accuracy. Suggestions for future work can be found in Section 7.2.3. 
Chapter 6 
Arbitrary Random Number Hardware 
Design Generator 
We present an automated methodology for producing hardware-based random number genera-
tor (RNG) designs for arbitrary distributions using the inverse cumulative distribution function 
(ICDF). The ICDF is evaluated via piecewise polynomial approximation with a hierarchical seg-
mentation scheme that involves uniform segments and segments with size varying by powers 
of two which can adapt to local function non-linearities. Analytical error analysis is used to 
guarantee accuracy to one unit in the last place (ulp). Compact and efficient RNGs that can 
reach arbitrary multiples of the standard deviation a can be generated. For instance, a Gaus-
sian RNG based on our approach for a Xilinx Virtex-4 XC4VLX100-12 FPGA produces 16-bit 
random samples up to 8.2a. It occupies 487 slices, 2 block-RAMs, and 2 DSP-blocks. The 
design is capable of running at 371 I\4Hz and generates one sample every clock cycle. 
The rest of the chapter is organised as follows. Section 6.1 contains an introduction about 
random number generation. Section 6.2 covers the design flow of the proposed generic RNG. 
Section 6.3 focuses on the hierarchical segmentation method targeting the RNG. Section 6.4 
presents the hardware architecture of the inversion-based random number generator and its 
components. Section 6.5 addresses the bit-width optimization technique used in the design 
generator. Section 6.6 evaluates results of this work and compares them against existing work. 
111 
112 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
Finally, Section 6.7 summarises the described approach. 
6.1 Introduction 
Random numbers are the key component in large scale simulations across many applications 
including communications [SSKV06], ray tracing [LRR04], and financial simulation [BGM97]. 
Clearly, the quality of random numbers plays a central role in ensuring that simulation results 
are meaningful. Although the most commonly used random number distributions are uniform 
and Gaussian, there are many cases in which random samples drawn from log-normal, expo-
nential, Rician, Rayleigh, or other distributions are of interest. In the communications field, 
for example, noise models are highly dependent on the specific propagation environment, and 
are quite often non-Gaussian in nature. Thus, there is a need for fast and accurate methods 
for generating samples corresponding to distributions appropriate for the target environment 
and application. 
While there is a long and rich history of work relating to non-uniform random number gener-
ation [Knu97], the overwhelming majority of this work has targeted software implementations 
where high precision is easily accessible. However, the higher speed offered by hardware for 
simulation applications ranging from communications to finance has stimulated growing in-
terest in hardware-based random number generators. This has led to reevaluation of many 
of traditional random number generator (RNG) methods in light of the constraints on preci-
sion and data flow regularity that characterize typical hardware platforms. For example, in 
software the best methods for Gaussian number generation are rejection-acceptance methods 
such as the Ziggurat method [MT00]. These methods can offer extremely high quality random 
numbers, but produce output samples conditionally, meaning that while the average output 
rate is known, the time-local output rate varies. This can lead to complications in applications 
that require new random number samples at specific clock intervals [ZLL+05]. Thus, hardware 
implementations typically target methods that produce outputs at deterministic intervals. 
In this work we introduce a general RNG design generator that produces hardware designs for 
6.1. Introduction 	 113 
Inverse CDF of the Gaussian (Normal) distribution 
0.1 	0.2 	0.3 	0.4 	0.5 
	
0.6 	0.7 	0.8 	0.9 
	
1 
x 
Figure 6.1: Inverse cumulative distribution function of the Gaussian distribution. 
generating random numbers from arbitrary distributions using the inversion method [BFS83]. 
The inversion method for generating non-uniform random numbers [Dev86] utilizes the inverse 
cumulative distribution function (ICDF) to convert a sample x of a uniform random variable 
over [0, 1) to a sample from the desired PDF through y = F-1  (x). Thus, the challenge in ICDF 
hardware development lies in creating an efficient and accurate circuit design for evaluating the 
function F-1(x). For example, Fig. 6.1 shows the ICDF of the Gaussian distribution, where x 
is a uniform random number and y is a sample from the Gaussian distribution. Such ICDFs are 
generally non-linear in the sense of having regions with high first or higher order derivatives. 
Hardware designs using the ICDF inversion technique have previously been implemented by 
McCollum et al. [MLBP03] and Chen et al. [CMB04]. In [MLBP03], a Gaussian ICDF is 
implemented via linear interpolation with evenly-spaced data points. This implementation 
leads to a large table size of 262 KB. In [CMB04], a pre-calculated ICDF inversion table using 
on-chip memory is utilized to transform uniform random numbers into non-uniform random 
numbers. This approach requires a 1 MB RAM for a 16-bit input / 16-bit output lookup table. 
The primary contributions of this work are a rigorous and automated framework and the associ-
ated tools for generation of hardware RNGs for arbitrary distributions via the inversion method. 
Techniques including analytical error analysis, bit-width optimization, hierarchical segmenta-
tion, and piecewise polynomial approximation are used in combination to guarantee accuracy 
of one ulp while also offering area- or latency-optimized designs. The resulting hardware ar-
chitectures are verified through FPGA implementation of designs for Gaussian, exponential, 
114 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
distribution input bit-width precision 
select distribution 
degree 
select segmentation 
(US, P2SL, P2SR, P2SLR) 
coefficient table 
segmentation table 
bit-width optimization 
signal bit-widths 
• 
hardware generation 
VHDL 
Figure 6.2: Design flow of the proposed approach. 
(*Gaussian Random Number Generator (GRNG) implementation.) 
and log-normal distributions. The combination of generality, automation, and memory-efficient 
designs makes the method presented here suitable for a wide range of simulation environments 
and applications. 
6.2 Design flow overview 
The RNG generator design flow is shown in Fig. 6.2. The designer first inputs a target distribu-
tion, the input bit-width and the output precision requirement. The input bit-width determines 
the maximum a value of the output random numbers. For instance, a wider input bit-width 
used in the input uniform random number to the design generator, higher a multiples of the 
generated random number output, because input values closer to zero and one can be repre-
sented. The output precision reflects the quality of the generated random numbers. The higher 
the precision, the better quality of the generated random numbers and the larger coefficient 
4 	
6.2. Design flow overview 	 115 
reset 
uniform random number generator 
uniform random 
numbers 
address decoding logic 
----- 
X2 
) 
x1 
V 
ROMO - segmentation table 
coefficient address 
, 
ROM1 - coefficient table 
Cd Cd-1 ". 	C1 Co 
degree-d polynomial evaluation 
I 
arbitrary 
distribution 
segmentation 
scheme 
generate 
ROM 
contents 
random numbers of arbitrary distribution 
Figure 6.3: Overview of the RNG architecture. 
(*random number generator architecture based on the inversion method.) 
table is used. Floating-point arithmetic is usually used in software designs, but in hardware 
implementations fixed-point arithmetic is preferred. It is because for example the area require-
ment for fixed-point design is generally smaller. Another parameter used in the design flow 
in the input interval. The input interval of the function evaluation is fixed to [0, 1) for the 
three described inverse functions in this work, but the design generator is capable of covering 
functions with variable input interval. 
Inverse CDFs for commonly use distributions are build-in functions inside MATLAB, so we use 
MATLAB to generate the polynomial coefficients. The MATLAB Symbolic Toolbox, which in-
cludes the MAPLE linear algebra package, is used to convert the Chebyshev coefficients [Mu197] 
into Horner's form for polynomial evaluation. The proposed design generator can use polyno-
mial approximation of any given degree, but in the proposed work we focus our attention on 
results up to degree-3. The "select distribution" step in Fig. 6.2 chooses a polynomial degree 
so as to find the smallest and most efficient implementation for a given distribution, input 
bit-width, and precision requirement. The "segmentation" step divides the interval into splines 
using a series of hierarchy and selects the adaptive segmentation scheme for storing the coef- 
116 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
ficients. The "bit-width optimization" step investigates the minimum number of bits required 
for each signal in the datapath. The final step "hardware generation" produces synthesizable 
VHDL code suitable for ASIC and FPGA realization. We carry out the entire design generation 
from the first step to the final output result inside MATLAB and it is fully automated. We also 
implement the GRNG design using Xilinx System Generator [Xi106c1], a MATLAB Simulink 
library, to maximise the performance of the GRNG design by inserting pipeline registers. 
Fig. 6.3 gives an overview of the general random number generator architecture based on the 
inversion method. The Hierarchical Segmentation Method (HSM) generates the segmentation 
table ROMO and the coefficient table ROM1 according to the given target distribution. Given 
an inverse CDF function, the proposed design generator determines which segmentation scheme 
is used and generates the corresponding address decoding circuit. The user input specification 
contains the information for the exact bit-width of the URNG. When the reset signal goes 
high, the URNG is initialised to generate random numbers using the pre-defined seeds. Two 
numbers (x1 and x2 ) are formed from the URNG output, x1 is used to address the segment 
information in order to get the correct coefficient address. The address decoding logic in this 
figure extracts x2 for the degree-d polynomial evaluation. Moreover, we quantise this x2 signal 
into X-2, will explain the benefit of using ±-2 instead of the input x2 for the polynomial evaluation 
in the bit-width optimization section. 
The methods described here are demonstrated with the following three distributions: 
	
X 	 1 
I F1 1( 2 	0) 1, Fi(Y) = ov27 foo 
e 	202 2 di 
[V 1  e— dt 
F2-1  (XI itt) F2(y) = 
.1 0 it 
20-2 
F3-1(xiii, a), F3(Y) = 
1 	y(1n(t)11) 2  
	10 	 
dt 
(1) f i is the ICDF of the Gaussian distribution with mean p = 0 and standard deviation 
a = 1, (2) f2 is the ICDF of the exponential distribution with t = 1, and (3) h is the ICDF 
of the log-normal distribution with p = 0 and a = 0.5. As shown in Fig. 6.1, the Gaussian 
distribution is symmetric, in implementation the absolute value of the first half of the Gaussian 
6.3. Hierarchical segmentation of ICDFs 	 117 
ICDF x = [0, 0.5) is approximated. For the reconstruction of the full distribution, a random 
bit is then used for the sign of the generated Gaussian sample. The exponential and log-normal 
distributions do not exhibit this symmetry property, and so are evaluated directly across the 
entire range of interest. 
6.3 	Hierarchical segmentation of ICDFs 
The commonly used segmentation method is uniform, where all segment lengths are equal [J.E99, 
dT05, POMB05, LAMLO5b] and the segment count is typically limited to powers of two. The 
major difficulty of the proposed random number generator is the approximation the inverse 
function for a given distribution, in particular the highly non-linear regions of the function. 
Although the uniform scheme leads to simple coefficient address computation, non-uniform 
segmentation enables segment lengths to be customised to the local function characteristics. 
Manual segmentation methods utilizing segments that vary in width by powers of two in con-
junction with uniform segments has been proposed [Hen89]. This work we apply the fully 
automated version of the Hierarchical Segmentation Method (HSM) [LLVC03] to construct 
random number generators for arbitrary distributions. 
US P2SL 
start 	 end 
	
start 	 end 
x 
P2SR 
x 
P2SLR 
start 	 end 	start 
	 end 
x 
	 x 
Figure 6.4: Illustration of different segmentation schemes. 
(*uniform segmentation US and three segmentations involving lengths that increase or 
decrease by powers of two: P2SL , P2SR, and P2SLR.) 
The HSM provides four basic segmentation schemes, denoted US, P2SL , P2SR, and P2SLR 
4 
3.5 
3 
2.5 
2 
1.5 
1 
0.5 
0  0.7 0.9 0.1 0.8 0.6 05 0.2 0.4 0.3 
li 
0.1 	0.2 	0.3 	0.4 	0 5 	0.6 	0.7 
	
0.8 
	
0.9 
X 
118 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
Gaussian (Normal) distribution 
10 
8 
2 
0 
Exponential distribution 
0.1 
	
0.2 	0.3 
	
0.4 
	
05 
	0.6 
	
0.7 
	
0.8 
	
0.9 
Log-normal distribution 
7 
6 
5 
"-co 4 
3 
2 
1 
0 
Figure 6.5: The inversion plots for the Gaussian (Normal) distribution. 
(*using the P2SL segmentation, the Exponential distribution using the P2SR segmentation, 
and the Log-normal distribution using the P2SLR segmentation. The black and grey vertical 
lines represent the boundaries for the outer segmentation and inner segmentation 
respectively.) 
respectively and illustrated in Fig. 6.4. In US, segments are uniformly sized. In P2SL, the 
segment sizes increase by powers of two from the beginning of the input interval to the end of 
the interval, while in P2SR the segment sizes decrease by powers of two from the beginning to 
the end of the interval. In P2SLR , segment sizes increase by powers of two until the mid-point 
of the interval and then decrease by powers of two until the end is reached. As can be seen in 
Fig. 6.4 and Fig. 6.5, this range of options offers a way to match the segmentation to portions 
of a function that have singularities or narrow peaks. This method is hierarchical because the 
segmentation can be applied recursively: in the first pass, the entire interval [a, b) is subdivided 
/Bxo / Bx1  
4 
5 
6 
7 
—Ow 
1000000 11111111 
7 9 6 8 
Segment 
number 
0 0 0 0 0.0 0 
0 0 0 0 1 0 0 
0 0 0 1 i0 0 0 
0 0 0 1 1 0 0 
0 0 1 0 0 0 0 
0 0 1 1 0 0 0 
0 1 0 0 i0 
i 
0 0 
0 1 1 o lo 0 0 
1 0 0 0 0 0 0 
1 1 0 0 0 0 0 
0 0 0 0 0 1 1 
0 0 0 oil 1 1 
0 0 0 1 0 1 1 
0 0 0 1 1 1 
0 0 1 0 1 1 1 
0 0 1 1 1 1 1 
0 1 0 1 1 1 1 
0 1 1 1 ) 1 1 
1 0 1 1 1 1 1 
1 1 1 1 1 1 1 
xl 	 X2 
3 —0. 
8 —0. 
9 —pp- 
6.3. Hierarchical segmentation of ICDFs 	 119 
Figure 6.6: Segment ranges in binary representation for Bx = 7. 
(* P2SL outer segmentation, Bxo = 4, and .13,1 = 1. The four bits corresponding to xo are 
highlighted in bold. The bits to the left of the shadowed digit correspond to Xo.) 
using one of the above four methods into smaller segments, then in the next pass, each segment 
can be further subdivided, again using any of the four methods. Although HSM is applicable 
to any level of hierarchy, in practice it has been found to be sufficient to use two levels of 
segmentation to obtain near-optimal segmentation and the best design complexity. Moreover, 
in this work the second level of segmentation is fixed to US for the three given functions, 
although it is extensible to any level of segmentation. 
For the P2SL, P2SR, and P2SLR schemes, two barrel shifters are used. For the US scheme, 
the first barrel shifter can be omitted because the number of bits of Bxo stays unchanged. We 
also illustrate the idea of fixing the number of bits of x1  which is represented as Bx1. Although 
the total number of segments would increase as shown in Table 6.3, we can remove the second 
barrel shifter in which the overall design complexity may be reduced. It is because there is 
a trade-off between the increase of the segment numbers and the removal of this second left 
120 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
shifter which has B, + 13,1 number of bits wide. 
Applying the HSM to the three functions f1..3 gives P2SL, P2SR, and P2SLR, respectively, as 
shown in Fig. 6.5. The inner segmentation method is fixed to US which is shown as the grey 
vertical lines in Fig. 6.5. Note that the output range for the three functions are different, and 
this maximum y value and the total number of segments are both directly related to the input 
bit-widths which will be further discussed in the next section. 
For the two-level HSM segmentation, the input x is divided into three partitions. For the first 
partition xo, we need to compute the segment address by detecting the number of leading zeros 
for segments beginning with a zero, and detecting the number of leading ones for segments 
beginning with a one. The exact computation is as follows: 
P2SL_addr = 
P2SR_addr = 
P2SLR_addr = 
Br° — LZD(xo ), 	if MSB(xo ) = 0 
Bxo , 	 if MSB(x0 ) = 1 
{ 0, 	 if MSB(x0 ) = 0 
LOD(x0 ) + 1, 	if MSB(xo ) = 1 
Bx0 — LZD(xo ), 	if MSB(x0 ) = 0 
B„„ + LOD(x0 ) — 1, if MSB(xo ) = 1. 
(6.4) 
(6.5) 
(6.6) 
where LZD(xo ), LOD(x0 ), and MSB(xo ) return the leading zeros, leading ones, and the most 
significant bit of x0 respectively. 
For the first partition x0, it is necessary to compute the segment address by detecting the 
number of leading zeros for segments beginning with a zero, and detecting the number of 
leading ones for segments beginning with a one. Consider the case when B, = 7, the outer 
segmentation is P2SL, Bs° = 4, and Bx1 = 1. As illustrated in Fig. 6.6, it is possible to construct 
a maximum of five outer segments and five inner segments. 	0 gives the number of bits used 
for indexing the segments in the first partition. It is determined by our design generation tool, 
which makes use of a linear search algorithm to calculate the minimum number of segments m 
—•—Uniform 
- * -Hierarchical 
_ - - 
105 
E 
E 104 
E 
co' 
5 103 
.o°) 
E 
Z 
103 
6.3. Hierarchical segmentation of ICDFs 	 121 
Segment numbers comparsion for different error requirements 
2-8 	
2-10 	2  12 	2  14 	2-18 
	
2-20 
Error Requirement ereci 
Figure 6.7: Variation of the number of segments m with error requirement. 
(*for uniform and hierarchical segmentation of the function f i (Eqn. (6.1)) with input 
fraction bit of 16 bits.) 
for the coefficient table ROM1. Let xo be the set of bits that remain constant (i.e. the bits left 
of the shadowed digit in Fig. 6.6) within a given segment. The next partition uses the adjacent 
Bxi bits to the right of xo. The number of bits corresponding to the second level depends on 
the value of xo, since xo determines the value of Hio. 
The major purpose of using a better segmentation method is to prevent using excessively large 
number of segments, thus we compare designs using uniform segmentation and designs using 
HS1VI segmentation. We also study the effect of changing the input bit-width to the total 
number of segments in Fig. 6.8. It can be seen that the segment number increases slowly for 
HSM while exponentially for the uniform scheme. For the comparison using different output 
precisions, the variation for the uniform scheme is not increased dramatically, but the gap 
between uniform and HSM is still very notable. 
Fig. 6.9 shows that for the design using HSM segmentation, the average error of 20 evenly 
spaced input is smaller than the uniform design. Note that for the uniform design, it requires 
4096 segments and 90112 bits for ROM1. For the HSM design, we only need 40 segs and 1600 
bits (i.e. 1.8% of the coefficient table ROM1 using uniform segmentation) with the maximum 
error of 2.77 x 10-4 and the maximum error of uniform method is 3.56 x 10-4. 
Segment numbers comparsion for different input bit-widths 
Uniform 
- * -Hierarchical 
10 
101  
10 	12 	14 	16 	18 	20 
Input Bit-Width 
106 
105 
N 104 
E rn 
(S' 103 
-5 
6) 
10°  a 
Uniform? 
- * -Hierarchical  
122 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
Figure 6.8: Variation of segment number m against different input bit-width. 
(*with error requirement accurate to 2-8 for uniform and hierarchical segmentation of the 
function fi (Eqn. (6.1)).) 
x 1 0' 
	Total e ror comparison for uniform and HSM for f1  
0.2 
	
0.4 
	
0.6 	0.8 	1 
Figure 6.9: The total error of the 12-bit input designs. 
(*and error requirement accurate to 2-11 for uniform and hierarchical segmentation of the 
function f j. (Eqn. (6.1)).) 
6.4 Inversion-based random number generator architec- 
ture 
The inversion method basically transforms one uniform random variable x over the interval [0, 1) 
into a non-uniform random number according to the given inverse distribution. For the URNG 
which generates the input uniform random variable x, we select Tausworthe URNGs [L'E96] 
due to its smaller and faster properties than only using LFSRs. There exist three LFSR-based 
URNGs in each Tausworthe URNG to enhance the equi-distribution property of the generated 
uniform random number. The generated uniform random number in each clock cycle has a 
large periodicity up to 288 	1025 which is sufficient for the purpose of this work. In Fig. 6.11, 
6.4. Inversion-based random number generator architecture 	 123 
Table 6.1: Maximum random number by using different input bit-widths. 
input bit-width 16 32 48 
fl 4.2 6.3 7.8 
12 11.1 22.2 33.3 
13 8.0 22.5 45.0 
Maximum achievable a—value using different input bit—widths 
16 
14 
12 
10 
—*— f1  
- • - f 2 
* fa 
8 
6 
4 	6 	8 	10 	12 	14 	16 	18 	20 	22 	24 
input bit—width [bits] 
Figure 6.10: Variation of the maximum a value. 
(*against different input bit-width for functions f l , f 2 and f3.) 
the two numbers shown in brackets for each signal represent the total bit-width and the fraction 
bit-width of the signal. For instance, the random number x is obtained by concatenating the 
two uniform random numbers from two URNGs. For a degree-2 function evaluation, we require 
three coefficients that are packed together to form one single coefficient entry inside the table, 
thus we can reduce the block-RAM usage. As we mentioned earlier in this work, one bit from 
the URNG is used to select the sign of the final Gaussian (Normal) random number which has 
5 integer bits and 11 fractional bits. 
For the proposed RNG generator, there are three major components in the architecture: a 
variable URNG, address decoding logic using HSM, and a degree-d polynomial evaluation using 
homer's rule. There is no feedback loop in this design. As shown in Fig. 6.11, the output from 
the first stage is a uniform random number x. In this work, the two specifications of the GRNG 
are a periodicity of 1015 and 16-bit random samples. From the Gaussian distribution, we need 
to represent up to 8.2a for a population of 1015 Gaussian random samples. In order to meet 
124 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
reset 
s.1  
uniform random number generator 
Tausworthe 
32-bit URNG 
Tausworthe 
32-bit URNG 
32 
concat 
20 
	(31:12) (0 	0) 
address decoding X / 52 
32 
Xo 	 Xl 	 X2 
/ 13 x0 /BX1 / Bx2  
LJ 
/ B22 bit 
selection 
unit 
PAi P2S 
unit 
ROMO 
addr  offset BX1 
Bx0 
Bs?, 
Bx , 
polynomial evaluation 
/1 
addr 
ROM1 
Cd Col-1 C1 CO 
/ B Cd  
	>e):I 	 
BDe2-2 
0 
De2•3 
/ 	> 
Bap2:$ • 
0 
, Cl  BC0 
   
   
   
D2 g 
/BD, 
	)10I 	Kv) BDO  
/By 
valid signal 
generator 
1 
valid 
(negate  
.1,  
\O 	sign  
16 
Gaussian random number 
Figure 6.11: Hardware architecture for generating Gaussian random number. 
(*based on the inversion and the HSIVI method for degree-d splines. ROMO contains 
information on the hierarchical segmentation, while ROM1 contains the polynomial 
coefficients to each spline segment. The grey "Q" squares perform quantization at run-time.) 
this 8.2a, we fix the input bit-width to 52. In Table 6.1, we show the maximum value of the 
generated random number by changing the input bit-widths using large values. As shown in 
seed3 
6.4. Inversion-based random number generator architecture 	 125 
reset 	 Tausworthe 32-bit URNG 
uniform random number 
Figure 6.12: Architecture of the Tausworthe URNG in Fig. 6.11. 
(*The output uniform random number is a 32-bit data.) 
Fig. 6.10, we observe the linear behaviour between the input bitwidth and the output u value. 
In this work, we concatenate two Tausworthe URNGs and extract the top 52 bits and the 
last one bit as the sign. In the proposed architecture, we select the Tausworthe URNG which 
provides superior randomness than the LFSR. Fig. 6.12 shows the hardware implementation of 
this component. 
We explain how we map the second component (i.e. the segmentation table and the address 
decoding logic) into the proposed GRNG design. Since two levels of segmentation are used in 
the HSM address decoding, the input x is divided into three parts: Bx0 leftmost bits of the 
argument x, BT , the middle portion, and B„ the rightmost bits as shown in Fig. 6.11. The 
number of bits of xo is found by the design generation tool that can guarantee the minimum 
number of segments for the coefficient table ROM1. The exact number of bits of x1, Bx , is 
determined by the P2S address and the Bx , pre-stored in ROMO. The final coefficient address 
is obtained by adding the offset data in ROMO with the value of x1 from the input x. The 
offset data represents the starting coefficient address of each outer segment. By adding this 
offset address with the value of x1, we are able to exactly locate the corresponding coefficient 
address for all inner segments. 
Xi X2 X0 
Xo X2 
P2S 
unit Bg0 barrel shifter (left) 
addr 
BX, 
barrel shifter (left) 
xo XI X2 
9 ROMO 
Xf 
X2 
Xf X2 
variable Bx , 
Bx0 
constant Bx , 
barrel shifter (left) 
X1 
X2 	> 
126 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
bit selection unit 
Figure 6.13: Illustration of the bit selection unit. 
(*in Fig. 6.11. The first barrel shifter can be omitted when US is used because B10 stays 
constant, the second barrel shifter is removed when the constant B11 design is used.) 
The bit selection unit shown in Fig. 6.13 has two versions, variable Bxi and constant .8,1 . In 
HSM, the number of bits in each segment is variable depending on the input value, thus we can 
adaptively locate the minimum number of segments. The overhead of having this flexibility is 
the increase in the address decoding complexities. For the US address decoding scheme, the 
Bso is constant and we only need to lookup ROMO to shift out x1 . In this case, the first barrel 
shifter is not required. For P2SL , P2SR, and P2SL,R , we need the first shifter to remove the 
leading Bso for the variable Bx1 design. Note that for this proposed work, we do not use US 
for the outer but use it for the inner segmentation. The second left barrel shifter is used to 
separate the remaining bits into x1 and x2. The size of the second barrel shifter is determined 
by Bx + max(Bxi ) in ROMO. For the second version of the proposed design using constant 
Bsi , after the first left shift which has removed the variable B10 , we only need to select the 
max(B,i ) bits for x1 and the rest of the bits represent x2. By introducing this simplification, we 
need to replicate some of the entries in ROM1 to compensate this gain. For example, suppose 
in the variable B11 design, there are only two segments but the max(Bx1) is 2 bits. In this case, 
we need to duplicate these two segments into 4 segments in the coefficient table. 
The outputs from the second stage are the coefficient address for ROM1 and x2 for the poly-
nomial evaluation. For a degree-d evaluation, it requires d + 1 table lookups, d multiplications 
6.5. Bit-width optimization 	 127 
and d additions as shown in Fig. 6.11. 
Now we obtain the correct ROM1 address and locate the corresponding coefficients. To approx-
imate the inversion function over the linear given range, we use polynomials with non-uniform 
segments given by the HSM method. In this work, polynomials are evaluated using Horner's 
rule for calculating the final output y: 
y = ((Cdx Cd_i)x + .)x + Co 
	 (6.7) 
where x is the input, d is the polynomial degree and C, are the polynomial coefficients. The 
hardware architecture for degree-d polynomial is shown in Fig. 6.11. Note that we use 12 instead 
of x for the polynomial evaluation for reducing the size of the operators and the signals. Now 
we obtain the approximation of the first half of the Gaussian distribution. From the URNG, we 
obtain one uniform random bit which is used to select between the output signal from stage 3 
and its negated signal. We now complete the reconstruction of the complete Gaussian (Normal) 
distribution by using the proposed architecture. 
We have described the segmentation and the coefficient generation, and also the components 
in the RNG. The next step is to determine the optimal bits required for each signal in the 
datapath, such that we can avoid data over-flow and also waste the hardware resources. 
6.5 Bit-width optimization 
Bit-widths of signals are important parameters that designers can tweak to improve the quality 
of a design in terms of area, latency and throughput. The idea is to use the minimal bit-widths 
to each signal resulting the most tight datapath, while each signal does not cause excessive 
error. For the data representation in the datapath, we perform two's complement fixed-point 
arithmetic. Given signal x, its integer bit-width (/B) is denoted IBx and its fractional bit-width 
(FB) by FBx , i.e. the total signal bit-width Bx = IBx + FBI . For the bit-width minimization 
problem, we adapt the MiniBit technique [LAMLO5a] described in [LV06] for function evaluation 
128 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
with polynomials. The problem is further divided into two parts: range analysis for determining 
the minimal IBs for controlling the range, followed by precision analysis for determining the 
minimal FBs required. 
6.5.1 Range analysis 
To compute the dynamic range of a signal, the local minima/maxima and the minimum/maximum 
input values of each signal are examined. The local minima/maxima can be found by comput-
ing the roots of the derivative. Once the dynamic range has been found, the required 1B can 
then be computed. In the proposed RNG generator, spline approximations are being used, the 
polynomial evaluation circuit needs to be shared among different sets of coefficients. The IB 
for each signal is found for every segment and stored in a vector. Since the signal needs to be 
wide enough to avoid overflow for the data with the largest dynamic range, the largest IB in 
the vector is used. 
6.5.2 Coefficient transformation 
For a given segment, the values of xo and x1 are implicitly known as constants, we use x2 
instead of the whole x for the polynomial evaluation. As a result, we are able to reduce the 
size of the operators. x2 is scaled to be over [0, 1). If x = [0, 1), this would involve masking 
out the bits corresponding to xo and x1 and shifting x by Bx0 Bx, bits to the left. However, 
the coefficients are generated with the assumption that x will be used for the polynomial 
arithmetic. If x2 is now being used, we need to perform coefficient transformation to ensure the 
results are the same. Below we illustrate this idea with an example using degree-1 polynomial. 
Transformation for polynomials of higher degree can be obtained using the same principle. If 
we group the bits representing xo and x1 as x0..1, and the number of bits required to represent 
this group is Bx0..1 . For a given degree-1 polynomial segment, x2 is given by 
x2 = (x — xo..1) x 28x0..1 	 (6.8) 
6.5. Bit-width optimization 	 129 
This equation can be rearranged and gives 
x2 
X -= 	 ± X0..1- 2Bx0..1 
(6.9) 
We can present a degree-1 polynomial by the equation 
y = Cix ± Co 	 (6.10) 
and transform this equation into 
Cl  y =-- 2Bxo..1 	x 2 + C1xo..1  + Co. (6.11) 
By examining the first and zeroth order terms, the newly transformed polynomial coefficients Cl 
and Co are now  2 xo B 	and C1xo..1 + Co, respectively. This coefficient transformation technique c' ..1  
is particularly useful for generating the highly non-linear Gaussian inverse CDF. 
6.5.3 Precision analysis 
There are three main sources of errors when evaluating functions in digital arithmetic: (1) the 
inherent error coo due to approximating the function with polynomials, (2) quantization error 
E Q due to finite precision effects incurred when evaluating the polynomials, and (3) the error 
of the final, output rounding step, which can cause a maximum error of 1/2 ulp. In the worst 
case, Ecc and EQ will contribute additively, so to achieve faithful rounding, these two added 
error must be less than 1/2 ulp. In the proposed design generator, we allocate a maximum of 
1/4 ulp for Ecc and the rest for EQ which proved to be a good balance between the two error 
sources. 
Quantization is usually performed in two modes: truncation which can cause a maximum error 
of 2' (1 ulp), and round-to-nearest which can cause a maximum error of 2-FB-1 (1/2 ulp). 
Round-to-nearest shall be performed at the output signal y to achieve faithful rounding which 
is required in this work, but either rounding mode can be used for the internal signals. Thus 
130 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
we perform truncation for the other internal signals. Since truncation results in better delay 
and area characteristics over round-to-nearest. Note that the quantization of the coefficients 
ed..° is performed with round-to-nearest at compile-time, while truncation is performed for the 
intermediate signals dd,2_2..o in the datapath at run-time (Fig. 6.11). 
Suppose "X is the quantised version and ex is the error due to the quantization of a signal x. We 
have, 
= x + ei-. 	 (6.12) 
For addition/subtraction operations z = x+y, the error E2 at the output z is given by (assuming 
truncation) 
= X±y=x±y+ci d-E- 2-FB2 	
(6.13) 
E2 
where 2-FBi is the quantization error at z. For multiplication, we get 
"y 
xy + acg 1)€. 	2-FBi 
	 (6.14) 
Ez 
	
YE" EEp- + 2-F14. 
It is clear that €2 is at its maximum when x and y are at their maximum absolute values. 
The addition and multiplication error expressions in Eqn. (6.13) and Eqn. (6.14) are applied 
to every operator, and an inequality to achieve faithful rounding is generated for the output 
signal. For the Horner's rule polynomial evaluation circuit in Fig. 6.11 stage 3, we assume the 
input x2 to the polynomial evaluation has no error, i.e. Ex2 = 0. In this section, we describe 
how do we quantise this input x2 into Z2 such that we do not need the full bit-width for the 
multiply-and-add for the polynomial evaluation. The main reason for this input analysis is 
that the output precision only requires 11 fractional bits. It is obvious that the design does not 
demand the whole range from the input bits, and gives rooms for tweaking the signals bit-width 
under a given error requirement. Precision analysis using a degree-1 polynomial y = Cl x x + Co 
6.5. Bit-width optimization 	 131 
which can be divided into two steps with quantization at each step. 
	
DO = Cl X Y2 	 (6.15) 
y = Do  + Co- 	 (6.16) 
The error ED0  at the signal Do is given by 
61)0 = C1Ex-2 Z2EC1 	Ef2 2-FBD0 
	 (6.17) 
Since we quantise the original x2 into ±-2 using truncation, the quantization error cE2 is 2-FBx-2. 
The error cy at the output y is given by 
Ey 	= ED0 EC° + 2
-F13, 	-1 
fc.• 
	 (6.18) 
For faithful rounding, the maximum output error, max(cy ), needs to be less than or equal to 1 
ulp, i.e. 
2-FBy > max(Ey ). 	 (6.19) 
Using Eqn. (6.17) ti  (6.19) gives 
2-FBy C1 x 2-Fsx-2 2-FBD0 2-FBc0-1 
2-FBy 
+2-FBc1-1 x (X2 + 2-FBI2) 
+2-FBy-1 + coo.  
(C1 x 2-FBx-2 2-FBD. 2-FBc0-i 
+2-Fscr-1 x 	2-FB,-2 ) 
+coo ) x 2. 
(6.20) 
132 	 Chapter 6. Arbitrary Random Number Hardware Design Generator  
Similarly for degree-2 polynomial, we get 
FBy 	(2-F Bco-1 2-FBc1-1 x (Z2 + 2-F13x-2 ) 
+2_Fsc2  x 	2-FB--2 )2 + 2-FBno 
±2-FBD1 x 	2-FB.-2) 
+2-FBD2  X (Z2 + 2-FB'-2 )2  
+C2 x 2-FBs-2 x (±-2 + 2-FB-2) 
+D1 x 2-FB32 + coo ) x 2. 
(6.21) 
Eqn. (6.20) and (6.21) are optimization problems, where the goal is to find the minimal FBs 
that minimise a certain cost function while satisfying the inequality. In this inequality, the left 
hand shows the maximum error that we can sustain at the output y. For the cost function, we 
supply an area model of the circuit according to different signal bit-widths. There are two steps 
for the determination of the fractional bits of all signals. The first step is uniform fractional bit 
width which is the same for all signals. For the non-uniform fractional bit width, the Adaptive 
Simulated Annealing (ASA) [Ing04] is used to solve this optimization problem. The ASA allows 
a faster convergence time than traditional simulated annealing. Inequalities such as Eqn. (6.21) 
are supplied as the constraint function and are the cost of the arithmetic operators supplied as 
the cost function to ASA. 
Table 6.2 shows the signal bit-widths found by ASA when evaluating f i accurate to 11 fractional 
bits with degree-2 splines. Comparisons for original and transformed coefficients for two input 
bit-widths are provided. With these signal bit-widths configurations, the output is guaranteed 
to have an absolute maximum error of 2-11 for all inputs. In this table, since there are leading 
zeros in the fractional part of a signal, we allow negative IBs in the bit-width optimization. 
The stored fixed-point coefficients are converted to integer (with an implicit binary point) for 
the hardware implementation. For the 12 input bits design, the largest reduction is from 31 to 
9 for C2 signal. We can see that the table is almost reduced by 50%. For larger input bits, this 
reduction is even more significant. The coefficient table (ROM1) size of "original" is 9120 bits, 
6.5. Bit-width optimization 	 133 
Table 6.2: Signal bit-widths found by Adaptive Simulated Annealing (ASA). 
(*when evaluating function J.]. accurate to 11 fractional bits using 24 input bits with degree-2 splines 
using variable Bx1 and without quantization at x2 . Comparisons for original and transformed coeffi-
cients are shown in [total bits (integer bits, fractional bits)].) 
input [bits] 12 24 
signal original transformed original transformed 
segments 40 40 80 80 
ROM1 [bits] 2960 1600 9120 3520 
Bc2 31 (18,13) 9 (-4,13) 57 (43, 14) 11 (-3,14) 
Bc, 25 (10, 15) 14 (-1, 15) 38 (23, 15) 14 (-1, 15) 
Bc0 18 (3, 15) 17 (3, 14) 19 (4, 15) 19 (4, 15) 
BD2 27 (9,18) 14 (-4,18) 40 (21, 19) 15 (-3,19) 
Bpi  25 (10,15) 14 (-1, 15) 37 (22, 15) 14 (-1, 15) 
Bp° 18 (2, 16) 18 (-1,19) 20 (2, 18) 18 (-1, 19) 
while its size for "transformed" is 3520 bits for the 24 input bits design. 
Table 6.3 shows the signal bit-widths found by ASA when evaluating fi accurate to 11 fractional 
bits with degree-1 and degree-2 splines. This table also shows the number of segments for storing 
the coefficients and the exact coefficient table size. We provide both the segment number counts 
of the design using variable Br , and constant Br ,. Since the number of inner segments of the 
constant Br, design is fixed which is the maximum number of the inner segment in the variable 
Br, design, the total number of segments is thus increased. However, with this additional ROM 
penalty, we can simplify the design by removing the second large barrel shifter, shorten the 
critical path, and reduce the pipeline stages. Further discussion is described in the hardware 
implementation section. 
With the quantization at the input x2, we can obtain a significant reduction in the signal 
bit-width of x2. For example, Bx2 is reduced from 48 bits to 20 bits for degree-2 splines approx-
imation and down to 22 bits for degree-1 design. The penalty is the increase in the coefficient 
bit-widths and also the internal signals. We can see that ROM1 has been increased by 12.5% 
for both variable and constant Bx , designs using degree-2 splines approximation, and by 10.7% 
for the designs using degree-1 splines approximation. However, this modification dramatically 
134 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
Table 6.3: Segment and bit-width information without quantised x. 
(*Number of segments, table size and bit-widths for evaluating the augmented Gaussian inverse CDF 
function fi accurate to 11 fractional bits with degree-1 and degree-2 splines and using constant B,1 
and variable Bs1 . The bit-widths in the brackets indicate the IBs and the FBs.) 
design degree-1 degree-2 
.13,1 	[bits] variable constant variable constant 
segments 620 832 144 208 
ROM1 [bits] 17980 23296 6480 8320 
Bc, 11 (-3,14) 8 (-5,13) 
Bc1  10 (-3,13) 10 (-3,13) 14 (-1,15) 13 (-1,14) 
Bco 19 ( 5,14) 18 ( 5,13) 20 ( 5,15) 19 ( 5,14) 
BD2  15 (-3,18) 16 (-5,21) 
BD1 14 (-1,15) 13 (-1,14) 
BD° 19 (-3,22) 14 (-3, 17) 19 (-1,20) 20 (-1,21) 
reduce the size of the multiply-and-add datapath and also the number of intermediate pipeline 
registers which will be explained in details in the next section. 
6.6 Hardware implementation results 
This section discusses how we map the inversion-based RNG into FPGA technology and de-
termine the correctness of the generated random number samples. Comparisons with previous 
work and performance study of various configuration of the random number generator are also 
included. 
We inspect the exact resource allocation of the GRNG design using degree-2 approximation. 
The result in Table 6.5 shows that the P2S address decoding circuit of the GRNG consumes the 
largest portion of the hardware resources. In this work, the two Tausworthe URNGs consume 
131 slices. However for the exact size of the generated URNG in stage 1, it is actually directly 
proportional to the input bit-width. The two major components in the address decoding circuit 
are the leading zero detectors (LZDs) and the barrel shifter. We adopt the scalable method 
proposed by Oklobdzija [0k194] for the LZD and leading one detector (LOD), and the logic 
6.6. Hardware implementation results 	 135 
Table 6.4: Segment and bit-width information with quantised x. 
(*Number of segments, table size and bit-widths for evaluating the augmented Gaussian inverse CDF 
function fi accurate to 11 fractional bits with degree-1 and degree-2 splines and using constant Bx1 
and variable 13,1  and with quantised input x. The bit-widths in the brackets indicate the IBs and 
the FBs.) 
design degree-1 degree-2 
Bx1 [bits] variable constant variable constant 
segments 620 832 144 208 
ROM1 [bits] 19220 25792 6912 9360 
Bc2 12 (-3,15) 9 (-5,14) 
Bc, 11 (-3,14) 10 (-3.13) 15 (-1,16) 14 (-1,15) 
Bco 20 ( 5,15) 21 ( 5,16) 21 ( 5,16) 21 ( 5,16) 
BD2 - 16 (-3,19) 14 (-5,19) 
BD, - - 15 (-1,16) 14 (-1,15) 
BD,, 19 (-3,22) 19 (-3,22) 22 (-1,23) 22 (-1,23) 
B, 22 ( 0,22) 23 ( 0,23) 23 ( 0,23) 20 ( 0,20) 
Table 6.5: Hardware resource usage of the proposed GRNG. 
(*Gaussian random number generator using the hierarchical segmentation tool for degree-1 approxi- 
mations to function fi using variable and constant 13,1 bits.) 
component  slices  block-RAMs  DSP-blocks 
variable Bx1 - degree-1 
Tausworthe URNG 
address decoding 
polynomial evaluation 
141 
268 
134 
0 
0 
2 
0 
0 
2 
total 543 2 2 
constant .13,1 - degree-1 
Tausworthe URNG 141 0 0 
address decoding 221 0 0 
polynomial evaluation 125 2 2 
total 487 2 2 
barrel shifter proposed by Pillmeier et al. [PSI02] for the P2S decoding. 
Statistical tests, such as the X2 test or the Anderson-Darling test [Knu97] are not necessary, 
since (a) we know that the the inversion method itself is correct and (b) we generate the samples 
136 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
Gaussian (Normal) distribution 
0.45 
0.4 
0.35 
0.3 
0.25 LL. 
0.2 
0.15 
0.1 
0.05 
0 
—6 	 0 	2 	4 	6 
x 
Exponential distribution 
0.8 
0.4 
0.2 
1 2 3 4 5 6 7 8 9 10 
x 
Log—normal distribution 
	
1 	 
0.9 - 
0.8 - 
0.7 
0.6- 
t•c's 0.5 - 
n. 0.4 - 
0.3 - 
0.2 - 
0.1 
0 	 
 
8 9 10 
x 
Figure 6.14: PDFs of the generated random numbers. 
(*from the proposed architecture for a population of ten million samples for three 
distributions. The black solid line indicates the ideal PDF of each distribution.) 
6.6. Hardware implementation results 	 137 
Error plot in ulp of the Gaussian random number samples 
1 
0.8 
0.6 
0.4 
0.2 
1; 0 
Liti  -0.2 
-0.4 
-0.6 
-0.8 
-1 
0 
	
0.1 
	
0.2 
	
0.3 
	
0.4 
	
0.5 
	
0.6 
	
0.7 
	
0.8 
	
0.9 
x 
Figure 6.15: Error plot in ulp using 216 randomly selected samples. 
(*for degree-2 spline approximation to function f i accurate to 11 fractional bits and 5 integer 
bits with 52 input bits and the bit-widths from Table 6.3 incorporated. The black curve 
indicates the inherent approximation error coo, while the grey curve indicates the error with 
finite precision effects. Over 95% of the outputs are exactly rounded: the remaining 5% are 
faithfully rounded.) 
accurately within the 16 bits resolution. Fig. 6.14 shows the PDF of the generated random 
number samples for a population of ten million of the three distributions, while Fig. 6.16 shows 
the PDF between 7a and 8.2a for a population of one million. In both cases, the acquired 
random number samples closely follow the true PDF of the associated distribution. Moreover, 
the generated random number samples are found to be accurate to 1 ulp. As shown in Fig. 6.15, 
we randomly select 216 samples and test the GRNG design for degree-2 spline approximation. 
Results show that 95% of the samples are exactly rounded (i.e. accurate to 1/2 ulp). 
Fig. 6.17 shows the area, clock speed and latency variation according to the variation of increas-
ing pipeline stages. For each instance of the pipelined designs, one random sample is produced 
in each clock cycle. We insert the pipeline register into the design according to the post place-
and-route timing analysis, these additional registers breakdown the critical path, improve the 
clock speed, and also induce area penalty. We can see that after 24 pipeline stages, the design 
reaches a maximum of 371 MHz which is limited by the critical path between the output of the 
block-RAMs and the input of the DSP-blocks inside the Xilinx Virtex-4 FPGA device. We also 
include Fig. 6.18 to show the design complexity and also the pipelining registers distribution 
of the proposed design. 
Table 6.6 shows the comparison between the proposed architecture and the fastest GRNG in 
High a region of the Gaussian (Normal) distribution x 
72 	7.4 	7.6 	7.8 	8 	8 2 
x 
0.9 
0.8 
0.7 
0.6 
7'< 
0.5 
Q_ 
0.4 
0.3 
0.2 
0.1 
138 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
Figure 6.16: PDF of the generated random numbers. 
(*from the proposed architecture for a population of one million samples between 7o- and 
8.2a. The black solid line indicates the ideal Gaussian (Normal) PDF.) 
literature [LVLL06]. The two designs both generate very high quality random numbers up to 
8.2a. The random number samples are 16 bits composed of 5 integer bits and 11 fractional 
bits. To make a fair comparison, the two designs have the same ulp accuracy guarantee. The 
proposed inversion based GRNG is mapped into FPGA device using Xilinx System Generator 
8.1. An overview of the datapath design using System Generator is shown in Fig. 6.19 and also 
the address decoding block is shown in Fig. 6.20. We heavily pipeline the GRNG designs to 
achieve the most efficient and the most compact result. This table shows that the proposed 
design using degree-1 splines approximation with constant .E3 1 has the largest throughput / slice 
ratio. Moreover, the proposed design has greatly reduced the hardware resource usage of block-
RAMs and DSP-blocks due the use of hierarchical segmentation scheme and the quantization 
of x2 value. 
We explore the trade-offs between using different types of hardware resources using the Xilinx 
Virtex-II device. For instance, a table can be implemented using block-RAM or distributed-
RAM using slices. Table 6.8 shows the GRNG implemented using different FPGA resources. 
We can see that the design using slices only is almost two times the number of slices and has 
significantly lower clock speed than the original GRNG design. Also, the area and speed penalty 
of using slices to implement tables instead of block-RAMs is especially high, from 500 MHz 
to around 100 MHz. Hence dedicated FPGA resources such as block-RAMs and DSP-blocks 
should be used for compact and efficient designs. 
-140 
-110 y 
0 
- 90 	5`1 
Area [slices] 
- * -Clock Speed [MHz] 
* Latency Ens] 
700 170 
„A- 
- 50 
	 20 
6 8 	10 	12 	14 	16 	18 	20 
Pipeline Stages 
7± 560 
2 
-o 
a) a) 
((ft 420 
0 0 
V; 280 - a) 0 
polynomial evaluation Z-9 address decoding Z-13 URNG 
Tausworthe 
Tausworthe Z-1  
6.6. Hardware implementation results 	 139 
16-bit Gaussian random number generator 
Figure 6.17: Area, clock speed and latency variation. 
(*with number of pipeline stages for 16-bit Gaussian random number generator with 52-bit 
input and 11-bit output precision using degree-2 Horner's rule with variable Bx,. 
Block-RAMs and DSP-blocks available on the Virtex-4 device are utilised.) 
r 
Figure 6.18: Pipeline registers distribution of the GRNG design. 
(*using degree 1 splines approximation with constant Bx,. SHL refers to the logic barrel left 
shifter, and Z-1  and the black rectangle both refer to one pipeline stage.) 
For the results showing in Fig. 6.21 to Fig. 6.26, we implement the proposed architecture 
using pure combinational circuit with variable Bx, and synthesize the generated VHDL using 
Synplicity Synplify Pro 8.4. Xilinx ISE 8.1.02i is used for place-and-route with maximum effort 
configuration. 
Fig. 6.21 and Fig. 6.22 show that for a given 24-bit input width using degree-2 splines, the 
area increases because the design requires more segments for holding the coefficients. When 
precision is higher or equal to 12 bits, f2 uses more area than h because of the larger coefficient 
table and also the wider multiple-and-add evaluation. Since the P2SLR address decoding circuit 
is more complex, h has the longest latency in Fig. 6.22, Fig. 6.24, and Fig. 6.26. 
140 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
 
3 
0 
  
  
   
Figure 6.19: Overview of the GRNG using System Generator. 
We can observe the trade-off between using smaller degree and larger coefficient table, against 
the more complex datapath for the polynomial evaluation in Fig. 6.23. For low precision 
requirement such as 2-4 to 2-12, the evaluation using degree-1 has a smaller area requirement 
O. 
En 
6.6. Hardware implementation results 	 141 
0 
Figure 6.20: Overview of the address decoding block of the GRNG. 
than higher degrees. However, from the same figure we observe that the degree-3 design provides 
a better area cost than the degree-1 and degree-2 designs starting at the precision of 2-20. For 
the GRNG in this work that requires 11 fractional bits, as shown in Table 6.6, the design 
16 14 8 	10 	12 
Precision [bits] 
6 
f 
142 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
Area comparisons for different output precision requirements 
cc 
2000 
1800 
1600 
4 
	
6 
	
8 	10 	12 
	
14 
	
16 
Precision [bits] 
Figure 6.21: Area comparisons for variable error requirements. 
(*and the input is fixed at 24 bits.) 
Latency comparisons for different output precision requirements 
Figure 6.22: Latency comparisons for variable error requirements. 
(*and the input is fixed at 24 bits.) 
using degree-1 is actually smaller than the degree-2 design due to its shorter multiply-and-add 
datapath. Table 6.7 also shows the comparisons of the proposed approach and other existing 
related work. 
We also study the effect of using more bits for the input while fixing the output precision in 
Fig. 6.25 and Fig. 6.26. These two figures show the comparison between three distributions. 
The general trend for all three distributions is increasing because we need a larger URNG in the 
input and thus more bits for the address decoding circuit. The bit-width optimization is highly 
related to the actual data storing in the ROM and the number of bits required to represent 
  
—*-- Degree-1 
- * - Degree-2 
* Degree-3 
.* ......... * 
_ 
100 
--*-- Degree-1 
- * - Degree-2 
Degree-3 90 
80 
7 Cbr 
a) 
io 
60 
50 
6 	8 	10 	12 	14 	16 	18 	20 	22 	24 
Precision [bits] 
40
4 
6.7. Summary 	 143 
Area comparisons for different polynomial degrees 
4 	6 	8 
	
10 	12 	14 	16 	18 	20 	22 	24 
Precision [bits] 
Figure 6.23: Area comparisons using variable degree-d. 
(*and variable error requirements for functions fi.) 
Latency comparisons for different polynomial degrees 
Figure 6.24: Latency comparisons using variable degree-d. 
(*and variable error requirements for functions fi.) 
these coefficients. For example, it is the main reason of fi using more slices than f2 when we 
using 20 or more input bits. 
6.7 Summary 
We have presented an automated hardware-based random number generation using the inver-
sion method. This proposed method is capable of generating random numbers from arbitrary 
distributions. Three demonstrated functions are used to generate large number of samples using 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 
Area comparisons for different input bit—widths 
f 
- * - f 2 
f3 
a) 
Ti) 
ro 
900 
800 
700 
600 
500 
14 	16 	18 	20 
	
22 
	
24 
Input Bit—Width [bits] 
1200 	 
1100 - 
1000 - 
300 	 
12 
24 22 
75 
70 
60 
55 
512 
	14 	16 	18 	20 
Input Bit—Width [bits] 
144 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
Figure 6.25: Area comparisons for variable input bit-widths. 
(*and the output error requirement is fixed at 11 fractional bits.) 
Latency comparisons for different input bit—widths 
Figure 6.26: Latency comparisons for variable input bit-widths. 
(*and the output error requirement is fixed at 11 fractional bits.) 
fixed-point arithmetic from the Gaussian (Normal), Exponential and Log-normal distributions. 
The MiniBit framework [LAMLO5a] incorporated in this approach facilitates the accurate error 
analysis and minimizing the signal bit-width in order to produce compact hardware implemen-
tations. The random number sample generated every clock is analytically guaranteed to be 
accurate to one unit in the last place by using the proposed method. For the Gaussian (Nor-
mal) random number generator, the proposed scalable design generator can accurately model 
the true Gaussian PDF up to very high a values. In this work, the generated random number 
samples can reach up to 8.2a by using 53 input bits and are accurate up to 11 fractional bits. 
6.7. Summary 	 145 
Suggestions for future work can be found in Section 7.2.4. 
Table 6.6: Comparisons of different hardware GRNG. 
(*Gaussian random number generators implemented on a Xilinx Virtex-II XC2V4000-6 (V2) FPGA and a Xilinx Virtex-4 XC4VLX100-12 
(V4) FPGA.) 
design Lee [LVLL06] proposed [variable Bx1 ] proposed [constant Bx1 ] 
method Box-Muller Inversion [degree-1] Inversion [degree-2] Inversion [degree-1] Inversion [degree-2] 
device V2 	V4 V2 	V4 V2 	V4 V2 	V4 V2 	V4 
slices 
block-RAMs 
DSP-blocks 
1528 	1452 
3 	3 
12 	12 
548 	543 
2 	2 
2 	2 
585 	579 
1 	1 
4 	4 
502 	487 
2 	2 
2 	2 
542 	523 
1 	1 
4 	4 
clock speed [MHz] 
samples / clock 
million samples / sec 
throughput / slice 
233 	375 
2 	2 
466 	750 
0.305 	0.518 
232 	371 
1 	1 
232 	371 
0.423 	0.683 
231 	370 
1 	1 
231 	370 
0.395 	0.639 
232 	371 
1 	1 
232 	371 
0.462 	0.762 
232 	367 
1 	1 
232 	367 
0.428 	0.702 C
ha
pt
er
  6
.  
A
rb
itr
ar
y  
R
an
do
m
  N
um
be
r  
H
ar
dw
ar
e  
D
es
ig
n  
G
en
er
at
or
  
if 
ue
tu
u r
nS
  
'2
'9
  
Table 6.7: Comparisons of different hardware Gaussian random number generators. 
(*implemented on a Xilinx Virtex-II XC2V4000-6 FPGA. "CLT" refers to the central limit theorem.) 
Xilinx [Xi106d] Zhang [ZLL+05] Lee [LLVC04] Lee [LAMLO5a] Lee [LVLL06] proposed - degree-1 
Box-Muller & CLT Ziggurat Box-Muller & CLT Wallace Box-Muller Inversion 
653 891 2514 770 1528 502 
4 4 2 6 3 2 
8 2 8 4 12 2 
168 168 133 155 466 232 
0.257 0.189 0.053 0.201 0.305 0.462 
method 
slices 
block-RAMs 
MULT18X18s 
million samples / s 
throughput / slice 
148 	 Chapter 6. Arbitrary Random Number Hardware Design Generator 
Table 6.8: Hardware implementation results of the GRNG. 
(*using degree-1 spline approximation with constant Bx , using different types of FPGA re-
sources (block-RAM (BRAM) and DSP-blocks (DSPs)) on a Xilinx Virtex-4 XC4VLX100-12 
FPGA.) 
FPGA resources used 
slices + BRAMs + DSPs 
slices + BR,AMs 
slices + DSPs 
slices 
slices BRAMs DSPs speed [MHz] 
487 2 2 371 
611 2 370 
1509 - 2 222 
1628 - 220 
Chapter 7 
Conclusions 
7.1 	Summary of thesis achievements 
This thesis presents a design generator library for producing customisable hardware designs. 
The key components of each chapter of this thesis are summarised as follows: 
Chapter 3 describes a customisable pipelined and parallelised elliptic curve cryptography (ECC) 
design for various field operations. This design supports various parameters, such as the key 
size and the degree of parallelism, to enable trade-off between level of security, design size and 
speed. A design generator, ECCGen, has been developed to facilitate fast implementation. 
Performance analysis shows that our design at 35 MHz is the fastest among existing hardware 
implementations (as of September 2005). It can compute a point multiplication up to 1150 
times faster than a software ECC application on a Xeon 2.66GHz computer. We also present 
a configurable System-on-Chip architecture for Elliptic Curve Cryptosystem and an embedded 
secure web system using reconfigurable hardware. A four-level partitioning scheme is proposed 
for exploring the area and speed trade-off of pure software implementation on embedded pro-
cessor and the hybrid hardware/software approach. 
Chapter 4 presents Prime Gen, a design generator for producing a scalable architecture for 
prime number validation which targets reconfigurable hardware. Two significant algorithms 
149 
150 	 Chapter 7. Conclusions 
have been discussed: the Rabin-Miller algorithm and the Montgomery algorithm. The Rabin-
Miller strong pseudo-prime test has been successfully mapped into hardware. Several modular 
multipliers have been developed as libraries in the design generator, PrimeGen. In particular, 
the Montgomery modular multiplication is the core of various modular operations used in 
the primality test. The scalability and parallelism of this design have been explored for very 
large prime numbers, up to 32k bits. We systematically implement the design for different bit 
lengths on reconfigurable devices using both scalable and non-scalable multipliers, and study 
the trade-offs among different designs. 
Chapter 5 presents IMGen, a customisable library for floating-point function evaluation based 
on the integer instruction set used in embedded processors. We have developed an approach 
for automating code generation and optimisation within this framework, to customize preci-
sion in order to obtain the best trade-off in embedded code space, performance and accuracy. 
We evaluate this approach using the embedded PowerPC processor on a Xilinx Virtex-II Pro 
XC2VP30 FPGA. By trading off accuracy, we achieve over 80 times speedup compared to the 
single-precision reference mathematical library with a static error analysis below 10-5, while a 
64-bit polynomial method achieves 30 times speedup when compared to IEEE single-precision. 
One of the current limitations of the framework is that the quantization error of the 64-bit 
method is higher than IEEE double-precision. We can use other function evaluation techniques 
such as piecewise polynomial approximation and multiple words in the datapath to achieve a 
higher accuracy. 
Chapter 6 describes RNGGen, an automated hardware-based random number generation using 
the inversion method. This proposed method is capable of generating random numbers from 
arbitrary distributions. Three functions are used to demonstrate the generation of a large 
number of samples using fixed-point arithmetic from the Gaussian (Normal), Exponential and 
Log-normal distributions. The MiniBit framework [LAMLO5a] incorporated in this approach 
facilitates the accurate error analysis and minimizing the signal bit-width in order to produce 
compact hardware implementations. The random number sample generated every clock is 
analytically guaranteed to be accurate to one unit in the last place by using the proposed 
method. For the Gaussian (Normal) random number generator, the proposed scalable design 
7.2. Future work 	 151 
generator can accurately model the true Gaussian PDF up to very high a values. In this work, 
the generated random number samples can reach up to 8.26 by using 53 input bits and are 
accurate up to 11 fractional bits. 
Table 7.1 shows the customisable parameters and the design trade-offs that are common to 
different design generators. For example, one of the goals for ECCGen and PrimeGen is to 
develop a design as fast as possible with a given area constraint. In ECCGen, we enable this 
by using a higher degree of parallelism, which is the same principle applying to the increased 
number of processing elements in Prime Gen. For IMGen and RNGGen, since the focus is more 
on area reduction (code-space reduction), precision analysis is involved with appropriate error 
analysis. 
design generators trade-off customisation 
ECCGen (Chapter 3) speed vs. area size 	of field 	multiplier, 
degree of parallelism 
PrimeGen (Chapter 4) speed vs. area size of modular multi-
plier, number of process-
ing element 
IMGen (Chapter 5) number of cycles vs. code space precision reqirement 
RNGGen (Chapter 6) accuracy vs. area precision reqirement 
Table 7.1: Design generators comparison. 
7.2 Future work 
The following sections describe possible future extensions to the work presented in this thesis. 
Some research extensions that can be applied to all design generators include: the exploration 
of power/energy optimisations, trade-offs associated with hardware/software codesign, and the 
investigation of run-time or partial reconfiguration opportunities. The latest 65nm Virtex-5 
FPGA has used 6-input look-up tables (LUTs), which means that each CLB is able to support 
more complicated logic. When building a general design generator, an architectural study of 
the latest FPGA is attractive and exciting. For instance, which coarse-grained component will 
be the next to be embedded into the 45nm FPGAs? 
152  
7.2.1 ECCGen 
Chapter 7. Conclusions 
 
Possible future developments for the ECCGen include: 
• the current system is built based on ONB representation, the next step can improve the 
system for both Polynomial Basis (PB) and dual field basis representation. 
• ECC system has been extensively use in smartcard device as a program running on the 
32-bit CPU or with the aid of cryptographic coprocessor. Our work can be used to rapid 
prototype an ECC smartcard system. 
• an efficient Hyper-Elliptic Curve Cryptosystem (HECC) is expected to achieve the same 
security level with this ECC system with a smaller field size. We expect a further extension 
work on HECC system. 
• the function of the ECC system can be extended such that new encryption protocol can 
be implemented over the presented system. 
• the speed of ECC systems can be improved by analysing the critical path. 
• area optimisation for the generated ECC designs is an important next step. 
• run-time reconfiguration (RTR) techniques can be applied to the ECC system such that 
different protocols and field size can be modified on chip. 
7.2.2 PrimeGen 
Possible future developments for the PrimeGen include: 
• the scalable architecture presented is architecture independent, which means we can port 
the whole design into the latest reconfigurable hardware such as Virtex-5 from Xilinx and 
Stratix-III from Altera. Moreover, we could make use of the embedded microprocessor in 
some FPGAs such as the PowerPC for improving the whole system performance. 
7.2. Future work 	 153 
• the extension of this scalable architecture across multiple reconfigurable hardware device 
enables the division of the original big problem such as solving the Mersenne Prime 
number. The inter-communication between different devices would be another interesting 
problem. 
• MicroBlaze provides a Fast Simplex Link (FSL) channel for including customisable IP 
blocks such as our prime test block. In this case, the microprocessor and the IP block 
operate at two different clock speeds while there is no degradation to the MicroBlaze 
speed. The FSL channel enables the data to be passed between hardware block and 
processor using C-macros, with the processor able to execute other instruction while 
waiting for the result. 
• RTR techniques can be applied in the validation. The implemented design can run-time 
select different algorithms depending on the actual input number under test. Extended 
libraries can be easily developed. 
7.2.3 IMGen 
Possible future work for the IMGen includes: 
• code generation for other instruction processors and comparison with floating-point co-
processors such as the Xilinx floating-point coprocessor using the new Auxiliary Processor 
Unit that integrates hardware accelerators and co-processors with the PowerPC processor. 
• use better range reduction techniques targeting software implementations [IT03, Mu197]. 
• use multi-precision techniques to provide very long precisions. 
• reconfigure a soft processor using FPGA technology at run-time to achieve the best trade-
off in speed, area and precision for specific applications. 
154  
7.2.4 RNGGen 
Chapter 7. Conclusions 
 
Possible future work for the RNGGen includes: 
• extend the use of the bit-width optimisation and segmentation method in other applica-
tions. 
• explore the use of the generated RNG in various applications. 
• explore the use of other customisation parameters such as the approximation methods.. 
This section presents many possible research extensions. The most challenging and exciting 
work is probably the exploration of which coarse-grained component is going to be embedded 
into reconfigurable fabric. For instance, if customised random number generator core and 
function evaluation core are going to be located next to a block-RAM or to a block multiplier, 
perhaps another fundamental question for the FPGA vendors will be "what kind of customisable 
parameters are used for these dedicated cores?" 
7.3 The road ahead 
The above future work shows possible research directions for further improving customisation 
and performance of arithmetic designs. Implementing security protocols in embedded systems 
faces the challenges of design complexity, limited resources and required security level. One of 
the speculated potential impacts for ECCGen and Prime Gen is the integration of RSA designs 
for embedded systems and the use of customisable Montgomery multiplier. For instance, a 
unified RSA plus ECC key token (SecurlD) enables an authentication to a user to a secure 
resource. 
Function evaluation is essential to many computing applications. Polynomial approximation 
and interpolation are two common methods for function evaluation. Interpolation involves the 
evaluation of a function from certain known values of the function. Both methods have their own 
7.3. The road ahead 	 155 
advantages and disadvantages. For example, interpolation involves the run-time computation 
of the polynomial coefficients [EL04], thus it imposes high computational burdens. For a 
memory-intensive embedded environment, interpolation may use fewer memory resources than 
polynomial approximation. Our research provides the foundation for automatically determining 
which method to use, given user-defined requirements. 
More power-efficient designs are going to dominate the next generation of mobile computing 
devices. The design generators in our current research have not covered the power/energy 
versus speed/area issues. It is a clear direction for further study involving our design tools. For 
instance, designers may want to examine the total power consumption per MHz with various 
precisions of designs using various approximation and interpolation methods. Questions like 
"which customisation scheme should be used for producing an optimal design that has the best 
trade-offs in area, delay, power and precision?" are exciting and useful to answer. 
Bibliography 
[ACS05] 	A. Alimohammad, B.F. Cockburn, and C. Schlegel. An iterative hardware gaus-
sian noise generator. In Proc. IEEE Pacific Rim Conf. on Communications, Com-
puters and signal Processing, pages 649-652, 2005. 
[AKS02] 	M. Agrawal, N. Kayal, and N. Saxena. Primes in P. ITT Kanpur, August 2002. 
[ALC+02] A. Abdul Gaffar, W. Luk, P.Y.K. Cheung, N. Shirazi, and J. Hwang. Automating 
Customisation of Floating-Point Designs. In Proceedings of International Confer-
ence Field-Programmable Logic and Applications(FPL), pages 523-533, 2002. 
[A1t02] Excalibur device overview data sheet, Altera Inc., 2002. 
http://www.altera.com/literature/ds/ds_arm.pdf.  
[AMB02] 	AMBA specification overview, 2002. http://www.arm.com/.  
[AMV89] 	G.B. Agnew, R.C. Mullin, and S.A. Vanstone. A fast elliptic curve cryptosys-
tem. In Advances in Cryptology: EUROCRYPT'89, LNCS 434, pages 706-708. 
Springer, 1989. 
[AMV93] 	G.B. Agnew, R.C. Mullin, and S.A. Vanstone. An implementation of elliptic 
curve cryptosystem over F2155. Journal on Selected Areas in Communications, 
11(5):804-813, 1993. 
[ANS99] 	ANSI X9.62, Public Key Cryptography for the Financial Services Industry: The 
Elliptic Curve Digital Signature Algorithm (ECDSA), 1999. 
156 
BIBLIOGRAPHY 	 157 
[APU05] Accelerated System Performance with APU-Enhanced 
Processing, 	Xcell 	Journal, 	Xilinx 	Inc., 	2005. 
http://www.xilinx.com/publications/xcellonline/xcell_52/xc_pdf/xc_v4acu52.pdf.  
[Arn95] 	F. Arnault. Rabin-Miller primality test: Composite numbers which pass it. Math-
ematics of Computation, 64:355-361, 1995. 
EAS031 	A. Akkas and M.J. Schulte. A quadruple precision and dual double precision 
floating-point multipler. In Proceedings of IEEE Symposium on Digital System 
Design, pages 76-81, 2003. 
[BDG+02] M. Bednara, M. Daldrup, J. Gathen, J. Shokrollahi, and J. Teich. Reconfigurable 
implementation of elliptic curve crypto algorithms. In Reconfigurable Architectures 
Workshop, 2002. 
[BDG03] 	E. Boutillon, J.L. Danger, and A. Gazel. Design of high speed AWGN com-
munication channel emulator. Analog Integrated Circuits and Signal Processing, 
34(2):133-142, 2003. 
[BDK+05] N. Brisebarre, D. Defour, P. Kornerup, J. Muller, and N. Revol. A new range-
reduction algorithm. IEEE Transactions on Computers, 54(3):331-339, 2005. 
[BD11402] 	L. Benini and G. De Micheli. Networks on chips: a new SoC paradigm. IEEE 
Computer, 35(1):70-78, 2002. 
[BDT+02] M. Bednara, M. Daldrup, J. Teich, J. Gathen, and J. Shokrollahi. Trade-off analy-
sis of FPGA based elliptic curve cryptography. In IEEE International Symposium 
on Circuits and Systems, pages 797-800, 2002. 
[Bec02] 	J. Becker. Configurable systems-on-chip (CSoC). In Integrated Circuits and Sys-
tems Design, pages 379-384, 2002. 
[BFS83] 	P. Bratley, B.L. Fox, and E.L. Schrage. A Guide to Simulation. Springer-Verlag, 
1983. 
158 	 BIBLIOGRAPHY 
[BGM97] 	A. Brace, D. Gqtarek, and M. Musiela. The market model of interest rate dynam-
ics. Mathematical Finance, 7(2):127-155, 1997. 
[BJM+05] 	D. Bertozzi, A. Jalabert, Srinivasan Murali, R. Tamhankar, S. Stergiou, L. Benini, 
and G. De Micheli. NoC synthesis flow for customized domain specific multipro-
cessor systems-on-chip. IEEE Transactions on Parallel and Distributed Systems, 
16(2):113-129, 2005. 
[bM58] 	G.E.P. box and M.E. Muller. A note on the generation of random normal deviates. 
Annals of Math. Stats., 29(2):610-611, 1958. 
[BoPV03] L. Batina, S.B Ors, B. Preneel, and J. Vandewalle. Hardware architectures for 
public key cryptography. Integration, the VLSI Journal, 34:1-64, 2003. 
[BP01] 	T. Blum and C. Paar. High radix montgomery modular exponentiation on recoil-
figurable hardware. IEEE Transactions on Computers, 50(7):759-764, 2001. 
[Bre76] 	R. Brent. Fast multiple-precision evaluation of elementary functions. Journal of 
the Association for Computing Machinery, 23(2):242-251, 1976. 
[Cam03] 	Cambridge 	Advanced 	Learner's 	Dictionary, 	2003. 
http://dictionary.cambridge.org/.  
[CC01] 	J.E. Carrillo and P. Chow. The effect of reconfigurable units in superscalar pro-
cessors. In Proceedings of International Symposium on Field Programmable Gate 
Arrays, pages 141-150, 2001. 
[Cel] 	Celoxica. http://www.celoxica.com/.  
[Ce103] 	Celoxica. Handel-C Language Reference Manual for DK2.0, Document RM-1003-
4.0, 2003. 
[Cer97] 	The Certicom ECC Challenge, 1997. http://www.certicom.com/.  
[Cer06] 	Certicom: The Basics of ECC, 2006. 
BIBLIOGRAPHY 	 159 
[CH02] 	K. Compton and S. Hauck. Reconfigurable computing: a survey of systems and 
software. ACM Computing Surveys, 34(2):171-210, 2002. 
[CHWOO] 	T.J. Callahan, J.R. Hauser, and J. Wawrzynek. The garp architecture and c 
compiler. IEEE Computer, 33(4):62-69, 2000. 
[cK95] 	C.K. Koc. RSA hardware implementation. Technical report, RSA Data Security, 
1995. 
[cKA98] 	C.K. Koc and K. Acar. Montgomery multiplication in G F (2k ). Designs, Codes 
and Cryptography, 14:57-69, 1998. 
[cKAK96] C.K. Koc, T. Acar, and B.S. Kaliski. Analyzing and comparing montgomery 
multiplication algorithms. IEEE Micro, 16(3):26-33, 1996. 
[CMB04] 	J. Chen, J. Moon, and K. Bazargan. Reconfigurable readback-signal generator 
based on a field-programmable gate array. IEEE Transactions on Magnetics, 
40(3):1744-1750, 2004. 
[cor06] 	CoreConnect bus architecture, 2006. http://www.ibm.com/chips/products/coreconnect/.  
[CW97] 	Jason Cong and Chang Wu. FPGA synthesis with retiming and pipelining for clock 
period minimization of sequential circuits. In Design Automation Conference, 
pages 644-649, 1997. 
[DC03] 	N. Dutt and K. Choi. Configurable processors for embedded computing. IEEE 
Computer, 36(1):120-123, 2003. 
[Dev86] 	L. Devroye. Non- Uniform Random Variate Generation. Springer-Verlag, 1986. 
[DH76] 	W. Diffie and M. Hellman. New directions in cryptography. IEEE Transactions 
on Inform. Theory, 22(6):644-654, 1976. 
[DM02] 	A. Daly and W. Marnane. Efficient architectures for implementing montgomery 
modular multiplication and RSA modular exponentiation on reconfigurable logic. 
160 	 BIBLIOGRAPHY 
In Proceedings of International Symposium on Field Programmable Gate Arrays, 
pages 40-49, 2002. 
[DR02] 	J. Daemen and V. Rijmen. The Design of Rijndael: AES 	 the Advanced En- 
cryption Standard. Springer-Verlag, 2002. 
[dT05] 	F. de Dinechin and A. Tisserand. Multipartite table methods. IEEE Transactions 
on Computers, 54(5):319-330, 2005. 
[EJM+02] M. Ernst, M. Jung, F. Madlener, S. Huss, and R. Bliimel. A reconfigurable system 
on chip implementation for elliptic curve cryptography over GF(2n). In Crypto-
graphic Hardware and Embedded Systems, LNCS 2523, pages 381-399. Springer, 
2002. 
[EKHH01] M. Ernst, S. Klupsch, 0. Hauck, and S.A. Huss. Rapid prototyping for hardware 
accelerated elliptic curve public-key cryptosystems. In Workshop on Rapid System 
Prototyping, pages 24-29, 2001. 
[EL04] 	M. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann Publishers, 
2004. 
[E1g85] 	T. Elgamal. A public key cryptosystem and a signature scheme based on discrete 
logarithms. IEEE Transactions on Information Theory, 31(4):469-472, 1985. 
[ETMT00] M. Ercegovac, T. Tang, J. Muller, and A. Tisserand. Reciprocation, square root, 
inverse square root, and some elementary functions using small multipliers. IEEE 
Transactions on Computers, 49(7):628-637, 2000. 
[EW93] 	S.E. Eldridge and C.D. Walter. Hardware implementation of montgomery's mod-
ular multiplication algorithm. IEEE Transactions on Computers, 42(7):693-699, 
1993. 
[FD01] 	V. Fischer and M. Drutarovsky. Scalable RSA processor in reconfigurable hard-
ware - a SoC building block. In Proceedings of XVI Conference on Design of 
Circuits and Integrated Systems (DCIS), pages 327-332, 2001. 
BIBLIOGRAPHY 	 161 
[FLP+04] 	E. Fung, K. Leung, N. Parimi, M. Purnaprajna, and V.C. Gaudet. ASIC imple-
mentation of a high speed WGNG for communication channel emulation. In Pro-
ceedings of IEEE Workshop on Signal Processing Systems, pages 304-409, 2004. 
[GB91] 	S. Gal and B. Bachelis. An accurate elementary mathmetical library for the ieee 
floating point standard. ACM Transactions on Mathematical Software, 17(1):26-
45, 1991. 
[GC00] 	J. Goodman and A. Chandrakasan. An energy efficient reconfigurable public-key 
cryptography processor architecture. In Cryptographic Hardware and Embedded 
Systems LNCS 1965, pages 175-190, 2000. 
[GES02] 	N. Gura, H. Eberle, and S.C. Shantz. Generic implementations of elliptic curve 
cryptography using partial reduction. In ACM Computer and Communications 
Security, pages 108-116, 2002. 
[GG99] 	M. Gasteier and M. Glesner. Bus-based communication synthesis on system-level. 
ACM Transactions on Design Automation Electronic Systems, pages 1-11, 1999. 
[GIM] 	The greatest Internet Mersenne prime search. http://www.mersenne.org/.  
[GNVV04] Z. Guo, W. Najjar, F. Vahid, and K. Vissers. A quantitative analysis of the 
speedup factors of fpgas over processors. In ACM International Symposium on 
FPGAs, pages 162-170, 2004. 
[Gor98] 	D.M. Gordon. A survey of fast exponentiation methods. Journal Algorithms, 
27(1):129-146, 1998. 
[GSE+02] 	N. Gura, S.C. Shantz, H. Eberle, S. Gupta, V. Gupta, D. Finchelstein, E. Goupy, 
and D. Stebila. An end-to-end systems approach to elliptic curve cryptography. 
In Cryptographic Hardware and Embedded Systems, LNCS 2523, pages 349-365. 
Springer, 2002. 
162 	 BIBLIOGRAPHY 
[GSS99] 	L. Gao, S. Shrivastava, and G.E. Sobelman. Elliptic curve scalar multiplier design 
using FPGAs. In Cryptographic Hardware and Embedded Systems, LNCS 1717, 
pages 257-268. Springer, 1999. 
[Hen89] 	H. Henkel. Improved addition for the logarithmic number system. IEEE Trans-
actions on Acoustics, Speech, and Signal Processing, 37(2):301-303, 1989. 
[HFHK04] S. Hauck, T.W. Fry, M.M. Hosler, and J.P. Kao. The chimaera reconfigurable 
functional unit. IEEE Transactions on Very Large Scale Integration (VLSI) Sys-
tems, 12(2):206-217, 2004. 
[HHM00] 	D. Hankerson, J. Hernandez, and A. Menezes. Software implementation of elliptic 
curve cryptography over binary fields. In Cryptographic Hardware and Embedded 
Systems, LNCS 1965, pages 1-24. Springer, 2000. 
[HKN+01] A. Hoffmann, T. Kogel, A. Nohl, G. Braun, 0. Schliebusch, 0. Wahlen, 
A. Wieferink, and H. Meyr. A novel methodology for the design of application-
specific instruction-set processors (ASIPs) using a machine description language. 
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 
20(11):1338-1354, 2001. 
[HKST99] J. Harrison, T. Kubaska, S. Story, and P.T.P. Tang. The computation of 
transendental functions on the is-64 architecture. Intel Technology Journal, Q4, 
pages 1-7, 1999. 
[HL03] 	W. HOrmann and J. Leydold. Continuous random variate generation by fast 
numerical inversion. ACM Transactions on Modeling Computer Simulation 
(TOMACS), 13(4):347-362, 2003. 
[HLB01] 	Y. Hida, X. Li, and D. Bailey. Algorithms for quad-double precision floating point 
arithmetic. In Proceedings of IEEE Symposium on Computer Arithmetic, pages 
155-162, 2001. 
BIBLIOGRAPHY 	 163 
[HPS98] 	J. Hoffstein, J. Pipher, and J.H. Silverman. NTRU: A ring based public key 
cryptosystem. In Algorithmic Number Theory: Third International Symposium, 
pages 267-288, 1998. 
[HVO4] 	A. Hodjat and I. Verbauwhede. A 21.54 Gbit/s fully pipelined AES processor on 
FPGA. In Field-Programmable Custom Computing Machines, 2004. 
[HWOO] 	J.H. Hong and C.W. Wu. Radix-4 modular multiplication and exponentiation 
algorithms for the rsa public-key cryptosystem. In Asia and South Pacific Design 
Automation Conference, pages 565-570, 2000. 
[Ing04] 	L. Ingber. 	Adaptive Simulated Annealing (ASA) 25.15, 2004. 
http://www.ingber.com/#ASA.  
[IT98] 	T. Itoh and S. Tsujii. A fast algorithm for computing multiplicative inverses in 
GF(2m) using normal bases. Information and Computation, 78(3):171-177, 1998. 
[IT03] 	C. Iordache and P. Tang. An overview of floating-point support and math li-
brary on the intel XScaleTM architecture. In Proceedings of IEEE Symposium on 
Computer Arithmetic, pages 122-128, 2003. 
[1E99] 	J.E. Stine and M.J. Schulte. The symmetric table addition method for accurate 
function approximation. Journal of VLSI Signal Processing, 32(2):167-177, 1999. 
[JMEH02] M. Jung, F. Madlener, E. Ernst, and S.A. Huss. A reconfigurable coprocessor 
for finite field multiplication in GF(2m). In IEEE Workshop on Heterogeneous 
reconfigurable SoC, 2002. 
[JPV00] 	M. Joye, P. Paillier, and S. Vaudenay. Efficient generation of prime numbers. In 
Cryptographic Hardware and Embedded Systems 2000, pages 340-354, 2000. 
[JW05] 	A.A. Jerraya and W. Wolf. Multiprocessor systems-on-chips. Morgan Kaufmann, 
2005. 
[KMV00] 	N. Koblitz, A. Menezes, and S. Vanstone. The state of elliptic curve cryptography. 
Designs, Codes and Cryptography, 19:173-193, 2000. 
164 	 BIBLIOGRAPHY 
[Knu97] 	D.E. Knuth. Seminumerical algorithms, volume 2 of The Art of Computer Pro-
gramming. Addison-Wesley, third edition, 1997. 
[Kob87] 	N. Koblitz. 	Elliptic curve cryptosystems. 	Mathematics of Computation, 
48(177):203-209, 1987. 
[Kob89a] 	N. Koblitz. A family of jacobians suitable for discrete log cryptosystems. In 
Advances in Cryptology - CRYPT0'88, pages 94-99. Springer, 1989. 
[Kob89b] 	N. Koblitz. Hyperelliptic cryptosystems. Journal of Cryptology, pages 139-150, 
1989. 
[Kor93] 	P. Kornerup. High-radix modular multiplication for cryptosystems. In Proceedings 
of the 11th IEEE Symposium on Computer Arithmetic, pages 277-283, 1993. 
[KPMF02] T. Kerins, E. Popovici, W. Marnane, and P. Fitzpatrick. Fully parameterizable 
elliptic curve cryptography processor over GF(2m). In Field-Programmable Logic 
and Applications, LNCS 2438, pages 750-759. Springer, 2002. 
[KT00] 	S. Knapp and D. Tavana. Field configurable system-on-chip device architecture. 
In Custom Integrated Circuits Conference, pages 155-158, 2000. 
[KZ90] 	I. Koren and 0. Zinaty. Evaluating of elementary functions in a numerical co-
processor based on rational approximations. IEEE Transactions on Computers, 
39(8):1030-1037, 1990. 
[LAML04] D. Lee, A. Abdul Gaffar, 0. Mencer, and W. Luk. Adaptive range reduction for 
hardware function evaluation. In Proceedings of IEEE International Conference 
on Field-Programmable Technology (FPT), pages 169-176, 2004. 
[LAMLO5a] D. Lee, A. Abdul Gaffar, 0. Mencer, and W. Luk. MiniBit: bit-width opti-
mization via affine arithmetic. In Proceedings of ACM/IEEE Design Automation 
Conference, pages 837-840, 2005. 
[LAML0513] D. Lee, A. Abdul Gaffar, 0. Mencer, and W. Luk. Optimizing hardware function 
evaluation. IEEE Transactions on Computers, 54(12):1520-1531, 2005. 
BIBLIOGRAPHY 	 165 
[LAS95] 	T. Lynch, A. Ahmed, and M. Schulte. The K5 transcendental functions. In 
Proceedings of IEEE Symposium on Computer Arithmetic, pages 163-170, 1995. 
[LD99] 	J. Lopez and R. Dahab. Fast multiplication on elliptic curves over GF(2m) without 
precomputation. In Cryptographic Hardware and Embedded Systems, LNCS 1717, 
pages 316-327. Springer, 1999. 
[LdSP02] 	C. Lu, A.L.M. dos Santos, and F.R. Pimentel. Implementation of fast RSA key 
generation on smart cards. In Proceedings of ACM symposium on Applied com-
puting, pages 214-220, 2002. 
[L'E96] 	P. L'Ecuyer. Maximally equidistributed combined Tausworthe generators. Math-
ematics of Computation, 65(213):203-213, 1996. 
[LH04] 	J. Lutz and A. Hasan. High performance FPGA based elliptic curve cryptographic 
co-processor. In IEEE Conference on Information Technology: Coding and Com-
puting, pages 486 -492, 2004. 
[LL02] 	P.H.W. Leong and K.H. Leung. A microcoded elliptic curve processor using FPGA 
technology. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 
10(5):550-559, 2002. 
[LLVC03] 	D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. Hierarchical segmentation 
schemes for function evaluation. In Proceedings of IEEE International Conference 
Field-Programmable Technology (FPT), pages 92-99, 2003. 
[LLVC04] 	D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. A Gaussian noise generator 
for hardware-based simulations. IEEE Transactions on Computers, 53(12):1523-
1534, 2004. 
[LN94] 	R. Lidl and H. Niederreiter. Introduction to Finite Fields and their Applications. 
Cambridge University Press, 1994. 
166 	 BIBLIOGRAPHY 
[LRD04] 	K. Lahiri, A. Raghunathan, and S. Dey. Design space exploration for optimiz-
ing on-chip communication architectures. IEEE Transactions on Computer-Aided 
Design of Integrated Circuits and Systems, 23(6):952-961, 2004. 
[LRLD04] K. Lahiri, A. Raghunathan, G. Lakshminarayana, and S. Dey. Design of high-
performance system-on-chips using communication architecture tuners. IEEE 
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 
23(5):620-636, 2004. 
[LRR04] 	J. Lawrence, S. Rusinkiewicz, and R. Ramamoorthi. Efficient BRDF impor-
tance sampling using a factored representation. ACM Transactions on Graphics, 
23(3):496-505, 2004. 
[LVO1] 	A.K. Lenstra and E.R. Verheul. Selecting cryptographic key sizes. Journal of 
Cryptology: the journal of the International Association for Cryptologic Research, 
14(4):255-293, 2001. 
[LV06] 	D. Lee and J.D Villasenor. A bit-width optimization methodology for polynomial-
based function evaluation. IEEE Transactions on Computers, 2006. accepted. 
[LVLL06] 	D. Lee, J.D. Villasenor, W. Luk, and P.H.W. Leong. A hardware Gaussian noise 
generator using the Box-Muller method and its error analysis. IEEE Transactions 
on Computers, 55(6), 2006. 
[Lys03a] 	P. Lysaght. Future design tools for platform FPGAs. In Proceedings of the 16th 
Symposium on Integrated Circuits and Systems Design, pages 275-280, 2003. 
[Lys0313] 	P. Lysaght. System-level design for FPGAs. In Proceedings of the 16th Symposium 
on Integrated Circuits and Systems Design, page 4, 2003. 
[Mar90] 	P. Markstein. Computation of elementary functions on the ibm risc system/6000 
processor. IBM Journal Res. Develop., 34(1):111-119, 1990. 
[Mas91] 	E.D. Mastrovito. VLSI Architectures for Computation in Galois Fields. PhD 
thesis, Linkoping University, 1991. 
BIBLIOGRAPHY 	 167 
[Mat] 	Rabin-Muller strong pseudo prime test. http://mathworld.wolfram.com/Rabin-
MillerStrongPseudoprimeTest.html. 
[MCD+05] C. Mucci, F. Campi, A. Deledda, A. Fazzi, M. Ferri, and M. Bocchi. A Cycle-
Accurate ISS for a Dynamically Reconfigurable Processor Architecture. In IEEE 
Proceedings of the Parallel and Distributed Processing Symposium (IPDPS), pages 
160-167, 2005. 
[Men93a] 	A.J. Menezes. Applications of Finite Fields. Kluwer Academic, 1993. 
[Men93b] 	A.J. Menezes. Elliptic Curve Public Key Cryptosystems. Kluwer Academic Pub-
lishers, Boston, MA, 1993. 
[Mic06] 	MicroBlaze 	processor 	reference 	guide, 	Xilinx 	Inc., 	2006. 
http://www.xilinx.com/ise/embedded/mb_ref_guide.pdf.  
[Mi176] 	G.L. Miller. Riemann's hypothesis and tests for primality. Journal of Computer 
Systems Science, 13(3):300-317, 1976. 
[Mi185] 	V. Miller. Uses of elliptic curves in cryptography. In Advances in Cryptology: 
Proceedings of CRYPTO'85 LNCS 218, pages 417-426. 1985. 
[MIP03] 	QL90xM QuickMIPS Multi-application System-on-Chip Product Brief, 2003. 
http://www.quicklogic.com/images/QuickMIPS_QL90xM_PB.pdf.  
[MKM+02] A. Mihal, C. Kulkarni, M. Moskewicz, M. Tsai, N. Shah, S. Weber, Yujia Jin, 
K. Keutzer, K. Vissers, C. Sauer, and S. Malik. Developing architectural plat-
forms: a disciplined approach. IEEE Design t1 Test of Computers, 19(6):6-16, 
2002. 
[ML04] 	0. Mencer and W. Luk. Parameterized high throughput function evaluation for 
FPGAs. Journal of VLSI Signal Processing Systems, 36(1):17-25, 2004. 
[ML304] ML310 development board user guide, Xilinx Inc., 2004. 
http://www.xilinx.com/products/boards/m1310/current/pcb/sch/ug068.pdf.  
168 	 BIBLIOGRAPHY 
[MLBP03] J.M. McCollum, J.M. Lancaster, D.W. Bouldin, and G.D. Peterson. Hardware 
acceleration of pseudo-random number generation for simulation applications. In 
Proceedings of IEEE Southeastern Symposium System Theory, pages 299-303, 
2003. 
[M086] 	 J. Massey and J. Omura. Computational method and apparatus for finite field 
arithmetic, 1986. 
[Mod07] 	Modular arithmetic, 2007. http://en.wikipedia.org/wiki/Modular_arithmetic.  
[Mon85] 	P. Montgomery. Modular multiplication without trial division. Mathematics of 
Computation, 44:519-521, 1985. 
[Moo65] 	G.E. Moore. Cramming more components onto integrated circuits. Electronics, 
38(8), 1965. 
[MoPO4] 	N. Mentens, S.B. Ors, and B. Preneel. An FPGA implementation of an elliptic 
curve processor over GF(2171 ). In Proceedings of the GVLSI, pages 454-457, 2004. 
[MOPV04] N. Mentens, S.B. Ors, B. Preneel, and J. Vandewalle. An FPGA implementation of 
a montgomery multiplier over GF(2m). In Proceedings of the 7th IEEE Workshop 
on Design and Diagnostics of Electronic Circuits and Systems (DDECS), pages 
121-128, 2004. 
[MOVW88] R. Mullin, I. Onyszchuk, S. Vanstone, and R. Wilson. Optimal normal bases in 
GRpn). Discrete Applied Mathematics, 22:149-161, 1988. 
[MT00] 	G. Marsaglia and W.W. Tsang. The Ziggurat method for generating random 
variables. Journal of Statistical Software, 5(8):1-7, 2000. 
[Mul97] 	J.M. Muller. Elementary Functions: Algorithms and Implementation. Birkhauser 
Verlag AG, 1997. 
[Nes01] 	A. Nesterov. Optimized math library for TMS320C67x DSP reference manual, 
2001. 
BIBLIOGRAPHY 	 169 
[NGCEG03] N. Nguyen, K. Gaj, D. Caliga, and T. El-Ghazawi. Implementation of elliptic 
curve cryptosystems on a reconfigurable computer. In International Conference 
on Field Programmable Technology (FPT), pages 60-67, 2003. 
[OBPV03] S.B. Ors, L. Batina, B. Preneel, and J. Vandewalle. Hardware implementation of a 
montgomery modular multiplier in a systolic array. In Proceedings of International 
Parallel and Distributed Processing Symposium, 2003, page 8, 2003. 
[oCIoST00] U.S. Department of Commerce/National Institute of Standard and 
Technology. 	FIPS 186-2, Digital Signature Standard (DSS), 2000. 
http://csrc.nist.gov/encryption/.  
[0k194] 	V.G. Oklobdzija. An algorithmic and novel design of a leading zero detector 
circuit: comparison with logic synthesis. IEEE Transactions on Very Large Scale 
Integration (VLSI) Systems, 2(1):124-128, 1994. 
[OP00] 	G. Orlando and C. Paar. A high performance reconfigurable elliptic curve for 
GF(2m). In Cryptographic Hardware and Embedded Systems, LNCS 1965, pages 
41-56. Springer, 2000. 
[OP01] 	G. Orlando and C. Paar. A scalable GF(p) elliptic curve processor architecture for 
programmable hardware. In Workshop on Cryptographic Hardware and Embedded 
Systems, LNCS 2162, pages 356-371, 2001. 
[OTIT00] 	S. Okada, N. Torii, K. Itoh, and M. Takenaka. Implementation of elliptic cruve 
cryptographic coprocessor over GF(2m) on an FPGA. In Cryptographic Hardware 
and Embedded Systems, LNCS 1965, pages 25-40. Springer, 2000. 
[P1300] 	IEEE P1363, Standard specifications for public key cryptography, 2000. 
[Par00] 	B. Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford 
University Press, 2000. 
[Ped04] 	V.A. Pedroni. Circuit Design with VHDL. The MIT Press, 2004. 
170 	 BIBLIOGRAPHY 
[Pie00] 	H. Pietilainen. Elliptic curve cryptography on smart cards. Master's thesis, 
Helsinki University of technology, 2000. 
[POMB05] J. Pineiro, S.F. Oberman, J.M. Muller, and J.D. Bruguera. High-speed function 
approximation using a minimax quadratic interpolator. IEEE Transactions on 
Computers, 54(3):304-318, 2005. 
[PSI02] 	M.R.. Pillmeier, M.J. Schulte, and E.G. Walters III. Design alternatives for barrel 
shifters. In Proceedings of SPIE Advanced Signal Processing Algorithms, Archi-
tectures, and Implementations, pages 436-447, 2002. 
[PW76] 	G. Paul and W. Wilson. Should the elementary function library be incorporated 
into computer instruction sets. ACM Transactions on Mathematical Software, 
2(2):132-142, 1976. 
[Rab80] 	M.O. Rabin. Probabilistic algorithm for primality testing. Journal of Number 
Theory, 12:128-138, 1980. 
[RHcK03] 	F. Rodriguez-Henriquez and .K. Kog. Parallel multipliers based on special irre-
ducible pentanomials. IEEE Transactions on Computers, 52(12):1535-1542, 2003. 
[Rib91] 	P. Ribenboim. The Little Book of Big Primes. Springer-Verlag, 1991. 
[Rie94] 	H. Riesel. Prime numbers and computer methods for factorization. Progress in 
Mathematics, 126, 1994. 
[RLP03] 	A.L. Rosa, L. Lavagno, and C. Passerone. Hardware/software design space explo-
ration for a reconfigurable processor. In Proceedings of the Design, Automation 
and Test in Europe (DATE), pages 570-575, 2003. 
[Ros98] 	M. Rosner. Elliptic curve cryptosystems on reconfigurable hardware. Master's 
thesis, Worcester Polytechnic Institute, 1998. 
[Ros99] 	M. Rosing. Implementing Elliptic Curve Cryptography. Manning Publications 
Co., 1999. 
BIBLIOGRAPHY 	 171 
[RS94] 	R. Razdan and M.D. Smith. A high-performance microarchitecture with hardware-
programmable functional units. In Proceedings of the 27th Annual International 
Symposium on Microarchitecture, pages 172-80, 1994. 
[RS99] 	R. Razdan and M. Smith. ConCISe: A compiler-driven CPLD-based instruction 
set accelerator. In IEEE Symposium on Field-Programmable Custom Computing 
Machines (FCCM), pages 92-101, 1999. 
[RSA78] 	R.L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures 
and public-key cryptosystems. Communications of the ACM, 21(2):120-126, 1978. 
[RV04] 	K.K. Ryu and V.J., III Mooney. Automated bus generation for multiprocessor 
SoC design. IEEE Transactions on Computer-Aided Design of Integrated Circuits 
and Systems, 23(11):1531-1549, 2004. 
[Sch96] 	B. Schneier. Applied Cryptography. John Wiley and Sons, 1996. 
[ScK01] 	B. Sunar and C.K. Koc. An efficient optimal normal basis type ii multiplier. IEEE 
Transactions on Computers, 50(1):83-87, 2001. 
[SE94] 	M. Schulte and E.E. Swartzlander Jr. Hardware designs for exactly rounded 
elementary functions. IEEE Transactions on Computers, 43(8):964-973, 1994. 
[Sina01] 	N.P. Smart. The hessian form of an elliptic curve. In Cryptographic Hardware and 
Embedded Systems LNCS 2162, pages 118-125, 2001. 
[SSKV06] 	L. Sun, H. Song, Z. Keirn, and B.V.K. Vijaya Kumar. Field programmable gate 
array (FPGA) for iterative code evaluation. IEEE Trans. Magnetics, 42(2):226-
231, 2006. 
[ST03] 	A. Satoh and K. Takano. A scalable dual-field elliptic curve cryptographic pro-
cessor. IEEE Transactions on Computers, 52(4):449-460, 2003. 
[StrO5] 	Stretch 55530 Datasheet, 2005. http://www.stretchinc.com/.  
172 	 BIBLIOGRAPHY 
[Tak92] 	N. Takagi. A radix-4 modular multiplication hardware algorithm for modular 
exponentiation. IEEE Transactions on Computers, 41(8):949-956, 1992. 
[Tau65] 	R.C. Tausworthe. Random numbers generated by linear recurrence modulo two. 
Mathematics and Computation, 19:201-209, 1965. 
[TcK03] 	A.F. Tenca and C.K. Koc. A scalable architecture for modular multiplication 
based on montgomery's algorithm. IEEE Transactions on Computers, 52(9):1215-
1221, 2003. 
[TCW+05] T.J. Todman, G.A. Constantinides, S.J.E. Wilton, 0. Mencer, W. Luk, and P.Y.K. 
Cheung. Reconfigurable computing: architectures and design methods. IEE Pro- 
ceedings of Computers and Digital Techniques, 152(2):193-197, 2005. 
[TLO5] 	D. Thomas and W. Luk. High quality uniform random number generation through 
LUT optimized linear recurrences. In Proceedings of IEEE International Confer-
ence Field-Programmable Technology (FPT), pages 61-68, 2005. 
[TLL03] 	K.H. Tsoi, K.H. Leung, and P.H.W. Leong. Compact FPGA-based true and 
pseudo random number generators. In Proceedings of IEEE Symposium Field-
Programmable Custom Computing Machines (FCCM), pages 51-61, 2003. 
[Tod00] 	G. Todorov. ASIC design, implementation and analysis of a scalable high-radix 
montgomery multiplier. Master's thesis, Oregon State University, 2000. 
[TTcK01] 	A.F. Tenca, G. Todorov, and C.K. Koc. High-radix design of a scalable modular 
multiplier. In Cryptographic Hardware and Embedded Systems 2001, pages 189-
205, 2001. 
[v5f] 	Virtex-5 	platform 	FPGA 	family 	fact 	sheet. 
http://www.xilinx.com/company/press/kits/v5/v5factsheet.pdf.  
[Wa196] 	C.S.. Wallace. Fast pseudorandom generators for normal and exponential variates. 
ACM Transactions on Mathematical Software, 22(1):119-127, 1996. 
BIBLIOGRAPHY 	 173 
[WBP00] 	A. Woodbury, D.V. Bailey, and C. Paar. Elliptic curve cryptography on smart 
cards without coprocessors. In IFIP CARDIS 2000, Smart Card Research and 
Advanced Application Conference. Kluwer, 2000. 
[WGP03] 	T. Wollinger, J. Guajardo, and C. Paar. Cryptography in embedded systems: An 
overview. In Embedded World Exhibition and Conference, pages 735-744, 2003. 
[WTS+85} C.C. Wang, T.K. Truong, H.M. Shao, L.J. Deutsch, J.K. Omura, and I.S. Reed. 
VLSI architectures for computing multiplications and inverses in GF(2m). IEEE 
Transactions on Computers, 34(8):709-717, 1985. 
[Xi196] 	Efficient Shift Registers, LFSI? Counters, and Long Pseudo-Random Sequence 
Generators, 1996. http://direct.xilinx.com/bvdocs/appnotes/xapp052.pdf.  
[Xi102] Additive White Gaussian Noise (AWGN) Core v1.0, 2002. 
http://www.xilinx.com/ipcenter/catalog/logicore/docs/awgn.pdf.  
[Xi103] Xilinx Inc. Xilinx Virtex-II Pro Product Brief, 2003. 
[Xi104] PowerPC 405 processorblock reference guide, Xilinx Inc., 2004. 
http://www.xilinx.com/ise/embedded/edk6_2docs/ppc405block_ref_guide.pdf.  
[Xi105] Floating-Point Operator v1.0, 2005. http://www.xilinx.com/bvdocs/ipcenter/.  
[Xi106a] 	Xilinx 	application 	note 	Embedded 	system 	example: 
web server design using MicroBlaze soft processor, 2006. 
http://www.xilinx.com/bvdocs/appnotes/xapp433.pdf.  
[Xi1061)] 	Embedded 	System 	Tools 	Reference 	Manual 	8.2i, 	2006. 
littp://www.xilinx.com/ise/embedded/edk82i_docs/est_rm.pdf.  
[Xi106c] Xilinx Inc. 	Xilinx ISE in-depth tutorial 8.2, 2006. 
http://direct.xilinx.com/direct/iseKtutorials/ise8tut.pdf.  
[Xi106d] 	Xilinx Inc. 	Xilinx System Generator User's Guide v8.2.02, 2006. 
http://www.xilinx.com/support/sw_manuals/sysgen_ug.pdf.  
174 	 BIBLIOGRAPHY 
[YSBOO] 	Z.A. Ye, N. Shenoy, and P. Banerjee. A c compiler for a processor with a re-
configurable functional unit. In Proceedings of International Symposium on Field 
Programmable Gate Arrays, pages 95-100, 2000. 
[ZLL+05] 	G. Zhang, P.H.W. Leong, D. Lee, J.D. Villasenor, R.C.C. Cheung, and W. Luk. 
Ziggurat-based hardware Gaussian random number generator. In Proceedings of 
IEEE International Conference Field-Programmable Logic and its Applications, 
pages 275-280, 2005. 
