Design and implementation of encryption algorithms in a coarse grain reconfigurable environment by Rhinelander, Jason P.

....I.,....,.·~~---~_·, 'I~


1.1 Nationa l Libraryof Canada Biblictheque nationaleduCanada
Acquisitions and Acquisisitonset
Bibliographic Services services bibliographiques
Your file Votre rettfmmce
ISBN: ()'612-93054-8
OurMe Notrereferonce
ISBN ()-6 1N~3054-8
The author has granted a non-
exclus ive licence allowin g the
National Library of Canada to
reproduce, loan, distribut e o r sell
copies of this thesis in microform,
paper or electron ic formats.
The author retains ownership of the
copyright in this thesis. Neither the
the sis nor substantial extrac ts from it
may be printed or otherwise
reproduced without the aut hor's
permission.
In complia nce with the Can adian
Privacy Act some suppo rting
forms may have been removed
from this d issertation.
Whi le these forms may be included
in the document page count,
their remov al does not represent
any loss of con tent from the
disse rtat ion.
Canada
L'auteur a accorde une licence non
exclus ive permettant a la
Blbllotheque nattcnale du Canada de
reproduire. preter, distribuer au
vendre des copies de cette these sous
la forme de microf iche/f ilm, de
reproduction sur papler ou sur format
elec troniq ue.
L'auteur conserve la pro priete du
droit d'auteur qui protege cette these.
Ni la these ni des extraits substantiels
de celle-ci ne do ivent etre lmprimes
ou aturemen t reprodu its sans son
autortsauon.
Ccnformernent a la loi canadienne
sur la protection de la vie prtvee.
quelques formula ires seco ndaires
onl ete enleves de ce manu scrit.
Bien que ces formulaires
alent inc lus da ns la pagina tion ,
il n'y aura auc un con tenu manquanl.
DESIGl'\ AND 1 ~ IPLP.~fENTATION OF ENCRYPTIO N ALGORITUr..l S IN A
COARSE GRAIN RECONFIGURABL E ENVIRONMENT
BY
JASON P. RHINELAl'iD ER
A T hesis submitt ed to the
School uf Gr aduate Stu dies
in par tial fulfillment of the
requirements for the degree of
Master of Engineering
FACULTY OF ENGINEERING AND APPLI ED SCIENCE
I\lE MORIAL UNIV EHSITY OF NEWFOUNDLAND
2003
MASTER OF ENGINEEHING THESIS
OF
J ASON P. RHINELANDE R
APPHOVED·
thesis Committee
Major Pro fessors
DEAN OF TilE SCHOOL OF GRADUATE STUDIES
MElIlOHIAL UNIVERSIT Y OF NEWFOUNDLAND
20m
Abstract
In early 2000, Chameleon Systems Incorporated and Memorial University formed
a research agreement to evaluate the viab ility of the Chameleon Systems CS2112
Hcconfigurable Communications Proc es sor (Rep) for lise in implementing popu lar
cryptographic algorithms. The CS2112 has a coo-ruegra in reconfigureble architecture,
capable of run time reoonfigurablll ty.
The benefit of coarse gra.in reconfigurahle architectures is that they can offer
many of the Bexibilit ies found ill software, such as reprogrammability and ease of
modification to implementation, while giving performance advan tages of speed and
hardware encaps ulat ion.
This research involves examin ing the implementation charectertsncs of two pop u-
lar symmetric key block ciphers , RC5 and RC6, and two popular cryptographic hash
algorithms, t.l 0 5 and 8HA-l with respect to the C82112.
RC5 was designed as an iterative loop and then expanded to provide a paralle l
pipeline to maxim ize the usage of the reconfigurable fabric . Reu was designed as an
iterative loop and a pipeline. For both hash algorithms, initial designs were drafted
and performance figures 'were est imated from experience gained through simulation
and testing OJ) a C52112 development. hoard
By implementing these a lgorithms, the architecture of the C82112 was evaluated
Ior its suitability for cryptographic applications. Moreover, the reconlignrable fabric
of the C52112 was evaluated with respect to its support for the primitive operations
tha t are required for cryptographic algorit hms
The conclusions of this research and reccomendat inns for future research are di-
rectly related to resource use on the CS2112. In parvicular , support for control and
data peth logic , memor y space, and globa l communication resources within the CS2112
were all design C01 ~~ tmints . More speci fically, it.would be advantageous to have direct
support for accessing memor y without using datapa th resources . Ab o hardw are sup-
por t for dat a dependent logical rotations and unsigned intege r multi plications wou ld
great ly save resour ce usage and increas e performance. Finally the desig n process for
the CS2112 was sometimes time intensive and cumbersome, especial ly with respec t
to layout and placement of reco nflgurablo logic. Advances in t he area of a utomatic
placement and layout for coarse grain primitives would benefit. the design process for
the CS2112 greatly.
iii
A cknowledgements
I would like acknowledge the sources of help t hat I have received while pursuing
my Maste rs' degree in Elect rical Engineering . I wouldlike to thank my supervisors,
Dr. Howard Heya, and Dr. Ramacha ndran Venkatesan, not only for their super b
guidance throughout my work, hul. for intr od ucing me to th e field of cryptography
and hardware design. I would also like to t.hank Dr. Mark Rollins from Chameleon
Systems Incorporated for his technical su pport and gu idance.
Thi s thesis would not have been possibl e without the s ources of fundin g from: The
School Of Graduate Studies at Memorial University, Dr. Heys and Dr. Venkatesan,
Chameleo n Systems IIlC., and the Covernm cnt of Newfoundland and Labrado r.
I would like to thank numerous colleagues and friends for their assistance and
s upport throughout th e dur ation of my Maste rs resea rch. I would like to thank my
parents for thei r cons ta nt sup port and guidance in all my ende'avours . My colleague
Andrew Cook for his help and a "fresh view on thing s" . My special friend Cindy
O'Dnscoll, for al l her proo freadin g and gra mmatical expert ise.
iv
Contents
Abst ra ct
AcknowlP.dge rncnts
Ac knowledgem ents
Table of Conte nts
List of Tables
Lis t of F igures
List of Abbreviations
1 In t roduction
1.1 Motivation for Research
1.2 Rese arch Scope .
1.3 Thesis Ou tline
2 B lock C iphe rs , Hash Fun ct ion s a nd Applica ti on s
2.1 Symmetric Key Block Ciphers
2.1.1 Elect ronic Co<lebook Mode (EeE)
2.1.2 Cipher-block Chaining Mode (eRG)
2.2 Hash f unctions
iv
iv
ix
x iii
10
10
11
2.3 Descriptio n of an H~IAC . . 12
2.4 The RC5 Block Ciphe r 15
2.5 The RC6 Block Ciphe r . 16
2.6 Th e M05 Algorith m 18
2.7 TIle SIIA-l Algorit hm 2]
2.8 An Exa mple Application: T he IP Security Protocol (IPS{.'C) 24
2.8.1 Authentication Header ProtOCQI 25
2.8 .2 Encaps ulati ng Security Payload Protocol . 26
2.9 Concluding Remarks 28
3 Ha rdware Ar chite cture 29
:3.1 Software vs. Hardware Algorithms 29
3.2 Application -Specific Integra ted Circuits 30
3.2.1 Designs and Performance . 32
3.3 Field-Programmable Gate Arrays 33
3.3.1 Implementat ions and Performance . 35
3.4 Coarse Grai n Reconfigurable Architect ure 36
3.4.1 Survey of Existin g Technologies 41
:1.4.2 Cryptographic Applications 42
3.5 Chameleon Systems C82112 4:J
3.5.1 C82112 High Level Architecture Description
3.5.2 C82112 Data Path Unit
3.5.3 C82112 Local Store Memory
3.5.4 CS2112Mllltiplier.
3.5.5 CS2 112 Control Logic Unit .
:1.5.6 Design Process For The C82112 Fabric
vi
"
·16
48
18
50
51
4 Sym me tr ic Block C ip he r Design an d Im p lem entation
4.1 Diagram US('
4.2 Re 5 Designs . .
4.2.1 RCS Simple Iter at ive Design .
4.2.2 Two Half-Round, Full Slice Version of RC5 .
4.2.3 Full Fabric RC5 Design .
4.2.4 SUmtDfUY of RCS Resul ts .
4.3 RC6 Designs .
4.3.1 Unsigned 32-bit Integer Mult iplica tion
4.3.2 Ite rative RCGDesign .
4.3.3 Pipeline Primitives .
4.3.4 Pipelincd Mul tiplica t ion
'U.5 Re 6 Full P ipelined Des ign
4.3.6 Summary of RCG Designs
4.4 Summa ry
5 E val uation of M essage D igest Alg orit hms
5.1 MD5 Implemen tation
5.1.1 Performa nce and Usage Estimat es for l1 D5 .
5.2 SHA- l Implcmenta ticn . .
5.2.1 Recu rsive Array Expansion
5.2.2 SHA- l Auxilia ry Function ne"Sign .
5.2.3 Full SHA-l Data pa th .
5.2.4 Perfor mance and Resource Usage uf SllA - l
5.3 Comparison uf SHA- l and :-'1D5 Implementations
5.4 Summary
vii
55
57
64
68
69
70
70
73
75
78
81
87
87
89
89
91
93
93
95
se
99
99
100
6 Su m mary an d Co ncl usi ons 10 1
6.1 Summary of Results WI
6.2 CS2 112 Archit ectural and Support Featur es 102
6.3 Considerations For Futu re \Vork 103
List of Refer enc es 105
Ap p en d ices A- I
A Sa m p le Vcri log Co de for Selec t ed M od ules A- I
A.I RC:; Testbench A-I
A.2 Iterative Re:; Top Level Module A-I
A.:i Iterative HC5 Controlle r Module . A-2
A.1 Iterative RC5 Data path Module . A-6
A.5 Unsigned Integer Mul tiplier Modu le Contro ller A-I 2
A.6 Unsigned Integer Multi plier Mod ule Datapath . A-14
A,7 Verilog Testb ench For Contro lling RC5 Iterati ve Pipeline A-19
n A NSI C Code for Se lec t Implementations n~2
B.l RC5 C Code For Test ing fi.-2
B.2 BC6 C Code For Testing B-4
List of Tables
3.1 Results from AES candidates in ASIC technology. . :~2
3.2 Simulation results from RC6 ASIC design!> using 128 Lit keys. 33
3.3 Some results of FPG A simulation.';of AES candida tes. 35
:H &111 1(~ results of FPGA implementations of Reli 36
3.5 Survey of exist ing coarse grain rcco nfigurablc technologies 41
4.1 Resource usage for the simple iterative version of RC5 63
4.2 Timing infom l>l.t ioll for the simple iterati ve Vt'rsioll of RCS . &1
-1.3 Resource USfIgf! for the two half-round design of RC"). 67
4.4 Tunin g for ti lt' fuU slice design of RCS. 68
4.5 Resou rce use for tbe full fab ric \'ft1<KlDof n O) (u.J11 lro l logic excluded). 69
·1-6 Resource est imates for 8 single shce of RC6 in tlw fabric . 75
-1-7 T iming informatio n fer the RCti pipeline . • 8-i
4.8 Resou rce usage for 8 fully pipelined RCfi. • 85
4.9 Cont rol logic res ource usage for a fully pipchn ed RC6 design . 86
4.10 Summary of block ciphe rs on the CS2112. 88
5.1 Resource usage for preliminary '-' D5 implement ation. ~J:J
5.2 D elay through M()5 datapath. . 9:1
[1.3 Heso urce utiliza tion of SHA-I. Y!)
5.4 Summ ary 11I1.'lh algorithms on the CS2112. 100
6.1 Summar y of deslgne 0 11 the CS2112. . 101
Lis t of Figures
1.1 Desc ription of encryption with respect to a data network.
2,1 Block diagram of secure communications.
2.2 ECn mode 10
2.3 cac mode 11
2.4 Operation of HMAC. 14
2.5 Flow diagram of the ItC5 block cipher. 16
2.6 Illust ra t ion of a simplified S-bit left da ta dependent rotation 16
2.7 Flow diagram of the RC6 block cip her 17
2.8 Procedure of init ial p rocess ing arbitr ary length message . 18
2.9 Looking into the 1L.\105 function. 19
2.10 Operations involved in a single step of K ...\ ID5. 20
2.11 Procedure of processing arbitrary length message. 21
2.12 A decomposition of the H_SIIAI function. 22
2.13 A decomposition of Ii step in the ILSUAI function 23
2.14 ABpacket forma t . 26
2.15 ESP packet format. 27
3.1 A view of the ASIC design process. 31
3.2 Abstr acted internal FPCA structure. 34
3.3 Flexibility verses performance of hard ware technologies. 37
3.4 Examples uf 2D mesh connections. 38
3.5 An example of a liJlt'lU array cunfigurat iun. 38
3.6 AI) example of a crossbar configurat ion 39
3.7 AI) example of a Kresaa rray configuration Mult iple communication
schemes between processing elements are used. 39
3.8 Chameleo n ci pher chip, designed for encr yption 42
3.9 Process of swap ping active and background fabrics 43
3.10 High level block diagram of CS2112 compo nents . 44
3.ll High level deco mpositio n of rcconfigurablc fabric 45
3.12 Communication arrangement between processing clements, 16
:U 3 Datapath unit b lock diagram . 17
3.14 Multiplier block diagram . 19
3.15 CLU comm unicati on and interact ion with process ing clements . 50
3.16 Design flow for the CS2112 52
3.17 Screen ca pture of the graphical Hoorplan ncr. 53
4.1 Examp les of configured OP U str uctures 56
1.2 Two methods for describing memory structures containing one LSM
an d one OP U. 57
4.3 Abstracted block d igram of simple iterati ve RC5 des ign 59
4.4 Structural d iagram of data depen den t ro tation. 61
4.5 C......SIDE [loorplanner screen shot of simple iterative HC5 design. 63
1.6 High level abst raction of the two half-rou nd design of nG'). . 05
1.7 Four OPU implemen tation of the data dependent rota tion . f:i6
1.8 Screen capt ure of the two hal f-round RC5 fabric function. 67
4.9 Screen capture of the fnll fahric RC5 implementatiou 69
4.10 Creat ing a 32-bit unsigned integer multiplier. . 71
4.11 Iterative multipl ier setup 7.1
4.12 Need (If delay in a pipeli ne. 77
4.13 First-in , first -out queue setup 78
,;
,I.H Pipelined multiplier module .
·1.15 Mult iplier and co nt rolle r inter actio n
·1.1G Fix ed logical rotat ion by five hits
4.17 One full round ofRC6
,1.18 Descript ion of cont rol end dntupat.h interaction
4.19 RCG pipeline floorplan .
5.1 Implem ent at ion of F and G functions
5.2 Implementation of 1I and I function s. .
5.3 A proposed r.m5datapath for one ste p of IL.r.1D5.
5.4 Recursiv e ex pansion of W!O..15] to W lO..79].
5.5 Auxiliar y function implementation.
5.6 Auxiliar y fnnction implcmcnt at jon.
5.7 SHA-I datapath design
xii
79
81
sa
83
85
85
90
91
92
94
96
97
98
A ES Advanced Encry ptio n Standard
All Aut hen tication Header
AI,V Arithmetic Logic Unit
ARC Argonaught RIse Core
AS IC App lica tion-Specific Integrated Circuit
CDC Cipher-block Chaining Mode
CLB Configurab le Logic Block
CLU Control Logic Unil
CMOS Complementary Metal Oxide Semiconductor
CSM Cont rol Store Memory
D E S Dat a Encryption Sta ndard
D MA Direct Memory Access
DPU Dat a Path Unit
DSA Digital Signatu re Algorit hm
DSP Digital Signal Processing
ECB Electro nic Codebook Mode
ESP Encapsu lating Security Payload
F IR Finit e Impulse Response
FPGA Field-Programmable Gate Array
FSM Finite State Machine
HDI , Hardware Description Language
HM AC Hashed MessagoAut henticat ion Code
ICV Integrity Chock Value
xiii
l O B Inpu t Ou tput 81(1(".k
IP In ternet Protocol
IPSec IP Secur ity Protocol
LS M Local Sto ll' ~IMn(lI)'
" l AC ~I~(' Anlhf'ntkat ion Code
MD ~I~l' Digest
MSP ~h~c Security Protocol
N ES SIE New EUrOIH:'lUl SdJ(~IUC!l (or Signatures , Integrity, and Encry pt ion
N IST Nattoual Insu tu tc o( Stan dards and Technology
P E Processing Element
P G P Prett y Good P rivacy
PI a Programmable Input Output
PLA Programmable Logic Arr ay
Rep Heconfi gura ble Commumceuoos Processor
RTL Register Transfer Level
S- I1TTP Sec ure Hyper tex t Trans fer Prot ocol
SA Security Assoctatjon
SRB State Register Block
SSL Secure Suc!u-ts Lay er
VPN Virtual Private Network
xiv
C hapter 1
Int rod uction
It is hard to imagine today's society without modern communication systems. While
it is a necessit y for people to communicate with each ot her to share informa tion , the
way in whic h this is under taken has changed dramatically since the advent of global
voice and dat a networks
T he Intern et has grown at all exponent ial rate in the last decade. Not only are
people us ing the Internet as a medium for comm unication , hut the t ransmission and
sto rage of sensiti ve data has also seen increased usage . On line ban king is a good
exa mple of both the transmission and sto rage of sensiti ve data. A person must trans-
mit their accoun t number a long with a password to access their personal information
and accounts stored 011 the bank 's computers . By the yea r 2007 it is estimated that
30% of Americans willuse online banking a nd in the salary range of $50,000-$75,000,
usage will be 50% !1). With resp ect tv the Internet economy, quarterly growth figures
between 191J!1 and 2000 were es t im ated to he between 20% and 30% (2). Corporta-
t ions are uti lizing Virtual Private Networks (VP Ns) to connect remote locations to
t ile private infrastructure of th e company network
While modern eurnmurncat iou through global network'! has increased oommuui-
cation efficiency, it has also become f'asy for people to inte rcept data traveling across
shared networks such as the Internet For example, a packet sniff er is II. sys tem that
looks at t raffic flowing across a network W tha t a th ird par ty r-an view private infor-
mation. A packet sniffer is a common way to obtain user IDs and passwords [31.
Cryptog raphy can he used to provide securit y to information flowing throug h a
publicly accessible network. ~'1ost cryptogr aphic appl icatio ns lie in the digit al world of
comput ers, hut crypto graphy has a mud }older past . Given below are some interesting
facts about th e history of cryptography 14):
• T he first known emergence of a crypto graphic subs tit uti on cipher occurred
arou nd 19DO D.C. in a town called Menou Khufu. near the river Nile. Some
uniqu e hieroglyph ic symbols were used in place of normal QIl I'S .
• Ancient Egyptians used subs tit ut ions of hieroglyphs, and th e usc became more
popular with th e increasing occurrence of tombs.
• Mechanical encrypt ion devices were used ex tensively in \Vorld \Var II for the
encryp tion of milit ary and polit ical messages.
• Some of the first mechanical comput ers wero invented and used by Marian
Rejewski to break codes produced from the German V,"VII Enigma machine
[5J
The use of data. networks has been increasing a.t an aston ishing rate and with
this growth in use, there is a need to secure privat e information across the network.
Within ti le scope of da ta security, encrypt ion plays a large role. Encryp tion provides
dat a confidenti ality, but the re is also a need for the following security services [6]:
• Access Contro l: Maintains privileged access to informatio n.
• Data lntcgrity: Pre vents unaut horized modification of informat ion
• Authentication and Replay Pr evention: Verifies fl sende r 's identity and prevent s
unauthorized re-transmi ssion of information.
• Scalable Key Management: Allows t he secure deployment of cryptographic keys.
• Accountability and Non-repudiatjon. Mainta ins the identit y of sender and pre-
vents deniability,
T he support of data integr ity within a data network prevents the modificat ion
of data while it is in t ransi t across the network. A data integ rity service over a
network often uses a hash function to produ ce u Message Digest (Mlj} of a message.
A digit al signature technique such as the Digital Signatur e Algorithm (DSA) abo
UHf'» a hash algorit hm in its operati on [71. Figure 1 is a typ ical hierarchy with respect
to encryption in a data network
Application and Host Layer
Encryption
Link l ayer Encryption
Figure 1.1: Descript ion of encry ption with respect to a data network
Encryption algorithms, or ciphers. can be implemented in hardware or software.
Some examples of popular secure communication proto cols that are tar geted for soft-
ware eucryptlon are: Secure Hypertext Tra nsfer Protocol (S-IITIP) , Pre t ty Good
Pr ivacy (PCP), Message Security Protoc ol (U SP), Secure Sockets Layer (SSL) and
IP Security Protocol (IP&>c). Software encryption implement ati ons are slower than
hardw are implementations. Hardwar e implementations arc used in high hand width ,
low latenc y environments such as the link layer of a data network 16]. In addi tion to
pot entia lly higher performance compared to software , hard ware based encryption can
often be more secure than software. It is harder for an attacker to obtain information
about tho cipher during oper at ion [8]
There are various ways of implementin g encryption algorit hms in hardware . One
way is throug h a special cryptographic processor that can be configured for various al-
gorithms . One example can be found in the CryptoManiac device [9]. Crypto'Meniec
is a cryptographic co- proces sor und is designed to speed software encryption. Broad -
rom offers two cryptographic co-processors (BCM5840/4 1) that work at th e hoot level
to aid encryption speed» of software [10]. Anothe r way to use hardware encryption
is th rough an Application -Specific Integra ted Circuit (ASIC) or Field-Programmab le
Cat e Array (FPCA) technology.
The field of ASICs is a broad one. ASICs can be fuUcustom integrated circuits or
semi-custom . A full-custom ASIC engineer will design some or all of the logic, circuits
and layout for a particular chip. In most app lication areas full-custom ASICs are not
as popular as they once were, but they are growing in the area of integrated analog and
digital ASICs fII]. Semi-custo m ASICs are designed using standard cells that provide
functionality as simple as logic gates to th e level of complexity of micropro cessor cores .
Once the design is simulated and laid out it. can be fabrica ted. The fabricat ion of
all ASIC involves the masking of layers of silicon similarly to s tand ard integrated
circuits . Once th e design is fabricated it cannot be changed .
An ASIC implementati on is specifically designed for the ciphe r(s) of choice and
has the main advantage of speed . Another advantage of ASICs is t hat the designer
has complete control over placement , and is limited only by the design rules im-
posed by the fabrication precess- Cost, design t ime ami luck of flexibilit y are some
disadva ntages of ASIC design.
FPG A$ are a more flexible way of designing algorit hms in hard ware. FPGAs were
developed initially as a fast way to protot ype eells to be used in ASIC designs . Since
then FPGAs have grown in size and capa bility allowing designers to implement market
produ cts in FPeAs. FP GAs contai n programmable logic devices that Me set by anti-
fuse or sta tic random uceess memory technology. The matrix of programmable logic
cells are connected together by a network of wires allowing commu nication bet ween
cells [12J.
An FPGA solut ion is considered to be a fine grain reconfigura ble solution and
will allow faster design cycles because designs can be re-bur ned or reconfigured wit h-
out restarting the whole design process as with ASICs. A disadvant age of FPG A
implementation is th at rout ing can have overhead and can be problematic 1131.
Newer coarse grain archi tect ures are emerging to exploit th e advantages of hard-
ware while simulta neously offering the flexibility of software. Run-time reconfigurable
processing, ease of mod ification, and quick tum around times in design and testing,
arc adva ntages of coarse grain architect ures . Unlike all FPGA, coarse grain architec -
tures can use datapaths that are greater than j - bit [13].
1.1 Motivation for Research
The Cham eleon Syste ms Inc. CS2112 RCP is a coarse grain reconfiguruble architec-
ture [1·11designed for communication and Digital Signal Processi ng (nSP) applica -
nons. The performance of coarse grain reconfigureble architecture; with respect to
cryptographic applications can be dependent 011 resources and cherectcnsur» offered
by the specific reconfigurable product. With the increased need for secure communica-
tion , commerce, faster transmission speeds, and increased tr affic over pub lic networks,
there is a OIled for implementation of new ciphers in hardware. A survey of companies
in 2Ol)"1 shows that medi um to large sized businesses utilize hardware based security
services over sof tware based rueehods (15].
The Chameleon S~tt'll1S CS2112 offers a reconfigurehle environment for encryp-
tion that gives the securit y and performance orhard ware while offering the Hexibilit ic6
of software . Smoethe CS2 112 is a gene ral purpose comm unications processor 'l;ith
support for any of the arithmetic and logical op'fatwlll found in encrypnon, a research
agree men t WHoSdeveloped be-t ween Mernonel Uniwrslty an d Chameleon Systems to
invest igate and eval uate the performance of popular ciphers on the CS2112 (161.
It is the purpose of this resea rch to not only investig ate the performan ce of pop-
ular encr yp tion algorithm s on t he CS2112, but to inves tiga te where the CS2112's
architec ture is inadequate to su ppor t there algo rithms rffi cient.ly and to determine
the advantages of using the CS2112 for securit y appli cations. Resul ts of thi s research
were reported back to Chameleon Systems Inc. for fut ure design considerations.
1. 2 R esea rch Scope
The purpose of this thesis is to investigate the suitability of symmetric key eucryp tjon
and cryptographic hash algorithms on the Chameleon Syste ms RCP. T wo symmetric
b,y block cip hers that ....'ere f'Xploro llln.' RCS 117) and RC6 118J. Hash functiolu;
that were exp lored Me MD5 1191 and SHA-I 17]. Vario us d.~ign methods of these
algorithms ....-cre add ressed along with testing and perfo rmance evaluat jou.
Defore addressing the to pic of this research , the reader will be given adequa te
backgro und ill crypt ograp hy, hardware implement et jon techno logies lor encryption
algorit hms , and a hil(1J level descri pt ion of the CS:.!112 a rchitecture. Re S Wl\8 th e
first algori thm 1,0 be inv estigntednnd th is focused on nn i tt~ rflt iVt" approach to cipher
implement a tion. Dptimizutions to this design were carr ied out to maxi mize l L'IC or
the CS2112, Next , RC6 was evaluated with a pipelined des ign opt imized lor high
speed operation. !'.lD5 and SHA-l were designed based on information gained from
des igning RC5 and n CG on t he CS211 2. When design and test ing were completed
the CS2t t2 is evaluated for its suita bility for suppor t ing the selected algo rithms with
respect to its processing resources and support for cryptographic primitives
1.3 Thesis Ou t line
This thesis follows th e outline below:
• Chapter One is an int roduc tion to the research conducted with the CS2112.
• Chapter T wu will introd uce the reader to symmetr ic key block cip hers , cryp to-
graphic hash funct ions, and give au example of a. pop ular PlOtOlXJ] that makes
lise of both types of a lgorithms
• Chap ter Th ree will focus on different hardware im plementat ion technol ogies
while giving some exa mples of cryptograph ic applications and per forma nce.
The CS2112 is a lso int roduced in Chapte r Thr ee outlin ing its architecture and
target applicat ion areas .
• Chapter Four descr ibes implementation and performanc e of sym met ric key
block ciphers on the CS2112.
• Chapter Five descr ibes the design and performance uf hash functions on th e
CS2112.
• Chap te r Six provides conelusicns with respect to the CS2112 and gives SOllie
recommendations Oil coarse gra in architectural supp ort fur cryptogra phy in re-
lation to the CS2112.
Chapter 2
Block Ciphers, Hash Functions and
Applications
T he purp ose of this ch apter is to give an overview of symm et ric key block ciphers,
hash funct ions , and their applications such as t he IP Securit y Prot ocol. Pr imitive
operations that cipher s and hash functions utilize also will be disc ussed
2. 1 Symmetric Key Block Ciphers
when a message or plaintext is enc rypted using an encryption algorit hm, i t is com-
putationa lly infeasible to ext ract. the informatio n from the ciphertext unless the cor -
responding decrypt ion algorith m is uf<IXI . Cryptology is the field that it; made up of
Cryptograp hy and Cryp ta nalys is. Cryptogra phy is the field tha t involves mapp ing
plaintext to cipher tex t in a secure fas hion. Th e pur pose of Cryptanalysis is to tes t
the security of ciphers by decrypting encryp ted messages in a met hod not inten ded
by the decrypt ion algorith m, ill effect testing die security of the encry pt ion. A block
cipher is a function that mathemat ically maps an n-bit plaint ext block to an n-bi t
cipher tex t block, with the block size defined to be n 1201.
Ext ra information, called a key, is requir ed to execute all encryption algor ithm .
If the sa me key is used for both encr yption and decry ption it is called a symmetric
key cipher [17}. The use of a sym metric key block cipher to transmit an encrypted
message is illustrated in Figure 2.l [201.
The total possible Humber of keys is defined as t he keyspece and the security of
a cipher is related to both the keyepaoc and n, A cip her is unconditionally secure if
cip her text blocks and plainte xt blocks are sta tistically indepen dent. In general , an
increase in block size and/or in size of the keyspace will increase the implementa tion
cost of the cipher [201.
Figure 2.1; Block d iagram of secure comm unica tions .
Ite rated round ciphers involve a seque ut ial Joop or a li internal function (a round),
involving blocks of plaintext. A round consists of simple crypt ograp hic opera tio ns
such as additions and data de pendent rotations. Each round uses a subkey that is
derived from the orig inal key that is mixed with the data . Virtually all block cip hers
arc iterated, and RC5 and ftCa operate ill this fashion.
Block ciphers may be used ill d ifferent moUL'S uf operation. Modes of operation
arise when encrypt ing a message that is longer than n-bits [20J. Two modes of opera-
tion will he discussed , Elect ronic Codebook Mode (EC13) and Cipher-block Chaining
Mode (CBC). ECB is a mode that does not involve fet..dback of a prev ious operation,
while cnc requires feedback from its previous operation. For the two modes of opcr-
ation discussed, a br ief explanation of the advan tages and disadvantages with respect
to security and erro r recovery will be presented.
2.1.1 E lectronic Codebook Mode (ECB)
ECE is illustrated in Figure 2.2. The symbol n is the block size in bits , X , represents
the i-t h block of plaintext, C; represents the i-th block of ciphertext, eO represents
th e encrypt ion funct ion, dO represent s the dec ryption function, and k represent s th e
key.
x,
Figure 2.2: ECE mode.
w hen a message is morc then n-bita IOlLg, it is sect ioned into n-bit blocks , each
block is encrypted separ ately, and decry ption is carried out in a simila r fashion. ECB
mode has the ad vantage of being t he simplest encryption mode. An error in a trans-
mittcd encrypted block will result in a full plaintext block being decrypted in error
on the receive r's end. f,CE mode has the disadvantage that it does not hide dat a
pa tt erns 120). In EeB mode an observer eau view cipher text across an insecure net-
work and may sometim es have knowledge of the plaintext that is being trans mitted .
The observer can then build a library of plaintext-ciphertext pai rs a llowing partial
decrypt ion of futur e messages
2.1.2 Cipher-block C ha in ing Mode (C BC)
CBC is illustrated in Figure 2.3. CBC mode sta rts with en initia l value or IV vector
and subsequent encryptions are car ried out with the use of the previous cipherte x t
10
block. The IV vector is requi red because this mode uses feedback . When the first
p laintext block is encr ypted there is JlO previous ciphe rt ext block to use, therefore a
value is provided exteruully 80 t hat the operation can proceed. CBC has the advantage
of hid ing patterns in plain text values.
Figure 2.3: CDC mode.
Some of the disadvantages of C HC mode are that an error in the transmitted
ciphe rtext block will res ult in an incorrect deciphering of C; and C t-i- l. In addition,
th e order of plaintext blocks matter becaus e th e dec rypt ion requires the receiver to
have the previous block of ciphertex t .
2.2 Hash Fun ctions
Hash functio n" take a finit e arbit rary lengt h input and output a fixed length mess age
digest or simply a hash of the message. In other words , a hash function will ma p au
arbitrary ran ged input to Ii fixed and smal ler ran ged output. [2U]. Hash funct ions are
used i l l both cry ptog raphic and non-cryptog rupl nc applica t ions.
II
Cryp togra ph ic has h function •s are one way funct ions, meani ng that you cannot
get the or iginal input based solely on the outp ut . A collision occurs when t wo inputs
map to th e same ou tp ut value . Th e out put val ue of a has h function is regard ed
liS a com pact digital image or representa t ion of the inpu t da ta . For cryptographic
a ppl ications, a hash funct ion must exhibi t st rong and weak collision avoidance. To
explore the concept of st rong and weak collision avoida nce, the hash funct ion will be
defined as hO, the input to the functio n is x and another input to the funct ion is
x' {different than x ). Strong collision avo ida nce is obse rved if it is c omputationally
infeasibl e to find both x and x' such that h(x ) = h(x' ). Weak ecllision avoidance is
obse rved if given x , find ing x' such that hex) = h{x') is infeasib le \21)J.
Hash funct ions arc primaril y used in data integrity sch emes and lIlay be keyed or
not keyed . A keyed hash funct ion will ta ke two inputs (data ami a secret key) and
produ ce cue output referred to as a Message Authentica tion Code (MAC). A hash
funct ion that is not keyed can be configured as MAGs produ cing an Hashed Message
Authentica tion Code (HhlAC).
For the purp oses of thi s research single input , single out put hash funct ions used
for aut.hcnr.ication schemes will be addressed.
2.3 Descr iption of a n HMAC
H~lAC is a mechanism for message au then tication that utilizes a cryp tog raphic hash
funct ion. ~IACs provide a way to check the integrity of information t ransmitted across
an insecur e medium{21J. T he Hl\IAC scheme uses a cryp togra phic has h functio n and
a secret key, The iuput information to an Hr.-lA C is th e message to be authen t icated ,
and the secret key. it is a.<;,;;UlUOO that only till' sender and receiver has access to the
secret key. To describe the operation of HMAC, the following definitio ns arc made
12
• HO, a cryptographic hash funct ion that processes an arbitrary leng th message
format ted into length B-hyte blocks.
• L, byte length of hash functicu out put, It is essur ucd t ha t L will he less than
B
• K, the secret key used ill the II l\IAC. The secre t key is of variable length and
any length that is less tha n B will have zero bytes appended to the end of
the key. For key (K I) with length greate r than B the following will occur ,
H(JO ) = K2 . K2 has a length of L , and will be used as the secret key.
• ipad, th e value Ox36 repeated B tim es
• oped , t he value Ox5c repea ted B t imes.
Figure 2.4 illustrates the opera tion of an I1:l\1AC. In illustrations, adjo ining bloch
of data represent the appending of two blocks into a single, la rger block. The $
opera tor is the bit-w ise XOR operat ion.
The const ruct ion of an IIMAC provid es two fuucticns . The int egr ity of the mcs-
sage is prot ected becau se the crypt ogra phic hash function provides a d igital finger-
print of the original text. A message canno t be forged because a secre t key is mixed
into the Hl\IAC. H}.IACs fire used in the Auth entica tion Header (AH) proto col within
IP Sec.
TIle performance of the 1I;-..IAC depe nds 011 the underly ing bas h function and it
is desir ed to use hash functions that will perform well in softwa re. Th e main goal is
to make H~'IACs sca lable to faster or more secure has h functions in t.he future [211_
13
ipad
coea
Figure 2.4: Operat ion of HMAC.
14
2.4 The R C5 Block Ci pher
Ronald L. Rivest developed Res[17] at the ~UT Labo rator y (or Computer Science
and i t is a tradem ark of RSA Da ta Secu rity . HCS is an ex tre mely compact cip her and
is su itable for both har dwar e an d soft war e imp lementati ons . Lis ted below are some
more notah le c haracterist ics of Res (17):
• Re 5 is a symmetric block ciphe r. T he same cryptographic key is used for both
encryption and decryption. The ciphert ext and plaintext arc of fixed bit length.
• RC5 uses operat ions and Instr uctions that are commonly found ou typical mi-
croprocessors.
• RC5 is it era tive and call ha ve a variable Hum ber of roun ds.
• RC5 uses litt le memory so th at it is useful with smart card s, mobile computing
platforms, micro cont rollers, and ot her low memory environments.
• RC5 makes use of data dependent rotat ions as one of its diffusion primit ives
• RCs is pa ra meterized as RCs - wlr /b. Th e word size is defined as 1JJ and the
block size is 2w. The Humber of itera tive rounds is given by r -,and b specifies
the key size in bytes . For this research RC5--32/1 2/ 16 will be used .
The RC5 cipher is illustrated in Figure 2.5. Th e parameters A and Bare 32-
bit blocks of plaintext. Th e array S !O..2r + 1) is composed of 32-bit words that are
created by manipulation of the initial key. T he + operation is mod z32 additi on and
L «< It is the data depe ndent bit-wise left rota t ion of L by the amount R To further
illust rate the process of It data depe ndent left rotation, a simplified ope ra tion is given
in Figure 2.6.
'H e bit-wise XOR operation is defined as (fl . The i parameter is used to indicate
...... hich round the algorit hm is in: for example, the statement repeat i = Lr is the
Figure 2.5: Flow diagram of the ReS block cipher .
90 [base 10)
I
10011010
2 (base 10) = 106 (base 10)
t 1
000000 10 '" 0110 1010
Figure 2.6: Illust ration of a simpli fied &-bit left dat a dependent rot ation
equi valent of a loop where the variable i is incremented by one each round . Therefor e
in the second round of ReS, 8 [41 and S[5]are used in the algo rithm
2.5 The RC6 Block Ci p her
RC6 was a subrnissiou to the Nat iona llnst itnte of Stan dard.. and Technology (NIST)
for cons ideration as a candidate for the Advanced Encry pt ion Standard (AES) ill 1998
(Rljndac l was chosen to he the algorithm for AES 122)). ReG was abo considered
for the New European Schemes for Signatures, Integrity, and Encryption (NESS IE )
specifi cat ion, but did not wake it through to the final round of selections due to
ongoing intellectual proper ty rights issues [23J. nC6 is a direct evolution of RC5.
Figure 2.7: Flow diagram of the RCGblock cipher.
Since Re6 is an advancement of HCS there are various similarities. RCG is pa-
ra mctenxeble like RC5 with the same parameter format , namely, RCfr 1l'{r{b. All of
t he operations th at are used in RCS are also found in n C6. The opera t ion of neG is
shown in Figure 2.7 118). From th e flow diagram the differences in RC6 are evident.
There is a left rotation by I.q(UJ), where 190 is the IU9-l0 opera tion. T he use of four
IL,-bit input blocks of plaintext denoted as A, S, C and D make RCG a 128-bit block
cipher when w=32 T here is a permut ation of t he data blocks a t t he end of each
17
round. Th e biggest ditlcrcnce , especi ally from the viewpoin t of performance is the
ope ration JO, which repres ent s the following relat ionship , f (x ) = x{2x+ 1) mod z32 .
Hence, J{) requir es an unsigned integer ruult.iplicat inu operation For Lhis research ,
HC6-32!2 0! 16 will be used.
2.6 T h e MD5 Algor it hm
MD5 is a cryptographic hash function that tak es a messa ge of arbit rary length and
produc es a 128-bit message diges t. MD5 exhibits weak and strong collision avoidance,
and is a popul ar hash algorithm that lias found much use ill Internet based message
authcnt.ic at.iou [191.
A:.0~567 .L r1.
B: 89abcdeI --J;~ H_M05C:fedcba98 ~
0;76643210 --
"...
(l 28 bils)
Figure 2.& Proced ure of initial pr o cessing arbit ra ry length message.
Once t he message of ar bit rary lengt h has been forma tted into N 512-bit blocks as
shown in Figure 2.8, it is passed into th e compression algorithm denoted as HJ\ID5.
The input s to the compression function are one 512-bit Mock of the formatted ar -
bitr ary leng th message , and four initializ ed registers , A, B, C, and D. The outputs
of the function are the modified values of the registe rs th at were used by the inpu t.
These four registers act as tempora ry buffers for subsequent ca lls to t!JCcomp ression
function . The values 1·0, r l , r2...rN - 1 arc used to denote the st ate of the four reg-
lsters before and after a call to the cnmpressiou funct ion where IV - 1 would be the
IS
last call to HJ,ID5. when each block of the message is processed in this fashion, the
values of the registers make lip the 128-hit message digest or hash value [19].
r(O):
[A,BC,DJ
IA',B',C,D1
'''',cp!G(.,,", ,,,, '1'>'7",,_..".",
iA'.B',C',O'J
r(1):
A,B,C,D
Figure 2,9: Looking into the IU.ID5 function,
There are 64 steps involved with the execution of ILM D5, as shown in Figure 2.9
Each grouping of Hi ste ps are abs tracted into a block showing inputs, outp uts, and
dat a values involved, The F,G,H and I functions denot e auxiliar y functions that take
in three 32-bit values and output (I single 32-hit value. Theseopera t ions are solely
made lip of hit-wise operations (AN D, OR and XOR), The registe rs an' re-assigned
to during the execut ion represented by a supersc ript " , ". T he final operat ion is to
add the old values of the 32-hit registers to the new modified values.
1 1".B ,C,DI Flx.,,Z> XjkI TJbl Sl;bl i'6_ b 'OIO~
/ m mmm :::..,
: :
, ,
, ,
, ,
, ,
, ,
1 t
XlkJ
f(b}
5(bJ-
,
- -- - - - -- ---- -- - - -
Figure 2.10: Ope ratio ns involved in a single step of H_r-. ID5.
The array T[O..G3] ts all array of 32--bit. values de rived from the .sinO funct ion.
The array S[0..63] contains values for the data dependent rotation operation. The
array X[U..15] contains 32-bit words of the 512-hit message block that was passed
iuto the comp ression funct ion . Th e mapp ing of b to k is accomplished by using a
permutation mocl16 operat ion, therefore elements of X are used and reused during
the execution of ILMD5.
20
2.7 The SHA-l Algorithm
SHA-I [7] is a hash algorit hm used in digital signature schemes and for HMACs with in
IPS~ [241. SHA-l lISt:S primit ives from MD4, and there are similar it ies between ;"'ID5
and SHA-1. SHA-I generates a t60-hit message d igest from fl message of ar bitrary
lengt h 17]and this process is illustr a ted in Figur e 2.11.
~
A6 7452301
B:EFCDAB89
C:98BAOCFE
D:10325416
E:C3D2E1FO
Forma1led Message
(P>idOOd and Aweo<Jod)
leng!I>is oo
~m~?," 5 12Ws
""""".-(l 60 bil. )
Figure 2.11: Procedure of processing ar bit rary length message.
The compression function is defined as IL8 HA1. wuh each call to the com press ion
func t ion, five 32-hit registers are IISed as input , labeled A,B,C,n , and E. A 512-bit
block of the arbitra ry length message is abo used as an input to the compression
funct ion. Th e output of th e fun cti on ar e the five regist ers mod ified from th eir original
val ues and they are used lI.'> the input to the next call to ILSHA I. Onoe all of the
message blocks are processed the finl\l value of the registers make up tho 160-bit
message digest [7J
The compressing function is decomposed in Figure 2.12. The 5t2 -bit block is
stored in th e 32-hit arrey dement W [D..151, and is expanded to a 80 element array
of 32-bit values. T he process of expanding W is recursive and only depen ds 0 11 W .
There are sn steps ill tutal for running the compression function. Th e 80 ste ps have
been furt her d ivided mto grou ps of 20 steps . Each one uf these su b-groupings is
illustrat ed as a rectangle with inpu t, output and oth er para meters listed inside T he
21
r{O):
(A.B,C.D,E)
J [A.B,C.D,E]FO(x,y,z)W[b]KO(20 IItepsb ooOtoI9)
(A.B,C,D,EJ F3(x,y,zI W1bl
YI<3
"'"""" '_= 'C:'C;;" O::_" 19l;O---,
~,
r{1)"
.....B.C.D.E
PiKUl"{' 2.12; A deeom positlon of the H-SHAI Iuncuon.
22
functions FO,F l, F2, and F3, tak e as input three 32-bi t values and outpu t a single
32-bit value using bit-wise operations. The K paramet er is a 32-bit constant that is
used for tha t part icular grouping of step". Th e ope ra t ion of the first block is furthe r
decomposed ill Figure 2.13 to show the oper a tions that ta ke place in each step . Th e
specific operations involved are add ition, rot a tion , and bit-w ise logical operati ons
cont ained in PO [7J. The other three groupings use F l, F2, Wid F 3 respectively.
I ....!.C .O,E)FO\A.', <lWl"lKO(20 . _ . ·0 .. 1O) I
/--- -- - - - - - - - ------- - - ----~
W[bl
KO
Figure 2.13: A decomposi t ion of a step in the ILSHAI function.
For the pu rposes of this research it is impor tant to look directly a t the implemen-
ta t ion of the ILSHAI function .
23
2.8 An Example Application: The IP Sec urity
P ro t ocol (IPSec)
The following section is an overview of how symmetric key block ciphers and hash
algorit hms arc used ill popular app lications to provide general securit y to comm uni-
cations oyer t he Internet, IPSec is the proposed stan dard with respect to the security
a rch itect ure for the Internet protocol (IP ) [25}. IPSec allows the implementation of
VPNs through the esta blishme nt of secure tunnels [26]. T he biggest advantage of
a VPN over th e Internet L~ tha t businesses can abandon private dial-in and leased
communication lines in favor of using more pop ular public connection methods. Par t -
ners, supplie rs, and customers can he seamless ly cunn ectcd over a private network
t hat is based upon a public medium 1261. IPSec will be examined in terms of how
and where symmetric block ciphers and has h functions are ut ilized . lPScc is designed
to be algorithm-indepe ndent while olTering a set of requ ired algori thms for operation
on different pla t forms [25]. lP Sec provid es the following functionality to IP based
networks, including the Intern et [251:
• Access Control
• Connec t ionlesa Integrity
• Data Origin Authenticatjon
• Protection Against Rep lay Attucks
• Confidentiality
For the purpose of this resea rch it is important to look at how and where IPSec is
im plemented. IPSec can be applied to the hos t level (often in software) or in conjunc-
tion with an Internet gate way or rou ter (possi bly in hard ware ) . Three implementation
possibilities arise:
24
I. Into the nntive Inter net Protoco l (IP) imple menta t jou at the OO6t level or a t
the gllu-way jevel , requiri ng au:esI> to IP source code [251.
2. Undernea th the existing IP protocol stad.. refer red to as "Bump-in-tbe-stack"
or Brrs. lmpleme nte t joa is transparent to existing arch itect ure I25J.
3. Iii extcrnal ha rd .....ar e referred to as "Bump-in- the ..wire.. or BITW. BIn\'" is
often used in business end milita ry applicat ions end is us uaJly IP addressable.
The BITW IlITlllIgt'lIwnt can act as a best or II. security gatew ay (poss ibly both)
125\. The BIT\\'" scheme will c fte u require high throu ghput and a hardw are
solut ion.
IPSe<:works in either II tr ansport mode or tun nel mode. In either ewe there are
Security Aswdlltion.'l (SAs) associa ted with each connection. In the creation of SAs
there are three tYJ»'lI of pro tocols t hat work Loget h<:'r to provide security servi ces:
Enca psulatin g Security Payload (ES r' ), All , and various key man agement pro tocols
[261.
2.8. 1 Authcnticari on Head er Protocol
The role 01the IP AlIllwllt icatioll Heade r is to provide deta origin eut benrication , da ta
integrity , aud pro tectjou lIg8iIl!,t rep lay at tacks. To pro tf'('t agai ust rep lay at tacks.
the receiver must clwck the sequence number of ti le incoming packets 124).
Algorit hm options for the AH prot oco l are : IIMAC },1D5 or HMAC SHA-l. The
protoco l aut.hentjc a tes the enti re packet with th e l;'xt:l'pl ioll uf the dest ina t ion address
jWJ. The format of the All contains various fields 1I.'l described in Figure 2.14. The
field "Authcntlca uon Dutil" ooutaius the comput ation (If t he Integrity Check Value
(ICV), The ICV is the output of the specific HMAC «Igorjt bm used. Th e length
of the "Aut hentica tion Dalll.~ field is varia ble but its It'lIgth 1I11L"ll be a mult iple of
25
:32-bi ts fur IP version-a or 54- hits for IP version-6 124], while non-multiple lengths are
explicitly padde d [241
Next Headef I Payloadlenglh I RESERVEO8-b~s 8-biIs 16-bils
Secu rity Parame1ers lnde:<(SPI)
az-eas
Seq~2~~berFiekl
AuthenlicalionOata
Yariablelengitl
Figure 2.14: AH packet format.
HMACs are dist inguished by underlying cry ptog ra phic hash funct ions HMACs
are keyed algorit hms and the y provide dat a or igin authenticatiun and data integrity
that arc dependent Oil the distr ibution of thei r key. In other words , if a packe t is
sent Irorn une par ty to ano ther , and the• .JfCCiver u..ses it s key and deems tile HMAC
is ex)rrect it shows two things : first tha t the HMAC must have been added by the
sender (and is not a forgery) [27], and second, the data. with in t he packet has not
been modified. Th e per forman ce of a Hr-.IAC is dep endent on the performance of the
underlying cry ptog raphic hash func t.ion [271which is within the scope of th is resea rch.
2.8,2 Encaps u la t ing Secu ri ty Pay loa d Protoco l
The ESP protocol provides confidentiality, data or igin authentication , connectic nlesa
integrit y, an ant i-rep lay service, and limited t raffic flow confidentiality [28). ESP is
designed to he algorith m-indepe ndent and some cipher options are: Data Encryption
Sta ndar d (DES), 3DFB, Re S, Blowfish, IDEA , and Cest end BCG !2G] 18J. ESP also
sup ports optional message euthent icatiou within it s protocol. To provide intero p-
era bility between differen t implementations the following algorithms are mandatory
26
• DES in CBC mode.
• HMAC ~lD5
• m.fAC SHA-l
• Null Auth entication Algorithm
• ~1I11 Encryption Algorithm
SewrityParame len;lodex
,,""
Seq'"';'~o.nrt>er
PayloadDala
. atia.bWllengt h
I O l:=~
I pa~~lh I Ne~~r
Aulhen bcalion Oala
Va riablel,,,,glh
Figure 2.15: ESP packet format .
The format for the E..o;; p packet is outlined in Figure 2.15 and is provided to
illustr ate the steps involved when using th e ESP protocol. When encrypting an ESP
packet t.he following steps take place [281:
1. Th e original data packet. is encapsula ted into the ESP Payload field. If in t rans-
port mode, only the upper layer tran sport protocol informatio n is enca psulated .
In tunnel mode the enure da ta packet is encapsulated iuto the ESP packet.
2 T he Paddi ng field is added as required by the protoco l.
3 Thr- result. is encrypted using a specified cipher ill a specific mode , such lIS CBC.
Encrypted fields are Payload Data , Padding, Pad Lengt h, and Next Header
27
fields. An IV vector may or may not be added to the Payload field. Thi s
depend s on the modeof operation and set t ings defined whcu th e SA is creat-ed
by t he specific connect ion.
Th e speed of the cipher is variable dependin g 0 11 the type, and how it is used
internally (e.g. number of rounds execut ed]. For BITW implementa tions of IPSec , it
is ad vantageous to have fast cipher designs available so that data throu ghp ut can be
hig h
2.9 Concluding Remarks
In this chapt er, block ciphers ami hash algorithms were introduced and explained
in basic detail to provide the reader an under standing of some of the operati ons
that take place when implementing th ese algor ithms. IPSec was also described to
provide a general view of why ther e is a need for fast hardwar e based ciphers and
hash functions. It is impo rtant next to focus un hard ware impletneut.at icu method s
and how they can be applied to cryptographic applica tions
28
Chapter 3
Hardware Architecture
In this chapter. •arious hardware platforms for encrypt ion algor ithms and hash {W lC-
tions are discussed. 'I'h.. per formance advantages of hardware verses software appli-
cat ions of enc rypt.ion will he discu ssed briefly. Next ASIC, FPGA, and coarse gra in
technologies will he outlined and illustra ted , with applicat jons to encrypt ion algo-
nthms that are of inte res t to this researc h. A high level descri pt ion of the Chame leon
Systems CS2112 Re p is given with the desi/!,Il process requ ired by the architecture.
3.1 Software vs . Hardware A lgorithms
Cry ptographic applicat.ions can he implemented in many different forms and across
various levels within a da ta netwo rk . The previous cha pter showed that ciphe rs
and hash nlgorithms use many ari thmetic operations [such as multiplicat ion) and
bit-wise manipulations {sneh as rotat ion and permutation). For those applicati ons,
specialized hard ware is a faster choice th an the 11.<;e of soft ware, which uses general
purpose processors For example, a GOO Ml lz processor is inca pable of enc ryptin g
a 1'3 enumuuications line (45 Mbit / s) with 3DES [291. Th ere is crypt ographic eo-
processor hardware that can aid t he encry pti on speeds of genera l software platforms
with respect to pr ot ocols suchus IPSec. For example, Broadcom's Cryp toNetX line
29
of products offer full IP Sec support at speeds up to 4.8 Ch it /s [10].
The lack of architectural support for some cry ptographic operations is one reason
software based encrypt ion O ll general purpose processors is relati vely slow compa red
to direct hardware implementations. It has been shown that adding architectural
suppo rt to a general purpose processo r has improved encry pt ion speeds, For inst ance,
u 590/0-74% epeedup in eucrypt .ion was encountered by add ing instru ct ion set support
for fast subs tltutjcus, genera l permutatio ns, rotates and modul ar ari thm etic [291
Overa ll, cip hers that rely heavily upon subs fitut.ious and permuta t ions (such as DES)
benefit from having higher mem ory access times and bandwidth, while ciphers t hat
are based heavily upon ari th metic opera t ions (such as RC5 and HOi ) benefit from
arc hitectural support for their operations, such as rotation and mult.iplicution (29J.
3.2 Applicat ion -Speci fic Integrated Circuits
ASICs are a techno logy th at iii ap plicable to a wide variety of des ign areas, When
implement ing a design using ASIC techno logy, a designer will use com pute r librari es
and simulat ion tools typica lly encapsulated within a Hardware Descri pt ion Lan guage
(HDL). Once simulation of the design is complete , the design enters the placement
a routing phase. The plac ement and routi ng phase is highly depend ent upon the
technology and process used to fabr icate th e chip . A general design flow can he seen
iu Figure 3.1 [301. Two of the mO>lt pop ular HDLs are Verilog and VJIDL. In the
simu lat.icn stages , a beha viora l description of the desig n is created. The designer can
then break each block down into smaller units , poss ibly to the level or t ransisto rs and
gates P0]. Thew arc many design processes that call automa te certain par ts of th is
process depending on tile particular process and provided libraries. The final stage
of the design process is to send the placed design for fabrication. Once th e chip is
fabrica ted it ca n be tested and depe nding on the results the whole process may need
30
to be repeated if re-design is requir ed. Indust ry examples of ASIC designs show that
it takes on average two to three repetitions of the design flow in Figure 3.1 to meet
design specifications [31).
Customer
Requir ements
Figure 3.1: A view of the ASIC design process
There lire both advantages and disadvantages to using ASICs for implement atio n
T he use of s tanda rd cells in a design allow t ile designer to be removed from some of
the underly ing technology that is used for fabrication.
ASle s work at highe r opera t ing speeds than other technologies and work well
with integrated technologies such II.~ ana log cores , micro processor cores, and high
speed 10 1321 . ASICs also usc less power tha n othe r implementation metho ds 1311.
There are indust ry arguments that there is a widening gap between process tech no]-
ugy, speed of oper ation, ami ASlC cells that make the process of ASIC design more
d ifficull and expensi ve \31]. Problems ar ise with an increase in operating speed. As
HI)Ce<! incroesos . transmission line effects start emerging into the connect ions between
cells. \Vith older technological precesses , many of the effects could be hidden inside
the standard cells provided by the manufacturer. Expansion of tool sets ann tool
automation are required to address this proble m [31]. Long design and rubrication
times are the biggest draw hacks to ASIC technology.
31
3 .2.1 Des igns and Performance
While the DE..') cipher is a popular algori thm to consider for implemeutution and
performance, there are other algorithms that have been gaining much attention for
thei r app licabili ty to ASIC tech nology. There has been ongoing work with the final ists
for the AES to replace DES. Of these ciphers, the Rijndacl algorithm was chosen
over ciphers such as RC6, IDEA. Serpent, Mara and T wofish. In the evaluation of
these ciphers, simula tions were conducted using Mitsubishi Electric's 0.35 micron
Complementary Metal Oxide Semicond uctor (CMOS ) ASIC design library 1:i3]. T he
AES candidates were designed and simu lated to explore bottlenecks in perfor mance
with respect to ASICs , and they do not represe nt opt imizations for per formance. The
figures in Table :U are impor tant for explo r ing what primitive operations affect the
performance of popular ciphe rs [33J
C ip her Performance (G hit / s)
~IARS 0.2256
RC6 0.204
Rijndae l 1.95
Serpent 0.9316
Twofish 0.39-11
Table 3.1: Results from AES cand idates in ASIC technology.
This study also made particular mention to opera tions were moot time consum ing
and had the larges t effect on performance [33J. With respect to llijndael, substi-
tutions, unsigned addition, and bit-wise logical ope rations had the biggest impact.
With respect to RCG, unsigned integer mult iplications had the biggest impact to per-
formance [33J. The unsigned integer multiplication within RCf, is of interest to this
research
Anoth er investigation into the suitability of RCG in ASIC technology was COIl -
ducted [3'{] [35J. Two versions of RC{-i were used. a pipclined version optimized for
32
perfor mance , and an iterative version that al lows d ifferent modes of operat ion. Sim-
ulaticns were perfo rmed using 0.5 micron CMOS technology [35], and were eval uated
with respect to throu ghput , nwnb er of tr ansistors used, and chip area used . T he
evaluat ions are given in Table 3.2.1 [~~5J. Results from th e simu lat ions suggest that
the performance of Re6 was dependent on the muh.iplica tion ope rat ion.
R C6 Design Max Speed (Gb it / s ) Tr ansistors Max Area [eq nun)
Iterative 0.10 450000 0.023
Pipclin ed 2.10 9fXXlOO(} 0.52
Table 3.2: Simulat ion resu lts from Re 6 ASIC designs using 128 bit keys.
3.3 Pield -Program mablo G ate Array s
Field-programm able gate arrays (FPGAs) provide a hardware plat form that can be
conside red a cross between software (\ lSC of programm able genera l purpose processiug
units), and specific hardw are implement at ions (ASICs). FI' GAs a re an exa mple of a
reco nfigurable hardware technology, Reconfigu rable solutio ns can potentially provide
a fas te r opera t ing pla t form than softwa re, while offering greate r flexibility than ASIC
technol ogy l36l.
FPCAs were conceptually des igned to fit between programmab le arr ay logic (PAL)
devices and meskable progr amm able gate a rrays or MPGAs . FPGAs are like PAL
devices in that they can he programmed \lsing an elect rical cu rrent, FI' GAs are
similar to 1-IPGAs in th at they can accommodate complex designs with in a single
dev ice while util izing arrays or logical ga tes 136]. Usua lly an FPGA is conne cted to a
SRA:\l device that contain" configuration bits for the design. When the configuration
bits are t rans ferred to the a ppropriate points on the chip, gates are config ured for
use depending on where tbe software mapper places the desig n with in the chip 's
reoonfigurable architecture. OWrIl.lI the performance penalty for des igns in an FPGA
versus an ASIC are on an orde r uf live to ten t imes 1371.
33
~uuuu[§]
I~~~~~
~~~~~~
I~~r~~~
I~~~~~
~[OO]][OO]][OO]][OO]]~
Yer$oIRlI'\i Routlllg Chal'ltlt/.
Figure 3.2: Abst ract ed internal FP GA st ruct ure.
Th e stru cture of the basic bu ilding blocks withi n a typical FPG A are illust rated in
Figure 3.2 [38J_ T his is taken from the Spart an family FP GA data sheet from Xilinx
Inc. The Confignrable Logic Blocks (CLBs) illustr ated are used to implement the
users logic, and the Input Output Blocks (lOBs) arc used to provide commu nicat ion
bet ween the internal structure of the FPG A and its external pins [381. FP GAs are
dee med to be fine grain reconfigurab le devices.
While FP GAs do not perform as fast all ASICs, t heir reconfigllrablechamcte ristic:;
make FP GAs a very attractive platfo rm for application. IIItoday's marke tp lace the
development cycle of an ASIC is often longer than the market lifetime of a product .
FOT instance, the time required to prototype an FP GA is a few weeks while it take;
mont hs to fa br ica te an ASIC. Another advantage of FP GAs are that they can be
reprogrammed, whereas an ASIC nee ds to he replaced with a new chip when desig n
modi fioettons fire requi red [39].
34
Wi th ASICs , a design is first simu lated and synthesized for a particular technology
and then th e netl ist is passed to an ASIC design house for physical placement and
routin g. All FPGA design , simulati on, and implement ation is done by the designer.
The placem ent lind routi ng stage is accomplished by software called a mapper and is
dependent on the part.icnlar FPGA used The process of mapping can be problema t ic
dependin g Oil application needs [39].
3 .3,1 Impl ementations and Performance
Preliminary design of AES finalists ",'as conducted in a simila r fashion to Sect ion 3.2.
Tile ciphe rs were designed to analyz e potent ial bottlenecks in perfo rmance based on
primitive operatio ns. T he result s shown in Table 3.3 are without the presence of
performanc e enhancing stru ctures such as pipchning [40]. All shown earlier, the mai n
bot t leneck observed for Reo WIlS the multipli cation operation and this is of specific
inte rest with the design of RCG.
Cipher P er formance G h itls
Serpent O.33!J-1
llijndlJ('1 0 .3315
Twofish 0.1773
RC6 0.1039
MARS 0.0398
Table 3.3: Some res ults of FPG A simulat ions of AES ca ndida tes
A detai led SUH'ey of the design of the AES candidates on FPGAs was eon-
ductedusing various methodologies. Th e eva luat ions were based on the Xilinx Virtex
XCVIOOOde vice with a 40 ~lHz design constraint . Itera t ive designs of RC6 using
feedback and non-feedback modes, and optimized pipelines using feedback and 1101\-
feedback modes are given in Tab le 3.4 !8]
Th e findin gs show how pipelining only has an advantage to performance whennon-
feed back modes are used, such as ECA r-iphcr mode During this st udy t he authors
35
Tahle 3.4: Some results of FPCA implementations of RC6.
found that the two biggest architectural chal lenges for RC6 IVCre the implementat ion
of the multiplication arid data dependent rotation operations. T he aut hors made note
that the multiplication operation is actually a sq uaring and an addition operation
(X(2X + 1) = 2X~ + X). Within an FP GA the operation can be tre ated with an
array squarer with summed partial products 18]. SUIllIIwd partial products arc used
in a different manner for t he analysis of the multiplication operation used in this
research.
Au evaluation of the suitability of RC6 and CAST -256 for the Xilinx XC4000
family of FPCA is also conducted I·Hl. This stud y achieved an RCOim plementation
with a speed of 37 Mbit/s utilizi ng 91% of the resources of the dev ice. V·/it h respect
to the t arget device, the overall conc lusions were the following [41]:
• The mult .iplicatio u operation was a pr imary source for resource utilization.
• Per formance was affected by th e multip licatio n operation .
• Pipelining is difficult due to hardware complexity.
• Lerger devices will be needed to achieve faster performance.
3.4 Coarse Grain R eco nfi gu r ab le A r ch it ec t u r e
Coarse grain reconfigureblc hardware offers a new approach to dl\~i!!;n8 in flo hard-
ware environment. FrGAs "'..ere defined to be fine grain architectures because their
reconfigureble elements (CLI3s) had datapath widths of OIl(' bit [13] The bigges t
36
problem with FPGAs is that their fine grain nature yields much routing overhead
113J. Figure 3.3 illustrates the flexibility vs. performanc e relationship between hard -
ware technologies [131
i<8>
'I ffi'''''''''R T~"- (Ff'GA._~""')
~
Figure 3.3: Flexibility verses performance of hardware techno logies.
Coarse grain arch itectures commonly support Processing Elements (P& ) that
have word level ope rations (fur exam ple, addit ion and subt raction may be supported
within one PE ). Fast reconfiguration t imes can allow runtim e reoonfi gu runon . Sollie
coarse grain architect ures allow multiple Pf':'" to be combi ned to make larger dat a
width PEs. Combining two 32-bit adder>!to make a 6-1 hit adder is an example of
this. Coarse grain architectures allow combin ing PEs with much less overhead than
fine grain platfo rms, yielding a multi-granul ar architecture [13].
One of the biggest design challenges of coarse grail"! architectures is the way in
which the P Es communicate with each other . The connec tion of mult iple PE." form
what is often referred to as a "fabric". A fabric may be mesh based as shown in
Figure 3.4. where PEs will have con nection to th eir neighbors , either side hy side,
four ways or eight ways.
Pre cessing dem ent s can also he connect ed in linear arrays, illustrated in Fig-
life 3.5. \Vit.hin the linear arr ay, cad i clement is connected to its neighbour ill a line,
37
Fourway nearest
neqhbor
(NN)coonectioos
Eightway nearest
neighbor
(8NN)
ccro ecncns
Figure 3.4; Examples of 2D mesh cunnect.lons.
ami this a rrangement can he used for development of pipc lined architectures If a
pipeline must branch , the PEs must have some type of two dimensio nal real izati on
in their structure jl.S].
Figure 3.5: All example of a linear array configuration
Finally , multiple layer crossbar switchin g can be used to connect PE..~ together,
shown ill Figure 3.6. while it crossbar can enable ruany PEs in a fabric to communi-
cate with each other , it is usuall y costly to do this. Some coarse grain architectures
use partial crossbar s to communicate. III Figure 3.fi, bidirectional pathways are es-
tablished where communieatinn pathways overlap
38
Figure 3.6: Au exa mple of a crossbar configuration.
Pr ocessin g elemen ts may exhibit one or more of the described types of inte rconnec-
lion and communication pathways, and they may be unidirectional or bi-direc tional
in nature. The use of local and global comm unication hU8CS can aL'SO be used in com-
bination with th ese structures. A Kress.Array sty le of architecture will support this
par t icular arrangement , and is illustrated in Figure 3.7 [42].g"-" " "
" " "
UsingLocal
Nflaresl Neight>or
Comm unica tion
lJ1.;iog Local
Bussesfoe
Com municalion
Ii-"" " "" - "
U::'~~I
Communication
Figure 3.7: An example of a Kfe">SArray configuration . Multiple communica tion
schemes between processing clements are used.
Coarse grain a rchitec tures have varying applications. Ouo application area L~ DSP
[421. Many operations and algorit hms used in DSP utilize multiplication and addition
operat ions. For example, to create a Finite Impulse Respo nse (F IR) filter, a designer
could configure and connect processing elements together and th en flow th e input
signal throu gh the dat apath [141. If the architecture is runt ime reoonfigurab le, the
filter parameters could be changed while processing data, creating an adaptive filter.
With respect to sym met ric ciphers and hash functions, many operat ions are similar
to that of DSP. An array of PEs can form an encryption array or a pipe line to
implem ent the cipher of choice . While being runtime rcconfigura ble, keys can be
changed or a new cipher put into place while running. Limiting factors for lise with
respect to ciphers and hash functions consist of lack of suppo r t within the PEs for
some required. operations, lack of P E H'SOUrOOS to accommodate the cipher , and lack
of commun icatio n resources within the reconfigurable fabric.
CoIl-I'Se grain arc hitect ures begin to show a divergence from th eir fine grain cousins
when addressing application space . A fine grain technology is ra the r universal in
nature and is meant for deployment in many areas such as DSf' , cry ptog raphy, control
applications and telecomm unications ap plica tions. Coarse grain architec tures will be
more suited for a partic ular area , that is , one architecture may he suited for OSP
whi le another will 00 more suited for multimed ia applications [13J.
Coar se gra in architec tures exhibit some disadvantages over finc grain arc hitec-
t ures. As with FI'GAs, processing clements can be left unused if ther e are not enough
comm unicat ion reso urces available. An unused I' E in a reconfigurable architecture
is a greater loss of computational resources than in au FPGA, since PEs arc larg er
than CLBs ami are far fewer in number [421. Ano ther prob lem can ar ise if operations
and bit-mani pula tions require word lengths di fferent th an the dat a widt.h supported
by the PE.s , result ing in a waste of computational resour ces. Powe r consumption in
general should be less than that of FPGA on II. per impleme ntation bas is, but the corn-
municat.ion network must be carefully designed to support low power consumption
[42J.
4()
Th er e are various met hod s for mapping designs onto a reconfigurable fabric . From
the examination of curious architectures, the ma pping meth od is highly de pende nt on
the type of PEs and tl w communication arrangement used . Some archi tectu res will
go from a HDL description of an a lgorit hm and auto ma tica lly placed PE elements to
impleme nt the desig n (with possible use r iut erac ucn ) . Ot her ar chitectures will try to
translate a high level descri pt ion of a design in C or C + -I- to a hardware co nfigurat ion
withi n the chip [13].
3.4 .1 Survey of Ex ist ing Tech nologies
Sur veys of existing coa rse grai n technol ogies exist and Table 3.5 is a comp ilation of
SOllie exist ing tec hnologies ]131 [141. T hc architecture field in the ta ble represents
the genera l communication str uct ure withi n the recoufig urable fabric and gra nularity
refers to th e eomumuication word width between p rocessin g clements
Na me Arch itec t u re Gran ularity
Morp hICs 20 Arr ay Und isclosed
Cha meleon CS2 1l 2 20 Array 32-hit
OReA ~" 20 Array 8 awl 16-1>it
CHESS Hexagon Mesh 4-b it
Morphogys 20 ~Iesh 16-bit
RE MARC 2D Mcsh 16-bit
P ipefullch 10 Array 128--hit.
P leiades Mesh an d Crossbar m ult i granular
Gar p 2D}. tl'Sh z-ou
RAW :m~Iesh 8-bit mult i granular
M at r ix an t\lesh 8-bi t multi gra nular
RaPID 1D Ar ruy Hi-hit
Col t 2D Array 1 and Ill-hi t inhomogeneous
KressArray 2DMcsh selectable multi ple NN
DP·F PG A 2D Array I and 4-bit
PAODI-2 Crossbar 16- bit
PAODI Crossbar 16-h it
Ta ble 3.5: Surve y of existi ng coarse gra in reconfigurable techuclogies.
3.4 .2 Cry ptograph ic A pplica t ions
Thus far, there has not been a reconfigurn ble urchiteeture design specifically for the
usc of cryptography, DS!' is a large field t hat. requires many bit-level oper at ions and
often requ ires high speeds of operatio n.
T here has bCt~1I ROme work into dynamically rec onfigurable cipher cores in which
the rcco nfigurable nature of the chip has had cryptogr aphy as a primary design goal.
Chameleon (not to be confused with Cham eleon Systems and the CS2112) is a cipher
implement ed Oil a chip with archit ect ure t hat contains reconfigurable featur es spec i f-
ically for bit-level encryption operations 1431. A block diagram illust rat ing importa nt
component s of t he Chame leon cipher chip is given in Figure 3.8
Figure 3.8: Chameleon cipher chip, designed for encryption
The Chameleon cipher operates 0 11 64-bi ts of plainte xt and uses a 64-hit key, Th e
reconfigurable blocks within the architect ure are used to generate new subkeya during
execut ion. Th e purpose of th e Chameleon ciphe r chip was to add flexibility to Ihe
hardware design. a benefit of ut ilizing a coarse grain reconfigurable architect ure !43).
42
3. 5 C ha m eleo n Systems C S2112
The Chameleon Systems CS2112, also referred to as a Reconfigumhle Comrmmica-
l ions Pro cessor, is an int-egra ted system on a chip th at contains a 32~hit coarse grain
reconfigurablc fabric. The C82112 was designed in 2000 with high speed sign al pro-
cessing applicati ons in mind. The arch itecture is designed to allow high speed parall el
processing of informati on, fast. design to market t ime, and flexibility wit h respect to
individual application needs
Th e beue lit of the C821 12 is that it is a coarse grain reconfigurab le syst em 0 11
a chip. As discussed in Section 3.4, runtim e reconfigurability allows the designer to
ap ply different algc rit.hms dur ing operation of the chip. The CS2112 incorpora tes
two reconfigurable fabrics; one is not currently processing and is referred to as the
background plane, while t he ot her is actively processing and is referred to as the
active plane. Th e swapping of active and bac kground planes is described in Figure
$.9 and call be accomplished in one clock cycle [14]. Th e swapping of planes allows
t he user to load the background plane from ext ernal memory during runt imc
Figure 3.9: Process of swapping active and background fabrics
43
Th e CS2112 COi'\.fS<'!grai n pro cessing elements ca n also be reconfigured individuall y
during r unti me based on contro l logic tha t is asso cia ted with each sec tion of the
rcconf igure ble fabric. One of the edve.nt.ages th at the CS2112 has over trad it iona l
ASICs is that it is reconfigura ble , end that it is reconfigurable at runtime makes it
more useful than FPG As.
3.5. 1 C S211 2 Hi gh Lev el Arch itecture Descr ip t io n
A high level description of the CS2112 is illustrated in Figure :UO. The components
of the CS2112 co mmuni ca te over a 128-bit Roadrunner Bus. The Argonau ght RISC
Core (ARC) processor provides th e CS2112 with a general microprocessor on chip to
contro l the reconfigura ble fabri c, run UHCr r-ode, and cont rol the ot her components
of the CS2112, Th e ARC pr ocessor has been optimized for the CS2112. and it em-
ploys a four-stage pipeline and 54 gen era l pur pose 32-hit regis ters. There is a 8k byte
instruction cache , and a 4k byt e data memory. The Direct Memor y Access (DMA)
subsystem contetns sixteen channels for t rans fer ring data among st t he various mod-
Illes withi n th e CS2112 [l .tJ.
Figure 3.10: High level block diagram of CS2112 compo nents
11
The CS2112 has lour hanks of "10 Prog ram mable Input Output (PIO ) pills which
give the chip its highest fa bandwidth of 3.2 Chit/s. When all four banks are used
the to tal bandwidth of the CS2112 is 12.8 Cbit / s and is the highest data transfer
speed of the RCP. The PIa pins of the CS2112 allow it to be integ ra ted into larger
syste ms such as int erfaces with FPC As, analog to digital/digital to analog blocks and
exte rnel memory modules .
11,e recon tigura ble fabr ic illustrated in Figu re 3.11 is broken into four slices, each
containing th ree ti les. Each tile contains SCVCIl Data Pa th Units (DP Ul:I) and two 16 by
24-hit muh.iphers. The bas ic PE build ing blocks within the fab ric are the DP Us, with
added support from multipliers and mem ories . Th e Local Store Memorya (LSMs ) in
the CS2112 fabric provides a memory str ucture so the OPUs can store and ret rieve
da ta
Reronfigurabl9 Fablic Tile
Ir~~~,,~ ssce Slice '"~ I Oa-!a Palh UIl~ I 1~~G G G G rpara1>~th1JnD~I Oata Palh U n~ 1 12$ _G G G G\ CLU [palfPalh Unit tI Oala Path Unit 1 1 ,~t l
G G G G ~lSMEJ EJ~
Figure 3.11: High level decomposition of reconfigurable fabr ic
The CS2112 DP U structure has multip le inputs from local , globa l, and feedback
communic ation pathways and buses, gh·ing a comm unication a rran gement shown in
Figure 3.12 [HI. A DPU can commnmc ate loca lly with 8 OPU s downward or 7
upward, otherwise global routing must he IL'Mi throu gh verti cal and horizontal buses
If global routin g is uSClI, each tile has right 32-bit data tHL<;eS while each slice has
th ree groups of three 32-bit data buses (one per-tile). The process of placing OP Us
45
in a design is doue through a graphical floor plann er that comes with the software
tool set. (C~SIDE tools) fur the CS2112.
Figure 3.12: Comm unicati on ar rangement between process ing clements.
Each tile eontai:ns a Control Logic Unit (CLU) that provides support for design of
Finit-eSta te Machines (FSl\h ) . FSMs are used to cont rol the flow and sequencing of
the OPU s that make up the datapath. Each OPU also has an associated Con tro l Store
Memory (CSM) (located in the CLU). The CSM sto res the cartons configuratio ns
for the OP U and during runtime th e OP U call be reconfigured by the CLU. Th e
Progr ammable Logic Array (PLA) is a compo nen t within the CLU that provides
combinatjonal logic for the user's FS:\ts.
3.5.2 C S2112 Data Pat h Unit
The most important funct ional hnildiug block with in the CS2112 fabric is the OPU.
Figure 3.13 is a detail ed block diagram illust ra ting the various inputs, out puts, arid
functional blocks with in a DPU.
The inputs to a DPU are abstracted to eight input s 011 both the A and B sides .
Dependin g OIl the rou ting setup, th ese sixteen inpu ts origi na te from the local and
global inte rconnect s illustr ated in Figure 3 .13. To use a local inte rcon nect the inpu t
must come Irrnu the outp ut register of a DP U t hat is wit hin seven IJP Us below , or
eight OPUs above, otherwise a global inte rconnect must be used. when implementing
(or th e CS2112 , the desig ner must keep in mind that communicat ion resour ces will
place restriction s on the design. It is importa nt to no te tha t a DP U must be used to
add ress and transfer date from an L8M . In th is process only even numbered OPU s
nan read from an LSM, eud odd numbe red OP Us CM write to an L8M.
Figure 3.13 : Datapath unit block d iagram
Th e CS21l2 Dr u contains th roe 32-bil regi:rtRrs, one on ('ach in put , and one on
the ou tp ut ol the npu.These registers ca n be oon 6gur....1 to hold their val ue or to load
8 new value . The regislelll 011the inputs CaD.be bypassed...-hile the out put regil;~
cannot [l-Il. The delay of data flowing t hrou gh 8 OP U is dependent OIl whet he..r the
input registers are load ed or not. The delay will be two clock cycWs if t he registers
are enab led. or one d ock cycle if the )' are bypassed.
T he Arit hmet ic Logic Unit (ALU) within the oru st ruct ure supports C and v cr-
Bog operatio ns. So me of t he bi t -level ope ra t ions t ha t t he ALU support s are ad d ition ,
subt ract ion, hit -wise logical opera t ions , and equalit y test.in g. Th e ALU can also be
set to pese data th rough wi thout modification . In ad d it ion to ALU funct ionality,
each OPU contai n:; II. 32-hi t burrel shifte r that iH abo ca pa ble of hi t-wise AND/OR
masking, word swa pping, byt e swa pping. an d word duplicn tiou. Th e configuration of
a OP U is done th rough t he Verilog HOL. Sect ion 3.5.6 provides a mor e det ailed kook
at how the fab ric is prognumued duri ng the design phase.
3.5 .3 CS2112 Lo cal Store Memory
Within each tik on the CS2112 fabri c are four l.S~ls. Eacll LSM is 32-bits wide and
128 Iocatioru; deep . The l.S~f primitive can be connected together to provide deeper
memories (using" ~pccill.l chain input /output). With the ":*Iistanre of an extra O1'U,
wider memories can also 1M'configured , bu t for th... purj.lOOlt'fl of this research these
special memory configurations were not utiliz ed .
3.5.4 CS 2112 M ult ipli er
Whil e many DSP algor ithms require multiplication , it is not comm on in the a rea of
cryptography due to toe ffll:t that it is a com put ational ly intensive opera tion. For
th is researc h, t he CS 211Z multipliers will t)f' used with th e RC6 ciph er . Each tile
within the fab ric contains two multi pliers illust rated in Figu rt' 3.14 1441.
EP
"-,bpl;...
~",,'
Figure 3.14: Multiplier block diagram
Since the CS21l2 WfIS designed with DSP applicat ions in mind , the hardware
mul tipliers 011 the fa br ic are designed to operate 0 11 fixed point two 's complement
numbe rs [14J. The multipliers 011 the fabric will use two 16-bit operands to produce
a signed 32-hi t operand. With respect. to RC6, the mult iplicati on opera tion required
is an uns igned integer multiplic ation (mod 2J 1 ) . There will be furth er d iscussion on
how the mult iply operation was implemented Oil the fab ric for nC6 in Sectio n 4.:U .
3.5 .5 CS2112 Con trol Logic Unit
'Ole C LU is the control mechanism for the reconfigurable fabric of the CS2112. The
CLU provides control over the DPU configurations, synchronous state machines and
conditional operation. The re is one CLU in each tile on the fab ric end communicat ion
pathways are illustrated in Figure 3.15 [14J. The following are key components of the
CLU.
• Control Store Memor ies (CSMs)
• State Register Blocks (SnOs)
• PL As
• MUXing Plane
CONTROL
LOGIC
UNIT
f igur e 3.15: CLU communication and interaction with processing elements.
5U
The CS~I store; the configuration inform ation for eac h OP U. They are eight loce-
tiona deep and provide eight different DrU configura tions per OP U. The multipliers
also have CS:..t,;associated with th em which allow four configurations per multiplier.
There are eight. SRBs in a CLU and the SRBs are USL'tl to register the PLA outputs.
The PLA has 16 inputs, 32 outputs and 32 product terms tbat make up the required
controllogic. The ~I UXi Tlg plane controls the inputs to the PLA and inputs can come
from various SOUrCUl including outputs from ot her PLAs in the same slice (local com-
munioation] , PLAs in another slice (global communication), or feed hack from DP U
Hag signals . For exampl e, i( a OP U wa." implemen ted as a counter, when the count
was completed the DPU can communicate back to the CLU thro ugh a Hag signal.
PLAs can communicate with eac h ot her globally hy the use of broadcast registe rs .
3.5.6 Design Process For The C S2112 Fabric
Designing for the CS2112 involves the use of a Verilog simu lator, waveform viewer ,
and GNU C programming language tools (gee, gdb ctc.] , and the C.....SIDE software
tool set developed for the CS2112 . Figure 3.16 is a block diagram illustrating the
design flow for the CS2112
Chameleon recommends t hat the starting point for development is a C funct ion
of the algorithm that will he used [141. The development process Illustrated in Fig-
ure ;1.16 occurs ill a Unix environment, with the except ion of testing with the the
CS2112 development boar d which is based in a Windows NT4 environm ent .
There are cer tain guidelines which are required when converting a C function to
a fabric function. The C function must not call any functions itself and if other
functions are called, the code must be restructured. Data can he passed by the usc
of funct ion argum ents aligned to 128-bit boundaries, lind the UBC of global values life
invalid . Finally, there is no floating point between th e conver sion from a C function
to u fabric function.
51
Figure 3.16; Design flow for the CS2 112.
The ARC processor is responsible for setting up and running the fabric. Fab ric
parameters arc sf'(, up th rough CS2112 BIOS function calls. Da ta is passed into
the fabric through streaming 0 1\1A [t rans fer while the fabr ic is running ), or through
indepe ndent DMA {trau.sfer before and after the fabri c has run ]. 0 11("(" the fabric is
set up, a start signal is sent telling tile control logic it can begin the operation. Once
the fabric is finished, it sends a done signa l to the ARC . Data is tr ansferred out of
the fabric in tbc same manner it was passed in [Ii ). To execut e the fabric function
the ARC ca ll" a function that is defined by a #pragma sta tement. App endix 8.1
and Appendix B .2 are examples of C code that the ARC proces sor fUJlS within the
52
CS2112 to con trol the fa bric .
The implementa tion stage in Verilog requires the writin g of st ructural Ver ilog code
to repr esent the funct ion. Th e designer has much llexibili ty at this poin t because the
bas ic bui lding blocks are OP Us, LS:-'I~, and MULl' s. Once the dat apath is created
by wiring togeth er the primitives, contr ol logic is written as Register Transfe r Level
(RT L) state machines with input signa ls comin g from OPU flag outputs. The OPU
module is configured by ORing ins t ruc tion mnem onics toge th er to give a configurat ion
bit s tream t ha t is sto red within a CS~1. The CSr.-Iwill use the bit stream to configure
the fabric at run time . Designing for the CS211 2 requires th at all Verilug code be
enca psulated with in a top level module that only receives only st art, done , reset and
clock sig nals. Appendix A.2 give; an example of a top level module definit ion. within
the top level mod ule is the logic for th e da tapath and controller . Appen dix A.3 gives
example Verilog code for controlle r design, ami Appendix A.4 gives example Veri!og
ernie for darapath implementation .
Figure 3.17: Screen cepuue of th e gra phical Hoorplanner .
Once the Verilog design is tested and deemed to be correct th rough a waveform
viewer, the appropriat e Veri!og modules are loaded into the C~SIDE too ls. Using
these tools, the design is checked fOJt imillg violatio ns and it can be t arried Oil to the
layout phase. T he software tool set tr ies to place cells for layou t within the fabric
53
using a simplistic approach that i.__ successful for small designs.
Larger designs require the user to layout the design by hand . Figure 3.17 is
a ll illustration of the C....SIDE interactive graphical Hoorplanner. Once the design
is laid ou t and is deemed to be "routeb le" by the Boorplanner, th e design can be
either simulated with the chip simulator t hat comes with t he tools, or tested on a
developm ent board provided by Chameleon Systems . Both methods were used to tes t
implemented designs ill this resea rch
Chapter 4
Symmetric Block Cipher Design and
Implementation
In this chapter, the designs and results of Re5 and ReB are given. Various design
methodologies and issues wit h the architecture of the CS2112 are discussed and there
are a total of five different designs bet ween ReS and RC6. Convention s used within
diagrams of datnpath and contro l structures are outlined.
Fur the purpose of research con ducted wit h the CS2 112, two design strategies were
employed for the implementation of each cipher. Both strategies provided a wide sur -
vey of performance, ease of implementation, and efficient use of fabric res our ces . \Vith
ReS, an iterative design approach was used as well as a pseudo-pipelined approach
that used an iterati ve design as its basic building block. With respect to nC6, It more
efficient pipelined approach was employed , a des ign method tha t lends itself toward
the intended structure of the C32112
4 .1 Di agram Use
For designs on the CS2112, many diagrams are used to describe algorithm design in
the fabric. Des ig nst ruct ures within th e reconfigurab lc fabric are illustrated as bkx-k
diagrams to show component interaction. Figure 4.1 shows some examples of how
55
DPUs are used in design dtagrams
32-tltlIlarrelShi~""__~ 1<27
(1... 1IU<;licn~ed)
32-biI lnpc(1Outpu1
...-l_ed~Tc ---... 0
,--
~"(inslructicnlabel!l<li_l
Figur e 4_1: Examples of configure d DPU structures
If 1\ component within a DP U is contigurable, the configuratio n is shown inside
the st ruct ure. The bar rel shifter (illust rated on the left ) is configured to shift the
input word twenty -seven hits to the right . T he logical mask st ructur e (illustr ated on
t he right ) is configured for an AND mask with a vector that is initial ized in the A
side regis ter. Init ialized registers arc shown with an arrow intersectin g horizont ally
with the value labeled. ALU tnswucuons are labeled wit hin t he st ructure and if an
ALU has mult iple cc ofigur at io ns they arc all illust rated with in th e diagram . A flail;
output from t he ALU is represented by a horizontal a rrow corning out of the ALU,
while DPU C8!\' instructions arc given by II hori zontal arrow into th e ALU stru ctu re
Memory structures arc abstracted in two different forms, as shown in Figure 4.2
Both st ruct ures contai n a fabric L8:\1 for memory space , and a DP U for memory
address generation. Wit hin the st ructure, information is given abou t the memory
modulo and its part icular use to the algori th m bcmg implemented. For the purposes
of this research, all array elements with in t he memory lire :~2-bit words.
Figure 4.2: Two methods for describ ing memor y structures containi ng one L8M and
one Dl'U.
Some complex muluple OP U st ruc tures within a datapa th arc illustrated as blocks ,
with their respe ct ive operati on labeled within. This is to reduce th e complexity of
the descriptions and to aid in th e legibility of the d iagrams
4.2 Re5 Designs
T he following subsection s illustr at e various im plementations of RCS on the CS2112.
The designs nrc presented in the order in which they were created, with subseq uent
designs developed as more exp erience was gained with the G""S IOE tools and in
simulation with Cade nce VerilogXL.
4.2.1 Re5 Simple It era tive Des ign
For t he first design a.nd analysis of RGS on the CS2112, a simp listic design a pproach
was used. to both gain familiari ty with the development environment, and maximize
t ile chances of a successful design. Th e simple iterative design of RC5 operates as
follows-
1. Two 32-bit words of pla intext are used as inpu t into th e cipher as scalars from
the ARC pr ocessor . Th ese words of plain text a re stored in OPU input regist ers
for simplicity , and are immed iately accessible by the detapeth.
57
2 A half of a round of RC5 is implemented in the fabric .
3 The associated cont rol logic either configures a Df' U to operate on the data and
puss the information to the next ~ tage or it configures a DpU to hold its value .
,I Once a half round is comple t ed the values swap and are loaded into the tup of
the datapat.h for t he next ha lf roun d. Once both hal f rounds are completed, the
round description given in Figure 2.5 is finished.
5. When an appro priate num ber of ro unds have been processed th e contro ller will
ossert the done signal and the ARC processor will tr ansfer the cipherte xt from
the a ppropri ate registers as ret urn values from the fabric function call.
A high level abstraction of the controller and da tapath for the simple iterative RC5
design is shown ill Figure 4.3. Tile dat a dependent rotalion is shown with inputs ,
out put s, and contro l signals labeled . The cont ro ller is an FSM in Verilog, and was
mapp ed to the CLU within the fabri c through the C",SlDE tools.
Ini t ially the datepath was desi gned based Oil the required operations of the al-
gorit hm. Multip le OP Us were used to build the data dep enden t rotation module
because th is operation is not supported within the architecture of a single OP U. By
using an iterative approach , each oper ation in the algorithm can be translated to a
OPU configurati on. As linesof oode are executed ill a sequential program, da ta flows
th rough the datapath in the form of a bot spot of acti vity, while OP Us above and
below the dntapa th mo.'ILl)' hold previous values
T he roles of the da tapath de ments for implementation of t ile algorithm described
in Figure 2.5 are as follows:
• DPU }: A + S!Oj and holds th e modified value for th e completion of the round.
• DPU 2: A +SlIj and holds th e modified value for th e comp letion of the round
• DPU3: Bit-wise exclusive OR ($) opera tion requir ed by RC5.
Figure 4.3: Abstrac ted block digram of simple iterative Re5 design
• DPU4: Additi on of a suhkey value to the data
Ther e are adva ntages to using t his iterative approach in the design. Firstly, since
many OPUs arc holding previous values, problems with race haz ards and mistinung
due to latency in the dat apath are simp lified. Overall fewer DPUs are used than in a
pipelined appronch because DPUli are not used solely [or th e purpose of delay, as is
required for a pipelined design .
A disadvantage to a simple iterative approach is that the datapat.h needs to be
configured while running by the cont roller as the hot spot of activity progresses. A
complex contro ller result'! from timing the configura tion of the datapa th in this Iash-
ion. For exam ple, the two DPUs that store th e initial words of pla intext , A and ll ,
59
must initially arid the firs t two subkeys to the plaintext anti then hold the result.
Later in the execution of a found , these OP Us must load in the processed words
from the bot.tum of the dut apath and sto re the values in their regis ters. These are
three differen t roles that the same physical st ruct ures must play during tilt" opera -
lion of RC5. The contro ller must be synchronized to switch eonfigurations with the
progression of these ro les. T he tas k of design ing a complex contro ller can become
quite cumbersome for the CS2112 and sho uld be avoided for larger designs. Smaller
sub-controllers should be used or a less comple x datapath should be designed . AIM
since there is usua lly a hot spot of activity tra veling throug h the da tapa th , the fabr ic
is under-utilized, which results in a lower performance than a pipelined approach
The perfo rmance of a design on the CS2 112 can be esti mated based on the amount
of latency thro ugh t he datnpath. If a OI'U uses its input regist ers , it takes two clock
cycles to provide the out put . It takes one dock cycle to load the ou tput registe r with
the output (t hil; registe r cannot by bypassed}, and one clock cycle to load the inpu t
registers. If the inpu t registers are not used it ta kes one clock cycle for a Dr U to
produce its output. If data i~ to be accessed from a LS~1 it takes thr ee clock cycles
for the address generator DPU to produce the da ta from when the address WUl; input
to the DPU. Writ ing to a LSMis immediate and takes one dock cycle
For the implementation of the data dependent rotation, five OP Us were used
Since there is not a da ta dependent rotation primitive in C, the following definiti on
was used for the left data depe ndent rotation 1171.
\#ddine ROlL(x.Y ) ((( x)« (y&(...~I) )) I ((x» > (w- (y&(w-l)))))
The « is the left logical shift operation, » is the right logical shift ope ration, &:
is the bit-wise logical AI"O operation, aud I is the bitwise logicaJ OR operation. T he
definition for the rotation module within RC5 was taken directly from the above C
declaration
GO
left Operand <<< RighiOperand
Figure 4.4: Structural diagram of data dependent rotation
There are two possible inputs for the right operand into the rotenon module.
The collt rol logic \\11-;0 designed to use right operand one during the lll"!lt round , then
use right operand two during subsequent rounds . This design choice "'lISmade to
correct a glitch that would occur due to mist.hnlng ill tile datapath. To control this
behaviour , 1\ cont rol signal is uR'<:1 from the Cl.U into t he appropr iate OPU and can
be see n ill Figure 4.1. The initia l values of regist ers are labeled within the register
st ruct ures and a re in hexudeclmal. Th e BBA configur ation of the ba rrel shifter will
sh ift t.h.. input by the val li" on the A inp ut. H hit six of t he input ill cue , the shift
will Iw lert . If bit 6 is set to I , then Ibe shift is righ t . The following ill a III"", detailed
61
explanation of the function of eec h structure within the rota tjon module:
• DP U1: I/&(w -I) requi red by the rotation. ALU_OR is used here to set bit six
to ' I',
• DPU2: [r ] « (1I&(U:- I).
• DP U3: U' - (I/& (w - I ), and uses an A:-.lDmask to set hit six to '0' for 8 right
shift.
• DP U4: (x » > (tlI - (y&(w - I » ).
• DP U5: The Ilnal hit -wist' OR operation for the rotat ion.
The control logic and the above datapath wen ' simulated using Cadence Ver·
ilogXL. T he design was then imported into the C......SIDE tools and t iming and Ill.'"
sign rules were chocked . Automatic Placement was attempted hu l was unsuccessful ,
therefore manual routi ng of DPUII, LS~ls, and control state regi ste rs was performed.
The (}-..SIDE tools do not attempt to re-arrange placement upon (·ncounterin g a IKlD -
routeble design . A design will be d•emed roureble if all communica tion paths between
OPU s , LS~ts throu gh local and glohal couuuunicat jcn are available an d m1id. An
annota ted illustration of the floor plan layou t of the sim ple it erative design of RC5 is
given in Figw-e ·1.5.
After routin g of the design, the C code w~<; modified to genera te ciphert-ext from
within the ARC processo r and Irom a call to the fahric function. The ciphertext Wll.'l
thcn compared to verify correc t opera tion . Th e design WaH verif ied with the chip
simulator for the CS2112 u nd Oil t ile Chameleon SYIlW IIIS CS2112 developme nt 1110(1-
ute. Th e resource usage of thc simple iter ati ve version of HC5 is given ill Ta ble 4.1.
Sect ion .4.. .1 is the \'. 'l'ilog description of th e da rapat h. The cont rol logic for the dat -
apath required 13 "ta li'S ....-ith seven 3--bit output signals , and 1"'-0 I-bit fiag I>ignal..
62
Figure 4.5: G'" SIDE floorplarmer screen shot of simple iterative RC5 design.
from the datapath (not including start and done signals), and is illustrated ill Sec-
tion A.3 . Both th e datapath and contro ller are abst ract ed to a top level module that
the C"""SIDE to ols will interpret as the fabric function, and this module is described
in Section A.2
Resource Count Total
Slice 0 Slice 1 Slice 2 Slice 3
OPU 12 0 0 0 12
LS).I I 0 0 0 I
T\IUL 0 0 0 0 0
Table 4.1: Resource usage for the simple iterat ive version of RC5
After the test ing of this design, simulat ion results and waveforms from th e de-
velopmcnt modul e were used to meas ure the performance of the design. Table 4.2
shows both the tot al number of clock cycles from the sta rt of the fabric to when the
done signal is sent to the ARC processor and t he numb er of do ck cycles required fur
the data dependent rotation. T he time required for the rotat ion is variable between
two and t hree clock cycles becau se the operands may arr ive at the mod ule inputs at
differen t tim es changing the eritical path throug h the module .
Table 4.2 Timing information for the simple iterative version of Re5.
Based on a lOO~lHz cluck in the fahri c, the number of clock cycles between the
st art and done sig nal ami consider ing that 64 bits of data arc being processed durin g
operat ion, the simple iterat ive version of RC5 was determined to oper ate at 40.7
t...fbit /s .
4 .2.2 Two H alf-Round, Full Sli ce Version of Re 5
The nex t phase of research involved the application of ReS such (,hat it more ef-
ficiently used the resources of the reconhgurable fabric . Using a var iation of the
design in Sectio n 4.2.1, two half rounds were placed onto a slice allowing two sepa rate
plainte xt pairs to be processed in parallel. A high-level a bstraction of the two round
implementation of RC5 is given in Figure 4.6.
Based Oil the simpl e ite rat ive des ign, some changes had to be mad e to allow this
design to fit into (J)jP s lice. The simple half-round uses twelve DPU s in total. A slice
in the fabr ic contains 21 DPUs ove r three t iles. The following changes were needed
to fit t he design into a single slice:
• One DP U was removed from the data dependent rotat ion.
• Th e cont rol st.ate machine was re-designed to opera te with only one count er ,
removing one DPU Ircm the design.
64
Half round
datapath
Half round
datapath
DatapathModule Con troller
Module
Figure 1.6: High levelubstruction of the two hal f-round desig n of HC5.
65
A DPU was removed from the data dependent rotation by feeding bark and chang-
ing oonfigur ntjons of another Df' U with in the rotation modu le. 'I t do this, a DPU
within the the rotatj on module must be seque nced by the controller during the ex-
ecutio n of every round. An illustration of the four DPU data dependent rotation is
given in Figure 4.7. From the modification of th e rotatio n, the controller and th e
datapath dement.'! required hy t he contro l logic, the full slice implementation of RC5
was accomplished using 10 DPUs in total. Resource usage is glven ill Table 1 .3 for
this design
l eft Operand <<< RighI Operand
Figure 4.7: Four DPU implementation of the data d...pendent rotation.
T he OPUs described ill Figure 4.7 have the following roles:
• DP U1: y&(w - I) required by the rotation. ALU_OH is used here to set hit six
to ' !' .
• DI'U2: (xl« (y&(m - 1).
66
• DPU3: (x» > {w - (y&(w - 1)))
• DP U4: Confi gurations: (1) uses bit -wise OR to provide tilt' final rotati on out -
put , (2) provides DPU3 with (w - (y&(w - 1)) and (3) uses an AND mask to
set bit six to '0' for a right shift .
Figure 4.8: SCl'€eUcapt ure of the twn half-round Re 5 fabric funct ion.
By interle aving and mixing the locations of Dl'Us from bot h half rounds , the
enti re da taput h makes use of local communice tiou. Global broadcasting should only
he used when needed, and requir es receiving DP Us to have registered inputs.
Resource Count Total
Sllce G Slice 1 Slice 2 Slice 3
DPU 19 0 0 0 19
LS~I 2 0 0 0 2
MUL 0 0 0 0 0
Table 4.3: Resource usage for the two half-round design of RC5.
The design was fully implemented in Venlog and success fully laid out with the
67
floor planner ill the C",SI DE tool set. Simulation results were IlSCd to measure
performance. Table 4.4 shows both the tota l number of clock cycles from the start of
the fabric to when th e done signal is sent to the ARC processor and t he number of
clock cycles required for the nata depende nt rotation
Table 4.4: Timi ng for the full slice design of ReS.
T he full slice version takes two blocks of 61-bit plaintext (l2S-bi t total input]
and has a throughp ut of 65.3 ~Ihit/s (with a fabric operating frequency of lOOMHz).
T he performance figure is not quite double of th e simple itera tive design because
processing time is 106t when the to p level controlle r receives a count done signal from
the counte r, and when the cuntrol signals are sent to the datapath to start the next
round .
4. 2.3 Full Fabric R e 5 D esign
Maximu m lise of the fabric was investigated by copying the design from Section4.2.2
to the remaining throe slices. Each slice ite rates for thr ee rounds and passes its data
off to the next stage. This style is a pipclinod method that uses an iterative design
as its basic building block. Th e function of 1.',«:11 half round module is different from
Section 4.2.2. Instead of passing the data down locally to fl. half round module in
the same slice, it broadcasts the data tu the next slice. Therefore this design can be
considered to be two separate ReS pipelines. A floor plan layout of the full fabric
design with the top-level control logic excluded is shown in Figure 4.9.
Simulation of this design was carried out by using the datapat.h and contro l logic
associa ted with each half round. Top level control was done by the asser t ion of the
68
Figure 4.9: Screen capture of th e fil l! fabric IlC5 implementation
control signa ls from the test bench. The test bench fur the design is given in Sec-
tion A.7 to illus tra te the role l\ top level contro ller would need to play Itimplemented
O Il the CS2112 and to allow for fnll simulat ion of the datapath. From the simulat ion
results , it takes 214 d ock cycles to receive the done signa l from the fabric , and since
the pipeline accommodates a tota l of 512 bits of plaintext, a th roughput of 237 Mbit /s
is achieved.
Resource Cou nt Total
Slice 0 Slice 1 Slice 2 Slice 3
DPU 18 18 18 18 72
LSM 2 2 2 2 8
~1UL 0 0 0 0 0
Table 4.5: Resource usc for the Iull fabric version of RC5 [control Iogic excluded ],
4.2 .4 Su m mary of R e 5 Results
The three versions of He:! thus far have all used an iterative half-round building
block. The first simple itera ted ve rsion ut ilized 14.3% of the total DP Us within
the fabric. RC5 was designed for operation on a 32-bi t general processing platform ,
such as in deskto p computers and smart.ca rds [171. The fab ric of the CS:l1l2 a pplies
RCS well becuuse the data width between its processing elements is :l2-hits wide and
the operations sup ported with in the DPU are also 32-bit. operations. Th e biggest
rest r tcuon to a fast ncruuvo RCS design is that there is JlO da ta depe ndent rota t ion
ope rat ion contained within a DPU . The da tu depe ndent rotation acco unts for roughly
42% of the resources used by a ll three desig ns. Performance is increased by arrang ing
the iterative half-rounds to provide a mul tist age pipeline. This method of encr yp t ion
docs not allow the cipher to be U'SN in a feedback mode such as cnc mode.
4 .3 R C6 Des igns
The following sections describe the work done with RCGon the CS2112. The fully
imple mented des ign of RC6 uses knowledge ga ined Irom ReS 0 11 the CS2l l2. T he
IIlIJf;t challengi ng compone nt. of the des ign of nCG 0 11 the CS2112 is performing the
X(2X + I) lllo<l2:12 ope ra tion d uring the execution of a round. The re are more
operations to be performed in a round of RC6 than in RC5, t herefore it will be
ass umed that more f(':!:\()U«"-'S will be used 0 11 t he fabric of the CS2112.
4.:-l.1 U nsigned 32~bit In t eger M ult ip lication
The reconfigurablc fabric of the CS2112 has two mult ipliers present in each t ile.
These multipliers operate on signed floating poin t values , Wit h respec t to RCG,
the multipl iers were used in the 16..hit mode , where two Hi-bit operands multiply
to generate a signed 32-bit result. Th e ope rat ion in RC6 is th e unsigned integer
multiplic ation mod 212• Tha t is, two 32-bit ope rands mult iply together to give a 32-
bit result . To produce a mod 23:1 result , jll"(, the leas t sign ificant 32-bit s are taken.
To produc e a 32-bit nnsigued multiplier out of Hi-bit signed mult ipliers, the
operands must under go special proc essin g comprised of two steps The first step
70
forms a part ial product with the sign bit'; masked. T he second step ia to incor porate
the effect of the sign bits into the partial product. T he 32-bit operands must he split
into four 16-bit operands. The most significant bit of eachHi-bit operan d must be
masked because these represent sign bits to the multipliers 011 the fabric, and not
magnitude bits. T he Hj..bit masked operan ds must the n be mnlt.iplied and the results
summed togethe r after appropriate shifting to preserve magnitu de of the intended
multi plication . This pro cess is illust rated in Figure 4.10 with t he results of th e 16-hit
multi plications represented by 32-bit variables temp i , tcm p2, and temp3
signbit masked
132 bit operand 1
x
I32 bit operand I
high l 6t;,;1s
o 15bit 101 15 bil
~+...,
l o1 15~ D 5 bil
Figur e 1.10: Creating a 32-oit unsigned Integer mult iplier.
Variables temp i and temp2 must be logically shifted by lb-bit positions to account
for the magni tude of the result. When temp l , temp2, and tempd are summed we have
a pa rt ial result that represents the unsigned integer mult iplicatio n mod 23~ of two 32..
Lit numbers, without the contribution of th e sign bits that were mas ked off.
Consider now t he etlect of the sign bits that were masked off to obta in the initial
partial product. The multiplication operation in RC6 is X(2X +1) mod 232 = (2X2+
X) mod 2"12 and can be viewed as squaring X, left shifting the res ult by one and an
add it ion operation. Consider now the contrib utio n of the sign bits to t he operation
X 2 mod ~. First let X' be deb ned as X with bits 16 and 32 being set to zero. The
following expressions can be written :
71
X 2mod i W (x' + SL21~ + S'H2:n) (X' + !~h21~ + 8//2 31) mod 2~2 (4.l)
X 2 UIOd i l2 (X')2 + Sd215X ' ) + S H(231X ' ) + Sdro )
+ S LSu (2J6 ) + Sd21~X' ) + SH(2:11 X ')
+8 H S d :f 6X ' ) + S J/(t'2) mod :jJ2 (' .2)
X 2m od 232 (X')2 + S L(216X) + S H(:jJ2X' ) + 8d23(l)mod 232 (4.3)
X2m od :jJ2 (X ')2 + Sd216X ' ) + SdZJO ) mod 23:1 (4.4 )
SL and 811 are the values of the sign bits of X . AllY terms with powers higher
than 231 drop out because the multi plicati on ix mod ~n. T he final result of the X 2
simplifies to a X' term, and depending on the sta te of th e sign bit , t he addit ion of a
shifted version of X' and a constant. The value (X' )2 is the partial product developed
by the addition of shifted temp l , shifted temp2 , and temp3 . SJI has no effect in
calculating X 2. It is worth noting that temp l and toclllp2are equivale nt because we
are calculat ing X 2 . Based on t he above result the following can be applied
(4.5)
X (2X +I)mud 2"1.2 2 (( X')2 + S'L(2l6X ') + Sd 2:1O )) +X mod 23:1 (1.6)
X(2X +I ) mod :jJ2 2(X'f + 8d217X') +SL(::!·11)+ X 1JI0d2n (4.7)
Given the arithmetic expression in Equation 4.7 for performin g the X (2X -I- 1)
operation with the sign bits masked, the dat apath for the 32-bit unsigned integer
multip licat ion using signed floati ng point !G-b it mult ipliers can be created in the
reconfig urable Iabric.
72
4.3.2 It erative Re6 Design
Based on design experience with RCS, an invest igat ion of per formance of an itera t ive
des ign of RC6 was undertaken. Many of the operations in RC6 are similar to RCS,
therefore resource utilizatio n and performance est imates can provide a rough guide
when maki ng implementat ion decisions. Figure 1.11 is a ll iterat ive design of a DPU
configuratio n of Equat ion 4.7 from Section 4.3.1. The control logic resource usage
was not add ressed for the multiplier because da tapa th resources were deemed to be
t he biggest design restriction.
- partiaU ..... _2..ct1
partial_SUfTl_3_etl
Figure 4.11: Iterat ive mult iplier setup
The number of OPUs required for the design of the multiplication mod ule is
estimated to be seven. Five OP Us are used in t he multiplic ation, while 2 more
Me used for the fixed rotation of «< 19(w) required right after th e multiplicat ion
73
operation [17]. The following are the specific datapath elements required for an
iterative uns igned multiplier:
• DPUI: Tests the sta te of bit 16. This information must be passed to the control
module for the multiplier 80 that Equation 4.7 can be implemen ted.
• DPU2: Masks the sign hits of the input operand 'Die DPU sf'! bit 16 and 32
to zero.
• DP U3: Creates the summation of te tnp.l , tem p2, and temp3. TIle DPU also
provides the Itl-bit left shift to add th e proper magnitude to templ and temp2
• DPU-1: Adds in contri bution of the sign bit . The DP U must eithe r add or pass
data depend ing on value of the sign bit .
• MULl: Mul tiplies high and low segments of the ope rand and prod uces temp l
and te lilp2 from Secti on 4.3.1. T he ope rat ion being performed is 2(X')2, there-
fore temp.l and temp 2 are equivalen t
• -"dU L2: Mult iplies low segment s of the operand and produces temp3 from Sec-
tion 4.3.1.
A preliminary iterative data-path was designed The resource usage for a single
slice of RCGis outlined in Tab le ,1.6. Some OP Us can he removed by us ing one data
dependent rotation mod ule and by int.rodur-ing cont rol logic allowing it to accept
multip le inputs from the dat apath. Overall throughput would be decreased by such a
choice because the rotat ion would be used for one pert of the round and t hen another ,
effectiv ely doubli ng the amount of t imc required hy th e rotation
Based on Figures 4.7 and 4.11 it will tak e approxim atel y 14 dock cycles for one
round of HC6. Bused on a I001I Hz d ock for the fabric and 20 rounds of operation, the
74
Resource Count Total
SliceD Slice 1 Slice 2 Slice 3
DPU 2U 0 0 U 2U
LSM 2 0 0 0 2
xnn, 2 0 0 (I 2
Tab le 4.6: Resource estimates for a single slice of RCo in the fabric.
upper limit is 45.7 Mbir /s . If this design were to be copied to the remaining 4 slices
and implement ed to a paral lel pi pel ined design as in Section 4.2.:1, the t hroug hput
would be 182.8 ;\Ibit/s
4. 3.3 P ipeline Primitives
Thus far, all des igns haw' been designed in all iterative approach, or have used an
iterative design as a basic building block. Ali investigation of a pipe lined approach
for RCo was cond ucted. By consid ering a pipelined or rolled out app roach, the most
important design focus is to maximize the usc or DPU s within a design . One or
the main applica t ions of the CS2112 is to focus on DSP appl icat ions where data
flows th rough the fabr ic. An example of this type of a pplica t ion would be the 1100
or a mult iple tap , finite impulse res ponse d igital filter , For II pipelined version of
RCo, ideally plain text data blocks will enter the pipeline every d ock cycle. With
each dock cycle , each DP U will perform some operation on the data , and pass it on
by the next d ock cycle. T he d ifference between this type of design approach and
that of Section 4.2.3 is that t he pipelined elements are indiv idual OPUs, rat her th an
groupings (If OPUs.
A pipeliued approach tan provide an easie r design approach with res pec t to the
fabric of the CS2112. With a pipehned app roach, most OP Us will have only one
configuration in that they will perform the opcrat ion on new data every d ock cy-
d e, unless th e pipeline is stalled waiting for informatio n. Therefore, contr ol logic is
simplified from th e viewpoint of managing var ious OPUs. Co ntrol resources may be
75
occupied with respect to the speci fic sta te of data on the pipeline . For exampl e, in
the case of the unsigned integer multipl icat ion in nCfi, it is advan tageous to have the
data st ream through the multiplier.
In Section 4.3.2, the itera tive multiplier would send sign bit information back to
the controller which would t hen use con t rol signa ls back to the detapath for correc t
operation. To preven t st alling the data flowing through the multiplier, the controller
for t he multiplier needs to reta in sign bit inform at ion of incoming data. Whe n the
da ta reaches the par t of th e pipeline where the sign bit sta te matters , the cont roller
will switch the DPU contiguret.ion. The controller needs a FIFO queue that. is /Ii
spaces deep where N represent s the number of clock cycles from when the sign bit is
detec ted to where t he informa tion is re levant in the pipelin e. It will be seen how this
requirement is sat isfied in the design of the pipelined RC6 mult.iplier modu le.
With an iterative design there is a hot spot of computat ional activity t raveling
through the datapnth. Some OPUs are co nducting relevan t work while others are
just holding their value for use elsewhere in the det apath. A pipe lined ap proach also
has DP U elements such that their sole purpose is to hold data. Th ese OPUs are
provided to delay dat a flowing thro ugh the pipelin e so that it reaches a ll portions of
th e datupath at th e proper times . Figure 4.12 illustr ates a pipelin e tha t will perform
the following opera tion D = A + (B + C) with and without DPUI; used for delay.
T he output values O il the left pipeline are incorr ect because th ere is no delay on
th e .4 ope rand. The addi tion of B and C takes one clock cycle and A must be delayed
by Due cycle as well. Since adding do d cycle delays to a pipel ine is an impor ta nt
part of synchronizing the data flowing thro ugh the pipeline, it is importan t to see
how delay was used in the des ign of a pipclined version of RC6. A oue clock cycle
delay is employed by usi ng a DPU to I'M'>its input valueand load the outpu t register
[141. Two clock cycle delays are crea ted hy using one Df' U and having it load its
respec tive input registe r , pass its value un, and load it s outpu t register. It takes one
76
Figure ,1.12: NL~U of delay ill a pipeline .
clock cycle to load a oru register, and loy strin ging toget her DPU~ we can create an
a rbit rary IV d uck cycle dela y [141
OP Us are a fundamental resour ce within the fab ric of thc CS2112 and it is wastefu l
t-o create a 20 clock cycle delay using 10 OP Us. Not only would thi s occupy a pprox-
imately 12% of OP Us in the fahric but it would add to the complexity of creating a
routable desi gn. A more space efficient way to implement an arbit ra ry lengt h clock
cycle delay is to IlSC two OP Us and a ll LS!\I to create a first-ill, first-out data queu e.
Data can be read from an LSM ever y dock cycle, therefore t ill' queue mus t be IV
spaces deep with each space holding a 32-bi1 data value. Figure 4.13 is an illustration
of thc fab ric resources involved with t.hccreation of a FIFO da ta queu e
The DP U illustr ated Oil the left Hideof the LSM in Figure ·1.13 is a write add ress
genera tor that writes to addresses -IN ahead of the read address generator, which
77
Initial Value
""
Inil ,al Value
',N
Figur e 4.13: First- in, first-out quem' setu p
InrtialValue
""
is illustr at ed to the right of the LSl\1. The LSMs are addressed ill byte wide loca -
t ions , when the port size is set to 32-bits t he address must be incremented by four
places. Physical addressing for the LSMs is mod 29 a llowing the FIFO queue to up-
cret e without auy control logic. Th e FIFO queue can he 128 spaces deep resulting
ill a maximum delay of 128 d ock cycles with this.. set up. T here is a th ree dock cycle
late ncy from when a read request is given and when information cornea cut of the reed
add ress generator accessing the LSl\l . A minimum of a th ree clock cycle delay can be
us ed with the setup illustrated in Figure 4.13, bu t since it. requires more resources
than using DP Us for such a small delay (two DPU s and a LSr-.1 verses two DP Us) ,
this setup is only useful for dela ys gree ter than Iour d ock cycles
4 .3.4 P ip eli ned M u lt ip lica tio n
Figure 4.14 is a block diagram illustr ating the pipelined unsigned inte ger multiplica-
tion module and a Vcr ilog description call be found in Section A.G. Th e square block s
represent delay clements with thei r delay value illust rated within the block
The multiplication module did not usc the FIF O dat a queue st ructure as Illus-
tr ated in Figure 4.13. These st ruct ures will be needed in the ReGdu tap a th, due
to long delay requi rements, Th e functi on of the remaining DPU s arc described as
follows:
• DPU l: Tests the sta te of bit 16. This informa tion must he pessed to the contro l
modu le for the mult iplier so th at Equatio n 4.7 can be impleme nted .
• DP U2: Masks the sign hits of the input ope rand. Th e DPU will set bits 16 and
32 to zero
X' (2X+1)
Figure 4.14: Pipelined mult iplier module.
79
• DP U4: Ad& tem p.l and It'mpJ variable in the multip lication pI'OCl'SS.
• DP U5: Adds tern p2 to the partial sum of ternp l end tf"ffipJ.
• DPU6 : Add '! the contr ibut ion of tbe sign bit . Tbe OP U passesthrough X only
if the sign bit is 0 and add'! St (217X' ) +Sd~l) Irom OPU3 if sign bit is l.
• DPU7: Crea tes the final summat ion to prod uce X(2X + 1) mod zJ2.
• ~IULI: Creates tcmp l and temp2 pa rt ial mult iplica t ion products because
tempi is eq uivalent to temp2 .
• ),IUL2: Crcete, tempd part ial multipli cati on prod ucts .
An important Pll.!t of the multiplier modul e is tho;' cont rol logic associated with
contro lling the process or the mulnplica tiou dep endiu g on th e sta te of t he s ign hit .
As stated previousl y, there is need for a cont ro l queue th at will keep track of the
sign bi t of incoming dat a. Figu re 4.15 is a high !€'\'P! illustrat ion of bow the pipehned
multiplier ane!cont roller interact.
T he cont rol module was creat ed by defining II finite !<tate-machine that too k the
sign bi t ill as a ll input . ThO:' queu e assigns to thr ee stale registers within the cont roller .
with the third sta te regist er being the ou tpu t of the controller back into the datapat h.
Th e Verilc,g descriptjou of t he multiplier controller i., in Sect ion A.5.
The control registe r CIUC'UI' actually contain>! the required CSM instruc tions for the
multi plier detaputh. In ull ata tea with in the controll er , the sign flag is checked and
then t he first position within the cont rol queue ili nssignod. \Vith each d ock cycle
the CS1I1 Instructions liremoved forward and l\SSigUllll'llt to the next state register is
made. The output of the contro l queue can then be tied directly into the controlling
DP U in the maltaplier data path.
Block inputted
with every clock
cyce
Shifted
dow n with
every
d ock
cycle.
Multiplier
datapalh
Figure 4.15; Multl plle r and controller inte raction
After the operation of X (2X + 1) ts performed, RC6 calls for a static rotatio n of
the prod uct of t he mult iplier module . TIle rotat ion is X (2X + 1) <<< 19(w) where
Ig(w) is the log base 2 of w (which is 32 in this im plementation], resulti ng in a rotat ion
by five bit posit ions to the left. F igu re 4.16 illus trates the fixed rota t ion st ruct ure
4 .3.5 RC6 Full Pi pelined Design
A gOM way to concep tu alize the pipeliucd dat apat.h for RCG is an assembly line
wit h a finite amount of space. Given th e available resou rces of the reco nfi gur ublc
fabric , the pipeline was designed to allow one round of RC6. Th e pipe line can fill
up with independent pla intext blocks unt il the depth of th e pipeline is reached. If it
takes j" clock cycles for an input block to reach the end of the pipeli ne, we call fit
an additional N - 1 blocks of data in behind the ini tial block. Once the pipeline is
filled, the outp ut block and every block th ereaf ter will feed back into the input of the
pipeline uut il each block uf data has been passed thr ough the ap propr iate number
of rounds. In effeet t.ht' pipe line become s circu lar for the rema ining rounds. Top
level control logic is required to incre ment suhkey value s, initi alize inp ut and out put
81
X' (2X+1)
X"'(2X+1)<:« 1932
Figure 4.16: Fixed logical rotat ion by five bit s
memories and broadcas t other datapat b contr ol signals across the fabric . Th e RC6
pipeline assumes that the key is t ile same for all words in the pipeline . If different
keys were req uired for da ta inside the pipeline , redesign of the dn ta pa th, controller
and C code, t hat the ARC processor uses to derive the subkeys, would be requir ed .
Figure -1.17 is an abs tracted diagram of the RC6 da ta pat h. Delay elements are
given as sq uare blocks, Illult iplier/ fixft l rotation , arnl data depe ndent rot at ions are
abstracted as blocks with their resp ective operations labeled within. O ther elements
are drawn as DPU st ructures . The input words of 32-hit plaintext a re lab eled A,B ,C,
and D with the outpu t blocks labeled ill the same fashion. The d iag ram illustr ates
one round of Reo and docs not show th e feed back of the out put of th e dat apath
into the input , nor does it show the input and out put memories involved with the
operat ion
82
81.17: One full round of ReGFigure
83
The function of the Iollowing OP Us in the pipeli ne of Figure 4.17 are described
as follows:
• DPU1; Adrls S IOjto Boftllt" plaintnt in the first pess end m subseque ut pesses
it will registe r tbe der e,
• DPU 2: Adds S Il l to D of the plain text in t!l~ first pass and in subsequent
pesses it will register the-da ta.
• DPU3: Exf'Clltl'S XOR operation required in Figure 2.7.
• DP U4: T he bit-wise exclusive OR operation required in Figure 2.7.
• DPUu: Adds rln- S!2iJ subkey to the data
• DPU6: Adds the 8 [21+ 1)subkey to the data .
Figure 4.18 is bigh level block d iagram showing all control logic and how they
commu nicate with the d~tapath modu le. Since th e RC6 pipeli ne is spread 8CnNi the
entire fabric , some si~a.ls (rom the to p 1e1."p1 controller must be broadcast across ti lt"
fabric . In Figure -U S, tbese signals are shown as das hed hill'S . Broadcasted signals
enter into a smal l collt rollt'r [two or thr ee state» tha t control thdr local OPUs.
The pipel ine for RC6 i.. 19 clodr. cycles dl't'P. This is evil-k>nt from the ena bled
registers 0 11 the initial DPU (input and OlltPllt , giving \ 10\-':1clock cycles of delay] and
the 17 clock cycle delay element th at provides the out put . T he timi ng informat ion
for thi.s datapath can be found in Table 4.7.
nco Pipeline Timing
Star t [Q Done T ime
Dura Dependen t. Rotation
Multip licatio n
Fixed Rotation
Clock Cy cles
·107
3
,
2
Table 4.7: Timi ng information for th e ncopjpeline .
Figure 4.18: Description of control and datapeth interaction.
Figure 4.19 is a layout floor plan of the reconfigurable fabric captured from the
Hoorplanuer in the C~SIDE tool set . Slices two a.n.-l three are ma inly used for the
pipe lincd multi pliers. Sikes one aud four conta in cou nte rs and datapath logic for
rotation, addi tiou , and read /write memories for opera tion. Slice four contains read
and write memories for openuion and remain ing dutapath logic.
Resource Count Tota l
Slice 0 Slice 1 Slice 2 Slice 3
DPO 19 21 21 16 77
L3M 1 3 3 ,1 14
MUL 0 2 2 0 4
Table 4.8: Resourc e usage for a fully pipelined RC6.
Table 4.8 is a summary of the resources used within t he Inbr ic for the pipelined
Re6. This version heavily ut ilizes t he fabri c, especiall y with respect to Dr U usage,
85
Figure 4.19: RC6 pipeline fioo rplan
and ill general would const rain larger designs of this natu re. Control logic for the
pipclincd version of RC6 consisted of t he resources illustrated in Tab le 4.9. T he C
code for RC6, along with the har dwar e call to the CS2 112, for tes t ing within Chipsitn
or in th e development mod ule , can be found in Sect ion 8. 2.
Conlrol..r FSM In st antiat iou s St at es InternalH.egisteI"!l l upula Outp ut s
Multipl ier a 7 6 (3-bit) I (I -bit) 1 (3-hit)
ReAd LS!ll , a 2 (3-bit ),2(J -bit ) i(l_bit) 1 (3-l.>it)
WrileLSM , , 2 (3-bit) ,2 (I -bit) I ( I ·bil) I (3-bit )
Subkry 2 2 2 (3-bit ),2 ( I_bit) 1 ( I_bit) 1 (3-bit)
Dday , I 2 (3-bit ),2{1-bit) I ( I ·bit ) I (3-bit )
RC6Top 1
"
2 (5-bit ),6 (3-hit) ,12 (I_hit ) 3 ( l ~bi l) 8 (3-l.>it)
Table 4.9: Cont rol logic resource usage for a fully pipelin ed RC6 design.
The pipclincd version of RC6 was fully sim ulated , test ed using the CS2112 simu-
lator , and the developme nt board. Based on waveform output su ppor t of the CS2112
development board and simulation result s, the RC6 pipeline operates at 597.5 Mbit / s,
encry pt ing 19 sets of plaintext a t once in the cir cular pipe line.
oS!)
4 .3 .6 S ummary of R C6 D es ign s
The CS2 112s arch itect ure ....reeks well with a pipehn ed desi gn because the datapatb
is more effie",ntl)' used , With an iterative design , 8 ho t fipot of ect ivi ty iterates
through the dataparh with only a small portion or the dat8p1\tb clements performi ng
ope ratio ns while the rest. a re holding data . Wi th rcspecr to the pipel ined versjon of
RC6, the dat aper h is more effk aently used because most of the DP Us ar e doing useful
work each clock cycle, ....ith the exception of DPU s a nd LSMs IL-i for delay.
The bigges t res rric tam to t he performance of RC6 ill that the unsigned integer
multipliers tak e UJl roughly 50% of the resources of the fabric. Having unsigned
integer multipliers Oil the fabr ic would save resources . It would he advantageous to
have a meth od of accessing kl;)Ms without using DP Us to generate read and writ e
addresses. If th is were the case, 18 DP Us (8 for rt'ad / write memories, 8 for FIFO
delay elements . and 2 for subkey sto rage) ecco umiug for 21.4% of the OP Us on t he
fabr ic, would be ava ilable for com putat ional purposes.
4.4 Sununary
This chap ter contained venous methods and design ph iloooph ies for implement ing
sy mmet ric block cip hers 0 11 the CS 2112. As illust ra ted in Tab le 4. 10, HCS was de-
signal with an ite ra t ive na ture in mind . RC5 first started ....i th a simplistic version ,
and built up LO a d"""igli th at fully utilized t he resources of the reconfigureb le fabric .
Fabric resources WNe iu...d more effectively with a pipe lined version (with all iterat ive
core) of RC5, nnd speed increased significa nt ly
RC6 WI\$ evaluated in both all iterative and II. pipelined fashion . A full fab ric
pipclin cd design was impleme nted once it was d,'t'lIled that fabr ic resour ces would he
adequate to accommo date 8. full pipelin ed round of HC6. The pipelined str uct ure of
HC6 yi elded ti ll' h ighl'1<t speed of a ll designs .
87
Im pl emen t a tion D P U s Used Speed M bit/ s
RCS Simple Ite rative 12 <1 0.7
RCS T wo Half Round 19 65.3
RC5 FUll Fab ric 72 237
RC6 Iter ative FUll Slice zc 15
Re6 Itera tive Full Fabr ic 8O 182.8
nC6 Pipelined Full Fabric 77 597.5
Table 4.10: Sum mary of block ciphers on the CS2112.
T he next chapter will explore the viability of hash algorithms within the recon-
figurab le fabri c of the CS2112. \Vhile hash functio ns have many of the primitive
opera t ions of symmetric block cipher s, due to the amount of processin g required on ly
the compress ion functio n of SHA-l and r..ID5 were implemented.
ss
Chapter 5
Evaluation of M essage Digest Algo r ithms
It is the purpose of this ehnpter to provide an evaluatio n of the suitability of the
recoufi gurable architecture of the CS2112 with respect to two popular message digest
algorithms. ~ID5, as discussed ill Sectio n 2.6, will be the first algorithm explored .
SIIA-I which was discussed in Sectio n 2.7 will be the second algorithm discussed.
\Vit h respect to both algorithms, an estimate of resource usage along with implemen-
tat ion issues will be addressed
5 .1 MD5 Implement ation
Based on the complexity of the :\ID5 algor-ithm, it is feasible to use an iterative
kernel that will provide functionality for the compression Function H....MD5 only. The
ARC processor will have to properly format the arbitrary lengt h message for use with
t he fabric Iuuction . Wit hin the function, t here are a set of auxiliary functions that
are bit-wise operatio n and are used within different steps of H....MD5 The auxilia ry
fuuctjons are defined as follows [19]:
• F (X ,Y, Z j := X & Yll\'OT (X j& Z
• G(X ,Y, Z) := X&ZIY&NOT{Z)
89
• H(X ,Y,Z ) =X $ Y E!7Z
• I (X ,Y, Z ) = Y til (XI.I\IOl'(Z»
T he reconfigura ble fab ric can be used for these funct ions , according to the depic -
tion in Figures 5.1 and 5.2, showing t he datapa th of t he auxili ary functions. The solid
black hues represent d ock cycle delays of dat a flow. Where possib le, data should be
sync hronized for easier t iming of the da tapath.
""lI£l~' ~"Jl!L'"I h~'.• • NOTe • • NOTe
. .
VV
F(x,y.z) ~x,y.. )
Figure 5.1: Implem entation of F and G funct ions.
A preliminary des ign of the da tapath for the compression funct ion H_MD5 with
the variable rot atio n and auxili ary funct ion a bstrac ted , is shown in Figure 5.3. The
varia ble rotation is the version used in n.CG using five OP Us. 1111' five DPU version
of the variable rota tion uses one more t han the iterative version found in t he BC5
designs, hilt does 110trequir e control logic. Tb c rot at ion amounts for the execut ion
of H)11D5 are VlIlIJ(~ th at are stored witbin a LSl\1. The setu p of 7' and X are done
by the ARC pr oces sor initi ally when the kernel call is made. In Figure 5.3 it is the
90
H(X.Y,z}
~
H(X.Y,z)
Figure 5.2: Implementation of H and I functions
role of four DPUs to store the registered values A, B, C, and D. T he role of these
four DPUs is to buffer and permute the A, B, C, and D in each step and to do the
final addition after th e four rounds are completed . The original value is stored in the
A registe r of {~Il(:b DPU while the permuted a nd feedback values are stored in the B
register uf the DP Us. TIle addition is performed by addin g the registers together.
5,1. 1 P erformance and Us age Estimates for lVID 5
A rough estimate of hardware usage is given in Table 5.1 . Th ese estimat es arc based
upon prelimina ry invest igat ion of a datapath. As th e final design was not imp le-
mented, estimates of CLU resource usage for the :\10 5 data path are not available.
Tile des ign almost fills up an entire slice in t he Iubric. As was discovered with
RC5 and nco, an itera t ive solution does not make efficient use of ti le fabric from a
computa tional point of view, but the size of the fabric and the natur e of r-I05 docs
not allow for a fully pipeliued version of th e algorithm. This preliminary analys is
9'
Figure 5.3: A proposed MD5 datapa tb for one step of U--,\ID S.
does not take int o eccouut the DPUs used for control purposes, such as counters.
There is a st rong possibility t ha t a solely finit e state machine controller will not be
realize d within the sta te hits of one slice's CLU, and the MD5 design will span two
slices. The number of OPUs in the cri tical path of data flow will dete rmine how fast
~l05 will run when the hardware function call is mad e. Table 5.2 is an analysis of
the number of d ock cycles required to get through th e flow of the dat apath
It takes approximately l :~ clock cy cles to get through one of sixteen ste ps ill
a round , with four total rounds ill ~I05. Therefore, it will take at-out 832 clock
cycles to process one 512-hi t b lock of the formatted message. Assuming a 100 Ml lz
92
Fu nction N um be r of DPUs
Rotati on 5
Auxiliary Function 4
Registered Inputs for Inits 4
Ot her OPUs For Other Arithmetic Ops 4
DPU~ for L8M Addr ess Ce nerators 3
LSM Memories Used 3
TOTAL 20
Table 5.1: Resource usage for preliminary MD5 implementation.
Delay Through Data path Clock Cycles
Init ial Stag e 2 elk
Auxiliary Function 3 elk
Th ree Additi ons 3 elk
Varia ble Rota t ion <Ielk
Final Addi t ion l clk
Tutul 13 elk
Table 5.2: Delay t hrough MU5 dat opath
dock in the fabr ic and ope rat ing on a 512-bit message block, the th roughput is 61.5
Mbi t / s. Softwa re calls to the hardware fun ct ion H_MD5 will be made in processing
the arbitr ary length message, th erefore th e actual performan ce of the algorithm will
be considerably slower due to late ncy in the ARC processors ' execut ion, and DMA
latency or transferring the appropriate da ta to and from the fabric .
5 .2 SH A- l Implementa t ion
SHA-l is part ially based 0 11 MD5 and as a result has many similarities Sect ion 2.7
outli nes th e SHA-l algor ithm. As with ?lIDS, it is pr actical to imple ment one of 80
steps involved wit h compression funct ion H..8HAI as an iterative fabric functio n.
5.2 .1 Rec ursive Array Ex pansion
Dur ing the execut ion of Il..8HAI, W [U..15] L<;recu rsively expan ded into an 80 eleme nt
a rray hy the following epcrat.ion:
WIt] ~ lV lt - 3J EB W it - 8] EB WIt - 141EB Wi t - 161(t > 15)
Since this occurs with each call to H..5HA1, it makes sense to provide this funct ion
into the hardware of H..8HA1. Figure 5.4 illustrates a da tapath for t he expa nsion of
IV.
Figure ri.4: Recursive expansion of WIO..15] to W[O..79J.
The ope ration of t his module can be considered in tWQweys: first the expansion
of H' can be perfor med, and the n the rest of tho algorithm call be ca rried out. A total
of G4entries (en t ries 16 t-o79) need to be filled , with each entry taking 4 duck cycles.
The overhead of reading from an LSM is only at the start resu lting in a one-ti me cost
of th ree d ock cycles. T herefore the total time to fill ~V is 259 dock cycles.
T he second method of filling H' is that it ca ll be perfo rmed while othe r par ts of the
94
datapath an' processing. f or perfo rmance ana lysis , it must be determined whrlher
the mod ule t ha I expand, It' will he able to sta y ahead and supp ly W[tl values for
all 80 steps of tbl' kerne l function. This point of Vie lll· Ioob to expa nd W in parallel
with the ma in datapeth of H.5H A-I. TIle module oolllaim W!tlvalues for f = 0 to
15, so for the first 15 stept! the datapeth ca n proceed . During this tim e a ent ry in
iV ca n be fill...-levery 4 dUI'k cycles. The ti ming 0( th e rest of the detapath 11lI~1 be
determined ne xt .
5.2 .2 S HA-l Auxiliary FUn cti on De sign
As wit h 11D5, SIIA-I m akes use of bitwise aux iliary functions dur ing its execut ion.
These are defined fl..~ follows [71:
• FlI {X ,Y, Z) : (X&Yl l{{ NOT(Y)) &z)
• Fl (X ,Y, Z ) =X \BY lII Z
• F'l {X'y, Z ) : (X H)I{X&Zll{y&Z)
• F3(X ,Y,Z) = X lD }FIDZ
Figure 5.5 end 5.6 are now diagr ams illus tra ti ng co nfigu rat io ns of each DP U in-
volved with the implementefion. The blackline separ at es clock cycles of tim ing. Note
tha t F2 reuses a DI' U !ly lL'<ing a feedback to the ALU.
A significant di IYt'r"!lI,Cwith respec t to SUA- I from MDS is t.he auxiliary functio n
F3_The compu tati on tak e; 4 clock cycles for F'J bec ause ALU feed back L~ utilized 00
tha t a DPU can be fI'US! '( 1. The choice to use ALU ft'(utmtk was made to minimiz e
resource usa ge. \Vit.h rt'SIKd to a performance nnnlysis it will be assumed that four
clock cycles ar c needed to execute th e auxiliary function , since this ts the worst case
ti me for th ts execuuou.
I
SameD PU
llSiogALU feedbacl<
--
Fig ure 5.5: Auxilia ry functio n implementat ion .
5.2.3 :Pu ll SHA- l Datapath
Figure 5.7 illustrates a full iterative design for one step of 80 for the ILSHAI kernel
function. Some of the differences bet ween MD5 and the SHA-l that cause SHA-l to
use more resou rces than 1\1D5are:
• The use of static rotat ion instead of a variable rotation This will use less
resources than a data dependent rotation.
• A different auxiliary function definition than ~ID5. In the case of SHA-l , the
auxilia ry function takes oue extra clock cycle, and requires some control logic
for ALU feedback mode. This will affect performance
F I(x,y.z)
F3(x,y ,z)
Figure 5.6: Auxiliary Iunc tlon implement ation .
• Tile recurs ive expansion of the tV arr ay which is deri ved from format ted arbi-
t rary length message. Th e method of expand ing W is rec ursive and is derived
from the 512-h it message block . The imp lement ati on of t hil; requ ires a sizable
portion of resour ces within the fabr ic.
• The lise of five 32-bit regist ers to produce a lOO-bitmessage digest adds some
extra complexity and resource usage to the desi gn
97
Figure 5.7: $HA·I datapath design.
98
5.2.4 P erforman ce and Resource Usage of SH A- l
Tab le 5.3 is a summa ry of t he resource utilizat ion of the design outli ned in Sec-
tion 5.2.3. A conside rable amount of control logic will be required to operat e the
configurat ion changes of the auxiliary functions and to operate the expansion of W .
Two slices will he used with SHA-l , including th e expansion of W . T here is enough
time (four clock cycles are required to provide one W entry) for t he expansion of ll'
to occur while the datapat h is processing through the rest of the compression func-
tion , which takes approxima tely 10 clock cycles. Ther efore, each ca ll to the kernel
function will take BOO clock cycle; if the expansion of It' is conducted ill parallel.
Assuming 100Jl..IHz clock for the fabr ic we can expect abou t 64 Mbit / s of t hroughput
by processing a 512-bit message block in 800 clock cycles.
Fu nction N um ber of D P Us Used
W [] Ex par»__ion 10
Auxilia ry Funct ion 4
Top Level Datapath 16
Tota l 30
LS~Is Used 5
Tahl e 5.3: Hesou rce utilizat ion of SHA-l.
5 .3 C om parison of SHA-l and M D5 Implementa-
ti ons
Since MD5 and SIlA -l are similar a lgorit hms, it is useful to make compar isons be-
tween their implementations in evaluati ng th e CS2112. Th e performa nce of both
SHA-l aud !l.IDSare similar and are in the range of 60 l\lb it /s to 65 Moit/s. T he
biggest deviation is the amount of resources used when impleme nting th e algorithm.
r..IDS uses an array of 32-hit words from the Bi"O, function while SHA-l uses a ll
array of 32"bit words from a recursive expansio n of elements from the 512-bit mes-
sage block. T he implementation of expanding W in t he fabric Iunct.ion is the reason
SHA-l has more resource ut ilizat ion th an M05.
5.4 Summary
Thi s chapter ex plored the design uf hash algorith ms Oil the CS2112. Resource usage
and performance ligures are given in Ta ble 5.1. Hash algorith ms exh ibit many of the
primit ive operat ions that can be found in symm et ric key block ciphers. Due to th e
size of the comp ression functions used in both algorithms in t his chapter, an ite rative
solu tion was the only design choke. The next chapter will summa rize ti le result s of
t his researc h and provide some insight into algorithm implernentat.ion, opti mization,
and specific arch itect ural consideratio ns for t he CS2112
Table 5.1: Summary hash algorithms on the CS2112.
I {)()
Chapter 6
Summary and Conclusions
6 .1 Summary of R esults
The results for all cryptographic algorit hms Oil the C82112 are found in Ta ble 6.1.
I mplem en ta t ion DP Us Used % or Tot al DP Us P erformance t.Ib it s
RC5Sirnplelt<::ratiw ra 14.;RJ% 40,7
Re 5 Two Half Round Full Slk .· rs 22.60% 65.3
Re S FullF..bric
"
85.70% 237
RC6 It,'rali ""FuUSJi"" 20 23 80% 45
RC6 It, .,-".tive!-'ull Fabric ., 95 ,ZO% 182 .8
HC6 Pip eline<l l-'uHF ahr ic rt 91.70% 597.5
MD5 1t<-.,-a t; vcSingl<,Slice 20 2~ .80% 6 1.5
SHA-I lt crat i,... Two Sliw so 35.7 0% .,
Ta ble 6.1: Summary of designs 0 11 the CS2112.
The pipelined version of RC6 was the best performer of the block cip hers with
a speed more than twice that of th e pipelined (with lin iterative core) vers ion of
RC5. Both versions used roughly t ill' same amoun t of OP Us, while RC5 has a more
simplistic iterative found str uctur e tha n ReG
The str ucture of the CS2112 fabr ic is more su ited to the plpcilned or unrolled
implemen ta tion of ciphers . Th e application space for the CS21I2 is for streaming DSP
[141, and telecom municatio ns ap plicat ions [13], making pipclining the moot efficient
usc of resources
Th r- performance of the hash algorithms ill this resear ch was almos t equ ivalent .
101
However, SHA-I used more resources due to extra processi ng within the fabric. T he
hash a lgorit hms were similar ill structure find were iterat ive in nature due to limita-
tions imposed by t he available resou rces with in the fabric.
6.2 CS2112 Architectural a nd Support Features
The fabric of the CS2112 is rich ill operational features and many of these feat ures
can he app lied to the area of cryptography. Many of the arit hmet ic operations can
he found within one reconfigurable unit in the fabr ic, the datapath unit. Ope rations
such as the bit-wise exclusive OR , logical masking ope rations. and barrel sh ifting can
abo he accomplished with one DPU.
\Vith respect 1.0 the CS21 12 the following feature; were found to be lacking within
the reeonfigurable fabric-
• Support for a single clock cycle data dependent rotation within one DPU. The
designs required 5 OPUs or 4 OPUs with poss ible associated control logic to
implement a data dependent ro tation. T he des ign of a data depe nde nt rota tion
also took more than one clock cycle to complete th(~ operat jou. If this could be
accomplish ed withi n one clock cycle, the speed up of the data dependent rotation
would be 2 to 3 times, and resource usage will be decreased by 75%-80% .
• Suppo rt for a unsigned 32-bit integer mul t iplication structure. RC6 requires this
operation and of all the resources used for the pipelinrd RC6 design, roughly
50% of the fabric was utilized for the unsigned integer multipliers. If a one-
clock cycle 32-bit unsigned multiplication module were to be used, a speedup
of 9 times would be achieved for the multiplication
• while memory requirements were adequate for iterative block ciphers, the need
of a DPU for address generation when accessing an U;\I is wasteful
102
All algorit hms imple mented in this n..se arch were originally designed to perform
well on general purpose processors . The algorithms use 32-bit words, with 32-bit
opera t ions and man ipulatio ns. The fabr ic of the CS2112 has feat ures which exploit
the characteriet.ics of algor ithms t ha t were created with software iii mind . The results
from th is research cannot be expa nded to algorithms designed for specific architectures
and platforms beca use no such algo rithms were investigated.
Com munication within the fabric is a mix of local and global buses iu a two di-
mensional ar rangement, allowing for the creation of pipelined and iterative st ruct ures
With respec t to reconfigurable a rchitectures in general, the data width of processing
elements is an important feat ure. For all algorithms exp lored in this research , pr imi-
tive operations wen' all carr ied out as 32-bit opera t ions, allowing easy t rans lation to
th e reoonfigur able fabr ic of t he CS21 12
The URC of comm unication and contro l resou rces was a design consideration for all
the algorithms, hut did not cause problems that required redesign of the datapa th.
T he C",S IDE tool set provided by Chameleon Systems gives flo full implementat ion
plat form, incl uding simulation fabric mapp ing tools. The biggest dra wback with using
the C",SIDE tools was tha t the mapper did not use any intellig ent map ping algor ithms
and for designs exceeding five or more DPUs, all mapping was done manually. Th e
proces s of mapping also included placement of controllogic.
6.3 Considerations For Future Work
The st udy of symmetric key block ciphers and hash algorithms was carried out for
the purp oses ordetermining the suitabilit y of the Chameleon Syst ems CS2112 for
hardw are based cryptogra phic algorrtlnns. In 2003 Chameleon Systems ceased to
exist us a corporate enti ty, and the CS2112 and rela ted technolo gy Irom Chameleon
has not found its .....ay into the market Th e suitability of coarse grain archit ectures
103
with respec t to run t ime reconfigurability, quick design, short er ti me to market , and
functio nal flexibility, remain lIS motivat ing factors for furt her research in the design
of cry ptog raphic algorit hms ill coa rse grain environments .
10'
R eferences
[11 Online Banking Cues Mains/ream in United States . NUA Web Site :
http:/ / www.llua .ie/ surveys.
[2] 1111': Internet BCOfjomy Indicators Web Site: htt p:/ / www. internet.indica tors.
com/keyfindings.h h nl
13[ E. A. Fisc hand G. B. White, Secure Com puters and Netmorks: Ana lysis, Design,
and implementation. e Re Press, 2000.
[4J D. Kahn , The Gode Bn~akers: The StQry of Sff Jl'1 Wrilmg . Scrib ner, 1996.
[5] S. Singh, The Code Book. Doubleday, HI96
[61 s. E. Forrester, "Sec urit y ill dat a networks," 81' Tedmology Journal, vol Hi,
no . I , pp . 52 -75, 1!J98.
[71 Federal lnjcn-moiioa Prv;"cs.,i ug Siandarrls Publication 180-1 1995 Apri/17 An -
nouncing the Standard for Secure Hash Standard. Nation Ins t itute of Standards
and Tech nology, 1995
[81 B. C. AJ Elhr it , W Yip and C. Paar , "An FPGA implementation and p erfor-
mance evaluation of AES block ciphe r candidate algorithm finalists," in AES:1:
The Th ird AdvlI1lCed En cryption Standard Candidate COIl!erenc£ , :lUOO.
[9] L. Wu, C. Weaver, and T Austin, "Cryptolvlaniac: a fast flexible architecture
105
for secu re cormnunica tion," ill Procecdinqs of th.e 28th annual int ernational ~1/f1I"
posium on on Computer archit ectu ff , pp. IlG--Il9, AC~I Press , 2001.
[l 01 L. E. Frenzel, "Cryptochips: Help eliminate the S(~mity bo t tleneck," Eicctrauic
Design, MarchZOO3
[j j ] M. .I. S. Smith, ASICs . The Web Sit e. http: / / www-ee.eng.huwaii.edu I
Illsmith /A SICs I HT r.1L/A SIC"l.htm.
l12] A. Daudalis, V. K. Prasanna, and J. D. P. Rolim, "A comparative stu dy of per-
[ormanc e of aes final cand ida tes using FP GAs," in AES3: Th e Third Ad llGnr.ed
Encrypt ion Standard Cllndidal e Conference, 2000
[131 R. Har tenstein, "Coarse grain reconfigurable architecture [embedded tutorial),"
in Prnceeding.~ of the conference on As ia South Pa cific D~sign Automation Con-
ference, pp. 564 -570 , ACt\-1Press , 2001.
[14J Chamel eon Syst ems CS2112 User ManuaL Chamel eon Systems Incorporated ,
2001.
[15] Inltymted IPSec /MPLS Scrmccs and S SL -Base4 VPN.~ Fuel Solid Growth in
VPN. lufououcs Research: http: / /www.infollFties.comjresJurccsj.
[Hi] Chamcl e»n Systems Inc. - McmOl'ial Utliversity Resee rdi Agreem ent. Chameleon
Systcms lnc., Octobe r 2000
[17] R. L. Rivest, "The Re f) encr yption algorithim," ill Proc/:,edingHof th e 1994 Lee-
ven Workshop on Fast Sof tlt'are Encry pl iQfI, pp . 86-96 , 19%.
!18] R.1. Rivest , M. Itobshaw , R. Sidney, and Y. Yin, The R C6 Block Cipher. 1998.
(19] R. L. Rivest, The AfD5 Me~sugf: Digest Algorithm. 1992
100
[20] S. A. V. Alfred J . Menezes, Paul C. van D orsehot, Handbook of ApplieA Cryp-
togruphy. CR C P ress, 1997.
[21] H. Krawczyk, 1'.1 Bellare, and R. Canett i, HMA C: Keyed-Hashing for Mes~age
Authentication. RF C Edito r, 1997.
[22] National Institute of Standards and 1/xh nology. NIST Website; htt p://
www.nistgov .
123) Computer Security and Indus trial Cryptography. NESSIE Web Site: https :/ /
www.oosic.csat. kllieliven.ac.be/.
[24] S. Kent and R. Atkinson, Il' A1.thentication Header. RFC &litor, 1998.
{251 S. Kent and R. Atkinson, Security A rchitecture for the Internet ProtocoL RFC
Editor, 199/i
[26] R. Younglove, IPSf":c: What Makc.~ It Work. 2000
[27} C. Madson and R. Glenn, The U8eof HMA C-MD5-96 within /:,'SPand AH. RFC
Editor, 1998,
128] S. Kent and R. At kinson , [P Enmp .mlating Security Payload (E'S?). RFC Editor,
1998.
[2'JJ J . Bur ke, J . Mcljonald, and T . Austin , "Architect ural sup port for fas t
symmetric-key cryptography," in Proccedinqs of the ninth i1lternational confer-
ence on An::hite.d1l1ul support for proqrammingl(mquagc,~ and opt:rnting systems,
pp. 178-189, ACM Press, 2000
[301 J . P. Huber and M. W. Rosneck, Successful ASIC Design The Fint Time
Through. Van Nostrand Reinhold, 1991
107
[31] L. Stok and J. Cohn, "T here is life left in ASIC!;," in Proccedinqs o//he 2003
mternationet .'iymposillm on physimI design, pp- 48--50, ACr-.t Press, 2003.
\32) H. A. Hute nber , M. Baron, T . Dan iel, R. Jayaramau, Z. Or-Bach, J. Rose, and
C. Seche n , "(when) will FPGAs kill ASICs? (panel session)," in F'roceedings 0/
ihe 38th conference on Desiqn autom ati on, pp. 321--322, ACt.-I Press, 2001.
[33] T . K. Tetsuya Ichikawa ami ~f. Matsui, "Hardware evaluation of the AES final-
ists ," in A ES3: The Third Adl'anced Encryption Stal/dard Candidate Conference,
2000.
[3.1] T . It. Bryan Weeks, Mark Bean and C. Ficke, "Hardware performance simu!e-
Hom; of round 2 ad vanced encr yp tion standard algorit hms," in AES3: The Third
Advana' d Encryptiotl Standard Carldidat e COII/ero lce, 2000.
(35} T . R. Bryan Weeks , Mark Beall and C. Ficke, "Hardware performance simula-
tions of round 2 advanced encryption standard algor it hms (presentation) ," in
A ES3: The Third Adva nced Encryption Standard Cand idate Con/erenr;e, :WOO.
[36] K. Com pto n and S. Hauck, "Recoufigu rable computing- a survey of systems and
software ," ACM Computing SUt-veys (CS UR) , vol. 34, 110. 2, pp . 171-210 , 2002
ja7j R. Tessier ami W . Burleson, "Hcconfigurable computing for d igital signal pro-
oessing: A sur vey," Journal of VLSF Signal Process ing, vol. 28, pp- 7- 27, 2001.
1381 Spartan and Spa1tlm-XL Famalie,~ Field Progmmmable Gate Army Datash ed.
Xiliux Incorpo rated , 2002
[30] C. Ajluni , "Field programmable gate a rrays j ust aren 't for prototypieg any -
ruore.," El IXlro11ic Design, April 2000
[10] K Cuj am! P. Chodowiec, "Compar ison of the hardwar e performance of th e
108
AES candid ates using reconfigureble hardware," in AE8 J: The Third Adva7lced
Encryption 8tundard Ca7ldidatc Conference, 2000
HI] !l.1.Riaz and H. Beys, "Th(~ FPGA implementation of the RC6 and CAST ·
256 encryption a lgor ithms," in IEEE Canadian Conjevencc on Electrical and
Computer Engineering, May 1999
[42] R. W. Hartens tein, T . Hoffmann , and U. Nadelrlinger , "Design-spac e explo-
rati on of low power coarse grained reconfigurable deta peth array archi tect ures ,"
in Proceedings IJ/the 10l1i Ini ernet ionol Workshop on Integratl'A Circuit De-
sign, Power and Timing Modeling, Optimization (lnd Sim1Jlation, p p- 11&-128,
Spr inger-Verlag, 2000
[43] Y Mitsuyama . Z. Andal es , T . Onoye , and 1. Shirakawa, "A dynamicall y recon-
figureble hard ware-based cipher chip, " in Procet:dings of the conference on Asia
South Pacific Design Autom(diotl Conference, pp. 11-12 , ACM Press 2001.
[44] Chamelum 8y,~ tems C82112 Data Book. Chameleon Systems Incorporated , 2001.
1O!J
A ppendix A
Sam ple Verilog Code for Selected Modules
A.I RC5 Testbench
I/~CH FOB RC5CIPHF./Il([ll//fl.
~~ ~:::nW~~n:;:~:~ to vuHy openti•• ~fon .yn"h.oi•.
_oLlorc6tb;
~~.<~:';:7t . n an ,
~:~~~~r~~~c~~ . ~:t .n.n.d•• oJ ,
;~;;r:l'~~~~ elk c _ 05·e lk ,
'io<loo."«5Koyo .lnel'>'l'"
<n t - I ,
. tar< e-<l,
" . <_0,
..one-I,
",ut<-O,
' 2000 ,
th.I.~;
'."bl~!I.
h ha . "P" D( "t t 5t op . oJ:uo" 1 ,
.,,:olla _r<Ol>e("1S".re51;
11~;~~t:;~~t \_ • . • · u t ''I.b 0<"'-'-'" do... · 'l.b · . ..t ."'art.do~.) ;
/f • ..,]
A .2 Iterative RC5 To p Level Module
IlT01'IX'IELIlODIIL.EIlEQlJIllEllBTCS2112IJ1CHlTECTtlRE.
_ olor<bt op le lll . ro t , otaT t . d"" . I :
::::~ ~~:~
1"P" . . . ..... '
o.tpu' don o ,
vito [2 ' O] lS>!.. ctl . n . .... '.ttl. lblotkl.tU ,.hift.'.cU,iblock2.tU
,..d d <- Ctl . t OOl. t U . t tl ,
.\• • • o i t . oo no . n os , t o UD'-dono -f l oS '
A-I
:,:;;"",-3<-1'1101 :
1"_,---3<- ,-110:
:,:;;.--_3<-" "' :
&'>...... 3<- 1'110:
::;; 3<-1'..1:
&O.~3<·"IIO:
:,:;;nIIo. 3 <. ""I '
...'_ ...... _3 <.' · 110,
:~;;.""_o.tor... l_4 <. , ' b! :
p_<un_e, t ..,..,I_4,. " bO ,
:':0"",_4' ."""
tg _nm_4 <· 1' 110,
u w
:; ;;,--- 4 <_""' ,
1"_'---4<- .·....:
u w
:;;; 4 <. 1. " "
"'_ 4<. '· ....:
u w
:;;;.....4<.' ... ' :
..._.--_4<_1'110:
::;;.--.4<- ""':
"'_......4 <.'·110:
-,
lllgiu. :
19ltul "-clb
~.gf'"(· 'oot , . "'·) :
O..: ·ho..-pr oboo( Ots O. r d . p l pol l ...," l :
II"-Ut.
A-I
n;!\.d pdp(
:~~~~~~~;:
. i l>lo<U_~l ( ib1oct l _«lJ ,
.1 bI0<k2_ctl(lbloct2_<'ll,
.:::~t:;~~:~~ :~~~~~_ot ll ,
. LS"_ctl(LSIl~<tl) ,
. vU U r _et l (n l t e , _et l) .
COW>t . '_etH COUD' . ' _Ctl> ,
. v. j . _dou_fl og(v. U _do.... _fl og l.
j~O~Dt_d""._flog{""""t __ ._fh~>
ro 5ctl ctl(
cll«dk> ,
:::~~~::~n> •
•t bl"".I _otl(l.bloc.' _ctlJ,
:~;~~i~~~~~~~~~_CtlJ •
t.!>I~ctl (lSII~<tl) ,
.:~::;:~~~~::~~~~:~t;; ~) '
. c"""t • • ~etl(<<>.... t e . _etl) .
v.tt_doM _llog(nlt .do •••f1og> .
::::::~d::;_flog(C"""U"""j log) ,
>,
A .3 Itera t ive R C 5 Controller Module
~~ ~~~.:::~~ou,;,oler
_ ule r~ctl ( c lk . '"' •• t en, l.Sl!_et l . va i", _ctl ,lblO<-l<Lc tl , j blo< k2 _etl.e<ldr.,tl,
; :-ur.etl • • h11tt ,r _et l. vau _dOU._fl"ll , COIl.Ilt_do•• • nog .doDeJ:
J u t _. r ef ea , . , J~t ..oIvo t o _~t. ..... :
par .....r
IOU _ ~ · bOOOO .
IKIT_"bOOOl .
Rl - '·t>OOl (l .
R2 _ "bOOll ,
1l3_"bOlOO.
R4 _ " bO' OI .
R5_"bOllO.
~I; - "bOllI.
RT_". IOOO,
R8 ~ ".1001.
1<8. _ 4'01010 ,
~9 _ ".1011.
00lI~ _ ' ·.1100 :
.,
Iu !",tclk :
'-"f"' t ut :
'-"""t nart:
"" tp nc l~,(l] l.S!l_ctl:
outp ut [2,O]lbloctl_etl ,
Dut puC (2,0] 'bl"d2.etl :
""'put [2:11] sJl.l lter _c tl '
" " 'p uc [~ ,OJ a<\<lr_cd:
"" 'puc (2 , OJ "U c.cctl:
out put[2:0]COWlU T_Ctl ;
<>ot pucd""",
Input •• It_do~• • llog ;
' nputccunt _dc_.llog:
A-2
r~ (2,0] ~M.<tl,
reg [2 ,0] v&lt u _ttl,
.o g [2:0] Iblo<kl.etl:
.o gt2:0Jlblotk2 _<tl,
ng [2, 0 ] abll tu~ttl:
' ''! [2 :0J ..~M.etl,
n g [2:0) <ouRt or .ttl,
"'g ~one ;
reg [3:0) cunORt. nat.,
<"lI ta.or noon.ot . " ,
n g (2 ,OJ ......LS'l.<U,
r 03[2:0]""" . v.lt oc.<tl,
r og (2 , OJ ....... lblotlCtU'
<08 [2:0J n..t~ Il>l o<k2.C<I ,
reg 12: 0J "" . t.olll U u . ",I,
re g (2 ,OJ nR" t ....~• • etI,
~~~;~:~=:l::::::~t:~~ THE CONTROLLU
~:~~R t (po. " <le0 clk)
~:~~::~~~~£~ 'l>OOOO:
v.it~<.t tl <_ 3'1>000;
Ib lo ok l_etle_3'l>!IOO ;
i b loc k2. cU<_3 '1>OOO,
Mdr.ttl <. 3 '1>000,
<-,....t • • • ctl(-3'l>OOl;
.hilt~' .<tI <_ 3'1>000;
~0fI" e. I ' bO;
.~ .I ... ""ll "
""rr. nt _at ato<_ nut. o..t . :
I.SJI..ct}e· ...' .LSI' .et! ,
n itor . <tl <_ t . v.ltu_<tl :
lb lotllCttl<· _ll>loekl.ttl :
Ib l<>ek2.etl <- '. i bla<.k2.<tl,
<onn..r _ttle·nnt_<oUIlt U.<tl;
oMr_ctl(- n."._.ctl,
.hlftu_etl e••u,-.l>H ter _etl ;
~O" O e••o... ~o.... ;
II Co.b1 na tor1olbI<>tkJ ortb.<ootrolor IlOdvl ..
~::;;,:;:::::~:~ .t. or ot ut or vo1t~~on._ll"ll or <on.nt . d one_ llq ) I><>g;.
/li3llIClTOrnrrPlITUIF.3
4'1>OOOO,l><>g"
~:t~ ::~: :-:;~I:
.....l.SIl.ctl -3'bOI0;
n • • t.lbloek Cttl _3'bOI0 ;
...... lblotll2.eU ·a·bOIO;
....t .v .\tu.ctl - 3 ·1>OOO,
" ."_toW>.t ..r .ttl ·3·bOOl;
• ••t . .....r .c< l - 3' bOOO,
...... t _o.bll ..r~ttl • 3 ' 1>000;
n.....~"". - l'bO;
:=:t:~;:t:e~':,bOOOO;
ne . ' . l.SIl. <t 1 - 3 'bOOO,
. ..._lbl<>tkl_etl ·3'1>OOO;
",," . 'blo<k2. ctl -3 'bOOO;
. ........Hu~<tl . 3'1>000;
.~"_<onnt.<.et l _ a 'bOII1;
n.. . . . 0<ldr_ttl _3'l>OOl:
."" . ol>H t u _<tl ~ a'!>OOO;
••• t .<1o.D~ _ l ' bO;
::'~:~t:-~i: ,I>OO'O'
....n.J,.Sll_ct l ·a·bO'O'
......_iblQo kl . cU ·a'bOIO'
non _lb l ock2_<t l · a · bOI0,
nut_ultor_<tl ~ 3'1>000,
nr< t _""""ur_cu -a'l>OOl,
nu t_ addr_o tl · a · bOOl ;
nr< t.olllft or . oU ·a·l>OOO,
D•• t _do• • * I ' bO,
4·I>OOIG:bog i.
1....t _.t.t. -4 ·l>OOlI,
out.Ul".ctl·3 ·b0 1G,
oon_lblotk l_ctl ·a·I>OIO,
nu t _' bi o<:l<2. ct l _ a· bOlO;
"..t . ....."'.ct l · a· I>OO I ,
n,o:rt_coun t or. ctl ·3·I>OOI,
,.. .t. oddr . otl - a· I>OOI ,
. ..t.o~Hter _ ctl . J 'I>OOO,
D.n_dOlle ·l·bO,
.~
:~~~:~t:-!~'bO lOO '
n• • t _Ul" .<tl ·J·bOIO,
Dr<t_1blntkl_cU ·a'bOIO;
.... . t.1b.ock2. ct1 -a·bOI0 ;
.... ,, _ _ctl ·3·I>OOI'
Dr<t .o""nt <:<l - a· l>OOO ;
.... .. . """".ctl _3 'I>OO';
oo,,_obHtor_cU ·a·I>OOO,
... . t . don. -.·bO;
t ' bOIOO: bogl D
~:~~:~~;~:":·::::O~gl.
oc.t.l-'OI. ctl _ a'bOO l,
oo" _'b l oclll_cU · J'bO'O,
..... t _iblock2.ctl -3'bOIO'
n.......oi tnr_et l · 3 ·bOOl '
nut_con nt o<.oU ·a·bOO',
n..t.loddr.oU _3'bOOl;
n... ,-ohtf t or _ct l · 3 · bQOO,
n • • ' .don . -I 'bO;
::,:~;:t:-~I: ' bOl OO '
... t-.UO'!.-ctl -3·bO,O,
no" S oI Qok l _cU · a· bOI G,
n• • t . lblock2.otl -3·bOIO:
n..t _"ai to r _ctl·J ·bOOl;
n. ,' .cQ""tnr .o U · J'bOOl,
... t.oddr_ttl _3'bOOl,
..", _obHter _cU -a'bQOO,
n• • t .don o · '·bO,
:~~~::': t:-!I:'WIIO'
n."JSI'- otl · 3·WIO,
n... t . 1blQcU .cu ·a ·wlO:
... t . !.block 2.ctl _J ·WIO :
0..t_ ... 1to ,,_ct1·a'bOOO,
...t.coont ..- .ctl _3'bOO':
...._. ddr _ct l · a ·bOO(I:
..... _••u . ... o<l ·a·l>OOO:
odt_doll. ~ l'bO:
A··l
=~:~"~'lIOtll '
-'~o:<l _ 3 '10011I,
..U_lblocU_ct.l _ ' · IIOl>I,
-'_lbl0a2_ct.l • )'110111,
_n__","_~u _ )'IlOO1,
_ .._.-..._ct.l- ' · lIOOl,
_ .. ct.I · ) ·IIOlII ,
...._...;.t .... c_ct. l ·)·IIOOI'
_.n._·l·bO,
=:~l~':'bIOOll '
_"_I3!'I.~U ·)· IIOlII,
....._. lo1ocU_ct. l ·"lIOllI'
...' .lbl0ck2.~tl • 3'11010,
_ .._... U ... _cU·)'IIOOI,
::::=::i~:137~~1'
:::::~:, :c i~~; - )'bOO l;
4' bI00ll :I><oCi D
::~:1:~.;~~~~:I D
... .._l bl od l . cU · ) ' IIOIO;
... ... l blocr.!.U l ·)'IIOOI;
... . l ..... .... _<U· ) ·IIOOlI,
... .._eoutoc.~tl • ) ' IIOlII,
...t . _ . <tl - ) ·bOOll,
............ r .....C11· ) ·lIOlIl;
___ · 1'110;
=!~.:~=~,
..rt_ltll"""'I.C1I·)·IIOIO,
...._ltll0ck2_nl· , ·IoOIO,
... rt.-.I',",,_<11-1 ·1IOO1,
..rt_~"".~U ·3·IIOlII ,
...... _ _""1 - 3 ·bOOl '
.....__ r.... . ~U ·3·IIOOI,
-'._ -1'110;
4' .100"""1' ·U ( c_,.__t1ad "*c UI
=~-:~~ ~.:~:~~ ,
......I.lodl.UI ·)·IIOIO,
...._.bl<>ocU.ct.i -'·IIOIO'
........u.' _~U · 3 ' 1IOOlI;
... ~_~..,.' O:<l • ) ' IlOO1,
..... ..wr.~tl • ) 'bOOl ;
...._d i tt..._cU·)·bOOll '
.....-. ,.."
:''':'::::t:-~I :' b IO ' ';
... ...LB~_<tl • ) ' lIOlO;
... . t.lbl<><~'-cU • )'11010;
....._lbl<><~2+<tl • "bOOl;
D. .. _~.lt cU • ) ' bOOO;
..n , _<11 · 3 ·1IOO1,
.....__ <_<"I ·) ·bOOl;
... n.olLIr .., _<u _ ) ·bOOO ;
......_.·1 '110,
A-5
:::~~~~~~;,~ , ~ ~ ~ ;
.... .._i b lodl.c U _ 3'b'HO,
n. .._ib1o ck:C cU o 3' bOW ,
.~.._v" u ,_ ctl · 3 ' \>000 '
.... xt .C<'u.c"'_cU · 3'bOOI '
0. " . 6<ld, . « 1 -3 'l>OOl;
0."_obU' . , .ctl ·3'bOOO,
o.xc.d"". ol 'bl,
:~:~:~~~.:-~ ':, bOOlO,
0."~_<U _ 3'bO IO ,
0• • C.i b l . ck L «1 -3 · b0 10;
u. H _l b l ock2 . c U · 3 · bOOl ,
..... vo i ..' _ctl ·3'\>ooo'
...._c"" o t • •• ct l - 3 · bOOl;
n. .._addr . c tl o 3 ' bOO1 ,
• • • <- ."II,", . <tl _3·bOOO;
... ..""• • - I ' bO;
d:::~~~a~g~·4'''''00;
ou._LSIl. <U _3' bOIO,
ont _l b l <><kl _«1 ·3' bOI O,
•••• • lblock2.cU ·3 ·bOIO,
• •".v.ltu _ct 1 ·3·bOOO,
.... . c o"" t ..-. ctl _3 · bOOl ,
box e.,a ddc cU· 3 · bOOl ,
• • " . obift . ' _<t 1 - 3 ' bOOO,
o u ' .doa. ·Pbl ,
A.4 It erative R eS Datapa th Module
~~ ~::••:~:;n'~~ •••• pac" d.lioIU•••
"/lJ...ooR),' • • l ......r
' j oc lu d . " CS<l 12 . I . " nI ctl o• • . ioclod."
T..ploc.'na tant hloo
::::::: :::::~:: ::~::~:::::: : ~~::~
• • fpar.. . p"l.O.IlEG.'"ITUl....ViLUE ·n'bO'
I >. lITS
;~: ~;~:l :
. " _'00 0 ,
.b .lDO( ) ,
A-6
. J1 "&--hI @;hO •
.1l~lowO ,
.dat~_to_l..() ,
.d~"_fr_. 1o.0.
: i::=:~~ ~;no,
CJ.lIB.Y LOGlC
~u:ry_ i ~O .
~o.rry_o~t()
1Ilk><!II.l~ d... ~rlpt1"" f or ...rhb1o circul...hift... .
_ 10 ..rC1TSI>if< ~ r«n,<n.~Mf.""_<tl.lop .<op,rop2 ,cutd.o'.):
luput oIt:
'c pu . .... :
lu pu t {31 :0) lop :
I.put (3':O) rop:
'.put [31 :0 )rop2,
i.put ( 2 :0 ) . hH te <_c tl :
:~;:,~~:~~l t~~~·:~~" tlto . oJ.S :
/lnATi PiTH .. l _ on to for w..hbl . c l r e\l l .. obif tlllS .
dafF*<.....hbb_ t l._ . hlft.i _B_REG_UITI AL..VALUE .32 ·bo<X>OOOlf:
d.. f"","" varhbio _Ci< .Obift.i.l_IlEG.lKltlAI.. VAWE · 32'bOOOOO020:
~~~"7 ~~:~~;;;;' ~_~~~j:n~:~~~i)~; ~:~~I~ !~:~o~::
dafpor"" varhblo . tl r _.Mft _i.INSTROCTIOW_I " ('I_ZEllO.1K I
::;~WJ~ !~~:~~ I 'iLU" I
CS2112J)l"J vor hbl . _tlr. "blft _I I
t1k(dk),
r .. (rot) •
. ,,_' nOO ,
b_IuO( r opl.
b_;",(rop2),
:::=~:~~~~:~~<tl),
fl Of:..blghO .
::~=~;2~~~; ),
. 1.._ ~<ldr O •
. h . _or... . ... O •
. tarry _I o O ,
tarr-,_O~tO
:E~J~~2~~E:::~::~~~sn;lX;'~~· . (: :~~~~o~ ~· ~O_1IEG I
t lk(t l tl .
r ot(ro t ) .
"_ l uO(t l r il .
, b. ' c O(l op ).
':::==~~ ~ l:
.11 "&--hi gb O ,
ll oS. 1 o~ () •
.dot .....t o _h a O.
A·'
""t"-frOlO_I_O •
•I"_44dr O •
.h, _u " ._."O.
c ... r y_I oO ,
c arT y_o ntO
/ / Kf.)ID ro 1lOD1FY TllIS BUlCK Sll iHAi BIT 6 0 F (Ya (W-l»IS CLEARED
dd pu- 1l& var i.abh _c J r _4h ltt _C•.l....Rm _l Il TI AL_VALm; . S2 ·bOOOOOO:!Q;
4.1p ... ...or h b l . _' J r _ultt _C. B.REG_lMlT UkYI.l,llE - S2 ·b OOOOOO U :
~.~PB-:A;:~ .(~~:~_~ .~~I~~~:O~~~~~; I'OPA_REG I
C3:l1l~_OPIJ n r i.bl._u r _"hH t _C(
~~~i ~~~: :
._ ; 000 ,
.b _J "O(c 1<I) ,
::::=~~~~~:
.fh&-b ighO,
·~:::~::i~o.
:~::~~~)~'.O ,
.1.._vrJt. _uO ,
carry_laO •
•c....J_ou tO
<lefpu-"':'~7~:~~~_ ~ '~r:~:~/~~~~~,· ('AU~ I 'OPU O-!tEC I 'IKUJ I
C3~112_l>Pl1 . ... I . bl . _OJr _. . ..t _O(
~~ ~: ~~:;:
o_J" O{c l r,! ) ,
.b_'oOO""l,
~1"' _ out"",,(cb4) ,
.c"," _add .(J 'bOOO),
.tl"ll .h l~hO,
~~~::i~() .
.~:::~~~) ~.() ,
1o, _wr l t . _o" O ,
<or'7_'oO,
carry_outO
I/ ORthetwov.l""otogootb.r, fJ a lobodclr<1l1 ...- 0bifth...
dofpu-.. n r io.hlo _oi r .o.h U t . E. n STlU.lCiI OlC O _ ('.I0_IJ I ' OPA_IIO_REG I
::S~I:~:.~~:~~:~:_~h::~E~R j 'O\li_ALU I 'LOIJl_O_RW;
:;~:: ~~~: :
. ::~~;~:;:::
. :::::~~;::.o~:")'
tl"ll_h;ghO ,
':~;;~~~~(l'
.1,,_0"",0,10_ _...-1<,,__ 0,
.c ar r y_la O ,
A-8
II !I.lD\IU:OESCltIPTlOM FOR THE RCS ll.I.ti Pi tH
::~:t~:::~~~:~ : c:~~~~~:~: I~:~~r~:~~k~;;::~;r_ctl •
nU_d"" e _fl a,o;, cow>t_""""_fl"l!) ,
l~p"t d t ,
hput ut,
Input CZ:O) LSM_ctl '
i~pu't (2 :0 ) ibl<>ekl _ctI'
input (2,0)lbl<>et2".ct1 ,
l~putI2:0) __ctl ;
i nput [2 ,0) .~ifb~. ct l :
i.put CZ,O)" . ib"_ et l ,
~~~t (~~~~:::~:;i:;~'
out!"" <<>u:>.t _dOM _n OS'
1/ i~t."",1 " l rl c~
::~: ;~:~ ~; ;~.~~:~;t~'l:= ~n~d_.d"I • • ..o . u;
~~ ~: ~~l~:t~.i. UCS TO LOI Q IJ tHE VALDESFOR PUI II1UT ( Al I Ml)S [O]
da lp ...... Ibl.,cU . .LREG3rJUll..- ULUE • 32 ' hl>bbM8 c8 ;
dofparulbl"<kl . B.RtG .iJltlAL..ViLTIE -32'bO:
II UltiALADOOPf.lU.UOITOWI.DYALtrES
<lot p<>:raal bl ocU . JMStR lICt IOH_O - ('AO_II ( ' OPI _llEG (' IIO_IM ('GPtl_R£G (
' SHIf"t _OFF I 'llU_.lOOI · DIIT_ALUI ' LOAD_OJlEG) ,
II """. A
?~~;"'; ~~~~~sn:~:~=~ :'~~~:.~~~-~~~~ ;~i~=~~):
II IIOI.DIlIStRlICtIOli
" . , _ l bl<>et l . n rStl WCTI 0lI_2 - ( ' .....ILU_IM I '0l'1 .1\I'.O I ' WAD.A. 1Im I 'IIO. U I
CS~1l2.~~1 ~i:~:I : 'AIlI_PASSA t 'OIII_lll1 I ' W AO_OJ,EC) ;
c!k(clk) ,
.::~~~~; ~ '
.:=::~ i;~ )·
.=::~~~;~;Utl) .
. t b gJo'8t ( ) .
.~~~~~:~~~ () .
. d. u.f......l ..O •
. lo d<lr(} •
• 1o vr l U _""O,
•c u "Y_1o O .
carTJ. o~t()
>,
~~ ~ ~;~!~:t1;; '1i'l\IDS TO lJU D II tHE VlL tU f O!i pl..I.lll1UT (8 ) i.O S(1]
dol p<>:r'" Ib l ock2.i_I\Et::.UltlAL. VAllIE -32 ·hl a3 7!7fb :
~oIpuaa Ibl <>ek2.B_ P.Ec:_IMItIll..-YI.l.lllt - 32' 110:
II IMlTUL.I.DD OPf.lU.n Oll TO l.\lAO VlLlIES
dolP"'~~~:';~~;~I~~~_~ ::';'~~L~"II ,;:~=~~) ;80_U I
I f .... . i
A-9
det par Nl ; bloc~2."StllOCTIOJ_I' (' il , U I ' DPl.- I D_JlEllI ' LllAD_4. JlEll I
'00_1' I ' DPe.)l D_llEC I'SIlII'T3JPF I 'A LlJ. PASSA I ' OUT_~~D I 'Ul lD_ D_REC);
!I 1I01lI1IIStIWCTI(IlI
d.fpar_ ibl, od2 .IISTROCT1DN_2 . (' IJLII_IX I ' OPI_ REC I 'WAD_I _l<£C
~~~2~~ I ;:~~~~FF • ' IL1UIS~ I 'OIrUUJ I 'Ul lI l_ O_REG);
<lk(<lk ) ,
.rn(rot) •
. ... 1..0 0 •
.:-: ~~ :~) .
dpu _""tput(v2),
e"'_orld.r (; bloek2 _""iJ •
. fl "6--bl4!>O .
fl 'lg.. lovO ,
:::~ :::~;~:~~~() ,
.I a<Id.O.
lo w. .... _...O,
.c u I')"_lD().
e">T}".out (J
"(/iDD conflturati"" . . .
~:r;~~';~~~~~:I?~I~:~"1 1 ' :~~~~·~~~=il;~D~D~ ~~ I
!l IIDUll..tUCti_
delpac.. a<ld l . IXStll UCTIDN_l - ( ' I O. IJ I 'OFI..A£G I 'ope _REG l ' SO_U I
CS2 112_;;::I~i7 1 'll.lJ_iDD I ' OUT_ALlIl 'WIIl_D_REG);
. e lk ( elk } .
n «rot) ,
::-:~~;:~"l.
:~~=~:Z~~;t1l ,
flag_bigh O ,
. fl~lo. () •
•d.t ... <o_Io" O,
:~:::;:~)~Nl().
.10"_"' 1<. _... 0.
cor I')"_I DO .
CUTy _o utO
"!I I lIRe ""Utur.t\oD. . .ddpar.. . o r l. l I St1lIJCTI OlI_O - ( 'A O_U I ' OP4_MO_REG I ' BO_IN I ·OPB_IID.REG I
=1t2.~::::1~;i7 I ' ILIl.IOR I ·OOT. 41O I ' Ul.O.D. D. REC);
.clk(c lk) .
n t [ r.t ) ,
:=~::~: :
. ~::=~~;~).
t1 'lg..1l.1gbO,
.~~~::;~(),
.2:~~~~:::: ·
. eu r,_I . O .
COTr)' . ou t O
A-lO
II lQodtm.t....dloo
~~~:,~:::~~ n:~~~ · I (: ~:'O~) : LIl.w_lJI£G I 'OPA.REG I
II ~o. l ...t .... ct IQ.
~~::;~:~~~~~~~=:; ; ' ~~~=~~~'~~~ii~4~ I 'OP!. REC I
11""ldl.oo< .... cti<m
~~rr;:~~~~~~:O~~:~il~r.:ii~:)~ 'HOLD.i.aEC I 'OPl j.E.CI
CS~li 2.DPIl oddrl:.o(
.~~~;;~~:
. ".':<'lO •
.b_l .0 0 .
:==~~~;:) .
fl>.g _klghO,
. t1 >.g. 1<"' 0 .
da.t .....tQ_l_O.
~.::;:~~:;~":=~:th) .
l ..... vc1c . _... O ,
eoorry.l.0 •
•• • rr 1. out O
L
II H ~ to.o.tol.th.S[J ........ , .
ddpar _ r ro y .OFFSE1 . 0,
d.lp4l" _rr..y. 1OO~_"A1'CII ' 4 'hO;
d.lpar rr .y .I.IlIllI. ....Tl;H. P A!lL£...... SI\ · 4 · kO:
do l par nA r c.y. WRlYE...PORT.lflDtH . 'lStI_PCMlT_SlZE . :U;
ddpar .."rr..y . llf.AOJ'Ol\ T.VIOTII . 'L SIU'llRT_S I ZE_32 :
CS1111 _L$Il ....... . y (
· ~~~::~~;.odd.O •
:i:::~~:=::~ ~~ ).
• 1 __.(1••_no . d.IUldr) •
. 1 d.d.t.n.....dA.u) .
clI.I'Ldat . _lo(32·bO)
L
~~CoUllto. 10 u.od IQ' ~..pln8 t ..c~ 01 t h. ...."., 01 ,,,,,nd . '''pl . tod
~1~~";o:::::~~:_F\EG.nlTlJ.U AL\IE • 32' hOOOOOOO.;
d~~:.;"7;~~~~~~~J.~(; l~~m~~.~~1?i:=~ I ·OI'u m. I
II Ik>ldI... t ru ctio.a
dolpar_coUllt .r,IN5T1l0CT1D1CI ' (' i.ALlJ.IN . <L010 . l _1\EGI·OPi .REG I
;~:;: ~.::~~:~~ I ' OUV Ul I ' LOl O. O.ut; I 'PLJ.UI;j):
delpar... conoto, .USTROCTI0JI.2 · ('!_lERll. 1N I ' LIl,w_A.1lEGI 'OPA.R:EC I
::~I~~ii~'\:..:.:~~-tASs, I ' OOT_ALlJ I 'LIl,w.O.RF.C);
:;~~i~~:
"" 00 0 ,
. b , i.OO ,
dl"' .""tl"'tO •
.......dd r('''''ot .,_ctl) ,
U og.b'gMc<>unt .d""'l. tlog).
: :~~~:~ ~~ ( ).
•<1.o.t l r ",".loO,
.1 odd rO .
h vr lt• • oaO •
. carry_laO .
x.n
.c o..y_<>utO
"/I M... d 0 volU"lI bl<>c~ t or t ....""trolle. to valt fo< th o <l<cul...
~~~~::.t:.~e~~~~~nTl.lL_mUE _ aa-eooocccoa.
~;f:::: ::~<:i:.::;~:~::~ _YiUJ& . 32'hOOOOOOOO;
dolr..... wol.t u .USTIIUCT IoN _o . ( ' B_ZERO_IM I'HlllJl_I _UC l'I.OJ.D_B_1Wl
~;I~:;~~~.~:~;IW) •
~:~~~!::~;;~~~::~ f?~~: :~:~ I
dk(cUd •
.:::~~~~~ ,
"_i~OO .
. d p ,,_ <>utpn t O ,
.c .._ oMr(woJ t.r~<tI ) ,
. f1.,; _h l gh {vo l t_ d _ o_ t l "l!:) .
:~;::~:~~~~o .
4oto_fr<>e_l ..() •
.~:::=~~;~() .
coney _lnO,
.""" y. <>Ut O
"v..,-cirShltt .,. ...Uter(ch.ut, ....Utor _etl.w3 ..... . -S •• 4):
A .5 Unsigned Integer M ulti plier Module Con-
t roller
~~::~:'~~~,.3~~op.o.u c""troUo<
'd.11• • !IIIl.13'dl
'd .t1n.KIJI..:!3'd2
'd.t1~. IllIl.3 3'd3
'd .tin.1IUl.53'd5
.ip...fl "ll.
output~ctl) :
::::~ ~~:~
'oput " p--Jl"ll:
nutput [2,Q] ootpot_ct1:
r eg [2 .01 outpot _etl .
'. S [2 :01 eurr.nt......:
r·S [2 .0 1 ..... t _. to • • ;
".s [2.01 olS'' _U ''ll_<101ayl:
TOg [ 2 :0) o l S" _fl ag.. da l a y2 :
u s [ 2 :01o l l9' _flag..dd"y3 :
IIDEFIlE lIffi SEO!JI'JITIAL BLOQ; FOll; 11IE COtmlO1.l.Eli
'lva1· · (po··d~ e lk)
"'gln
1f ( <ot-- ll ""S i n
A-12
owrtat_n• • • <. 'IDLE,
.1l!'Lfl......hJI.)·_'
.11I'I .f\_<lti.(.l~)·_ :
OIp..n......hyJ • )'_,
_pot_<U .""'lOO,
_.t..t..c....
_ ....t_ <-_.._.~:
___<d _ .l.p_U .....l .')'
.l.p..U-e.-oIo.I." ••lp_n l • .,2 ,
~_II_.I.(.l - •• p.n l.., l :
I/lSlIl GIl fU WTPVT Lint
' IDLE : ~""
. u t _..... - · IlIJI. I :
U (.lp dl-r-I' I>I) 1»&1•
•i&lLn ,,&-~ol .J 1 - 3'bOO l,
. ".. . l .. l>o&i b
:~O_f1O&-d.l.Y l_ 3 ' bOQO ;
•.....1 : ""1\ ....
... ,_ ...t. _'~:
IH.lp_Il-e-- j·OU l>opo
::";:~"S;.~:'1.3'bOOj,
:r_f1"S_<lti.J1~)'_:
•.-JU:I»g....
Mn._. ...t . ~ '''-'l.3 :
UI.l9'_Il_I·l>l1-e....
~~~""'::~:'1 • 3'_1,
:r_f1 ....<kJ..'1 . 3._'
'1IUL4:I>e&io
...._. ... .. _ ' IM.5,
:~:~~~::;:~;I>~) 3~ ~:
;:~::~~:JI . "bOOO:
•....u;:bq;..
_,_....... · ·IDLE:
A-13
H(. l gn_ll-.r-"I'bl) M~ln
:~~~~~t.::~ :Y I - a-soon
:~_1l.~_dol.YI_ 3 ' bOOO ;
A.6 Unsigned Integer M ultiplier Module D at ap-
at h
II Un.I~.<>d .~ l"pli.,- . """10 f~< «6 . • ot . tb h ..., l t l pUor 00" vork>
~7.~~;':i>raU""of
' i o c l U<lo · o;s2 1l 2 _I""t ru et j Qn• . I~cl"""·
I<>du1 . ... _lUL1tipllor_dp(elk .rot .Qp _lo •• i~_n~.r.._2...<t l . _ l tipllor _""-t "" t ) ;
:~:~~ ;~~~
,,,-V", (3 1,0)QP_in ;
I npo t (2: 0 ] r .._2_ cU ;
o~tpn< O.\gn. f1"l1 ;
Q~tp"t (31:0] ""l tipllOT _Qut ll" "
v l u (31: 0 ) vl ,v2.v3 .v4.v:;.~.v7 ,va."",vI0.v'1.vI 2.vI3 .~U .dol .J;
delp i gn_ ,,",U'U.B_REIl_I.ITUL_ULU& ' 32'bOOO<l&lOO;
delpar lgn_d.t.ct\ .•.JIJ;I:_,.I1'IJ...• .&U/E·n·hOOOOBOOO;
~~~~r.-~:;~~~~~~~:v~~~~~p~~~~~~~~~~~G~~~~? n.~~) ;
C:S2ll2_~~k~~~tt."I(
ro tlT. t l ,
·::~:~i;~-i.n) ,
::::::~~~~) .
f1 ag_b l v. (. 'v'_ll"ll) ,
·:~~;d~~~() .
l ."_. d u O .
h. .... vd t . . ... O.
corry_ l o O .
c ortJ_ Q"tO
d"fpar" .'l"_6et ""t2 .e, JW).I IITHL_HlU£ _32 ·b7fff7fff ;
delpor.. . lgn _dO..,t2.IM3T1\ilCllO__Q- ( ' oO_n l' OPo _lHO_IIOSKI' I'lO_I W!
· (lP!l. Ill'.<:I ' Il(lU)_.... lIEGl'II(lUl_ B_Ill'.<:I'SHII'T . OFl'.'lLU.PoSSll ' otIJ _OLUI·W W ..OJlf.C):
c:s2ll2_~~~;~C..<t2(
. ::~:i;~_in) .
dp~_ou~p.t ( vl) •
.c......<lr(3'bOOO),
n OR_U V.0•
floR_J ~vO •
.::~:=~;;;~~;),
: ~:''':~;~~;.O ,
A-14
. <:arry_I n C) ,
. c ..ry.ou~()
~~;"~I~~~:=:~~~:;~~.~:~:~:t1..:~~:~~I.OAIl_UJ'.cJ'
def~... I "" _lov _...1.1_IlEG_IJln1l..-VAWE e 3:l'h OOOOOOOO,
dd""r_ lo ,,_ lov ••• I.P.IIt<;.I.ITU~.UW& - 32'h(>l)OO(l(:<l(l,
CS2 112_~t~:;;:~ :"_"'.I(
: ::~~~~:~) ,
h_I""("O •
.::;;';:~~ ;
de fpar.. bi~lov_.uI.INSYI\IICTIOJ.O .« ",",,--Ao_nj'IIl!L.U1AD_l..RECI
' Il\IL_BO. n l' lIOI...l.OAIl. 1I.JlEll I ' Il\IL_1 _Hl _161 ' IIIJl...P_UI\I . 16 l'IIl!L _(RJT l'IlIJl...LOJ.D_O_IlEG);
def )'Q"_hi gh..l "" d . 1 _1lEG_IU TI.I.I...-VALUE _3:l ' hOOOOOOOO,
dd p.., .. h.igh_I.", _ I.p.llEG.n1TH~_VAlm: • 32'h()OO<;J(M)OO ,
CS2112_~~!;~~"".",l(
__100, _1) .
b_loOC_ll.
:~~=~:~;
IIDi'Il J .o.IlE, de l lly_l : I clkddllY
ddpu_d<ol_y.l.nSTHIICTI O' _O'oC' 10 _IXj '0I'1••0. 1lEG1' 1lO. I J I ' l1P&_Il.EGI
' HOUl_1 _UC !'lJ)J.D_P.Il.EGI 'SllIFT. 0l·' I ' AW_P ASSllI'mrT_iLU I 'I.OIJl _O_UC)'
c:l2 1 1 2..D:;t~:~:~: I (
: :: :~~~ : '
b_loO( ,,2),
. dpo.ou~"", (,," ) •
. c........, U · bOOO) ,
. fl "ll _hl gh O .
tl "ll _l "" C),
:::~::;;~:~;~() ,
. i::::~~~~.(),
carry.loO •
. cerrr _out O
II--~··-··--·_--------_·_----·~---·----- ·--~-----~--
II OPU HlIl£ o delay_I' 2 <l k o1el .,
~~:;::'::~~~a~~~:.u;:~~;~~W~~:~I~~=~:~~:~O~) '
c~m12_D~t::~:1111(
..tCut l,
lI.bOO .
h .1 ..o (v7) .
:::=~~~=C·
neg_bighO,
:£::=;;~~~i>.
. 1.........'C) ,
b ll _",U . _""C ).
A-15
.... r r ' _ta O •
. car ry_ QnO
/ 12cn Od.,
defpv_ dal.,_".I~IOIl_o-('jO_I . I·otl....AUl·-'_I . I·OI1I...&£CI'w.o_j .aEJll· I1lf.D•• •lml
·$Ilf".Ofl'I'llII.PJSSIII'llUT_&Lll I ·UWl_D.IEG):
C32 112~Er(
... l.aI'lO •
•1>• • ..0(0,."') .
: :=~~ j .
tl ...... .p o .
:~:::~::~~( l.
. d . ....h ·...l ..( l.
: ~::=~~;,, (}.
•<arTJ.taO •
. <arTJ_....O
II 2 cla 6al.,
_ _ 6al.'..2.1r!mlllCl"la..o-(·ao.I.I ·DPI....IEG I·-..I .I ·~...&£CI'UWl.j.UIH ·UWl•• _IEGI
·$Ilf ".DP'fI·&Lll .PlSllIIl 'DlII"_llII l ·UWl_O..JJ;l:l:
=112~~'::~:r:2(
.r n{rnl ,
.... . mOo,
l>.laO(v9) ,
: :::;;:~I.
.n lpo .
• n l a vO.
: ::~:~"7~(I.
: ::::,,"~~;'{l.
. <arTJ.'.O •
.carr,_"".O
II 1 cU dal.,
~~=':~~~~;~:~~~~;~~~~~~~~;~io~~;~' I · llP•.•O.R£Ct'LDlD • .l..-RF.G!
(32 U2__ dal.,. 4(
:~: ~;..~: :
' :::~~; I}.
•"'"'.....,....(.101•
. ....._ (3·1>000 1 •
.n ...... lpo.
~::~:~O.
•Oat ... f ....._l_O •
.~:::::~~~~.o .
.carry_bO •
•carry__t O
~:"'~~~~~i?"~~~~~=:C~~·I.I ·OPt..rml ·UIlD•.u""I 'UI"D.I.lEGt
cs::III 2_tIf'I;I-..U
.c n(cn) .
A·16
n~(rot) ,
: ::: ~: ;:~: :
<%>,<-o~tp"t(.s) •
. e..~*ddr(3·bOOO) •
.n~lgllO,
n o&,.lo v O •
.::~:=;;;;~~() .
•1.......dr O ,
.loa.uit. _oaO ,
eo<ry _;aO ,
earr1_"" tO
deil'""_ od<I_2 . n ST1UJCTIOK_o- l' .O_lM l' OI'A_l¢C I ' ~.n l ' 0P8.= I ' UliD. l_l¢C1
~~2~D::~~: 'SJIFT_AlIT_16 1'lLO _lIlDI ' OI1I"_l LUI ' l1llIl . O. UG ) ,
elldehl •
.nt(r.,).
*. I»o h,5 ).
. b_Jn O(v4) ,
: ::::=~;;:.;) ,
. f1q..h1gb O ,
. {los.l oaO.
""ta_to . 1... ( ) •
.~::~"7>~"'O .
l ....vr>t ._ .aO,
.carry. l a O ,
. CUT)".o~t()
dwfpar_ Bdd_3. USTlUlCTIOJU>-('iO_III 'OPl .RtCI 'IIO.UI '0l'll _UGI'WiD_• • UGI
~~2~D::~~:'SHl'TJHT_I I'IlU_lDDl'OIIT~WJI 'LOIJ)_O_AEC);
. ;~~ ; ;:~ : :
.UaO ( d:~;~~ ;V6) ,
.::=::~~~~;.
f1"l;./ligb O .
:~:::~~~i~() .
.~::~~;;:; ~.. o.
1. a. vr lta_aa O,
ear ry _l a O.
e or ry.o""O
dd par.. ... _I .A. IIJ'.c:_I . n u l.• •• lUK - 32·/l00000000:
~~S~:?::~fr~~=~~:~~~~~~~~~~:~~~:m~"~_a_R£.cl '
. ch ( en ) .
r.t(rat),
*_1110 .
b _'aO ( vl) .
:',,= ~~ ;~j ,
flag..blgbO .
: ;:::=;;2::~().
la __add.-O,
A· 17
.1"_liTi~• • e"{) .
ca"ITJ.ioO .
Car TI.ou'()
de / paT'" c • • • 2.USTR\JCTIOM.00(·AO.1M1 'lIPA. MOJWl I ' OO_IMI ' I)pB.'O.Itt~1
· LOlD.A.Il£(lI · HOlJ).B_R~"I · SHIP1'.O""\ ·ILU.PiSSJ, I'(RJI". ll.Ol·LOlD.O.l\Ell),
ddp"' ... r u . 2 . ' MSTIlOCfI OW. ' . ( · AO_I MI · t»' I .... O..JWlI 'B O.I1 I1 '0I'B _. OJlt:G1
::S;~~;~jj::::~~~.B-I.ml ' SHIfT .OfT 1'1W• .uHl1'0IfI".1W I 'LOlD_OJ£.\l) ;
d1<ld.) •
. u , (u , l .
.b _ioO(u'OJ •
• • ioO( u8 "
: ::= ~;~::~: e 'l) .
n..,.lolgMl ,
· ~~:t~::i:;(l .
.~::~"7t,, (j ·
.'. ,, _"clt" . "oO ,
ea«J.bO ,
.carc1.ou,O
defpu.. l i co d l . I MSTRtlCTIOM. D--( · AO. IMI ' OPIJIO_REGI ' BO_IM! ' OPe . MO. REGI
;:'~;~jj~:;:~(e.Il£(lI'I.SL I' SIIFT.I.IIT.51'.lUI.P.lSSB I'(kIT.IUI I·UWl_O-ItEG)'
el'(e1k) .
.::~~~~~.
•b.1n0(u 12),
::==~;~;:':;,
. n ..,. b l gb.( ) .
· ~~~::i~ ( ) .
d. b . fr_ . h . ( ) •
.i:=~~~~;",() .
ea"ITJ.1"O ,
.cacc1.outO
dofp"".. U~o<l2_I"S1I!"CnO••D--('IO.lMI ' ffi'I..MO.A£ GI 'OO. IN I 'OPIUIO .REGI
' BOU>_I .IIJ):; I' HOUl_BJlE~ I'I.SII I'SIIl'T.AlIT_271 'A1.\J.P"'%BI ·O\JT. AU1 I'l.OlD_O.It£C);
CS21l2_D~:;;~~r
' . ; 00 0 .
b.I"0(vI2) .
: ::= ~~ ~ ;,
f1 aSJ>lpO .
f1 as .lou O .
::~:=~;~~:.:( ) .
'~==~~;::OO .
<Aery.hO,
ca«, .oot()
A-18
.dk(dll) •
. r n ( r n ) •
•_.!.G(. 14) •
•I>_UoG(.U).
:===~~::I_.oortpoId.
.fl O(...Mp U .
~::::::~o .
:=~ci~-()'
.l __..-I U . _ O •
.<....-,_'.0 .
.•....-,. - 0
A. 7 Ver ilog Tcstbench For Controlling R C5 Iter a-
tive Pipeline
vi.... (31:0) n"ll.l. ...:ho:
.ir . (3 1 :0).' ''t:'' 1.".''':
.Jr. (31:0) n"llol.".2<;
. . .... (31 : 0) .'''t:''I . " .:N:
. . .... (31 : 0).~."_3Io;
...... (31 :CI)._2."..3lo:
....... [31: 0) 'UC.2. " _3c :
.1 .... (3 1 :0J .....2.".3<1:
.,... (31:0) n o,p.1. ' •• 4&;
...... (3I:Cl1......,• • ••41>:
..... (3 1:0) . Up3. ".4c ;
'O:i.... (31:01~3••••4d :
.... (31:01 '""• ...u..~t1:
>'W(l(31:0)"~.1__2 ,
'OS (3 1:0) .n.nol.1af'ot3:
..... [3 1:0J'""•...-l.~4:
. tro (31 : 0) .n...u.. _ pou :
. 1 (3 1:0 ).rt..--t._'po.2,
. 1 (31:0).n ......l . _ . poU :
.1 (31:CI)'""<.~._.po.4:
' OIl: Ao_ l.on .l •
.....-..'.
.... ...... .n . n ol . ' :
r . A ~•• 1. 1t . 2.
AO. "", . 2 .
go.ru~• • , ........ol _2:
....A &•• J.~1' _3.
A·........3 •
... . nm • • " . "' o1. 3:
<011: i.o" .4 .
11"'_ _4.
t"'.nw• • "'.nol .4 ;
.\ · 19
.'re [2::oJ LSIl.cU. I:
.,re (2,OJ ' bl<><u.nL I;
~;: g~:l :':"-:~i~:·l ;
ortre {7: 01 ..... ft...et12. I;
wOre ,... l.'-_~I;
...... [2,01 I.5II..rtL2,
. , .. [2::0) Ibloclll.nLa,
" I.. {2:01 ' 1>I0d2..n L 2 :
.... [2:,ol_.« La,
~;: [~~.=~~7'.«1a.2:
. ,." [2:0} LSL «I. 3;
.h" ta :ol i~oclll .ctl . 3 :
. 1.. [2:01I blocU.ctL3,
::: g::: ~:i~~:~~ 12. 3 :
wl ... .... U _d<>...3,
v I..., (2 : 0} L$!I. <U . 4 ;
".~. [2 , 0) tb l <><:k L <lL4,
vlre (2 :0 ) lb l<><:k2.<tl."
"' to (2,0) ><!d~. <t l.4 ,
:~: (~~~."::~~:7~_<U2·"
rc~."""-f._.<tl«lj{
.dk(dk ) •
...<Ito< ) •
.........l< I _ <• • ) •
... -r-I U •
.&0'...--..---.11 1.1) •
.lSI... ctl (UJl . ctl. U •
•' b l oclll.nUlbl..aI.<t1.l).
:~~i~~n~rLU•
..U ...."'12I..........«12• • ) .
i~·--("""-e,-·I)
~d.MlI._.nl ..121
::~~~~:
P .u(p.· · . 2) •
.p /"' 2 ) •
.p onar-llp l .2) •
.LSIl."'1( i.SL"1.2l •
. l bl ..d l .ctlUbl<><:llI.<U.2) •
. ' bl<><k2.<U ( I~<><k2.«L2) •
._~<Uhddr."1_2l •
• U h.r."'12(uU,•••"'12.2).
}~lf~<i<>••(""lJ_d"" •• 2l
r<b..b &lf .ro.ll<C..l ..l3 (
.< l k (d k ) •
....( r ot).
p.lm;' {~Q_I"l<.3) •
., ..-r-(... _..-...3) •
.... ....... ........1(10 1. 3) •
. 1S!_ ",I (LSL <U . 3) .
: : :~=:~::: :~~::::::: :
.addr.<<l(04d• • n l .3) .
A-20
· ... U~*"_cU2 (.... U ....._C~U_) •
i~C-(It.&II.-_31
r <f>..l. olC ..-4_cU eU4 (
. c1 k(c Ik) .
n~(cIt) •
.p.1aa(p.iAi~ _41 .
p_...lr_.....41•
.~':i~~:;:---.n. ..... . 41•
. i.l""",' _",,lU. l oc u_nl_4) .
· ::"":::i~~~~~:t1·41.
ek i fto..-_cU 2 (o U t *" . rt12_ 4) .)~I_ ( II.o.lJ 41
~:ik~~::; ~-dp "'piC
n«rnl,
.• .-o~_ l blodl _<t l (lblocU~ctLII.
.:::::~~"::~i ~ ~~;~~:j:U. I I .
: :::::::~~~(~~~~~ ~ ::r.ora.u.
.oM.. 'blocU . ctl(lblo<kLc~ LIl.
· :::::~;:~~~~~~~~ij:~U) .
· ::::~~~:~ i~~~~~~ ::·_eUU1 •
. t __l (• • ur.al_h'l .. ~ Il •
.~'2(."..-.....J.~ U) •
• t . potJJ (. ....._ LI.....31•
.'-.4(•....--r~ I_t4) •
._H.~•. I_••_2a) •
._ 2< ,_..._~1 •
. _ <3( _ 1_' 0. 2<) .
,7"'4< 1_••_2<11
~;~~~;~_.. ....02(
... . (......1.
::::~:~::~:::::::==:::~::
.• . ctll_ . ct l . 2 1 .
:=:~~:~(~~~~~~;:r_nI2.2).
:::: :::=::::: :~::~::~:~::
. odd . _ _cU(eddr. n L 2) •
•odd _oIo.in__ctI2h IlIU_ _eU2..2) •
•0<Sd_lSII _ctl (L'lll _d L 2 ) •
. 1BI"" U s"'l!;o l _to.2II ) •
• i .pat 2( O<"f'" \.t" .2Il) •
. 1.p"t3(ot·soLto . 2c ) •
• lB pat4( ot "t" l _to _2<l) •
.""~ I ( '~"lI.2_~o_3. ) •
oot2 ( .~og.2_to . 3b) •
""t3(.~ ogo2_«>_3c) •
) 7" t4 ( ..ogd_~ O_3<l)
~:i;~~~;~-dp oup )(
..~( ... <l •
._-.. l blockl_<tl(l lIlockl_ct l..3l.
. 'I>1"".12_~.l U~I O<-.l2..cU_.) .
.........._ _ct l( __~. 1.." •
.......... 1ft.... ct12(oIIif...._ct U_,sI •
...-.J.SI\....U (l3l_«. 1... I •
.-"I~l oc:lr.l~~tlU~I"""L~tl_31 .
: :::~~i~~~ ;~UI .
•0<Id..UJ.n ... . rt1.2 lo11.1l .... _~U2...1 .
•"""..I3LcU(1SI_rtl _,sI •
. ........ U . .... 2".. o_W •
.~ub~ :-. ) •
. ....p«. ( ... op2.. _3<I•
.~(.up2.. _. ," •
. _ l hup3..... , 4al •
. """'(.<'Ip3_. o_nl •
._t3h~_to_4<l .
j 7"'U.~~to_4oI)
r~;~;~~.4p ••~(
r ot (r ot) •
..... ILib l odl.ctl (lb1bt U . ctl-4 J •
•• • •• • ~1 <><n_ct1 (1 ~I .cn_ctl .4 1 •
.......oddr~ctHa<ldr_t~L41 ,
.:::~~~~;(~~~~~:~:r_ct lU l •
. o4d _lb lockL<tHl bl <>< kLctl . 41.
· :::~-::~i~:~~;;~j~L41 .
: :::::~~:;(~~~~~:~r. cm.41 •
. ....,.tl (.~_ •• _UJ •
. lo"".,I.top3_<o. 4bl •
• Lo",,"(~_""_4cJ •
•1opo14 1.......... .... . 4,1) •
._.I I-"~._~""~ Il •
._u{.n...--J.__ poU l •
....,, 1.. .....-1__.""0.3).
i;""'4(... ..................4J
~"&l. boc""
·.""I ·r~_Io .• ocl0040·
· 1. <:1 ·rc!oIl.)'dtaco n . I..<l •
· h Kl ·ro;.5l(,oyaS :lo .l oc l •
' 1",, 1-.1. · r<f,j[.,..s~ :l'I>,l.d •
· l oc l ·retill"'oll d • . loc l •
•• • c1 · rc$I:..,aS 3b . lach.*" ·
· Iocl -.l . · <CU .., olIt 04 U d •
·h.el.... ·r<~..,olItq04b, loc l •
r"c'I 'W:
...
•"uaol _ito""tIC_n',,_,
::""01_1_" C~ 3:I '''_;
.._.-....... no&I._l c. I · ~I ;
.. ....-....l_"-t, c~ 3:I' ",,,7f Jtb:
A·22
:~;_.....u.__'4 ' _n".,..,.,,,t,,,
p . .... _n.r-1 . 1 .. I '~;
::0....1 . - 1 ' '' ,
p_~I <. I'~:
:::-.1 <.1'1> 1;
p.~l<.I'~ :
::0~1<.1 '1> 1;
1"_~I<- I'bO:
~Q. "",. I <. l'bI :
p . ........I<.I 'bO :
~Q.........t ..nl.2 <. 1'1>1 ,
'"P .....· It• ••1Ol.2.·"bO,
:::-.2<.1'1>1 :
......r-...2<.I 'bO :
p_~2 '- I'''I:
1J'I'• .--J "' I'bO:
:':0 2 •• 1'1>1:
p _ _2<-I'bO:
..00
:':0 2 . - 1.1>1:
p 2<.I'bO'
..00
:': 0...._2 <- 1.1>1;
~Q_n>._2 c_ l' bO;
..00
:;0· ··,,···..1. 3 <., ·1>1 :
~Q _.rt.....101.3 c. l'bO:
..00
iJ".....3<...1>1:
."1"_.--...3<-I "bO:
:\.-23
Appendix B
ANSI C Code for Select Implementations
B .! R C 5 C Code Fo r Testing
I '
ANSI C Impl .... ntation of RC5~w/r/b encryption <:1ph"r.
T",""n froll."TheRC5E:ncryptionAlgorithi.. ",RonaldL.
MIT Laboratory f or COIIlputer Science .
Modified May 25th, 2001
Ja"onRhinelander
Modif1&d July 9th, 2001
J " s oIi Rhi n e h .nd e r
Cc<>dellillmakefunctioll call and verify corro<:t output .
•1
typedef unsigned l ong i nt WORD ; 1/ 4 b yt e" i n WORD
'deftne II 32//llord size i n hit"
'd"Uner12/lnumherof r ounds
'defineb 16//DUllloor of byUa iOkey
' d efin e c4//nUllbero t words in ke y
I!c - .."x(t, c1 el(S_b/>I)
Uefillet26//sizeoftahleS -2_(r+l)
1I0RDS[128) __attribute __ «digned (16»):
WQRDS2[t28] __a t u i but,, __ ({ aligned(16») ) ;
1I0RDpt[2] __attributil __ «alig.... d (6))) ;
WORDctl[2 ] _ _"ttribu'te __ « align ed (1 6 » ) ;
i nt c t 2 [2] __attribut,, __ «al igned(16») ;
uns igce<l dIU kily [b ] __at t r i but e __ ((alig ned (J6));
WORDp..{lxb7e15163;
WOROQ-Ox9 ,,3779b9; // Magi c conJItantsfor generationof
//tbe8ubkey8.
B-2
tpragma CMLIl_fUNC_DIo:F rc5topUnt i n dp .i b l ockl .dpu.a,
int indp.iblock2 .d pu .a,int i n dp . .." mArray . b",[128] ,
int out +dp .ibloek1.dpu.o ,int ou t +dp .ad d L dpu . o )
//Il.."dtodefin.. rotation operator.. . llote x mus tbe UDoi gn ed t o g e t
// logiealrightsbift .
' defineROTL(:r,y ) «(:r)«(yi:{,, -l») I «:r)>> (,, -(y"'(w -1»»)
t d..fineROTR(x, y) { «x)>>(yt(w-l))) I « :r)« (,,- ( yU ,,- I »» ))
//Th....ncryptionfunction .
voidre5_enerypt(1I0RD+pt,1I0RD+et)// IIB:2 wordaillpt andet
{
IIORD e ;
IIORD D,
IIDRDi, A.-pt(O] +S[O],B-pt[ l)+S ( l );
fo r (i-1;i<-r ,iu){
A.- (ROTL(A- B,B» +S( 2 +U ;
B_(ROTL(B"A ,A»+S(2+i +l],
}
et[O] -A.;
et [ t] -B ;
}
//The D..c ry ption function .
voidre5_decrypt(1I0RD+ct,IIORD +pt)
{
1I0ROi, A-ct[O). B'"ct (1];
for(i -r,i>O; l - -){
B-ROTR(B-S[20i+l] ,A)-A.;
A._RQTR{A._S[ 2 +i) ,B)" B;
}
pt [l ] -B-S[l];
p t[O] - A.- S [O] ;
}
/1 Setup f un etion for th .. S array .
vo idre5_oetup(un..ignedcllar oKey)
{
1I0RDi:
1I0RDi;
\/OBOk;
IiORDu -,,/8;
WORD A.;
1I0RD B;
1I0BOL(c];
II InitLlUIdtbenSthell lllix key i n toS.
for(i -b-l,L[e-O-O;i ' - -l, i - - ) ( L[i / uj - ( L[i / u] « 8 ) +Key[i ] ; )
f or (S [O] - P. 1- 1; i <t ; i+ +)( S( i] - S [i -I] +Q; ]
for(A- B-i-j - k-O; k<3 +t :k++.i-U+J):/:t, J -(j+1» );c )
{
' .. S(U -ROTL(S[n+UB,3};
S "L[j] -ROTLCL[j]+A+B.(hB»;
)
for{1~;i(t-2;iH){S2[1l "s[1"2];}
)
' +Anyoth"rc<>dcfoll"wingthislsfortestingpurp""""
(ex geoeration theS[] array) . /
int mainO{
inti;
pt[lj ·O:r21Af>DBF.E;
pt (O] " Ox I 54BBF6D;
IlforU "O; i<b;i++){k"y[il -OxOO;}
key[ 15] _Ox91;
key[14] -O;o; SF;
key[13] -Ox46 ;
key [121 -0xI9;
tey[l1 ) -OxBE:;
l<"y(10] " 0>:4 1 :
key(9] -07;B2 ;
k"y (8) _0,,(; I;
k"'1[7] -0,,63.
key{6j -Ox55;
koay[5]-OxA5;
1<",[4] -0,,01;
key[3] -OxlO;
key[2] -OxA9;
key[t J -OxCE;
tey[0] -Or91;
rc5~"etup(key);
t » Put the S [) intoth.. ls.. forthe hud..ue can.1
rc5_"ncrypt(pt.etl) ;
.pra&l"" OIl.N_FUJiC_CAll rc5topO SLICES"(O:t)
rcStop{S[O],S[I] ,S2 ,Act 2 [O] ,&;ct2[t]);
H«ct1[O] " ·ct2[O]) U {ct1[11 ~ct2 (1]»{a9... vo latlle ( ".IO.ovr8. OxIO ");}
elBe{a".. ""latile {"mov r8. Ox20");}
)
B.2 ReB C Code For Testing
B-1
/ -
ANSI C Imp lbl"nhtionof RC6- II!r/bencryptioncipb o.-.
KERNAL HODIFIED CODE f or t .." U ng
10/10/02
This f U e t safnll t e s t a b1& i"'l' lem entat ion of RC6
The key ..""t be t he sall e f or all blocks of p l a in t e :<t .
In th e MainO an ea sy " ay t o chaIl ge t he plaint e:rt va l ue s
t hat ar e going i nto t he c i pbe r are to change t h e
ar1th/llet lc paralletera t hat arellodifyingthe
seed vallJ"s .
-/
typedd uM ignedlong int lo'ORD; 1/4byte" inl/ORO
' d" U n<>,, 3 2 / / lIor d s i ze in b i t s
' defi ne r 20 II numbe r of r ound s
'deftneb 16JI number of bytes in key
. defin e c 4 / 1 mll"ber of wor ds in a byte
Ilc " max(1, ci el(S-h / lI)
'detine t 44/1 ei",eo! t abh S .. 2.(r+O
\{(IRD S [1 28] _ _attribuh__ ({ aligned (16)) ; I I globa.1vidbility
WORDe ve nS[129) __a.ttr ibut,,__ «aligned(16») ,
1I0RD oddS[ 128 ] __at tribut,, __ «aligued ( tl.;») :
WORDA[128] __attribu t,, __«aligned{16»);
WORD B[ 128] __attrlbute__{(aligned{16») ;
1oI0RD C[1 28 ] __at tribute__ {(alignad{16)});
WORD D[1 28] _ _attrihute__ «aligned(16»);
1oI0RDctA [128] __attribut ..__ «aligned(16») ;
1oI0RDetB[128] __attdbut,, __ «al1gned.(I6))) ;
1oI0RDct C[ 128] __attribut,, __({aligned(l6»);
1oIORDctD{128] __attribute__ ({al1gned(16») ;
1oIORDcUfab[12S] __" t t r i bu t e _ _« a.1i gn ed. (l6»} ;
1oIllRDctBfab[128] +_"ttI"ibut e __ «al1gned(J6)));
1oI0RDctCfab[12B] __" t t ribu t e __ «aligned(l6» ) ;
1oI0RD c tDfa b [ 12B] __a t t r i bu t e __ «aligned(16») ;
// Magiccon''ltll.Ilcs forgeneratjonof
//theS [j
WORD P~O~b7elS163. 0 .. O~ge3779b9;
' pr agtoa CHLN]U?IC_lJEF rc6to p(in int dp .rcounter. dpu.a.
i nintdp. ArdH" II. :rdHemLSH.ls" [1 28 J.
i nin t dp.BrdH..... :rdH.... LSH.lsm[128] .inint dp.C :rdHem. :rdM..mLSM.l" .. [128].
inintdp. Dr dHem.rdHemLSH.lsm[128J.inint dp .illiti . dpu .a,
ill intdp.illit2 .d pu. a,inint dp .finall . dpu.a ,
in i nt dp .fina12 .dpu. a,in int dp . OddSub k..yM..... ls.. (t28].
"-5
ill i lltdp .e ve nSubk ey lle ls.. [12 8}.
out illt dp . ,\" rll em M" l em[ 128} .ou t tnt dp .S ..rMell ... rlle:ll .1o .. U28] .
out ill t dp. CiIrMem r Mem. ls ll [ 128] . ou t illt dp . Dl>TH e:l . "rHe m. l sa [ l 28})
//II.."dtode!1nerntat 'OIl ope rators . lIo t e Jl: ",us t b'l un sign" d to g e t
// logical right shift .
'define ROTL(JI:. Y) («JI:)«(yl(,,-1») I «(x»)(,,- (yl(,,-l»»)
.define ROTR(JI:. Y) «(JI:»)(yt(,,-1») I «:t)« (v -(yUv-t)))
//Th" cnc ryptiolllunction .
vo idrc6_ellcrypt(WORO ' pt. WORD'ct) / I NB: 2 "erds ill pt an d c t
{
WORD i . B~pt[tI+S[OJ . 1l~pt[3l +S [1} . '\ _p t [O} . C~pt [2] ;
WORDte..pl. t e ..p 2 . t .... p3;
fo r (i -l ;i<-r; i++){
t empl "ROTL«B'C20B +1».5);
t emp2 _ R.OTL{ (0 ' (2'O+1 » . 5 );
A- (ROTL(A·te" pl . te "p 2» +5[2-i.l;
C_(RQTl(C-teJllp2 .templl) 'S [2 *i +t];
temp3~A;
B~;
CaD;
Il-t e mp3 ;
)
A - A + 5[2 *r+2J;
C .. C+5[2'r+3];
ct[O} ·A;
ct[1J -B;
ct[2} "C;
ct[3] "0;
)
//Setup functioll for t he 5 array .
voidrc6_sootup(ulls 'glledch ar _K..y)
{
//NOTEENTERKEYHI'.RE!
\/ORDL[c} -{OxOOOOOOOO.OJl:OOOOOOOO.OxOOOOOOQ{).OrOOOOOOOO} ;
IIORDA.S;
i nt a_O.s~0.1_0. j"0,v_O. u"II/8;
s EOl-P;
for(a~l; a<.. (2or+3); a ++){ S [a J - S[ a - 1] +Q; }
,1._0;
B-O;
v _3 ' .."xOf(c. 2'r+4);
fo r(s -l;s<- v ;II++){
,\ -ROU(S!i] +A+B.3);
5[.] -,1.;
B _ROTI.(L[j]+A+B.A+S);
L[j} -B;
i -U-+t>X{1or -+O:
j - (joOXc:
)
)
lDt aaxOf(i.. t opl, l ilt op1){
U (o pl>op1){utu.rllOpl , }
d . ..{retllnlop1 ;}
}
l oAIly otllercod.follo.. I"l!:tbi81sfor testi~purp<>8". (
u g..ne:r at i oll t he S (J aru7' 0/
ill t ll" i .. (void)(
/l c r ..a t .. .... r . y . f or iopu tbu U . ra
\/ORD i, j,
IiORDp tlll(4 ] ;
IitlRDctOut[4] ;
ch ar . key;
i n t pa ,.s;
WORD r onnd s ;
ro unds-0100000014:
pus -D,
IIp,, . udo r 1lDdoll pll. i.Dtut
A[O) aOdf7 S8600;
B[O) aO>::OI12f..SO;
C[O]aO>::OO23ef fS;
O[O)a01ff4$980 ;
fo r (i - l ; i.<I28; i ooH
A(i ) - U[;.- I]oOn f 4S6 234 >1O:rff U t ff f ;
B[i) - ( O:o:45900 lff -+' [ i ] ) lO d f f f f U f ;
C[iJ -(01023ef03 I-+B(i)XOdftftff f;
O(1 ) - (C/I00260081-+<:(i) lllhftff ffff ;
)
rc6_• • tup(key ) ;
fo rU-2 ; i«t-2);ioo ){
H «i12 ) -oH .... nS[(i -2 ) /2) - SUI ;}
elll.,(oddS[( i-3)/2J - S(iJ;)
)
ptln[O] ~A[i J ;
ptIn[1 ] ~B [i J :
ptln[2] <>C[i] ;
ptIn [3] ~O [ iJ ;
rc6_e nc rypt{ptln , ct Out l ;
c tA [ i ] ~c tOut[O] ;
ctB[i )-c t Out[ J} ;
ctC[ i) - ct Out [2 ) ;
ctO[i) " c t Out[3] ;
B-1
)
II "0 '' ",,,ke thehardll"r" c all.
.prag .. "CM LIl]UNC_CALL rc 6topO SLICES-(O:4)
rc 6top(rounds,A ,B,C , D,S(O] .S(J],S[t-2],S{t-l],oddS .<l"enS ,ctAfab,
ctBf a b,ctCfa b,ctOf a b);
/* l1e 11 11 now compare v a.lues f or correctness
Ran ge of v alid da t a [blockO - > b loc!, 18] :
ctAht>[2S]->ctAh b[ 43]
ctBfab[23] - > c t Bfa b [41]
ctCfab[25] ->ctCfa b( 43]
c tD fa b[23]->ctDfa b [4 1]
f o:r (l ~O ; i<19 ;H+){
it(ctAUl ·~ctAfab[2 + iJ lI: ctB[ t] · ·ctBfab[i] U ctC[i) · ·ct Cfa b[2+i]
at ctOhl .... ctDfab[i]H
paGesp,,-ss+l ;
)
)
if(paas· · 19){asm vo latih ( "1I\0,, :r8, OXIO");}
e ll1eh.8mvolatile{".,ov r8 . 0x2l)');}
r ecu r-uO;
)
0 -8



