Flexible and resource efficient design for hardware implementation of the advanced encryption standard by Wang, Cheng



Flex ible and Resource Efficient Design for 
Hardware Implementation of the Advanced 
En cry ption Sta ndard 
by 
©ChcngWang,i\I.Se. , 13.Eng. 
A thesis submitted to the 
School of Graduate Studies 
in partial fulfilment. of the 
req\lirement~ for the degree of 
Doctor of Philosophy 
Faculty of Engineering and Applied Science 
I\ \eillorial University of Newfoundland 
I\ Iardl2012 
St. John 's Newfoundland C~lIlilda 
Abstract 
In this di~~rtatiOll, we invc:;tigatc the performance of a broad range of hardw[lrc im-
plementations of symmetric key block ciphers. The Ill ~\jor focus of thi:; dissertation is 
d('di('al.(~d to the iuvc:;Ugation of the pcrforlllaIlcc of hardware implementations of the 
Advanced Encryption Standard (A ES) influenced by the implementation architcdnfC 
Thl: llexilJilily of t,he AES algori thm alluws an extrusive varidy of irliplenwiltatioll 
ardJih:durcs making it llCt.'CSSary to investigate the performance chamctpristics of 
the!;c an:iJilL'CturCS. On the olle hand, this investigation identifies the most, r~\lrce­
efficient architecture for the AES implctlWlllatioll targeted at resource-constrained 
appli(:atioIlS; while, Oil t he otller ham!, this discloses the uniquc performance t.rade-
ulrs frum the dilfefcnt architecturcs allowing flexible implementation of AES 
In thi~ dissertn.tiOl\, two perspectives of the implcmcntatioll architectu!C of !\ ES 
arc explored: I) the pipeline configuration of tile AES S-box wilh the m mposite field 
structurc alill 2) the datapatll arrhit.rdure of AI~;S_ For thp S-hox pip('lin(' ('()JIfign-
ratioll, 11 gate-level approach for the pipciine of 2 to 7 stagcs lind a mmpuncut level 
approach for the pipcli!lc of 2 to 4 stages arc evaluatoo. For the datapath architec-
ture, paramctcrizcd architectures with the datapath width of 8,16,32, G4 or 128 hits 
and thc ullfol\illg factor of I, 2, 5 Of iO for the l28-bit \vidtl! arc evaluated. il1 110ng 
which the lG, 32alld G4- bit architectures are Ilovel dcsigus. 
Gf'Ilf'ric and typical models for these al'chit(~tllrcs arc built. T he per[ulllumce 
of these implementations in terms of timing, arca, power and (,!lcrK.\-' b lwnchmarkl'<l 
ba:;ed Oil Complementary r.kt(ll -Oxide Semiconductor (C!'Il0S) technology following 
Applical ion-Specific Integra ted Circuit (ASIC) design Aow, The benehlllflrk results 
demonstrate the putentially significant IJcrformaJ}{;e improvement by selecting an ap-
propriatc architecture under given throughput rC<juirements compared with other 
arc1litcctures 
For Ihe ill\'cstigation of S-box pipeline configurations, we examine the perror-
luam;e of the S-box implementation for the 9 coufigunltions over a wide range of 
throughput requirements. l3ased Oil thc results, \\'e analyze the influcncc of the 
pipdiu(' configuration on the performance and identifiefl the regular trelld~ of tlw 
rCSOllfC(, cost.s varying with the pipeline configurations and the throughpllt require-
ment/timing constraint. Numerically, there arc maximum reductions of 51% in 
area and 69% in power/cnergy witl] an appropriate pipclinc configuflltion through-
Pilt reqllir(,lIl('nt/t. iming ('onstraint ('om paring- with other ('onfigurat,ions inrlwling- no 
pipeline. For the investigation of datapath architecture, we examined the perfor. 
mancc of thc data path implcmentation for thc 8 data path an:hitecture:> ovcr a widc 
nmg,e of tlJroug,hput requirements. The quuntitive performance of the architecture 
implf'lllcntations is presented. compa red and analyzcd. The most, cflident arch il.C('-
tme implementation in area, peak power aud energy, U,'; well us in the overall resource 
cost is identified. The performance trOOLLufrs uver a range of lIrciJitectufC parMnetcr 
values arc diselosed. In contra.st to cO!lvell t ional belief, the lJl(},';t oompad urehiI L..;:;' 
ture implementation with the 8-bit width dOL'S !lot hdp to minimize, but ac tually 
increases, thc energy conSUlllPtion even running at a low clock frequency_ As well , 
compact iJllplctnentat ions do not rcsult in thc minimal peak power. In oll r rcscnrch. 
we also combine the most energy effici ent S-box pipeline ('onfiguration and r1alapal h 
arrhittttllTr idrntified from tll<' abo\'(' and drlllonstratr thr fllTthrT rcOUdion in ('n-
ergy consumption by the combination. [t is found thllt the combination consumL'S 
ab011t 50% k'Ss energy t.han soldy the most cllcrg), dlicicilt dal.apath architecturc 
implementation and 80% less energy thnn the S-bit width datapath archi tecturc im-
p[ementation 
Tile last chapter of this dissertation di~cuSSC!i the d~igll of II newly propoS(:d 
block cipher named PUFF IN2 , PUFFIN2 is designated for lightweight applieation~ 
with modest security strength requirements. PUFFlN2 IHL~ an involutional st.ructllre 
and can lead to a vcry compact implcuJClltlitiotJ, which is smaller than the previous 
most compact block cipher PRESENT. 
Acknowledgements 
I would like to deeply express my sincere appreeiation and respect to Ill)' supervisor Dr. 
Howard Ill. Heys. Being his student is really a beneficial and pleasurable experience. 
His wisdom and knowledge guided me through the difficulties in Illy studies aud 
f{.'sellrdl. His kindness and tolerance added lots of Aexibilit.y and comfort to Illy li fe 
during the studies and research Ilis rigorous scholarship and justice convinced me 
to follow hilll fur lifetime. 
I al50 need to direct my thanks to my supervisory committee members [) r. nil-
machandran Venkatcsan and Dr. Lihong Zhang, who maoe efforts to support me, 
revicw my thesis and give valuable suggestions. I would al50 likc to thank Dr. Chcng 
Li and Dr. TIlL'odorc S. Norvell for thdr helps 
It is my great joy to know and get along with many good friends and lab mutes 
in the university, including Dr. Yuanlong Yu, XiaolJing Zhang, Jianiumg W<1ng, AtIlir 
Zadeh, Ruoyu Su and Zelma \Vang. T hanks for thdr aceompany, support and encour-
agemcnt. Special gratitude to Dr. Yuanlong Yu for Iistcning to me, understanding 
me and encouraging mc all along the days we work towards Ph.D. dcgree. 
I am grateful to my landlords Betty ali(I Steve, witil wilotlJ llivc sincc the first 
week I came here. Tiley accommodate me wi t h a comfortable and carefree home so 
tlmt [ can coneentrate on my work at the university. They lcave <1 light on evcry 
night for my late getting home, which makes me feel warm even in the coldest days 
Finally, [ would like to Itlctlt iull t!Jat there is no way to fully appreciate rny 
parents for their love and Hupport and tlH~re i~ (Il!;O no way to (~rase my regret not 
able to accompany thcm over the yeaN a;; the only chi ld of them 
Contents 
Ab!;t ract 
Acknow ledgcme nts 
List of Tables 
Lis t of Figures 
1 Int rod uction 
1,1 ~[oli\'ation 
1. 2 Disscrtntioll Outline 
AES ,ltld t he Hardware Implementations 
2.1 The !\ES Algorithm 
2.1.1 Hound PUllctioll 
2.1.2 Datapath 
2.1.3 Kcy&hcdulc 
2. 1.4 II lode,; ofOpcratiOll 
2.2 lIardware Implementation of AES 
vi 
xiii 
xv i 
10 
" 
2.2. 1 S-l3ox ImpJcmentation 
2.2.1.1 Hardware Structure!; of the AES S-Box 
2.2.1.2 Performance of the AES S-Box Structure!; 
2.2. 1.3 Pipdined AES S-Box Implementa t ions 
2.2 .1.4 Potential for S-box Pipclining 
2.2.2 Datapath ImpJcmentation 
2.2.2.1 High Throughput ImplclllelJtation . 
2.2.2.2 Low Rc~oun:e Cost Illlplcmcntntion 
2.2.2.3 Pot.f'ntial for Difff'rf'nt Arf'hitN"tllrcs 
2.2.3 Characterizing Datapath Implementation Architecture!; with 
Parameters 
2.3 Summary 
3 Methodology of Hardware Implemelltatio n a nd Performance Evalu-
ation 
3.1 Hardware Implementation 
3.2 Pcrformallcc Evaluatioll 
3.2.1 Evaluation of T illliug ami Arca 
3.2.2 Power ConMlmption and Energy ConsumptiOIl 
3.2.2.1 Dynamic Power Consumption 
3.2.2.2 Static Power ConSllmption . 
3.2.2.:1 Energy ConsulllPtioll 
3.2.3 Evaillation of Power and Energy 
3.3 SUlllmary 
vii 
II 
II 
'" 
15 
IG 
17 
18 
IS 
20 
21 
22 
23 
23 
26 
26 
2G 
27 
30 
31 
32 
33 
4 Using Pipciilled AES S-iloxes for Ilesource Efficient Purpose: An 
Example 35 
,1.\ Introduction 36 
4.2 Architecture Design 37 
4 .2.1 Thc Shift Rows Componcnt 39 
'1.2.2 S-Box 39 
4.2.3 The MixColuutllS Component 40 
4.2.4 Key Expllusion Component 10 
4.2.5 Overall D(.'!;ign 
" 
4.3 Implementation Hcsults and Discussion 12 
4.4 SUIIlmary ,16 
5 Explorat ion of S-Box Pipeline Configurations for Flexible and t: ffi-
cient Implementation 48 
5. 1 llltroductiun 49 
5.2 The S-bux StructufC for Pipclining 50 
53 Applicability of the Pipclined AES S-Bux 52 
5.4 Pij)elining the AES S-Box ,,' 
5..1. 1 Pipelining at the Compouent Level 55 
5.'1.2 Pipclining at the Gate Level 57 
5.'1.3 Comparing Placement Approacllcs 58 
5.5 /I.·l ethodology 58 
5.5.1 Deriving the Candidate Implementations 60 
5.5.2 Evaluation of the Performam:e 62 
5.fi Experimental Hcsults !Iud Anllly~b 
5.fi. l Performance verl:lUS T iming Constraints 
5.6.2 Performance \'erSIlS Pipeline Configuratiuns for the .\'iinimized 
TimingConstraillt 
5.fi.3 Performance v('r~ns Pipelinc Confignrations for CiVl'n Timing 
Constraints 
63 
63 
66 
5.fi.3.1 Arca VCfSm; thc Numher of Pipelinc Stages. 71 
5.fi.3.2 Power/ Energy VCrl:lUS thc Numbers of Pipeline Stages 72 
5.6.3.3 Area \'CfSIlS thc Plaet:lllenL Approllehes 73 
5.fi.3.4 Power/Energy versus the Placcment Approaches 73 
5.fi.3.5 Energy-wise Costs 74 
5.fi.3.fi Trends in One Picture 75 
5.fiA Bl'ncfit.s of Using Pipdinffi S-Box Implel1lpnt llt ions 75 
5.6.'1.1 Bendit" over \"on-Pipdill<.-oJ llll plelllcnt~tions ifi 
5.fi.4.2 Benefits over Other Pipeline Configurations 7fi 
5.6.4.3 Benefits of Providing ~'[ore Performancc Options/ Trade-
Olrs 
5 .7 Gcncrality of the Mcthodology and Rcsults 
5.8 SUlllmary 
6 Explorat ion of Datapath A rchitect ures for Flex iblc and Effideut Im-
plc mcntation 
6.1 Introduction 
6.2 T he Datapath Architecture>; of AES 
77 
77 
78 
80 
81 
83 
6.2.\ Commou bsues 83 
6.2.1.1 S-box Structure 83 
6.2.1.2 Key Expansion 8,1 
6.2.1.3 Impact of l\lodes of Operation 8,1 
6.2.2 Partial Datapath Architectures 
" 
6.2.2 . \ S-boxes 86 
6.2.2.2 Shijt llaws Componcnts 8G 
6.2.2.3 A'f ixeo/limns Components 89 
6.2.2.4 Novel 16-hit, 32-bit a]J(1 G4-bit Datapatil Arcilit cctures 91 
6.2.3 Complek Datapatil Arcilitcctufcs 
6.3 ~'Icthodology 
6.3. I Deriving tile Areil itecture Implementations 
6.3.2 Evalnation of the Pcrformll.]J(;e 
6.4 Experimental Rcsu lts and Arlillysis 
6.4.\ Area 
6.4.2 Peak Power Consumption 
6.4.3 A\'erag(! Energy Consumption 
6.4.4 Overall Resource Cost 
96 
97 
99 
100 
100 
102 
10,1 
III 
6.5 Summary III 
7 Demonstration of C ombined Effects for Energy Efficiency 113 
7.1 Introduction 113 
7.2 The Methodology 114 
7.3 Resul ts and Analysis 11 5 
7.4 Summflry 
8 Design of a Lightweight Block C ipher PUFF IN2 
8. 1 Introdnction. 
8.2 Cipher Spedfication 
8.2.1 Overall Structure 
8.2.2 Basic Components 
8.2.3 Encryption and D.xryption Process . 
8.2 .4 Key Schedule 
8.3 Sc(;urity Analysis 
8.3. 1 Dilferential and Linear Cryptanalysis . 
8.3.2 Related-Key Attack~ 
8.3.3 Weak Keys 
8.3.4 Updated Crypt.analysis Ite»ults 
121 
]22 
123 
125 
125 
127 
128 
133 
13'1 
134 
13G 
137 
137 
8.4 Scri"lizcd l\rcltit.xture for ll ardware impiement"tion 138 
8.5 Hardware implementation Hesults . 142 
8.G SUlIlmary 14,1 
9 Conclusions 145 
9.1 SUmmary of Research and Contrilmtiolls 1'15 
9.2 SuggC8tiolls for Future Work 148 
Bibliography 151 
A Description of the Operation of the Shift Roms Components 163 
B Description of the Operat ion of the ;H b :Co/uTtms Components 166 
xii 
List of Tables 
'Ll Register states of the round operation data path 
,1.2 Register statc~ of the key cxpam;iOll component 
4.3 iluplcrncntation result>; 
4.4 Normalized performance comparison of the architecture using a single 
S-uox with Jjfferent uumucr of stages 
6. I Comparison of 32-bit AES dll tap .. 'l.th architectures 
G2 Assignments of the timing constraints (in us) for the architectures 110-
" 
~>()rd ill g to the given throughput~ 98 
6.3 Areas of the architecture implementations (normalized to 978 GEl 101 
6.4 Ratios of the arcn to the ma:>:irnum throughput of the architectures 
(normalized to the valuu of UiO) 
6.5 Peak powers of the architecture implementations (normalized to 66.8 
/1\\') 
G,G Average energy for the encryption of 128-hit pillintcxt of the archi t<..'C-
turc implementations (normalized to 0.73 II J) 
xiii 
102 
103 
\05 
6.7 Average energy incurred due to ~tati<; power for the encryption of 128-
hit piaintext block of the architcdure implementations (Ilonnalized to 
l.l pJ) 
6_8 f\ vefflge energy incurred due to dynamic power of the regbters for 
the ellcryption of 128-bit plaintext of the archi tecture implementations 
(normalized to 3'1.2 pJ) 
6_9 Overall resource cost of the architecture ilnplementations (rrorrnaliz cd 
to tire value of \\'32 under 2.13 Cbps) 
7.1 A~signrnent~ of the timing (:onstraints (in lIS) fo r the architectures lIll-
del' (;omparisoll fl(;corciing to the given throughputs 
7_2 An';\.'; of the combined flfchitecture and selected datflpflth arehitf'ctnf!' 
irnplementfltions (normalized to 078 CE) 
7.3 Peak powers of the combined architecture and selccted data path arehi-
Vx:ture implcmentfltions (normalized to 66.8 /I \ \') 
7.4 Average energy for the encryption of 128-hit plflintext of the comhined 
arch i tlJct nf!~ and selected datapath architecture impicmclltatiom; (nor-
JOG 
Ill') 
1JO 
116 
117 
118 
nmlized to 0.35I1J) 119 
8.1 S-box mapping of PUFFlN2 (iIlIJCxadcciu\fIl) 126 
8.2 64-bit Permutation of PUFF lN2 127 
8.3 Dcsniption of the components of the key schedule 131 
8..-1 Round distribution of PL64, PR64, L6'1 and R64 131 
8.5 impiemcntat ion results of PUFFIN2 and serialized PHE$ENT I4 ! 
8.6 Count of hardwarc components of PUFFlN2 and serialized PRESEN T 143 
xiv 
A.l Contents of the registers of the 8-bit width Shijtl1ow~ component at 
the selected clock eyeJc,; 
A.2 Contents of the regis ters of the l6 ... bit width ShijtRow~ eOllll}ollellt at 
the sdech .. u clock cycle>; 
A.3 Cotltents of the registers of the 32-bit width ShijtRows component at 
the selected clock qcle>; 
A.4 Contents of tllO registers of the fi'l-bit width ShijtRow~ component at 
thesclcctcdclockcycle>; 
ll.1 COlltents of the registers of the 8-bit width MixCoiumll~ coml}ollcnt 
164 
1&1 
165 
165 
for the clock cyeks during a compkte operation (m = /I + 1) lfii 
ll.2 Contcnts of the n.'gistcrs of the lfi...bit width MixColu"l1I~ coml}Onent 
for the clock cycles during a compkte operation (m = 71 + \) 168 
13.3 Contents of the registers of the 32-bit width MixCol!J1l1ll~ component 
for the clock eycles during n compktc operation 168 
List of Figures 
2. I LaYOllt of 128-hit plaintext /Stale block, key aud state. 
2.2 1'110 Shifl.Rows operation of 1\£5 
J_ l General digital ASICdL'tiigll flow [53] 24 
3.2 Charging current of the load capacitance of a CMOS illvcrtcr /551 27 
3.3 Shurt ci rcuit current between liupply power and ground of II Cr.l0S 
iuvcrtcr [55] 28 
3.'1 Staticlcak.'lgccurrcnt.;;ofaCIIl QS iuverter [55] 30 
3.5 Power consumption lind energy consumption 32 
3.6 Power Evaluation flow using PrimcPowcr 34 
4. 1 Block diagram of the prop()l;C([ AES encryption core archi tecture 37 
4.2 Architecture of the AES encryption core with a 4-stngc pipc linl'<i S-box 38 
5. 1 The typical datil path architectures of the AES: (a) loop-unrolled IITchi-
lect.ure, (b) round-iterative architecture, (e) flllly scrializl.'(l architecturu 51 
5.2 Componcnt ICllel pipclincd S-oox architectures: (a) I-stage (IlO pipeline), 
(b) 2.stagc, (c) 3.stagc, (d ) 4-slagc 
XIIi 
5_3 Deriwltion of the candidate implementation~ frOlll the sourel' 1-I0L de-
~cription of the S-Box 
5.4 Illustration of UJC gate level aPIJroach of pipclining into 3 stages by 
regi~t(~ r reliming. 
5_5 Nornmli~.ed performance \'ersus target timing constmints ( n~ ) , grouped 
according to the pipclinc configuration 
5.6 Normlllizcd performance versu~ pipeline configurations ullder the syn-
thesis constraint of rninimizM critical path delay 
5.7 NornlHliz('d pcrformanc{' v(~rsns pipdint' coTlfig11Tati01I~ nuder giV(~I I 
59 
61 
timing r<xluirements (frolll 0.35 ns/2.86 Gllz to 1.50 us/G07 ]\l ll z) 68 
5.8 Normalized jl e rfornml!(~C versus pipeline eoufigurat iolls uncie! givetl 
t.iming requirements (from 1.75 ns/ 571 ]\Ulz to 3.0 IIs/333 tdHz) 69 
5.9 Normalized performarH"c verSllS pipt'!illt' {'onfigUTal.ions Ilnd{'r givt'u 
timing requirements (4.0 n4250 1"IHz and 8.0 ns/125 ~'lHz) 70 
5.10 Trends of optima! pipeline configurations for the throughput require-
mcnts from high to low 
G.l Generic Illodel of the partial data path IIrchitt..'C tures with width w E 
{8, 1G,32, 6'1} 
6.2 Strm:ture of the ShiftRow<~ l"ompollcllt for the partial dHtaputh archi-
tedlln~ with the width of 8 bits. 
6_3 StfUC.t.UfC of the ShijtRau's fOIllpoIlcnt for the jlllrtinl datnpath an'hi-
te.::tUff'S with the width of 16 bits 
xvir 
71 
85 
8fi 
87 
6.4 StructufC of thc ShittHows componcnt for the partial dntapath archi-
tcctur~ wit h t llC width of 32 bits .. 
6.5 Structure of the Shift/lows componcnt for thc partial datapath archi-
tccturL'ti witll the width of 64 bito; 
(j ,(j Structure of the MixColul/JrI.~ component for tllC partial datapath 
architectures with the width of 8 bits 
6.7 Structure of the MixCoiumlis component for the partial datapath 
arcllih .. 'Cturl'S wi t ll t ile widt ll of 16 bits. 
6.8 Structure of the MixCoiumns component for the partial dfl tapflth 
architccturL'S with the width of 32 bits 
6,9 Generic model of tile complete datapath architectures with tllC lJ lJ-
87 
88 
00 
rolling factor r E {I, 2, 5, IO} 93 
6.10 Structure for unro!!ed architectures of the round function. !H 
6. 11 Structure for unrolled architectures of the hl-~t round funct ion for l' E 
{1,2,5} 9·1 
6.12 Structure for unrolled architectures of the last round fUllction for r = 10 95 
8. 1 Block diagram of the encryption (top) and dL'Cfyption (bottom) processes 129 
8.2 Block diagram of tIle key schedule 132 
8.3 Serialized architecture of PUFF!N2 138 
8,4 Contcnts of thc 144-bit register at clock cyde~ 6, 37, 45, 53 and 57. 140 
Chapter 1 
Introduction 
In 2002, the Advanced Encryption Standard (AES) was cstahlishcd lIS the symmetric 
key encryption algorithm that is officially endorsed by the United States gOVCfll-
ment for the next dccuJCti [i]. Since then, this algorithm hm; bem widely adopted 
by applicat ions requiring security feat.urcs . Thi~ hM led to intcm;ive research and 
development. of the efficient hardware implcmcntutioll of AES. AltlwugiJ AES was 
introduced around 11 dC(;ade ago and the most recent effort on the cryptanalysis has 
been ahle to reduce the computational mmplcxity for 11 key recovery by a fact.or of 
1/ 4 compared with a brute force attack [21. it" :;<:(:1ll"ily is sti ll rcgardLoJ tu be suUicielit 
for future needs 
Siucc the information security provided by AES is often realized through the CIl-
Cfyptiull lI,llu/O! dccryption of data translllitt,cd ovcr conullunication climlllcls, fI basie 
requirelllcnt for a hardware implementation of AES is the throughput. UpoIllllectillg 
liJis basic requirement, the resourec eo~t of the implemcntation is expected to be as 
low o.s possible The rcsource cost of an implementation includcs the manufacture 
wst aJl(1 the running cost. The manufacture cost is relatcJ to the complexity of the 
circuit, which is usunlly rnensurcd by the area of the circuit, T he rUllning coot usu-
ally denotes t lJC power consumption and/or energy consumption when running the 
implementntion. Power consumption and energy consumption of the itnpletnenwtion 
art' si!!;nifknnt. melrit's for passively pow('.rf'rl n('vke!i (e.g. , r.ontal't i(~s smart. (·anls 
and HFJD tag!i) and battery powcred devices since the buttery life and the potential 
workloud nredeterm incd by thcsceoDsumptions 
1.1 Motivation 
Dut' to thc TOund-itcf(ltivc nature of the AES algorithm, 11.;; well as th(~ highly reg-
ular aud rppeat. ing 0lwrilt.ions ddiur.d in t.he (!atapath, t.he d('t<ign of th(' dalapMh 
fln:hitedure for the hardwure implementatiolJ of the a lgorithm is extremely flexibl e 
ano, Ii.'; a rcsult, there exists a numher of pOSOiihlc riatapath architectures. However, 
in the previous literature, only a small part of them have bccn d i~u~~d and iuvcs-
ligated. In nddition, the AES S-hox, u.s the most complex component in ,m AES 
implementation, suffers from a iOll!!,critieai path find is conventionally pipeliue<\ only 
for the benefit of SPl.'(.'<iup. Howe\"er , there exists more pipeline architectures t,han 
those fldop tcd in previous works und also unexpiorl'l! potent ial henefit fur rl'SOllIce 
eflificlIfr when using pipclining 
Since cfldl of these mcilit ectul·cs (includ ing both the datnpath architectllre~ aud 
the S-box pipciilw architcctures) could lead to tilc implcmcntat ion with unique per-
formance eilaracteristics and performance trade-offs. it is necessary to conduct fill 
investigation of them for tile comprchcn~i\"e uuderstamling of the performanec vflrifl-
tion under til(' diffprpn t. ardlitedllrNi, espedally wi t. h thp fOCllS on I,hf' rcsollrcc-rf'larNi 
performance. On one Iw.nd, this would identify the most rcsourcc-..eUicicnt architl'{;-
turc for the AES implcmcntation targded at r(.'l:;(:)]]rClrc.on~trained IIpplic.ations, On 
the other hand, the performance trade-offs from these architectures would provide 
mOT(, opt ions for tile flex i h l(~ impl(~lIle!lt.alion of ACS 
The design of the block cipher PUFFIN2 is motivated by the demand for block 
ciphers targded at lightweight npplic.at ioI1S. Sincc AES has relatively high imple-
mentat ion complexity as is de termined by the algorithm , there is demand for illl 
altemative block cipher algorithm that leads to lower implementation eomp1cxit.y 
suitable for low price and low power appl ications, Fur thc.sc applications, the block 
t:iphcr is not required to be as strong as AES in terms of SCf:urity I(!\'ci (i.e. the dfort 
required to crack the cipher) but the implementation complexity is expected t{) be as 
low as possible. 
1.2 Dissertat ion Outline 
The dissertation is outlined as the follows . Chapter 2 is the review of the AES algo-
rithm llIui the previous works on the hardware implementat ion of I\ES. Through the 
nnn lysi~ of the previous work, the necessity of the work presented in this dissertation 
is shown. In Chapter 3, the methodology of the hardware implementation and the 
pcrfonwUlce evaluation adopted ill this disscrtation is introduced. Chaptcr <I is a 
prelimillluy ~tudy of using pipelined AES S-boxes for resource efficient purpooe. III 
this work , we replace the two S-bOXCli ill an ultra compact AES implementntion with 
one pipplincd S-box and arc therefore able to double the throughput whilf' the area 
of the implementation remains very similar. Inspired uy the work in Chapter 4, we 
conduct a comprehensive study of the p·erformance of the pipclincd AES S-box with 
various pipeline configurations ill Chapter 5. This work shows a more complete pic-
ture of the value ill using pipelined S-boxcs o\'er non-pipdincd in terms of resonrce 
dliriency and design flexihilit.y. lJiI,<;ed on til(' results in this chaptN, W'! fllso 1\111\-
ly~,e the generalizable trend of the variation of the appropril\te pipeline configuration 
over the different throughput requirements. In Chapter 6, we investigate the resource 
efficiency and performancc tradc+offs of the dat apath implelllentatioll of AES with 
various datapath nrchih..'Cturc and throughput requirements. III ~"()ntfllsl to eOllven-
tional belief, this illves tigalion discloses that the most power and ellergy efficient 
datapath arehitL'CturCl; under a given throughput requirement are not achieved uy 
the moot compact a rchitecture alld there is significant reduct ion ill power and energy 
by using other appropriate datapath architectures. In Chapter 7, we demollstrate the 
(;omuitlL'd effL'CIivellL'SS in illlplUvi!Lg rew\ll'U, dlide!L(;)' of the 1'L~uJt s flUw Chitpters 5 
lind (i hy examining the perforrwlllcc of the implementation with the combination of 
a S-box pipeline configuration and a datapath architecture. In Chapler 8, we present 
the d~'S ign of P UFFIN2 with thc rcle\'lUlt background of lightweight block cipher de .. 
sign. Chapter!) is the SUlllmary of the research and contributions in this disscrlatioll 
and the suggL'li tions for future work. 
Chapter 2 
AES and the Hardware 
Implementations 
[n this chapter, the algorithm of AES is introduced. Previous work on high ])crfor-
ma!l(;c and resource ellident AES hardware implementation is reviewed. The llL'Ct'SSity 
of tiw research presented ill this uissertatiou is shown based Oil the review. 
2.1 The AES Algorithm 
AES is II block cipln::r algorithm witb u block sizc of 128 bits. The key sizc of AES 
call be independently spccified to 128, 192 or 256 bits, and accordingly there arc 10, 
12 or 14 iterative rounds to be performed for the encryption or decryption of a block 
of the plaintext or ciphertext [I). III the next section , a 128-bit block sizc and 128-oit 
key sizu arc llsed fOf demonstratiull. The 128-bit plaintext bluck, key, rollnd keys nud 
tiw intermediate resu lts (calk'tl the Stale) between operations in AES arc orgnnizcd 
ill II rectallgle army of bytcs, as ShOWll ill Figure 2.} Such lin nrrangelllent or bytes 
Byte 8", Byte 8", 
0.0 0.' 0.2 0.3 
Byte Byte Byte 8", 
'.0 '.' '.2 ' .3 
Byte 8", Byte 8", 
2.0 2.' 2.2 
" Byte 8", Byte 8", 
3.0 3.' 3.2 3.3 
Pigure 2.1: Layout of 128-oit plaintext/Slate block, key and sta te 
facilit.at.es SOllle of the operatiollS of ,\ES which work on rows or COhUllllS of all array 
2.1 .1 Round Function 
A wUlld function of AES can be de;,uihcd in pseudo code llotation as" 
Roun d(Statr,Roulid Key) 
{ 
SubBytes(State): 
ShiflRows (Slatc); 
MixColumn s(State); 
AddRoundKey (Slate,Hotilld Ke.y); 
,\ 11 round function~ are t.he same in AES exc"cpt the final rollnd which is sligh1 ly 
differenl and expressed as 
FinaIRound (Sta/ c,Roulld Key) 
{ 
SubBytes(Statc ); 
ShiftRow8 (Sta te); 
AddRoundKey(State, lImmd K~y); 
In t.he notation abovo, Slule denote!; tho intermediate re!;ult. produced by the prouxl-
ing opcration. Sub/lyles, Shi/lRows, MixGolllT/IIIS and AddR01mdKcy !Ire operation~ 
in round functions and they nrc defined in tile next section 
The SllbBytes operation works as the substitution layer of AES. The 8-hit non-
lineal transformation of SubB,vtes is rdcrred to as the S-box, The ol-'eration per-
forms the non-linear transformation of each byte in the State according to the S-
box mapping of AES. The S-box lIlappiug is the composition of two tl'ansfOfma-
tioJls: (I) the Illultiplicative inverse OWl' GF(2~) with the irreducible polynomial 
m (x) = xS + X'I + Xl + X + 1 (with the exception of mapping zero to zt~r()) ami (2) an 
allilH' transformation defined by 
'" 
1 0 0 0 1 1 1 1 Xo 
y, 1 1 0 0 0 1 1 1 
y~ 1 1 1 11 0 0 1 1 
'11:\ 1 1 1 1 0 0 0 1 XJ (2.1) 
1 1 1 1 1 0 0 0 x, 
y, 0 1 1 1 1 1 0 0 x, 
y, 0 0 1 1 1 1 1 0 
y, 0 0 0 1 1 1 1 1 
where Tn and '!In arc the (n + I )-til bit of the input bytc and output hyte, respectively, 
of the alliue transfonnaUon. 
The ShiftRows operation cyclically shifts left the bytes in the rows of the Stale 
with different olf~cts as illustrat.ed in Figure 2.2 
Siale 
Byte Byte Byte Byte 
0,0 0,1 0,2 0,3 
Byte Byte Byte Byte 
1,0 1,1 1,2 1.3 
Byte Byte Byte Byte 
2,0 2,1 2,2 2,3 
Byle Byte Byte Byte 
3,0 3,1 3,2 3,3 
~ ~ 
Siale' 
Byte Byte Byte Byte 
0,0 0,1 0,2 0,3 
Byte Byte Byte Byte 
1,1 1.2 1,3 1,0 
Byte Byte Byte Byte 
2,2 2,3 2,0 2,1 
Byte Byte Byte Byte 
3,3 3,0 3,1 3,2 
Figure 2.2: The Hlli/tHows operatioll of AES 
In the klixCol1J1rm.~ operation, the columns of thf' Stll te are con~iden,'(1 as poly-
nomial~ with rodfidf'nt.s in GF(2~) [lnd multiplied modulo m(x) = x·1 + I wi th a 
fixed polynomial c(x) = 03X4 + 0lx2 + Olx + 02. Given the (n + I)-th column, 
rl E {O, \,2,3} , of the St(Jt(~ 8Q,,,, 8 1.", B~,,, and 8:1.", (17 + l )-Ih column of the Stotp-
aftcr MixColllll!l!~ can be realized by t ile nliltrix Illult iplication as 
lJ~,,, 02 03 OJ OJ 80,,, 
8 ;,,, OJ 02 03 OJ ill,,, (2.2) 
il;,n OJ 01 02 03 il2,n 
il;." 03 OJ OJ 02 8 3." 
Tim Hili/tRows op(~mtion ami AfixCoblflm.~ opern t ion can be viewed as a linear t ran~-
formation on the Stote 
T he ilddHou7IdKey operation pcrforms the bitwise XOH of the State and the 
round kcy. Each byte of the Statl: is XOHed wit il the byte of the muml key witll the 
Slimc po~ition in the array 
2.1.2 Datapath 
An AES cipher with 128-hit block size amI 128-bit key siw ha.~ tell munds in the 
datapath , The datapath for encryption is deserihed in pseudo code llotation as 
Datapath (Plaiutext,Raund Key) 
{ 
AddR oundKey(Plainte:r.t,ROImd Key ll l); 
for(i = l ;i< lO;i++) 
Round(Slate, Rolmd Key [i + I)); 
FinaiRound(SI,atG,Raulld Kev[I II) ; 
Note that J 1 round keys arc required : olle prior to t.he first r01lnd and one for each 
round. The datapath for decryption is achieved by reversing the encryption datapath 
2. 1.3 K ey Schedule 
T hc key schedule of AES consists of two componcnts' (1) the key expansion where 
the 128-bit key is cxpanded into 44 32-bit vedors; (2) the fOll lld key sd(x: t ion where 
dIP. 44 32-bit vectors (Ire segmented into II 128-hit v(.'{;tors, each of whicll is a rollnd 
key. The key expansion is described lI1i follows 
K eyE:r;pansion (bytc Key I16] , 100m IV[44[) 
{ 
1"« 
W[i ] = 
1m I< 
{ 
tem p = \V[i - II; 
if (i%4 == O) 
*i+3]); 
temp = SubBytes ( R otBytes(tem p)) $ RculI [i/4]; 
\V [il := lI' [i - 4I (1) tcmp; 
In the nbove description, RotBytes denotes a cyclic permutation operation that rotatl"S 
left the bytes in the word by one byte «(~.g. (I word (a, b, e, d) i~ ~hifted to (b, e, d, 
a )) and (j) i~ bitwise XOR operation: RcolI [i/4] is a pre-defi ned constant. T he found 
key sck'(:tioll is describl'<l ;~. 
SubKeySeI (word 11'(44], .(-wom Round Key( ]]]) 
{ 
for (i = I ;i <= lI ;i + +) 
/lOTlfld Key(i] = (\V(4. i], IVH. i + 1], 11' (4 * i + 2], 111 (4. i + 3]); 
2. 1.4 Modes of Operatio n 
A.~ a block cipller, AE:S llcctlS to work under a certain block cipher mode of opera-
tion when uscd in applications. Ba.':iie block cipher modes of opcration include ~~kc-
Ironic Codebook (E:CB) mode, Cipllcr Block Chaining (C BC) modc. Cipher R'('(lbllek 
(CFn) 1Il0de, Output F'cctlback (OF B) mode and Counter (CTH) llIodc 13]. Block 
ciphcr modes of opcration can be categorized a.s either feedback modes (e.g., cnc, 
CFn , and OF n ) or non-fecdl)flck modes (c.g .. ECB and CTH) dC):~Clldiug on whether 
each cncryption/decryption using thc block ciphcr depcnds on the output of the pre-
vious encryption/decrypt ion. In this disserta t ion, wrne of the implementations arc 
1I ...... "Ill)H.xl to work under nOll-k'Cdbaek modes only. 
10 
2.2 Hardware Implementation of AES 
Since AE:S was proposed in 2001, there has been intensive invcstigations illto the 
cllicicnt hardware implclllclltation of AES. Gcucrally, these invt'!;tigations foclI~ on 
two areas: I) the implementat ion of the AES S-box with low rCSOUrCC('OtiL and 2) the 
implementation of the entire cipher or the datapath of AES for a variety of dcsign 
requirements from high throughput to low resource cosl. The fullowing is the review 
of the S-hox and datapnth implementations. 
2.2.1 S-Box Implementation 
In the hardware implementation of ,Ill AES cipher. the S-box typically lH~ 11 complex-
ity significantly higher than other functions and therefore has II major impact on the 
performancc of the overall AES implementation in terms of liming, area, power and 
cnergy. For this reason, various hardware designs of the S-box tllwe been prop()S(Jd 
aimill/.( at the imJlrov(~melLt of the eflicierlC'}' ill lhr.sr' pcrspct' t ircs ofllw rwrforllH\lH'f' 
2.2 .1.1 Hardware Structures of t he AES S- Dox 
The most stTllightforwurd rea!izHtion of an AES S-box is a lookup table. A table 
lookup in hardware ean be constrnctoo with ei ther a BO~1 or n combinational circuit. 
The !lOtll approach require;; the spacc of 256 bytes to store the lookup table, which 
is costly in hardware ,\Tca, particulnrly when multiple copk'S of the S-box II C{.'(I to 
be impIClllentt'.:\ in an AES implementation . The combinational circuit approach 
u;mally relk'S on n synthesis tool to translate the lookup table into the circuit and. 
due to the high nonlinea rity of the lookup table. the synthesis tool has to interpret 
II 
the t.rnnsfonnntion as It nmdOIll mapping, rc'Sulting ill a ci rcuit with large area being 
gcnemted 
To adJieve better ftrea elliciency, some advanced approaches arc proposed by px-
ploiting the mathcmatical properties of the S-box mapping. Sinee the most costly 
operation in thc S-box is multiplicative inverse ovcr CF(2~), llijmen suggested in \"] 
to decompose the finite field CF(28 ) to its sub-field CF(2'1} so thnt an plement f1 in 
CP(28 ) can bc represented by a polynomial 11 = bx + c with b, c in CF(2~), and thcn 
tlw computation of tlJC lIJultiplicative inverse ovcr C F(2 8 ) is eonverWd 1.0 operations 
ovcr GF(2 1) which have much lower complexit.y in hardware. I\ n ASIC implemen-
tation of thi~ approach is shown in \5]. This approach is further developed in \G] by 
decomposing the elements in CF(2~) to polynomials with codficients in CF(2Z ) , alJ(l 
in such 11 way tIle multiplicative invcrse in CF(2Z) can be implernentcd in hardware 
with only il swap of the two input bits. \Vith the same decomposing approach. bet-
teI" area ellicienc), is Cldlievc'(l in [7] by using elaborately selected normal blL~es for 
element representntion and other optimization tc'ClllliqIK'S. Thcse approachcs arc all 
based on the decomposition of the operations ill CF(2 8 ), so they are generally kuown 
as S-boxcs with composite field strueturcs. There arc other S-box stfllctur~ proposed 
based on the (;omposit.e !ield struc.;turcs, which further improve lhe performan(;e in 
timing, area and/or power through extensive explomtion of the po:.;.~ihlc eonstruetioll 
configurations usillg polynomial basis (e.g .. [8]). normal bilsis (e.g., [9], [1 0], !I l] aud 
[J1]) Of mixed oasis (e.g .. [13]) 
rhe S-box implemelltation with low power conslJIIlption cnn bc achieved by rC'-
ducing the switching activity of tlJC circu it. A t.ypicnl switching reduct.ion dc·sign 
is IJfcsent in jl4[. It emplo)'s a decoder-permutiltioll-elleoder strnctme where the 
12 
decoder converts the binary reprCt:lCntation of the 8-bit input to the 0]]e-h01 replc-
sent.ntion of 256 bit~ and the encoder doe~ the inver,:;c, In thi~ way, any tnUisition 
of the S-box input can only eallW the transitions of the signals on two Jines in the 
permutntion (one from 1 to 0 and the other from 0 to I ), w the ~wi t<:hi ng activity is 
minimized in the permutation. Further, the decoder and encoder a rc aJw built with 
power efficiency through the extcn~i\'e search of all the possible struct ures. A low 
powcr st ruct ure targeted at ASIC implement.ntion of the composi te field st,Tud ure Gf 
the AES S-box is sllOwn in [i 5]. In this work, a multi-stage two lcvcllogie struct ure 
Wilh buffers is used to balancc lhe signal IUrival times of gate so th at tllf' power 
consumption caused by dynamic hazards in the circuit afe reduced. Otlwr low powef 
df'Signs of the composite field structure of the AES Sobol( include [16] and [17]. The 
implclllentationin [IG] isafulJ-eustorncircuitdcsign using pass 1ransistor logic, which 
1I~lI a lly has beU"r powrr efiidcnfY ('orupared wi1.h til(' (,Ollvenl ional C;"'!OS logie. TIlt' 
w()]'k in [17] h~ a 7-stage piJldi lll~1 S-bol( with the [;()JJjpu;ite field stru~ture b<lSl~1 Oil 
Algebraic Normal Form (A NF) reprCt:lCnta tion and it achieves low power consumptioll 
by rl'du~ing the hazards in the circuit through pipeJining, 
2, 2, ].2 Perfonn allce o f the AES S-Box S t r uctu res 
Thc performance of the S-box structures ment ioned above in terms of timing, area 
and powcr arc characterized based on a Q.25-/lm standard cell C/'o.'10S process in [18] 
According to this characterization, the composite field structures have a small range 
of performance in all perspcctivl'S colllpared with other ~tructurcs under exaHlilla-
t ion, By varying the synthcsis constraintH, the implementatiollS of the composite 
lield structurcs can be built with the critical path delays in the range betwl'en 4.93 
L3 
ll~ to 9 ns, the arCIL~ in thc range betWL'CIJ 303 CE to 625 CE (where 1 CE or Cate 
Equivalent is the size of a two-input NAND with the lowe"t drive ~tf{mgth in thc pro-
cess libmry), lind the norlll ali~L'(1 power t:OnsulIlption (with respect to 4,45 /,W nnd 
measured at 50 r.. l llz) in the range betwecn 1.51 to 2.0. The implcme!ltntion~ of the 
oombinational eircuil structure for a lookup tablc hru; the perfOf[nancl' in timing, arca 
and powcr in thc f!lngL'S of 1.95 ns to (j.61 IlS. 1301 GE to 1545 CEo nnd 1.00 to 1.18, 
respoctively. TIlc pcrformanee for tiJC implementntions of the decoder-pcnnutntion-
cncoder structurc mngCl:! fm m 1. 86 ns t.o 3.3 1 ns, 1399 GE to 20 16 GE, nnd 0.27 
t.o 0.42. resprrtivrly. Thr pcrformnnrc of thr cmnposit,r firld st.m rlnTl'S iu [8[, [9[, 
[10], [II], [12] nnd [13] is not included in the above eharacteri~lltion , but Ilccording 
to the ('ompariwns in thc original papers, their perfoflllllncc arc very clo:;c to tho:;c 
composite field structures examined in [18] compare<:! with lookup tahle stmeture and 
tllU dc'(;oder-peflllutatiolJ-etlcOOer structure. ror the low power dcsigns of the S-box 
with tlJC compooite field st rud uml likc [Hi], [16] and [17], aceordinf!; to the original 
pnpers, the powcr consumption of [15] is about 1/ 4 of [6]; thc power consumption of 
II G] is very do:;c to [15]; and the power consumption of [1 7] is ahollt half of r 15J. 
These eharneterimtion rCl:!ults from [18] reveal that the composite field structures 
are supcrior ill arca dlidelley but inferior in timiug ~m(1 powcr COllsumptioll. Oil the 
cOlltmry, the lookup table structure and the decoder-permutation-encoder stfllcturC 
have the pcrformnlJce with thc reverse trade-offs Thus, t.hey are llSllnlly targeted at 
applicatiolls with differelJt dcsign requirements 
14 
2.2.1.3 Pipelined AES S-Dox Implementations 
The stmighlforward effect of pipelining is t.he redurtion of critical p!Jth ddH.Y at tlie 
cost of the Iwrdware overhead of pipeline registers. As is mentioned above, the com-
positE~ fidd structures lead to implementations with low com plexity and high critical 
path delay compared with the lookup table structure and the deeoder-permutatioll-
encoder ~truchlre. The low complexity implic; fewer number of pi pdilU~ regi~ tcr~ 
in pipeliniug and the high critical path delay impliC!; morc potential in delay rOOuc-
tion by pipclining_ Therefore, the composite field structures arc more suitable for 
pipdining than the other structures 
P revious work using pipe1iuOO S-boxes wit.h cornpooite field structures are tar-
geted at high throughput implcmentations witb the fully ullfolled architL't:ture. Typ-
ical examples include ASIC implementations ill)] [201 or FPGA implelllenta tion~ [21] 
122J. In 1201, a throughput of around 66 Gbps is achievOO by the AES implementation 
with 3-stagc pipelinOO S-boxcs using a 0. 18-lllll C'\-'IOS tcciJllology. In 122], the im-
plclllcntation produC!.'S a throughput of around 31 Gbp~ with the S-hox('S pipelimxl 
into roughly 7 stagC!:; on a Xilinx Spartan-III FPGA. lu 1211 and 122), the rela t ion 
betweell the critical path delay and the uumber of cllscaded Lookup Tables (LUTs) 
is analyzcd and, bll,s<..'<i on the analysi~ as well as the consideration of the availability 
of ronting resourCe!; , the minimum realistic nnmbCr!; of LUT-level:; of 3 and 2 arc 
dct(' rmined for one pipeline stage and accordingly the ·1 -~ tage and 7-stagc pipeline!; 
arc adopted to produC{! the maximum throughput, re!;peetively_ In [ID] and [20]' t.he 
tnl(k~off hetween throughput and arca of the overall implcmcllt<ltiotJ is cxplorL-J hy 
adjusting the 1l\llnbcr of pipeline stages. However, only 2 and 4 stages are considered 
15 
iu [I!! ] aud 2 and 3 stagcs in [20]. Tile dcsign in [!7] is anotiler FPGA-based design 
npplying: pipelining. I!owever, the focus of the work is the comparison between ~m 
7-stage pipt~lin(.'<1 S-hox based on the optimum construction of the eompo:sito field 
proposed by the author, all 5-stago pipelined S-box based on an l"Onvenlionnl eOIl-
~trn{'l ion of thl' {'om posit I' ficld nnd t.he S-hox from [1 5]. The res ults show thut thc 
7-stage pipclincd S-box with the proposed construction of the composite ficid has thp 
similur powcr consumption hut mm:h higiler throughput compared with tilL' other 
designs 
Pipelining the AES S-box i1nplcmentation illVolvcs the decision concerning where 
is the npproprinte plaeemellt of the pipcline registers. While this is enrefully inn:;;-
ligated towards the realistic minimum critical path delay in [22], tho FPCA-specilic 
approach is not applicable in a standard ASIC desigu flow. For the ASIC illlplcmc!l-
tat ions ill [IU] and [20], n similar plaeement approach is adopted that has the pipeline 
registers inserted on ly betwccn tile components of the S-box (e.g., the multipli e1tt.ive 
inverter allli the multiplier) . Sillec this is a eOiHw-grain approneh , the cri t ical pilth 
ddnys are not Jl(~~afily well balanC(~1 d1H ~ to tho unbalanc(.'(1 complexities in thc 
components. As well, the limited positions for register inscrtion of this l:ompollCllt 
level approach would prevent implementing more pipelille stages tllllll the ,I stage:> ill 
[I!!) and [20J. 
2.2. 1 A Potential for S-box Pipclining 
Ba.~ 011 t he fPview ubovc, it can be ;;':~~Il that the AES S-hox implemcntation for high 
speed, low area and low power purposes hIlS been extensively explored . llowe\'er, the 
pipelined S-box implementntiolls nfe exelusively tnrgeted at high throughput and only 
thf' admntagf' in timing or somf' tradp-olfs herwC('1l timing and Mf'H arf'f'onsiri{'fC(\ 
III addition, the appropriate placement of the pipeline registers is not explored b1L~'(1 
on ASIC ledlilology. For thc;>C rCIL';()ns, the rescurdl prcscnted in Chapter [) is lhe 
in\·cstigatioll of the pipelined S-box implementations in terms of timing, area, powcr 
and cncrgy lind this is based on !In extcnsivc exploration of the placement of the 
pipelille registers targeted at ASIC technology. including fill automatic placcment 
approach for 2 to 7-stage pipelines and all malllial plll{;emellt approach for 2 to 4-
stuge pipeline,; 
2.2.2 Datapath Implementatio n 
As is shuwn in the algorithm of AES, AES lias II round itcrative structure where round 
functions arc idcntical (with the exceptiotl of the last round) and in each round runc-
tion the SubByles operntion on each byte I)nd the A/ixColumns operat ions on each 
~-hytf' gTOilp I)rf' idf'ntif'al. '['his nature of AES allows for thf' high flf'xihility in t.lw im-
plementation architedure of AES ill tlle way thal differcnt ailloulits of the parallelism 
of the identical operations in the implementation leads to different implementation 
architecturcs. As the general trade-off, the more parallelism in the implementat.ion , 
the morc throughput is achieved but the higher complexity is incllrrc'<i. l3yexploring 
the tnode-off, different. d ... 'Sig,n requin:l!Ielits of AES ililplelllelilaliOIl !;1I 11 be fulfilkd 
with di fferent implementalion archit.ectures. The pre\'iOllS work on the m"erflll jm-
plelllellt,atiOll of AES usually focuses on the fulfillment of the d ... ",ign rC'"qllirements of 
either high throllghplltor low r,--'SOurce cost. 
17 
2.2.2. 1 High Throughput Implcm entation 
The implementation with high t.hroughput is usually fulfilled with the fully unrolled 
architeetme with round-level pipeline, which providcs the maximum lcvel of p1lrnl-
lclistll of thc idcntic!11 opcr1ltions and conS(."(lllCnlly the maximum throughput. AIl-
other fa<.;tor that alk'(:t::; throughput is the <.:riti<.:al path dd ay of the imvlelllelltation 
As the trlost complcx componcnt in an AES implcmcntation , thc S-box along with 
othcr compOllcnts in 11 round can be pipclincd for ~hort critieall}(lth delay, which is 
known [L'; inncr round pipclining. Pipclincd S-boxe"/round~ <Ire uSC(1 in the ilnplemen-
tatioll with the fully unrollcd architecture for further improvl:ment of the throughput. 
An example of the fully unrulled architecturc with round-level pipclining is [23] 
where a fully pipelilled AES implementation achievcs the throughput of 17.S Gllps 
based on an Xilinx Virtcx-Il F'PGA. Thcrc arc many morc works invcstigating inner 
round pipcline for maxilllum throughput, including ASIC implcmentations of [ I~]alld 
[20] illld F'PGA implcmentlltions of [21 ]. [2-1 ] and [22]. III [20], a throllgllPllt of around 
66 Gbps is Ilchic\"(xl by the AES implementation with 3-~ tage pipelined S-uoxes usiug 
11 a. JS-Jlm Cr.IOS technology. In [24], the AES implemcntation has 5-stagc iuucr 
round pil)Clinc for el.lCh of the rounds and achicves a 26.&1 Gups ba.'-;{.·d on a Virtcx-E 
FPG". III [22], thc implementation produces 11 throughpllt of around 31 Cups with 
thc S-boxcs pil)Clincd into roughly 7 stages on a Xilinx Spartan-III FPGA 
2.2.2.2 Low Resource Cost Im plemcntation 
III contr[L';t to the high throughput implementation, typical low rCSOllfee eost imple-
mentat ions of AF..s cxploit the rcverse trade-off hclwL~n throughput lind eOlUplexily 
18 
While the high througbput illlplelllentation~ of AES u~ually luI\'{: a fully impl\c!ll\CntL'<:1 
datapath of 10 r()\lnd~ with 128-bit width, lower complexity of the implementation can 
bc achicved by reducing the Ilumbcr of rounds implemented and thc dntnpnth width 
baStxl 011 a loop structure. While the reducing of the complexity. the throughput is 
also compromised ~ince the parallcli~lll in p ro~ing of d!lta i~ fl,<:juce<j 
Depending on th\C dC!;ign r\Cqllirem\Cnt for th roughput and eompkxity, many AES 
implementations have been proposed based on a loop structure with \'Hriou~ riatapath 
widtbs. Amollg tlJO~ mostly seen arc the i!!l plcI!lcnti1tjoll~ with onc round loop and 
the data path widths of 128-bit, 32-l>it or 8-bit. A typical exnmple of the 128-bit width 
implementa tion i~ shown in [25]. For the lower compl\Cxity AES implcmentatioll, tlJCre 
arc 32-bit A ES implelllClltations that reuse the datapath for <l tim(.'!; for the operatioll 
of one round, as is sccn in ]26], [271. [281 and ]291. 13y breaking the 32-bit operation 
of JI/iXCu/IUllns illto scriuliz\Cd op\Cmtioll 011 8-bit data. some rc<.;c nt de:;igns realize 
tlw AES implcmcntation with a 8-bit datapath width for even lower illlpll'melltation 
complexity. Designs with nn 8-bit widtb dntapath include [3D] , [31] and [32]. The 8-bit 
AES implelllentations arc also referred as the illlplelllentation~ with a fully scrinli7.(xl 
architecture 
Hegarding the performance of these low rC:SOllrcc cost implementations, then~ i ~ 
usually no uniform beuclmuu k for the eompnrison betwccn t hem in the way that 
thcy nrl! usually implemented based 0 1\ dilferent platforms (ASIC or FPGA) with 
different technologies or devkes alld t ile performance variation caused by difference 
in implementatioll tL'(:hnology is llot ignorable. It is abo ll L'<-'{,'ssa ry to POillt out 
tlwt tlH~e implcnwntation" with the fully seriali7.(..q nrehi tccture ~\re targeted at the 
design requirement of low resource eost, which IIleallS low in area, power and energy. 
19 
I[owcver, while it is straiglltforwilrJ to sec the achievement of 1011' area and that powcr 
consumption can oe scaled down oy decreasing the clock frcqucncy, it is not clcm if 
these implcmcntations [cnd to [ow cncrgy cost. The dctniled distinction betwecn the 
powcr consumption and the energy consumption of an implementation b introdUCI.."f j 
in Section 3.2.2 
2.2.2.3 Pote ntial fo r Different Arch itectu res 
An;ording to thc abollc rellicw, most of the previous work focu5(.'S Oil a specific an;hi-
t('Clme for a specific design n .. "fluirelliellt, usually either the high throughput imple-
mClltatioliS with the fully unrollcd datapath ilrciJitecture or thc low area implcmen-
tations with loop ba.scd 12S-bit, 32-bit or 8-bit width datarmth How('\u. thcn~ lUIS 
bl..'Cn insufficicnt invCli tigntion on the performancc of other dntapath ardtitect.ures as 
well as thc c\'alu1Ition of the performance of thc various dalapath archi tectures based 
on tlte same benchmark, especially in tefms of power !.Illd energy. In ChapteT G, we 
eouduet the iuvc,;tigation of an extensivc range of the Jntapnth nl"chitectufes for their 
performance in timing. area, power and euergy based 011 a 9O-uIll standard cell AS[C 
tcchnology. TIK'IiC are/Jit l..'Ctufl'S include parameterized flrchik-t:tures with the data-
path width of 8, /G, 32, G4 or 128 bits and the unrolling factor of I, 2, 5 or 10 for 
l1H~ 12S-hit width. Through this ill\·I..'St igat ion. the perforrnHnee trade-offs betwccn 
timing. inca, power aud energy over the different arehiteclurcs arc showu aud the 
power ami energy efficient ulchitocturl..'S are identified 
20 
2.2.3 C haracterizing Datapath Implementation Architectures 
with Parameters 
Since AES-E128 performs 10 rounds on a plaintext block of 128 bits to complete the 
encryption, the algorithm inherently proel'SSl,!; data in a batch of 128 bits. For the 
data path architectures with the width of 8, 16, 32 and U4 bits. the round of AES-
EI28 is partially implemented in hardware and the reuse of the hardware for 16, 8, 4, 
and 2 times is rcquirL11 to complete one round, respectively. These arehitl'durl,!; are 
called partial datapath arehitedures thronghont this dissertation . When referring to 
a partial datapath architedure, the [Jotataion Un is used where II is the width of 
the datapath architecture_ Fo r the 128-bit width archi tectures, ~ince the complete 
round function is implemented, they arc called complete datA pAth arclJitecturcs in this 
dissertation. Depending O]J the ]Jumber of rounds implemenkd, a unrolling factor of 
I, 2, J or 10 is used to chaflu;terize the complete datapnth nrchitectures. while tlie 
tJotutiou of CIl is used to refer to a cOllJplcte datllpath architecture with tlle ullfolling 
f(u;lor of n . When pipelined S-boxcs arc used in the datapnth architecture,;, the 
]Jotation of CII or Pn is used to indicate the S-boxcs arc pipelinccl into 11 stages with 
gat(~ k'vd pipelining or component level pipelining, rC8pectivcly (refer to Chapter 5 
for details (lbont gate level pipclining and componcnt level pipelining) 
The architectures of previous work on AES implementation CAn usually be char-
;lcteri%ed by the datapath parametcrs mcntioned above. For example, [23] corrcspouds 
to UIO. [20[ includes UIO, UlOP2 and UIOP3. [25J is UOl. [26J, [27] , [2SJ il!ld [29] arc 
\\'32 . [30] , [31] and [32J arc examples of \\'08. In this disscrtntion, it is nssullIed that 
the AES datpath architectures with the same d11tnpath width, unrolling factor and/or 
21 
pipclining :;tage numbers have the similar performance and the performance of th(! 
datapatll architectures built in this dissertation i~ used to represent the pcrformance 
of the architecturl'l> with the :;ame pnrameters, including thOliC proposed in previous 
work 
It :;hould be notic{".J. that the key expansion component and the controller for 
the AES implementntion arc not included in the dnlnpath archiledUl'Cl; that arc con-
sidered in thi:; dissertation. The reason is thnt, for a specific dalapath an:hiteetUle, 
it would not sigllificantly vary with different implementations of the datapath archi-
1('('ll1rc while the kcy expansion componf'nt. and th!' controllpr pan IH' highly ftexihlc 
nnd independent 011 that datapath architecture. Especially for the key expansion 
component, which can be either implementl..'(\ in order to generate the keys on-Ihl..,-fty 
or not implemented so thnt the keys nrc generated omine and stored for the use by 
the d~ltapath implementation. 
2.3 S ummary 
In thi:; chapter, the AES algorithm und the statc-of-thc-nrt hardware implcmentation:; 
of AES arc reviewed and anaiyzl..,(1. Through the review and H1uIIYhCb. the potentia! for 
the increase of the implementation eflicietlcy or the performance trade-off:; is shown by 
('xploring Ill(' archit OO.urc coufigurat.ions of the S-hox and tl)(' dal.apal.h. 'I'll<! rl'S!'anh 
presented in Chapters <I to 7 i~ thc dcploymcllt of thc iu\,chCtigatioll of tile potentinl 
A,; the fonn(liltion of the r~'1;Carch work in this Ji!S1;Crtatiun, the methodology used 
for the h(lrdware implemcntation and performallcc cvaluatioll in Chapters <I to 8 is 
prcscnt(!<i in the next chapter 
22 
Chapter 3 
M ethodology of Hardware 
Implementation and Performance 
Evaluation 
Since the research in this dissertation is based on the pcrformullcc charactcrizntioll 
of the hardware implementation of vurious AES architectures, the methodology used 
for buildiug the hardware implementation of the AES architectures nud cvuluflting: 
the performance of the imp1cll1cutation8 is prC!;Cntcd in this chapter. 
3.1 Hardware Imple me ntation 
The IHlrdwnrc implementation performed in thi~ dissertat ion is based Oil ASIC tech-
nology. The general digital ASIC design flow [331 is shown ill Figure 3.1, which mainly 
involves four stages: algorithm modeling, RTL coding. sYllliu:tiis lind layout. 
23 
Fi!!:1lf(' 3 .1 Cpupml diKi t,d ASIC d('!;i!!:lIitow [531 
24 
In this disscrtatioll , the IlTL description of the hardware implementation is codt'<l 
with VIlDL. The synthesis tool is Syuopsys Design Compiler (DC, ven;ion X-2005.09) 
[341 with the O.IS-I,m CMOS standard cell library frOIll TSMC (for the work ill Cha]>-
ters" and 8) ur DC (version B-2008.09) with the 90-nm Cr.. IOS standard ccillibrary 
from STMicrociectronics (for the work in Chapters 5, 6 and 7). In this dissertation, 
the cstimation of the p·erformance of the illlplelllelliations relit'S un the syuthesis tool 
instead of the measurement of the fabricated circuit, so the physicJlJ hl)"out stage 
in Figure 3.1 is skipPl.'(1 and, instead, un virumllayout p!'Occss is performed in the 
top-ogmphical mode of DC duriug the synthesis proC<.'SS. The topographical Illode 
of DC allows layout aware IlT L sYllthesis, which can Jlerform a coarse plaCCIllcnt 
of the cclls and extracts thc interconnect resistanl.'C ami cllpacitant'C f!'Om that and 
this prot~ is called a yirtuallayout process [341. With virtual layout, the panL~itic 
capacitanccs of the physical layout of the dt'Sign can be estimated IllOTe accurately 
cOlrrparcd with the wireload model-based statistical approximations_ The cstimalion 
of eapacitJmel's l'IlSUfCS good eorrl'iatioll of the estimation in timing, area ,lud powcr 
bascd on thc synthesis rC!;ult to that of the final physical dC!;ign. For the virtual 
layout, the height-to-width mtio of the placement flfea and the mea utilization [3.11 
IIrc set to I and 0.7, respectively, for 1111 the implementatiOlls. The synthcsis prcKcss 
genelfltt't; the technology-dependent gate-level designs (nctlists). The nellists arc then 
lLsed to t'Stimatc the timillg, urea, power consumption and energy consumption of the 
implementations. Although the estimation of the performance b,1SC(1 on synthcsis tool 
docs not accurately reflects the realistic perfonnam;e when the ilnplelnen1(lli()lls are 
fabri cak'<l, all tire implementations built in this dissertation arc based on the snme 
tt'ehnology library and follow the same syn tl\l~ is process, and therefore the relative 
comparisons should hold closely, leading to a fair and objective comparison 
3.2 P erformance Evaluation 
The performance of the implemeutations in thb dissertation i~ evaluated in terms 
of timing. area. powcr consumption and energy consumption. [n the ncxt section. 
the evaluation of timing and area is presented firstly. After that. siucc the powcr 
consumption lind the energy cousumptiou of an implementation arc dist.inguished in 
this dissertation, the basic concepts of power consumption and cuerg)' consumption 
i~ introduced before thc considenltion of their evaluation. 
:.1.2.1 Evaluation of Timing and Area 
In this dissertation, the timing of an implementation dellotct; the critical path dday. 
The critical path dd ay and area of all implementatiotJ arc acquired from the sYlIthct;is 
report and the arca is converted to Gate Equivalency {GEl, where j CE is e<jual to 
the area of 1\ two input NAND gate with the lowest drive strength in the technology 
lihrary. Althongh 111(' mct.ric CE docs not ~x:lC't]y rl'fl('('t thO" si7.1' of thO" illlp11'11l1'1l-
talion after fabrication, the relative compmisons between different implementations 
should hold dosely and th(;l;C arc of most interest in this dissertation. 
3.2.2 Power Consumption and Energy Consumption 
The total power consumption of n C~1 0S circuit consists of dynamic power con-
sumption and stat ic power OOIL'SIIHlption. Dynamic power consumption is the power 
consume<] durillg gate switching, ie .. the output of the gate i~ changing Static 
Figure 3.2 : Charging current of the lORd capacitance of a CIIIOS inverter (55) 
power consumption is thc powcr con~umcd whcn thc gate flU.'; voltagc applied but is 
not ~witching. 
3 .2 .2.1 Dy nam ic Power Consumption 
The primary source of dynamic power consumptioll is switching power consumption, 
whidl b principally thc n:>;ult of charging of the load capacitan~e (output capacitance) 
of II gate 135). The charging of the lo."id capacitance of II CMOS inverter is illustrated 
in Figure 3.2. For e~lCh time the output of 1\ gate chang!,.'S from 0 to I , the lom\ 
cap'lCitancc i~ charged and the cncrgy consumed is 
(3.1) 
where CL is the load capacitllllce alld V,jd is the supply voltage. Tllen. the switching 
27 
Fil,'lHC 3.3: Short circuit currcnt bctween supply powcr and ground of a CII'IOS 
invcrtcr [55] 
power can be dcs(;r i bL~1 1\>; 
(3.') 
where Pro .... ] is the probability of the output switching from 0 to I and f,-Il is the 
clock frequency. 
In addition to switching power. internal powcr also contributcs 10 dynamic power 
Internal power is causcd by thc short circuit betwcclJ power supply and groulJd when 
thc PMOS and N'\'IOS transistors urc (;o[J{iucting sillluitancously dur ing tilc switdling 
of the input, as shown in Figure 3.3. T hc cnergy (;OIlSUlllL'(i by the short eireuit pCI 
swit(;hing period is 
where toe is the time both transistors are on and I p<;nk reprcscnt.~ the short dr~lIit 
~ll1T('nt J'lwrdore, the int~rtlal power consumption ClUJ he described >I.~ 
(3.4) 
wliere P",w",'h is the probability of the change in the input 
As II SUIll, the dynamic power can be cxpressed as 
Siuce the iutenml puwer only occms during the ramp time of tlw iupul signal of 
the gate which is nunnally vcry short, the overall dyllamie powe!' consumption i~ 
dominated by switching power [:\5J, and thcrefore, the dynamic power consumption 
can be approximated to 
(3.6) 
According to the aboveexprcssion, it is easy to sec that. iu the ea.~e ufimplementatiuns 
with standard ccll C;\'[OS tedmolog)', PrO_.l and fd.< arc tim factors tlwt. cml be 
con~idcred in the design and implClllelllat.ion to scale down the power COlJS1Uliptiull , 
while for fIliI custom Ci\IOS k'Chnology, more factors, including CL amI Vdd , Clm be 
taken into consideration for low power desigu. 
Figufe 3.4 Static leakage current~ of a Ci .... 'OS illverter [JJ[ 
3 .2. 2 .2 Static Power Cons umption 
Static power consllmption in a C/>'10S gate can be expressed i\l:l 
(3.7) 
where I.!al is the total leakage I:llrrollt that flows between power suppl.i' and ground 
for the period thc gate i;; not switching. Current I.M cun~ist~ of source or drain 
lcaknge (;nffent nnd snb-threshold current. Source or drain leakage (;Urrellt i~ the 
leaknge current betwccn the source or the drain and the substrate of 11 transi~ t{)r, 
ami sub-threshuld current is the leukage cunent flowing from the dmin to the source 
of II transistor operating in the weak inver8ion Illode [3J[. l\n illustration of till'S(' 
phenonH'na is shown in Figurc 3.<1. 
Sinef' all gat~ in II f'irf'llit. snfff'r from static power ("onsllmptio!l during 11011-
switching period, it is ohvious that stalie power OOIlSlllllptiun is pruportiollal to the 
gate count of the (;i r(;uit , i.e. the ar(;u of the circuit 
3.2 .2.3 Ene rgy Consumpt ion 
Euergy (;OIlS!lIllptioll is the ac(;unmlated effeet of power consumption over t ime, (IS 
expressed as 
£ = J p(t)dt (3.8) 
where p(/) is the power consumption at time I. The rela t ion between energy con-
~ llInption and power consumption can be illustrated by Figure 3.5 where the curves 
are instantallcollS power consumptioll and the areas under tlJC curvetS arc cncrgy COIl-
sumptiOIl . Thc two approaches shown in Figurc 3.5 have different power consumption 
(PI> P2 ) over time but their total energy consumptions arc same ( £1 = E-l ) . Un-
der t Ill:' assumption tlmt a computation task of a device can be completed with ei ther 
of the two appronchcrs (i.e. , in time T1 for approach 1 and t ime T~ for Il.ppro:lch 2), 
npproach 1 is preferred if the dcvice is powered hy a hattery becausc the task is COIll-
pletexl earlier ( TI < T2 ), and approach 2 should bc sclected if thc device is passively 
powcr"x! alld PI can not he well supplied. Consider llOW that using approach 1. it is 
pos.~ihle to complete the task in (I time To < T1 . In this case , for cnergy constrained 
(i.e" hattery powered) devices, it b clearly prdenlble to nse approach I, since less ell-
ergy is uS()d to complete the tll.sk, even though the instantaneolls power l"OllSUIliplioll 
is higher than for Il.pproach 2. 
31 
Figure 3.5: Power eunsumptiun and energy cunsumptiun 
3.2.3 Evaluation of Power and Energy 
The power consumption uf tllc candidate implementations is estimated using Prillu> 
Tinw PX (vorsioll i3-2008.06-SP2) from Synopsys [36). The C!;tim Jl tiOIi is II gate- level 
power analysis based on the the switching activity of the netlist. According t.o the 
ddillitiolls pn)Vided in the previous .seCtiOIl , tl lo dYJI(lI l1ic power of a (· ircllil. can be 
calculated based on its power parameters (e.g., parasitic capacitances), timing infor-
mation (e.g .. clock frequency) and switching activity. The para:;i t ie eapacitJllleC!; of 
til(' circuit are predicted lind extracted in t.he synthC$b ami virtllal layo\lt proc~","~ 1I~-
ing the topology lIIode of DC. The switching activi ty b obtained from the gate-level 
simulation of the notlist with certain simulation vectors. 
The power evaluation flow using PrimePolI'er is shown in Figurc 3.6. As is ~holl'n, 
after the sYlithesis stage, a forward Switching Activity Intercllflnge Format (SAW) 
[371 fiI(' is g!'n!'fHtoo hy D(~ign Compil!'r 1L~ input to til(' simulation stag!'. TIH' for-
wllrd SAIF lile contains lhe alillolatiolls ahulll which circuit deulents ,He to be traced 
32 
during !;imulation. Thcll, the ~imlllator rull~ gate-lcvel !;irnulation 011 tlle nctli~t with 
tIle ~imlllatjotJ vectors that reprCl>Cllt the typical task!; of thc dcsign. TIn' !;witching 
activity during the simulation i~ capturL't1 ami recorded iu a ])ockward SA IF file in 
the form of timing and toggle atlributCti of pins and port!;. PrimcPowcr ealculatCti 
tIl(' power l'on~llTnption u~ing the hal'kward SA lF fill', the n('tlist fill', tIl(' timing in-
formation, the parasitic information and the technology library. Since the Cfllclllation 
of power oonSlllllptioll by PrimcPower hcu"il)' relil..'!> Oil the ~witchillg activi ty driven 
by the simulation vectors, a lurge number of random simulation vectors need to be 
gcncrall..'ti to imitatc thc practical us.1.ge of the design in order to achieve accuf;lte 
power estimation. PrimcPower can report both the real-timc power l.."Onsumption of 
the circuit at any timc point during the simulation pcriod and the average power 
l'Ollsllmption over the period. \Vith the report of real-time power consumption, the 
peak power consnmption of the circuit is also known. Encrgy consumption is caleu-
latl't] by multiplying the average power consumption with tll(~ duratiou of the period 
of inten:st. In this dis.';(.'rtatiOIl, thc cuergy is dclermined o\'er a period of t ime for 
the implcmcntation to producc onc unit of tllroUghput, sud) as a byte for the S-oox 
implcmclltatioll and a 128-bit block for the datapath implementation 
3 .3 S ummary 
In this chapter, the methodology for the hardware irnplelllelitatioll and t.he evalua-
tioll of the performance' of the implcmcntations in this diS!;Crtatioli is dCl;cribed. In 
sllIlImar}" thc AS IC dl..'!>ign flow based on standard cell teclmologics is adopted for the 
IlHrdw<lre implementation. Thc perfornwncc of the IU.lT(lll'arc implementatiolls is eval-
33 
RTLDesign 
TimingConslrainl 
TechnologyLibrary-""'------'.r+~ 
Timing Constraint 
8ackw~rd SAIF file 
Timing Constraint 
Delays (SDF) 
,-c::=r==r=----, Forward SAIFfile 
L-_'-"-_---'"- TechnologyLibrary 
Figure 3.0: Power Evaluation flow using PrimcPower 
\lilted hased on the Tletli~t from tlw synthesis results of the RTL design for timing. 
area, power and energy. The estimated pnrasitic capacitances of the implement". 
t iolls frolll tile vir tual layout proces.~ lue used for est imation of power and energy 
consumption 
Chapter 4 
Using Pipelined AES S-Boxes for 
Resource Efficient Purpose: An 
Example 
III this chapter, we propose II compact ASIC implementation for AES encryptioll 
with 128-bit keys which employs a single 'i-slnge pipclilJ(.:d S-box shanxi by the dat.a 
pillh operation and the key expansion operation. Compared with the prcI'ious small-
t't;t encryption-Dilly ASIC imp]cmcutatioll of AES [31], it achieves an incrc,\SC in 
th roughput of 2. 1 times while slightly rooucing thc gatc count. This result indicatcs 
that pipclincd AES S-hoxcs aTC applicable in AES lwrdwarc implementations tar-
geted at low resource applications Thc content of this chapter is also prNCntN\ in 
1"81· 
35 
4. 1 Intro duction 
Au AES encryption core with an 8-bit datil path w&~ prescnted in [at] where two S-
hoxc~ are implemented, one used by round operations and the other used by the key 
cxpansiou. Even though the throllghpllt of th is de~igll i~ higher t lHm othl'r compact 
de:>igns, the critical path, which determines tIll' maximum clock ffL'<:jucncy and cou-
SL"(juently the throughput , is quitc long because it comprises the cntire critical path 
of thc S-boxcs. S-hoxes are the most complex component in an AES implemeutation 
ami it gencrally involvf'S a largc number of gat!:"!; on its critical path . Commonly in an 
AES implementation for high ~pccd applications, the S-boxcs are pipclined to scveral 
stage:; ill ord! ~r to reduce the critical path of the overall da;ign. However this method 
is scldom applicd to (;lJmpat t implementations for tlmJ1lgbput improvellwut because 
it is assumed that pipeline regbters would incur large hardwar(' overh('ad, which is 
not afforriabll' fo r I.hl' ('om paC' t. impll'llll'ntations targl'tl'd al. low ('osl. appli('a t ion~. In 
t hi ~ dwpter, the applicability of llsing a pipelincd S-box in compact AES hllrdwal·c 
implementations is examined. A new VLSI architedure design for AES implementa-
t ion is propoSL,(] to accommodate a t\-stage pipclincd S-hox and the implementation 
results ~how th(lt the new design can achieve more than double the throughput of 
[31] while slightly reducing the gate cuunt. In the following of this dissertation, the 
de:> ign from [3 I [ is referred to as the reference design 
36 
Figure 4.1: mock diagram or the propot;C(i AES encryption core architecture 
4.2 Architecture D esign 
The block diagram of the proposed architecture dc,;ign is shown in Figure 4.1. III 
the flrch itccture, the round operations have all 8-bit data 1><1th, and on the p1lth, the 
Shifl noms, S/jbBytcs, A/ixCohwl1Is and AddRouru//( cy operations arc performed 
byte by byte in SI..'quenec by the eorrl'!;polHiing components. To complete the operation 
of olle round of AES encryption. a ll the by t.cs of the Slate need t.o tnwerSl~ the round 
operation data path once, so totally 10 traversals arc rC<:llIired to cncr)'pt an 128-bit 
plaintext 1Ifter the data path loads it. T he key expansion componcnt also has Ull 
8-bit data path and generates round keys on-the-Ay using 128-bit keys. One S-box is 
used a lternately by round operations ami the key expansion . During the period the 
S-box i:; occupi{,'d by the key eXPllnsion componcnt, the rou nd opeffltions life frozeJl 
by clock gating. The proposed daiign is uevclopc<:i bused on the reference (!Psigu [31] 
IIlHladopts the sa me ShiftRows , MixColullws and S-box structures. The prop()S{."(i 
design lms mooilietl illtereon!leclion udween oJmponents, key exp<LIL~ion U)LnpOtLent 
and data flow which allow the interleaved use of one pipeline<:! S-box betwccn the 
37 
Fig,ure 4.2 : ArciJitoctlll"C of thc ASS cncryption core with a 4-~t ll.ge pipciiucd S-box 
38 
data path operation lind the key expansion operation. The detailed architcctun~ of 
thc prop(l;';C(1 design is shown ill Figurc 4.2. All the path~ in Figure 4,2 have II width 
of 8 bits except tllose between two consecutive pipciine ~tages in the S-box ha\"(~ the 
width~ liS the ill ternal data widths of the S-box at the place~ where the pipeline 
registers arc inserted. T he hloeks marked with" H" arc 8-bit registers. T he operation 
of each component and their interaction will be described in the following sections. 
4.2.1 The Shifl Rows C ompone nt 
r hc Shif/Rows t:ompollent collsists uf 128-bit registers connected in scrie::; and there 
arc shortellts from iL~ input and e\·ery four th register to it~ output. The colilPoncnt 
takes byte::; arriving in the orda of State columns and rcorders t.he hytes lI·hile they 
arc pi.lSSing through T he detailed operation of the component is dt:seribed in [31 1. 
4 .2.2 S-Box 
The S-box adopted in the proposed design is de\'c!oped in [7] Since the computa-
tiun of multiplicative im'ersc over GF(28 ) can be converted to the computations in 
sublields, in 17] the S-box structure is examined for a number of representat ions of 
subfieMs, induding both polynomial ba&.os and normal bwsc:s, and the one leading 
to the implementation with the smallest. gate count is identified , In the prop(l;';C(1 
architecture. the S-box is pipclined to <I stagCli. The pipeline registers arc plau:d 
betwocn two conSl)Cutive stages bllt not shown in Figure 4.2. The rcgistt:[ placclncnt 
is performed at the gate level The exad register placement is not pfl":iCnte<:!, Heft.'r 
to Chapter 5 for the details. 
39 
4.2.3 T he MixCol1t1nns Component 
T he MixCoiumns component is a s~ ~ r i al- i ll , parallel-out matrix mult ipliel It takes 
olle byte input per elock cyele continuou~ l.y fOl' 4 cycles to rc.:cive a column of t Il(' 
Statl' . At cvcry fo urth clock eycl~" the ('OI 11 lmtatioll of the ,HixCoiulIlI lS operation 
on t l l{~ e nff(~nt eolnlllll of the Stilte i~ completed and lltc h l'st lJyt.c or tI ,e l'eSll l1. is 
availa ble at the output while the renwinillg t.hl'c'C by t.~'!; are fed to tI,e input of the 
parallel. in, serial·out shift rt'gi~ter~ incorporated in the MhCoJ~mm.~ component. 
SuhS(Xlll('lltly, the three bytes are shifted ont in the following thrcc cycles. The blocks 
X02 and X03 in Figure ,1.2 generate the products of t Ile eurrent input byte ami 0211 
and 0311, aeeordingly. T he AND gates arc used to bypass the XOIl gates, T his is 
done by sett ing EN to 0 and thus Pllsuring that the XOR operation does not change 
the data. During the loading of a 128-bit. plaintext , only til(' ~h i ft registers at the 
right H id(~ of t. l l(~ (mnpOlwnt arc working to shift ill lind shift out the plilintext. bytes 
in serial. Refer to I:ll i for a detailed explanation of the com ponent 
4 .2.4 Key Expa nsion Component 
T he key expansion component has all oS-hit data path , which is implemented mainly 
by circularly cOIlJl(:ctcd shift regis ters l~ I i to 1l32. The byt~'!; of a round key arc 
generated whi le t lw kpy state ein:ulates through the shift rcgiste rs and the generation 
of II mund key is completed eyery time a ll of til(' key state ha<; circulated along (]JC 
vat!, once. T he computation of the next round key involve>; t he substitutioll of tlJC 
last four hytes of the current rOlllld kpy. T his is realizc..-.l by llll 8-uit Illultiplexel 
s>.l"itching t.he input of the S-box bctween the round operat ioll dlllll path aud the key 
40 
expansion data path. During the load period of key bit~, the AND gate has EN sd 
to 0 to bypass the XOH gate on the ~hi ft register path 
4.2.5 Overa ll Des ign 
In order to clarify the operation of the architecture, the states of the llUmbered 
registers in Figure 4.2 in certain selected clock cycles arc shown in Tablc 4.1 and 
Table 4.2 for the round operation component and the key expansion component, 
n.'$pecti\·c!y. For hoth tahk'l;. the output of the register during a clock period is 
regarded ,,~ the stnle of the register. In THble ,1.1. for each st(lte /I,!", (0 :5 N :5 
10.1 :5 III :5 16), N represents the N-th round within which the byte of the S/(Ite 
is prm;eSSl.'d (with the cxt"t.'ption that N = 0 indicates the initinl plaintext) Hnd II! 
represents the m-th byte of the State in the order of columns. The state Nn, represents 
the 1/I-th byte of the Stafe after the MixColu1I!ns operatioll in the N-th round (after 
the SubBytes operation in the final round). Similarly, in Tahlp 4.2 the state of a 
register Nm indicates the /II-th byte of the N-th round key with til(.' original key bits 
represented with N = O. In both tables. X indicates a stllte holding n useless byt('. 
The operatiou of the multiplexers and the AND gatC6 can be em;ily determined fWIIl 
Taule4.1 and Table '1.2. Clock gating i~ applied reguhlr]y to both round operntiou und 
key expnnsiou COlll l}()ucnts. The liClected cycles that delllon~tHlte the OCeUTllllt"t.' of 
clock gating are lIlarked with * ill Taule 4.1 and Tnule 4.2. The registers that ({.'quire 
clock gating and the cycles wheEl clock gllt iug is active Clill ue deduced from 'J~'lble '1. 1 
(llid Table 4.2. It shonld be mentioned that, as is shown iu l llblc 4.1 ulld Table '1.2. 
l.heill"chitecture worb in 11 wily for the Hna] found operations slightly differeully froul 
41 
tha t for other rounds because the MixColumns operation is skipprrl in 11\(' final 
round . It t.HkC!i 256 dock cydC!i to complete thc encryption of a 128-bit plai ntext 
including loading und unloading, and sincc there is overlapping uf thwe do<:k <:)'des 
durillg luading and unloading, the cffedive do<:k count of the arehitedure is 2[)3 fOI 
the eneryption of a 128-bit block of pla intext. 
4.3 Implementation R esults and Discussion 
The proposed AES arehitecture d<..'l:iign with u 4- st~Ige pipelilllxi S-box is synt iK'Siz<"'f1 
using Synopsys Design Compiler version X-2005.09 under O.IS-Im) Cr.l0S stfllldard 
cdl t.echnology from T SMC. T he ~ynthesis results of the prop06C<I design with the 
cOllstmint of minimum area arc reported in Table 4.3. Sinre it is difli,·ull to WIll-
pare Iwrfofmanee of impienH;ntations with different tedlnologil-s, the impirnJenlation 
r<..'Su lts under O.13-pm technology from [31 ] are not quoted for compfll·ison here. <HId 
in~tead, the refere]}<:e design is implemented aud synthL"S ized with t ile same tool ami 
tedmuiogy as the prupoH<..'fi design. The r~'Sults arc present{.'fi in Table 4.3 . It ean 
he secn that the design with the pipelined 8-box uses slightly fewer gates than the 
reference d{.'H ign l1!1d achieves an increase in t hroughput by a factor of 2.1. Alt.hough 
the overhead of controiiogil: i~ not include<i in the comparison, the slight increase in 
gates USl.'ti fur the controller of the propoSL'fi d<"'Hign would be euunter{''fi by the slight 
de(:rell.,;p in gates on thc dat. a path. T he implemcntation rC!i u l t.~ and compruison .~how 
that , p\"pn though the pipclined S-box would introduce the IM,ency of sc\'eral clock 
cycles per rOllnd opcration compared wi th the reference design, the reduction of the 
eritil'al pflth delay by using the p ipclined S-hox compensates for the inerCiL~ed la tency 
Table 4 ,1: Rcgistcr states of the rouTld operatiun data path 
Cycle II, Hz I II , Ii , I II, I ", I R, I II, I 
X X X X X X X X 
15 X 1, I, I, I, I , I , 
" 
*16-20 
" 
h 16 I. I, I. 
21 I , I , I , I, I, I, 
28 I, I , I, I " I" 114 
" 
I. 
29 
" 
I :J 112 IIJ I" I, I, X 
*41-44 2 , 2, 2:1 2, 2, 2, 2, '. 
*232-236 10, 10? 10'1 10, l O~ 10, 10, IO~ 
241 10, 10, 10. IOu 1010 !O3 1012 lOu 
2:i6 X X X X X X X X 
Cycle I II, I II ,. I Ii" I Ii " I Ii" I R" I R" I "" I 
X X X X X X X 0, 
15 I. I , I " I " X I" 11:1 I " 
*16-20 l il 111) I " I " X ILl 114 II ~ 
21 I ,. I" I " I " X 11,1 115 110 
2S X X X X I, I' I ' 
29 X X X 2, I ' 
" 
I ' I , 
*4 1-44 2, 2,. 211 212 1'13 1;4 1;5 I'u; 
*232-236 100 1011) lOll IOn 9'13 9'14 9;[, 9'w 
2,11 101,1 1015 10. X l if, X X X 
256 X X X X 10'1£ II, 0, 0:1 
,13 
Table 4.2: Hegister states of the key ex pansion component 
I Cycle III" I ills H21 III,; I U-lIJ I Ib I Ib I H.lY I R30 I R31 I II" I 
X X X X X X X X X X 
17 0, OJ ...... 09 Ow lO ll 1012 IOn 1014 101 ~ 1016 0, 
20 0, 06 .. .... 012 013 10).1 1015 lOw 0, 0, 0, 0.1 
*2 1 0, 06 ...... 012 0" 0" 0" OIJ 0, 0:1 0, 
*2:1 05 0, 
······0.12 0 16 0" 0" 0" 1, 1, 0:\ 
*24-28 0, 06 ...... 012 0" 0" Ol!", 0" 1, 1:\ 1, 
29 0, 0 7 ...... 013 0" 015 01(i 1, h 1, 0, 
31 0, 09 .... .. Ol~ 0" 
" 
1, 1, 
" " 
0, 
" 
1, 16 .. .... 112 1" 1" 1" I" 1, h 0, 
*45 1, 16 .. .... 112 1" 1" 1" I" 1, I :) 2, 
*237 9,. 9u ·· ····9 12 913 914 9" 9" 9, 9, 10., 9, 
*240. 9, 9u ...... 9 H 91J 9" 91~ 916 10, 10, 10:] 10.1 
*24 1 95 9, ...... 912 91l 9" 9" 9" 10 , 10, 10:1 10.1 
256 10, 108 .... . . 1011 1012 IOn 1014 1015 1016 10, IO~ IOJ 
Tnble4.3: Implementation re::;ults 
Area /l.h .. x . Fieq 
(GEs) (M Hz) 
I'roPOS(.'<:[ 2730 233 253 117.9 
Rderem;edesign [311 2815 69 160 55.2 
44 
'1~1ble 'IA NormHlized perfofJlJanee cOlllparbon of t lJe arelJitecture u~ing a sillgle 
S-box with d ifferent number of stages 
# Pipeline Sta!!;es 3 ,I 
Af(~11 0.93 0.96 0.99 1 1.05 
T hroughput 0.37 0.G5 0.82 I 1.15 
Ratio (Throughput! Area) DAD 0.65 0.83 I 1.1 
and hrings significant hoost to 1h<' 1hroughpU1. T llf'refore, whcn tllfOugilput i~ II <'011-
ecru for a low gatc count AES hardware implcmcntation, thc proposed d!.-'Hign with a 
pipclinctl S-box b a milch better cilOice than the rderellce design with two S-boxcs 
Only tile performanee comparison with the referencc (h'Hign i~ pre:;cnted here h!.-'<.:ausc 
the r('[prpnce design IISes thc lowest hardware cost among publishcd works hased on 
all ASIC vlatfOlm [31 ] 
In order to detefmine the influencc of the numbcr of pipciine stages on thc overall 
performancc of a eOIllVact design, the sccnarios for varying numbcr of pipeline stagt'l; 
arc inw'Hligakd. T ile area and throughput perfofmance of the architedurc u~ing a 
single S-box with a variety of pipeline stages is lIol'llIaliwd to the <I-stage pipcline 
sccnario and shown in Tahle 4.4. It ~houJd hc noted thnt the arehitp.cturc in Figure 
4. 1 ollly works with a -i-stnge pipclillc'(i S-box and for othcr stagc Ilumhers up 10 5 
the archi1ecture requires minor changes 1.0 fit. FOf more t. han [, ~ tag,~ of pipeline. a 
lJIajor lliodifkatioll un the af<.;hitL'<.:lure is reqllifc'<i since it is get.lillg l"ulllplkatc'<l tu 
sharc the S-box betwccn the da ta path and the kcy expansion when thc latency of the 
S-bux becollle'S larger than 5 clock cycles. T he differcnces in arCa betwccn pipeline 
45 
stagf' Illlml)('rs ('omf' from tl)(' difff'rf'1l/ amOllil1. of pipf'linf' rpgistf'rs 1lSf'f1 in ('a('h (,iL~(' 
The figllrc~ of one pipeline stage in Table 4.'1 indicate the s<;ennrio of using a nou-
pipclined S-box. Prom Table '1.'1, it can be seen that the ratio of throughput/ area is 
gradually improved as the number of the pipeline stage!; illcre,l.S('S. The archih:ctllre 
with a 4-stagp pipf'linCfI S-hox i~ !;('le("/,(~l t.o ht· spet"ificd in this dmptf't ht~" III!;(' it 
tws the be!;L performancc for an architecture with an area smHller than therdcrcm:e 
design 17). The information in Tablc 4.4 also indicates lhe pcrformance vHriancc ovcr 
diffcrcnt IIIJJllber~ of the pipeline ~tages of the S-box illJplementation is worth investi-
gation in ordcr to identify the appropriatc number of pipeline stngl"'S for certain dl'!; ign 
rIXluirelllcnts. In the next chapter. u detuiled nnd comprehensi\'c in\'~'!;tigl\tion 011 the 
pipeline configurations of thc AES S-box is perforllllXllhwugh thc dlarac:t e ri~,ation 
of the performance of the different pipclinIXI S-box implemelltatiol1s 111I(lcr a variety 
of throughput rcqllirClllent~ with a 90-nm CI\IOS standard cell technology. 
4 .4 Summa ry 
In thi~ chapter, IIIICW architecture design for eompllct hardware implcHH'lJtation of all 
i\ ES cllcryption corc is prcscntIXl. The lIew dcsign is fCiltllred with a 4-~tage pipdined 
S-box. The implemelJtat ion rcsults show that, compared with the pl'cvioll~ Slllalle~t 
encrypt ion-only AES hardwnre implemcntation, tllo 1I0W design uSC'S thc salllO amOUllt 
ofgatestouchie\'Clmincrelll;euf2.1 t imcsin throllghplit. The implcmcnl.atiol1rL,!;ults 
illdif:atc that 1I0t only are pipdined S-boxes are applicable to compact implcmcnta-
tions of AES, they clln actllnlly be ust."t:l to improve performHilce. The contcnt of 
this chapter is II prdiminary attempt to improve tho performllnce of tlJC AES illl)J1c-
·IG 
melltatioll with pipclined S-hoxCti. T<lble 4.4 also shows sollie preliminary results of 
t.he exploration of the performance trade-olrs with the different number of pipciine 
.~tages. In~pired by tiJCsc rcsults, in the next chapter, a cornprehellHiv(~ investigation 
of the performancc improvcment Hnd trade-offs provided by pipclined AES S-boxcs 
is performed and prcscntoo 
Chapter 5 
Exploration of S-Box Pipeline 
Configurations for Flexible and 
Efficient Implementation 
I II this chapter, we pn~ent fI. comprehensive invcsTigation of t he pipeline coufigura-
lions for ASIC implementations of the Advanced Encryption Standard (1\1;;5) sl1\)sli-
lul-ioll box (S-box). \\ie consider pipeline configurations for the S-hox with ~ typical 
!;o IIJP():; i1.'~ Held structure by varying the numucr of pipeline stagl-'S <ltld the piac.:clIlcllt 
npproach of pipeline registers. I3csidcs the cOllventional placcJJlclJt approac h at tbe 
component level of the S-hox. we mlojJt II new piaccmcut approach at the gate level 
to achieve a fi ne-gr1l.ined pipeline. Tile dlaradt~r izat iOI I shows that there i$ notable 
PCrfOrlll>lIlCC improvement ill 1.iming, area, power aud/o[ (!JlCfg)' efficiency [,y nsing >In 
appropriate coufiguwtion compared with uther cUIlligura(iun~ including nun-pipclined 
illlp1clllf'lltation~ 
48 
5.1 Int roduction 
In contrast to the convcntiOJl!ll u~ag(~ of pipclincd S-boxes ill high throUg]lput im-
plementations, therc arc vcry few previous works that invcstigate the pot(,lltial of 
pipelirlL'J S-boxes fur resource efficiency so that tile}, can be applied to resource--
constrained Hpplication~, such ;1.'; lightweight embedded appliClltions. This is duc to 
the fad that t he area uverhead introduccU by tlJC pipeline registers appefll"S to COIl-
!!iet. wi t h the effort of reducing [lrca and rOllsequcntly reducing power and energy 
consumption 
In the twxt SCI.:tiOIl, based on a typical oompooite firJrj StrU<'1.UfC of t,ll<' !\ES S-
box, we consider an extensive variety of pipeline configurlllious and investigate their 
illfhwTH:e 011 all perspectives of the performance, incimliug timing, area, pow!)r ,.IlU 
energy. Tile pipeline eonfigllTatioTls consist of the rilimber of pipeline stages of 2 
to 4 for the componcnt level register placement approach and 2 to 7 for t,hc g(l t(' 
lcvel register plilcement appron.ch. Wc introducc the IIppliClltioll of the gate level 
approacil for the pipciiue of S-box implementations. It exploits the retiming functioll 
by the syntilesis twltu achieve tile IirlL'-grained pipeline al the gate lew:l wilhont the 
I"iolillion of the standflFl ASIC d~ign Aow. The performllnec of the pipclillcd S-box 
iUJplelJJentations is bCllchmarkcd with a gO-nm C~'IOS standard cell technology 
Tilroug;h the analysis of the performance, solJJe obvious trends that reflect the 
influence of tile pipeline eonfiguflItions 011 the perfornJallce are idelllilied. In addition 
to the timing improveIllent, notable improvements in term); of area, power amI ellergy 
efficiency lin: 11150 observed by using lin appropriate pipciine configuration compared 
with implcrmmtatiuns with no pipcli lle These improvements occur under tire wide 
range of liming f(",<]uirelllents we lH).v~~ eXi.\mim",(\, including the reqllirenH'nt.~ to which 
li g,htweight AES implelllelltationt; are targeh.'(l. Thc!;e reslllt~ arc ~trong evidence 
that pipeline<:! S-box implementations arc not only suitable for high throughput AES 
irnplementations, but also valuable to resource .. dficient AES implementations. Thc 
results also show tim! pipelining provides many lIIorC pcrfonrmnce options that al .. 
low more flexible implement,alion of the AES S-box compared with nOll-pipeline<:! 
implementations 
5 .2 The S-box St ructure for Pipelining 
Acs i ~ lllentiorwJ in SL'(;tiolJ 2.2.1, the cOlllpm;ite ficld strm:tures have low (·onlplex .. 
ity 1\11(1 this allow~ for more dlid(,Tl1. pipclininJ,.( in t.('rm~ of the 11\lInh('r of piIWl iu(' 
registers. Tll('rcfore, the composite field structures arc most sui table for pipclining 
Although quite a number of composite field S-box structures Me available, ill-
eluding [5J, [6J , [7J, [8], [~], [I OJ and [n], there is typically li tt le d ifference betwL'en 
them. They tend to be very close together in performance compared with the other 
groups of S-box stlUdUlU; [18] . Since OUI purpose is to investigllle the genem l cffl'(;" 
tivCIICt>"i of pipelining the S-box wilh <1 composite field structure, we use tlie one frorn 
[5] as a typical and common composite field structure for the study in this chapter 
Althollgh it is not til(' one wilh thr hest prrformanr(' among all tl](' ("omposil,(, fipld 
~t ructun,;s, it hn,; II relatively simple and dcm structure and enll be el,-~i1y pipel inL'(l. 
The pipeline of the S-hox in [I!.!I amI [201 i~ also based Oil this structurc. Further, 
~incc thc S .. box('S for cncryption ami for decryption have vcry similar strlletllf(~, we 
foclI~ O \lI' inve:-;tigatioll Oil thc encrypt ion S .. hox 
I~'" ~ ,,-:7 ayt.SubI 1 ~s-bo' •• ) , I ShiIIRow -, , ,/ ~,COIUrm ,"', 
, M<lRO<.ndl<.ej RO<.nd2 ~ __ _ 
(.) 
(b) (0) 
Fif!;t1I"c 5_ 1 The typical data patl] un:hitccturcs of thc AES: (a) loop-umollcd 
nrcli itccturc. (b) round-iterative llrchitccturc, (e) fully seria lized ~lrchit ccturc 
5( 
5.3 Applicability of the Pipe lined AES S-Box 
\Vhl'll ('() !l~idcrillg n~p!a~ing 1l001-pipdined S-hoxcs with pipdimxl S-hoxes ill an AES 
implementation. the primary concern is whether it would impn.ct the throughput of 
the (!iplj(~r. Actually, the flJlSWCr to this coucent var ics with the contexts, including 
1110 architecture of the cipher find tiJC mode of operation of the cipher . In the next 
section , the impact on the throughput will he discuSo<;ed in the contexts of t1lf(X~ typical 
data path architcdurcs of AES find the two groups of the commonly llscd block cipher 
Illodes of operation: the lIou-fctxlback modes amI tim fccdbaek modes. 
Three typical AES hardware architedurcs arc the loop-unrolled an:hitcdmc [191 
[20], the round-iterative architecture [391 and thc fully scriali?,ed ~uchi tectlln~ (with a 
dlttapath width of 8 hits) [:llJ [32J, and the illustnltiou of them is ~hown in Figme 5. I 
The loop-unrolled architecture is u~uHlly ~Idoptcd for high throughput implemellta-
tions while the fully serialized arehitcctl1l'e targets at low area implementa!,ioll~ , The 
roulld-i1efalive arehitedure provides a tra{iL'-off option in between. NOlJ-feedbiltk 
modes im:ludc dectronic codebook (ECB) mode a nd eOlluter (CTIl ) Illude, and fCLu-
hflCk mock'!; iucludc ciphcr-hlock chaining (CaC) Illodc, ciphef feedback (CFI3) mode 
fllld output feedb!tek (OFI3) modc. In cm;h cuntcxt, two scenarios are considcred 
after the replacement uf tlw Ilun-pipeliued S-boxes witll the pipeliued S-boxes: (a) 
the cipher fUllS at the saJl)e dock frequency, and (b) the eiphef fUllS at higher do<:k 
fre<jIJeney. For cum'enicnce, the m;sumptioll is made that the inputs am! the round 
keys arc able to be f{.'(1 to the data path whenever they arc f{.'(luired 
Tlw loup-unrolled architecture is developed particlibuly fOf the non-feedback 
modes in order to fully exploit its capability to produce a 128-bi t output per clock 
5:2 
cycle (md. therefore, it is (L~~umed that th is an;hitecture is only u~d for the 11011-
fC'(xlback modes. For scenario (tl), there is no impact on the througllPlit after the 
rep!(lceillent, For .':iCCnnrio (b), the throughput would illcrefL';C. 
For the round-iterative arehitecturc working in the nOIl-fccdb(lck mode,; , the 
throughput c(ln be kept the ~(lille (sc(,lHlrio (a)) or increased (seenario (b)). due to 
th" fact that the lHllHber of inJlu ts that cail be processed simuitallCQusly is C<!llill t,o 
the llumber of pipeline stages. In the cases of the feedback modI'S, th('f(' is only one 
input. tlul!' cau be prOC(:!jscd ut one timo due to tim depemlen~y between the eUITeilt 
inJlut ami prcvious output, and collscquently the throughput drops by (I factor of !.l l(' 
ImllllH'r of pipclinc stages for st."enario (a) , However, if this cipher works under Olll' of 
the fctxlback modes but serves for parallel independent dfltfl streflllls/ rhalluds willI 
the nnmber t~llal to or larger than tim number of pipeline stages, as is also asSllllll'(l 
in [.tll] , the throughput can be kept the snme fOf scenario (a) or illereascd for sccnario 
(b). 
T he fu lly seriulized urchitecture with OIlC S-box takes at least iG clock cycles 
to COlllplcte one round function and itcnltcs for thc !lumber of rounds to produce 
01H~ input. It is uot allie to haudle more than one input nt a timc, so there is no 
difference eausl~J uy the mode uf operation in evaluatiug tile i!llpa<.: t on lhrou~h]lut 
by pipdilliug. Assuming that tile fully serialized architecture tflkl'S 16 d()(;k eydt'!; 
to {;Ompictc one round fuuctioll and lends to thc throughput 'l" before an lI-~ tagc 
pipdiul'(\ S-box (II 2: 2) is used, the numb!'f of d ock eycic~ for Olle roulld function 
becomes 1(;+ (11 - 1) after tilC replacement nnd the Ulroughput become!; 16'1'/(15 + 11) 
ill ;;ceuario (a). Considering that the fully serialized flrchitt.'Cture is usually used ill 
the applications running at 10111 clock frequency, the dl'ClillC of the throughput {Ifter 
53 
the replaeellll:llt i~ slight and acccptable if the pipelincd Sobol( hil.'i iI smull number 
of stagcs. For s(;cnario (h), the throughput can hc inCfPIL'.;(:t1 ahovc 16'1'/(15 + /I) hy 
increasing the clock frequency. It will be shown in thc followiug that there is no 01 
very little penalty on the area and energy eflicicncy if the clock frequcnt)" is increased 
for the S-box implementiltion built for running ilt low clock frequency. 
In this work, the pipelinoo S-boxcs implementations arc compared mostly under 
sccuario (a). so that the throughput of the Sobol( remains the ~a llle for pipelimxi and 
non-pipplined illlplementiltions, regilrdless of the number of pipeline stages L3a:>(.'(] 
on the above analysis, this context is prevalent. 
5.4 Pipe lining the AES S-Box 
S-box implementations with the composite fidd structures creilte a long critical path 
d('ln), for tllp ov(~rnll AES impleillentation . For most AES implclllclltfltions, tli(' eriti('a l 
path of the overall implementation lies in the hardware of olle rOllnd fundion. A 
rouud fu ndion would perform the opl'ration~ SI/bByles, ShifIRow<~, MixCoilmlrl,~ 
and AddUollJld/{cy in sequence in tlie dilta path pilrt. Among these O]lPrations, 
iIlixColllntl]s ciln be built with 3 XOR gatcs on its critical path [191. AddUoUluU{cy 
is single XOR gates in parallel, and ShiftRows is a efOssover of wiring in hardware 
Thereforc, tlJC critielll path of the S-box takes up almost the wllole critieal patll of 
11m overall implc!llClitalioli. Pipeliniug thc S-box would c1fcdi,'c1y rcdu(;e Ihe ui li<:n.i 
pilth delay of the o\'erall implemcntation. !n this sect ion, we invcstigate the possiblc 
pipelilw configurations applicable to the S-box lI'ilh a composite field struct.ure. We 
rpft'r to f\ pilWlilic POilfigl lr<l.t.ion flS f\ pOlllhinf\1ion of fI pla~f'Ill f'1lI aJlprofl~h of pilwiint' 
54 
rcgistcrs nnd n numbcr of pipelinc stages. 
Another benefit of pipclining an S-hox with long criticfll path ddflY is the rc-
dud ion of gl itdlL's ill tile circllitaml cOIls<-'quelllly tlll~ reduction of power alld cnergy 
consumption . CEtcbes are the uIll lecessflry transitions of sigllll.b in a I;in;llil. They 
arc gcnerated whcn the input signal~ do not flrr ive simultalleou~ly at it gate ami these 
gEtelJes may propagate to generate more glitcbcs lhnmghollt the circuit. Since tit" 
dyuamic power consumption of a circui t is proportiol)nl to the Iltllllbcr of trall.~iti()n.~ 
in the circuit and dominates thc total powcr consumption of a circuit. in the C/ll0S 
prm·pss, lhl' m'gative inAmmec Oil t.he power r:onSllmpt.ioll e!l,uSlXI by g:lil.dlP~ (·an be 
sevcre if thc amount of glitches is large. Aceording to the tnmsition ~imulatioll per-
formed in [.11 1 on a dirC<"tioll dctector that hat; a IOllg critical path delay. about. 80% 
of loh(' totfll transitions in the circuit flrc glitches. An cffective approach 10 rNuc(' 
Kl i1.dn~ i~ to ill~I'rt flip-flops inlhl' ("irnlit. iL~ is shown in ]42J. The work in [.I l] also 
shuws t.hat there is significant r<.-~llH;tion ill power COlISUllJplioll if lhe opti!Jl ll llJU IIlOel 
of fl ip-Hops is used. Since thc AES S-box with a composi te field struct.ure is also n 
combinational circuit with long critical path delay, it is reasonable to believe tlmt 
pipeli lJing <:il lJ contriout e to the power and energy d1i<: ieIK)" of the S-box illlplemell-
lalioll th rough the reduction of glitcil<.-'S ami thi~ b confirm(~d by the dmradel"ization 
rl'Sl1lt~ in thi~ chaptn 
5.4.1 Pipelining at t he Component Level 
I'ipclining at the component level is a coarse-grained placemcnt appronch. An cxam-
pie that pipeli]](.'S the S-hox structure of [[Ii into 2, 3 and 4 stag~'S is shown ill Figure 
'·' f-I ------- -------~ 
Ibl l S_2---l 
lel l-----------s~l S_2--------t------~ 
,·) I-S,.,....-S_2---t-S\OvO~~.----I 
Figmc 5.2: Campanelli level pjpclinoo S-box "rchitccturcs: (a) \ -st!Jgc (no 
pipeline), (b) 2-stagc, (e) 3-stagc, (tI) 4-stagc 
5.2, where 1\ and B arc clements in GF(2") , (f) and 0 arc addit ion amlmultiplicatioll 
over CP(2') , ra;pcctivcly, and e i~ the constnnt in hcxadccinwl notation . The dotted 
IinC8 rcprC>iCnt the posit.ions of the registers for the corresponding number of pipeline 
stages indicated at the bottom , The 2-stagc and 3-stagc pipclining arc the Slime 
as used in [201. Pipclining at tl,c component level can be rca1i~cd at 11,c Hegistcr 
Transfer Level (HTL) in thc IIDL description of thc S-oox. Thc registers arc placed 
according to the cstimatctl complexity or delays of tlJC components before thc S-uox 
i~ ~yntIJC~iz(:d iuto the gate-level design" For tltis placclllcul approach, tlJC [lumuer 
of pipdine stages i~ examinl'l] up to ,1 in thi~ chnptpf ~ince there becomes S(:verely 
III1Ualall{"C"(] nitical path Jday~ in the ~tages when the lllltllUer is larger. 
56 
5.4.2 Pipe lining at the Gate Level 
A fine-gnt ined plflCement Hpprol\ch for pipelining may he desired since it CHn avoid 
the problem of unbalanced delays. The most fine-grained placetllcnt call be achicvcd 
by pipelining at the gatt~ l(~\'el 
In tIll: ASIC uesigll !low using standard cell li lmll'ies, a gate It:vel (iL-sign, also 
know us a net list, is generated from the synthesis process applicd to the HTL design. 
It should be noted that inserting registers into the ItTL design, as is done for the 
eompoucnt level pipeline, would not lead to a desirable gatc level pipeline with tIle 
well balanced delays since the delays can not be accurately estimated before the 
synthesis. On the other hand, aSter the synthesis, the manual insertion of registers 
iuto the lletlist for gate lovel pipeline is not eompliallt with tlJC standard ASIC d~ign 
flow and could violate the optimization effort of the synthesis, considering that the 
lllallual pipt'lilH' would imp(l';(' a maJor modifklltion of tIl(' Il(' t list. Th aT 1m:, h('('n \\"(,11 
optimized by the sYll thesis towanls synthcsis constnl ints. To get rid of this dilcnlIIlH. 
we ndopt the retiming function of the 6ynthcsi~ tool to perform the gate len)l pipelillc 
during the synthesis process. In this way, a non-pipelincd !tTL dCtiign of tIl(' S-box 
would genernte the gnte level pipclined netli~t after the synthesis pro ccs..~ . T he details 
of the implementation of this approach in our experiment arc dCl:iCribcd in Sedioll 
5.5.1. 
For the gate level Hpproaeh, tllL'oretieally, the crit ical pM]] dclay call be shortened 
t.o the miuiuJlJIlJ by increasing the number of pipeline stages UJJt il there is only a single 
gate left 011 the critical path of 11 sta!!;e. When the llllTllber of stfJge~ hecOJIK's largt' 
am! the gatc delay 011 the critkal path is close to the deJay of the register, there is 
57 
littlc or 110 improvcmcnt of timing from morc pipeline stag(~s , In this chapt!'r, the 
Ilu rnber of pipeline stages is examined lip to 7 for the gHte levelllpproadl in order to 
observe the saturation in timing improvement. 
5.4.3 Comparing Placement Approaches 
Despite its superiority in tenllS of balalluxl delays, the gate level phu.~mlellt approach 
is inferior ill term~ of the number of pipeline register bits required. Compared with the 
component lcvel appro:u:h that only plat:(~ regi~tcr bits on tllf' input or output. ports of 
a component, thc gate level approach usually requires a expanded Ilumber of bits fm 
pipelilling whcn the eomponents are decomposed into gates. On thc otll('r hand, lhe 
flexibility in register placement of the gate level approach allows the synthesis process 
to han' more optimization potential. Thcrefore, there is no straiglttforll'ard way to 
tell which approrleh leads to overallhetter perfonnam;e 011 the implementations until 
a comprehellsive Chllflu:terizatioll IIml eomplirisOiI i~ (:Olldm:ted, il.~ is dOll(' ill thb 
chapter 
5 .5 Methodology 
i\lthough the number of pipeline stages is the essential factor determining the critical 
pllth delay of pipelincd S-box implementations, the synthesis tool can tunc the implc-
mf'lltat,iollS with different pipf'linf' ('onfignrations to lllf'e\ the sam .. timing ('onstraint 
III tlJis way, they l:all loe l:UlllpaIeJ tu Idled lhe performallee varialioll while varyilll; 
the jJipdille cOllfigurlltion to identify the appropriate configuratious that lead to tIlt' 
IlJoot dl'~ircd pcrformam:e for dilfment timing eon~traiJlts The S-box imjJlementa-
58 
Figure 5.3: Derivation of the <;antlidatc implernentl!tions from the soare!' HDL 
uCl;cription of the 5-l3ox 
59 
lion~ of difff'rput. pipdin(' ('onfigurations arf' [f'f('rred il8 cani/ii/ote illlJ!le1l!ellt(jli()II .~ 
in the next Sf'ctions . The c(lndid(lte implemcntnt.ions (lfe tuned and com pared ag<lin~ t 
the same t iming constraints b",'Cause timing usually has the lirst priority ru; a design 
fC<luirement. This is the basic methodology we adopt for the iuvCitigation of pipeline 
('()] lfif';mations. 
5.5. 1 D eriving t he Cand idate Imp lem entations 
Thc dcr iv(ltion of all thc S-box implelJJclltatiolis bascd on the source I!DL description 
of the non-pipdined AES S-box is shown in Figure 5.3. The SOUfL'C l-lDL deocription 
of the non-pipclincd S-box in Figure 5.3 is the ~tructural descriptiOll of till: S-hox with 
t he composite field stflH:t ure from [5), and ti le structure is shown ill Figure 5.2. i30th 
the input and output ports in thi~ IIDL description afC registerL":!. as is indkatcd by 
the Jotted line at the iuput and output of tile S·box in Figure 5.2 
In Figure [j .3, thrf'(' cat('gori(';l of t. he HDL descript.ions arc dprin'fl tif~t ly fwm Ill<' 
source llDL description and thcn cach of the IIDL descriptions i~ u~ed to gPlierate 
a IHlmber of the implementations under dilferent timing t:onstmints. Each of the 
HDL dCl;(: r i pt ion~ fel1ect~ a pipclin!~ configllffl.tioll. Each of the illlplementations i~ 
generated from a ~ynthcsis and virtual layout process based on the Hl)L description 
Tlte tin;t t:nt.egury uf Hl)L descriptions cunt<lins that of t.he non-pipdilled S-box, 
which is CXilet!y the same as the source I-\l)L description. !t is 11>;(.-'(\ to generate 
the IlOIl-pipdincd candidat!~ implementations 1L~ the refprelll'Cs. The s(''COJl(\ eatcgory 
contains t.he HDL descript.ions of the pipeline configll rat.ions with the cO!llponent le"el 
plal'ellll'lIt approadl _ They arc derived uy inscl'1.ing the regist(:rs !.I.$ D.Hip flops int o 
Figure 5.4 Illustration of the gate level approach of pipciining iuto 3 stagel; by 
register rctiming 
the sourcc HDL description according to Figure 5.2. The third category is the pipeline 
cOllfigllrat iolls with the gate level placement approach 
TIl(' gate lc\·cl approach is pcrformed wi th the register retiming fllilctioll pro-
vi(k~l by Synopsys Dpsign Compiler [3-1). The register retiming fUllction moves reg-
isters through the combinational logic of a dCl:lign to optimi?,e timing and area. An 
illust rat iull of huw this function works is slJOwn in Figure 5.4. Before the registel 
rdiming, rows of regbters nl~~1 to be ins.ert(.'(1 in the s.onf(:t~ HDL descr iption of the 
S-UO:>;, according to the target number of staF,cs and they can be simply plac(.'(1 at the 
outputs of the S-box. Each row of registers init ially has a width of 8 bits ilnd while 
it is IIIOVOO to the appropriate position during tlw rcgistcr fctiming, the width will 
be adjusted tu fit the width uf the data path at the pooit iol!. It should be nuted that 
although the design proJuc(..J loy the rctillliug prOl:ess is sulojed to [ormal vCl'ifica-
61 
lion, lht' rorf('{'1.npss of om pipdint'd S-hoxf'S ran hr pasily \'rrifiNi by a t'xhallsl i\'r 
test, Since the retiming process is automatic and the placcmcnt of registers may vary 
frolll one candidate implementation to another, the detailed loeatiolls of the pipeline 
regi~ter~ arc llot prC!;Cnted in this chaptcr 
for e;).('11 pipeline ("(HlfigllratioTl, II II ll llli)(,f of d iff('f('Tl1. l. imilll-\ ("(Hl~traill t~ aft' USI'{I 
in lhe synlhesis process to produce diffcrcnt candidate implementations. These timillg 
(;onstraints consist of minimizl'(l ddny (timing cOllstraint is set to 0) ami some sclectl'tl 
constraints covering a widc range of timing rl'CjuircmcntH. TIH:st~ st'leded t iming 
constraints illelud(~ 0.35 ns, 0.40 liS, 0.60 ns, and 0.80 ns lL~ the tight cOllstraillts, 
1.00 ns, 1.25 llS. 1.50 ns, 1.15 ns IlIld 2.00 ns as tlw mediulil constraints, and 2,50 ns, 
3.00 llS, 1.00 liS nnd 8.00 n~ a.s the 100l:ie cOlls tnlints. 1\ enndidate impicillentii/ioll is 
omitted if it callnot meet its targd t iming constraint. 
The SYlIt ill,,";is and virtual layout procpss is performed ill the topogmphical mod(' 
of Synopsys Design Compiler. as is described in Chapter 3. The synthesis Iibmry is 
tile gO-nm CMOS standard cdl library from ST~'licroelcctronics witll a corc voltage 
of 1.2V and standard threshold voltage 
5.5.2 Evaluation of t he P erformance 
After the candidat.e implcmentnl.ions nrc nchic\'cd from thc abo\'e process. their per-
fOl'!nallcc in tefiliS of critical path delay, nrcn, power and cnergy i~ cstinwtcd aecord-
illg to thc pCrfOrtllllllCC evaluation methodology described in Cllaptcr 3, The power 
consnmpt.ion is cstimntcd at the clock frequency corresponding to it..~ target timing 
{'onslrainl, lind this Iliay differ frulll the IlH.xiIllUIIi dock frequency ill whkh iI, is (\ok 
62 
to work_ For thc candidatc i mp l(~lIlcntations that urc bui lt undcr tlu' constrai nt of 
minimized delay, the !!lfLximu!Il clock frequeucy i~ u~ed for the est imatioIl . The cn-
ergy consllmption of a candidatc implementation is calculated II.>; the produd of it~ 
pOll'cr consumption and the clock period applied to it. This energy consumption is 
1I 0nualizL~ 1 to be t Il{: average encrgy (;OJlsl lIIwd to producc one bytc output. 
The power consumpt ion of the candidate implemcntations is estimated IlSilig 
Prinl{~TiTlle PX from Synopsys_ The switching: activity is obtained from the gate-
level simulation of the Iletlist with 10,000 mmlolli byte in puts. The sa li ll': set of 
random byte inputs is IIsod for all the cundidatc implementat ions 
5.6 E x p erimental Results and Analysis 
Followillg the methodology described in the last &.'ctiOIl, there arc 10 HDL descriptiolls 
derh·ed from the source I! DL dCllcription und totully 127 candidatc i!llplcrncntiltion~ 
arc built from thclil . \Ve inn.>stigatc the elfed of the pipel ille configurMiolls upon tllc 
perform[lIlCC of the c[lndidatc implemcntations in three wu)'s_ Fin;tly, each pipdine 
configUrfl.t.ion h(l.'l its pcrformance under d ifferent timi ng constraints CXilmi ncd. Sco::-
ondl)" t.he candidute implementations from diffp rcnt pipeline configU filt ion~ with the 
minimized timing constraint arc compared_ Thirdly, the candidilte i lil p lementation~ 
from d ifTprPllt, pipeline ronfiguratiolls with gh-PII timing pOllstTflints ar(' ('ompilrro 
5.6 .1 P erformance vers us Timing Constraints 
The normalizco:l area, power and energy of tile cundidate implelllcntations with the 
sanw pipeline w nfig ura t ioll arc pl£lced in the stmw suLfigure of F igurc S.S accorr!ill ~ 
63 
--+- Arc:! ...... Power -.- Energy 
':&J "' NP: NoPipclinc Gn;Galc ic"cipiaccmcniwilh" siages P,, : Componem i,,·cipiac.m.nl\\' ilhn <lag.' "' 0.2 _ ..... 
".l3 0(, 0.4 " ()2 
" 
".l·';=! "" "' 02 __ 
o OO ___ N'" 
... -.I _ '" 'C '-' -.I 
(0' 
": ~".<CCN"",W'~~"" 
"" 
"' 
0,2 ........ . _____ 
" 
" 
O ___ NN 
.... _ ..... 'C "" .... 
(a) 
":~·Ia;~d 0,6 , 0.6 0.4 OA 02 _ 0.2 
o 0 0 0 _ '" '" ~ ;; ;:; 7c)::;: t ;;; 
;~a3 " "' 0.2 • 
" -o 0 _ _ '" ,,, ,,, 
... 00 , ... (i) o ... 00 
I'2J.;PJ) 
"" 
"' 
0.2 ___ . _ .... _ 
o 
Figure 5.5: NOl'lllaliMXi performancc vcrSliS targct timing constraints (ns), grouped 
flCcord ing \.0 (,he pipeline c;onfigll rflt ioll 
to t.lwir timin~ constrain ts. The pipf'lillf' ('()nfi~ll ral.ion of t. lw l'OllljlOIl('llt h'wl or 
tile gate level placcmcllt approach with tiw pipelinc of 11 stages is dcnoted as PI! or 
Gr! , respccti\'ely, and the non-pipclined configuratioll is denoted as NP The a rpa, 
power and energy eusts arc nurmalized with respcct to the leftmost valucs, wllieh arc 
showll fl.1. the t,np of t.hc snhfi !l;1lff', rCl'pf'oivf'ly, The pf'rfnrm al1('f' of thf' f'IIn didat(' 
implemcntflt ion ~ with thc t iming con~ t ra i nL~ of 4.00 n~ and 8.00 li S is lIOt showli iB 
F igure 5.5 bccausc thcy have the samc arca and negligible diffc rcnce in power IIml 
oBerg,)' <:O lllparcd with thc implementat ions at 3.00 ws 
Accord ing tn Figure 5.5, a ll til (' pip('line ('onfigll ffl lions haw similar t f('n d.~ ill 
arca, powcr and cncrgy wi th thc variat ion of the timing constraint. The trends rdlect 
thc trade-offs between (,iming and thc resource costs. \Vilen the t iming const raint is 
loosened , the area of the pipelinc configurat ion d rops until the minimal is reached . 
In Figufe 5_5, f'af'h of the pipf'l in f' confi~nrat.ions ha~ the f'lllIdidat (' implf'llH'nt atioll 
with thc minim fl.l area undcr n certain t iming constraint nnd for the further loos('ned 
timing constrnints, the mea remnins same. I\not her common fenturc slmrcd by a lmost 
all lIre pilJeliue config ura tions is that the t rend in energy does not exnct ly follow t he 
area trend. \Vhilc t llC area is being reduced to thc minimum. thcrc is a noticeable 
incrf'fl.';C ill energy. This indicntCl; that thc implcmcntation with minimal area does 
not lend 10 the minimal energ)' consumpt ion. 1-"or cach pipd inc configura tion, the 
candid" te implcmcntntion tha t has Ulf' mininmm cnergy inriir.ate<! ill Figure 5.5 can 
be USN! at the lower clock frequencies for hetler energy effi ciellc), than those camlidat e 
irnplf'nu~ntati on s ach ieved for t he corrCl;pOllding t iming (:onstraillts. T IK'SC eaJlt!idate 
imp l(~rne nT. atiOllS lead t.o t,he energ)'-wisc coots of the pipeline configurations that arc 
examined in Section 5.6,3 
65 
;:~ rn::::::;::=(O.:"lO="="2=48,::"",,,"='I':::.''::':,:",'=V/:::'.2=':'.J,::' =:;:::] 
2.' 
2.J 
[J Delay • Are" • Power [J Energy 
NP:Nol"pcii"" 
Gil: ().". b d pla,,<m,nl ""lhll stages 
p,,; CO"'f'O"""1 iovoi pl.cemon' "'''h ll Slag'''' 
Figm ,' fdl : \ 'ormali 7,,'d pel' fOfTIHIl H"1' Vl'rsn~ pilwli rH' (,o!lfig\l r at.io ll~ \lwkr I. hl' 
syntilesis collstraillt of millimi1.oo critical path delay 
5.6.2 Perfo rmance vers us Pipeline Configura tions for t he Min-
imizcd Timing C onstraint 
When the throughput of the S-box implementation is expected to be as high a.~ 
possible, the implementatiolls with the mi nimized t iming collstrainl, from d i/fefent 
IJipeliue configurations afe considered , F igure 5.6 shows the pcrformancc of th<'SC 
<:<lIldidate implementations. The dclay, area, power and energy in Figure 5.6 arl' 
uOr!wllized with rl'!;pect to the mrrl'sponding minimum values, respecth'ciy, wltich 
are ~hown at the top of the figure. 
As expected, the critical path delay decreases with the increase of thl' mllllher of 
pipeline stages until it reaches the minimum for the pipeline confignTation G5. After 
G5. the ddilY rPllH\ins the SiIllle for G6 ano G7 and this ind il'atl's that the saturation 
GG 
of timing improvClllcnt i~ rCiu:hed, (l~ is anticipilted in Scdion 5.'1 . Among, G5, GG 
and G7 that IHIVe the same minimal df'lay, GG b cOllsidf'rably better then tim otheI 
two in area, powcr and energy. Thcrefore, increasing the number of pippliIw ~tagl'S 
bcyond () does not improve the perfornwnee in timing, area, power or energy 
Din){;tly comparing pipelinc pi1lCCIllcnt apprO/lchCl;, it. cun be '"'-'en that the gate 
level approach leads to shorter delay~ , while the component level approach leads to 
l;maller area and less power and energy througlJ comparisons between G2 and P2. 
G3 aIJ(i P:l, and G4 and P4. This confirms the conje-ctme in the armlysis of the 
two approaches in Section 5.'1 that the gate level approach has the atlvantage of the 
well balanced nitieai path delay, while the componcnt level appronch rcquires feweI 
pipclinc rcgistcr bits (due to the datapath width at thc component ports) lind thi;; 
contri but.cs to efficiency in area, power and energy 
5.6.3 P erformance versus Pipeline Configurations for G iven 
Timing Constraints 
In the f'ircum~tanee that a gh'en throughput b expectm from the S-box illlpl(,IlH~nta-
tion, t.he p{~ rformanct'S of the pipelinl' configurations with the same tiuJing constraint 
aw cmnpIll"l'( i, a~ il; shown in Figurc~ 5.7, 5.8 and 5.9. The timing (:onstrninl as wcll 
I~~ ("orn'Spouding clock frcqlIl'III'Y an' prcscntcd at thl' top of each ~ubfig\lfc of Figurt'S 
fJ7, 5.8 llnd 5.9. The llrea, power llild energy costs llre normalized with respect. to the 
1<::£11I1I)8t \'ll lucs, respectively, tha t a rc shown at the top of the subfigure. Since energy 
is the sarne metric as power ullder a givcn clock frequency, they share the &'1 U1C burs 
in Figu l'ps 5.7, 5.8 lind 5.9 
G7 
Figllre 5.7: Normalized prrform:l!lcc versus piprlino ronfigllTat.inns unrlrf giv(, 11 
timing ft'f.ju irclllcnts (from 0.35 ns/2.86 C llz io 1.50 ns/fi67 r-Hlz) 
68 
Figure 5.8: Normalized pcrformanec \'CD;U>; pipeline configuration~ under given 
timing rcquirement~ (from 1.75 us/ 571 t-,Hlz to 3.0 lls/ 333 ~IHz) 
69 
Gil: GalC icvcl placcmcnlwilh II slagcs 
o Area PII: Componenl level plueement with II stuges 
• Power/Energy NI': No I'ipcline 
EGIIJEPII/ ENP: Energy-wise GII/ Pn/NP 
~8G3<3 & GO: "':: ~§§~§~§5~ 
,., 
Figure (j,9: Normalized performance versus pipeline configurations under given 
t iming requirements (.I,O ns/ 250 l'IHlz and 8,0 ns/ 125 l'IHlz) 
70 
According to the discussion in Section 5.fi.l, for 1\ /;iven pipdine ("onfiguratiou, 
the most energy efficient implementation is gellcratl'(l with 1l100CS1. but not nl'Cl'ssa rily 
the loosest timing constraint. These candidate implementations lire estirnated for 
powpr/encrgy lit the clock frequencies lower than the fret!ueuey eorrespondill!; to 
their timing eunstrllints. Their power/energy collsumption. if applicable, is rf'ported 
at the right in the subfigures fleeording to the aetnal frl'qnency, along with t.he area!; 
of these implementations, and are market! as the "Energy-wise" costs of the pipclirw 
eunfigllration~. To dis tinguish from tire energy-wise costs, the costs showli in the left 
pllrt ofdlf'suhfignTC:> arecallc<"l f('gular ("001$. 
hr the nextS(.'Ction, lI"eanalyze the varillllCC of the arCH Hnd powcr/cnergy costs 
for differcnt pipelinc configurations under the wlrious timing con~traints uaS(.'d 0 11 
Figures 5. 7, 5.8 and 5.9. 
5.fi .3 .1 A rca versus t he N umber o f P ipeline S tages 
The nr("a cost roughly follows the trend thnt, for a tight timing eonstnrint. the in-
crellSC" of pipeline stages graduHlly retluCCl> the cost unt il the minimul is reached hy an 
appropriate number of stages. This trend is refleded in the cUSt.'S fronr 0.35 ns/2 .SG 
Cl lz to 1.25 ns/SOO ~l ll z. A~ the timing constrairrt is looscn(..'(i to n]l'(lirrrn and below, 
tire cost gets larger with tire incre,lse of pipeline ~tages, HS is ~hOWll in tire T("JIlaining 
C,IS('S of Figures 5.7. 5.8 and 5.9, where no pipeline lends to the minimal al"('a cost. 
Tlris trend of «rca cost varying with timitr!!, constrnints is reasorlflule hecausc the 
increase of pipeline stages could relieve the timing COllstnlint imposed in each stage 
and consequently improve the area efficiency for the tight timing COllstnoint. Although 
the increase of Stagl'S would mise the arca cost by introducing more pip·cline regi~ters , 
71 
there could be o\'erall gll iu IIml the appropriate uumher of pipeline ~ t<l gc::; IllHximiz('s 
1 he gain. \VIJCtl thc timing eoust.raiut is furt.her loosent..'fi, the e!fed on area redm' tion 
becomes less significant compared wi t ll the IHell increlL';e by more pipeline stagt'S 
lind thcrcfOIC IllOfe stages would no longer rt..'<:I ucc the lIrell cost. Finlllly. when it is 
rcnlizablc for the gh'en timing constmint , the lJolJ-pipciiucJ implelllcntat ion results 
in the minimal arclI ami tim mum pipelille stages, the more urea is incurred 
5.0.3.2 Power / Energy versus the N umbers of Pipeli ne Stages 
By looking at the regular power/energy values (cxcluding thohl: ('ncrgy-wihl: yalm'S) 
in Figures 5_7, 5_8 and 5.9, it is able tu see thc treml ill pown/energy effi{;icm:y is 
very simi lar to lilat, of a rea effidcncy and the reasuning follows similarly. TIH~ only 
obvious dilfe renee hetw~'Cn the trends is t.hat pipelined implementations leads to better 
powcr/p1lt~rgy efficient implementation than non-pipelined implcnwntat ions C\'Cll fOl 
the med ium to low throughput cases, as is shown ill tile cases from 1.50 ns/667 r.. l l-1z 
to 8.0 ns/ 125 lI'IHz. T his ClUJ b(: llttributed to the reduction of glitches by llsing the 
pipcJine registers. It is supported by our experimcnt that , when the pipdine registers 
of P2 ill the Cl\&.'S uf 4.0 ns/250 111Hz and 8.0 ns / 125 "'1 Hz are llmnuully rpmonx!. til(' 
power and energy incrcase tu that of N P. 
Ovcrnll , for a. given timing constraint, the number of pipeline stages Icadiug t.o 
the best powcr/ energy clliciency tends to be in the middle of the possible nllllll>ers 
/I. similar fcaturc is obscr\'~-d ill 1,1 tJ where the increase uf llip- llops gradllall~' redu("(.'!; 
the power of a direction detector unti l the minimum is reached and then , IL'> mOT(' 
flip-flop s IHe addc(l, the ]lower im;realiCS. 
72 
5.6.3 .3 Area ve rsus t he P lacement Approaches 
Comparing the two apprOllChcs under the sam!' llumber of pipcline ~ t agt:'!; Wlll'W both 
lIIe applicable, tim ovcrall trend is that the component level approach i~ the lllore 
efficient approW'h for the timing constraint ranged from tight to mode;t, lL'> is fcfil'( tcd 
ill the (:lL~e~ frOIll 0.60 n~/1 . 67 GHz to 1.50 n~/667 ~HIz. A[( exceptioll o(;curs in the 
case 0 .40 ns/2.50 Gllz where G4 is better UWtl P4. This i~ because the targpt timing 
cOllstmint of Pi) (0.40 ns) is \'cry close to the minilll (li it CU ll readl (o.as us) , as is 
slJOwn in Figure 5.5 0). When the timing constraint is further lOOliened, thp area cost 
of the component lcvd increa,o;l"S above that of the gate Ievcl approach, a.~ b H(.'('ll ill 
the cn.scs from 1.50 ns/OO7 IIHlz to 8.0 ns/ 125 MHz. This is cxpla.i[(ablc sillcc the 
gllte level approach has the flcxibili ty in the placement of regist,~ rs for further area 
reduct ion while the (;Olllponellt kvc1l1pproach has no potential for area rl'ductioTl 
o\'er the tighter timing constraint cases, as is seen in Figure 5.5 (h) to 0). 
5.6.3.11 Power/Energy versus the P lacement Approaches 
The trend of the power/cncrgy versus the placemeJlt approaches does not exactly 
follow that of the area versu~ the placemcnt apprua(;iJcs. Similarly, the gate lewl 
appruach is loore elfidellt ill puwcr/ '~JlPrgy for the timiu!!, wJlstmillts frum tight tu 
mooium. ror the further Iooscnoo timing const.raints, while the gate level approadl 
leads to les.~ mea, it CO llS U(JJ(."S mure power/energy compared with t he compommt 
level approach. This means that the les.~ area comes at the co~t of comprollliSt,,, 
puwerj t'lH'rg,y d!id{~m;y, as i:; also foulld ill Seclion 5.6.1. 
73 
1 
1 j 
... 
FiJ!;1lTI' 5.10 Trr nds of op1imal pip('lin f' mnfigl lrat io n ~ for til ... t hroughput 
rcquirclllcnt~ from high to loll' 
5 .6. 3.5 Ene rgy-wise Co:st s 
The energy-wise C()!) t~ in Figure>; 5.7, 5.8 and 5.9 arc generated by the candidiltc 
implementa tion with the minimal cllergy consumption (according to F'iglln~ 5.5) run-
ning at t he clock fn:{l tlcnci{.'S lower than its dcsigllated frequency. These power/cur fg)' 
costs arc lower than the l,'Orrcsponding costs generated by the candidate implement,l-
t ion designated for the clock [n_"q UClley but result ill higher a rea (e.g., EG2 VS. C2) 
By comparing between the energy-wise costs, the t rcnds follow th()l;C of the regular 
costs under t ile [oo:-;e t iming constraints. \ Vi t h the decrease of pipeline st;,lgc:;, the 
minimal aroa is achieved by ENP whilc thc minimal powcr/cncrgy rcmains al EP3. 
Tile COllipollcnt lcvel approach is bcttcr than thc gatc lcvcl approach for hoth a rca 
ami power / cllcrgy. 
5.6.3.6 Trends ill One Picture 
For a more intuiti\'e comprehension of the trends analyz(..'(1 above, the values shown 
in Figure~ 5.7, 5.8 and 5.9 arc converted to i\ gray scale map in Figure 5.]0, where II 
level of gray indicates the eool (the urighter, the lower) of the pipclillc configuration 
compared with others under the timing constraint. For any row in Figure 5,10, tIl(' 
brighte!;t and the darkC!;t corrC!;pond to the low(,'st and the higlll'!;t costs, r(..'!;IK'Clivdy, 
in the corresponding subfigure of Figures 5.7.5.8 ami 5.9. while the other slmdC!; in 
the row arc determined to be betw(..'Cn them. The energy-wise costs in Figurcs 5.7, 
5.8 and 5.9 a rc used for the power/energy columns 
Figure 5.10 shows a highly regular variance of area and power/energy co~ts ulld('r 
the pipeline configurations and timing constra ints. The nrrows indicate the trt'nd!; 
of the most efficient pipeline configurations while the liming constraint \1\rics from 
tight to loosc. 
5.6.4 Benefi t s of Using P ipelined S-Box Im plementations 
T hrough the analysis of the candidate implementation rC!;ult.s. Ihe henefits of lIS-
ing pipehued S-box implementations with all appropriate pipeline configuration are 
d('arly SN'n. Thf'Sl' IlPll('fits ran hr sorl.N"1 into thrrr rat q~or i{'S: I) hrllrfits o\'l'r 
Iioll'pipdineo iUlpiemenlaliolis. 2) uCllcfits (wcr utilcr ]Jipelillc coufigunttiulls, ""d j) 
hcncfitsof providing more performance optionsjtrade-ofrs 
75 
ri.6 ..1.1 Benefits over Non- Pipe lined Impl eme ntation s 
These hene/its lie in tlw timing, area, power lind energy fo r higll throughput require .. 
menls or lie in power/energy for medium to loose timing constraints. According to 
our f'xpefilllent. there is maximum reduction of the critical path delay of 61 % for the 
minimized timing eonstraint (NP vs, G6 in Figure 5.6). TIJCre are ma..xinJUrn rcdut .. 
tions of 51% inllfea and 69% in power/cncrgy for (I given tight timing (:onstraint (NP 
YS fl3 in the case 0.80 ns / I.25 Gl!z of Figure 5,7 (d)). There is maximum reduet.ion 
of 30% in power/energy for a medium to louse timing wnstraint (NP V1>. P2 ill the 
(:a.'>I~ 3.0 ns/333 11'11-17. of Figure 5.8 (d)) 
It is worth notieing that, even under the very louse timing constraints, tllcre 
is 1\ considerable reduction in power/energy of 28% (Nfl YS . P2 ill both CH.'iCli in 
Figure 5.9) or 16% (ENP vs. EP3 in both casCli in Figure 5.9) with compromised 
arra eJlicie!ll'Y. Tllis PTH'mlrages the application of pipelinCfI S .. hox impkrllrntflt.ioliS 
in light.weight AES implementations. whkh llHlHl.lly run at a low clock freljueuc}" and 
hin-e traditionally r(lrciy usc pipelinoo S .. bo:.: illlplelllentntions 
5.6.4.2 Benefit s over Other' Pipeline Configmatiolls 
Th,~s<! IJl~ue/its always ,~xists for ill! appropriate pipdiue configuratioll that leads to 
the 1II0st dlieient implementation. The appropriate pipeline configurat ion varies with 
the timing constraints, as is displayed hy Figure 5.10. 
76 
ii.GA .3 Benefit s of Providing: )\·lore Perform a nce Oplions/,rradc-Ofr~ 
The performance opliuns/trade-olrs <:C1.Il be signilicunUy illtreased by usinl!; pipelinL'd 
S-hux implemcntations compared with using only non-pipelined implementations. In 
this wny, hesides the lllost efficient pipeline configuration , othcr pipeline eDnfigura-
lions (;uuld also lead to an implcmentatioll wi th des irable pcrfOrmall(;l.!. For examplc, 
in the C1L<;C 1.50 ns/ fifi7 !'11Hz of Figure 5.7 (g), NP and EP3 arc the most ellkient 
pipeliue (;onfigurations in term,; of area and power/energy. res pedively. As a trade-off 
betwccn them, P2 costs less area than NP and less power/energy than EP3. Such 1l 
cumbinat ion of area and power/cnergy of 1'2 may be preferable for the design with a 
combined requirement compared to an imlividuru requirement fur bt'st f1 IU \ frolll NP 
or power/energy from EP3 
5.7 Generality of the Methodology and Results 
III this scctiOll, we dit;(;us.s the gcncra li ty of the methodology ilnd the experimcnta l 
rbults vrcscnted in this work. 
In our methodolugy, the inlluellce of the pipeline confil!;urations on the perfor-
lillI/I CC of S-bux implementatiolls is drawn based Oil tire analysis of our experiment.al 
]"( ~sl1lts. We bdi<.~\·c th is is tire most ~traiglrtforward metlrod to have a realistic cvalu-
alion of t.il(' inilllpnrf' sinrp thO' pxpf'rinwntal implrrnrl1tations an' Hchip\"()f i fol lowing 
t ile same desigJl !low willi whidl J"en.lis tic implementntions are buil t. All lire experi-
mental implcmcntations arc bClJclr lllark(.'d upon the sallie technology librflf)·. leading 
to II fnir and objective comparison betwecn I.he pipeline confignrations. Although tire 
ahsohne fjllant.il.ies of the f'xper imcnlal f Cli Uh. S an' \.N' imology-sp(X'i lk, \.li(' rplat.i,·(' 
77 
compari,;olls should hold dosely if another tedmology is applied and our anillysis lind 
conclusions a rc mostly based on the relat ive instead of the lIl>solute vahws. As wdl, 
we al~o try to mi t igate t.he impact of the tedlllology-speeific performance by using 
gate equivalence as the metric for area 
Anothf'r iS6ue related to generulit.y is the gate level placement IIpproadl. This 
approach is sYlI thcsis tool dependent since it relics on the retiming function of tlil' 
tool. l!owever, tllis dependency should not have a major implld on the general 
{;() Ill:lusiolls (;(}nsidering tlwt whatever synth~is tool is used, thc ret ill\inf!; functi on 
should work towards the similar goal iL<; the one we use and hence the results should 
lIot significantly differ. 
5 .8 Summary 
This dwpter presenh a comprehcnsivc st.udy of pipeline configurations gcncrally ap-
plicable to the AES S-loox witll a cOluposite fi eld struct ure. In particularly. we in-
\'I'Stigate lin extensive range of pipdilll' cOllfigurntiolls for t.llcir inftllt'lice on thr prT-
fornwnce of the S-box irnplementlltions in terms of timing. arell, power aud {~ncrgy 
T he pipeline configurations consist of no pipeline, a component level pipel ine with 
the nmui.>l'f of pipeline stages from 2 to 4 and II gate level pipeline with thc numloCl 
of pipeline stages from 2 to 7. The gate ieI'd pipeline utilizes the retiming function 
of the synthesis \.001 for the feasii.>lc and etTect ive register placement (hat is complilln( 
witl] the standard ASIC design flow. Totally 127 S-box implemcntatiolls with varying 
pipeline configurations arc built with a !){)...nrn ~tandal"d cell C1\'IOS technology undpr 
a variety of t iming constraints 
78 
na~pd on th(' pprforrnilll("p of th('!'«' implt'rnrntations, tht' influ(,Il("(' of 1l\!' pip('lilu' 
conliguffitiuns is Ji!;ClIs.';cd with trends thal indicate how the appropria te pipeiinl~ 
I:OllligmatiOlis for certain persp~'Ctives of the performancc mry with the timing COli-
titraints. The trends arc found to he highly regularly and explainable. They call 
Ill' Wil'" as the gPIiPral rpf('rCIJ('(' for phoooing the npproprinte pipE'lin(' ('ollfignffttions 
ullder II givclI design TL'<luirclllclit in tilllilig. By using the appropriate pipelille con-
figurations, notable performance improvclllent can be achieved COlllpar~,<1 with thc 
uou-pipeiincd c!t_'>C. This indicate:-; the merits of applying pipc1incd S-box implcmcll-
tation~ for rl'S(lur('('+('flkknt purposes, indudillg 1ightwdght applications. III addi t ioll 
\0 thc appropriatc pipeline L"OllfiguTatiolls thaL lead to the most ellident implementa-
tions, it is also shown that other pipelinc t:unfigurations nTe able to provide desirable 
performance trade-offs in S-hox implementatioliS. 
This work is the first. t.hat ('xtpllsh'Piy in\"P.Stigates the pipdinp('onfigliTatiolis for 
the AES S-lJOx with a <':{)Illpu;;itp field structurc aud the rL~lIhing pcrfunullu<.:c tTadL'-
ofrs. It is also the first work that applies pipelining to AES S-box implement<ltiolls 
for r~'s()urce efficient purposes, as is C()ntnH~' to lhc cOII\"cnt ional pipciining purpose 
of speedup. It discloses the bCllcfits of pipciilled S-box ililplelliellt<ltiolis in terms of 
n~)JIrc(~ cflid<.:ncy. h abo iutruduCL-'; thc rclimillg fllnction a . .,; a pradi<.:al alltl efft.'Ctil·c 
rcgistcr placcmcnt appro.1.Ch for lhc gatc Icvel pipclining of S-I)Qxes. 
III the Ilcxt chapter, wc look at <lllothcr per.;pective of thc archilecture of AES im-
plementations, thc datapath architectllre, and invcstigate the performilllcc improve-
mcnt ami tradL'-ulfs pruvided oy diffcrellt datllpath mdJilL'Cturcs. 
Chapter 6 
Exploration of Datapath 
Architectures for Flexible and 
Efficient Implementation 
III this chapter. we pfCM:1l11hc illvcstigat ioll of the performance of II I-ariely of i\1~S 
datapath architectures based 011 t.hc S-boxcs wilh II. cOlllpooitc fidd ~truclurc , These 
architectures arc paramclcri~.cd by n datHpatl1 width of 8, lfi, 32. ()..1. or [28 hits 
lind, for the 128-oit width. un unrolling factor of I, 2, 5 or 10. T hrough this char-
ac t,cl'i7,ation. the performance tradl.. .. offs affected by the architecture paHHllctcrs arc 
extellsively explored. The parameters leading to the bcst performance arc identified. 
It is fOllnd that the 8-bit width da tapath, which is conventionally adupted for resource 
dlicknl ]JllrpuSt..'ti, hns th,! wurst Ctl(; f gy d!kicm;y /lJld dot'!> !lot r~':SlIlt ill tlw millillm] 
peak power nmong the i\rchitcctufClj. As well, the 16, 32 and 64-hit width AES dat-
apath architcctures <lrc newly con~idcrcd or rcpresent irnprovelllcnt~ ovcr prcvions 
80 
work. 
6.1 Introd uction 
T he flexibi lity ofthc AES algorithm allows paramctcrizablc data path architectures for 
hardware implementations. There arc two parameters that can SIX'Cify the datapath 
architl..'Ctllre of an AES implcmcntlltion, the datapath width and the unrolling factor. 
Generally, the possible parameters inelude the datapnth widths uf 8, !G, 32. tJ4 and 
128 bits and the unrolling fflCtors of I, 2, 5 and 10 for the 128-1>it width. T he fully 
umolled pipclincd architecture and the fully serialized archi tecture can be pfll"flllW-
krized as IIll unrolling f!lCtor of 10 and a datapath width of 8 bits, rt";lx'Ctil"ely. It is 
obl"ious that these two architecture parameters can lead to the implemcntations with 
the maximum throughput and the minimal area, respectively. ll ow(~l"er , there arc 
design IWluiremellts other thall the ma.ximum throughput or the minimal area (e.g., 
low power/energy or tmdc-offs betwccn arca and throughput). It is unclear which onc 
"Illong the possible architectures can provide supcrior performance for such design 
)"I'<]uircments. This motivates this work to characterize thc performance of imple-
mentations specified by the architecture panlllleters. \Ve consider the performam;c in 
tcrm~ of the area , peak power consumption and average energ), cunsumption under 
a given throllghput n.'<juircmcnt and there nrc II variety of throughput requirements 
co!lsidercd. 
Since there is no standard architecture for given dlltapath parameters, we COli ' 
sider generic parameterized AES datapath architcctun..,; for the cJmracterization !lml 
implcmellt them based 011 the same slllndllrd cell C~[OS h.'Chnology so thnt thc chllr-
81 
aderization resul t.s are generalizable. These arch i t.cctur~ arc designated to perform 
the AES encryption with 128-bi t keys (referred to ~ AES- EI28 in the fu llowillg). 
S-hoxes with a typical I:omposit e field structure are adopted ill these architcctur('S 
TIle storage clelilents ill tllCSC architcctures arc all based on rcgisters or shift r<'l~isWrs 
that ~He composed of only standard cells ill the CMOS technology. 
A similar wurk blL<;C(1 on F'PGA tc"t:hnology is seen ill [43J. llowc\'er, it limits its 
investigation to the a.rcllitectllre;; with 1I111111rolling factor of I. 2, 5 and !Oand only 
the trade-offs betwecn area and delay arc explored. As well, for each arehitedllfe, Dilly 
lhe implementatiotJ sytJth~izcd for the maximum throughput is considered. Since dif-
ferent t iming constraints for synt hesis could lead to quite dilferent implementations 
with d ill"erent performance properties, they can not all be realized by synthe!<izing 
under the t.ightcst timing constrai nt for thl' maximum throughpu t.. The arciJ itectures 
in tlli ~ .. mrk arf' synllwsi:wd with a variN}' of timing ("On~traill1. s that fit t Ill' through-
put requiremcnts of a wide range of AES applications, so that a panoramir viE'w of 
the perfofllll)llee of the implement at ions affected by t llC aft ilitect urc parameters is 
presented 
Another contribution of the work is t he characterization of tile novel shift rcg-
ister lm .. <;(~l 16, 32 and 64-bit width datapath architectures. This provides more per-
form ance I.rade-offs between the 8-bit and the 128-bit width areili tLd ures. Tilese 
arehitedures are designed for effici en[y and geuerality. The number of clock C)"des 
requi red to cumplete an AES round i~ tile minimal for t.he widt.h, which is 8, ·1 nud 
2, l"l'spLdivdy. Nu specific ml'mUl"Y lIJacro is fC"(juired sillce a ll the CUlllPUllellt s in tlu' 
architecturl's arc comp{)ti{."{] of standard cells 
82 
6.2 The Datapath Architectures of AES 
In this chapter, we consider the datal><1th architectures th1l1 include th" datapath 
widths of 8, 16,32,64 and 128 bits IIml, for the 128-uit width. the unrolling factor.; 
of I, 2, 5 and 10. Corrcsp'Qlldingly, we build the generic pafUmctcriwbk data]r 
ath architeclures supporting these parameters for AES-EI28. which p·crfofllls the 
encryption-only operation and has thc key sizc of 128 bits and ~lCcordi !lgl)' 10 rounds. 
6.2.1 Common Issues 
T h" common i$.'illCS related to the datapath IIfcliitcctlln .. "S dmraclcrizcd ill this chapter 
are described in thc following 
6.2. 1.1 S-b ox Str ucture 
All the datapath architectures arc based 011 the S-boxcs with II composite field struc-
tllfe frOIll [5J. Although there nrc other S-box ~truclures, such as sudl I\S m i llimiz('(] 
collibinatiolHll logic functions from thc truth table [18] aud dC{;odcr-pI'fllllltlition-
encodcr ~tr ll ct llrc [1.1), S-box implemcntations with a cOlllpooitc ficld stfudufC pro-
vide for balanced performance ovcr timing, area IIlId power [is] and arc frCtlllcntly 
adopted for AE:S implementations with 'I',lrious flrchih .. >ct llf(,'S, such as ill [I!.I] and [20] 
with a fully \lllfolk'(l architC{;ture ami in ]:.10] ami [n] with all 8-bit width dalap· 
alit architecture. Therefore, we also adopt a compositc field structure for the S-box 
ilnplclrlentatiolls in the architectures. T here arc a lIumber of eOlllp()l;ite field stfllC-
ture::; aVllilablc, including [5). [G), [7[, [8[, [OJ. [IOJ and [l3J. T he pcrformance of these 
CUIII]>ositc field structun:s (irc silJliklf We pi<;k tilC one frollJ [5[ sincc it is II typical 
83 
composit(' fh ·ld S1rllr1.ur(' and the A('n('rir archit('('turN; d!l\m('ll'ri~('(1 in this chaptf'r 
arc expected to SIIOW the typiclll performllIlee tllllt can he prol'ided hy the p,lnllnC-
tl'rized architecture>; 
6.2 .1.2 Key Expans ion 
Since we only consider the encryption datapath of AES without the key eXJlalJ.~ion, 
for all the architecture>;, it i~ as.;;umcd that the round keY5 arc kd fl.;; inpuL~ to the 
architectures whenel'er they are f(x[uiroo 
6 .2.1.3 Impac t of Modes of Operatio ll 
The datapath architectures also differ wheu {:onsidering the illlpfl(;l of block cipher 
modt'S of operatious. \Ve considcr the two group,; of modes of operations, non-
fccdhaek modes (e.g., clectronic (:odebook (ECO) mode and (;()uuter (CTH) mode) 
and f(~~ IlJllek modo; (e.g" cipl](~r-block chaining (CI3C) modc, cipher f(:t'(lback (CFI3) 
mode and output feedback (OFI3) mode). While alJ of the arehitecturcs can work 
under any mode of operatiou , the compicte datapath architt'Ctur~'S with tl](: uurolling 
facton; of 2,5 (Iud 10 wOHld not be fully utili~cd wheu working undcr feed!.l.'Ick mod~'S 
~inee only the encryption of oue plaintext block can be proces;cd at one lime due 
to the deJlendency between the current encryption and prel'ious cncryption. For this 
reason, we exclude the l)itU(ltioll of the f)rchitcctures workiug undcr k'Cdllflck modl:S 
whcli comparing performancc. 
8" 
I;'igure G.1 Generic model of the partial datapath architedllfl.'S with width 
IV E {S, lG,32,G<l} 
6.2.2 Partial Datapath Architccturcs 
A gcneric model of the partifll dfltflpath arehitedUlu; is shown in Figure 6.1 where 
III b equal to thc dfltapath wid th of a specific architecture am! (IJ denotes the bitwi~e 
XOH opcration for AddRou1!dKcy 
l3asically, t lH'i\r<;iJitecturc;;havcn u,obitdntllpllt h lind on t.hc ptlth,Add!loUlldf\·cy. 
Sub/jytt>,~. Shijt !lows, MixCo/ull!lIS operations fire performed in the SC<luence. A 
128-bit pillintext block i;; loaded in w-bit jJil'ct's in ~rial, and after the reqnirl.,<llll llllo 
bcr of itcmtiolls over thc an:hitectuft', t he ciphertcxt hlock i;; loaded Ollt in the same 
\\'<1). Tile round keys arc abo loaded ill w-bit piece!; , The last round key of each cn-
cryption is 10flded through thc sjlecilic input FirIllLKc.!I _IIl , bO that the next plaintext 
can be lomkd in while thc current ciphertext is hcing loaded out. This allows the 
IJiaxilllUnJ utilization of thc architectures without flny idlc hardware during loadillg 
85 
Figure h.2: Structure of til(: ShiftR()w .~ component for the partial datapath 
architcctures with thc ,,·idth uf 8 bits 
plaintcxts/eipcrtcxts On average, the datapath areilitectnrc witll III-hit width G ill 
complete the encryption of a plilintext block with (128/w) x 10 clock eydr.; (i.p., IGO. 
80, 40, 20 cyclcs required for thc architecturcs with 8-bit, lU-bit, 32-bit and 611-bit 
width, respectively) 
In addition to the dalapalli width, the partial dat.apath architedmes difrer in 
thc !lumbcr of S-boxcs, thc Shift Rows componcnt and thc Mixeo/umn s component 
Till: details arc prl'Henk'(i as follows. 
6.2.2.1 S-boxcs 
Eadl [m:liitectnre has !II/8 S-hOX(l'H) ill parallel on tll(~ datapatli. The S-boxes perforlll 
tlie S III!Byt c.~ operation and have a composite fie ld st rm:tnre from [5J. 
6 .2.2.2 ShifrR()w.~ Componcnts 
Each of tlip partial datapath arehitectures hilS a specific shift rcgi~tcr based compo-
nent fOf the Sill/tRows operation Thc st ruc tures of the componcnts n f C shown in 
8G 
Figure 6.3: Structure of the Shifl Rows componcnt for thc pnrUnl datnpath 
architectures with t hc width of JG bits. 
Figure 6.4: Structure of the Shifl Rows component for t he partial datapath 
IIrt:hitecturCt; with tilC width of 32 bits 
87 
Figurc 6.5: Structure of the Shi/IRows componcnt for the partial Jatapath 
architedllrL~ with thc width of 64 bits. 
88 
Figure 6.6: Structure of the MixColllmos component. for the partial datapath 
an;hitl'd ufl'S with the width of 8 bits. 
Figures (j ,:.!. 6.3, 6.4 and 6,5, Thc COlJlpUIlcnts afe composed of rcgisters lind ll!nl-
Uplcxers. Elich pllth in Figures 6.2, 63 , 6A and 6.5 hu;; thc width of 8 hi ts , The 
ShiJI R() U!.~ cumponcnt for the 8-l>it datapatiJ ardlitl'(~t llre in Figure 6.2 was proposed 
ill [.14], These components work as shift registers with the multip!cxcrs determining 
the flow of the data. The l'Omponents call work without idle cycles by [(",,'<:ling input,s 
amI produciug outputs continuously in eath clock tyde. For thc architectures with 
the widths of 8, 16, 32 and 64 bits, it takes W, 8, 4 aud 2 cyc:;IL'S. respectively, to 
eOlllplpt\> the ShiJtHows operat ion of a St(Jtc. Tll(~ dl'tails of the op(~rat io !l of the 
components me dcscrihcd in Appendix A 
6.2.2 .3 fHixCo/(OIms Components 
M ixC()/wnlls is ddillcd as <ltI upemtiun on J2-l;it datil. For tll(~ datajJa1. h architL>dnnc 
with the width of 32 or 6,1 hi ts, one or two (;o!llpIctc M ixCo/!11IW8 operations life 
illlplf'lllent.ed on the drltapalh, FOf those with the width of 8 Of JG hits, j\lixCo/ulI/ l! s 
can be partially implementl,,1 as 1/,1 or 1/ 2 operation, respl'Ctivcly, and the imple-
mentation is reused to cOlllplete one A/ ixCo/umIlS operation, The structures of tim 
Figu rc 6.i: Structllfe of the ,\JixCoiumns component for tlJC pllrtilll dll1.apath 
llfchitectnres with lIJC width of 16 b it.~ 
Figurc 6.8: Structure of t.he J\fixCuiunlll s component fo r t he partin] da t rrpath 
architecturCti wi th the wid t h of32 Lih. 
YO 
T,l ble G.I Compari~on of 32-bit AES datapatb (uchitectures 
Ours ["[ [271 [28[ [291 
Storage Element (in bits) 128 128 256 256 224 
Clock Cycb 40 G·j 40 80 ·10 
COlJlpOllents arc shown in Figur<,'S G.G, G.7 ami G.8, respectivdy. 
Thc componcnts arc composed of registf'fS, multiplexers, xtiTlle (XT) compo-
nents, XOH gates and AND gate'S (if applicable). TI JC xtime opcration is equivalent 
to Lhe llluitiplication of the input byte with t.hl' hcxadc'(:iTllai value 02 ill CF{2~ ) Tllod-
1110 III(X) = x S + X4 +x:1 + x + I. It Cfln be implemented with 3 XOH gates, a.s is shown 
ill [.15]. The AND gates arc IIseo:l to bypa,;:; the attaehed XOR gates. Ead l path in 
Figureli G.G, G.7 and G.8 h,L~ the width of8 hits. The MixCoilWHlS eOTllpOllcnt for t.hc 
8-bit datflpath architecture in Figure G.G WIIS proposed in [:11[ . The MixCO/III/IlIS 
component for the G4-bit datapath lIl"chitccturc consists of two copies of that for 
tile 32-bit datapath arehitccture (Figure G.8). Thc S-bit, lG-bit , 32-bit and G·I-bit 
M l:rCoIIUl/llS components perform the (:olllpletc Mixeu/ul/ms operatiolJ Oil a Stu/(; 
iii 1G, 8. 4 and 2 clock cydes, ra;pectivcly, hy feeding inputs lUld prodneing outputs 
tOlitinuously in each dock cycle The dl'tai ls of thc opl'ration of the eOlllPonents are 
described in Appcndix B 
6.2.2.4 Novel 16-bit , 32-bit and 64-bit Datapath Architect ures 
The partial datapath arthitectllrL'S with the width uf 16. 32 and G4 bits an' novel 
arehitL'ttllres ArehitettmL':s with W-uit or 6<I-bit datapath width~ Iwve not UC('ll 
DI 
dis(,:us.';(~1 in pr~vious literature. T he G4-bit an;hite<;ture requires the mininwl amount 
of storage equivalent to a 128-bit register and tim minimal llumber of dock cycles to 
(;Qmplde 1lI1 encryption (20 eycl~s) for a G4-bit AES datapath architecturc. For the 
lti-hit an;hill'Cture, although it i~ pos.~ible to b~ built with the minimal alilount of 
~toragc equivalent to a 128-bit register. we found that the overall arCH can be smaller 
with lij more bits of ~toragc (totully equivulellt to a 144-bit register) whilc sti ll using 
the minimal number of 80 cycles clock cycles to complete the encryption of a pluintext 
block for a !G-bit AES datnpnth architectu re 
Ar(,:hit~ctur~s with a 32-bit data path width hm·e bt.'cn illvcstigated ill previous 
work, illdmlillg [26], [27], [281 and 1291. Sincc thc results of prcvious work arc ~ith~! 
bHSI.,x! Oil older AS IC techllOlogy Uwn used in thi~ work ([26]) or based OJI F'PGA 
technology ( [27] , [281 alld [29]), we lJH\ke a rough comparison betwccn these ardli-
tcctnrps ba.';(.'d on the ~i7,e of storage and the number of dock eyck~ to cnmplete the 
cm;r}'ptiou of a plaintxt block, a<~ is shown in Table G.I. 
The storage ill an architecture is necessary to hold the updated Stotcs in an AES 
dntapath and usually takes up a sigui!icant amount of the total hardware overhead of 
the arehitl'Cture. The minimal size of the storage for an AES data path architedme is 
128 bits. The number of clock cycles per encryption of a block and the erit.ielll path 
delay determinc the throughput of the arciJitedurc implementation. Considering 
that the architectures under comparison all have the critical path dday detef!Jliued 
by thp oppration of the round function , the cri tical paths of the arehit{'Cturl'S would 
oe SillJiiar if tlley me lmscd on the same technology lind, hence, the number of clock 
cycles becomes the dOlllinant fac tor determining the throughput 
According to Tnble 6. 1, our 32-bit AES datllpnth archit.ecture is SUlwrior in at 
Roulld_Key 
L'---;---c:-- -:--"~-+\:B-----"" Last_Key 
L~~'::'~!'_~:~ 
Ciphertext 
Pig1lr(, 6.9: Generic model of the complete datapath architectures with the unrolling 
factor r E {1 ,2,5,1O}. 
lealSt olle of the two mctries compared with the previous works. Among the architee-
lures under emnpa rison in Table 6.1, the ~ to rage of [26J and [29] is based on regis t.crs 
while the storage of [27J and (28) b h.'lSCd on memory. The Shi.fl Rows operation 
ill the regi stcr-b;~~l'(l architect un,; [26J illld [29) is not. real i~cd as efficiently m; in 
our dcsign so that either more clock cycl('l; or more registers arc required. r'br the 
llIe lllory-bl~~()d architeetuf<:'S [27] illld [28], the Shi/tRows operat.ion is fmliz ed by 
the appropriate addr~'Ssing while transf('rring the Stllte bytes between two 16-byt(' 
memories, and heIlC(~ the storage of 256 bits b re(lllircd. Doubling t. h(~ !lullllH'f of 
d oek cydes is requinxl for the archi tecture [28] compared with that of [27] since [28] 
1m • ..; separate loops fOf the SubB!ltes op(~ration and the MixCIl/l1mrl$ op('mtion 
Input 
MixColumns 
12S-bilRegister 
Output 
Figure 6 . 10: Structurc for ullTolll..>U architectures of t hc round fuuctiOIJ 
Input 
Output 
Figure 6. 11 StIUdllrc for llllTolled architectures of the last fOliud fuuction for 
r E {l ,2,5}. 
Inpul 
ShiftRows 
128-bit Register 
Oulpul 
Figure u.12: Structure for unrolled arehit{.'(:ture!; of tlw liL~t round flllwtioll for 
r= 10 
6.2.3 Complete Datapath Archi tectu res 
A gCllcric model of the complcte datapath arehi tecture!; b shown ill Figuf(' G.9 where r 
i~ L'{[ual to t hc unrolling facto r. Thc complck datapath architecturc with th(' uurolling 
factor r, r' E { I, 2, 5, 10}, contains l ' ro lmrl functions ill series, 1\ 128-hit ulill tiplexer 
(for r E {[, 2, 5}) and a 128-bit bitwise XOJt opcration, denoted by CD . 'Ib flliow for 
simultaneous processing of I" encryptions in the datllpath . the round Icvel pipeline is 
employed, which mean~ therc fife pipcliue registers plan'd between tIlL' imrdware of 
two COIlSL'Cnt ivc rounds 
The found functions arc identical except fOf the liL~t rollml fundion . The strue-
tme of the round funct ion i~ shown in Figure 6,10 amI the structurcs of the last 
round functioll for r E {l ,2,5} amI for r = 10 are shown in F igures 6.11 awl 6. 12, 
re~pcdively, T he round function in Figure 6.11 b able to work as either thc ordinary 
round function, as in Figure (j , ]0 , or the rou nd function wi thout the MixCollurtrl .~ 
95 
operation for the last round of AES. The last round function in Figure 6.12 oilly 
pcrfofllls the r()111ld function wit.hout the M ixColumlls operation 
Thl' Shift/lows components in Figure:; 6.10,6. 11 lind 6.12 arc implemented 
simply as crossover wiring according to the definition of the ShijtRow.,· operation. 
The AfixColumrj.~ components in Figures 6.10, 6. 11 and 6.12 arc implemented with 
four copies of the 32-bit A.fi XCO/UIflIlS component in Figure 6.8 ~()IJ(;atellated in 
parallel 
For T E { 1.2. 5}, the architc'Cture IHIS all iterative loop within II'hich T plaintext 
('ncryptions arc bcing processed. The data iterates for the required !llllJlOer hefon' 
the ci vlJCrtext is generated. For l ' = 10, the syst.(~m processes 10 encryptiolls similltn-
neollsly and the plaintext goes through tIl(' dntapath once to generate the ciphertext 
For th(' maximum utilization of the architecturc, l' plaintcxt blocks can bc londcd COll-
tinuously and pro<.'L'SSCd simuluUlcously in II pipeline by the r round functions. This 
is allowed for the ciphcr using ~I nOll-feed buck ciphcr IllOJC, such as countcr mode 
For maximulll utilization , the loaJinl; of the following plaintext blo~ks is paralleled 
with the unloading of eUffent eiphertp.xt blocks. With the nmxinmm ntili,mtion, 011 
averagc, thc architecture::; with r E {l , 2, 5, JO} takc JO dock cycle:; to complete tllc 
cncryption of a plaintext block lleutc, the tllroughput is dctcl'lnincd as ,. blotks 
('\'Pry 10 dock c}'de::; 
6.3 Metho d ology 
In order to d,nnwl.Prize the performrtnee of the dntnpilth ;lrehitect,\1l'es. all th(' nrehi-
tectures pzcsc!lh:u above are implemented with a 90-ml1 standard eell CMOS tecll-
nology from ST!l.lieroclcctronics. Thc pClforrnam:e of the architedures nrc est imated 
hased on t.he ~Yllthcsis rcsu l t~ of the impbnentatiOllS. The pcrformllucc we eOIl~ ider 
fOI IIIJ implellleutatiun includes area, peak power consumption alJ(1 average energy 
consumption. The method to derive the arehitedure implementatioll~ and to esti-
mate thl' performance arc dC8eribed in the following 
6.3. 1 Deriving t he Architecture I m p lementations 
In pract ice, AES implementation~ 1tre buil t for applieatiou~ with various throughput 
requirements, from high throughput fot high spe<-xl applications to low throughput 
for lightweight IIppliclltiollS. In order to pn.'!;(mt the results adaptc~1 to a wide rflnge 
of applications, each datapath :trdlitecture is synthesized with a variety of timing 
I'ollstraillt.s to glmerate the architlx; ture implementations tll(x;tillg a tllltllher of sclected 
thwughput tlxluirements that range f['(Jm high to low 
T he t iming eonst ra int.s for 1t datapath archite(;tufC are determined in the way 
that the impletItentat ions arc dockexl to produce the selocted t.hroughputs. The list 
of the ~clccted throughputs and the eorr<'l; ]londing a,,-;.;;igltmctt t~ of the timing con-
strnints for each of the areltiteetttrcS arc shown in Table 6 .2. In Table u.2, the partia l 
datnpnth architectures with the width of w hit;; arc dcnoted as \VW and thc m mplctc 
dataJ.>uth nrchitcct llfcs with t he unrolling factor of r are dcnoted as Ur. Sittcc tIt(' 
minimal realbtic critical path dela)' of the architcctuR'l; i~ vcr)' dOl5C to 1.50 ns ll ttdcr 
thc givett t(x;lmology library. Hencc, the timing constraint of 1.50 ns is regarded as 
thc tightcst realistie constraint. Itt Tablc 6.2, each timing constraint indicates there is 
an implementation built with the architecture of thc row and IItcctiltg the throughput 
97 
Table 6,2; Assignmcnts of thc timing constraints (in ns) for the architectures according to the given throughputs 
85.3 42.7 17,\ 8.53 4.27 2.13 1.07 533 80.0 8.00 800 
Gbps Cbps Cbps Cbps Cbps Cbps Cbps )'!bps i\'lbps Mbps kbps 
W08 N/ A N/ A ~ /A K/ A N/ A N/ A N/ A 1.50 10.0 100.0 lk 
\V16 N/ A N/ A i\ / A N/ A N/ A N/ A 1. 50 3.00 20.0 200.0 2k 
:;; W32 N/ A N/ A 'S / A 'X / A N/ A 1.50 3.00 6.00 40.0 400,0 4k 
\\'64 N/ l'. N/ A N/ A N/ A 1.50 3.00 6.00 12.0 80.0 BOO.O 8k 
UOI ~/!\ N/ A i\ / A 1.50 3,00 6.00 12.0 24.0 160.0 1.6k 16k 
U02 i\ / l'. i\ / A 1.50 3.00 6.00 12.0 24.0 48,0 320.0 3.2k 32k 
U05 ~ / l'. 1.50 3.75 7.50 15.0 30.0 60.0 120.0 800.0 8k 80k 
UIO 1.50 3.00 7.50 15.0 30.0 60.0 120.0 240.0 1.6k 16k lOOk 
of the colum11 clocked at the frequeucy corn'Sp011ding to the timing constraint (e.g., 
the timing con,;traint of 3 mj corrC!;polids tu the clock frC<lIlCncy 333 111 Hz) alld N/A 
indicates that the corresponding throughput i,; 110t achicv1lble for the datapath archi-
teelure even under the tightest timing constraint. The nrchitccture implementat ions 
1Ire as;umc<l to work with the maximum utilization, i.e., under a non-feedlmck ciVhcl 
Illude with continllou,;ly nvnilnble plaiutext and round keys 
The ,;yntil(:.;is I.md virtuallayoul proce~ i,; performed in thc topographical mode 
of Synop,;ys Design Compiler, as is dcscribc<1 in Chapter 3. The synthesis library is 
the 90-1Il1J Cl\10S stnndard cclllibrary frolll STl\1ieroclcctl'Onics with a corc \'o ltage 
of l.2V and standard threshold voltage 
6.3.2 Evaluation of the P erformance 
After the architecture illlplemelitations arc built, their performance in terms of arca, 
peak power and average cnergy is c;timatc<l following thc performance evuluation 
mcthodology described in Chapter 3. T he average power and peak power consump-
tions 1Ire estiumh.'d for thc implementation ruuning lit the clock frequency that pro-
due.:."S the corresponding sclccted throughput. The average energy con,;umption of lin 
ardlil.:.'cture implelllClitatiou is calculated as the product of it,; average power l;OIl-
sumptioll. thc dock pcriod and the cJol;k l;ydl'S pcr cncrypt ion of a plaintcxt block 
Thlls, thc energy tonsUlnption is normalizcd to be the avcrage energy cOlisumed to 
enl;ryptolle 128-hit plaintext. 
The avemge power alld peak power COllsumption of thc IIrdlit':''l;lllI'e implculcn. 
tatiulis arc estimated usiul/, PrirneTime PX frollJ Synop,;ys. The ,;witl;hing activity 
99 
is obt!)incd from the g!)tc-Ievel ~ilIlulation of the netlist with 10,000 random 12S-bit 
pbintcxt~ and the corresponding random round keys aud cotllrol ~ignals. Thc sallle 
set of random plainlexts ami round kpys is lIscd for tht' power e~tirnatiOIl of all the 
flrchitect urc implementations. PritrieTime PX ClUJ also break down thc nvernge powel 
of an implementation into the average dynamic power alJ(iliJC average static power or 
the average powers of the combinational logic and the SC<lUentiallogic of the circuit 
Accordingly, the average energy caused by the dynamic power and the st!)tic power 
or the combinational logic and the ~cquc]jtial logic can be talculatcd 
6.4 Experimental Results and Analysis 
The performance of the implement!)tions of thc various architecturcs arc presented 
and analyzed in this sedion 
6.4.1 Area 
Tahle 6.a pre~ellts the norml!.li~l'(l arel!.s of the Hrehitednre implementatiolls As is 
exp('{'I.{'(1. the nren grows with the dntapath width and the nnrolling factor. For 11 
givcn architecture, the area droPll with the loosening of the timing constr!)int uutil 
the minimal me!) is reached and the resulting implementations with looser timing 
eons(.r!)ints with the smne arc!) arc !)ctunlly the snme implementntion. Aceonling to 
Tabl<-' G_3 , tht' Illrn<t. arra-dlkif'n1. impiPn)('ntntion is achit'v('(l hy tI)(' 'lrchit('('ll)rt' with 
the width of S bits (WaS), around 1/47 of the size of thc datapath with the unrolling 
betor of 10, undcr the throughputs achievable by the architecture was (froln 533 
i\-tbpH t.o 800 kbps) 
IDO 
101 
Table 6.4 Ratios of the area to the maXillllHlJ througllput of the ;(]'{:hitedures 
(normalized to the value of V1O) 
Arehitecturc WOS WI6 W32 \\'64 VOl U02 U05 UIO 
Area/Throughput 2.9G 1.39 L26 L22 L11 1.23 
We also determine the area to througlllJUt ratio of the implemcntation yielding 
tile maximum throughput for each of the architL't:IUfl'!; and f{~sults, normaJiz{'{ j to the 
UIO result , [He ~hown in Table 6.4. For example, the implernentatiotl of W32 with 
the throughput 2.1 3 Gbps and the implementation of U02 with the throughput 17. 1 
Gbp~ are u~ed to derive the corrcsponding val ue~ ill the table. In thi~ comparison. 
Vl0 is 111(' trlORt I'ffi("ieut ardlite\'lure ill terms of the area ('Ost }'i('lr!iu~ 1 ill' olle l1uit of 
throughput. iloughly, the efficiency b~omcs worse with the decrease of the unrolling 
factor or t.he datapath width. Architecture W08, which leads to the most eOlllpnd 
illlplementation~, results in the least efficient architectnre and is about 3 t imcs worse 
thnn UlD in its aren to throughput ratio. 
6.4.2 P eak Power Consumption 
For 1.h(' AES implementat ions targetc<l at passivdy powered deviccs (p.g ('ontactlcss 
smart cards llIld H.FID ta~,,;) , there is rigorous con~trai ]jt 011 the peak power COll-
sllmptiou ~ i]jce these devices lIsually have II \'ery tight hudget on power cOJlsllmptio ll 
that is shared hy all the compOllcnts 011 the dc\'icc. For t.his rea.SOIl , the peak power 
consumpt.iOIl of the architecture implement.ations i~ im'cstigatc<l in this sectioll 
102 
Table 6.5: Peak powers of the architecture implementations (normalized to 66.SIlW) 
85.3 42.7 17.1 8.53 4.27 2.13 1.07 533 80.0 8.00 800 
Gbps Gbps Gbps Gbps Gbps Gbps Gbps Mbps l"I-Ibps Mbps kbps 
W08 N(A N( A ":-io ( A N(A X( A ~(A N( A 1.72 1.46 1.46 1.46 
WI6 X( A N( A N( A "l' ( A N( A N( A 1.47 1.34 1.32 1.32 1.32 
§ W32 N( A N( A "l' ( A ':' ( A N( A 1.24 1.12 L08 LOS 1.08 LOS 
\\'64 N(A Nj A "l'( A Nj A 1.58 1.07 
VOl N( A N(A "l'( A 3.59 L94 1.22 1.23 1.23 1.23 1.23 1.23 
V02 N( A "'( A 5.8 2.66 2.26 2.26 2.26 2.26 2.26 2.26 2.26 
V05 Nj A 17.S4 9.35 5.85 5.85 5.85 5.85 5.85 5.S5 5.S5 5.S5 
UiO 19.57 13.1G 11.39 Il.3g 11.39 11.39 11 .39 11.39 11.39 11.39 Il.3g 
Table G.5 presents normlllhxl peak power of 11 11 the IIrch itecture implellll'ntatiolis. 
For a given ltrehitedufC implementation, the peak power would [Jot vmy with the 
dock frequency, a;; is shown in Table 6.5. Under the ~ame given throughput, the 
architecture I·esulting in the minimal pellk power is WG4. It is obvious th llt the peak 
power of lin implementat ion i~ related to its IIrell. This is the reawn tllIl\. t.he incrc(1.';C 
of the umolling factor beyond UO I inercase~ the peak power. ~l owe\"er, for the partial 
datapath arehill'Ctures, tlie d(~rease of the datllputh width leads to the increase of tli(' 
penk power ilccording to Table 6.5. Th is reHects the fact that there arc morc intCIlSC 
instantancolls switchillg activities incuned by tlIC partial data path an:hitL'Cture Witll 
11 smaller diltllpath width. Especially for W08 and W lG, they ll<\vC more regis\(>r.~ 
than other partial datapMh ardlitcctures and a register conslImcs more power than 
other Ci'...tOS cells and is updated evcry clock cyele. Betwecn WG'] lind UOI , WG-] 
also leads to les.s instantancous switching activities and achieves the overall lower 
pe,lk power 
6,4.3 A verage Energy Consumption 
Energy consu mptioll is usually a crucial constraint for bnUery-powercd deviees ~ ill cc 
tIle capacitr of the battery is limited . It is preferred to perform as mallY tasks as 
p()!;sibl(' under a given cnpacity of the battery. The 1I\·cnlg(' elll'rgy requircd to c1llTypl 
a 128-bit. plaintcxt block for cach arehitect.ure implement.ation is shown in Tabl(' G.G. 
It can be s(~n in T'lhle G.G that , for any architecture, loosening the timing eon-
straint would lead to the implemcntation with less energy conSllmption when working 
for the designat~'{1 throughput until the minimal is rcaehed. P!L~t this point, mOI"(, 
104 
~~ M M :; 0 ~ ~ ~ ~ 
o!. ;; ~2 8 Ie ;:; o~ 
co:;;; ~ ~ N ~ ~ 
<=:3- i'l :?; ;; ::: ~ ~ :;; ~ ~~ ~ 
- - -
-
~2 ~ N ;l ~ : 
" "":;;: - - - - ~ 
~.t 
'" 
:5 
-0 z ~ ~ 
- - - -
,,~ ~ 
'" 
gj i'l M ;\ 8 
~ NO 7, ~ 
- - - -
::e ~.t 
"' "' ~ ;\ 0 ~O Z Z 
- - - -
~.t 
'" '" '" '" 
~ 
-
R 
-000 Z Z Z Z ~ 
~! '" "' "' '" '" 
~ 
-Z Z 
" 
7. 
" 
-
'" "' "' "' "' "' 
~ 
~ ~, Z Z Z Z Z 
-
:li 
'" 
"' "' '" '" '" 
"-Z Z ?:- '';'-; 7. 7 . :.c. 
-
E ~ ~ § § 0 
" 
:5 CO 
lO5 
106 
(;u(;rgy is (;()JlSllllH.,{! wh(;u thc imp!euwutations work for lower throughputs. For thc::;c 
Ca8Clj, the increase of energy consumption is incurred by the static power of the circuit 
and is indep(~ndellt of the switching activity but rc!~ltcd to the duratiou tile circuit is 
llO\\'I~red on. 
rhc avcragc energy incurred by the stlltic power of tll(~ architecturc implcmcnta-
tionlj is shown in Tab!e 6.7. !t can be seen in Table 6.7 that the energy consumption 
til!(' to static power riSl.-':; dramatieally wit.h the drop of thl' throughput (i .e .. the sluw-
ing down of clock frequcncy). This indicates that the highest energy efficiency of 
the implementation CIlIl oll!y be readlPd with the appropriate clock freqncm;y umb 
which the total cncrgy consumption (the energy by both thc dynamic power and the 
static power) is minimal, such a:; 80.0 ;-"!bps for W08, 1.07 Cbps for VOl nnd 17.1 
Cups for UW. Conventionally, AES implementations targeted at lightweight appli-
cations are llslla11y made to work at an extremely low dock fn.'{lucncy. According tu 
tht' above ~malysis. this , in f~ICt. docs not. ncx:essnrily saVI~ encrg}' bllt. may COJISllmc 
more due to static power 
Comparing the implcmentatiuns with the lowest average energy collsumption 
frolll each of the arehitedllI'L':s, it can be seen that thc U JO illiplelllcnlation with 
the throughput of 17.] Cbps or 8.53 Cbps is the lowest for UIO while the W08 
implemlmtation with the throughput of 80.0 I\lbps is the lowest for W08, which is 
about 2 tim0i that of UlO. By comparing the avernge ('nergy consU lliptiolJ of the 
impleml'ntatiOlls for a given t1lroughput in Table 6.6, U 10 remains a,~ the most ellergy 
etficietlt (lrehit<.-"Cturc for a wide runge of the thruughput-s, frolll 85 .:.n Cbps to 533 
1\1hps. For lower throughputs, since the energy in(;urrcd by the statie power gradually 
heCOllles dominant ill the total energy, the arehitl'Clure implernentations with stlwlle! 
107 
arf'a llf'f'olllf' morc f'1lf'rgy dlirif'nt. For cxamplf' , thc W08 flrrhitN'\.ure i~ sllhstnmially 
more energy efficient than U 10 at 800 kbps 
For more in-dcpth analysis of the energy COllsUInptiolJ of the rm:hitc'etllw im-
plementations, we also break down the cnergy inCUffc'(l by the dynamic power into 
t he energy used by the combinational logic and the energy u~:xl by the registers in 
the implvlIlental ions. We have found thallhere is sig:nifi cllnt difference in t.he energy 
causcJ by the dynamic power of the registers, as is shown in Table 6.8. It can be 
secn in Table 6.8 that the energy CO !l~ UlllptiOll of the rcgisters of the p(lftial data path 
architectures is (lpproximatcly proportional to the clock cycles rcquired to colllplctc 
tIle elllTyptio!l of II 128- bit plaiutcxt block, which are 160, 80, 40 (lnd 20 cydcs for 
W08, WIG, W32 and W64. rcspcctil'c1y. This is rcason(lblc considcring that the par-
tial datapath architvdures havc a similar lIullilierofregisters and the registers are 
updated approximately 160, 80, 40 and 20 tinK'!; to complcte the encryption , respec-
tively. On the other hand, fOf the complete d(ltapath architcctlll"Cti, they all llccd to 
update the similar numbcr of registcrs 10 t.imCti to complete one cucryption. There-
fore, these architectures IHwc rclntivcly similar cllergy consumption of the rcgister~ 
aud that b flbout 1/ 16, 1/8, 1/4 aud 1/2 of tlw eucrgy for W08, Wi6. W32 find 
WM, rt>:;pc'Clive!y. This filldillg illspircd u~ Lo illvt'!;ligaLe (he [mt her reductioll of 
the eJlergy of the complete data path IIrchiLecture~ by removing the registers bdween 
the cOII~eclltive rOllnd functions (exduding the architecture UOI). Jlowt~\·er , our ex-
perimentul rt'S u l t~ sllow that the~ architectures con~lIme significantly more energy 
after tlw registers arc remonxl due to the inere lllt~ nt of cnergy ea\l~ed by the severe 
iucrease of !l,l itching. Houghly, li02, U05 and UIO con~ume 2,5 and 10 times mort' 
cucrgy after the registers are rcmoved. 
!Os 
fable 6.8: Average energy ineurrL'<:i due to dynamic power of the registers for t he encryption of 12S-bit plaintext of the 
ardlitecture implementations (nofl Jlalizcd to 34.2 pJ ) 
WOS N/A X/A N/A N/A X/A N/A :\ / A 1S.75 lS.45 18.4518.45 
g \\' 16 N/A ",/A N/A N/A S/A N/A 9.13 9.26 8.8 8.8 8.8 
\\'32 N/A N/A .~/A X/A X/A 4.59 4.46 4.02 4.02 4.02 402 
W64 ':II/A NfA Yo/A S/A 2.97 3.15 2.39 2.39 2.39 2.39 2.39 
CO l X/A N/A _"Ii/A 1.55 1.68 l.()3 1.03 1. 03 1.03 1.03 1.03 
U02 S/A Yo/A 1.52 L8 1.14 1.14 1.1 4 1.14 1.14 1.14 1.14 
LJ05 N/A 1.91 1.43 U5 U 5 1.35 1.35 1.35 1. 35 1.35 1.35 
UIO 1.'19 1.76 
Table 6.9: Overall resource cost of the architecture implement at ions (normalized to thc value of \\132 undcr 2.13 Cbps) 
85,3 2.13 1.07 533 80.0 800 
Gbp~ Cbps Cbps Mbps l\lbps kbps 
W08 NjA NjA NjA N/A NjA 1'jA NjA 5.57 24.79 265.94 4648.7 
W10 NjA NjA :-:. j A NjA NjA NjA 2. 1 2.76 21.27 241.18 4916.7 
\\'32 NjA NjA ~jA ~jA X j A 1.55 2.83 19 .37 238.09 683!H8 
\\'64 ~jA S j A NjA S j A 1.26 l.22 2.25 4.49 31.15 428.43 15751.6 
Cal S j A NjA NjA 2.79 2.17 2.67 5.29 10.55 74 .99 1206.91 58082 .31 
U02 NjA NjA 4.5 2.T::! 4.4 8.82 17. 79 35.71 268 .98 5825.8 3710848!} 
U05 NjA 17.92 13.5 14.6429.4558,97 120.27244 .962]]4.7 71348.51 5732672.21 
UlO 10.48 9,72 2(),42 40.99 82.91 168.2 348.15 735.52 8277.6 437831.35 39855107.53 
6,4 ,4 Overall Resource Cost 
In the previous sect iolls, we have evaluated the cmt of the nrehitecture implc!mmtn-
tions for each of the performancc perspectives separately_ In this section, \\T' c\'illllilte 
the overall resource cost (ORC) of the architecture implementations by combining 
their perfO!'mnncc in aren, pcuk powcr consumption, averagc encrgy eOllsUlllptioll 
filld throughput, as 
aae = Area )( Pcak Pawcr x E1lf;ry ,~ 
ThrOllqhpllt 
This metric reflect s the combined cost required to yield the unit throughput. and the 
lowpr till' cost, thc higher t he pllirie1\cy. The o\'erall resource COlit, of 111P architpctnfc 
implem('ntations i~ shown ill Table G.9 
According to Table G_9, the architecture im plementation \V32 under til(' throngh-
put 2,]3 Cops provides thc lowest ovcrall resource cost and the implementations do::;c 
to that one ul;;o huvc low Cfl>it , sHch a.~ \V64 nndcr 2,1 3 Gbps, \VIG Hnder 1.07 Cbps 
and W08 under 1,07 Cbps , It can also be secn from Table G.9, when the throughput 
decrciL<;CS, the overall resonrce cost increases dramatically. 
6_5 Summary 
In this chapter, the performance of paramcterbmhlc AES (latapath ard! itcdlUt'S arc 
bellebmarked balled on a 9O-nrn staudnrd cell C.\'!OS technology in terms of area, 
peak puwer consumption and average energy consumption. For this purpo::;c, generic 
nnd repwsentative datapath architectur('S arc built with the archi tecture parameters 
ineluding the d11tapatb widths of 8, I G, 32 lind G1 bit.s lind the unrol ling factor~ of 
III 
1.2, ;) find 10 for the dfltapflth width of 128 bits. Among UK'&: arehitedures, tile 
1G-bit find fi 'l-bit width arehitedUI'Cl; illlve not been dbcusscd before in t.he lilPratufe 
find the 32-bit width flreh itec ture is fI novel architp.cture with the benefit of IIX.iU(Lxi 
storage ami/or fewer dock cycles tOlllpared to previous work. 
for eflell of the (lfehiteeturc:;, a number of timinf!; constfilinls an' appli(xl to 
derive the arehikdure implementations fit fOf different throug,hput fL,<]uirellltnts. The 
(]uantit ivc pcrforlllanct of tilt ardJitL'(;ture impicmentations flrc pr=ntcd, cotllpnred 
ami analyzed. The 1ll0Ht efficient architL'(;ture implemeutatiull ill mea, peflk pOWCt 
and I'TlPfg)', as wl'li as in tllP overall f{'source rosl. af(' ideutified. The p{~rfofluatl("(' 
trMLLolfs over a rang,e of architecture parameter vfllut'S are disclosed. In (:ontrast 
10 eounmtiouai belief, the ImJl;t comp>ICt III'dlitccture implementation with the 8-bit 
widtll doc:; nol help to minimize, but actually increases, the energy conS\Imptiou even 
nmuing at n low dock frequcncy. A';!. well. compact impiclllcntfltions do not l"f'»nlt in 
the minimal peak power. 
SillC(~ the data path Hf{;hitL'(;t ufC panuucler valucs cxamined in tllis work eO\l~r lin 
extensive range of the possible vnlue~ and the nfehitccture implcmPlltatiom; life fairly 
compnred based on the same standflrd ceIl CMOS technology, thc rcsult.s frotll this 
chapter arc generalizable and scalable to other tecilllologics. Therefore, this work e<ln 
serve as a general reference for rcwurce-emcient and lIexible implementation of the 
A ES datapath 
In the next chapter. we combinc the tllost. energy dlicif'nt S-box pipeline config-
uralioll frotll ChUIJtcr;) and datapath ardlitecture identificd in this chapter, in o]'(lcl 
to dCl1lonstral.c the averill! cffectil'eness of the research [('Hult s prCSClltt'fj ill the two 
dillpt.ers 
112 
Chapter 7 
Demonstration of Combined 
Effects for Energy Efficiency 
III this ('haprer. we demonstrate thf' impTOW'nlCnl. in mu~rgy ('lIiri(,1H'~' of 11w AES 
datapath illlplclllcntlltiull loy {;umloiuing the ajJprupriate pipetill" nJllligurllli()1l allli 
d~t~p ,tth archiled,llre Iklt arc ident. ified in Ch!lpters G Hud G, respectively. The 
rc~llit siJOws significant reduction in energy consumption for the dalapatlJ illlplL'-
mentation wit.h pipclincd S-bOXCl; compared with t he Jatapath illlplclJlClltHtion with 
lIoll-vipdillCd S-boxcs 
7.1 Introd uction 
Tlw results ill Chapters 5 and G show the performance improvcnwnts and pcrfofmallcc 
tmde-olrs under a v!~ricty of pipeline configurations for Ihe S-box and architc.::tnre; for 
the datapath. In terms of the improveillellts ill resource cfliciclICY, the cotnbinfltioll 
of 1lH' appropriatr piprlinr mnfigllral.iOll and thl' al'l'ropriat(' datapath ar('hitN'l,l1T(, 
113 
would further impron~ the resourc(' efficieIHY In t.he following of this Stttion, WP 
demonstrate the cornbined effect in resource efficiency by applying tIl{' euergy-eflkiellt 
pipeline configuration from Chapter 5 into the energy-emdent datapath archit ecture 
implementatiou from Chapkr 6. 
SpN'ifkally, lI'e apply the 2-stage and 3-st.agc cornponent l('vel pipplinp ronfignra-
lions to thc 128-hit datapath arehiteeture with the unrolling faetor of 10. Au."ording 
to Chapkr 5, the eomponent lovel 2-stage and 3-~ tagc pipeline are the two most 
oncrgy-eflicient configurat iollS for the t iming constraints where they nrc applicahle. 
According to Chapter 6, the dlltapath nrchitccture with the unrolling factor of 10 
achieves the lowcst energy consumptioll among the datllpath archi tedtlrt!S. \Ve will 
build t. h{' implementutions with the cornbined pipeline configurat ions and the dat-
apath architectures and compare the performance with the implementations of the 
uou-pipelinoU S-hox datapath architecture with the unrolling factor of 10 and the dnl-
upath arehill'dure with 8-bit width. The !:I-bit datapath arehill"C ture is eOllvelltioually 
adopVxl for low power/energy implemcntation of AES . TheS(~ implementatioJ]s do uot 
CQntllin key cxpansion and it is assume<! thllt the round kcys arc fed as inputs to thc 
implementations whenever they nrc require<! 
7. 2 The M ethodology 
Th o.; methodology for impicmcllting thc combincd <lrchitedllr0l lInd cVlllulIting the 
pcrforrnanf"C follow~ that used in Chnpters 5 and 6 In the following. thc datapath 
arehit(x:tnre with the unrolling factor of 10 and 2-stage componcnt level pipelinl'd 
S-box(!S i~ dellotl-d as U IQP2 and the datapath areh itednfe with the ullfolling fao.;-
tor of 10 and 3-stage componcnt level pipelined S-boxes is dcnoted as UlOP3. The 
denotations of other datapath architecturcs follows from Chapler G 
The list of the selected throughputs and the corresponding a_'>.~igIlmelJts of the 
liming constraint.;; for the combined architectures is shown in Tablp 7.1. The scll'Ctc'<i 
throughput.s inelude those used in Chapter G nnd some higher throughput.s which an' 
achievable for the nn:hitectures due to the pipeline. It should be noted thilt, aft.er 
introducing the pipelimxl S-boxcs in the datnpath architc'Cturl'S, the implelllelltatioll 
can still produce n 128-bit output ellery elock cyele 
7. 3 R esults and Analys is 
'I'll(' nOflllfllize<i urea, peak power unci rrverage energy of the implementations of tll(, 
cOlnbined architectures nrc shown ill Tables 7.2 , 7.3 and 7.4, rcspeetively. For com-
parisolJ, tlJC perfurmance of the datapath path arehitecturcs \\108 IIml UiO arc also 
prcscnted in the tables 
It enn he S<-"CU that the combined architectures UiOP2 and U1OP3 enn lead to 
lower area than the datapath arehitc'Cture UiO under the throughput 85.3 Cbps in 
Table 7.2. ,\ 1:;.0, the combined luthitc"Cture UiOP3 hi\1; lIlore pipeline stagC";'; in the 
S-boxes but Cll1lleaJ to lower area than UiOP2 uuder the throughput 171 Cbps 
Tlwrc is signilkam redurtioll in euerg.Y hy Ilsing thl' romhiw'<l arrhit('rtlln·~. 
Considering the minimal energy CUllsumptioll for eacll oftlw architeetllres IHlder com-
pariwn (i.e., \V08 under 10.0 ~Ibps, UIO unrjpr 2.13 Cbp~, UIOP2 under 5.33 Of 2.13 
Cbps (lncl UJOP3 under 5.33 Cbps) in Table 7.4 , the combined architect.ur('S UIOP2 
nnd U JOP3 enn sn\'e m(lximally ~lround ,10% (lnd 50% energy, respectively, compured 
115 
iii 
'" 
§ § § 
8 0 <e 2 
- -
g 
- -
~ ~ 
:;: ~ 
e 
," 1! 1! 
t; ~ ~ ~ 
-
z 
-
~ 0 iii iii N Z. 8 
~ 
~ g g 7, ~ 
-< ~ ~ ~ Z 
-
~ :il 5l 
-
7. ~ ~ 
::: g 8 ;0 
~ Z M M 
:2 ::: § § :;: z 
::: ::: i" i" 
-
2, Z 
~ ::: ::: ::: 5l 7. '/, 2. 
~ 
'" 
~ ~ '3 
J iG 
g]. 
00", 
~ 
82 ~ 
002 
o~ 
g~ ~ 
~E 
" ""2 ~ 
t_ ~ 
o~ "'-~o ~. 
~2 "'-NO z 
~& 
N~ "'-~O ~ 
'" ~2 0 "'-000 7. 
~~ "'-~O ~. 
~2 
~O 
< 
z ~ 
~2 < 
000 ~ :3 
;::2- "'- "'-~O 7. ~. 
"'- "'- "'-Z Z Z. 
~ ::: ~ ::: 
117 
~ - l" ~ :oj si 
8 
-
l" ;;; 
" " 
iii - l" ~ :oj ~ 
El - l" ~ :oj ~ 
"-
l" ~ 
~ z 
" 
:oj ~ 
"-
l" ro 
" N Z " 
~ 
"-
l" ~ 
~ z 
" -
si 
"-
l" 
" 00 z " -
~ "- l" " z " 
:oj si 
~ "- 8i ~ z ,,; :oj ~ 
12 "-
::l 
z oj ~ 1'1 
-
"- "- ~ Ii %. 7. 
~ :::. "- "- ~ % . .,... %. 
~ 
" 
~ ~ 
" " 
118 
Table 7.4: A\wage energy for the encryption of 128-bit plaintext of the combined architecture and selected datapath 
architecture implementations (normalized to 0.35 nJ) 
'56 17l 85.3 42.7 17.1 8.53 4.27 2.13 1.07 533 80.0 8.00 800 Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps Gbps ~"bps Mbps Mbps Kbps 
;; 
W08 N/ A ~/A N/ A K/ A ~ /A K/ A K/A N/ A K/ A 4.94 4.32 4.63 8.09 
UIO N/ A N/ A 2.39 2.09 2.00 2.07 2.09 2.13 2.19 2.33 3.92 20.74188.82 
UlOP2 N/ A 1.68 1.28 1.23 1.23 1.24 1.26 1.3 1.38 1.53 3.27 21.65 205.62 
UlOP3 2.24 1.11 1.01 1.01 1.02 1.04 1.08 1.16 1.33 3.22 23.19222.87 
with the Inosl rlwrgy ef!irirl\t ar('hit(~' ture VIO fOllln\ in Chflpl.rT G. COlTlpar;x1 with 
the datapath archit~~: t ll n~ W08 whidl is regard ~'f l a.~ tlJC most liglltweight impkml'n-
lilt ion architcctu l'e for AES, the combi ned urchitccturcs VlOP2 <lnd UlOP3 requin' as 
low as around 1/5 of the minimal energy consUluctl by W08. The energy reduction 
by the combined architectures UIOP21md Ul OP3 compared with UIO comes from tlw 
energy redudion by using tlJ(~ component l{ ~vd 2-;;tage and 3-stage pipdincd S-hoxl"s, 
as is shown in Chapter 5. UlOP3 can consumer less energy than UlOP2 bcc<luse 
the <.:O ll iponelit level 3-stage pipelillt'f.i S-box has lower ellergy consumption than the 
cOllipouent level 2-stage pipclined S-box as sccu in Chapter 5 
These energy reductions come at the cost of the incrensed penk pOWI'r. A .~ is 
SCC II in Table 7.3, W08 remains as the archi tecture with the minimal peak power 1Il1d 
l'Ompared with UIO, UIOP2 and UIOP3 under the minimal energy consulliPtion, the 
peuk power of W08 is about 1/8, 1/ \G and 1/ 20 of them, re:-; pcctivciy 
Based 011 above analysis, it is dear tlmt, for solely low elll:lgy purp(k;C, the 
architedure UIOP3 should be adoptl'fl and tile implementation should he coustraill ~'f 1 
nnd run under ,I high (but not the highest ) t hroughput (e.g., betwccn 85.3 Gbps and 
4.27 Gbps). Thc architccture UlOP2 Iearl.~ to thc irnplemclltations wi th the similar 
sccllilrio but lH~~ ilround 7% decreusc in arca (54 _83 vcrsus 58.65 in Tn.b l(~ 7. 2) at the 
cost of around 23% incrcusc in cncrgy (].23 I'crsus 1 in Table 7.4) _ In tNlllS of area 
and peak powel', the a rchitectures \V08 and \V64 identified in Chapter 6 TCilIHi n as 
the moot eflident solutions, respectively. 
The si!;llilicallt Cllcrgy rI.'f.iuction foulld in this section indicates that tho;;c AES 
implementation arehitecturcs that employs the minimal datapath width for low power 
purpose (chura<:tcriz('f.i as \ V08 i ll this dissertation. such as [31ll adually consume 
120 
milch morc cnergy than those architc'Ctllre;; built for high throughput purpose (char-
actcrized as UlO, UlOP2 and UlOP3 , such ns [20]) 
7. 4 Summary 
In th is chapter, we demoustrate tIm combined effect in im proving energy pffk iPtlC}' by 
thp appropriately combined S-hox pipeline configuration lim ] datapath archilccturp 
We combiue the componellt level 2-stage and 3-stagc pipelined S-boXG"; witll the 
datnpath architecture with the unrolling fac tor of 10. T hc perfOfllla1l(~e evaluation of 
the implementat.ions of the combined architecturt'S ~how~ there i~ ~ig:nifi(~ allt, energy 
reduction achicved by the combined arehitecturCti compared with the' llIost energy 
dfici(' IIt. d ata pa th archi le('!,nf(' id('II1.ifil)ff ill Chapt cr G or (hp most, rompar t dalapnl h 
(l rchi tecture eOllventiomllly used for low resollfcc PIlfPOSl'S. 
121 
Chapter 8 
D esign of a Lightweight Block 
Cipher PUFFIN2 
In previous dwptcrs, we have {ut:us(xl Oil analyzing the implementation of AES. As the 
nltcrlwtivc of AES, nwny others hiln~ propooed lightweight block cipher algoritbms 
targetcf.l at loll' complexity illlplcmcntation. In tliischaptcr, wcprcsput a block ci pher . 
PUFFIN2, which is designed to be ust-"(I with applications requiring vcry low circuit 
area. PUFFIN2 is designed to be implemented exclusively with Cr.l0S technologies 
allJ in a serialized architecture, so that the maximUlll rOllSC of hardware compoIlent:; 
is achieved resulting in a vcry compact implementation. PUFF1N2 ha,~ II block size of 
C4 bits and II key size of 80 bits. Compurcd with it serialized implementat ion of ciphcl 
PRESENT, which has the same block Si7,C a nd key ~izc and i~ claimcd as the smallest 
practical block ciphcr implcmclltal ioll to date, our cipher hf\,'; 16% fewer gates llsi lll; 
the same Cl'I!OS technology_ Further, PUFFIN2 inherently supports both encryption 
nud decryptioll while thc scrin1izcd PHESEN'i' is an cncryption-on!y impicmclltation 
122 
T he COtltellt of this dlflp ter is wso prcsented in [4G). 
8.1 Introduction 
Ug;htwcight ilpplicntions Ilsually rd(~r to the applications with extremely constrained 
requirements on cost (complexity), power and/or eJl(~rgy consumption, sudJ a.;; RFID 
tags alHl senso! networks. TIJC block ciphers that nre targeted ilt thQ,<;(' applicatiotls 
me uSlInlly called lightweight block ciphers. Although thl'f() have been plenty of 
effort.s put into the investigation of the implementation of AES wi th low complexity, 
tlJC complexity illherellt in the nlgorithm impoSCH a lower bou1J(1 on the complexity. 
\·lost of the efforts 011 tlH; dlicient implementatiollof AES are ac tually the exploration 
of 1.1](' t nl(lc-off bp1.,\"C('n area and dplay whrre the sarrificp of one major lI~ped of I h .. 
perforlllllm:e is unllvoidable 
LigIJtw(~ight block dplJCf design is Olle of the recent trends in symmetric kl'Y 
eryptogmphy. There nrc many attempt.~ mad(' to investigate the poteutial of de~igllilJg 
ligli1.wcigllt. block ciphers 1.h,,1. lead to compact impll'Jl]rntations wit.hout significHlll 
eompromi;;c of t1w delay. Lightweight block ciplJCr nlgoritlJ[n~ indude PHESENT 
[.17][48]' mCRYPTON [49), ICEI3EHG PT), HIGHT [5[11. SEA [5 1) , PUFFI.:\' [521. 
KATAN, KTANTAN [531, !\"l ll3S[541 and L13loek [55) 
It. is " .. ("]1 kuowu t.hat. an effkicnt method TO minimize harriwafr an':\ i~ t.o fPllS(' 
lim siugle pie{;e of a haruwnre component for multiple times instead of replicatiug 
idcntical piL"(;CS for simuitancous opernlion. This Iw.rdwnre reduction met.hod, abo 
kuowIl as Ii scrilllized ll.rciJitL"(;ture, is well suited to block ciplJCfs which uSlInlly involve 
cryptographic componcuts consisting of idcutical function blocks. e.K non-linear 
123 
substitution layers generally consist of identical S-boxes. Althou!!.h tlJis reductioll of 
Iwrdwure Mca comcs at the penalty of increased execution time, the compromiscd 
timing performance is st ill acceptable for many applications <It which lightweight 
hlock ciphers are targetC"d. The scria1i~cd architecture is firstly exploited and applicd 
to PHES 1~NT ill [4S1 and thi~ is thc ~Illallcst known imp1cmcntat iolJ of a practical 
block Cipher. In the reminder of this chapter, this implementation is called serialized 
PHESEN·r 
PUFF1N2 b a block cipher IHUlJ L-G aftel its predecessor PUFFIN and desiglJL"(1 
to hI' implplllpllt.('(1 pxdllsivply with a ~ri aliz()d aTehit<"X'turf'. The differp]w!'S \W/WN'1l 
PUFFIN and PUFFIN2 lie in the number of fOunds, fundioll order in 11 fOulJd and 
tile key schedu le whidJ is fu lly l"L-GesignL-G for PUFFIN2 in order to perform it wit.h 
the datapath. l30th PUFFIN and PUFFIN2 have the capability of hoth f'neryption 
and {ict!ryption. In the next section it will be shown that the datapath of PUFFIN2 
is exactly the same for encryption and decryption, so there is no hardware ovcrilCad 
to aecolIlllJodate the difference between encryption and decryption opemtion~ 
PUFFIN2 is more cflicient than PRESENT for hardware implementation with 
,;crillliwd architL't:ture. PUFFlN21lils the ~ame block size and key ~izc as PHESENT. 
G4-bit and 80-bit, rcspedivcly. Compared with the widely ndopted 128·bit key ~ize 
aud 128-bit hloek size of Al::S, an SO-bit key si7,e with a G4-bit block size can result 
in a compa.ct implementation and still provide sullicient. securit.y for typical low CO$t 
$Ul;lrt. device~ . Based on our ASIC implementation experiment~ , PUFF IN2 is retdized 
witll 1083 gates which is iG% Ie:;.,; than serializL'(i PRESENT with 129(j g,ltcS. For 
tho,;c block ciphers appea.red after PUFF1N2, illc1llding KATAN , I\T i\NTAN [5JI, 
~dIBS [541 and Ll3loek [55], they are very similar or even larger area compared with 
124 
PRESENT [lccording to the compmi50n in these works. I~br this rellson, PRESENT 
i~ adoptcd as the reference for comparison in this clwptcr 
8.2 Cipher Specification 
A block cipher is gelJerally constructed frolll two parts: dat[lpath and kcy schedule 
The datapath is in the form of a product cipher from Shannon's tlll'ory 1561 where 
a number of identical or similar functions are concatenated to multiply the security 
strength of the block cipher. The cipher is composed of a number of rounds, whcre tIle 
ope!'ations within a round [lre referred to as II round function. Each round fUJl(;tion 
takes the output frolll the previolls rOlilid (or the pli1intext for the first round fUliction) 
amI also rcceiH.'!; II S()t of information (referred as to a rOlilld key or roulld kc.y) from 
the key schcdulc, and gcncrates the input for the following round (or the ciphertext for 
the last round fun ction), Confusion lind diffusion are two basic teclmi41ws int rod ucL'<I 
by Shlll illon 1561 to obsellf(~ plaintext into dplwrtext, Confusiou complicat es tllC 
rf'iationship bf'l . \\"~n tllf' plaint.ext, 11lf' f'iphf'f/.f'xt. 1I11(11hf' kf'Y, whilf' diff\l~ i()n spn'ads 
the influence of (;onfusion. 
8.2.1 Overall Structure 
The propo~'<I block cipher, PUFFIN2. adopts a simple involutional Substitution Per-
mutatiou Network (SPN) 157J with II dntn block size of 64 bits and key sizf' of 80 bits 
and consists of 34 rouuds. EnClyptiou lind decryption processes are idcntienl so the 
sn llle dlltapath 0.'1111 be used for bott! proCL'Sst'S. The key schedule of the cipher gener-
alL.,; 54-bit rollud kcys for each round on-thc-fly (t.hnt is, in parallcl to t.he prof'C'Ss ing 
125 
of the cipher data) 
In f\ block eiplwr using an SPN ~trudllre, cOllfll~i()1l lind dilfusioll arc perfonncd 
witl] the substitution layer and the transposi t ion or permutation layer in a round fu nc-
tion, and the l;C(;u ri ty is (~nhaneed by concatenating the round functions . Compared 
with thc classicn l Feistcl structure [58] where only half of the dllta block is proc('>;.~l'(l 
ill parallel, the cntire data block b prol't'>;.',eo:l at olle time for the substitutioll ami 
permutat ion operations of a round function. I\n involutional structurc allows the ci -
pher to iwve identical encryption and decryption processes, which is one of the design 
goal~ of PUFFI1\2 , In general, a bloek dphcr using an SPN st flld.lJfe has diif('f('lIl 
datapatits fur encryptioll lIlld de<:ryptiuIJ heClIll!;() tim rC\'ersc of the encryption pro--
ce",~, which is cqulli to the decryption proc('$S. is us\U\lly not designed to be the same 
"" the forward encryption procC.'i.'i. In PUFFIN2, we design the same forwllrd and 
reverse proccs-'j{'.'j for the encryption by adopting the involutional components in the 
(;ipite[ whose forward uperations used for encryption clIn he the Sllllle as the reverse 
operations used for decryption. In this way, there i~ no sepcrate encryption datapati] 
lind decryption datapath rC<luired in the implementation , although encrypt.ion fmd 
denyption eflll!lut he dUlle sinmltaul'Qusly since they share the same dntnpatli 
Tnble 8.1: S-box mapping of PUFFIN2 (in hexadecimal) 
126 
rable 8 2· 64 bit Permutation of PUFFlN2 
Co C, C, G:! C, C, c, e, 
n., 13 60 50 5 1 27 10 36 
R, 25 7 32 6 1 I 49 47 10 
R, 34 53 16 22 57 20 48 41 
U:j 9 52 6 3 1 G2 30 28 II 
II, 37 17 58 8 33 44 4G ;g 
R, 24 55 63 38 56 39 15 23 
R. 14 4 5 26 18 54 42 45 
R, 21 35 
" 
3 12 29 43 fi4 
8.2.2 Basic Components 
Each round fuudion of P UFF IJ\2 eon~i~ts of :llllyeIs, a nonlinear suhstitlltio]J layer 
S, a key addition layer A and 1I permntation layer P. The uoulinear substitution layer 
S is C()]llPOH(~d of 16 identical 4 x ·1 S-boxes, wilich lire tile same a':l the S - hox(~s used 
in PUFFIN and the S-box mapping is shown in Table 8. 1. 4 x 4 S-boxes (which arc 
small compum:1 to tlw 8 x 8 S-boxes of AES) are often found in lightweight block 
f'i phCfH because their implementations arc compact ami their compumtive \\T'ltkness 
in sccu rity Htf('ngth elHl be compCllsutoo by an increased llumber of rounci~ . Thl' 
key addition lay(~r A performs u bitwise XOI{ witl! the 64-bit data block and the 
6'I-bit round key provided by the key schedule. The permuta tioIl I<lyer P i~ u bit 
tnlusposition of the 64-bit. data block 
Tlw p(~ rllllltlltioIl scilerne of PUFF1N2 is borrowed fwm the 6·\-bit ciat.a block 
127 
lH'rmlltation of rUFFIN which fillfills th{' rritf'rion that flO two 01Hpur.s of a I) x II 
S-box arc connected to the same S-box in the next round. The permutation scheme 
is given in Table 8.2. According to this table, the (8 x //I + 11 + I )-th input is mllpped 
to t. he N-th output where N is the value located at e" and fl,,,. 
As clln bc seen from Tables 8.! lind 8.2, both the S-box mapping and the per-
mutation arc im·olutional. According to the complctcness property of cryptography 
introduced in 159], the cipher with a trl't.-... ~tructured SPN and a ~pecifically designed 
perlllutation clln achicve the comp)ctCIH:&; propcrty within the fewl'St rounds. which 
would he 3 for 6<1-bit ciphcr with <I x" S-boxes 159]. In PUFFIN II.'; w('ll as PUFFIN2, 
due to thc fC<luiremcnt of involutioual pcrumta.t.ioll, (I trl't.-'-struelured SPN (;llII uot 
be adopted to achic\·c the completcness property in only 3 rounus. Howcvcr, it clIn 
he ~howll that thc complctcncss propcrty can still be readled after five rounds of 
PUFFIN and PUFFIN2 with the S-hox Illllpping and perlllutlltiou presented abo\·e 
[60[. 
8.2.3 Encryption and Decryption Process 
The encryption and decryption procCS&'S arc shown in Figure 8.1. whef(~ 1\, dcnot.es 
the ro th round key ami J<:= P(J<r). The whole process con~ists of 34 rounds pillS an 
extra suhstitution lllyer. The explanatioll of sclect ing 34 liS the number of rounds is 
given in the next sectioll. The cxtra 5uootitution layer is rC(juin.-rl to form identical 
encryption and decryption prQCC>;.SCs. For c<l.ch round of the ellcryption/decryption 
process, thc u'l·bit input data goes through thc ~ul)l;tit utioll l<lyer S, the permut<ltioll 
layer P and then adds wit.h the round key to gCllerate thc input of the next roum!. 
128 
K, 
ToTI ... ~ 
K'" 
ToTI ... m]J 
K', K', 
Figure 8. I mock diagram of the ellcryption (top) and decryption (bottom) 
proCCSlSCS 
129 
AtOOlding to Figure 8,1, the encryption process of PUFFIN2 can be represell ted 
us follows: 
(8. 1) 
In the above exprcs:oions, the lIotation 0 means the concatclHltion of the basic opent-
tiun in onc round SUdl as substitution S and permutation P. The notat ion 0 is lI~d 
to represtnt tile eoneatenatiun uf 34 rounds of operat ion of (S 0 Po A). Decryption 
should be 1\S fullows 
Cl;;} [ /{ I ,/{2 , ... f(J,I ] SoO:_34(AKr ol'oSj (8.2) 
- ()~ ~ 'I(S 0 AKr 0 P) 0 S. (8.:3) 
Because the substitution layer S und the permutation layer P me involutional, 
we can have the fulluwing relationship: 
(84) 
lind therefore. we can obtain the following 
The above exprcs:oiOIl is cOllsbtcllt with the de.::ryptiOll process shown in Figure 
8, I, ",hidl mcans the decryption process is similar in form to the encryption proct'l;.'; 
III the decryption pwcess, the round keys used ill encryption arc pcrmuted with P 
and applied in the revelse order. 
130 
Table 8.3: Dl'l;cription of tlw componcnts of the kcy sehl'dule 
Componcnt Punction 
880 Slibstitlition of the 80 bits 
PL6'1 Permutatioll of the left 64 bits 
PH64 Permutation of the right 64 bits 
164 Selection of IIIC left 64 bits 
R64 Selection of thc right 64 bits 
Table 8.4: HOllnd distriblltion of PL64 , PH{i4, L64 and H64 
ROllnd # Permutation Selection 
r = i ,2,33,34 PL64 LG4 
r = 3,4,31,32 PR64 H6'1 
r = 5 +4 11,0 :S JJ :5 fi PR64 R64 
1'=6 + 4n ,0:5 /1-:;6 PIW,] IW4 
r = 7 + 411,0 :5 I I :5 5 PL64 L6'1 
r = 8 + 411 ,0 -:; II -:; 5 PL6·1 1.6·1 
131 
Enaypl I [)eaypl Key 
'" 
[)eayptiEnayplKey 
Figurc 8.2 Block diagram of thc kcy schedule 
132 
8.2.4 K ey Schedule 
The key schedule of PUFFIN2 operates on an 80-bit key and generates a G'I-bit round 
key for each round on-the-fly. The componen/.s used by the key sched\ll~~ nre listed 
in Table 8.3 and the key schedule is demonstrated in Figure 8.2. The key schedule 
eonsis t.s of 34 round functions plu~ an extra substitution layer at the beginning_ Each 
of the round functions is comprised of a permutat ion layer PL~I or PBG-I ,\lui n 
substitution byer S80. PLG4/PRG4 permutes the left/right G'I bits of the 8Q.-bi t roulld 
iuput and then S80 performs the substitut ion on the 80 b i L~ of key data. DepCJI(liug 
on thf' ;;election component (L6,1 or HG'I), each G4·bit round key is generated hy taking 
tllc left or rigbt 64 bits uf the 80-bit intermcdiate valuc that feeds to the corrcsponding 
round function. The detailed distribution of PL6,] and PHG,lll long with LG~ alld JIG,] 
is shown in Table 8 . ..[. Tll(~ irrcgulnr ~list rilmtion of PL6'] and P1l6-1 for each round 
is iuteuded to prevent rdawd-k~~y attacks and will be d iscussed in Sc'Ction 8.a. 
In order to maximize ha rdware rcsource reusc to aeh ie\'e II com plIet implell icn. 
tation, the substitution component S80 is designed to consist of 4 )( 4 S-boxes that 
nrc the same !II; thc S-box used in the cm:ryption and dc'Crypt ion procc'S.'ies, and the 
54-bit permutation mapping in PL64 aud PR64 is also the sallie mapping as in the 
perrnuHltion layer in the encryption lllld decryption process 
T he key schedule of PUFF IN2 is designed 1.0 be involutional to fulfi ll the design 
goal of a full involutional block ciphcr. T he involutional propcrty is achievcd through 
thp followillg mpaSllTPs. In the fir~ t. plm'p, all ha.~ir rompO!wnt~ in thp kpy scliedulp 
are involutional. Secondly, the distribution of I'L64 lind P IW~ along with 1.6'] and 
PM is ~ylllmetric, and in order to (lch ieve this , thc round numbcr of the kf'Y schedulc 
133 
h:1..<; to be a numher that is double of an odd number and eon!;Cquently 34 is ~lectL'(1 
as the round nUlilber of the key schedule !lS well as the encryption [lnd decryption 
prm:ps.-.;;'S. A.~ will be noted in Section 8.3, 3,1 founds arc also arkquate to ru:hieve an 
[lppropri[lte le\'el OfSCClll'ity. TlJirdly, there is an extra substitution l[lyer S80 at the 
beginning of the key ~ehedulc which Tllilkes thp forward pllth of round key generation 
identical to its backward path. The PL{)4 and 580 in the last rollnd (round 34) of 
the key schedule arc only u!;Cful to compute the decryption key corresponding to an 
etlcryptiotl key. 13y applying II dL,{;IYVtion key to the key schedule, tile round keys 
would be genemted in tlw reverse order of that made by the encryption key_ It i~ 
also necessary to note thllt the round keys g(~nera ted for docryptioll arc permuted 
versions of the round keys used ill encryption, and this fellture is required to provide 
the cOlTed decryption round keys for the de(;ryption proCl'SS mentioncd in the 5l'{;tion 
8.2.3 
8 .3 Security Analysis 
In this section, we lI11alyze the sccurit,y strength of PUFF IN2 under differentilll and 
line[lf cryptanalysis MId two nmjor key schedule attacks 
8.3. 1 Diffe r en t ia l and Linear C r y ptan a lysis 
O llr propo!;Cd block cipher PUFFIN2share:;thCHamc S-box and pcrmnlatioll mapping 
in tile ('II(:fYl'tioll and dL'(:ryption l'roc,-,;,; II."'; I' UFFIN, St) tll<~ diff(·I'<·ntial and linca! 
eryptanalysis rcsults of PUFFIN2 elln be ea.sily derived frolll that of PUFFIN in [GOI. 
For difrerential cryptanal~'sis [61] . the maximum differential clwmcteristic prob-
134 
ability of thc S·box is /J& = 1/4 and, based on thc 'I x " S-boxcs and the involu-
tional pefmutation, ellch found ha.<.; at least onc active S-box to fOfm thc path fOf a 
diffcrcntinl c1mTl\Cteristic. Hcnce, thc upper bound of the di ffprential dwracleristic 
probability ovcr 32 rounds is givcn by: 
(8.6) 
The differential dHuacteristic probahili ty JIll indicatcs that about TM chosen plain-
text /ciphertext pairs would be required to mount n s\1cccs.sful attil.Ck, the complexity 
of which is closc to II brute force dictionary aUl\Ck on a &.I-bit cipher. Thereforc, it, is 
re;~<;onllhlp to ('{m~id('r PU FFI N2 to he n~i~tanl. to diff('r('utial fryptaualysis. 
For lincnr cryptanalysis 162), thc ma.ximum linear approximation pror,.'1hility bias 
of the S-box is 1£51 = 1/'1. Similar to the cllse in differential cryptanalysis, there is at 
IelL~t one acti\'e S-box im'olved in cach ronnd to form a liuear approximation. lienee, 
the upper bound of the lincar approximation hilL~ £/, of 32 rounds is calculated with 
the piling-np lenuna 1621 as follows 
(8.7) 
Accord ing to [62], the number of the known plaintexts n . .'quin..'d to pcrfmlll linear 
cryptanalysis is proportional to I/ei, which means an dfccti\'e nttnck on PUFFIN2 
with linenr eryptanalysis requircs about ZW known plaintext/ciphertext pair.; and 
therefore is considered to be an impract ical attack 
It is nccc:;sary to mention that the SllCCess ehance of the cryptilllillysis lIlay he 
umlerestimated by the probabilities of tile differential and linear approximations (',,1-
L'ulated abuve. Fur JilferelLlial nyptanalysis, the differeutial dli\radefi~ti<.; pfub"biliLy 
135 
is nl.kulale(\ ba.<;('(\ on tlH' ass urliption lhal. tlH' data ('nt ering differrnt. S-boX(~ arc in-
dependent and there is only a single di fferential approximation path involved for 
the sek'Cted input fi nd output differential patterns. In realist. ically, thcw are usua lly 
mul t iple differential approximation paths invol ved on the same input and output di f-
ff'rf'n t iai patt.('rn ~ so that lhe act.\1ai prohfl hi iity of the diffprf'll t iai approximation is 
largcr than t lJC probabili ty calculated when eonsideri ng only 1\ single approximation 
path . This concept is referred to as diffe rentials [631. Similarly, fo r linear eryptanaly-
sis, tll( ~ aduallinear approximllt ion prohabilit.y is larger than the Ct; tillllltioll "bove by 
t,aking into account t he dependence of the S-box approximations and mul t ipl(' lin('a r 
a pproximation patb~ with the sallie plaintex t bits and data bi ts at l lle input to t he 
la.~t. round (rdern .. ' : l to as a linear hull [04]) . 
8 .3.2 R elated-Key Attacks 
The rdated-key ntt.aek was propoSl.-'.:1 in [651 It l'Xploit~ the regular ity of the rda-
tionships betwccn key schedule rounds and uses the choocn key relations to retr ieve 
the secret key information. The relflted-key a ttack generally finds its appliclI.tio!1 on 
those block ci phers t llat usc the sallie algorithm to generate round keys for 11 11 t he 
roullds, such I\<; tllC variants of vl\rialits of LOK I [661 11.1Id Lucifer [u7[ 
It is ea."y to S(,.'C t hat O UT block eipller does [lot have Olis regulari ty property in 
the k('y sd l( ~lu lc bcca\1Sl~ the permutatio ll layers PL64 and PR64 me IIOt regularly 
d istdbutcd I\Il1On g rounds, (lnd hence it is resistant to tll{' relah.x!-key attack. 
136 
8.3.3 Weak K eys 
\\bk keys IGSI arc keys that make the key schedule produce identical round keys for 
aU or some of the found~. For the key schedule of PUFFIN2, due to the existence of 
nonlincar ~ubstitution laycrs, we do not find any weak key in thc kcy space. 
8.3.4 Updated Cry pta na lysis R esults 
·\fter PU FFIN and PUFFIN2 were proposed, the security strength of them has bf'Cn 
IInalY7.ed by otllcr l'l'searchep.s, as is shown in IWI and 1701. In 1691, a linca!' cryptaual-
ysis again~t PUFFIN is performed by taking into account linear hulls and fl'COVl'rs 
4 hits of the last fonnd key with the complexity !loss than 25<1. In 1701. dilf('fcntial 
cl")'ptannlysis is mountl-d on both PUFFIN aUlI PUFFIN2. Tile cryptHnuly~is ill 1701 
rl'COI'ers the SO-bit key of PUFFI N2 with 27U~ operations using 2~2_J chosen plflin-
text.~. Thi~ cryphmalysis complexity is lower than the cstimation in Section S.3.1 
~inc(' it tflkes into account ll111itiplf' dilfpfI'nt ials (following til(' fmlll('work pfl-';{'n tpd 
ill 17i1). The updated cryptanalysis result.:; show the underestimation of the cost for 
key r~'Cover hy 0 111' own analysis. As a simple way to enhance the Sl'Curity strength 
of PUFFlN2 in order to thwart the lower attack coot fonnd in 1701. the !lumber of 
founds of PUFFIN2 ClUJ he incrCIL<;C(land in doing so the area of thc hardware im-
plcmclltation of PUFFIN2 baSl-d 011 the loop-iterative structure is lIot increased. 
137 
Figure 8.3: Serinliwd arciJit ecture of P UFF IN2 
8.4 Se ria lized Architecture for Hardware Imple-
mentation 
Th\~ ]lro]l(JS(:d \;lock dphl:r PUFFIN2 is desiglJed to \;e dlicicllt ly implemclJ!l.'d with 
II scriali~ c<:l hardware arehitectuf(,. A serializl'll architecture is the nn:iJitcctufe where 
multiple idcnticnl hardware OO lllPOIlCll t s that work for multiple ta>;ks simuh alJeousJy 
arc mapped to one piece of the hardware component that works for tlw multi pIc 
tasks in scric'!;. Gcncrally. a scriaJizcd architc<:ture call icad to the minimal hardware 
implemcntlltion among a vllricty of implementation architectures. III this section, we 
in troduce a serialized architecture for which the proposed block cipher P UFFIN2 is 
well suitc<:i in and that results in an ultra compact implementation 
138 
The serialized architecture for PUFFli':2 is shown in Figure 8.3. This f\rchitttturc 
is con~truc:ted with two 'I-bit 2-to-l multip lexers, a 6'I-bit 2-to-l multipl(~XeI , a 4-bit 
XOH adder A, a 4 x 4 S-box S, a 64-bit perlllutation P, and a 144-bit registel 
The an:liik-ctufe conceives a 144-bit wide data path (fOf both da ta and key) and the 
hardware component~ opentte ou the dlltllpath 
It is also wortl]y to mention tl]at there is a 4-bit rotation structure on thc dula-
pnth, whkh is crucial to ensure the hardware rt'l;()((re(~ arc shared properly in scrie~ 
and the block ciplJCr algori thm is funnin g correctly in the IIrchitcdure. The 4-bit 
rotation structure is rcalized by CrOSSO\·er wiring that maps bits I to \40 to biL~ 5 to 
1,1,1 ami hits I,ll to 144 to hits I to ,I (through the adder A and tile S-box S). 
The 4-bit. 2-to-l multiplexer with II dashed 4-bit. zero input. in F'igme 8.3 is called 
the first 4-bil Illultiplexer and the other one is called the second 4-bit nmltiplexer. 
Thr firs t, 4-hit. multiplexer is ahk to output a. ~ero vcctor indeprnrirnt of its two 
inputs, ami this is ~\C hieved by ANDing the 4-bit output of the multiplexer witll II 
siglJal bit from the controllel 
In the next section, we dewrihe the work How of the serialized architecture willi 
the example of plaintext and key loading pron.><iure and tile first round of the encryJ->-
t iou process. \Ve call tlJC bits that cllrry plailltext iufommtion and key infol'lilation 
duriug the cncryption process a.'; internal plaintext bits and internal key bils. respec-
tively. A 64-bit pJaiutext. plus an 80- bit eucryptiou key is loaded ill uuits of 4 bits 
through !lIP fir~t 4-bit mull.iplexer to Ihe datapath. The fir st 4-bit unit is prcsented 
to the input uf the 144- bit registcr in the first duck cycle illid hecolIl<:>; amil"ble ilt 
the output of the register in the second clock cycle. Each sul>sc<:!lIent 4-bit unit is 
added with a 4-bit zero vector and then fod through the S-bm: before heing .~tonx! 
139 
(, ) f-I-::-"--;!;I __ "-__ ~--·--- <-'-<--..: 
(b) 
Figure 8.4: Contents of the 144-bit regis ter a t d ock cycles G, 37, 45, 53 ilud 57 
in the J..\4-bit register. The 'J-bit zero vcctor is generated by the initial output of 
the I'I'I-bit rcgi~tcr. The 4-bit rolation strllctllTC makCl:; sure each -i-hit uni t is addt'd 
with II 4-bit zero vector and stored in the register bit,s next to the bst 4-hit unit. The 
loading pfo(!{",<Iufc of t he plaintex t plus the key takes 36 clock cyclCti (Illd durillg lhis 
period the fi rst substitution layers in the encryption procl'SS and key sched ule arc also 
performed. [n the 37th clock cycle, the G..\-bit internal plaintext bit,.; arc ]>crrnutcd 
witi] the M-bil permutation by selecting position 1 of the fi4-hit 2-to-l multiplexer, 
amI then the rightmost 4 bits of the updated interna l pI,lintext bits arc added with 
the rightmost ,I bits of the left 64 bitl; of the internal key bits by selecting position 0 
of t he :;('COlld 4-hit multiplexer. It takes \6 clock cycles to complete the key addition 
of 64 bits and this ends ,It the 52nd clock eycle. In the 53rd cycle. the rightmost " 
bi ts of t he SO-bit internal key bit.~ arc add(..,<\ with a <I-bit zero vector ill A !l!Ld thet! 
140 
Tllble 8.5: Implementation results of PUFFIN2 and serinlized PRESENT 
PUFFIN2 
Serialized 
PRESENT 
Area ;'dlL,{ , FrL'tI' Clock Cycles Throughput 
(CEs) (1\Hlz) per fi4-hit hlock @IOO KHz 
1083 32fi.81\Ulz 5,2 Kbps 
1296 346,()1\mz 563 11,41\bps 
substitlltt~1 by the S-box. In the 57tll cyde, the left G<I bit.s of the SO-bit intenwl kf'y 
loits arc permuted with tlJC 64-bi t permutlltion. In ordcr to better demollstrate the 
work of the architeclure, the eontents of the 144-bit regbter lit clock cycles 6, 37, 45, 
53 and 57 arc shown in Figure S.4 (a), (0), (e), (d) and (e), respectively. The dotted 
'l-bi(, block in Figure 8.4 (b) is the 4 hits to be added with the rightmost 4 bits of the 
illtcnml plaintext (lifter the pel'lIlutlltion) in the 37th cycle 
The fi4 -bit inlel'nnl plnintf'xt bits and the SO-bit internal key bits arc rotated 
witlJiJj the 14,I-bit register and it takes 36 cycle,; t.o complete 11 fuJI rotation. The 
peri(J(j of:l6 eyek'S is (lIsa the timc to {;Olllplete II round of the cncryptioll / (k,(Tyption 
pro('(~~. Hence, the total time to eomplete the entire encryption/decryption including 
the 16 eych'S for till: initial loading of the plaintext is givcn by: 
(16+36 x 34)cc = 12·lOcc, (8.8) 
where rr represclits onc clock cycle. 
141 
8.5 H ardware Implementat ion R esults 
Tile block cipher PUFFIN2 with the serializ{"'(l (lrchit(~:tllrc hru; l)('cll implCllwlltcd (md 
sYllthe~ized with the O.I8-"m Cl\'IOS standard cell library from TSII'IC. Synopsys 
Dc-sign Compilcr \·cr.:;ion X-2005.09 has been used as our synthesis tool. We abo 
ilrlplclllellk'ti the ~rializ(xl PRESENT frOlll H7] which is claimed w; the ~!lHlllcs t 
implementation of a block ciplier witli G4-bit block size. 130th of thc illlpklllentation~ 
arc Jatapath-only illiplellientations, which means their controllers (Il'e not incluJC'd 
in the illiplementations, and in both ca. ... 's the controllers (Ire ncgligible becausc they 
call he realized with a small counter and a small amount of combinationa l logic. OUi 
implenlPntation results of PUFFIN2 and the serializC'd PRESENT II le shown in Table 
8.5. In the table, the metric of gate (.'(Illivalcnb (GEs) is llsl'd, where II unit of I GE 
repre!;Cntsall (lre(lequiv(lient toa 2-input]\ANDgate. 
According to Table 8.5, the implementation of PUFFIN2 is 16% small(~r tllIl1) thc 
serialized PRESENT implementatiQn. As a tnu:lc-otT PUFF IN2 takes almost double 
the tillle of the serializ(.'(1 PRESENT to proee&; the same amount of data. This is 
oc-call';{} the data path of PUFFIN2 is reuscd for the key scllC'dule and tlie datnpnth up-
tration can [Jot be perforlll(.'(1 ~ illllll tll1lCOllsly a~ till' key schedule operation . 1I0wel'er, 
ill nlOst.lightweight. Dpplicatiolls, a large 1'I1ilning time is Ilot a scrious issuc. 
It is 1lC'CCSS{UY to point out tlmt the gatc count of the ~riuliz(.'(1 PRESENT im-
plementation claimcd ill [.17] is 1075 CEo The 221 CE overiJCad of our implementation 
of tlil' ~('riflliz(~ 1 PRESENT could IH' nmsod hy thE' differ!'n!. synthesis lihrarv lind th(' 
usc of scall flip flops with integratcd multiplexers i[J [.t7] illstead of the normal flip 
flops alld scparated multiplcxers found in our ilJlplclJlcntatio[J. The slime arca fL'tluc-
142 
Tnble 8.6: Count of hnrdwnfe component.~ of PUF' F'I N2 alld Sl:rillli7.ed PRESENT 
Components PUFF'IN2 Scriali7.e'(l PRESENT 
64-bit regi~ter (384 GEl 1 (35.5%) I (29.6%) 
8(}..bit fegi~ter (480 GE) 1 (44 .3%) 1 (37.0%) 
64-bit 2to l multiplexer (!53 GE) 1 (14 .1%) 1 (1 1.8%) 
80-bit 2tol multiplexef (!92 eEl 1 (14 .8%) 
4-bit 2tol lIlultiplexer (10 GE) 2 (1.8%) 3 (2 .3%) 
4x,j S-box (30 GEj32 GE) 1 (2.8%) 1 (2. 5%) 
4-bit XOR adder (II GE) 1 (1.0%) I (O .D%) 
5-hit XOH adder (14 GE) 0 1 (1.1 %) 
4 2-input AND gatc~ (5 GE) I (O .5%) 
Total gate count 1083 GE ( 100%) 1296 GE (100%) 
lion effect can be achieved in our implementation of PUFFIN2 with Sl:an (jip f1op~ a~ 
long a;; the 144-bit regi~tef is moved to the output of tIl(' 64-bit 2-to- l nlllltiplexer to 
fOflll the integrated (jip flops Il.lId multiplcxel~ . The position of the 144-bit register is 
ll<'xihl<' in \.llP Sf'rializ<:xl a rdl itf"Ct.llre, so this dlange wOllld 1I0t. haV(~ nlly illfiIlPn("<, OJ) 
t he flilletiollnlit.y. 
In order to llllVe a dear com pa rison betwccn the hardware complexity of PUF-
FlN2 and the serialized PHESENT, we list the count of the hnrdwnre compOlwnts 
required for both implementations in Tnble 8.6. The l<l<l -bit fegbtpf in PU FFL\l2 
is divided into a 64-bit regi~ter and nn SO-bit register in Table 8.G, ilnd the 36 4-bi t 
2-tu--1 multiplexers in tlJC two shift registers of the serialized 1'1lESENT are merge'd 
and slJown [I.'; II 64-bit 2-tc>-1 multiplexer and an SO-bit 2-tc>-1 multiplexer in Tablp 
8.u 
143 
From Table 8.6, W~ {'an SI,(' tli l' major arm dilf~n'nl'l' 1)('(.\\"I'('n P UFFIN2 and tIl<' 
serialized PH ESENT comes from the 80-bit 2-t0-1 multiplcxcr, which aCOO\lllt~ for 
],1.8% of lim total area of the serializoo PRESENT and docs not exist ill PU F"FIN2 
It i~ nbo Iloticenble in Table 8.6 that the l<j 'I-bit regi~ter takes 80% of the hardware 
reSOllr(:l~ of PUFFI.'\2, and this fn.ct allows us to bclieve that the serialized irnple-
llH'lltation of PUFFIN2 has appro.1.Clil,(] the area limit of the block ciphers that havc 
similnr block si7,c lind kcy ~ize 
8.6 Summary 
In lilis chapter we have proposed it !lew block cipher PUFFIN2 based Oil an iuvolll-
tiOlwl SPN structufe. The cipher with a 64-bit block siw and an 80-bit kl'Y Si7-C ca ll 
provkiP 8ulflciellt 5CCurity for low cost embedded devkcs alld support bot h em:rypl ion 
and d{,(,fyption. We also introduced a serializl,(] ,lfchitcctufe based on whidl PUF-
FIN2 can be implemcnted with Illl ultra compact sb:e. Compared with the .-;crializl'(j 
PBESENT implementation , the dutapath of PUFFI N2 uSC:! 16% fewer gates. In gell-
eral, lhe PUFFIN2 block ciplwr is a secure, area-efficient structure in cOllllwrison to 
other proposed compact hlock ciphers 
Chapter 9 
Conclusions 
Implementation of AES "'ith minirl\<ll resollrce cost or de>ired cost tradt..'-off is typi-
cally the goal of hardware dcsign of AES. This dissertation presents the inw'stigatiOIl 
that helps to reach thc go..11 from the llSpcct of implementatioll 'lTchitccturc. In the 
next section, this dis...;;crtatioll is concluded with the sllllllllury of re:;cnrch and eontri-
uuUons and the suggestions for future work 
9.1 Summary of Research and Contributions 
The rCM:!lrch and contributions of this dbscrtatioll include the following 
Performance C haracter ization of AES S- Box a nd Datapath Irnplc lllcntu-
tiUIl Architectures 
The rL''Sc.;Hrth is the !irst wurk in litcwturc that. cX1u l1 iucs and tOIll]l<lres the ]"'T-
formancc of an extensive range of AES S-hox and data path impJcmentations ill tcrlll~ 
of timing, area, power and energy based on the same implementat ion wdillology. 
Previous research usua.lly foeu!;(.'!; OJ] the teclmiques that improvl,!; {;crtain perspc(;. 
tives of the performance based on fI specific implementation archit.ecture, rather than 
exploring the perfonnanee variation hy different implementation arehitedurl,!;. The 
architecture runge imcstigatcd in this dissertation covers most of the pos.sible typi-
cal architectures of hardware AES implementation, which include the S-oox pipeline 
configurations of component level 2 to 4 stages lind gate level 2 10 7 stagl,!; hnd the 
datapath ilIcl]it cctures witl] the widths of 8, 16,32 amlli4 bits amI U1] wllin!!, factors 
of I, 2. 5 and 10 for 128-hit width. The performance of these implementations is char-
acterized based on the same implementation platform under a variety of throughput 
constraints which covers II wide range of design rcqnirempnts for variolls applications 
of ACS. The performance characterization allows for better unden;tanding of the in-
fluPllpe on f)w I)('rformanpf' from thl' a~pl'pf of impll'tnentation afchil!'("tmp. Tht, 
dlaraclerizalioll rl'!;ult s show the exlellsi,'c pcrfOrulhllce tradl'-olb otfcl"ed uy difrc]"('I Lt 
implement.ation hrehitcctures and can serve as a geneml reference for fl exible illid 
cfficicnt AES implementations. 
Identification of Resource Effident AES S-Box alld Dat apath Implem en-
t at ion Architectures 
Thr [(\;;('[>1"1"11 is 11ll' first. work ill literl"ltllfe t.hat applir~ piprlinillg to thl' S-hox 
implemcntatioll [01 resourcc-dlicicnt purpose. iust.end of the high throughput. pur-
pose that is investigated in previous research. Obvions rl'SOllrcc rl'(luction ill terms 
of I"Irel"l, powl'r I"Ilid/or ('nergy is achiev{'{l with the appropriate pipeline configurations 
compared wi th the non-pipelinoo S-box implementation under the same throughput 
constraint. The research also identi fi es t.he power /energy ellident. da tapath an:hitL,("-
ture, It finds that the most ellergy efficient architecture is not, aehievcd by the most. 
loll' power arehitectmc, which is used for both low power "nd low energy pmposcs 
!,n'\'ioll~ly. I3y eomhining tlw appropriat.(~ pipeline configurat.ion and datapath an'hi-
tecture, the energy consumption is fu rther significantly rcdu c:OO compared with the 
most luI'.' ]luwer archikdurc 
Dcvelopmcnt of Novel a nd ~~ fti ciCl lt AgS Datflpath Architccturcs 
The feseafeh deveJupo; the AES datapath architecture with 16, 32 aud 64-uit 
width . While the IG and 64-uit width arehitL't;tnres are newly presented, the 32-
hit. Ufchilt'rtnrp has now1 aspf'rt_~ which giv(~ thp henefit. of r('{]m:ed storagp and/ or 
fewer d ock eyeJes eompar<xl to previous work . All of these architectures arc designL'(i 
to eOinplete the opewtion with the minimal number of clock cycles ami with til(' 
minimal nmnher or close to the minimal lIumber of registers. No specific tnemor,v 
macro is required for the"e architectures since 1\11 the eomponcnts are cOlllpo..o;(.'(l of 
standard cells. The development of tlH'SC architectures cOlltribul es to the Ilexible 
implpnH'ntatioIl of AES 
Design of a Lightweight Bloek Ciphcr PUFF IN2 
PUFFIN2 i ~ a lightweight block cipher developcd h1L~L'd on its pnxlec('ssor P UF-
FI N. The cryptanalysis of PU FFI N2 shows that it can provide mod~t ~,(:lIfity 
st l' pngtll sui table for IUaIlY lightweight s(x;urity applicatiolls. PUFFlN2 IHls a ill-
lH 
volutiouHi ~tructure Hnd the datapath can he used for key generation. T hese feature;; 
nllow for very compact serialized hardware implementation. Compared with the pre-
vious moot compact hiock cipher PRESENT, which requires different hardware for 
em;ryption and decryption operations, PUFFiN2 ean he huilt wi th smaller urea ami 
has the ~ame hardwnre for both cncryption and decryption operations 
9.2 Suggestions for Future Work 
Due to the limit of timc, wille further in-(h~pth research ha.;; not boou (:omim;ted Hnd 
illduded in thi~ dit;SCrtation. Thel;e work is mcntioned in the following 
Ill vestigat ion of P ipe line Configurnt io ll ~ for Sp ecific A ES S-Box Structures 
The illl'estigation of pipeline configurations for thc AES S-box i~ bllscd on a 
typical composite field structure from [5) ill order to show the typica1 effen on perfOT-
llHlw.:e hy pipdillillg. There cxist~ a llumber of other cOIiJ\-,ooite field S-hox strud,ul'es , 
including [6] , [7), [8), [9), [10) and (1:1), and these ~tructurcs arc u:sually elahorately 
developed to achic\'e certain perfof!nance benefits over the typical structure W~~ adopt 
iu thi;; dissertatioll. It <:an be expcdoo that investigating the pipeline configurations 
for I'adl of tbt' ~p('('ifi(' S-hox stfll('!. llft'S wonl~\ [(~ult in pipelillN\ S-hox illlplPmell-
tatiOIlS with hetter performance thiln those examined in I.hb di.~~rt;ltion . It is also 
worthwhile to ilwe~tigate new composite fie ld ~t.f\lct. urcs that are ~pe<:ifically devel-
oped for pipeline.:1 strndure. !l.loreover, the p1H.(;ement of pipeline registers examined 
in this dissertation is determined either hy the synthesis tool at the gate level or 
hy thc virtual compuncnt boundary at the component level There ac tually exist.s 
148 
many more options for regi~ter placemcnt «nd this can be explored Hlld optimiz ~'{1 
for perfOrtnHllee improvement. It is also n~'Ces.<;a!y to conduct the perfornwnef' ~om­
parisoll betwccn the pipelincd AES S-boxes based on composile fiel d stIuclur~ with 
the S-boxes of ~tructures other than composite field struct ures, such as the decuder-
permuta tion-encodcr structurc for low power purpose in [141. [n this way, a more 
complete picture of performance eharacteri~tics of thc hardware implementation of 
the AES S-box is achieved 
Investigation of A ES Datapath A rchitecture with Innor Ilound P ip e li ne 
[nthisdisscrtlltion, weonly invc>;tigatp the architectlln,; with outerrulllld pipclin-
ing, i.c" pipeline that is only applied betwccn two eonsecut ivc round functions for the 
128-hit width datapath, and sonte cases of iuncr round pipclinillg (the pipclinc within 
tlw round [unctions) by llsing the pipelined S-boxcs (as i~ shown in Chapt,'r 7). By 
considering the entire hardwnfe implementation of the ronnd funct ion of the AES 
im;lmliug ShiftRu11!s and MixCu/UIIIHS, there is morc datapath (lfehitccture pipeline 
options available for the exploration of bcttcr performallCC and more perforrnillice 
l.radL' .. olf~. 
Inve~t igati oll of the Influ L'IH;e O il Sid e C ha nne l A ttacks under Var io us A t:S 
Da ta path Architect ures 
Due 10 the increasing significancc of the abili ty to r~'8 ist. 8id,~ channel at.tH.Ck~ for 
block cipher ilrlplelrlentat ion~. it would be interesting to invC8tigat.e the behaviour 
of lhp <iiffprpnt AES ,lal.apalh IUf'hit(X·l.uH'S IIml,'r ~ide dJalllwl a1.1.a,.ks. AII('nTioti 
could bc ~iVCll to the inycst igation of tllC behaviour of the Il rc1J i tedurL~ undcr powcr 
analysis sin~c these <lrchitcdu!'cs h<lVC quitc d iffc rent power ch <l fncteristics as [l fe 
~howtl in tiJis dissertat ion, 
Design of the Successor of PUFFIN2 with Improved Secu r ity 
Accordiug to the roecntly proposed cryptaualysis against PUFFlN2 in (70], PUF-
FlN2 docs [Ju t fully achieve the originally claimed security strength [lnd thc l'ulnf"Ca.-
bility lie;; in the involut iOlllll property of the ci pher. It would be desirable to dc;;igll 
the new lightweight block cipher ba.~ed O il PUFFIN and PUFF1N2 that has the (~n-
llanced ~'(;ur ity rc;; istcnt to the recently proposed cryptanillysis while sti lt feat uring 
an ill\'olut ional structure so that the compactness nnd thc support for bolb cllcrypt ioll 
and dc'(;ryption remains 
150 
Bibliography 
II] us .\'ationai Insitut.c of Standards Hnd 'll:dlllOlogy, "f\dvancc<:l Encryptioll S\'illl-
dmd," Fci/aul llljvnual ion PI'OCfCSb'iTlg SIUlld(mi.~ Pltblicutiol!, nu. 197, 2001 
[2] A. Bogelanov, D. Khovnttovich, and C. R(..'(.:hbcrgef, "Bidiquc cryptana!ysis of tIl(' 
fn ll AES ," in Proceedings of Ih e 17th international con/cn;uce 011 The Th eQry and 
i1pplicatioll of Cryptology and In/ormation Secmity, Lcdmc Notes in COllJputer 
Science, PJ>. 344- 371 , Springer-Verlag, 2011 
[:IJ \V , Stallings, Cryptogmllily (mil NetWlwk Secudty. Prf'uticc Hnl! , fourt.h cd . 2000 
[4) V. RiJlll r ll , 'Eflkicnt imp!PTlH'IIt.atioll of th(' JliJudap1 S- Box ," 
[ILlp; II www.L.sat .kulcuven.ac.be/ rij mOil / rijllJacl / 
[5] .] . \Volkcrstorfcr, E. Oswald, and .\1. Lambcrgcr, "An ASIC Implementation 
of the AES SBoxct;," in Prr:Jaeilings of Top ics in Cryptology (CT-HSA 2002), 
vol. 2271, pp_ G7- 78, 2002 
[6) A. Satoh, S. Ivlorioka, K. Takallo. and S. Il"lmwtoh, '·A Compact Hijlldaell!ard-
wan~ Architecture with S-Box Optimization ," in Frocccilmg$ of .4lblimcc.~ III 
Cryptology (AS/ACIlYP'!' 2001), voJ. 2248 of Lee/ure No/ es i,! Computer Sci-
ence, pp_ 239- 254, 2001 
151 
[7) D. Callright, "A Vcry Compact S-Box for !\ES," in Proceedill.'ls 0/ the luter-
national 1V0rb/wp 011 Cryp/ogmphic JIal·dware Imd Embedded Sy~lem~ (CIIES 
2()U5j, vol. 3659 of Lectlll"e Notes in COIIIlmter Science, PI'· 4,11 - 455, 2005 
[S[ X. Zhilng and K. K. Parhi, "On thc Optimum Constructions of Compositc Field 
for the AES Algorithm," IEEE ThmsactiO/l.5 011 Circuits and Sys tems If: E:qwess 
IJrie/s, vol. 53, pp. 1153- 11 57,2006. 
[9) S. Nikova and V. Rijmen, and 11 1. Schlnffer. "Using Nonnni 13= for Compact 
Hardware IllIplClllClltatiOllS of thc AES S-Box," in Pnx:eedmgs 0/ the 6th COI!-
/ercnce Ort Set:urily and Cryplogmpliy for Ndwol"i;s (SCN 20OS). pp. 236-2,15, 
2()()8 
[IOJ Id. 1>1. Kcrmani and A. Rcyhuni-,\Iasoldl, "A Low-Cost. S-box for the Advanced 
Encryptioll Standard Using Normal Basis," ill Ploct:ediI!9~ 0/ Ihe IEEE !rlle,·-
natiollal Con/enna on Electro/ln/orm(Jtion Technology (EIT 2(09), pp. 52 55. 
200!} 
[II ] 1>1. l\L Woug, M. L. D. Wung, A. K. Nandi, and 1. Hijazill , "CollHtruction of 
Optimum Compo~ite Ficld Architecture for Compad High-ThrollgiJPllt AES 5-
Boxes," IEEE TraTl~actiol!s 011 VelY LarYI! Sc.o.le Int erymtioll ( VLSJ) Syslcm.~ , 
vol. gtJ, pp. I 5, 2011. 
[1 2] tiL M. \Vong, 11. L. D. Wong, A K. Nandi , and 1. Ilijazin , "Com posit e Pielcl 
GF(((22)2)2) AdvanCtXt Encryption Standnrd (.-\ES) S-hox with Algebraic Nor-
mal Form Hcpresentation in the Suhficld Inversion," fl:,T CirCIJits Uel)ice.~ Sys -
km, vol. 5, pp. 471 476, 2011. 
152 
[13] y, Nogallli, 1<' Nekado, T. Toyota, N. Hongo, and y, t'l lorilwwa, "/vlixed l3aSl.-,!; for 
Ef!icienct Inversion in GF((22f? and Conversion "-'Iatrice'!; of Sllbl3ytcs of AES," 
in PIl)(~edin9s of the Int ernat ional iVorkshop on GryptoglTlphie f/nniulfl rt; (lnd 
Embedded Systerlls (GilES 2010) , Lect.ure Notes in Comput.er Science, pp. 23,\-
247,2010 
[141 C. Bertoni , M. lvlacchetti, L. Nt~gri, and P. Fragllf~to , "Power-Efficient ASIC 
SYllth e'!;i~ of Cryptographic S-l3oxcs," in Pnxccdillgs of the 14th AGM Creal 
Lake~ syrup08ium on VLSI (GLSVLSI04) , pp. 277- 281, 200'1 
[15] S. r.loriob and A. Satoh , "An Optimized S-box Circuit Ardlitccture for Low 
Power AES Design," in Proccedings of thc In/cl7lutionul WorbllOl' on Cryp!o-
gHlphic IIrmiwar"e (Jnd Embedded Sy8telli.~ (CIIES 2002), L<..'<:t llre Notes ill Com-
plltt'r Science, pp. 271- 295, Springer- Vf'rlag, 2003 
[IGI Y. Zellg, X. Zou, Z. Lin , and J. Lei, "A Low- Power Hijmlael S-hox I3w;ed ou 
Pus;; Trnn~mis;;ion Gate nnd Composite Field Arithmetic," JOIO'nal of ZIwji a1lg 
University - Sciwee A, vol. 8, pp. 1553 1559, 2007. 
[l 7] S .. 'duriulw and A. S11.1.oh, "A High T hroughput Low Power COUIPiH;t AES S-
Box Implementation Using Composite Field Aritlllnt~tic and Algebraic Normal 
Form ilcprcscntatioU," in Proceedings of the 2nd Asia Symposium 011 Quality 
Eifctnmic Design (ASQED 2010), pp. 31 8- 323, 2010 
[1 81 S. Tillidl , M. Fdtlhufer , T . POPI-' , antl.l. CrutlsdlOO1, "Area, DelClY, IUltl Puwe] 
Clwradcristics of Stalltlard-ccll lmplcmentatiulls uf the AES S-80x," lUIIllIli1 of 
S191l111 Pnxessing S!Js t em~, vol. 50 , pp. 251- 2Gl , 2008 
153 
[ID] X. Zhang find K. Parhi, "lligh-spccd VLSI Architectures for the AES Algorithm," 
IEEE 7hmsacli01l5 on Very Large Scale Int egration (VLSI) Sy8ttml.~, vol. 12. 
pp_ 957- 967, 200'1 
[20] A. Hodjat ami!. Vcrhauwhedc. "Area-Throughput Tmde-offs for Fully Pipd illcd 
30 to 70 Gbi ts/s AES Processors," IEEE Transflcti01l5 011 Computers, vol. 55, 
pp_ 3G6-372, 2006 
[21 ] T. Good and n l3enait;Sa, uAES on FPGA from the Fastest to thc Srnalicst," m 
Pr"OC~eding.5 of Ihe Intenllliiorta/ WOIk5hop (In C,]lptogmllitic Hardwlln: amI Em-
bedded Systems (CII ES 2005), vol. 36S9 of Leclure Noles in Comp ilier St;ww;e, 
pp. 427- ·1,10, 2005 
[22] T Good and !"It I3cnaissa, "Pipelillcd AES on FPGA with Support fo r FccJback 
l\'lodcs (in a Multi-CIJallncl Environment)." lET Infol1naliOl! S~c[Jrit!J, vol I, 
pp.1-l0,2007. 
[23] K. U. Jarvincn, l\!. T. TOIllItliskll. lIlld .J. O . SkytUi, "A Fully Pipdined :"'[CItlO-
ryJ cs.~ 17.8 Ghps AES-1 28 Encryptor," in Froceetiillgs of Ihe 2003 IICA/IS/GIJil 
e11~lIenth mtemaliolllli .~yflqI05i1H1l 011 FIC/d progmmlllable gule IIIl11yS, pp. 207 
215,2003 
[2'1] N. Iyer , P. Amllldmohflll. D. Poornaiah, and V. Kulkarni, "High Throughput , 
Low Cost, Fully i'ipelined Arehitedure for AES Crypto Chip,' ill Pmcccding.\· 
of the 2006 A7IHlw/ IEEE In dIa Conflmmce, pp. I 6,2006. 
[25] M.-Y. Wang, C.-P. Su, C.-L. Humg, C.-W. \Vu, and C.-T. Huang, 'Single- alld 
!"I!uhi-mrc Configllrahlc AES Arddt.<'<'t.ures fo r Fkxihlc S('( ·uri1.y," IEEE 7'rUlIS-
154 
ac/.iolls 011 Very Lmye Smle IlItcgmtioll ( VLSI) SY$tem$ , vol 18. pp. 5'11- 552, 
2010 
[26] S. !'o.l angard, M. Aigner, find S. Dominikus, "A Highly Regular a[J(1 Scalable AE:S 
Hardware Architecture," 1£££ TransactiOIlS 011 Computers, vol. 52 , pp_ 483- 491. 
2003 
[27] P. Chodowicc and K. Gaj , "Very Compact FPCA Implementation of the AES J\l-
gorithm," ill Proceedings 0/ the !rltenwtionai Workshop 011 Ctypt()(In'pil ic !Iard-
wm-c arid Emocddcd Systems (CJ!ES 2003), vol. 2779 of Leetun: Notes ill Com-
puler SelCnee. pp. 319- 333, 2003 
[28[ N_ Pm!ll~lid ler !Uld J \Vulkcr~torf<.:r , "A UniVefS!d ,lnd Elliden!. AES Cu-
processor for Field Programillable Logie Arrays," ill Pruacdillgs of Ih e 141.h 
Anmtlll Intf.malional Conference 011 Fwld-Programmable Logic and Applical ions 
(FPL 2004). pp. 565- 574. 2004. 
[29] C. ClJang. C. Huang, K. Chang, Y. Chen , and C. Hsieh, "High Throughput 32-
bit AES Implcmcntat.ion in FPCA," in Proceedings of the 9th IEEE Asill Pacific 
Confen;lIa: 011 CiIT1Jit.~ IIl1d SystfTn.~ (APCCAS 2008), pp. 18013 1809, 200S. 
[:10] .I . R. V. Fddhofer aIHI M. Wolkerstorfer, "AE:S Implcmcntation Oil II Grain of 
Sand." 1£'1' In/ormation Security. vol. 152, pp. 13- 20. 2005 
[3 1] P. Hiimiilii incn , T Alho, II"! _ lIiinnikiiincn , and T D_ Hiimiiliiillen, "l)('l; ign (lnd 
Implementation of Low-Area aIHI Low-Puwer AES EmTy ptiou Hardware Cure." 
ill Pro;xc(linqs of lhc 9th EUROMICRO Confcrcncc on Dlqitu/ Systcm Dcsiqll 
(DSD 20DG). pp. 577- 583, 2006 
155 
[:12[ T. Good and r.. 1. Bcnaissa, "692-IlW Advnnuxl Encryption Standard (AES) Oil 
n O.IJ-llnl Cr.. IOS." IEEE 7hlllSadlOl!S on Vel1l Large Scale Illtegroliot! (VLSI) 
Systems, vol. 18, pp. 1753 1757, 2010. 
[33) D. A. Group, Digitlll ASIC Design: A Tutolial all the Desi!}11 Flow. LU lid Uni-
VPfsity, SWl.'{lcn, 2005 
[34) SYIlOpSyS, D~~'ig1t Compiler Usn' Guide VersiOl! D-2010.0:J-SP2. 2010 
[35) r.. 1. Keilting, D. Flynn, H. Aitken, A. Gibbons, ilud 1'- Shi, Low Power Me/hOi/-
ology MlltIljli/: For System-all -Chip Df!.~ign. Springer, 20()7. 
[36) Synopsys, PrimeTime PX User Gllid~ Version D-201O.06. 2010. 
[37) F.-X . Standaert, C. Piret, G. HOll\TOY, J .-J. Qllisqllater, and J.-D. ['f'gat , "ICE-
BERG All Involutional Cipher Efficient for Block Encryption in Heconfigurahle 
Hnfdware," in Proceetiillgs of Ihe 11th intenwtiOllill Workshop 011 Fast Soft-
ware Encryption (FSE 2004). i£cture ,\"otes in Compute)' SciellCf', pp. 279 299, 
Springer- Verlag, 2004. 
[38) C. Wilng and II. Il eys "Using a P ipplined S-Box in Compact AES Hardware 
Implementations," in Pruceedillgii of the 8th IEEE IlItel1wtiOlw/ NEWCIlS COII -
ferellee , pp. 101- 104,2010 
[39) A. Hodjat, D. D. Hwang, B. Lai, K. Tiri , and 1. Vcrbnu\\'hcde, "A 3.84 Cbit.s/~ 
AES Crypto Coprocessor with Modes of Operation ill a O.IS-I!m C1-dOS Tech-
Ilology," ill Proceediu!}ii of the 15th ACM Clmt Lakes symposIUm QII VIS' 
(GLSIILSI 2005), pp. 60 63, 2005 
156 
HO] s. K. Mathew, F. Sheikh , ,'1'1. Kounavis, S. CuefOn, A. Agarwal, S. K. IIsu, 
n. Kaul, ~L A. Anders, alld It K. Krishnamurthy, "53 Gbps Native Composite-
Fif'ld AES-Enerypt j Dccrypt Accelerator for Content-Proteclion ill 45 Illll lIigh-
Pcrformance /I.'licroprocessors," IEEE JUU17UlI uf Sulid-Stulc Cin:IJit.~ , vol. 46, 
pl-'. 7G7- 77G, 2Ul l . 
[41] A. van der Werf, J. L. \7111 Mcwbcrgcn, E. I-I. L Aarts, \V F. J . Vcrhacgh, and 
P_ E_ 11 Lippf'ns, "Eftkient Timing Constraint. Derivation for Optimal Rdiming 
lligh Speed Prou'tiSiug Units," in Procc(m.ing.~ of th p. 7th intenwti01l(ll symposium 
Uri 11i9h- l~vcl .~Ylllh p'8i.~, pp. 48- 53, 1994 
[42] J. Lcijte tJ, J. vall ,\kwbcrgen. and.l . .less, ~Analrs is and B(:I]nction of Glitch ('~ 
in Sym;hronOlls Networks ," in Proceedi7l,f/s of Ihe 1995 Em-opw lI collfcIl; II CC 011 
D esign and Test, pp. 398- 401,1995. 
[43J J_ Zambrcno. D. Nguyen , and A. Choudhary, "Exploring Area/Delay Tfllrk{)fb ill 
an I\ES FPGA Implemcntation," in PmcccdillgH of the 14th Annual/nt p.r1Iational 
COII/cn;llcc 011 F i dd.P'"09rt1l1I11HliJlr Luyic Uri/I A"plim/ioIl8 (FPL 20(}4 j . pp. 575-
585 , 2004 
[44J P. HiimiiJiiincn, T Alho. :>''1. lI iinnikiiincn , T . D. lI iimiiJiiinctJ, T . .Iiivillcn, 
P S,dmcla . and J . 'I'<lk<lla, "Eflicicnl !Jyt e PCl'lllulal.iun Healizati()[l~ for COlllJ);I("t 
AES ImplemeIltations," ill Prva:etii llgs of the l:Jth E'llroproll Slg1l(l1 PI'OCCSS IIIg 
COllj cn;m:e (EUSIPea 2()()5), 2005. 
157 
[45[ I-I. Li lind Z FriAAstrvi, "I'n Effirif'nt Ardlilf'f'\.Urf' for I,hl' AES l'Ilix Colulllns Op-
( ~ ration , " in Proceedings of the 2005 IfE'E fntenwtionlll SYI1lTlOslUm 011 Circuil.~ 
awl Sy~tcm~ (ISCAS 2005), pp. 46:n 4640, 2005. 
[16[ C. Wang and I!. t.l. lieys, ~ ."n Ultrn Comp[let Ulock Cipher For Seria1i~cd 
Architcctmc Implementations," ill Pnx:ClCdings of Ihe 2009 Calladian CUllferellt·c 
UII Electrical aml Cumputer Enginecring (CCECE 2009), pp. 1086 I()<JO, 20()<J. 
[,17J i\. Uogdanov, L. Knudscn, C. LeI-Hider, C. Pall!, A. PosclmUIllIl, .\01. Hobsllaw, 
Y. Seurin, and C. Vikkc!soe, "I'HESENT: An Ult.ra-Lightwpight B1o<'k Ciph<'r," 
in Prvceedmgs 0/ the IlIl emallOnal iVorkshop 011 CryptoglUphic lIanlwm-c mul 
Embedded S!J~lems (CIIES 20m), vol. 4272 of Leclune Nutes in Cumplltel' Sci-
ellt.,;, pp. 450 466, 2()()7 
[48J C . Holfes, A. PosdlInann, C. Leander, ami C. Paar, "Ultra- Liglltweight IllIple-
m('ntatiOIl.~ for Snmrt DevicCti - Security for 1000 Gate Equivalents," ill />nx;ecd· 
in.i/s 0/ Smart Card Rescan;h arid Advallt-cd Application COIl/clew c (CAROlS 
2008), vol. 5189 of Lccture Not c$ in Computer' SCiCIICIC, pp. 89 10:1, 2008. 
WJI C. Loon !lnd T. Korkishko, "mCrypton - A Light""cight Ulock Cipher for Security 
of Low-Coot HFID Taw; and Sensors," in Pmceedings 0/ hl/omwtWI! Sccllnly 
Applications (WISA 2005), vol 3786 of Lecture l"/otes in Computer SCIC f/ Ce, 
pp. 243 258, 2008 
[501 D. IIoug, J . Sung, S. 1I0ng, J . Lim, S. Lec, B. Koo, C . Lec, D. Chung . J . Lee, 
1<' Jcong, II. Kim, J. Kim, IUld S. Chee, "H IGHT: A New 13I0ek Cipher Suit-
able for Low Resource Device," in Proccctiillg$ of Ihe IlJt cl"1J([ti(J7J([/ Wod,,\'/wp 
158 
on C'1ll!toglllpllie f/ardware anri Bmbedded Dcvices (CUES 2006), vol. 4249 of 
Lecture Notes in Cumputer Sdence, pp. 46- 59, 2006 
[51[ F.-X Standaert, C. Piret, N. Cershcllfcld, and J .-J. Qlli!;(llliitcr, "SEA : A Scal-
able Encryptioll Algorithm for Smllll Embedded Applications,"' ill Pnx:eedmgs 
of Smart Card Rescarch awl ApplicatIOns (CARDIS 2006), vol. 3298 of Lectltre 
Notcs IU COllipu/er Selcflcc, pp. 222- 236, 200G 
[52[ II. Cheng, II. Heys, and C. Wang, "]' UFf-IN: A Novel Compl\('t I3loek Cipll(~r 
'I:'lrgeted to Embedded Digital SystcllIs," in Pnx:cetiiflgs of /he 111ft f,;UHOAl/ -
CRO Confcf"Cflcc Ofl Digital System Design Architectllres, Metllods (Iud Tools 
(DSD 2008), pp. 383 390, 2008 
[53J C. CUllllicrc. O. Dunkelman, and 1">1. Kllt'zevic, " KATAN and KTANT AN - A 
Family of Small aud Eflicicllt, lfardwar\.~Orielltcd mock Ciphers;' ill Proccedmgs 
of Ihe IntemaliQlIII1 Workshop on CfYl!1ogmphic //(mlwllre 111111 Embedded S.lIS-
tems (ClIES 2009), Lcx;t \lrc Not{.'S in Computer Seicnce, pp. 272 288, Sprillg,er-
Vcrlag,2OO!) 
]5,'1 1">1. lzudi. 11 Sadeghiyan, S. Sadcghian, and II. Klwnooki , " lI !lI3S: A New 
Lightweight m ock Cipher," in Pmccclimgs of Cryptologg (HIli Network Sectmty 
2009 (CANS 2009), Lcx;ture Notcs in Computer Sciellee, pp. 33'] 3-18, Springer-
Verlag, 2009 
[55J W. Wu and L. Zhang, "Ll3loek: A Lightweight I3lock Cipher," in Pmu:cliwqs of 
the 91h Intc7Tlatiorwl Confert:rlcc Oil Applied CryptogmpllY afui NclwQf'k SeC/mty 
]59 
(ACNS 2011), Lecture Notes in Cumputer &ienee, pp , 327- 344, Springer-Verlag, 
2011 
[~6] C. Shannon, "Communication Theory of Secrecy Systems ," Bell System Tech-
rliml .Journal, voL 28, pp. 656-7 1 ~, HH9 
[57] O. It Stinson , Cryptogl"aphy Theory and Pmc/ice. CHC Press. third cd .. 2000 
[~8] II. Feistcl, "Cryptography and Computer Privacy," Scientific American, vol. 228, 
1973 
[~9] J. ll. I(am aud G I. Davida , "Structured Design of Substitution-Permutation 
Em:ryptiun Networks," JEEE 7hm.~flctions on COnll/U/crs, vol. C-28, 1979. 
[60] II. Cileng, Compact lJardware hlll/lemcn/fllion of Block Ciphcr wilh CurKlIflt!11 
Enur' Delcctioll. "',laster Thesis, Facull of Engineering and Applied Science. 
Melllorial University, 2007 
[61] E. lliharn and A. ShUlllic "Diflcrcntial Cryptanalysis of DES-like Cryptosys-
t('IllS," in Proceediflg.~ of Ad!)flllCeS in Cryptology (CRYPTO 1990), vol. ~37 of 
I.ectun; Notes in Computer Science, pp. 2- 2 1, 199 1. 
[(2) M. Matsui, "Linear Cryptanalysis "'!ethod for DES Cipher," ill Pnx:c(;difl ,qs of!ld-
L'all(;CS in C"yplology (EUROCRYPT 1993), vol. 765 of Lcctllrc Nutcs ill Com-
putC! SciCIIC(; . )lP, 38G-3!l7, 19~)'[ 
[63] X. Lai, J . L "' Ia;;sey, and S. Murphy, "Markov Ciphers and Differential Crypt -
analysis," in P1"{x;wiillgs of Ad!)U!lf;c.~ ill G'7-yptology (EUROCRYPT 1991) . Lcc-
t UIC Notes ill Computer SciclJ{;c, pp. 17- 38, Springcr-Vcrlag, IDD I. 
100 
[G4[ Ie Nyoorg, "Linear Approximations of Dloek Ciphen;." in Proccedmg5 of Ad-
vlmces II! Cryptology (E'U/(OCRYPT 1994), Lecture Notcs ill COillputer Science, 
pp. 439 44,1, Springer-Verlag, 1995. 
[651 E. l3 iham, ~New Type of Cryptan(lly!;i!; Attacks Using Ilclntcd I(eys," in Pro-
ceedings of Advances in C'YJlto/ogy (EUROCRYPT 1993), \'01. 765 of l-cel.1<1'e 
Noles in C01llpllte,' Scie1lcc, pp. 229-2'16, 19!M. 
[66[ L. Drown, .I. Pieprzyk, nud J. Seberry, ~LOK[: II Cryptognopliie Primitive for 
AuthCllticatiou and Secrecy Applications." in Pn)(;ccilings of Advllllces m C'1f11-
tology (AUSCRYPT 1990). Lecture Notcs in Computer Science, pp. 229- 236, 
1990. 
[671 A. Sorkin, ~ Lucifer: a Cryptogrnphie Algorithm." IEEE Tnlllsactwns on CUIIl -
,mlers, \'01.8, pp.22 41, 1984. 
[G8[ J. II. lvloore and G. J . Simmon!;, "Cycle Structure of the DES for Keys Jlaving 
Palindromic (or Autipalindromie) Sequences of Hound I(cys," IEEE 7hm5flc-
lioll" all Softwarc Enginccring. vol. SE-13. pp. 262- 273. 1987. 
[69] G. Leander, "On Linear lIulls. Statistic,l] Saturation Attllcks. PIlESENT amI 
a Cryptanalysis of PUFFIN," in PnxcCliillgs of t1dvaIl CC.~ II! C''Y1110Iogy ( EU-
HOCRYPT 2011), Lecture Notcs iu CotlJputer Science, pp. 303-322. Springer-
Verlag, 20 11. 
[70J C. mondollu and D. Gerard, "Differential Cryptanalysis of PU FFI N and PU F-
FIN2," in ECRYPT Workshop 011 Lightweight Crylllogmphy 2011. pp. 35 54. 
2011 
161 
[711 C. I3londf'all and U. Gp.rarrl, "Multiplf' Difff'rf'ntial Crypl fl ualysis Tl\f'oryand 
Pradice," ill Pmcel:diugs of tlu; /ntt1l'Ultional Worl.;slwl1 011 Fa.~t SoftwlIT"/; Ell -
c1'Yption (PSE 2011), vol. 6733 of Lecture Notes in Computer SCience, pp. 35- 54, 
2011 
Ifi2 
Appendix A 
D escription of the Oper ation of the 
Shi ftRow s Components 
The opcmtiou of the ShittRows components ~ho\\'n ill Figures 6.2, 6.3, (iA and 
6.5 is controlled through the llIultiplexers. All the 8-bit rcgi~ten; arc driven with II 
continuous c1U(;k. i ll order to demonstrate the operation of these components, the 
contcnts of the registers at some selected clock cycles arc shawl! ill Tables /\.1, 1\.2. 
A.3 and A.<I for Figur{,'S 6.2, 6.3, 6.'1 and 6.5, respec ti\-c!y, where the first clock cycle 
is dCllotl..'(i as CCOO and the II -th clock cycle after ecoo is denoted !~~ cell The 
l.'Olltcnt of a register is 11 byte of tim Stale following the Hotation in Figure 2.1 
163 
Table A,I Contents of the registers of the 8-bit width Shijl Rows component lit 
the selected dock cycle!> 
nOI R02 n03 RO'I n05 ROO H07 H08 R09 1110 Ril BI 2 
ecoo 130,0 8 1.0 B~.o BJ.O Bo,) B1.1 Bu B3 ,) Bo.~ 81.~ B2,2 Bu 
c eol B) ,o /J~.o Iho BO.1 B),) 8 2,1 8.1.1 BO,2 13) ,2 112,2 B3•2 110.:1 
CCO·I 110•1 Bl,O B2.0 B3.0 BO.2 B1,2 8 Z•1 B3.1 130,] Bu ih .3 Bu 
CC05 111.0 Bu) 8 3.0 8 0.2 8 1.2 Bv 133.1 ilo.3 B ) ,] B1,J B3 ,2 130.0 
CC09 8 1.0 lJ~.o 8 3,) BO.3 BI.3 lJ~.) Bu Bo.o 8;,0 13~.0 1J~.0 130.1 
CC I2 BO.3 B1.0 B7,1 B3,2 Ho.o B;.o B~ ,o B;.o BO.I B; ,) B~.I B~, ) 
CC iG Bo,o B;.o B~.o B~.o BO,) B;.) B~.) 8'3.1 B~.2 lJ;,2 B;.2 lJ!I.2 
Table A.2: Contents of the registers of the Iii-bi t width Shijlfrows compollellt ilt 
thcsclccted d ock cydC!> 
ROI 1102 R03 R04 R05 B06 H07 R08 B09 RIO 111 1 B I2 
CCOO Bo.o BI.o B7,0 B3.0 Bo" BI.I B~ ,) 113.1 Bo.2 B1.2 B2.2 lJ3.~ 
CCO I B2.0 B3.0 Bo.) B1,0 8 2.1 B3.1 lJo,2 B1.2 112.2 113,2 110.3 Bu 
e e02 Bo,) B1.0 B2.1 B3,0 Bo,2 B) ,2 llz,u Bl ,) BO.3 B1,3 B2,3 B:1.2 
CC03 lhl 113.0 BO.2 BI.o B2,O B3,1 Bo.3 Bu Bu Ih~ Bo,o 13; .0 
CC05 il~,o [h,1 Bo.:1 111.0 112.1 B) ,2 Bo.u B; ,o B;.o 11~.u Bo,1 B; .I 
CC06 BO.3 B1,0 8 2.1 Bu Bo.o B;.o B;.o B~.o BO,) 8 ;.1 B~.I B~,1 
ee08 Bo.o B;.o B;,o B;.o BO.1 B;.I B;,1 B;.I Bo,~ 11;.2 B;.2 B~,2 
16<\ 
Table A.3: e()[[tl~[[ts of the registers of the 32-bit width Shi/I.Raws component at 
the selected clock cycles 
IW1 R02 IlO3 R04 [W5 ROG H07 HOB IW9 HIO HII Hl2 
CCOO lio,o li] ,o 8 2,0 liM liD.] lil.l iJ2 ,] 8:u 110.1 11],2 B2,2 8 1,2 
ee01 lio.1 Bl,o B2,1 lho lJo,z 1hz liz,o il:J,l 110.3 li l,:1 il2,;1 iJu 
Ce02 110.z 8 1,0 8 2,0 8 3,] 80,3 Ih:1 112 ,1 113 ,2 llo,o Il;,u B~,u B; ,o 
CC03 111).:1 13).0 132,[ 8 3,2 130,0 8;,0 8;.0 13~,0 BO,1 B;.1 B;,I B~ ,1 
ee04 80,0 Il; ,o 132.0 13;.0 130.1 B;,1 8;.1 B~.I BO,2 8; .z B;,2 B:'1,2 
Table 11.4: eontcllts of thc Icgi~tcrs of the G4-bit width Slli/tRows compo))p))t at 
the sckctL'tl dock cycles 
CC02 Bo,o 8;,0 B;.o n;,o Bo,1 Bl.1 B~. I B;,1 
165 
Appendix B 
Description of the Oper ation of the 
M ixColumns Components 
The operatioll of the MixColulIUjs colllvoucnts shown in Figures (l,G, G.7 and G.8 
is controlled through the multiplexers and the AND gates. All the 8-bit registers 
<01'(' driven with a cont inuous clock . In order to demonstrate the operation of these 
components, UJC colltents of the registers at the clock cycle,; of an oppratioll are showll 
in TuLles B.I , B.2 and 13.3 for Figures fiJi, 6.7 and G.8, rcspccth-c!y. wllf'rc the first 
clock cycle is deJloted a" ecao lind the U-tll clock cycle after eCaD is denoted lL~ 
Cell_ The contellt of i\ rcgbtcr is a byte following the notation ill (2.2) 
Hili 
Taole l1.1: Contents of the registen; of the 8-oit width /ldixCo/ulI!1!S component for 
the clock cycles during a complete operation (/II = /I + I ) 
ROI n02 R03 RM 
Bo,,, ffi 02Bl," ffi Bo." $ Bl,"CD 03Bo." (!) B l," 
ceoo Bo.", 
03B2,,, ffi BJ,,, 02B2." $ 03B3." $ 8. •. " $ B2.,, 
8 o."IDi31."ID 0380.,,$ 8 1,,, 
eeol don't care lJO.m EIl lJI,,,, 
02B'1." tfl 03H:1." (jj /J2 ." EIl Hz,,, 
03Bo." tfl BI.n O:IBo,,,,Q) 
ee02 don 'tclul' don't eare 
EIlBZ." (!) JJ2 ,, il l,mill/h.m 
02LJo.", (!) 03IJl.m 
Ce03 don't care don't cme don't cme 
tfl LJz.", CD B3,m 
I I ROS ROG n07 
CCOO 8 0,m 03IJo.", 02LJo.", 
eeOl 03Bo.", $ B l.m 02Bo,,,, ID 03B1•m ilo,,,,ID 02BI.,., 
02LJo . .,, (D Bo,,,, m 02LJ1,m LJo.", (!) 8 ],m 
CC02 
03LJ I, m ill JJz,m (1)038 2,,,, 0 02il2,,,, 
ilo,,,, $ 02BI..,, ID ilo.", m B I ,,,, (!) 03LJo . ." ill ill,,,, 
CC03 
0382,,,, ffi 8:1.", 02LJz.m (l) 038J.", (fI i32 ,,,, (JJ 8 z .... 
167 
Tuble U.2: Contents of the rcgi~tcr~ of the !G-bit width MixCohmms componcnt 
for thc c1()~k cycles during a complete operation (111 = II + I) 
HOI R02 R03 
lJO.n tD lJ l .,, (}) 
CCOO lJo.", Ql Bl,m 0280,", tD 03lJ l .,., 
02B2,n CD 03B3.n 
02Bo,,,, tD 03BI,,,. Bo.", IJ) B l,mlJ) 
CeOI don 't care 
EBB~"" tD B3.'" 02B2.". tD 03B3.", 
R04 R05 !lOti 
03Bo,,, tD 8 1." 
CCOO 0380.", tD B I, ,,, Bo ... , (1) 0281 .... 
EfJ Bz.n tD Bz.n 
Bo.rn tD 028l,mEB 0380.". tD Bl .... 
eeol don' t care 
03B2,,,,tD 8 3.". ffi Bz.mtD B2.". 
TaLle 13 .3: Contents of the registers of the 32-uit width AJixCoiurrms cOlll Ponent 
fOf the clock cydes during a complete operatioll 
ltiS 



