Multi-Mode Operator for SHA-2 Hash Functions by Glabb, Ryan et al.
Multi-Mode Operator for SHA-2 Hash Functions
Ryan Glabb, Laurent Imbert, Graham Jullien, Arnaud Tisserand, Nicolas
Veyrat-Charvillon
To cite this version:
Ryan Glabb, Laurent Imbert, Graham Jullien, Arnaud Tisserand, Nicolas Veyrat-Charvillon.
Multi-Mode Operator for SHA-2 Hash Functions. Journal of Systems Architecture, Else-
vier, 2007, Special Issue on Embedded Hardware for Cryptosystems, 52 (2-3), pp.127-138.
<10.1016/j.sysarc.2006.09.006>. <lirmm-00126262>
HAL Id: lirmm-00126262
https://hal-lirmm.ccsd.cnrs.fr/lirmm-00126262
Submitted on 19 Jun 2007
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Journal of Systems Architecture 53 (2007) 127–138
www.elsevier.com/locate/sysarcMulti-mode operator for SHA-2 hash functions
Ryan Glabb a, Laurent Imbert b,a,c, Graham Jullien a,
Arnaud Tisserand b, Nicolas Veyrat-Charvillon d,*
a ATIPS Laboratories, Department of Electrical and Computer Engineering, University of Calgary, Calgary, Alberta, Canada T2N 1N4
b Arith Group, LIRMM, CNRS – University Montpellier 2, 161 rue Ada, F-34392 Montpellier, France
c CISaC, Department of Mathematics and Statistics, University of Calgary, Calgary, Alberta, Canada T2N 1N4
d Are´naire Team, LIP (CNRS–ENSL–INRIA–UCBL), E´NS de Lyon, 46 alle´e d’Italie, F-69364 Lyon, France
Received 28 April 2006; received in revised form 1 September 2006; accepted 15 September 2006
Available online 1 November 2006Abstract
We propose an improved implementation of the SHA-2 hash family, with minimal operator latency and reduced hard-
ware requirements. We also propose a high frequency version at the cost of only two cycles of latency per message. Finally
we present a multi-mode architecture able to perform either a SHA-384 or SHA-512 hash or to behave as two independent
SHA-224 or SHA-256 operators. Such capability adds increased ﬂexibility for applications ranging from a server running
multiple streams to independent pseudorandom number generation. We also demonstrate that our architecture achieves a
performance comparable to separate implementations while requiring much less hardware.
 2006 Elsevier B.V. All rights reserved.
Keywords: FPGA; Hash function; SHA-2 family; Multi-mode operator1. Introduction
Cryptographic hash functions [1] are a funda-
mental tool in modern cryptography, used mainly
to ensure the data integrity when transmitting infor-
mation over insecure channels. Hash functions are
also used for the implementation of digital signature
algorithms, keyed-hash message authentication
codes and in random number generators. Many
hash functions exist [2–4], but their actual security1383-7621/$ - see front matter  2006 Elsevier B.V. All rights reserved
doi:10.1016/j.sysarc.2006.09.006
* Corresponding author.
E-mail address: Nicolas.Veyrat-Charvillon@ens-lyon.fr (N.
Veyrat-Charvillon).level is very diﬃcult to estimate. Whenever weak-
nesses are found [5], security is compromised and
any stand-alone implementations must be phased
out leading to costly upgrades toward a new hash
function that is deemed secure at that time.
For example, an algorithm has recently been dis-
covered [6] that decreases the resistance to collision
of SHA-1 (Secure Hash Algorithm) [7], the most
popular hash function so far, reducing the number
of necessary computations from 280 to 269 and putt-
ing it below the accepted security threshold for high-
security operations. Since then, the SHA-2 family of
hash functions [8], developed by the National Insti-
tute of Standards and Technology (NIST), has
become the new standard..
Table 1
Secure hash algorithm characteristics
Algorithm Word
(w)
Message
size (l)
Block
(m)
Digest Security
SHA-224 32 <264 512 224 112
SHA-256 32 <264 512 256 128
SHA-384 64 <2128 1024 384 192
SHA-512 64 <2128 1024 512 256
All sizes are given in bits.
128 R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138Due to their complexity and limited lifespan, the
cryptographic primitives are generally implemented
in software on general purpose processors rather
than on specialized hardware architectures. Hard-
ware implementations are also far more expensive
and often diﬃcult to realize eﬃciently. On the other
side, software based cryptographic algorithms are
much slower than their hardware counterparts by
typical factors from 1 to 3 orders of magnitude.
Many secure cryptographic algorithms such as
AES (Advanced Encryption Standard) and SHA-1
were designed to be implemented in hardware, and
are drastically less eﬃcient when coded in software
[1]. In terms of hardware implementations, the two
principal approaches are Application-Speciﬁc Inte-
grated Circuits (ASIC) technology and Field Pro-
grammable Gate Arrays (FPGAs). Due to their
ease of use and lower cost, we have chosen FPGAs
from the Virtex and Spartan3 Xilinx families for the
prototyping phase and synthesis results reported in
this paper.
The aim of this work is to show the advantages of
using reconﬁgurable hardware operators to com-
pute various cryptographic primitives associated
with the SHA-2 hash functions, using shared
resources on a single chip-set.2. SHA-2 hash standard
Throughout this paper, we will follow the deﬁni-
tions and notations used in the SHA-2 speciﬁcation
[8]. This speciﬁcation details all steps of the hash
algorithms and constants used in the computation.
We will only report on the relevant parts useful
for the understanding of implementation and opti-
mization issues that are considered in this paper.
The SHA-2 hash standard speciﬁes four secure
hash algorithms, SHA-224, SHA-256, SHA-384,
and SHA-512. All four of the algorithms are itera-
tive, one-way hash functions that can process a mes-
sage to produce a hashed representation called a
message digest. Each algorithm can be described in
two stages: preprocessing and hash computation.
Preprocessing involves preparing the message
through padding, parsing the padded message into
m-bit blocks, and setting any initialization values
to be used in the hash generation. The hash compu-
tation generates a message schedule from the pad-
ded message which is used, along with functions,
constants and word operations, to iteratively gener-
ate a series of hash values. The ﬁnal hash value gen-erated by the hash computation is used to determine
the message digest.
A message M of length l to be hashed is pro-
cessed by blocks of m bits. Each block is divided
in 16 w-bit words for computation, the word-size
w depending on the algorithm.
The most important diﬀerence between the four
algorithms is the size of the message digest. Addi-
tionally, the algorithms diﬀer in terms of the size
of the blocks and words of data that are used during
hashing (Table 1).2.1. Functions and constants
Each hash function algorithm uses six logical
functions, operating on w-bit words, which are rep-
resented as x, y, and z. The result of those functions
is also a w-bit word. On top of the common
Ch(x,y,z) andMaj(x,y,z) functions, four additional
functions are deﬁned for SHA-224/256 and SHA-
384/512; the XOR operation () in these functions
may be replaced by a bitwise OR and produce iden-
tical results. The RORn(x) and SHRn(x) designate
rotate right and shift right by n places, respectively
Chðx; y; zÞ ¼ ðx ^ yÞ  ð:x ^ zÞ
Majðx; y; zÞ ¼ ðx ^ yÞ  ðx ^ zÞ  ðy ^ zÞ2.1.1. Functions
The 32-bit functions used for SHA-224 and
SHA-256 are the following:
Rf256g0 ðxÞ ¼ ROR2ðxÞ  ROR13ðxÞ  ROR22ðxÞ
Rf256g1 ðxÞ ¼ ROR6ðxÞ  ROR11ðxÞ  ROR25ðxÞ
rf256g0 ðxÞ ¼ ROR7ðxÞ  ROR18ðxÞ  SHR3ðxÞ
rf256g1 ðxÞ ¼ ROR17ðxÞ  ROR19ðxÞ  SHR10ðxÞ
The SHA-384 and SHA-512 algorithms use the fol-
lowing 64-bit functions:
R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138 129Rf512g0 ðxÞ ¼ ROR28ðxÞ  ROR34ðxÞ  ROR39ðxÞ
Rf512g1 ðxÞ ¼ ROR14ðxÞ  ROR18ðxÞ  ROR41ðxÞ
rf512g0 ðxÞ ¼ ROR1ðxÞ  ROR8ðxÞ  SHR7ðxÞ
rf512g1 ðxÞ ¼ ROR19ðxÞ  ROR61ðxÞ  SHR6ðxÞ2.1.2. Constants
Sixty-four 32-bit constants are used for the SHA-
224 and SHA-256 algorithms, Kf256g0 ;K
f256g
1 ; . . . ;
Kf256g63 . These words represent the ﬁrst 32 bits of
the fractional parts of the cube roots of the ﬁrst
64 primes.
SHA-384 and SHA-512 use eighty 64-bit con-
stants, Kf512g0 ;K
f512g
1 ; . . . ;K
f512g
79 . These words repre-
sent the ﬁrst 64 bits of the fractional parts of the
cube roots of the ﬁrst eighty primes.2.2. Preprocessing
This process consists of three steps: (1) padding
the message M; (2) cutting the padded message into
blocks; and (3) setting the initial hash value, H(0). Its
purpose is to ensure that the padded message is a
multiple of 512 or 1024 bits.
Let us consider an l-bit message M. A bit ‘1’ is
appended, followed by k zero bits. For the SHA-
224 and SHA-256 hash functions, k is the smallest,
non-negative solution to the equation l + 1 +
k  448 (mod512). The 64-bit block representing
the total size of the message to be hashed is then
appended. The length of the padded message is
now a multiple of 512 bits.
For example, the (8-bit ASCII) message ‘‘abc’’
becomes:The operation follows the same scheme for SHA-
384 and SHA-512. A 1024-bit padded block is
obtained using a 128-bit encoded message length,
with k satisfying the equation l + 1 + k  896
(mod1024).2.3. Parsing the padded message
The padded message is cut into N m-bit blocks,
M(1),M(2), . . . ,M(N). The m bits of the input blockcan be expressed as 16 w-bit words for all four hash
algorithms. The ﬁrst w bits of message block i are
denoted M ðiÞ0 , the next w bits are M
ðiÞ
1 , and so on
up to M ðiÞ15.
2.4. Setting the initial hash value (H(0))
Before hash computation begins, the initial hash
value H(0) is set, which consists of eight w-bit words.
• For SHA-224, H(0) is obtained by taking the 33rd
to 64th bits of the fractional parts of the square
roots of the ninth through sixteenth prime
numbers.
• For SHA-256, we get H(0) by taking the ﬁrst 32
bits of the fractional parts of the square roots
of the ﬁrst eight prime numbers.
• For SHA-384, the words of H(0) are the ﬁrst 64
bits of the fractional parts of the square roots
of the ninth through sixteenth prime numbers.
• For SHA-512, H(0) consists of eight 64-bit words
obtained by taking the ﬁrst bits of the fractional
parts of the square roots of the ﬁrst eight prime
numbers.2.5. Secure hash algorithms
2.5.1. Hash computation of SHA-256 and SHA-512
We will describe SHA-256 and SHA-512
together, in order to stress their numerous simi-
larities.
We put Alg = 256, w = 32 and tmax = 63 for
SHA-256, and Alg = 512, w = 64 and tmax = 79
for SHA-512.
Both algorithms use:
• a message schedule of tmax + 1 w-bit words,
• eight working variables of w bits each,
• a hash value of eight w-bit words.
After preprocessing is completed, each message
block, M(1), . . . ,M(N), is processed in order.
Additions (+) are all performed modulo 2w, and
akb stands for the concatenation of a and b.
For i = 1 to N {
• Prepare the message schedule,
W t ¼
M ðiÞt 0 6 t 6 15
rfAlgg1 ðW t2Þ þ W t7
þrfAlgg0 ðW t15Þ þ W t16 16 6 t 6 tmax
8><
>:
130 R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138• Initialize the eight working variables, a, b, c, d, e,
f, g, and h, with the (i  1)st intermediate hash
value:(k denotes concatenation operator)
ajjbjjcjjdjjejjf jjgjjh ¼ H ði1Þ
• For t = 0 to tmax rounds, we perform
T 1 ¼ hþ RfAlgg1 ðeÞ þ Chðe; f ; gÞ þ KfAlggt þ W t
T 2 ¼ RfAlgg0 ðaÞ þMajða; b; cÞ
ajj    jjh ¼ T 1 þ T 2jjajjbjjcjjd þ T 1jjejjf jjg
• Compute the ith intermediate hash value
H ðiÞ ¼ aþ H ði1Þ0 jj    jjhþ H ði1Þ7
}
After repeating those steps a total of N times
(after processing M(N)), the resulting message digest
of M is
H ðNÞ0 jjH ðNÞ1 jjH ðNÞ2 jjH ðNÞ3 jjH ðNÞ4 jjH ðNÞ5 jjH ðNÞ6 jjH ðNÞ7
The algorithm for SHA-224 is identical to SHA-256,
with the exception of using a diﬀerent initial hash
value and truncating the ﬁnal hash value to its
left-most 224 bits
H ðNÞ0 jjH ðNÞ1 jjH ðNÞ2 jjH ðNÞ3 jjH ðNÞ4 jjH ðNÞ5 jjH ðNÞ6
Similarly, the SHA-384 algorithm is identical to
SHA-512, except for the diﬀerent initial hash value
and truncating of the ﬁnal hash value to 384 bits
H ðNÞ0 jjH ðNÞ1 jjH ðNÞ2 jjH ðNÞ3 jjH ðNÞ4 jjH ðNÞ53. Implementation of the SHA-2 hash functions
In this section, we describe an implementation of
the SHA-2 family of algorithms that achieves zero
latency; i.e. there are exactly 64 (resp. 80) cycles
between the input of the ﬁrst word in a block and
the output of the intermediate hash for a 32-bit
(resp. 64-bit) word SHA algorithm. Data is pro-
vided along with the ﬁrst 16 cycles of computation.
This design, besides minimizing the computa-
tional overhead, is also very small, mainly because
it avoids any unnecessary storage of data. However,
its throughput is penalized by having a long critical
path. It is, however, possible to achieve competitive
results by pre-computing some of the data, and thus
reducing the critical path. This improvement
requires only a small increase in hardware, and onlyadds a two cycle latency for the hashing of a whole
message.
3.1. General operator architecture
The general architecture, shown in Fig. 1, is a top
level representation of the partitioning of the major
functional blocks. This architecture can be applied
to all hash algorithm modes described in this paper.
The operation of each major function is as follows:
Fig. 2
• The control unit manages all system operations
and processes. The control unit’s goal is to coor-
dinate new messages and new message blocks in
the system and manage relevant functions
appropriately.
• The padder realizes the message pre-processing,
handling all message data to be hashed.
• The message scheduler generates the Wt used by
the round computation.
• The round constant unit holds values of Kt.
• The round computation unit updates the a to h
variables, given their previous values, Kt and Wt.
• The intermediate hash is initialized with each new
message and updated at the end of each message
block processing.
The operation of the general operator begins
when a new message is ready to be hashed. The
intermediate hash is then initialized with H(0).
For the ﬁrst 16 rounds, Mt is transmitted to the
message scheduler to provide the ﬁrst values of
Wt. After that, Wt is computed recursively using
its previous values Wt2, Wt7, Wt15 and Wt16.
Along with Wt, the constant Kt is transmitted for
each round.
The variables a to h are initialized at the begin-
ning of each new block by the last value of the inter-
mediate hash H(i1) and updated 64 (resp. 80) times
using Wt and Kt. After this, the new intermediate
hash value H(i) is produced by adding the a to h
variables with each word of H(i1).
The padder receives its input words via the
WRD_IN port, and the hash value can be read on
port WRD_OUT one word at a time using the H_PART
address port.
3.2. Implementation
In this ﬁrst implementation of the standalone
versions of the SHA-2 hash functions, our goal
8   WRD
H (i) H (ii)
H (0)
M t
Wt K t
Round computation
Padder
Message
Schedule Constants
Unit
Unit
Variables
Intermediate Hash
H PART WRD_OUT
3
WRD
H_RDY
WRD_REQ
MSG_NEW
WRD_NEW
WRD_LAST
CLK
RST
Control
WRD_IN
BIT_VALID
Fig. 1. General structure of the operators.
WRD_OUT
WRD 2  WRD
WRD_IN
WRD_REQPAD_END
Word counter
Message length
Control
Unit
Mask &
append ’1’
MSG_NEW
BIT_VALID
WRD_LAST
RST
CLK
Fig. 2. Implementation of SHA2 padder.
R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138 131was to minimize the computational overhead, that is
to design components with zero latency relative to
the hashing of an input message.As soon as information is available, we begin the
computation without waiting for the complete block
(or message). While introducing some design issues,
132 R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138this approach allows us to greatly reduce the hard-
ware cost relative to other existing implementations.
This is due to the fact that in exchange for a small
control overhead, we remove all unnecessary stor-
age required for buﬀering.3.2.1. Padder
In the usual approach, the padder processes the
message by blocks, storing a full block, padding as
needed and outputting a whole padded block at a
time. This is especially true in software where stor-
ing and retrieving data are easily handled operations
when implementing code on a standard processor
architecture. But using such a strategy in hardware
implies the use of a block-sized register, and intro-
duces unnecessary latency of, for example, 16 for
a serialized input to the padder at each clock
cycle.
The reason we can avoid this latency is that, for
each computational round, one value of Wt only is
required, which is Mt for the 16 ﬁrst rounds, and
does not depend on the padder for subsequent
rounds. Thus the padder only has to compute the
Mt for the 16 ﬁrst rounds, and this computation
can be done on the ﬂy, as soon as the block words
are available.
Our padder processes one word at a time, count-
ing the message length from the ﬁrst word, given at
the same time as the Message_new signal, until the
Word_last signal rises, which indicates the last bit
of the actual message (indexed by Bits_valid)
has been received. Then, the padder appends a bitt–W
tW
15R
σ 1
+ mod 2
tW
tM
14R 13R 9R
Fig. 3. Implementation o‘1’ to the message, followed by as many zeros as
needed. The binary-encoded length of the message
is ﬁnally appended to the message, and the Mes-
sage_end signal is raised.3.2.2. Message scheduler
From the standard, we know that words W0 to
W15 do not have to be processed. Thus, they go
directly through the message scheduler (see Fig. 3),
and only the subsequent Wt are computed, which
depend on 4 out of the 16 preceding values of Wt.
Therefore we have to store 16 words containing
the previous Wt in order to be able to compute a
new message schedule.3.2.3. Round constants unit
A new round constant must be provided at each
round. We implemented this structure using RAM
blocks instead of ROM blocks because of the
FPGA target architectures we used. The RAM
blocks provide 512 32-bit storage locations. Only
one of these blocks was needed, including for the
SHA-384/512 hashes, because of their dual-port
capability.
The 1-cycle latency of the RAM blocks is
accounted for in the control part of the architecture.3.2.4. Round computation unit and intermediate hash
The round computation unit, given Wt, Kt and
a,b, . . . ,h, computes the next value of the a to h
variables using the equations provided in the
SHA-2 standard. This computation is performedt–7W t–15W t–16W2
σ 0
w
+ mod 2 w
+ mod 2 w
8R 1R 0R
f message scheduler.
R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138 133using a standard tree of carry-propagate adders uti-
lizing the fast-carry adders provided on our target
FPGA.
The variables are initialized at the beginning of a
new message block with the intermediate hash, and
updated every subsequent round with the output of
the round computation unit.
The intermediate hash is initialized to H(0) with
each new message and updated by adding the vari-
ables a,b, . . . ,h to its words after each processed
block.
3.2.5. Analysis
By computing Mt and Wt on the ﬂy, the hashing
operator is able to begin as soon as the message
words are provided. It is then possible to achieve
zero latency in the computation of intermediate
hash results. Therefore, for a SHA-224/256 hashing
(resp. SHA-384/512), the hashing of an N blocks
message will take 64 · N (resp. 80 · N) cycles
exactly, including the initialization required by a
new message.
The other advantage of this approach is that by
computing all data as soon as it is provided, we
remove all message buﬀering, therefore realizing
important hardware savings compared to the usual
approach where computation only begins following
the input of a complete block, adding at least 16
cycles of latency per block and an extra block-wide
register to the design.Σ 1
+ mod 2w+ mod 2w
2T
Σ 0 Maj
eb ca
+ mod 2w
a                       
Fig. 4. Original critical path in SH3.3. Merging of SHA-224/256, and SHA-384/512
The only diﬀerences between SHA-224 and SHA-
256 are in the values used for H(0) and the number
of bits in the digest that are used.
These two diﬀerences do not impact on the hard-
ware cost: the values for H(0) are ‘‘random’’
sequences of bits in each case, and the truncating
of the output does not reduce the required hardware
since the whole hash intermediate value is needed
for each block computation.
The same statements hold for SHA-384 and
SHA-512 which also require almost the same hard-
ware resources.
We add two extra blocks into our architecture:
SHA-224/256 and SHA-384/512 which have an
additional input Alt_Hash used to choose between
the variants. The only hardware overhead in these
components, compared to the separate implementa-
tions, is a MUX used to select the suitable value of
H(0).
4. Optimization
A synthesis of the previously described operators
shows that their critical path is quite long, leading to
small speed (Fig. 4). In order to increase the operat-
ing frequency, the computational process is modi-
ﬁed to split the critical path into three clock cycles
(Fig. 5). This leads to increased performances inCh
tW tK
1T
f hg d
+ mod 2w
+ mod 2w
+ mod 2w
+ mod 2w
                                       e
A2 implementation (in thick).
Σ 1 Ch
+ mod 2w
tKtW
+ mod 2w
+ mod 2w
+ mod 2w
+ mod 2w
4T3T
+ mod 2w
Maj
e f g b cag’ c’
p q
ea
+ mod 2w
+ mod 2w
+ mod 2w
Σ 0
Fig. 5. Critical path for the optimized SHA2 implementation (in thick).
134 R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138terms of speed at the cost of only a 2 cycles latency
per message and a small hardware overhead.
4.1. Determination of the critical path
Synthesis results using Synplify Pro show that the
critical path runs from the padder, through the mes-
sage scheduler and the round computation unit (via
the evaluation of T1) to the storage of the new inter-
mediate hash value. We can trade-oﬀ a reduction in
critical path, and hence increase the throughput
rate, at the cost of an increase in latency, as we dis-
cuss next.
4.2. Segmenting the computation delay
We can reduce the length of the critical path by
computing some of the intermediate results used
for the round computation during the previous
rounds. For example, Wt can be stored in a register
in order to reduce the critical path after the padder
and message scheduler. This results in an increase of
one cycle of latency for the message computation,
since Mt will be delayed once before being used.
This part of the optimization requires no extra hard-
ware since Wt1 has to be stored in the message
scheduler anyway. The value of Kt can similarly be
delayed by one cycle by acting on the address coun-
ter feeding the ROM block.
As can be seen from the round computation equa-
tions, variables c, d and g, h at round t are equal
(except for the initialization round at t = 0) to b, cand f, g respectively at round t  1. No precomputa-
tion will involve a and e (resp. b and f) since their val-
ues result from computations involving the previous
values of a and e at round t (resp. t  1).
Then d þ hþ Kf512gt þ W t and hþ Kf512gt þ W t
can be pre-computed eﬃciently at round t  1.
We introduce
c0 ¼ H
ði1Þ
3 if t ¼ 0
c else
(
; g0 ¼ H
ði1Þ
7 if t ¼ 0
g else
(
p ¼ g0 þ Kf512gtþ1 þ W tþ1
q ¼ c0 þ g0 þ Kf512gtþ1 þ W tþ1
T 3 ¼ Rf512g1 ðeÞ þ Chðe; f ; gÞ
T 4 ¼ Rf512g0 ðaÞ þMajða; b; cÞ
One can then compute: e = q + T3 and a = p + T3 +
T4.
The original critical path is now cut following the
padder and message scheduler operations by read-
ingWt1 from the registers of the message scheduler
instead of using the combinatorial value Wt at
round t. It is also cut by the pre-computation of p
and q as they can be used directly with T3 and T4
to compute a and e.4.3. Analysis
The critical path reduction described above intro-
duce two cycles of latency: one is due to the delaying
R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138 135of Wt, and the other to the computation of p and q.
No extra hardware is required for Wt, since Wt1 is
stored in the message scheduler. The t address used
for Kt has to be delayed (requiring 7 ﬂip-ﬂops), and
p and q are stored in memory at each round (2 w-bit
registers). Some extra logic and routing is also used
in the computation of c 0 and g 0.
The hashing of an N-block message will now take
2 + 64 · N (resp. 2 + 80 · N) cycles for a SHA-224/
256 hashing (resp. SHA-384/512). The synthesis
results show that the hardware overhead is low com-
pared to the speed improvement. The shorter criti-
cal path actually allowed smaller operators to be
synthesized.
5. Merging of the SHA-2 family
Merging the SHA-2 family of functions into a
single architecture is more eﬃcient than implement-
ing separate operators for each hash algorithm. For
example, in [9], SHA-256, SHA-384 and SHA-512
were each implemented using a separate computa-
tional unit. During the computation of SHA-256,
that implementation does not use the left half of
the 64-bit datapath, and it is held to zero.
Our multi-mode SHA-2 operator has been
designed to optimize the hardware eﬃciency. It is
able to run either a hash function working on
w = 64-bit words (SHA-384 or 512), or two
w = 32-bit functions (SHA-224 or 256) running con-
currently. When running in split mode, the operator
can be considered as two separate operators each
running a w = 32-bit hash.
5.1. Sharing the datapath
5.1.1. Comparison between the hash functions
The hash functions of the SHA-2 family share
many similarities. We can classify them into two
categories: the w = 32 bit functions, SHA-224 and
SHA-256, and the w = 64 bit functions, SHA-384
and SHA-512. Given their respective word sizes, a
large part of the datapaths are identical, and other
parts can be shared eﬃciently:
• The padding is identical with regard to the
respective word sizes. A message of length l is
processed by blocks of 16 words, and a ‘‘1’’ is
appended at the end, followed by as many zeros
(k) as necessary in order to have l + 1 + k 
14w (mod 16w). A 2-word binary representation
of l is then appended.• The message scheduler is identical for all hash
functions, except for the r functions which
depend on the word size.
• The deﬁnition of the initial hash valueH(0) allows
its implementation to be shared between the algo-
rithms. That is, the left halves of the SHA-512
words of H(0) are the words of H(0) for SHA-
256. Similarly for SHA-384, the right halves of
the words of H(0) are the words for H(0) for
SHA-224. For example for H ð0Þ0 :SHA-256 H ð0Þ0 ¼ 6a09e667
SHA-512 H ð0Þ0 ¼ 6a09e667f3bbc908
and
SHA-224 H ð0Þ0 ¼ c1059ed8
SHA-384 H ð0Þ0 ¼ cbbb9d5dc1059ed8
• In the functions deﬁned by the SHA-2 standard,
only Ch and Maj are identical for all algorithms.
The r and R operations are diﬀerent, although
they are based on the same idea, that is a bitwise
XOR of three diﬀerent rotations/shifts of the
input value, but the rotate/shift values diﬀer
and thus cannot be shared. Since there are only
two diﬀerent sets of functions (one for w = 32
and another for w = 64), they are both hard-
wired with selection between the two using a
MUX, which is a lower hardware cost solution
than the use of a generic structure (barrel
rotate/shifter).
• The round constants are the same for equal word
sizes, and the value of Kt for w = 32 is identical to
the left half of the corresponding w = 64 con-
stant. For example:SHA-224/256 K0 = 428a2f98
SHA-384/512 K0 = 428a2f98d728ae22
• The round computation and the intermediate
value deﬁnitions are the same for all SHA-2 algo-
rithms, although the number of rounds diﬀers
depending on the word size. Only 64 rounds are
performed for w = 32-bit hashes, and 80 for
w = 64-bit hashes.5.2. Physical sharing of the hardware
Our multi-mode architecture ﬁts into the same
datapath two 32-bit words hash functions and a sin-
gle 64-bit hash. We note a and b the two 32-bit word
hash functions that use the left and right halves of
the datapath, respectively, and c the 64-bit hash that
uses the whole datapath.
The physical sharing is accomplished by consid-
ering all operations realized for c on 64-bit words
t–2W
σ1
{512}
σ1
{256}
σ1
{256}
t–7W t–16Wt–15W
σ0
{256}
σ0
{256}
σ 0
{512}
Adder 64 / 2  32
tW
α a
βα γ
γ
α β
βα
γ
γ
Fig. 7. Multi-mode implementation of the Wt computation.
136 R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138as two separate 32-bit operations for a and b, by
inhibiting dependencies between the two halves in
the latter case.For registers and parallel operations,
this involves no hardware overhead since the left
and right halves are independent regardless of the
operator mode. When an addition modulo 2w is per-
formed, a carry propagation exists between the right
and left parts of the 64-bit words for c that must be
inhibited when computing two adjacent modulo 232
additions for a and b. Beside the small logic over-
head, control parts are duplicated in order to allow
a and b to run concurrently as well as in parallel.
Fig. 6 shows the modiﬁcations required to the
standard carry propagate adder, available on the
FPGA, that allow either one modulo 264 addition
or two concurrent modulo 232 additions to be
performed.
5.2.1. Padder
In the multi-mode version of the padder, the
word counter has been modiﬁed in order for it to
be used as either two separate 64-bit counters, or
as a single 128-bit counter. This implies a rather
complicated management of the carry since the last
4-bits (resp. 5-bits) of each message length for a
w = 32-bit (resp. w = 64-bit) hash are given by the
input Bit_valid. If the operator works in split
mode, one carry used in the word counter must be
discarded and Bit_valid used for the lower bits
in the message length of a.
5.2.2. Message scheduler
The message scheduling for SHA-256 and SHA-
512 is the same except for: (1) the word size which is
doubled; (2) the r0 and r1 functions which consist of32 3232 32
32 32
a(63:32) b(63:32) a(31:0)split
cin
cout
b(31:0)
c(63:32) c(63:32)
CPA CPA
γ
β
α
R
Fig. 6. We introduce a 64-bit/2 · 32-bit selectable modular adder
(up). The registers can either be considered as one 64-bit or as two
concurrent 32-bit registers (down).wiring; and (3) the number of rounds that does not
aﬀect the logical structure of the scheduler. Fig. 7
illustrates the multi-mode computation of Wt.
MUXes select data paths for each mode, and the
previously introduced split adders are used to per-
form the modulo 2w additions.5.2.3. Round constants unit
Since a dual-port 32-bit RAM block is used to
compute the SHA-384/512 64-bit round constants,
it can also be used, at the cost of some logic over-
head in the address input, to provide two diﬀerent
32-bit round constants as well as two concurrent
SHA-244/256 hashes.
In order to ensure the same latency properties as
in the separate architectures, some logic has to be
added to ensure the correct initialization of the com-
putation when the mode is changed, since the con-
stant unit must output either K0c or K0akK0b
depending on the new mode.5.2.4. Round computation unit and intermediate hash
The equations for computing the new values of
variables a,b, . . . ,h are the same for all hash func-
tions of the SHA-2 family, with appropriate changes
relating to the relative word sizes and with the
exception of the R functions. The only modiﬁcation
of the round computation unit for the multi-mode
version therefore consists in using the split adders
and implementing both R512 and R256 operators
for each R function, in the same manner that was
used for the message scheduler.
The initial hash value, H(0), is selected through
additional logic that takes advantage of the similar
 450
)
SHA-2 Resources vs. Throughput on Spartan3 FPGA
Multi-mode512
Multi-mode
R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–138 137values in the diﬀerent algorithms. The computation
of a new intermediate hash is performed using the
split adders. 200
 250
 300
 350
 400
 1000  1500  2000  2500  3000  3500
Th
ro
ug
hp
ut
 (M
bp
s
224
256
224/256 384
512
384/512
224
256
224/256
384
384/5125.2.5. Analysis
The multi-mode architecture shares the same
properties as separate architectures for the opera-
tors in terms of latency and speed. It is also possible
to improve the speed performances with the same
segmentation of the critical path.Required Slices
Unoptimized Optimized
Fig. 8. Throughput vs Area for Spartan3 optimized for speed.
All architectures are presented with their zero-latency (continu-
ous) and optimized two-latency (dashed) variants.6. Implementation results
This section summarizes our implementation
results using Synplify Pro as the synthesis tool.
The criteria considered are FPGA resources (slices),
maximum throughput (Mb/s) and their ratio in Mb/
s/slice. We synthesized our design using two targets:
the Spartan3-XC3S400 which is the most suitable
for our architectures and the Virtex 200/400XCV
which we used for providing accurate comparisons
with existing schemes.
Every hash function of the SHA-2 family was
synthesized as a stand-alone operator (224, 256,
384 and 512), or merged by word operating size
(224/256 or 384/512), and we also give our results
for the multi-mode architecture which is capable
of all SHA-2 family modes and can act as two
independent 32-bit (SHA-224/256) operators
simultaneously.
From a system-level perspective the decision to
support the merged or multi-mode operator is made
before synthesis. Once implemented, control signalsTable 2
FPGA Synthesis results and comparison
Slices Freq. (MHz) Cycles per block
Reference architecture
[9] SHA-256 *2120 83 81
[9] SHA-384 *3932 74 97
[9] SHA-512 *4474 75 97
[10] SHA-384/512 *5828 38 Pipelined
[9] SHA-256/384/512 *4768 74 81/97
Proposed architectures
SHA-224 1297 77 64(+2)
SHA-256 1306 77 64(+2)
SHA-224/256 1260 69 64(+2)
SHA-384 2581 69 80(+2)
SHA-512 2545 69 80(+2)
SHA-384/512 2573 66 80(+2)
**Multi-mode SHA-2 2951 50 64/80(+2)
* 1 CLB = 2 slices for Virtex, target: Virtex 200/400XCV, ** Max multdetermine the mode of operation. Any of these
designs can be synthesized with either two cycles
of latency operation for increased clock frequency
or zero latency operation with a longer critical
path.
6.1. Analysis of proposed architecture
It is apparent in Fig. 8 that the merged versions
of our operator provide no overhead in terms of
resources for both SHA-224/256 and SHA-384/
512. In terms of speed, the optimized critical path
provides an average 10% throughput improvement
across all modes and also reduces the average num-
ber of required slices by 3%. This results in a large
improvement for the speed-to-area ratios, especially
with regard to our new multi-mode architecture.Throughput (Mb/s) Throughput/area (Mb/s/slice)
262 0.123
293 0.075
396 0.089
479 0.082
233.9/390.6 0.049/0.082
269.5 0.208
308 0.236
276 0.219
331 0.128
442 0.174
422 0.164
2 · 200/320 0.136/0.108
i-mode throughput is 400/640 Mbps = 2 · 200/320 Mbps.
138 R. Glabb et al. / Journal of Systems Architecture 53 (2007) 127–1386.2. Comparison with published implementations
We now compare our architectures with previ-
ously published stand-alone and multi-mode SHA-
2 implementations [9,10] (see Table 2).
The focus of [9] was to implement SHA-256, 384
and 512 in a single operator using the Virtex
XCV200 as a target. Ref. [10] discusses a pipelined
approach to a single chip SHA-384/512 architec-
ture. Another implementation of the SHA-512 func-
tion can be found in [11]. In all of these designs,
16–32 clock cycles are required for the padder to
process an input message block before computation
begins. This is avoided in our system thanks to an
‘on-the-ﬂy’ padder that allows a clock cycle reduc-
tion of up to 25% compared to [10].
Additionally, due to some pre-computation tech-
niques, we are able to achieve clock frequencies
higher than [10] and slightly less than [9] while at
the same time signiﬁcantly reducing the hardware
costs.
Our multi-mode operator, in particular, uses con-
siderably fewer resources compared to the multi-
mode 256/384/512 implementation [9] with 2951
slices compared to 4768 slices and has a much better
throughput to area ratio.
7. Conclusion
In this paper, we have introduced a concurrent
SHA-2 operator which optimizes the datapath when
a 64-bit SHA-2 hash mode is supported and
removes all unnecessary latencies. The proposed
multi-mode architecture is able to perform a single
SHA-384 or SHA-512 hash function or to behave
as two independent computations of SHA-224 or
SHA-256 hash functions with minimal hardware
overhead. We demonstrated the beneﬁt of integrat-
ing a concurrent 32-bit mode when a 64-bit hash is
to be supported.
Additionally, the new architecture achieves a per-
formance comparable to previously published sepa-
rate implementations of these functions while
requiring much less hardware. Most importantly,all of the new implementations presented in this
paper are more eﬃcient than previously published
implementations when considering the throughput-
to-area ratio.Acknowledgements
This work was ﬁnancially supported through
iCORE (Informatics Circle of Research Excellence),
NSERC (Natural Sciences and Engineering
Research Council of Canada), CMC and an ACI
grant from the French ministry of Research and
Education.
References
[1] Alfred J. Menezes, Paul C. van Oorschot, Scott A. Vanstone,
Handbook of Applied Cryptography, CRC Press, 1997.
[2] Ronald L. Rivest, The MD5 Message-Digest Algorithm,
Internet informational RFC 1321, April 1992.
[3] H. Dobbertin, A. Bosselaers, B. Preneel, RIPEMD-160: a
strengthened version of RIPEMD, in: IWFSE: International
Workshop on Fast Software Encryption, LNCS, 1996.
[4] Vincent Rijmen, Paulo S.L.M. Barreto, The WHIRLPOOL
hash function, World-Wide Web document, 2001.
[5] Xiaoyun Wang, Dengguo Feng, Xuejia Lai, Hongbo Yu,
Collisions for hash functions MD4, MD5, HAVAL-128 and
RIPEMD, Cryptology ePrint Archive, Report 2004/199,
2004, p. 4.
[6] Xiaoyun Wang, Yiqun Lisa Yin, Hongbo Yu, Finding
collisions in the full SHA-1, Shandong University, Technical
Report, June 2005.
[7] National Institute of Standards and Technology, FIPS PUB
180-1: Secure Hash Standard, Gaithersburg, MD, USA,
NIST, April 1995.
[8] National Institute of Standards and Technology, FIPS PUB
180-2: Secure Hash Standard, Gaithersburg, MD, USA,
NIST, august 2002.
[9] N. Sklavos, O. Koufopavlou, The Journal of Supercomput-
ing 31 (3) (2005) 227–248.
[10] M. McLoone, J.V. McCanny, Eﬃcient single-chip imple-
mentation of SHA-384 & SHA-512, in: IEEE Proceedings of
the International Conference on Field-Programmable Tech-
nology (FTP), 2002, pp. 311–314.
[11] Grembowski, Lien, Gaj, Nguyen, Bellows, Flidr, Lehman,
Schott, Comparative analysis of the hardware implementa-
tions of hash functions SHA-1 and SHA-512, in: ISW:
International Workshop on Information Security, LNCS
2002.
