Hardware Design and Analysis of the ACE and WAGE Ciphers by Aagaard, Mark D. et al.
Hardware Design and Analysis of the ACE and WAGE Ciphers
Mark D. Aagaard, Marat Sattarov, and Nusˇa Zidaricˇ
Department of Electrical and Computer Engineering
University of Waterloo, Ontario, Canada
{maagaard,msattaro,nzidaric}@uwaterloo.ca
Abstract
This paper presents the hardware design and analysis of ACE and WAGE, two candidate ciphers for
the NIST Lightweight Cryptography standardization. Both ciphers use sLiSCP’s unified sponge duplex
mode. ACE has an internal state of 320 bits, uses three 64 bit Simeck boxes, and implements both
authenticated encryption and hashing. WAGE is based on the Welch-Gong stream cipher and provides
authenticated encryption. WAGE has 259 bits of state, two 7 bit Welch-Gong permutations, and four
lightweight 7 bit S-boxes. ACE and WAGE have the same external interface and follow the same I/O
protocol to transition between phases. The paper illustrates how a hardware perspective influenced key
aspects of the ACE and WAGE algorithms. The paper reports area, power, and energy results for both
serial and parallel (unrolled) implementations using four different ASIC libraries: two 65 nm libraries, a
90 nm library, and a 130 nm library. ACE implementations range from a throughput of 0.5 bits-per-clock
cycle (bpc) and an area of 4210 GE (averaged across the four ASIC libraries) up to 4 bpc and 7260 GE.
WAGE results range from 0.57 bpc with 2920 GE to 4.57 bpc with 11080 GE.
1 Introduction
In 2013, NIST started the Lightweight Cryptography (LW) project [1], with the end goal of creating a
portfolio of lightweight algorithms for authenticated encryption with associated data (AEAD), and op-
tionally hashing, in constrained environments [2]. For hardware-oriented lightweight algorithms, hardware
implementation results are an important criteria for assessment and comparison. In the first round of
the LWC evaluation, more than half of the candidates [3] reported hardware implementation results or
their estimates, ranging from complete implementation and analysis to partial implementation results and
theoretical estimates based on gate count. Various amounts of analysis, such as area reported only for a
cryptographic primitive used or thorough area breakdown of all components, different design decisions, such
as serial and unrolled implementations, and different ASIC and/or FPGA implementation technologies can
be found. Furthermore, some authors report the results without interface, some with the interface, and
in some cases, e.g. [6], CAESAR Hardware Applications Programming Interface (API) for Authenticated
Ciphers [7] was used.
This paper explores different hardware design options for two of the LWC candidates, ACE [4] and WAGE [5].
The original and parallel implementations were synthesized using four different ASIC libraries, including
65nm, 90 nm and 130 nm technologies. ACE implementations range from a throughput of 0.5 bits-per-clock
cycle (bpc) and an area of 4210 GE (averaged across the four ASIC libraries) up to 4 bpc and 7260 GE.
WAGE results range from 0.57 bpc with 2920 GE and 4.57 bpc with 11080 GE.
The paper is organized as follows: Section 2 briefly introduces ACE and WAGE, Section 3 lists design
principles and presents the interface with the environment and describes the implementations of both
ciphers. Section 4 describes the parallel implementations of ACE and WAGE. Implementation technologies
and results are summarized in Section 5.
2 Specifications of ACE and WAGE
Both ACE and WAGE permutations operate in a unified duplex sponge mode [9]. The 320 bit ACE permu-
tation offers both AEAD and hashing functionalities, and the 259 bit WAGE permutation supports AEAD
functionality. Because of the similarities between ACE and WAGE, this section begins with a short descrip-
tion of ACE and WAGE permutations, followed by a discussion on the unified duplex sponge mode for both
schemes, highlighting some differences.
This work was supported in part by the Canadian National Science and Engineering Research Council (NSERC); the
Canadian Microelectronics Corp (CMC); and Grant 60NANB16D289 from the U.S. Department of Commerce, National
Institute of Standards and Technology (NIST).
ar
X
iv
:1
90
9.
12
33
8v
1 
 [c
s.C
R]
  2
6 S
ep
 20
19
Hardware Design and Analysis of the ACE and WAGE Ciphers 2
2.1 ACE Permutation
ACE has a 320 bit internal state S, divided into five 64 bit registers, denoted A, B, C, D, and E. The
320 bit ACE permutation uses the unkeyed reduced-round Simeck block cipher [10] with a block size of 64
and 8 rounds, denoted Simeck box SB-64, as the nonlinear operation. SB-64 is a lightweight permutation,
consisting of left cyclic shifts, and and xor gates. Each round is parameterized by a single bit of round
constant rc = (q7, q6, . . . , q0), qj ∈ {0, 1} and 0 ≤ j ≤ 7. The algorithmic description of SB-64 is shown
at the end of Algorithm 1, with the 64 bit input and output split into half, i.e. (x1||x0) and (x9||x8)
respectively.
To construct the ACE-step(Si), 0 ≤ i ≤ 15, SB-64 is applied to three registers A, C and E, with each using
its own round constant rci0, rc
i
1, rc
i
2. The 8 bit constants rc
i
j , are generated by an LFSR with the feedback
polynomial x7 + x+ 1 run in a 3-way parallel configuration to produce one bit of each rcij per clock cycle.
At each step, the outputs of SB-64 are added to registers B, D and E, which are further parameterized by
step constants (sci0, sc
i
1, sc
i
2). The computation of step constants does not need any extra circuitry, but
rather uses the same LFSR as the round constants: the three feedback values together with all 7 state
bits yield 10 consecutive sequence elements, which are then split into three 8 bit step constants. The step
constants are used once every 8th clock cycle. The step function is then concluded by a permutation of all
five registers. For the properties of SB-64, the choice of the final permutation, and the number of rounds
and steps refer to [4].
2.2 WAGE Permutation
WAGE is a hardware oriented AE scheme, built on top of the initialization phase of the well-studied, LFSR
based, Welch-Gong (WG) stream cipher [11, 12]. The WAGE permutation is iterative and has a round
function derived from the LFSR, the decimated Welch-Gong permutation WGP, and the small S-boxes SB.
Details, such as differential uniformity and nonlinearity of the WGP and SB and selection of the LFSR
polynomial can be found in [5]. The parameter selection for WAGE was aimed at balancing the security
and hardware implementation area, using hardware implementation results for many design decisions, e.g.,
field size, representation of field elements, LFSR polynomial, etc.
Both LFSR and WGP are defined over F27 and the S-box is a 7 bit permutation. F27 is defined with
the primitive polynomial f(x) = x7 + x3 + x2 + x + 1, and the field elements are represented using the
polynomial basis PB = {1, ω, . . . , ω6}, where ω is the root of f(x) (Table 1). The LFSR is defined by the
feedback polynomial `(x) (Table 1), which is primitive over F27 . The 37 stages of the LFSR also constitute
the internal state of WAGE, denoted Si = (Si36, S
i
35, · · · , Si1, Si0); the subscript i is used to mark the i-th
iteration of the permutation. For the element x ∈ F27 , the decimated WG permutation with decimation
d = 13 is defined in Table 1. The 7 bit SB uses a nonlinear transformation Q and a permutation P, which
together yield one-round R = P ◦ Q. The SB itself iterates the function R 5 times, applies Q once, and
then complements the 0th and 2nd bit (Table 1).
Table 1: Specification parameters of WAGE
a ∈ F27 a =
∑6
i=0 aiω
i, ai ∈ F2 vector representation: [a]PB = (a0, a1, a2, a3, a4, a5, a6)
LFSR `(y) = y37 + y31 + y30 + y26 + y24 + y19 + y13 + y12 + y8 + y6 + ω, f(ω) = 0
WGP7 WGP7(xd) = xd + (xd + 1)33 + (xd + 1)39 + (xd + 1)41 + (xd + 1)104
SB Q Q(x0, x1, x2, x3, x4, x5, x6)→ (x0 ⊕ (x2 ∧ x3), x1, x2, x3 ⊕ (x5 ∧ x6), x4, x5 ⊕ (x2 ∧ x4), x6)
SB P P (x0, x1, x2, x3, x4, x5, x6)→ (x6, x3, x0, x4, x2, x5, x1)
SB R R(x0, x1, x2, x3, x4, x5, x6)→ (x6, x3 ⊕ (x5 ∧ x6), x0 ⊕ (x2 ∧ x3), x4, x2, x5 ⊕ (x2 ∧ x4), x1)
SB
(x0, x1, x2, x3, x4, x5, x6)← R5(x0, x1, x2, x3, x4, x5, x6)
(x0, x1, x2, x3, x4, x5, x6)← Q(x0, x1, x2, x3, x4, x5, x6)
(x0, x1, x2, x3, x4, x5, x6)← (x0, x1, x2, x3, x4, x5, x6)
Aagaard, Sattarov, Zidaricˇ 3
Algorithm 1 ACE permutation
1: Input: S0 = A0||B0||C0||D0||E0
2: Output:
S16 = A16||B16||C16||D16||E16
3: for i = 0 to 15 do:
4: Si+1 ← ACE-step(Si)
5: return S16
6: Function ACE-step(Si):
7: Ai ← SB-64(Ai1||Ai0, rci0)
8: Ci ← SB-64(Ci1||Ci0, rci1)
9: Ei ← SB-64(Ei1||Ei0, rci2)
10: Bi ← Bi ⊕ Ci ⊕ (156||sci0)
11: Di ← Di ⊕ Ei ⊕ (156||sci1)
12: Ei ← Ei ⊕Ai ⊕ (156||sci2)
13: Ai+1 ← Di
14: Bi+1 ← Ci
15: Ci+1 ← Ai
16: Di+1 ← Ei
17: Ei+1 ← Bi
18: return
(Ai+1||Bi+1||Ci+1||Di+1||Ei+1)
19: Function SB-64(x1||x0, rc):
20: rc = (q7, q6, . . . , q0)
21: for j = 2 to 9 do
22: xj ← (L5(xj−1) xj−1) ⊕
L1(xj−1)⊕ xj−2 ⊕
(131||qj−2)
23: return (x9||x8)
( L is left-rotation )
Algorithm 2 WAGE permutation
1: Input : S0 = (S036, S
0
35, · · · , S01 , S00)
2: Output : S111 = (S11136 , S
111
35 , · · · , S1111 , S1110 )
3: for i = 0 to 110 do:
4: Si+1 ←WAGE-StateUpdate(Si, rci0, rci1)
5: return S111
6: Function WAGE-StateUpdate(Si):
7: fb = Si31 ⊕ Si30 ⊕ Si26 ⊕ Si24 ⊕ Si19⊕
Si13 ⊕ Si12 ⊕ Si8 ⊕ Si6 ⊕ (ω ⊗ Si0)
8: Si+14 ← Si5 ⊕ SB(Si8)
9: Si+110 ← Si11 ⊕ SB(Si15)
10: Si+118 ← Si19 ⊕WGP(Si18)⊕ rci0
11: Si+123 ← Si24 ⊕ SB(Si27)
12: Si+129 ← Si30 ⊕ SB(Si34)
13: Si+136 ← fb ⊕WGP(Si36)⊕ rci1
14: Si+1j ← Sij+1 where
j ∈ {0, · · · , 36}\{4, 10, 18, 23, 29, 36}
15: return Si+1
As mentioned before, the WAGE permutation is
iterative, and repeats its round function WAGE-
StateUpdate(Si) 111 times, as shown in Algo-
rithm 2. In each round, 6 stages of the LFSR are
updated nonlinearly, while all the remaining stages
are just shifted. A pair of 7 bit round constants
(rci0, rc
i
1) is xored with the pair of stages (18, 36).
Round constants are produced by an LFSR of
length 7 with feedback polynomial x7 + x + 1, im-
plemented in a 2-way parallel configuration, see [5]
for details.
2.3 The Unified Duplex Sponge Mode
ACE-AE-128 and WAGE-AE-128 use the unified duplex sponge mode from sLiSCP [9] (Figure 1). The
phases for encryption and decryption are: initialization, processing of associated data, encryption (Fig-
ure 1(a)) or decryption (Figure 1(b)), and finalization. Figure 1 also shows the domain separators for
each phase. The internal state is divided into a capacity part Sc (256 bits for ACE-AE-128 and 195 bits for
WAGE) and a 64 bit rate Sr, which for:
• ACE-AE-128 consists of bytes A[7], A[6], A[5], A[4], C[7], C[6], C[5], C[4]
• WAGE-AE-128 consists of the 0-th bit of stage S36, i.e., S36,0, and all bits of stages S35, S34, S28, S27,
S18, S16, S15, S9 and S8
The input data (associated data AD, message M or ciphertext C) is absorbed (or replaced) into the rate
part of the internal state. If the input data length is not a multiple of 64, padding with (10∗) is needed.
In Figure 1, d denotes the number of 64 bit blocks of AD and m the number of 64 bit blocks of M and
C after padding. Refer to [4, 5] for further padding rules. No padding is needed during initialization and
Hardware Design and Analysis of the ACE and WAGE Ciphers 4
load-AE(N,K)
r
c
K0 AD0 C0
0x00
K1
0x00
Initialization
0x01
ADd-1
0x01
M0
0x02
Cm-1Mm-1
0x02
K0
0x00
K1
0x00
tagextract(S)t
Processing associated data Encryption Finalization
Perm Perm Perm Perm Perm Perm Perm Perm Perm
(a) Authenticated encryption
Perm
0x00 0x00
Initialization
0x01 0x01 0x02 0x00 0x00
Processing associated data Decryption Finalization
0x02
Perm Perm Perm Perm Perm Perm Perm Permload-AE(N,K)
r
c
K0 AD0
C0
K1 ADd-1 M0
Cm-1
Mm-1 K0 K1
tagextract(S)t
(b) Verified decryption
Figure 1: Schematic diagram of the AEAD algorithm, where Perm is ACE or WAGE permutation respectively
finalization because both schemes use a 128 bit key. With the exception of tag extraction, both schemes
generate an output only during the encryption and decryption phases: the 64 bit output block is obtained
by the xor of the current input and rate.
Figure 1 also shows functions load-AE(N,K) and tagextract(S), which are straightforward for ACE. The
ACE load-AE(N,K) performs the loading of the 128 bit key K and nonce N , where the key is loaded into
registers A and C, the nonce into B and E, and the register D is loaded with zeros. The ACE tagextract(S)
extracts the 128 bit tag from registers A and C. Special care was taken in the specification of load-AE(N,K)
and tagextract(S) of WAGE to take advantage of the shifting nature of the LFSR, which will be discussed
in more detail in Section 3.3.
load-H(IV)
0x00 0x00
M0 Mm-1
0x00 0x00
Absorbing Squeezing
Perm Perm Perm Perm Perm
H00
Initialization
H1
0x00
Perm
H2 H30 0 0
Figure 2: Schematic diagram of the ACE-H-256 algorithm, where Perm is ACE permutation
The ACE HASH functionality is shown in Figure 2, with only two phases, namely absorbing and squeezing.
The only input is now the message M . Since the hash has a fixed length of 256 bits, the length of the
squeezing phase is fixed. ACE-H-256 is unkeyed, and the state is loaded with a fixed initialization vector
IV. More specifically, the function load-H(IV ) loads the state bytes B[7], B[6] and B[5] with bytes 0x80,
0x40, and 0x40 respectively, and sets all other state bits to zero.
3 Hardware Implementations
3.1 Hardware Design Principles and Interface with the Environment
The design principles and assumptions followed by the hardware implementations:
1. Multi-functionality module. The system should include all supported operations in a single
module (Figure 3), because lightweight applications cannot afford the extra area for separate modules.
Aagaard, Sattarov, Zidaricˇ 5
2. Single input/output ports. In small devices, ports can be expensive. To ensure that ACE and
WAGE are not biased in favour of the system, at the expense of the environment, the ciphers have
one input and one output port (Table 2). That being said, the authors agree with the proposed
lightweight cryptography hardware API’s [8] use of separate public and private data ports and will
update implementations accordingly.
3. Valid-bit protocol and stalling capability. The environment may take an arbitrarily long time
to produce any piece of data. For example, a small microprocessor could require multiple clock cycles
to read data from memory and write it to the system’s input port. The receiving entity must capture
the data in a single clock cycle (Figure 4). In reality, the environment can stall as well. In the future,
ACE and WAGE implementations will be updated to match the proposed lightweight cryptographic
hardware API’s use of a valid/ready protocol for both input and output ports.
4. Use a “pure register-transfer-level” implementation style. In particular, use only registers,
not latches; multiplexers, not tri-state buffers; synchronous, not asynchronous reset; no scan-cell
flip-flops; clock-gating is used for power and area optimization.
Since both ACE and WAGE use a unified sponge duplex mode, they share a common interface with the
environment (Table 2). The environment separates the associated data and the message/ciphertext, and
performs padding if necessary. The domain separators shown in Figure 1 are provided by the environment
and serve as an indication of the phase change for AEAD functionality. For ACE-H-256, the phase change
is indicated by the change of the i mode(0) signal, as shown in Table 3. The hardware is unaware of the
lengths of individual phases, hence no internal counters for the number of processed blocks are needed.
Table 2: Interface signals
Input signal Meaning
reset resets the state machine
i mode mode of operation
i dom sep domain separator
i padding the last block is padded
i data input data
i valid valid data on i data
Output signal Meaning
o ready hardware is ready
o data output data
o valid valid data on o data
Table 3: Modes of operation
i mode
(1) (0) Mode Operation or phase
0 0 ACE-E Encryption
0 1 ACE-D Decryption
1 0 ACE-H-256 Absorb
1 1 ACE-H-256 Squeeze
- 0 WAGE-E Encryption
- 1 WAGE-D Decryption
FSM
lfsr clfsr c
lfsr c
datapathdatapath
datapath
lfsr c en
lfsr c reset
control
co
n
st
o ready
o valid
o datai data
reset
i mode
i dom sep
i valid
i padding
top level module
Figure 3: Top-level module and interface
i_valid
i_data
o_data
o_ready
pcount
round
step
M5
0 1 2 3 7
0 1 2 3 7 1 2 3 70
8 9 10 11 15
0
8 rounds
per step
1 2 15
16
0
125126127 0 1
16 steps per permutation
M6
C5
5 6 7 0 1
0
pcount stalls while
waiting for i_valid
o_valid
C6
Figure 4: Timing diagram for ACE during encryption
The top-level module, shown in Figure 3, is also very similar for both ACE and WAGE. It depicts the interface
signals from Table 2, with only slight differences in bitwidths. Figure 4 shows the timing diagram for ACE
during the encryption phase of message blocks M5 and M6, which clearly shows the valid-bit protocol.
The first five lines show the top-level interface signals and line six shows the value of the permutation
counter pcount, which is a part of the ACE finite state machine (FSM) and keeps track of the 128 clock
Hardware Design and Analysis of the ACE and WAGE Ciphers 6
cycles needed for one ACE permutation. After completing the previous permutation, the top-level module
asserts o ready to signal to the environment that an ACE permutation just finished and new data can be
accepted. The environment replies with a new message block M5 accompanied by an i valid signal. The
hardware immediately encrypts, returns C5 and asserts o valid. This clock cycle is also the first round
of a new ACE permutation and the o ready is deasserted, indicating that the hardware is busy. Figure 4
shows the ACE hardware remaining busy (o ready = 0) for the duration of one ACE permutation. When
pcount wraps around from 127 to 0, the hardware is again idle and ready to receive new input, in this case
M6. A few more details about the use of pcount will follow in Subsection 3.2. The interaction between
the top-level module and the environment during the encryption phase of WAGE is very similar, with 111
clock cycles for the completion of one permutation. More significant differences for the interaction with
the environment arise during loading, tag extract and of course ACE-H-256.
3.2 ACE Datapath
Figure 5(a) shows the ACE datapath. The top and bottom of the figure depict the five 64 bit registers A,
B, C, D and E, followed by the hardware components required for normal operation during permutation,
absorbing, and replacing, which imposes input multiplexers controlled by the mode and the counter pcount.
Similarly, the output multiplexers are needed to accommodate encryption/decryption and tag generation
for ACE-AE-128 and squeezing for ACE-H-256. Furthermore, the output is forced to 0 during normal
operation. The registers A, C and E are split in half to accommodate inputs and outputs. The rest of
Figure 5(a) shows one step of the ACE permutation (Algorithm 1). The rounds and steps always use
the same hardware, but in different clock cycles, which forces the use of multiplexers inside the ACE
permutation. The last row of multiplexers accommodates loading.
step_const
A B E
round_const
A1
i_data
dom_sep
A0
o_data
64
2
B
1
8
SB-64
A
CE
 p
er
m
ut
at
io
n
0
32
SB
-6
4 
an
d
ro
u
n
d 
co
ns
t
i_
da
ta
 a
nd
 o
_d
at
a
m
u
x
es
A’D’ BC’
C1 C0
32
0
D
D
8
C
A’ C’ D
8
E’’ E’B’
ad
di
tio
ns
 a
nd
st
ep
 c
on
st
lin
. p
er
m
. a
nd
ro
u
n
d/
ste
p 
m
ux
es
load/perm
muxes
A’ C’ E’
32 64 32 64
E
64
B’ E’’D’
1SB-64 1SB-64
(sc0,1,2)
(rc0,1,2)
(a) The ACE module datapath
A B
1
8
SB-64
A’D’ BC’
step_const
C’
1SB-64
1SB-64
1SB-64
round_const0
round_const1
round_const2
round_const3
i_data
(b) Parallelization p = 4 segment for reg-
isters A and B
Figure 5: The ACE module datapath and parallelization
3.3 WAGE Datapath
Because of the shifting nature of the LFSR, which in turn affects loading, absorbing and squeezing, the
WAGE datapath is slightly more complicated than the ACE datapath and hence is explained in two levels:
Aagaard, Sattarov, Zidaricˇ 7
lfsr_en
absorb
replace
load
sb_off
is_tag
i_data
i_dom_sep wage_lfsr_p
WGP
SB
SB
round_const_p
WGP
SB
o_data
Wage datapath
SB
Figure 6: The WAGE cipher datapath
Table 4: wage lfsr loading through D0
shift D0
count S8, S7, S6, S5, S4, S3, S2, S1, S0
0 K̂0 - - - - - - - -
1 K̂2, K̂0 - - - - - - -
2 K̂4, K̂2, K̂0 - - - - - -
3 K̂6, K̂4, K̂2, K̂0 - - - - -
4 K̂8, K̂6, K̂4, K̂2, K̂0 - - - -
5 K̂10,K̂8, K̂6, K̂4, K̂2,K̂0 - - -
6 K̂12,K̂10,K̂8, K̂6, K̂4,K̂2,K̂0 - -
7 K̂14,K̂12,K̂10,K̂8, K̂6,K̂4,K̂2,K̂0 -
8 K̂16,K̂14,K̂12,K̂10,K̂8,K̂6,K̂4,K̂2,K̂0
S0S1S2S3S4
SB
S5S6S7S9S10
D0 O0
replace
or load absorb
S 8
D1 O1
replace
 
SB_off
not is_tag
absorb
i_dom_sep
RLmux0Rmux1
Amux1 Amux0 Amux
SB
m
ux
Figure 7: The wage lfsr stages S0, . . . , S10 with multiplexers, xor and and gates for the sponge mode
1. wage lfsr treated as a black box in Figure 6 with p = 1 (no parallelization)
• wage lfsr: The LFSR has 37 stages with 7 bits per stage, a feedback with 10 taps and a module
for multiplication with ω (Table 1). The internal state of wage lfsr is also the internal state S
of WAGE.
• WGP module implementing WGP: For smaller fields like F27 , the WGP area, when implemented as
a constant array in VHDL/Verilog, i.e., as a look-up table, is smaller than when implemented
using components such as multiplication and exponentiation to powers of two [13,14]. However,
the WGP is not stored in hardware as a memory array, but rather as a net of and, or, xor and
not gates, derived and optimized by the synthesis tools.
• SB module: The SB is implemented in unrolled fashion, i.e. as purely combinational logic,
composed of 5 copies of R, followed by a Q and the final two not gates (Table 1).
• lfsr c: The lfsr c for generating the round constants was implemented in a 2-way parallel
fashion. It has only 7 1 bit stages and two xor gates for the two feedback computations.
2. Extra hardware for the wage lfsr in sponge mode. Figure 7 shows details for stages S0, . . . , S10. The
grey line represents the path for normal operation during the WAGE permutation. The additional
hardware for the entire wage lfsr is listed below, with examples in brackets referring to Figure 7.
• The 64 bit i data is padded with zeros to 70 bits, then fragmented into 7 bit wage lfsr inputs
Dk, k = 0, . . . , 9, corresponding to the rate stages Sr. For each data input Dk there is a
corresponding 7 bit data output Ok. ( D1, O1 and D0, O0 in Figure 7).
• 10 xor gates must be added to the Sr stages to accommodate absorbing, encryption and de-
cryption (xors at stages S9,S8).
• 10 multiplexers to switch between absorbing and normal operation (Amux1, Amux0 at S9,S8).
• An xor and a multiplexer are needed to add the domain separator i dom sep (Amux at S0).
• To replace the contents of the Sr stages, 10 multiplexers are added (Rmux1 at stage S9)
Hardware Design and Analysis of the ACE and WAGE Ciphers 8
• Instead of additional multiplexers for loading, the existing Rmuxk, k = 9, 5, 4, 3, 0, multiplexers
are now controlled by replace or load and labelled RLmuxk, (see RLmux0 on S8). Since all
non-input stages must keep their previous values, an enable signal lfsr en is needed.
• Three 7 bit and gates to turn off the inputs D6, D3 and D1 (and at D1).
• Four multiplexers are needed to turn off the SB during loading and tag extraction (SBmux at S4).
• The total hardware cost to support the sponge mode is: 24 7 bit and one 2 bit multiplexers, 10
7 bit and one 2 bit xor gates, three 7 bit and gates.
As mentioned in Section 2, special care was given to the design of loading and tag-extract. The existing
data inputs Dk are reused for loading, and the outputs Ok for tag extraction. The wage lfsr is divided
into five loading regions using the inputs Dk, k = 9, 5, 4, 3, 0. For example, the region S0, . . . , S8 in Figure 7
is loaded through input D0, however, instead of storing D0 ⊕ S8, the D0 data is fed directly into S8, i.e.
the RLmux0 disconnects the Amux0 output. The remaining stages in this region are loaded by shifting,
which requires the SBmux at S4. Note that there is no need to disconnect the two WGP, because they are
automatically disabled by loading through D9 and D4, located at stages S36 and S18 respectively. The
loading process is illustrated in Table 4, where Kˆi is the i
th 7 bit block of the 128 bit key K. Table 4
shows the key shifting through the LFSR stages in 9 clock cycles. The stages are shown in the second row
of Table 4, and the values “-” in the table denote the old, unknown values that are overwritten by the
new key. The state of stages S8, . . . S0 after after loading is finished is shown in the last row. The tag is
extracted in a similar fashion as loading, but from the data output Ok at the end of a particular loading
region, e.g., the region S9, . . . , S16, loaded through D3, is extracted through O1. The longest tag extraction
region is of length 9, which is the same as the longest loading region.
3.4 Hardware-Oriented Design Decisions
The design process for ACE and WAGE tightly integrated cryptanalysis and hardware optimizations. A few
key hardware-oriented decisions are highlighted here; more can be found in the design rationale chapters
of [4, 5].
Functionally, it is equivalent for the boundary between phases to occur either before or after the permuta-
tion. For ACE and WAGE, the boundary was placed after the permutation updates the state register. This
means that the two-bit domain separator is sufficient to determine the value of many of the multiplexer
select lines and other control signals. All phases that have a domain separator of "00" have the same mul-
tiplexer select values. The same also holds true for "01". Unfortunately, this cannot be achieved for "10",
because encryption and decryption require different control signal values, but the same domain separator.
Using the domain separator to signal the transition between phases for encryption and decryption also
simplifies the control circuit. For hashing, the change in phase is indicated by the i mode signal.
In applications where the delay through combinational circuitry is not a concern, such as with lightweight
cryptography, where clock speed is limited by power consumption, not by the delay through combina-
tional circuitry, it is beneficial to lump as much combinational circuitry as possible together into a single
clock cycle. This provides more optimization opportunities for the synthesis tools than if the circuitry
was separated by registers. For this reason, the ACE datapath was designed so that the input and out-
put multiplexers, one round of the permutation, and state loading multiplexers together form a purely
combinational circuit, followed by the state register.
4 Parallel Implementations
4.1 Parallelization in General
Both ciphers can be parallelized (unrolled) to execute multiple rounds per clock cycle, at the cost of
increased area. In the top-level schematic in Figure 3, the dashed stacked boxes indicate parallelization.
The FSM is parameterized with parameter p and used for un-parallelized (p=1) and parallelized (p>1)
implementations. Other components are replicated to show p copies, with p=3 in Figure 3. Such a
Aagaard, Sattarov, Zidaricˇ 9
representation is symbolic; parallelization is applied only to the permutation, not the entire datapath. The
interface with the environment remains the same.
4.2 ACE
The p=1 un-parallelized ACE permutation performs a single round per clock cycle, which implies 8 clock
cycles per step. Parallel, i.e. unrolled, versions perform p rounds per clock cycle, and were implemented
for divisors of 8, i.e. p = 2, 4, 8. The ACE permutation could be parallelized further, e.g. two or more
steps in a single clock cycle. Figure 5(b) shows the example p=4 for registers A and B, with p=4 copies
of SB-64 connected in series. Each SB-64 has its own round constant rck0, k = 0, . . . , p− 1. The round vs.
step multiplexers are still needed, and can be removed only for values of p, that are multiples of 8. Also
note the step constant indicated as scp−10 . For p=4 a step is concluded in 2 clock cycles. However, this
requires a modification to the lfsr c, which must now generate p · 3 round constant bits rckj , j = 0, 1, 2,
k = 0, . . . , p − 1 per clock cycle. The last cycle within a step requires 7 additional bits, which together
with rcp−1j yield 10 bits for the step constant generation sc
p−1
j . In the case p=4 the lfsr c must generate
12 constant bits in the first cycle and 19 constant bits in the second clock cycle of the step, which are
then used for rckj and sc
k
j . For the extra constant bits, the lfsr c feedback was replicated, i.e. (p− 1) · 3
feedbacks in addition to the original 3.
4.3 WAGE
S2S5
S
S8
S4S7S1 S10
S3S6S9 S0
S11S14S17
S16 S13
S15 S12
S20
S19
S18
WGP
rc00
rc01
rc02
S21
S22S25S31 S28
S24S30 S27
S23S26S32 S29S35
S34
S33S36
rc10
rc11
rc12
f 0
f 1
f 2
B00
SB20
SB01
SB10SB11
SB21
0
0
WGP10
WGP20
SB02
SB12
SB22
SB03
SB13
SB23
WGP01
WGP11
WGP21
Figure 8: The WAGE permutation with p=3
WAGE performs one clock cycle for the interaction with the environment, i.e. absorbing or replacing the
input data into the state, followed by 111 clock cycles of the WAGE permutation. Because 111 is divisible
only by 3 and 37, the opportunities to parallelize WAGE appear rather limited. However, by treating the
absorption or replacement of the input data into the internal state as an additional clock cycle in the
permutation, we increase the the length of the permutation to 112 clock cycles. Because 112 has many
divisors, this allows parallelism of p = 2, 3, 4, 6, 8. The cost is a less than 1% decrease in performance for
the additional clock cycle and some additional multiplexers, because the clock cycle that loads data has
different behaviour than the normal clock cycles
Figure 8 shows the 3-way parallel wage lfsr including all nonlinear components and their copies. Multi-
plexers are not replicated, and hence, are not shown. For the components f, rc,WGP and SB in Figure 8,
the superscript k indicates the original (k = 0) and the two copies (k = 1, 2). Computation of the three
feedbacks fk is not shown but is conducted as fk = S31+k ⊕ S30+k ⊕ S26+k ⊕ S24+k ⊕ S19+k ⊕ S13+k ⊕
S12+k ⊕ S8+k ⊕ S6+k ⊕ (ω ⊗ S0+k). Similar to ACE, the generation of WAGE round constants rck1, rck0
Hardware Design and Analysis of the ACE and WAGE Ciphers 10
must be parallelized as well. For readability, the two WGP were labelled WGPk1,WGP
k
0, with WGP
0
1,WGP
0
0
being the original WGPs positioned at S36, S18, just like rc
0
1, rc
0
0. Similarly, the SBs were also labelled SB
k
j ,
j = 3, 2, 1, 0, in the decreasing order, i.e. SB03 is the original SB with input S34.
5 Implementation Technologies and ASIC Implementation Results
Logic synthesis was performed with Synopsys Design Compiler version P-2019.03 using the compile ultra
command and clock gating. Physical synthesis (place and route) and power analysis were done with
Cadence Encounter v14.13 using a density of 95%. simulations were done in Mentor Graphics ModelSim
SE v10.5c. The ASIC cell libraries used were ST Microelectronics 65 nm CORE65LPLVT 1.25V, TSMC 65
nm tpfn65gpgv2od3 200c and tcbn65gplus 200a at 1.0V, ST Microelectronics 90 nm CORE90GPLVT and
CORX90GPLVT at 1.0V, and IBM 130nm CMRF8SF LPVT with SAGE-X v2.0 standard cells at 1.2V.
Some past works have used scan-cell flip-flops to reduce area, because these cells include a 2:1 multiplexer
in the flip-flop which incurs less area than using a separate multiplexer. Scan-cell flip-flops were not used
because their use as part of the design would prevent their insertion for fault-detection and hence, prevent
the circuit from being tested for manufacturing faults. Furthermore, chip enable signals were removed from
all datapath registers, which are controlled by clock gating instead. This allows a further reduction of the
implementation area.
12000100008000700060005000400030002500
Area (GE)
0.5
1.0
2.0
4.0
T
h
ro
u
g
h
p
u
t 
(b
it
s 
/ 
cy
cl
e
)
Re
la
tiv
e
op
tim
al
ity
0.
25
0.
35
0.
50
0.
71
1.
00
1.
41
2.
00
2.
82
A-1
A-4
A-2
A-8
W-1
W-3
W-4
W-2
W-8ST Micro 65 nm
TSMC 65 nm
ST Micro 90 nm
IBM 130 nm
WAGE
ACE
Throughput is measured in bits per clock cycle (bpc), and plotted on a log scale axis.
The area axis is scaled as log(Area2).
Figure 9: Area2 vs Throughput
Figure 9 shows area2 vs. throughput for both ACE and WAGE with different degrees of parallelization,
denoted by W-p and A-p (p = 1, 2, 3, 4, 8). The throughput axis is scaled as log(Tput) and the area axis is
scaled as log(area2). The grey contour lines denote the relative optimality of the circuits using Tput/area2.
Throughput is increased by increasing the degree of parallelization (unrolling), which reduces the number
of clock cycles per permutation round. For p=1, the area of WAGE (W-1) is less than that of ACE (A-1),
because WAGE has 259 registers, compared to 320 for ACE. As parallelization is increased, WAGE’s area
grows faster than ACE’s, because of the larger size of WAGE’s permutation. Going from p=1 to p=8 results
Aagaard, Sattarov, Zidaricˇ 11
Table 5: Post-PAR implementation results
ST Micro 65 nm TSMC 65 nm ST Micro 90 nm IBM 130 nm
Label Tput A f E A f E A f E A f E
[A/W-p] [bpc] [GE] [MHz] [nJ] [GE] [MHz] [nJ] [GE] [MHz] [nJ] [GE] [MHz] [nJ]
ACE
A-1 0.5 4250 720 27.9 4600 705 20.1 3660 657 62.2 4350 128 46.8
A-2 1 4780 618 18.4 5290 645 12.4 4130 628 35.8 4980 88.9 29.4
A-4 2 5760 394 15.1 6260 588 8.51 4940 484 25.4 5910 90.5 21.1
A-8 4 7240 246 11.4 8090 493 6.40 6170 336 19.4 7550 63.2 18.4
WAGE
W-1 0.57 2900 907 20.0 3290 1120 13.0 2540 940 39.2 2960 153 30.4
W-2 1.14 4960 590 19.1 5310 693 10.6 4280 493 34.4 4850 98.5 N/A
W-3 1.68 5480 397 20.4 5930 527 10.7 4770 414 31.2 5460 79.6 26.5
W-4 2.29 6780 307 24.0 7460 387 12.1 5790 277 32.9 6700 51.9 33.4
W-8 4.57 12150 192 38.5 11870 204 19.9 9330 137 49.9 10960 34.5 59.9
Note: Energy results done with timing simulation at 10 Mhz.
in 1.72× area increase for ACE and 3.80× for WAGE on average. Optimality for WAGE reaches a maximum
at p=3. For ACE, optimality continues to increase beyond p=8.
As can be seen by the relative constant size of the shaded rectangles enclosing the data points, the relative
area increase with parallelization is relatively independent of implementation technology.
Table 5 represents the same data points as Figure 9 with the addition of maximum frequency (f, MHz)
and energy per bit (E, nJ). Energy is measured as the average value while performing all cryptographic
operations over 8192 bits of data at 10 MHz. As the ACE throughput increases, energy per bit decreases
consistently, despite higher circuit area and, therefore, power consumption. However, this is not the case
with WAGE. This phenomena can be explained by the higher relative area increase for WAGE which comes
from the higher complexity of WGP with respect to SB-64. Connecting more WGPs in a combinational
chain results in an exponential increase of the number of glitches, which drastically increases power con-
sumption.
Table 6 summarizes the area on ST Micro 65 nm of the LWC submissions [3] that included synthesizable
VHDL or Verilog code. Table 6 reports the area results obtained using the ST Micro 65 nm process and
tool flow from this paper and the results reported in the submission. The various ciphers use different
protocols and interfaces, sometimes provide different functionality (e.g., with or without hashing), and use
different key sizes. As such, this analysis is very imprecise, but gives a rough comparison to ACE and WAGE
results. As the LWC competition progresses and the hardware API matures, more precise comparisons will
become possible. This preliminary analysis indicates that ACE and WAGE are among the smaller cipher
candidates.
Table 6: Area of LWC candidates on ST Micro 65 nm (post-PAR)
This work Reported in submission documents [3]
Cipher Module Area (kGE) Area (kGE) ASIC technology used
Drygascon drygascon128_1round_cycle 29.6 N/A
Gage gage1h256c224r008AllParallel 10.4 N/A
Lilliput-AE lilliputaei128v1 encryptdecrypt 9.9 4.2 theoretical estimate for 5 lanes
Remus remus_top 7.4 3.6 TSMC 65nm
Subterranean crypto_aead simple_axi4_lite 6.5 5.7 FreePDK 45nm
Triadx triadx1 1.5
Thash thash1 1.5
Hardware Design and Analysis of the ACE and WAGE Ciphers 12
6 Conclusion
The goal of the ACE and WAGE design process was to build on the well studied Simeck S-Box and Welch-
Gong permutation. The overall algorithms were designed to lend themselves to efficient implementations
in hardware and to scale well with increased parallelism. ACE has a larger internal state: 320 bits, vs 259
for WAGE, but the ACE permutation is smaller than that of WAGE. This means the non-parallel version
of WAGE is smaller than that of ACE, but as parallelism increases, WAGE eventually becomes larger than
ACE. At 1 and 2 bits-per-cycle, the designs are relatively similar in area. A number of the NIST LWC
candidate ciphers provided synthesizable source code. A preliminary comparison with these ciphers on ST
Micro 65 nm indicates that ACE and WAGE are likely to be among the smaller candidates.
Acknowledgements This work benefited from the collaborative environment of the Comunications Se-
curity (ComSec) Lab at the University of Waterloo, and in particular discussions with Kalikinkar Mandal,
Raghvendra Rohit, and Guang Gong.
References
[1] NIST Lightweight Cryptography https://csrc.nist.gov/Projects/Lightweight-Cryptography
[2] Submission Requirements and Evaluation Criteria for the Lightweight Cryptography Standardization
Process https://csrc.nist.gov/CSRC/media/Projects/Lightweight-Cryptography/documents/
final-lwc-submission-requirements-august2018.pdf
[3] NIST Lightweight Cryptography round 1 candidates https://csrc.nist.gov/Projects/
Lightweight-Cryptography/Round-1-Candidates
[4] M.D. Aagaard, R. AlTawy, G. Gong, K. Mandal, R. Rohit, “ACE: An Authen-
ticated Encryption and Hash Algorithm — Submission to the NIST LWC Competi-
tion”, March 2019, https://csrc.nist.gov/CSRC/media/Projects/Lightweight-Cryptography/
documents/round-1/spec-doc/ace-spec.pdf
[5] M.D. Aagaard, R. AlTawy, G. Gong, K. Mandal, R. Rohit, “WAGE: An Authenticated Cipher —
Submission to the NIST LWC Competition”, March 2019, https://csrc.nist.gov/CSRC/media/
Projects/Lightweight-Cryptography/documents/round-1/spec-doc/wage-spec.pdf
[6] B. Rezvani, W. Diehl, “Hardware Implementations of NIST Lightweight Cryptographic Candidates:
A First Look”, Cryptology ePrint Archive, Report 2019/824, 2019.
[7] E. Homsirikamol, W. Diehl, A. Ferozpuri, F. Farahmand, P. Yalla, J.P. Kaps, K. Gaj, “CAESAR
Hardware API.” Cryptology ePrint Archive, Report 2015/669, 2016.
[8] J.P. Kaps, W. Diehl, M. Tempelmeier, E. Homsirikamol, K. Gaj, “Hardware API for Lightweight
Cryptography”, 2019
[9] R. AlTawy, R. Rohit, M. He, K. Mandal, G. Yang and G. Gong. sLiSCP: Simeck-based Permutations
for Lightweight Sponge Cryptographic Primitives. In SAC (2017), C. Adams and J. Camenisch, Eds.,
Springer, pp 129-150.
[10] G. Yang, B. Zhu, V. Suder, M.D. Aagaard, and G. Gong. The Simeck family of lightweight block
ciphers. In CHES (2015), T. Gu¨neysu and H. Handschuh, Eds., Springer, pp. 307-329.
[11] Y. Nawaz and G. Gong. The WG stream cipher. ECRYPT Stream Cipher Project Report 2005 33
(2005).
[12] Y. Nawaz, and G. Gong. WG: A family of stream ciphers with designed randomness properties. Inf.
Sci. 178, 7 (Apr. 2008), 1903-1916.
[13] M.D. Aagaard, G. Gong, and R.K. Mota. Hardware implementations of the WG-5 cipher for passive
RFID tags. In Hardware-Oriented Security and Trust (HOST), 2013, IEEE, pp. 29-34.
[14] Y. Luo, Q. Chai, G. Gong, and X. Lai. A lightweight stream cipher WG-7 for RFID encryption and
authentication. In 2010 IEEE Global Telecommunications Conference GLOBECOM 2010 (Dec 2010),
pp. 1-6.
