Design and analysis of dual-rail circuits for security applications by Sokolov D et al.
Design and Analysis of Dual-Rail Circuits
for Security Applications
Danil Sokolov, Student Member, IEEE, Julian Murphy, Student Member, IEEE,
Alexander Bystrov, Member, IEEE, and Alex Yakovlev, Member, IEEE
Abstract—Dual-rail encoding, return-to-spacer protocol, and hazard-free logic can be used to resist power analysis attacks by making
energy consumed per clock cycle independent of processed data. Standard dual-rail logic uses a protocol with a single spacer, e.g.,
all-zeros, which gives rise to energy balancing problems. We address these problems by incorporating two spacers; the spacers
alternate between adjacent clock cycles. This guarantees that all gates switch in every clock cycle regardless of the transmitted data
values. To generate these dual-rail circuits, an automated tool has been developed. It is capable of converting synchronous netlists
into dual-rail circuits and it is interfaced to industry CAD tools. Dual-rail and single-rail benchmarks based upon the Advanced
Encryption Standard (AES) have been simulated and compared in order to evaluate the method and the tool.
Index Terms—Alternating spacer protocol, cryptography, design automation, dual-rail encoding, hardware security, hazard-free
design, power analysis.

1 INTRODUCTION
SECURE applications, such as smart cards, require mea-sures to resist attacks, e.g., Differential Power Analysis
(DPA) [1], [2]. Dual-rail encoding provides a method to
enhance the security properties of a system, making DPA
more difficult. As an example, in the design described in [3],
the processor can execute special secure instructions. These
instructions are implemented as dual-rail circuits whose
switching activity is meant to be independent from data.
While alternatives exist at the software level to balance
power, the need for hardware solutions is also mandatory.
Special types of CMOS logic elements have been proposed
in [5], but this low-level approach requires changing gate
libraries and, hence, is costly for a standard cell or FPGA
user. As a solution, using balanced data encoding such as
dual-rail or together with self-timed design techniques has
been proposed in [6], [7].
In recent work [8], a methodology for the design of
secure circuits was proposed. The main advantage of the
method is that it is integrated in a standard design flow.
However, this approach suffers from the following draw-
backs: First, it is difficult to build a dual-rail gate which
consumes the samepower regardless of data processed. Even
if such a secure gate is built for one set of parameters (output
load, supply voltage, environment temperature), it still can
expose unbalanced power consumption in other conditions.
Second, the use of positive logic and separation of comple-
mentary rails imply recalculation of inverted inputs of each
gate to the inputof the circuit. Thismaycausea significant (up
to four times) increase in circuit size, for instance, a tree of
XOR gates. The use of positive logic may also increase the
length of the critical path because additional inverters are
inserted. Finally, the method is only applicable to netlists
built of a limited subset of the library gates.
The clock signal is typically used as a reference in power
analysis techniques. System “ desynchronization,” as in [6],
[9], can help hide the clock signal. To mask the operation of
a block of logic is a much more complex task which could
demand very expensive changes to the entire design flow.
A cheaper desynchronization method is to rebuild indivi-
dual blocks within the same synchronous infrastructure so
that their power signatures become independent of the
mode of operation and of the data processed. This method
is used in [9], where synchronous pipelines are transformed
into asynchronous circuits using dual-rail coding. Dual-rail
encoding was also successfully used in [10] to build a secure
Amulet core for smartcard applications.
These desynchronization methods represent a combina-
tion of two aspects of security: hiding the reference signal
and hiding the data being processed. The major leakage of
information about the processed data is due to the data-
dependent power signature. The correlation between the
data and the power signature can be minimized by
balancing and randomizing the data encoding with respect
to the power signature. In this paper, we concentrate on the
balancing of the data encoding only.
Our idea is to replace blocks in existing architectures
dominated by synchronous single-threaded CPU cores and
their slow buses, having no pipelining or concurrency, with
secure and hazard-free (having no glitches due to race
conditions) dual-rail circuits. Using the standard dual-rail
protocol with a single spacer still has certain balancing
problems due to the asymmetry between logic gates within
a dual-rail gate. In this paper, we address and solve these
problems by using a new protocol with two spacers
IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 2005 449
. The authors are with the School of Electrical, Electronic, and Computer
Engineering, Merz Court, University of Newcastle, Newcastle upon Tyne,
UK, NE1 7RU.
E-mail: {danil.sokolov, j.p.murphy, a.bystrov, alex.yakovlev}@ncl.ac.uk.
Manuscript received 28 May 2004; revised 4 Sept. 2004; accepted 25 Oct.
2004; published online 15 Feb. 2005.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TC-0182-0504.
0018-9340/05/$20.00  2005 IEEE Published by the IEEE Computer Society
alternating in time, leading to all gates switching within
every clock cycle. This is the first contribution of the paper.
The other idea is to stay as close to the standard industry
design flow as possible. Our method is applied via an
automated tool to a clocked single-rail netlist obtained by
standard RTL synthesis tools from a behavioral specifica-
tion (“push-button” approach). Such circuits have an
architecture depicted in Fig. 1a. The result is also a netlist
which can be simulated and passed to the back-end design
tools. Furthermore, all DFT (Design For Testability) features
incorporated at the logic synthesis stage are preserved in
our approach unchanged.
The resultant dual-rail circuit can be built in either of two
well-known architectures (cf. [11], [12]): self-timed dual-rail
or clocked dual-rail, Fig. 1b and Fig. 1c, respectively.
Self-timed dual-rail circuits do not have a clock and their
registers are controlled by a completion signal formed in the
completion detection logic. Being asynchronous, these
circuits should exhibit better throughput, but they suffer
from a significant size overhead due to additional logic
from completion detection.
Clocked dual-rail circuits do not have completion
detection logic and rely on the assumption that the
hazard-free dual-rail combinational logic switches by the
end of the clock period. In our method, this assumption is
easy to meet because the delay characteristics of the dual-
rail circuit are inherited from the single-rail prototype.
While the method and the tool support both dual-rail
architectures, in this paper, we concentrate on the latter one.
The security aspects of system level, memory elements,
buses, etc., also do not belong to the focus of the paper. We
are looking at the security of logic circuits only.
The rest of the paper is organized as follows: First, the
theory of applying dual-rail coding to synchronous circuits
using a single spacer and two spacers is described, then the
operation of the tool is discussed. The AES benchmark
results and potential improvements follow and, finally, the
conclusions are presented.
2 METHOD
2.1 Single Spacer Dual-Rail
Dual-rail code uses two rails with only two valid signal
combinations f01; 10g, which encode values 0 and 1,
respectively. Dual-rail code is widely used to represent
data in self-timed circuits [12], [13], where a specific
protocol of switching helps to avoid hazards. The protocol
allows only transitions from all-zeros f00g, which is a
noncode word, to a code word and back to all-zeros, as
shown in Fig. 2a; this means the switching is monotonic.
The all-zeros state is used to indicate the absence of data,
which separates one code word from another. Such a state is
often called a spacer.
An approach for automatically converting single-rail
circuits to dual-rail, using the above signaling protocol, that
is easy to incorporate in the standard RTL-based design
flow has been described in [9]. Within this approach, called
Null-Convention Logic [14], one can follow one of two
major implementation strategies for logic: One is with full
completion detection through the dual-rail signals (NCL-D)
and the other with separate completion detection (NCL-X).
The former one is more conservative with respect to delay
dependence, while the latter one is less delay-insensitive,
but more area and speed efficient. For example, an AND
gate is implemented in NCL-D and NCL-X as shown in
Fig. 2b and Fig. 2c, respectively. NCL methods of circuit
construction exploit the fact that the negation operation in
dual-rail corresponds to swapping the rails. Such dual-rail
circuits do not have negative gates (internal negative gates,
for example, in XOR elements, are also converted into
positive gates), hence they are race-free under any single
transition.
If the design objective is only power balancing (as in our
case), one can abandon the completion detection channels,
relying on timing assumptions, as in standard synchronous
designs, thus saving a considerable amount of area and
power. This approach was followed in [16], considering the
450 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 2005
Fig. 1. Design architectures. (a) Single-rail architecture (standard RTL
design). (b) Self-timed dual-rail architecture. (c) Clocked dual-rail
architecture.
Fig. 2. Single spacer dual-rail. (a) Dual-rail protocol. (b) NCL-D AND
gate. (c) NCL-X AND gate. (d) NCL-X AND gate in clocked environment.
circuit in a clocked environment, where such timing
assumptions were deemed quite reasonable to avoid any
hazards in the combinational logic. Hence, in the clocked
environment, the dual-rail logic for an AND gate is simply a
pair of AND and OR gates, as shown in Fig. 2d.
The above implementation techniques certainly help to
balance switching activity at the level of dual-rail nodes.
Assuming that the power consumed by one rail in a pair is
the same as in the other rail, the overall power consumption
is invariant to the data bits propagating through the dual-
rail circuit. However, the physical realization of the rails at
the gate level is not symmetric and experiments with these
dual-rail implementations show that power source current
leaks the data values.
For example, in the structure in Fig. 2d, we compare the
gate switching profiles when computing two different
binary sequences of values c for corresponding input
sequences on a and b. The first input sequence is a ¼ 00,
b ¼ 00, and the second one is a ¼ 11, b ¼ 11. The switching
profile of these sequences at the level of gates is different: In
the first sequence, there are four firings of OR gate and, in
the second, there are four firings of AND (note that we
counted both spacer ! code word and code word ! spacer
phases).
While there could be ways of balancing power con-
sumption between individual gates in dual-rail pairs by
means of modifications at the transistor level, adjusting
loads, and changing transistor sizes, etc., all such measures
are costly. The standard logic library requires finding a
more economic solution. We do not consider randomization
techniques in this paper as they can be applied indepen-
dently and possibly in conjunction with our method.
Synchronous flip-flops are built to be power efficient, so,
if they switch to the same value (data input remains the
same within several clocks), then nothing changes at the
output. The absence of the output transition saves power,
but, at the same time, it makes the power consumption data
dependent. In order to avoid this, we make flip-flops
operate in the return-to-spacer protocol, as in Fig. 2a. The
solution in Fig. 3a uses the master-slave scheme, writing to
the master is controlled by the positive edge of the clock
and writing to the slave is controlled by the negative edge.
At the same time, the high value of the clock enforces slave
outputs into zero (output spacer as in Fig. 2a) and the low
clock value enforces master outputs into one (a similar
spacer for the logic with active zero).
This circuit operates as explained in Fig. 3b. Both the
master and slave latches have their respective reset and
enable inputs (active zero for the master). The delay
between removing the reset signal and disabling writing
for each latch (hold time) is formed by the couple of buffers
in the clock circuit. Buffers between master and slave are
needed to delay m_code-set value until s_En-. The advantage
of this implementation is the use of a single cross-coupled
latch in each stage for a couple of input data signals.
2.2 Dual Spacer Dual-Rail
In order to balance the power signature, we propose using
two spacers [15] (i.e., two spacer states, f00g for all-zeros
spacer and f11g for all-ones spacer), resulting in a dual spacer
protocol as shown in Fig. 4. It defines the switching as
follows: spacer ! code word ! spacer ! code word. The po-
larity of the spacer can be arbitrary and possibly random, as
in Fig. 4a. A possible refinement for this protocol is the
alternating spacer protocol shown in Fig. 4b. The advantage of
the latter is that all bits are switched in each cycle of
operation, thus opening a possibility for perfect energy
balancing between cycles of operation.
As opposed to single spacer dual-rail, where, in each
cycle, a particular rail is switched up and down (i.e., the
same gate always switches), in the alternating spacer
protocol, both rails are switched from all-zeros spacer to all-
ones spacer and back. The intermediate states in this
switching are code words. In the scope of the entire logic
circuit, this means that, for every computation cycle, we
always fire all gates forming the dual-rail pairs. This makes
the circuit more resistant to DPA (see Section 3).
The new alternating spacer discipline cannot be directly
applied to the implementation techniques shown in Fig. 2b
and Fig. 2c. Those, both in the logic rails as well as in
completion detection, assume the fact that, for each pair of
rails, the f11g combination never occurs. In fact, the use of
all-ones spacer would upset the speed-independent imple-
mentation in Fig. 2b because the outputs of the second layer
elements would not be acknowledged during code word !
all-ones spacer transition. The completion detection for
those gates can, of course, be ensured by using an
SOKOLOV ET AL.: DESIGN AND ANALYSIS OF DUAL-RAIL CIRCUITS FOR SECURITY APPLICATIONS 451
Fig. 3. Single spacer dual-rail flip-flop. (a) Schematic. (b) Signal
transition graph.
Fig. 4. Dual spacer dual-rail protocol. (a) Arbitrary order of spacers.
(b) Alternating spacers.
additional three-input C-element, but this extra overhead
would make this implementation technique much less
elegant because of the additional acknowledgment signal
channel. In the single spacer structure, due to the principle
of orthogonality (one-hot) between min-terms a0  b0, a1  b0,
and a0  b1, only one C-element in the rail c0 fires per cycle.
If some parts of a dual-rail circuit operate using the
single spacer and other parts the alternating spacer
protocol, then spacer converters should be used. The
alternating-to-single spacer converter shown in Fig. 5a is
transparent to code words and enforces all-zeros spacer on the
output if the input is all-ones or all-zeros.
The implementation of a single-to-alternating spacer
converter, Fig. 5b, uses a toggle to decide which spacer to
inject all-ones or all-zeros. The toggle can be constructed out
of two latches, as shown in Fig. 5c. It operates in the
following way:
xþ ! x1þ ! x ! x2þ ! xþ ! x1 ! x ! x2;
i.e., x1 changes on the positive edge of x and x2 switches on
its negative edge. The frequency of x1 and x2 is half the
frequency of x.
The alternation of spacers in time is enforced by flip-
flops. The alternating spacer flip-flop can be built by
combining a single spacer dual-rail flip-flop with a single
spacer to alternating spacer converter. The power con-
sumption of the single spacer dual-rail flip-flop is data
independent due to the symmetry of its rails. The rails of
the spacer converter are also symmetric, which makes the
power consumption of the resultant alternating spacer flip-
flop data independent. The optimized version of such a
flip-flop (the toggle is moved outside) is depicted in Fig. 6.
This implementation uses the clk2 signal to decide which
spacer to inject on the positive phase of clk. The signal clk2
changes on the negative edge of the clock and is formed by
a toggle (one for the whole circuit) whose input is clk. The
timing assumption for clk2 is that it changes after the output
of single spacer flip-flop. Both the slave latch of the single
spacer flip-flop and the toggle which generates the clk2
signal are triggered by the negative edge of clk. The depth
of logic in the toggle is greater than in the slave latch of the
flip-flop. At the same time, clk2 goes to all flip-flops of the
circuit and requires buffering, which also delays it. This
justifies our timing assumption.
It should be mentioned that the inputs of the dual-rail
circuit must also support the alternating spacer protocol.
Moreover, the same spacer should appear each cycle on the
inputs of a dual-rail gate. That means the spacer protocol on
the circuit inputs and flip-flop outputs must be synchro-
nized in the reset phase.
2.3 Negative Gate Optimization
In CMOS, a positive gate is usually constructed out of a
negative gate and an inverter. That is why the total area
overhead in dual-rail logic is more than twofold compared
to single-rail. The use of positive gates is not only a
disadvantage for the size of a dual-rail circuit, but also for
the length of the critical path. Our method for negative gate
optimization [16] is described in this section.
If the all-zeros spacer of the dual-rail code is applied to a
layer of negative gates (NAND, NOR, AND-NOR, OR-
NAND), then the output will be all-ones spacer. The opposite
is also true: all-ones spacer is converted into all-zeros spacer.
The polarity of signals within code words remains the same if
the output rails are swapped.
The spacer alternation between odd and even layers of
combinational logic can be used for negative gate optimization
of dual-rail circuits. The optimized circuit uses either all-
ones spacer or all-zeros spacer in different stages (the spacer
changes between the layers of logic) as captured in Fig. 7.
In order to optimize a dual-rail circuit for negative gates,
the following transformations should be applied: First, all
452 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 2005
Fig. 5. Dual-rail converters. (a) Alternating-to-single spacer converter.
(b) Single-to-alternating spacer converter. (c) Toggle.
Fig. 6. Alternating spacer dual-rail flip-flop.
Fig. 7. Spacer polarity after logic optimization.
gates of positive dual-rail logic are replaced by negative
gates. Then, the output rails of those gates are swapped.
Finally, spacer polarity converters are placed at the wires that
connect the layers of logic of the same parity (odd-to-odd or
even-to-even).
Consider negative gate optimization using a simple
example shown in Fig. 8a. Dotted lines in the single-rail
circuit indicate signals which will be mapped into the dual-
rail with the all-ones spacer. The bar on the wire is the
location of a spacer polarity converter. The circuit in Fig. 8b
is obtained by replacing gates by their dual-rail versions.
These gates are built from traditional positive dual-rail
gates by adding signal inversion to their outputs and
swapping the output rails (the latter is needed to preserve
the polarity of signals in the output code words). The
operation of negation is implemented by a rail swapping
and does not require any logic gates. The spacer polarity
converter is implemented as a pair of inverters having their
outputs crossed in order to preserve the polarity of signals
in the output code words.
It is possible to combine such an optimization with the
alternation of spacers in time, Fig. 9.
3 GATE-LEVEL SECURITY: ENERGY ASPECTS
In this section, the security characteristics of logic gates
(single and dual-rail) are studied. All diagrams in this
section are the result of SPICE analog simulations using the
AMS-0.35 design kit.
Dual-rail encoding is known to improve resistance to
DPA attacks. If the information leakage is reduced to the
noise level, then filtration techniques should be applied.
Filtering usually implies integration over time. In this
section, we analyzse energy consumption as such an
integral measure.
We introduce two security characteristics: energy imbal-
ance and exposure time. By energy imbalance, we mean the
variation in energy consumption when processing different
data values. Exposure time is the time during which the
energy imbalance is exhibited.
3.1 Energy Imbalance
Energy imbalance (further referred to as imbalance) can be
measured as the variation in energy consumed by a circuit
processing different data. If e1 and e2 are the energy
consumptions of two input patterns, then the numerical
value of imbalance is calculated as:
d ¼ e1  e2j j
e1 þ e2  100%: ð1Þ
Single-rail circuits exhibit significant imbalance and
exposure time. The imbalance in a single-rail circuit is
mainly caused by the data-dependent quantity of switching
events.
In a dual-rail circuit with return-to-spacer protocol, the
number of switching events is constant for every clock
cycle. This reduces the data-dependency of power con-
sumption and this is verified in simulation results below.
However, the imbalance is not eliminated completely. It still
takes place due to the different power consumption of
complementary gates which form a dual-rail gate. For
example, the power signature of a 2-input dual-rail AND
gate (as in Fig. 2d) switching from all-zeros spacer to code zero
(the OR component is switching) is different from the same
gate switching from all-zeros spacer to code one (the AND
component is switching).
An experiment has been conducted in order to determine
the worst-case imbalance in a dual-rail gate. For this, we
chose a 3-input dual-rail NAND gate. Such a gate consists
of one standard 3-input NAND gate and one standard
3-input NOR gate. These gates have the maximum
difference between the number of transistor levels in their
pull-up and pull-down stacks. Four-input gates are not
considered as they may be not implementable in future low-
voltage technologies. The current consumption of a gate
consists of three components: the input generator current,
the gate current, and the output load current. In the
experiment, we determined the gate current, which is the
source of the imbalance, by subtracting the input generator
current from the overall current and removing the load of
the gate. For this, the same benchmark was simulated twice,
under VCC ¼ 0V and VCC ¼ 3:3V . The waveform of the
SOKOLOV ET AL.: DESIGN AND ANALYSIS OF DUAL-RAIL CIRCUITS FOR SECURITY APPLICATIONS 453
Fig. 8. Constructing negative gate dual-rail circuit. (a) Single-rail circuit. (b) Negative gate dual-rail circuit.
Fig. 9. Alternating spacer after negative logic optimization.
power supply current under VCC ¼ 0V was subtracted from
the waveform under VCC ¼ 3:3V . A single positive 1ns
pulse with rise and fall times of 150ps was applied to all the
inputs of each gate.
The gate currents are shown in Fig. 10. The imbalance is
obtained by comparison of the energy consumed during
switching. The energy waveforms in Fig. 10 are the
integrated current starting from time 0.
The full cycle energy imbalance calculated by (1) is
10.7 percent. The energy imbalance during the falling
transitions of the gates is 10.4 percent and, during the
rising transitions, it is 11.1 percent.
The imbalance (being a relative value) becomes smaller if
an identical output load is connected to both NAND and
NOR gates. Fig. 11 shows the gate currents when each gate
output load is simulated as a pair of capacitors connected to
the ground and VCC , each capacitor is 0.016pF (equivalent to
four inverter inputs). The full cycle energy imbalance value
in this experiment is 2.1 percent. The energy imbalance
during the falling transitions of the gates is 4.8 percent and,
during the rising transitions, it is 1.2 percent.
In order to show that the 3-input NAND and NOR gates
exhibit the worst-case imbalance among simple comple-
mentary gates, the same set of experiments was also
conducted for 2-input NAND and NOR gates. The full
cycle energy imbalance in this experiment was 8.4 percent
for nonloaded gates and 1.3 percent for the gates loaded
with 0.0032pF capacitors.
The experiments have shown that the worst-case im-
balance in a dual-rail circuit is 10.7 percent. This imbalance
only occurs if the gates are not loaded. The worst-case
imbalance under a realistic load is 2.1 percent. Further
imbalance reduction is possible by either modifying the
circuit so that it uses more symmetrical gates or by
modifying the gates at the transistor level.
3.2 Exposure Time
The longer the imbalance is visible, the easier it is to
measure. This is why the exposure time of the imbalance
should be minimized alongside the imbalance reduction. In
a dual-rail circuit, the exposure time depends on the spacer
protocol. We have evaluated the lower and upper bounds
for the exposure time on the single spacer and the
alternating spacer protocols. The clock cycle is used as a
measure of the exposure time.
In a dual-rail circuit using the single spacer protocol, the
lower bound of exposure time is one clock cycle and the
upper bound is the whole time the circuit operates. These
bounds can be derived from the analysis of a dual-rail gate
operation. The imbalance in a dual-rail gate is caused by
switching one of the components of a dual-rail gate. It is
visible until the other complementary single-rail gate
switches up and down. The lower bound is hit if the
switching of the first single-rail gate is delayed as long as
possible (until the end of the first half of the clock cycle) and
the other single-rail gate switches as early as possible (in the
beginning of the second half of the next clock cycle). In this
case, the exposure time is equal to 0.5 clock cycle. If the
combinational logic delay is small compared to the clock
period, then the lower bound becomes one clock cycle.
454 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 2005
Fig. 10. Power signature of nonloaded gates. (a) 3-input NAND gate. (b) 3-input NOR gate.
Fig. 11. Power signature of loaded gates. (a) 3-input NAND gate. (b) 3-input NOR gate.
The upper bound depends upon data. If the gate output
switches between alternative code words in each cycle
(going through the spacer each time), then the upper bound
is 1.5 clock cycles (or one clock cycle under the assumption
of combinational logic delay being small). In this case,
however, the entropy of information at such an output is
zero. In order to make the output more informative, one can
somehow implement (this is not supported by any
industrial or, to the best of our knowledge, by any academic
tools) the Manchester serial code at that output. Then, the
upper bound will be 1.5-2 clock cycles (the second value is
for the small combinational logic delay). Finally, if no ad hoc
provisions are made in order to control the sequence of
switching, the upper bound becomes undefined. Our
benchmarks have shown that the average case of exposure
time can be about a dozen clock cycles.
The exposure time can be reduced by applying the
alternating spacer protocol. For this protocol, the exposure
time lower boundary is 0 (actually, one gate delay) and the
upper boundary is one clock cycle. Consider a dual-rail gate
operating in alternating spacer protocol. In the first half of the
clock cycle, one component of the dual-rail gate fires,
introducing a data-dependent imbalance. This imbalance is
exposeduntil the secondhalf of the clock cyclewhen theother
complementary component fires, leading the energy con-
sumption to a data-independent constant value. If the first
single-rail gate fires just before the positive edge of the clock
and the complementary gate fires just after the positive edge
of the clock, then the lower boundary of the exposure time is
achieved. The upper boundary is reached if one single-rail
gate fires in the verybeginningof the clock cycle and theother
gate fires in the very end of the same clock cycle. Under a
relatively slow clock, the exposure time is about half the clock
cycle and gets shorter under a faster clock.
The following experiment was performed to show the
influence of the spacer protocol on the exposure time. Single
spacer and alternating space protocols were applied to a
2-input dual-rail AND gate for one clock cycle. In each
protocol, two different code words were applied to the
inputs of the gate: both logical zeros and both logical ones.
The obtained energy waveforms are shown in Fig. 12, the
solid line for both logical zeros and the dotted line for both
logical ones code words. The experiment shows that, in the
single spacer protocol, the energy imbalance is not
compensated for in the current clock cycle, but can be
potentially compensated for in one of the following clock
cycles. In the alternating spacer protocol, the imbalance is
exposed only between the adjacent spacers, which reduces
the exposure time to less than one clock cycle.
3.3 Early Propagation and Memory Effect
Other security-related characteristics of a circuit are early
propagation [4] and memory effect [5], [4]. These character-
istics have much less impact on the security features of a
circuit than imbalance and exposure time. So, it is essential
to minimize the circuit imbalance and exposure time before
optimizing the circuit for the early propagation and
memory effect metrics.
The early propagation is the ability of a gate to fire
without waiting for all its inputs. Early propagation causes
the data-dependent distribution of circuit switching events
in time. The effect of early propagation is bounded by half
of the clock cycle. One way to avoid the early propagation is
to balance all paths by inserting buffers in such a way that
all inputs of each gate arrive simultaneously. In a dual-rail
circuit, NCL-D gates can be used in order to restrict the
early propagation effect to limited areas only.
The memory effect is the ability of a CMOS gate to
remember its previous state. It is shown by an example of a
2-input NOR gate simulated under two input sequences:
a ¼ 00100, b ¼ 01110 and a ¼ 01100, b ¼ 00110, see Fig. 13a,
Fig. 13b, respectively. The sequences vary in the second bit
only. However, the power signature shows a noticeable
difference in the fourth bit (marked with dotted circles).
This can be explained by the parasitic capacitor between
p-transistors which charges differently when processing the
second bit of the input sequences. The capacitor voltage is
shown at the bottom of the diagrams. A possible solution is
to modify the gates in such a way that the gates parasitic
capacitors were charged independently from input data.
For example, a 2-input NOR gate can be implemented with
two stacks of p-transistors controlled by the input signals in
different orders (i.e., <ab> and <ba>).
4 TOOL DESCRIPTION
The described conversion procedure of single-rail into dual-
rail circuit has been implemented as a software tool named
the “Verimap design kit.” It successfully interfaces with the
Cadence CAD tools. It takes as input a structural Verilog
netlist file, created by Cadence Ambit (or another logic
synthesis tool), and converts it into dual-rail netlist. The
resulting netlist can then be processed by Cadence or other
EDA tools.
The structure of our Verimap design kit is displayed in
Fig. 14. The main parts are the tool itself and two libraries.
The library of gate prototypes contains the description of gates
used in the input netlist. It facilitates the structural analysis
of the input netlist. The library of transformation rules defines
SOKOLOV ET AL.: DESIGN AND ANALYSIS OF DUAL-RAIL CIRCUITS FOR SECURITY APPLICATIONS 455
Fig. 12. Exposure time for dual-rail 2-input AND gate. (a) Single space protocol. (b) Alternating spacer protocol.
complementary gates needed for construction of the dual-
rail logic, the polarity of gate inputs and outputs, and
specifies if the corresponding dual-rail gate requires
completion signal (for asynchronous design only) and if it
inverts the spacer. If a predefined dual-rail implementation
of a gate is found in the library, the tool uses it; otherwise,
an implementation is built automatically using the rules.
The main function of the tool is conversion of single-rail
RTL netlist into dual-rail netlist of either of two architec-
tures: self-timed and clocked, Fig. 1b, Fig. 1c, respectively. It
is done in four stages. First, a single-rail circuit is converted
into positive logic dual-rail. Second, the positive dual-rail
gates are replaced by negative dual-rail gates and the spacer
polarity inverters are inserted. Then, the completion signal
is generated (asynchronous design only). Finally, a wrapper
module connecting the dual-rail circuit to the single-rail
environment is added (optional).
Apart from generating netlists, Verimap tool reports
statistics for the original and resultant circuits: estimated
area of combinational logic and flip-flops, number of
negative gates and transistors, number of wires.
The tool also generates a behavioral Verilog file assisting
the power analysis of the original and resultant circuits.
Being included into simulation testbench these Verilog
counts the number of switching events in each wire of the
circuits.
5 BENCHMARK RESULTS AND FUTURE
IMPROVEMENTS
This section summarizes the experiments performed to
characterize the proposed method in terms of security, size,
and power consumption. Two AES [17], [18] designs were
used: Open core AES [19] and AES with computable Sbox [20],
[21]. For each design, a single-rail AES circuit was
synthesized from RTL specification by using the Cadence
Ambit v4.0 tool and AMS-0.35 library. Our Verimap tool
was applied to the netlist generated by Ambit and the dual-
rail netlist was produced. The dual-rail circuits were
optimized for negative gates and used alternating spacer
dual-rail protocol. Both single-rail and dual-rail designs
were analyzed for static delays (SDF delay annotation) and
simulated in Verilog-XL v3.10. By keeping to the RTL design
flow, the netlists can be directly used in the back-end design
tools of Cadence.
The statistics for the parts of AES, namely, ciphers and
Sboxes, are shown in Table 1, Table 2, and Table 3.
The purpose of the first experiment was to evaluate the
correlation between data and switching activity of the
circuits. Switching activity is the number of switching
events in the circuit within one clock cycle. Table 1 presents
the minimum, average, and maximum switching activity
for the Sboxes and ciphers. These values were obtained by
simulating the circuits with a number of input vectors. In
single-rail Sbox, a transition is determined by a pair of input
vectors. The Cartesian product of previous/next 8-bit input
vectors includes 256 256 ¼ 65; 536 possible combinations.
For simulation of Sboxes, 10,000 random pairs of these
vectors were chosen. The ciphers were simulated with the
284 vectors supplied with the Open core AES testbench.
Note that, for dual-rail circuits, the switchings of single-rail
wires (e.g., reset, clock, and the signal which determines the
injecting spacer) were also taken into account.
The experiment shows a significant difference between
the min/average/max switching activity values for the
single-rail Sbox benchmarks. The minimum value is zero,
456 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 2005
Fig. 13. Memory effect in OR gate. (a) a,b: 0,0->0,1->1,1-0,1->0,0. (b) a,b: 0,0->1,0->1,1->0,1->0,0.
Fig. 14. Verimap design kit.
and the maximum values are up to 48 percent higher than
the average values. At the same time, switching activity for
the dual-rail circuits is constant. In the single-rail switching,
activity varies significantly depending on data and, clearly,
they exhibit zero switching activity if the input data does
not change. In addition, many switching events in single-
rail circuits are caused by hazards and the single-rail Sbox
benchmarks are no exception. Here, the hazards caused up
to 80 percent of data-dependent switching events. The
number of switching events in dual-rail combinational logic
is constant for any input data and is equal to the number of
wires (as every second wire switches twice).
The single and dual-rail implementations of the
AES ciphers were simulated to enable comparison with
Sbox results. Switching activity in the Open core dual-rail
cipher is 351 percent higher than in the single-rail cipher
and 255 percent higher for the AES design with computed
Sboxes. These values are greater than the results for their
corresponding combinational logic Sboxes. The bigger
difference can be explained by the nature of computations
in complex circuits. They execute in bursts, which are
defined by the algorithm. Under a burst, the switching is
similar to our experiments with combinational circuits.
However, between the bursts, the situation is significantly
different: A single-rail circuit is inactive and a dual-rail
circuit continues to “burn power” by switching between
code words and spacers.
A possible way to address this issue is to implement clock
gating. This, however, should be different from the
conventional clock gating technique. It is important to
make it data-independent. At this stage, we do not see a
feasible way of implementing this at the netlist level. Most
likely it will require analysis of behavioral specifications.
We view this idea as a subject of future work.
In order to compare the security features of single
spacer and alternating spacer circuits, the AES design
with computable Sboxes was also converted into single
spacer dual-rail. Both single spacer and alternating spacer
dual-rail implementations were simulated with 284 input
vectors from the standard AES testbench in the encryp-
tion and decryption modes. The switching activities of
“1” and “0” rails were recorded separately. Table 2 shows
the worst case difference in switching activity between
“1” and “0” rails. The imbalance between the number of
switching events in the rail_1 and rail_0 is calculated as
imbalance ¼ rail 1rail 0j jrail 1þrail 0  100%. While the total switching
activity is the same in both implementations, the single
spacer implementation exhibits significant differences in the
number of switching events on the complementary rails. As
the complementary gates within a dual-rail gate have
different power consumptions, the power signature of the
single spacer dual-rail circuit becomes dependent on the
processed data. Alternating spacer dual-rail circuits do not
suffer from this leakage because all gates are switching in
every clock cycle.
The cost of improved security features is the increase in
the number of gates, wires and area, see Table 3.
The benchmarks indicate only 84-88 percent overhead in
gate numbers (a positive gate is counted as a pair of a
negative gate and an inverter) for AES design with
computable Sboxes. This is less than 100 percent due to
SOKOLOV ET AL.: DESIGN AND ANALYSIS OF DUAL-RAIL CIRCUITS FOR SECURITY APPLICATIONS 457
TABLE 1
Switching Ability
TABLE 2
Switching Activity in Dual-Rail Rails
TABLE 3
Circuit Size
the negative gate optimization. For Open core design, the
overhead is more than 100 percent due to the structure of its
Sbox module. During the negative logic optimization of
Open core Sbox, more inverters were inserted into the not-
critical path (as components of spacer inverters) than were
removed from the critical path.
Thenumberofwires is increasedby117-145percent.Wires
are duplicated in a dual-rail circuit and then spacer
converters are added, further increasing the number ofwires.
The estimated area of the benchmarks combinational
logic indicates a 102-127 percent overhead. A significant
area increase for flip-flops (228-289 percent) can be
explained by using dual-rail flip-flops constructed out of
standard logic gates. This can be improved by transistor
level optimization of the flip-flops.
Fig. 15 visualizes the security improvement for the AES
block. These diagrams have been generated from the AES
design versions with computed Sboxes: in single-rail and in
dual-rail using the alternating spacer protocol. As the Open
core AES design yielded similar plots, they are not shown.
Fig. 15a clearly shows the initial operation of the circuit and
the AES computation phases. The first peaks reveal the data
being clocked into the circuit, the middle peaks show the
iterative rounds being performed, and the last peaks show
the data being clocked out. On the other hand, looking at
Fig. 15b, the operation is masked now the “clocking in and
out” and AES computation rounds are indistinguishable
from one another. The repetitive peaks correspond to the
spacer and data alternation.
The diagrams were generated using Synopsys Nano-
sim mixed-signal simulation software, which permits fast
mix-signal simulation up to 100x faster than a purely
SPICE simulation. The single-rail and dual-rail AES
implementations were simulated using the same, ran-
domly chosen, key and input data. Simulations were
performed with different keys and input data to ensure
fairness; similar plots for the dual-rail implementation
were also spawned from each simulation, which confirms
the improvement in security.
The security improvement in combinational logic blocks
is illustrated in the example given in Fig. 16. The Open core
Sbox was simulated under 16 random data values. The
diagram in Fig. 16a shows the power signature in the single-
rail implementation and the four diagrams in Fig. 16b show
the power consumption of the dual-rail implementation
with alternating spacer. From these diagrams, one can see a
significant security improvement. One can also observe
early propagation effects, which will be addressed in our
future work.
It is clear that, in the AES designs, there are opportunities
tominimize power consumption as not all logic is necessarily
being used all the time. Industry synthesis tools can identify
sleep mode logic and use this information to annotate places
in the netlist which could be committed to sleep mode logic
later in thedesign flow.This lowpoweroptimization couldbe
utilized in our dual rail circuitry; one approach would be to
put a spacer on the input to the identified sleep mode logic
and hold this there for the clock cycles while it is not used. By
458 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 2005
Fig. 15. Power signature for AES design with computed core Sboxes.
(a) Single-rail implementation. (b) Dual-rail implementation with alter-
nating spacer protocol.
Fig. 16. Power signature for Open Core Sbox. (a) Single-rail
implementation. (b) Dual-rail implementation.
doing so, the switching is now zero, thus saving power. This
techniquewouldnot reveal data as the sleepmode logic is in a
“meaningless” spacer state. By using the synthesis tool to
identify the sleep mode logic, we are adhering to the RTL
design flow and our conversion tool could use the annotated
netlist to apply the optimization to dual rail circuits; note the
committal stage of the sleep mode logic would need to be
different from what the synthesis tools would do (simple
ANDgatesusinga control signal). Presently, this hasnot been
implemented in the tool, but investigated using schematic
entry with simple examples which gave promising results.
This needs to be investigated further, together with the clock
gating idea.
6 CONCLUSIONS
We have presented a technique to balance switching
activity and energy consumption per clock cycle. Our
technique uses two spacers alternating in time within the
dual-rail logic framework. It is very cheap yet effective and
is supported by software tools that interface with standard
RTL design flow tools used by most ASIC designers. The
idea of using two spacers is deemed particularly efficient
for dual-rail logic, where the Hamming distance between
each spacer and a valid combination is the same. While it
can still be used without too much overhead in optimally
balanced k-of-n codes (e.g., 3-of-6), it would be much less
efficient in other popular codes such as 1-of-4 [22].
The AES benchmarks indicate that we have fully
eliminated the dependency which existed between data
and switching activity in the dual-rail circuits. The price to
pay for the improved security features is the increased
average switching activity and area overheads.
ACKNOWLEDGMENTS
The authors are grateful to A. Koelmans, A. Kondratyev,
S. Moore, A. Taubin, and the anonymous referees for useful
comments. EPSRC supports this work via GR/R16754
(BESST), GR/S81421 (SCREEN).
REFERENCES
[1] P. Kocher, J. Jaffe, and B. Jun, “Differential Power Analysis,” Proc.
Crypto, 1999.
[2] T. Messerges, E. Dabbish, and R. Sloan, “Examining Smart-Card
Security under the Threat of Power Analysis Attacks,” IEEE Trans.
Computers, vol. 51, no. 5, pp. 541-552, May 2002.
[3] H. Saputra, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, R. Brooks,
S. Kim, and W. Zhang, “Masking the Energy Behaviour of DES
Encryption,” Proc. Design and Test in Europe Conf. (DATE), 2003.
[4] S. Guilley, P. Hoogvorst, Y. Mathieu, R. Pacalet, and J. Provost,
“CMOS Structures Suitable for Secured Hardware,” Proc. Design
and Test in Europe Conf. (DATE), pp. 1414-1415, 2004.
[5] K. Tiri, M. Akmal, and I. Verbauwhede, “A Dynamic and
Differential CMOS Logic with Signal Independent Power Con-
sumption to Withstand Differential Power Analysis on Smart
Cards,” Proc. European Solid-State Circuits Conf. (ESSCIRC), 2002.
[6] S. Moore, R. Anderson, P. Cunningham, R. Mullins, and G. Taylor,
“Improving Smart Card Security Using Self-Timed Circuits,” Proc.
Int’l Symp. Asynchronous Circuits and Systems (ASYNC), pp. 211-
218, 2002.
[7] Z. Yu, S. Furber, and L. Plana, “An Investigation into the Security
of Self-Timed Circuits,” Proc. Int’l Symp. Asynchronous Circuits and
Systems (ASYNC), pp. 206-215, 2003.
[8] K. Tiri and I. Verbauwhede, “A Logical Level Design Methodol-
ogy for a Secure DPA Resistant ASIC or FPGA Implementation,”
Proc. Design and Test in Europe Conf. (DATE), 2004.
[9] A. Kondratyev and K. Lwin, “Design of Asynchronous Circuits
Using Synchronous CAD Tools,” Proc. Design Automation Conf.,
pp. 107-117, 2002.
[10] L. Plana, P. Riocreux, W. Bardsley, J. Garside, and S. Temple,
“SPA—A Synthesisable Amulet Core for Smartcard Applica-
tions,” Proc. Int’l Symp. Asynchronous Circuits and Systems
(ASYNC), pp. 201-210, 2002.
[11] J. Sparso and S. Furbe, Principles of Asynchronous Circuit Design.
Kluwer Academic, 2001.
[12] Self-Timed Control Of Concurrent Processes, V. Varshavsky, ed.
Kluwer, 1990 Russian ed., 1986.
[13] I. David, R. Ginosar, and M. Yoeli, “An Efficient Implementation
of Boolean Functions as Self-Timed Circuits,” IEEE Trans.
Computers, vol. 41, no. 1, pp. 2-11, Jan. 1992.
[14] K. Fant and S. Brandt, “Null Convention Logic: A Complete and
Consistent Logic for Asynchronous Digital Circuit Synthesis,”
Proc. Int’l Conf. Application-Specific Systems, Architectures and
Processors (ASAP), pp. 261-273, 1996.
[15] D. Sokolov, J. Murphy, A. Bystrov, and A. Yakovlev, “Improving
the Security of Dual-Rail Circuits,” Proc. Workshop Cryptographic
Hardware and Embedded Systems (CHES), 2004.
[16] A. Bystrov, D. Sokolov, A. Yakovlev, and A. Koelmans,
“Balancing Power Signature in Secure Systems,” Proc. 14th UK
Asynchronous Forum, 2003.
[17] Nat’l Inst. of Standards and Technology, “Federal Information
Processing Standard 197, The Advanced Encryption Standard
(AES),”http://csrc.nist.gov/publications/fips/fips197/fips197.
pdf, 2001.
[18] J. Daemen and V. Rijmen, The Design of Rijndael. Springer-Verlag,
2002.
[19] R. Usselmann, “Advanced Encryption Standard/Rijndael IP
Core,” http://www.asic.ws/, 2004.
[20] S. Mangard, M. Aigner, and S. Dominikus, “A Highly Regular and
Scalable AES Hardware Architecture,” IEEE Trans. Computers,
vol. 52, no. 4, pp. 483-491, Apr. 2003.
[21] J. Wolkerstorfer, E. Oswald, and M. Lamberger, “An ASIC
Implementation of AES S-Boxes,” Proc. RSA, 2002.
[22] W. Bainbridge and S. Furber, “Delay Insensitive System-on-Chip
Interconnect Using 1-of-4 Data Encoding,” Proc. Int’l Symp.
Asynchronous Circuits and Systems (ASYNC), 2001.
Danil Sokolov is a research associate in the
School of Electrical, Electronic and Computer
Engineering, University of Newcastle upon
Tyne. In 1994-1999, he studied computing
science at Kyrgyz-Russian Slavic University,
Kyrgyzstan. Beginning in 1998, he began to
contribute to the research projects of the
International Science and Technology Center
in Kyrgyzstan. After graduation, he continued his
collaboration with the International Science and
Technology Center as an IT specialist. In 2001, he joined the
Asynchronous Research Group at the University of Newcastle upon
Tyne as a PhD student and is currently writing his thesis. His research
interests include design of secure systems, automation of the design
process, modeling, and visualization of asynchronous systems. He is a
student member of the IEEE.
Julian Murphy received the Master of Engi-
neering degree in microelectronics and soft-
ware engineering from the University of
Newcastle upon Tyne, United Kingdom, in
2003. In 2003, he joined the Asynchronous
Research Group at the University of Newcastle
upon Tyne and is currently working toward the
PhD degree. His research interests include
developing secure VLSI hardware, side-chan-
nel countermeasures, and the implementation
of cryptographic hardware in general from design to layout. He is a
student member of the IEEE.
SOKOLOV ET AL.: DESIGN AND ANALYSIS OF DUAL-RAIL CIRCUITS FOR SECURITY APPLICATIONS 459
Alexander Bystrov is a lecturer in the School of
Electrical, Electronic and Computer Engineer-
ing, University of Newcastle upon Tyne. In 1980-
1986, he studied electronic engineering at St.
Petersburg State Electrical Engineering Univer-
sity, Russian Federation. After graduation and
until 1995, he worked as a research associate in
the Department of Radio Systems at the same
university, studying online and offline testing
methods. In 1995-1998, he did research on
testing of multilevel logic circuits at Napier University of Edinburgh and
received the PhD degree of from the named university. Since then, he
has worked in the Asynchronous Systems Laboratory at the University
of Newcastle upon Tyne. His research interests are modeling,
visualization, and design of asynchronous systems, arbitration, design
of low-latency asynchronous circuits, online testing, and security
systems design. He is a member of the IEEE and the IEEE Computer
Society.
Alex Yakovlev received the MSc and PhD
degrees in computing science from the Electro-
technical University of St. Petersburg, Russia,
where he had worked in the area of asynchro-
nous and concurrent systems since 1980 and, in
the period between 1982 and 1990, held
positions of assistant and associate professor
in the Computing Science Department. He is a
professor of computer systems design in the
School of Electrical, Electronic, and Computer
Engineering at the University of Newcastle, which he joined in 1991, and
where he currently heads the Microelectronic Systems Design Research
Group. He first visited Newcastle in 1984-1985 for research in VLSI and
design automation. After returning to Britain in 1990, he worked for one
year at the Polytechnic of Wales (now University of Glamorgan). His
current interests and publications are in the field of modeling and design
of asynchronous, concurrent, real-time, and dependable systems. He
has chaired the program committees of several international confer-
ences and is currently a chairman of the steering committee for the
Conference on Application of Concurrency to System Design. He is a
member of the IEEE. In 2002, he was among the finalists for the
Descartes Prize, awarded to the best scientific teams in Europe.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
460 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 4, APRIL 2005
