Abstract-Side channel attacks exploit the physical properties of integrated circuits to extract sensitive information. They are becoming increasingly important in the context of the deployment of the Internet of Things. One of the most effective countermeasures consists of modifying the logic circuits to reduce the leakage through side channels. This paper presents a novel side channel attack tolerant balanced circuit (STBC) based on a dynamic and differential configuration. Its main feature is the use of an improved binary decision diagram (BDD) with a multi-output function and internal gate sharing to reduce the implementation area. Compared to the earlier proposed dual-rail pre-charge circuit with binary decision diagram (DP-BDD) technique, an area reduction of 13.7% is achieved. A fixed versus random t-test shows that STBC obtains a substantial reduction in information leakage even though small peak exists. Further, its input variable dependence is comparable with that of a normal CMOS circuit and similar with DP-BDD.
I. INTRODUCTION
As the Internet of Things is deployed on a global scale and the interest in cyber security grows, methods for defending against side-channel attacks (SCAs) are becoming increasingly important. These attacks exploit the physical properties of a system such as computation time, power consumption, and electromagnetic radiation to extract sensitive information such as secret keys or personal data. A broad range of countermeasures has been proposed that includes increasing the variation in cryptographic protocols (e.g., frequent key updates), masking and randomization, and adding noise. One of the most attractive solutions for hardware implementations is to use a secure logic style; this approach has the particular advantage that side channel resistance is built in at the lowest layer and integrated in the design flow. Secure logic styles reduce the signal-to-noise ratio (SNR) visible to an attacker by reducing the correlation between the Hamming distance of the circuit input transitions and the power consumption during sensitive operations. Secure logic styles use hiding methods that keep the power consumption constant and thus independent of the input values.
Our logic style follows the wave dynamic differential logic (WDDL) [1] solution that combines dynamic and differential approaches. This keeps the sum of the Hamming weight of output values equal to 1 at all times. Further, our approach uses standard cells, which results in a much lower nonrecurring engineering (NRE) cost than special gates. WDDL and other differential logic styles are vulnerable to timing based attacks due to the early propagation problem [2] . By balancing the input arrival times and reducing the propagation delay differences, this problem can be addressed at the cost of extra logic gates. Binary decision diagrams (BDDs) (cf. Bryant [5] ) use nodes with an "if-then-else" algorithm to process the input variables. Akishita et al. [4] have proposed to combine BDDs with Dual-rail Precharge logic resulting in DP-BDD: they obtain a balanced configuration by changing every node to a multiplexor with two AND gates at the same logic depth and one OR gate at the subsequent logic level. DP-BDD is both dynamic and differential. BDD-based circuits have a critical path that is independent of the input values. Moreover, when input values are presented, the gates in a BDD-based circuit operate with reduced delay differences in the transition timing; note that some fine-tuning may be needed for each input value (e.g, input value at top node of circuit has more delay than input value at bottom node of the circuit), even though this can not be exactly calculated because of process variations. As a consequence, DP-BDD reduces the total delay and latency during the pre-charge and evaluation phases and solves the early propagation problem. Other dualrail based circuits using standard cell (e.g., WDDL) need dedicated delay matching using dummy gates; this become more difficult in complex circuits such as a composite field based AES S-box. However, BDD-based circuits always have the same tree structure independent of the S-box, which makes it easy to have the same logic depth for every input transition. For multiple output functions, DP-BDD uses several independent BDDs and a balanced configuration is obtained without dummy gates. However, the disadvantages of DP-BDD come at a cost: the size of the logic tree grows exponentially in the number of input and output bits. In the case of DP-BDD, the implementation area is up to 2.1 times larger than for WDDL [4] . This makes the use of these previous BDDbased secure logic styles impractical. This paper proposes a novel efficient secure logic style called the side channel attack tolerant balanced circuit (STBC). STBC minimizes BDD circuits by applying reduction rules and by searching for optimized variable ordering and it inherits the reduced propagation delay of the BDD approach. Compared with previous BDD-based methods, STBC uses an optimized variable ordering and multi-output BDD functions which substantially reduces the circuit size through internal gate sharing. STBC obtains a balanced configuration through the insertion of dummy gates. Furthermore, as a countermeasure against SCAs, this balanced configuration also has a dynamic and differential operation that includes a pre-charge phase and an evaluation phase. In addition, STBC performs the same number of switching operations regardless of the input values because the number of active AND-OR gates does not depend on the input values. Hence, STBC increases efficiency and is robust against SCAs. Therefore, STBC is an excellent candidate for secure lightweight implementations in standard cell technology.
In Section II, BDDs are reviewed and the propagation delay issue is briefly introduced. Section III presents our novel STBC method and explains the structure and operation of the circuit. Section IV validates the STBC method by describing the implementation of an AES S-box; its security is evaluated based on a pre-layout HSPICE simulation and fixed versus random t-test. Section V concludes the paper.
II. BACKGROUND

A. BDD
A BDD is an efficient representation method for complex Boolean functions [3] , [5] . It divides a structure into three parts: output nodes, internal nodes, and leaf nodes. In a BDD representation, each top (output) node is associated with one Boolean function (see Fig. 1(a) ). In contrast, at the bottom of a BDD, there are two leaf nodes with constant values of 0 and 1. These leaf nodes represent the output values of the Boolean function, which are fixed to these constant values. In addition, internal nodes have edges with two possible directions. One edge direction flows into the node from the upper or "parent" node, and the other edge direction flows from the node to the lower or "child" nodes. Each internal node is allocated input variables; nodes at the same logic depth (same stage) correspond to the same input variable. A BDD follows an "ifthen-else" algorithm, meaning that if the input value of one internal node equals 1, then it follows the solid child edge; if the input value equals 0, then it follows the dashed child edge. Therefore, after allocation is finished, a BDD configures every possible logic combination in a diagram. A BDD does not have a canonical form, as its form varies depending on the ordering of the input variables. The variable ordering can be optimized to minimize the number of internal nodes. An ordered BDD (OBDD) reduces the number of redundant nodes and merges corresponding internal nodes. After the optimization, the compacted BDD can then be used for compact and lightweight design as it has a minimal number of nodes and hence a minimal number of gates. The reduced OBDD (ROBDD) differs from the BDD in that it has a canonical form (see Fig. 1(b) ). This ROBDD can be used for a multioutput Boolean function. When the ROBDD is configured as the concatenation of many single output functions, the size of the ROBDD can increase exponentially with the number of input values. That said a large Boolean function can be manipulated into a multi-output BDD configuration by sharing nodes and gates. Each Boolean function can then have the same internal nodes in several Boolean output functions, and these nodes can be shared and used in both output functions. This 
B. Propagation delay problem
In [7] , the dual-rail timing imbalance is well illustrated, where two input signals are shown to determine the output signal transition in the evaluation and pre-charge phases, respectively. Therefore, the difference of arrival time at the output nodes caused by the different delays of the input signals results in information leakage. This leakage increases with the complexity of the logic circuits, specifically the logic depth and number of gates. Even if this problem can be mitigated by adapting the delays to have the same delay between the input signals of the circuit, this imposes many constraints on the design process. Moreover, it is difficult to adjust all gates in a more complex circuit (e.g., an AES S-box with composite fields).
The propagation delay also varies depending on the input values. These variances reduce the efficiency of the circuit and may also increase its vulnerability to attacks such as fault sensitivity analysis [8] . FSA exploits the vulnerability arising from the difference in the timing delay of input transitions. In this attack, the attacker selects an output bit with logical value one. Subsequently he/she varies the fault intensity and compares the critical path delay when the output bit switches to zero. This is because the differential circuit retains zero values at the outputs during the pre-charge phase and a fault attack during the evaluation phase can only change the output values from either one to zero or zero to zero. Therefore, the output values with logical one operating with fault-free intermediate values enable detection of a fault attack, whereas with zero to zero changing values it is difficult to recognize whether a fault has been injected. Thus, FSA can exploit the difference in critical path delay depending on input transitions.
III. PROPOSED STBC METHOD
This section presents an STBC design method that offers several improvements: area reduction, constant power consumption, and reduced propagation delay. An optimized compact shared BDD can be achieved by determining the variable ordering with the minimum number of internal nodes during STBC design, and by removing shared gates in a multi-output function. A substantial reduction in the implementation area is obtained compared to the previous DP-BDD. Furthermore, similar with DP-BDD, the propagation delay is decreased and the early propagation problem is eliminated.
A. Structure of STBC 1) Dynamic and differential configuration:
The STBC circuit has a dynamic and differential configuration. For dynamic operation, the enable signal of the dual-rail to single-rail converter is only used to keep all inputs at zero for the precharge phase. The differential representation of STBC has a symmetric configuration with a positive (true) and negative (false) representation of the circuit. The output function always has the same Hamming weight, which is independent of the input values. Fig. 2 shows the overall structure of STBC with a wrapper containing a rail-converter and a delay module for all input values to reduce the difference in arrival time between input values. The input variables of the delay module connect to buffers of different sizes corresponding to the stage of the BDD construction. For example, in Fig. 2 , at the top node of a BDD with n inputs, the input value connects to a buffer chain of length n−1. However, this needs to estimate more correctly in the real physical implementation because this can be varied by process variation effect. In [4] , delay adjustment module is consider to reduce the effect of difference of propagation delay between input ports of each AND-OR gate, the effect of load capacitance between input ports of each AND-OR gate and difference of the number of fan-out between output signals of AND-OR gates. However, this delay module also need finetuning and this cannot be calculated correctly because input signal delay varies more from the process variation.
2) Multi-output function and gate sharing method:
A multi-output function means that a complex Boolean function has only one BDD construction with multiple outputs. To realize this multi-output function, multiple intermediate nodes must be shared, and this sharing removes duplicated nodes, substantially reducing the area compared to DP-BDD. Additionally, more gates can be shared by the merged internal nodes in the BDD using a multi-output function. Fig. 3 illustrates this gate sharing method. For example, at stage 2 in Fig. 3 , the two AND gates of s1 have the same input value x2, and they share the same input value from the output of the same OR gate in stage 1 (the previous stage). As the output values are the same between these two AND gates, one of these AND gates can be shared and removed. Similarly, gates s2 and s3 can be shared using one AND gate. Furthermore, if a gate has zero input, it can also be eliminated because the output value of the AND gate is always zero. Therefore, at stages 1 and 2, the AND gates are redundant and are thus eliminated.
3) Dummy gate insertions:
When there are not enough gates to obtain a balanced operation at every path of the structure, dummy AND-OR gates are inserted to obtain a balanced circuit. This allows every logical combination to have the same number of gates in its switching operation. For example, a Boolean function with four inputs has a switching operation with four AND-OR gates. In Fig. 3 , at stage 3, D1 and D2 are dummy gates that are inserted into the edge of the BDD to obtain a symmetric configuration and balanced operation. These dummy gates consist of AND-OR gates for which the OR gate has 0 as one of its inputs and the AND gate has 1 as one of its inputs. Additionally, these dummy gates yield a critical path that is independent of the input variables. The Boolean function always flows through four AND gates and four OR gates, ensuring that every input transition has the same propagation delay from stage 1 to stage 4.
B. Operation of an STBC
An STBC operates in two phases.
• Step 1. Pre-charge phase: In this phase, the output signals are all reset to zero. To maintain this status at the outputs, zero enable signals are inserted to all input values at the delay module in Fig. 2 . Every internal gate simultaneously resets its output value to zero because they are connected to the input signal of the delay module. Therefore, this operation resets the output nodes to zero for all stages of the BDD with only small delays due to the intrinsic delays of each gate.
• Step 2. either from 0 to 0 or from 0 to 1. As in the precharge phase, every operation is executed with similar transition timing. To reduce the difference of the input value switching timing at different stages, a different input delay is generated in the delay module. This also helps to reduce vulnerability due to the differences between input delays.
C. Advantages of STBC 1) Robustness against SCAs:
Dynamic and differential characteristic causes the output transitions to have the same Hamming weight and maintains a constant power consumption. In addition, the STBC circuit has the same number of switching gates. For example, when the circuit has four inputs, as shown in Fig. 3 , even when the gates are operating according to the logic paths of each Boolean function, the circuit always switches the same number of CMOS AND gates and OR gates for the output values of sixteen different input combinations. This improves the balance of power in the configuration while still yielding a critical path that is independent of the input values. These factors efficiently defend against SCAs.
2) Reducing propagation delay and implementation area:
The BDD-based configuration avoids the early propagation problem because it has the same logic depth 2n for a circuit with n inputs: every internal nodes consists of two AND gates in parallel and one OR gate sequentially. Thus, it is not influenced by differences in the input value arrival times. In addition, each phase starts at the same time as the input values are applied with small delays depending on the position (stages) of those values [4] . Many studies have been reported on how the new proposed countermeasures for power analysis attacks can deteriorate fault analysis vulnerability [9] , [10] .
In particular, attacks based on fault injection such as fault sensitivity analysis (FSA) [8] , which exploit the difference of critical path delay and propagation delay, are emerging as an effective way to undermine implementations protected against power analysis. However, STBC is robust against such a fault analysis attack because it has a balanced critical path delay and a reduced propagation delay; moreover, both are independent of the input values. In addition, STBC uses multiple outputs and shared gates in order to obtain a compact design. The area is reduced, even with a complementary representation to obtain a differential configuration. STBC offers a solution for exponential circuit growth: as the number of sharing gates is increased, it improves over DP-BDD in dealing with functions with a large number of input variables.
IV. SIMULATION RESULTS
A. Case study
To confirm the properties of STBC, the compact AES Sbox proposed by Mentens et al. [11] was implemented in our simulation. The AES S-box is a typical target for SCAs, as it is the only non-linear component of the AES; other components such as ShiftRows require only wiring, and MixColumns is linear and very simple. The AES S-box was implemented with UMC 0.13 μm technology; HSPICE was used to simulate power traces at the pre-layout design. The AES S-box is an 8-bit to 8-bit mapping, meaning the BDD requires eight stages. At the same stage of the circuit topology, it has the same input variable, which is paired to the multiplexor containing the AND and OR gates (e.g., in Fig. 3 , every multiplexor has x2 or the complement of x2 as one input value at stage 2). A short description of the construction procedure of an STBC is given below.
Construction procedure: First, the SBDD was structured for multi-output functionality and had a complementary output. The optimized variable ordering with the smallest node count is shown in Fig. 1(c) . During this procedure, shared nodes were merged. Second, the shared and redundant AND gates of the shared nodes are removed after changing every node to a multiplexor with AND-OR gates (Fig. 4(a) ). After this step, dummy AND-OR gates were finally inserted on the insufficient edges (Fig. 4(b) ). The final structure has a balanced configuration and thus generates the STBC. Therefore, this construction method can be used for the whole AES S-box design. The structure of the AES S-box following this method has a balanced tree configuration with 8-bit input and 8-bit output.
B. Security analysis Fixed versus random t-test:
The fixed versus random t-test compares the leakage of fixed and random sets of traces [12] . For this test, the traces must be divided into two subsets: i) L f which could e.g. consist of traces with any fixed input transitions that are randomly selected; for the t-test these are all measurements with all 8-bit input transitions to all 8-bit 0s, which are fixed for the simulation; (ii) L r which could e.g. consist of traces with random input transitions; for the ttest these are all measurements with input transitions of values other than all 8-bit 0s. To apply Welch's t-test, we need to choose samples N f and N r over L f and L r , respectively. The following is used to compute mean valuesμ f (τ ) andμ r (τ ) and standard deviationsσ
r (τ ) for either sample; N f and N r . With these parameters, the t-test statistic for each time instance in the traces is defined as follows:
The threshold level in the t-test was set to ± 4.5 σ [12] , and the leakage of CMOS-based and STBC-based AES S-boxes are compared. threshold. The small peaks come from fan-in and fan-out effect of the STBC. However, this problem can be solved by weighted node which has large fan-out or fan-in. The node weighting can be realized by increasing gate strength. However, this issue is beyond the scope of this paper and it a task for future work. However, this reduced leakage shows that the proposed STBC reduces leakage and, consequently, the dependence on input values compared to standard CMOS, as the small difference between the two mean values signifies no data dependency. Therefore, these results confirm that STBC is good countermeasure against SCAs. Additionally, DP-BDD is expected to have similar robustness as STBC, but the small peak does not exist because the fan-in and fan-out of internal nodes of DP-BDD are well balanced than STBC.
C. Performance comparison 1) Early propagation and delay imbalance problem:
Almost all secure logic styles have a propagation delay problem. Additional methods for improving the security properties of the secure logic style incur performance losses. Without sensitive and difficult input delay matching with dummy gates, the dependency of input value can be occur due to early propagation delay. Recently, these problems have increased early propagation delay in more complex combinatorial circuits (e.g, AES S-box with composite fields) and has also led to other vulnerabilities such as to timing attacks and fault sensitivity analysis. However, the proposed STBC reduces this propagation problem because its logic depth for n-bit inputs is always 2n with a balanced configuration. In addition, the input delay can easily be tuned with thee input delay module and the resulting arrival time difference is smaller than for other differential secure logic style, because every internal node has two AND gates at the same level. Fig. 6 compares the propagation delay and delay variation depending on input values of a CMOS circuit and STBC.
2) Implementation area:
The STBC-based AES S-box offers a substantial reduction in implementation area compared to the previous BDD-based secure logic styles. Because of the construction characteristics of the STBC, additional gates are removed from the BDD circuit. If the circuit has many output functions, the sharing effect becomes larger than this enhancement. The AES S-box has an 8-bit input and output. The 8-bit output function can be one SBDD, which thus eliminates the shared nodes. In addition, several gates in the shared nodes can also be removed. Even if a few dummy gates were inserted to balance the circuit, this would offer a much smaller implementation area than that of DP-BDD. For this implementation, OR-NAND/AND-NOR, OR, and NOR gates were used and the STBC gate count is equivalent to 501×7/4 + 43×5/4 + 166 = 1, 097 GE, which is less than the number obtained by DP-BDD (1, 271 GE) in [4] . (Note that a NAND/NOR gate, an AND/OR gate, and an AND-NOR/OR-NAND gate are equivalent to 1 gate, 5/4 gates, and 7/4 gates, respectively). STBC reduces the implementation area of DP-BDD by 13.7% because of its multi-output function with shared gates. The data in Fig. 5 and 6 demonstrate that performance loss is minimized in STBC while offering a higher level of security. While most countermeasures show limitations in their performance because of factors such as propagation delay and implementation area, STBC minimizes these problems. Table I summaries the performance and security comparison between CMOS, DP-BDD [4] and STBC.
V. CONCLUSION
The proposed STBC is an efficient secure logic style that uses standard cells. It reduces the implementation area and propagation delay compared to previous secure logic styles based on standard cells. To reduce the design overhead of DP-BDD, STBC uses a multi-output function and gate sharing. In addition, the BDD configuration reduces the early propagation problem by keeping the same critical path delay independent of input values. In terms of security, STBC substantially reduces information leakage because of its balanced configuration along with its dynamic and differential configuration. Its security properties compare favorably with standard CMOS in the fixed versus random t-test. For these reasons, STBC is a practical and efficient alternative countermeasure against SCAs.
Clearly, avoiding routing imbalances is important to maintaining the proposed balanced property. Until now, the fat wire approach was a well-known routing method for a dual-rail based secure logic style. In addition, Cadence provides a SoC Encounter tool with a balanced routing option as one of its special routing techniques. These methods can be used to solve the imbalanced routing problem, even though the applications of such techniques in the proposed construction in this paper are not straightforward. We plan to explore how the parasitic routing capacitance affects security in future work.
