Abstract
Introduction
As the FPGAs are readily available COTS and provide a short time-to-market development cycle, they perfectly suit high-end market requiring flexibility and confidentiality. As for ASICs, cryptographic FPGAs can be tampered with implementation-level attacks which fall into two categories, depending whether they are active or passive. Active attacks consist in either injecting faults so as to gain information from the device corrupted execution results [8] . Usual counter-measures consists in detecting errors, using codes or redundancy. Passive attacks, also called side-channel attacks (SCAs), consist in simply observing the devices' emanations while it is performing a cryptographic operation. The attack consists either in computing a correlation coefficient between the acquired traces and the expected dissipation according to a key hypothesis (e.g. SPA, DPA, CPA, EMA) or in the consultation of a pre-characterized database (e.g. template attacks (TA) [4, 1, 2] ). Common countermeasures against observation attacks consist in randomizing the execution, using clock jitter, dummy or decoy clock cycles, or in blinding intermediate data words. The goal is to make an statistical treatment irrelevant. Another customary technique consists in balancing the circuit's activity, so as to make any dissipation data-independent. This approach alleviates the need for an high quality randomness source, but in return demands a strong effort in the balancing process.
Protections against active attacks can take advantage of robust FPGAs strategies at RTL level like register triplication or by using sensors to detect an abnormal event. But concerning passive attacks FPGAs are not intrinsically immune. It seems on the contrary that FPGAs have a propension of leaking much information [19] . As compared to regular ASICs, the interconnection network is extremely dissipative, because it consists of active switches and long metal routing distances between logic cells.
We study in this paper a power consumption balancing strategy called "wave dynamic differential logic" (WDDL, introduced in [23] ) that is well suited for FPGAs. Its principle is to duplicate the netlist into a so-called 'true' and a 'false' parts, that share the same topology (interconnection graph). The graph is devised such that if any gate of one network switches, then the sibling gate of the dual network does not, and vice-versa. This way, from a macroscopic standpoint, the activity of the circuit is constant. This is at least true at the logical level, i.e. at the first order.
The rest of the paper is organized as follows: Section 2 details the methodology used to achieve a WDDL netlist, and gives some indications on the overhead caused by switching from insecure to secure netlists. Then we report in Section 3 experimental security improvements reached by two types of WDDL netlists (non-positive and positive) over an unprotected reference. The Section 4 analyzes the performance of some synthesizers in mapping into logical gates the most complicated parts of a cryptographic algorithm, namely the substitution boxes. Finally, the Section 5 concludes on the efficiency and the cost of protecting FPGAs against SCAs using a power-constant strategy and provides some suggestions for improvements.
Fitting WDDL into FPGAs

State-of-the-Art about Dual-Rail Logic in FPGAs
Kris Tiri reports in [23, 25, 24] implementation methods for WDDL in FPGAs. Other types of logic use also differential logic like MDPL which is a masked logic introduced in [15] . Despite the great advantages provided by differential logic like WDDL or MDPL, it has been proved in [22, 16] that this logic type has still little imbalance due to early evaluation or technological bias. However no successful attacks based on differential logic on FPGAs has been reported so far. Secured designs in FPGAs based on masked logic are also described by François-Xavier Standaert in [21] .
The seminal publication [23] suffers a large area overhead due to the restriction to the minimal library consisting of only {INV, AND, OR} (3 gates). In [25] , a clustering method allows to use all AND-OR combinations (166 gates in LuT4 FPGAs). The method is shown to be automatable thanks to an ASIC synthesizer in [24] .
WDDL in FPGAs has already been studied by Pengyuan Yu and Patrick Schaumont [26] . Separated Dynamic DualRail Logic (SDDL), described in [23] , is shown experimentally to fail because of glitches caused by a race between a global signal (precharge) and local signals (differential data pairs). Double WDDL (DWDDL) introduced in [27, 26] is more secure than SDDL but it quadruples at least the implementation area and the EMA could take advantage of the distance between the two duplicated areas. One previous paper [17] shows that an integrated antenna of about 40 µm extension can measure EM emanations selectively.
In the rest of this Section, we present a case-study on the DES [12] cryptographic algorithm. As such, DES is no longer suitable for block encryption, because its keylength of 56 bit is too short [7] . Massively parallel or networked machines can indeed exhaust the 2 56 keys in a few days. This certainly compromises DES ciphertexts, and definitely ruins any hope of forward secrecy. AES [13] is the successor of DES with a key length at least equal to 128 bit. However, when DES is used as DESX (the standard DES is sandwitched between an input and an output Vernam masking of the plain-and ciphertext), or as triple DES (as described in appendix 2 of [12] ), it is perfectly secure. It has been selected, amongst others, for the international passport and is still used in banking applications for instance. The main appeal of DES lays in its compactness when implemented in hardware. We have therefore used a fully-fledged DES (achieving simple and triple DES, with all specified modes of operations), whose architecture is described in [9] . As detailed later on in Sec. 3, we managed to fit several DES instances in a single FPGA.
Our goal is to evaluate WDDL (positive or not -denoted respectively WDDL+ and WDDL) on a real embedded cryptographic application. We emphasize that the results presented in this paper are the first experimental implementations and attacks on a full-featured cryptographic system-on-chip equipped with a DES processor protected by WDDL. To be exact, we present WDDL and WDDL+ experimental results at the logical level only. The backend has been delegated without constraints to the automatic "partition, place-and-route" tools. However, given the symmetry of WDDL and WDDL+ netlist, we assume that backend implementation does not drastically deteriorates the logical symmetry between the dual networks. Albeit intuitive, this hypothesis is nonetheless to be verified on more accurate setups. Our experience, detailed in the sequel, is that the most secure power-constant logic (WDDL+) is already fairly strong against straightforward attacks, without any supplementary backend-level intervention.
Design Secure Partitioning
In DES, the control is independent of the data. Whatever the key or the plaintext, the algorithm consists in sixteen consecutive rounds. It is thus only required to secure the datapath (made up of the message and the key paths). The control part is never attacked because it conveys no useful information.
Therefore, the datapath is implemented in WDDL, whereas the control remain regular. The source code of the cryptographic engine can be written in an HDL language in behavioral (aka RTL) style. Only the ad hoc converters, that ensure the transcoding between single and dual-rail logic blocks, are described in a structural style.
Synthesis in WDDL
On the one hand, the non-secure blocks, typically the control of DES, can be synthesized with the toolchain that comes with FPGAs (Altera quartus or Xilinx ise). Alternatively, generic synthesizers, such as precision by Mentor Graphics or Synplify Pro by Synplicity, can be fair FPGA vendor independent substitutes.
On the other hand, secure blocks must pass through a more specific synthesis process. Power-constant dual-rail logic, as the DES data-and key-path, must be synthesized carefully. We base our WDDL design on the following du- alization of a look-up-table (LuT) f : the dual gate g of f satisfies:
where f is the complement of f . This corresponds to a 0x0 spacer for the netlist precharge. For the spacer to propagate, LuTs must satisfy the wave condition:
LuT(000 · · · 0) = 0 and LuT(111 · · · 1) = 1 .
For the sake of example, we illustrate how Eq. (1) applies on Altera low-end FPGAs. The encoding for the Altera Stratix 4 → 1 LuTs is given in Tab. 1 for some representative gates.
Therefore, the dual of a LuT4 gate, such as the one described in Fig. 1 , is obtained by replacing the look-up-table configuration mask:
• lut_mask="FFFE"; / * [OR4] Direct * / by
• lut_mask="8000"; / * [AND4] Dual * /. Now, the wave propagation constraint (2) is thus less efficient than with a complete library. This issue is further discussed in Sec. 4. Now, to reduce the available gates, we need to find a way to constraint the synthesizer to use only some specified cells. This possibility exists only for ASIC synthesizers. Due to their internal heuristics, these tools require at least one flip-flop (DFF or fpga fdr), one invertor (IV or fpga iv) and one two-input gate (say AN2 or fpga an2).
A typical duplication for a WDDL DES datapath netlist is illustrated in Tab. 2. Regular 4 → 1 LuTs are named fpga lut4.
Synthesis in WDDL or in "Positive" WDDL (aka WDDL+)
The property (2) is not enough to ensure the confidentiality of the data. The design must also be free of any glitch. As a consequence, every tabulated function must be positive. Otherwise, glitches can show up. As glitches are data-dependant, they are indeed a security weakness. For instance, in the Fig. 2, the functions f (a, b, c 
Overhead Incurred by WDDL Netlist Styles
We studied the performance of the WDDL and WDDL+ DES modules. All the DES modules share the same 8-bit VCI interface with an addressing range of 8 bits. They all embed 2,048 memory bits (a 256 bytes RAM); their area and maximal frequency is given in Tab. 3. The throughput is computed in simple DES-ECB with 56-bit key and in triple-DES-OCB with 112-bit key modes of operation. To simplify the estimation, we assume that the memory is infinite, and thus neglect the initial latency caused by the loading of the key and the final latency associated with the last block saving in RAM. The encryption of one block lasts 16 clock cycles for the single-ended DES module, but 2 × 16 for the dual-rail modules.
Experimental Evaluation of WDDL Security
State-of-the-Art about Attacks on FPGAs
The first attack on an FPGA (a handmade Xilinx Virtex 800 board) is reported in 2003 [14] . The impact of the RTL architecture on the leakage is studied next year [19] . Some improvements, made possible by signal pre-processing (such as filtering and averaging), are presented in [20] . The overall conclusion of these studies is that unprotected implementations of FPGAs are vulnerable to side-channel attacks, even if their dissipation process is different from that of ASICs. Some acquisition improvements have been done in 2007 [11] . On independent acquisition banks, an attack on AES programmed in an Altera Cyclone, is implemented successfully by exploiting EMA signals [3] . The first attack on a complete system-on-chip embedding a cryptoprocessor is reported in [5] . This study shows that even small cryptographic applications are at risk in FPGAs.
Evaluation Methodology
We have embedded a secured DES processor into a custom system-on-chip (SoC) architecture called "SecMat" [6] . This setup represents a realistic usage of FPGAs as security devices. Additionally, the SoC was equipped with an unprotected DES processor, to serve as a reference.
Evaluation Board and its Customization
We chose a Parallax board, for its simplicity, and also because it can accommodate the whole SecMat SoC along with the DES co-processors. We illustrate the synthesis on the example of Altera, but the principle could be applied as well to any other tool that can read a structural netlist. The board is shown in Fig. 3 . On the bottom left corner we can notice the small power shunt circuitry based on a coil-resistor impedance. The advantage of this device is that no coupling capacitors have to be removed and it allows to grab transient currents with enough sensitivity. This small intrusion allows to perform acquisitions with a differential probe at each side of the coil.
Attacks on Experimental Power Traces
We attack the regular module with the Hamming distance model. This model corresponds to the CMOS power dissipation which is produced by a signal transition. The attack of the WDDL circuit considers the Hamming weight model which indicates that the dissipation corresponds to the signal level and not the transition. This is due to the fact that the Hamming distance between the precharge state (full zeroes) and the evaluation state degenerates into a Hamming weight. Two correlation power attacks, Differential Power Analysis (DPA) and Correlation Power Attack (CPA), as described in [10] , are performed. A first acquisition campaign of 67,753 traces corresponding to the reference DES coprocessor activity, was performed. This module has been completely broken, as shown in Table 4 , by attacking during the first round round of DES. Table 4 shows the maximal correlation (for DPA) or covariance (for CPA) levels to get a reference to compared with the levels of the WDDL protected DES. The CPA traces for every sbox are shown in Fig. 4 . Similar attacks are led on protected DES modules. In a view to enhance the attacker's strength, we focused the acquisitions (cadenced at 20 Gsamples/s) around the first round of DES. Three sboxes of the nonpositive WDDL module are recovered (see The WDDL rows of Tab. 4).
The CPA is more efficient than the DPA because the correlation is normalized and allows to recover the signal in a noisier environment. Only one of the WDDL+ sboxes is broken after 123,743 traces (see the WDDL+ rows of Tab. 4). The positive logic used in WDDL+ to remove the glitches provides then a greater robustness. However the fact that one sbox can be broken shows that the countermeasure is not fully efficient and more sboxes could be broken by enhancing the acquisition platform sensitivity or the attack algorithm.
Synthesis Optimization of WDDL+ Netlists
Synthesis with Legacy Tools
Some substitution boxes are synthesized with various synthesizers. For DES and Kasumi (Feistel ciphers), the sboxes are given according to the standard. For AES (substitution-permutation network), the sbox and its inverse are studied.
The RTL description is tabulated. We chose this solution to avoid any segregation between the sboxes based on their internal structure. More compact netlist could be obtained with description that takes advantage of the mathematical description of the sboxes of Kasumi or AES. The results are listed in Tab. 5 for the Stratix using quartus version 7.1. Similar results for the Virtex-II using ise 9.2i are reported in Tab. 6. The performance of the Cadence ASIC synthesizer bgx shell (64-bit version v05.15-s095+1) and rc (64-bit version v06.10-s017 1) are assessed in Tab. 7. They are tuned to spend the maximal effort on the area optimization. rc (64-bit version v06.10-s017 1) is not as good as bgx shell. The results obtained with ASIC synthesizers read as follows:
• The library "plain" contains all the 2 2 n cells: it is meant to provide a lower-bound for the area.
• The library "WDDL" contains all the 2 2 n −2 cells that satisfy (2) . After synthesis, the inverters are removed, and the number of remaining gates is doubled.
• The library "WDDL positive" (WDDL+) contains the positive cells that, in addition, propagate 0 and 1, as per (2) . The number of these functions is equal to the Dedekind numbers M (n) [18] minus two (the two constant functions zero and one). Although Dedekind first considered this question in 1897, there is still no concise closed-form expression for M (n). There are 4 (resp. 18, 166, 7 579, 7 828 352) WDDL+ gates with two (resp. 3, 4, 5, 6) inputs.
A Novel Heuristic to Compact DES WDDL+ Sboxes
We present a heuristic for achieving a better synthesis for 6 → 4 substitution boxes, such as that of DES. The goal is [4] . Hence only 10 two-input gates lead to nontrivial LuT4s instances. All these 10 LuTs are instantiated, and shared between the 4 output bits tail logic. The tail part consists in 2 multiplexor trees (one true and one false) of 16 inputs and 4 inputs. This tree can be synthesized with 2 × 4 × (8 + 4 + 2 + 1) = 120 two-input The overall construction requires 120 + 10 = 130 positive LuT4s. This figure, albeit close to that obtained by the bgx shell and rc synthesizers, is better for all the eight DES sboxes (refer to the last line of Tab. 7). Moreover, it is likely that using some peculiarities of the 6 → 4 sboxes, the 130 LuT4s score can be improved. For instance, it might happen that some inputs of the tail 4-input multiplexor-tree be constant, or that some resources can be shared between the true or false dual networks.
In any case, we conclude that using ASIC synthesizers for generating positive logic is relevant, but that some optimizations are possible. As a perspective, we emphasize that there is a room for custom WDDL+ synthesizers or for enriching legacy synthesis tools with new heuristics when the mapping library is recognized as positive.
Comparison between bgx shell and rc
As already shown in Tab. 7, bgx shell is better than rc to synthesize substitution boxes. However, when it comes to simple blocks (everything but the sboxes), rc appears to produce more compact netlists than bgx shell. For instance, a two-input XOR or a two-input multiplexor is synthesized in 4 LuT4 by bgx shell, but in 2 LuT4 by rc. The solution found by rc is optimal. The XOR is mapped as the positive function f (a t , a f , b t , b f ) . = a t · b f + a f ·b t , while the multiplexor is inferred as f (s t , s f , a t , b t ) . = a t · s f + b t · s t . Therefore, the area reported in Tab. 3 could be decreased, by using bgx shell for the sboxes an rc for the rest of the DES datapath.
Conclusions
The usage of power-constant logic styles to impede the power attacks in FPGA has been shown experimentally. Incidentally, we report the first attack against a non-positive WDDL DES co-processor implemented in an FPGA. The attack takes place by means of very little intrusive acquisition circuitry and by using DPA or CPA strategies. The positive WDDL version proves to be more secure than its non-positive counterpart. However the slight imbalance of this robust logic type should be detectable with a more sensitive acquisition platform. A custom tool based on both FPGAs and ASICs logic synthesizers has been build to get non-positive and positive WDDL netlists. The estimation of overhead is about a factor 2. But by using home-made heuristics, the area bloat can be reduced. We report a constructive method to generate 6 → 4 sboxes in a more compact way than ASIC synthesizers. We therefore expect a new market for secured synthesizers to appear.
