The use of field programmable devices in security-critical applications is growing in popularity; in part, this can be attributed to their potential for balancing metrics such as efficiency and algorithm agility. However, in common with non-programmable alternatives, physical attack techniques such as fault and power analysis are a threat. We investigate a family of next-generation field programmable devices, specifically those based on the concept of time multiplexing, within this context: our results support the premise that extra, inherent flexibility in such devices can offer a range of possibilities for low-overhead, generic countermeasures against physical attack.
INTRODUCTION
Within the context of countermeasures against physical attack, flexibility often represent an important consideration. Specifically, realising such countermeasures can be easier if the underlying platform is more flexible; for example, although general-purpose processors are flexible in terms of what they can execute, their fixed design usually requires alteration to support efficient temporal skewing or shuffling. Field Programmable Gate Arrays (FPGAs), and reconfigurable fabrics more generally, therefore represent an interesting option. In particular, their flexibility can allow various generic countermeasures without the associated cost of platform redesign or redeployment (even if reconfiguration is required).
Soft Gate Array (SGA) [1] is an SRAM-based example of a Time Multiplexed FPGA (TMFPGA) design: SGA aims to overcome the remaining limitations in other TMF-PGA designs, namely dynamic and static power consumption. One might argue these advantages align with use-cases where physical attacks are an issue (namely embedded or mobile computing devices). As a result, we proactively investigate whether the added flexibility afforded by an SGA can be translated into mechanisms for realising generic countermeasures, with particular focus on Differential Power Analysis (DPA) [2] .
Space restrictions mean we rely heavily on background in the full version of this paper [3] ; our focus is presentation of associated results only. These results are produced by an experimental, VHDL-based cycle-accurate SGA simulator. We used a simple power model whereby power consumption scales in proportion to switching behaviour; this metric was extracted from each simulation with an existing framework [4] shown to give reliable estimations within the context of DPA. Each actual DPA attack was performed using the resulting data by applying the OpenSCA toolbox in Matlab [5] .
GLITCHES AND EARLY EVALUATION IN CRYPTOGRAPHIC CIRCUITS
Mangard et al. [6] describe a DPA attack on an AES S-box with integrated, masking-based countermeasures. The attack succeeds due to transient behaviour, or glitches, on intermediate signals in the circuit. Such glitches typically result from unbalanced paths, and therefore apply to FGPAand ASIC-based implementations. An SGA protects against such glitches. One can view evaluation as an n-stage pipeline, wherein each stage is a single gate: the output of each stage is latched before being reused as an input, meaning the same transient behaviour is suppressed. For an in-depth discussion of the glitch-free behaviour of the SGA fabric the reader is referred to the full version of the paper [3] .
Note that the number of transitions in the SGA-based implementation is clearly data-dependent (as for any CMOSbased hardware). However, standard countermeasures like hiding or masking, which have been often thwarted in the past by the effects of glitching and early evaluation, could potentially be more effective on an SGA which inhibits these effects by design.
SGA-BASED COUNTERMEASURES AGAINST
POWER ANALYSIS ATTACKS
Overview
In a power analysis attack, an attacker monitors the power consumption of a target device while it evaluates some function; through analysis of the inputs, outputs and power consumption traces, the attacker hopes to recover some embedded, security-critical information from the target. To illustrate the problem this presents, we mounted a standard attack on a simulated SGA-based implementation of the AES S-box; no countermeasures were employed. Note that this operation alone (rather than the whole cipher for example) is a valid example, since correct prediction of the S-box input from the power consumed during evaluation allows an attacker to recover the entire AES key in a byte-wise fashion. To map the hypothetical S-box values to hypothetical power consumption values, we applied the Hammingweight model, i.e., we assumed that the power consumption of the SGA is directly related to the number of bits set to one in the data being processed. Figure 1a illustrates the correlation coefficients computed for all 256 key candidates after processing 1, 000 traces; the correct hypothesis (namely k = 173) leads to a significant correlation coefficient of ρ = 0.66. Figure 1b illustrates the evolution of correlation as the number of traces increases: incorrect key candidates are plotted in grey colour, while the correct key candidate is highlighted in black. Per [7, Page 148] , this allows estimation of the number of traces required for a successful attack (namely ∼ 45); the same approach allows estimation of countermeasure efficacy (where relevant).
Countermeasures
At a high level, and ignoring approaches such as hardware shielding, countermeasures against DPA can be classified as based on either hiding (breaking the link between execution and traces) or masking (breaking the link between execution and algorithm) approach. Hiding countermeasures typically attempt to make each trace constant for all possible values of the security-critical information, or entirely random; in both cases the premise is that a trace is no longer related to said information.
By leaning on existing design features (esp. the high degree of flexibility wrt. when and where computation occurs), an SGA-based fabric can be used to realise various generic countermeasures of these types. Each case requires at most a minor alteration to the base SGA design, and can be described as generic in that application can be (semi-) automated without relying on the functionality (i.e., algorithm) being implemented. Additionally, the countermeasures can often be composed to amplify the security benefits they provide individually. We use the rest of this section to describe both the approaches and results, which are summarised in Table 1 .
Buffer randomisation
The eight bits of output from the S-box implementation are generated during different phases: to provide a collective final output at some boundary (either phase or system cycle), they need to be selectively buffered before communication to subsequent logic blocks. By design, the buffers are initialised to zero. Thus, under the power model, the Hamming weight of latched data is directly related to power consumption. For that reason, a DPA attack can recover the input with only a small number of traces.
As shown in our results in Table 1 , the complexity of such an attack can be significantly increased by randomly initialising the buffers instead. Two directions are possible when considering how to realise this approach. First, one might focus on an unaltered SGA and attempt to interleave a PRNG implementation into the S-box using free slices; this would permit the PRNG output to initialise buffers with essentially no overhead. Second, one might alter the SGA to allow each slice to be controlled by a mode flag: in the i-th phase, said flag has the slice either operate as normal, or draw a configuration from some external (local or global) source of randomness (thus randomising computation).
We performed two experiments: the first with reconfiguration of only those slices which directly influence the buffers (additional unused slice configurations were set to zero), and the second with reconfiguration of all unused slices with random values. In the former, we increased the number of traces required only marginally to ∼ 300; in the latter, the attack failed even given 10, 000 traces. We stress that we artificially aborted the experiment after 10, 000 traces, and that of course generation of additional traces will eventually allow recovery of the target value; the point is that the threshold for success is now so great, the attack should be deemed less viable. Table 1 : Results of a DPA attack on the AES S-box: effectiveness of various countermeasures. Note that the number of traces required for a successful attack is an estimate of the security level, and that the number of slices used is out of a possible 32 in total; the unused slices represent additional options for the implementation of countermeasures outlined throughout Section 3.
Phase randomisation
Imagine selecting an SGA parameterisation where instead of p phases, there are p ′ = p + δ. One motivation for doing this could be some form of optimisation; for example, more phases in each system cycle might reduce the number of system cycles. Another, more pertinent, motivation is to include a degree of freedom (governed by δ) wrt. scheduling of computation. Specifically, one might consider: Phase skewing The idea is to randomise the point in time when a particular step of computation occurs (in our example, when a bit of the S-box output is computed). This is achieved by simply "skipping" δ randomly selected phases, meaning the overall system cycle takes p ′ phases but a given i-th phase might not be evaluated when expected per the original schedule. Dummy computation Instead of idle phases as above (which could arguably be detected and eliminated by an attacker), an incremental extension is to have slices compute some dummy (or fake) computation. In both cases, the randomised control signals required can be obtained relatively easily from existing digital clock managers [8] . Table 1 presents the result of mounting a DPA attack on such implementations. We stress that the experiments use δ = 1 only in order to show the results are consistent with theory: larger choices of δ cause the number of traces required to increase further, trivially achieving a much higher security level (with associated degradation in performance).
With some effort, the same approaches are of course viable on FPGA-like platforms. Crucially however, the fact that phased-based evaluation is inherent on an SGA means the countermeasure can be applied generically; on an FPGA, the same is not true. For example, clock randomisation on an FPGA [8] requires at least some algorithm-specific detail about clock signals and clocking strategy. Additionally, on an SGA randomisation influences delay between the computation of single LUTs, and thus makes the countermeasure very fine-grained. The same is not true for FPGAs, where the same delays will relate to computation between two consecutive flip-flops with combinational logic between them: usually this represents a more course-grained approach.
A more aggressive approach still, which we defer to further work, would be a phase-oriented analogue to instruction shuffling; the idea would be to reorder phases (while retaining dependencies) instead of skewing them in time. Versus skewing per the description above, similar security benefits result and potential improvements wrt. efficiency are possible: if the phase dependencies allow, δ could be small (even zero) while producing the same security benefit. High-level, algorithm-specific implementations of this concept in hardware are known [9] ; realising an algorithm-neutral version on an SCA seems difficult as the result of managing dependencies between phases. This is, for example, more difficult than the case of instruction dependencies in software [10] .
Complimentary computation
Approaches to ensuring constant power consumption during evaluation of some functionality can be considered at a variety of levels. For example, at a low-level the concept of specialist logic styles [11, 12] can be considered; at a higher level options such as MUTE-AES [13] are possible. In the latter, the idea is to have two processors execute the same operation in lock-step, but ensure one computes with intermediate data that is the complement of the other: in essence, power consumption is balanced at each step of computation. However, realistic use of the concept needs to consider at least two criticisms, namely 1. construction of suitable complementary functionality seems hard to generalise to all high-level algorithms, and 2. a careful approach to synchronisation of steps, and the problem of early evaluation within those steps, is required. An SGA interconnect uses an LVDS scheme: this means signals are (and hence communication is) balanced by design. In addition, the availability of original and complement signals means computation can be also balanced with relatively little overhead: one produces a complementary LUT for each original LUT, and places the resulting slices so both evaluate in the same phase (using appropriate inputs from the interconnect). Crucially, the approach is generic and no alteration to the SGA architecture is required: only (semi-) automatable effort at design-time during synthesis and place and route is needed.
We note that somewhat analogous techniques exist for ASIC [14, 15] and FPGA [16] platforms. For an SGA however, one can reasonably expect less overhead wrt. routing (by virtue of existing interconnect design), and more flexibility wrt. any area constraints. Specifically, duplication of resources (per the FPGA-based solution) to ensure balanced power consumption can easily hit a limit: with m LUTs in the design before, at least 2m are required afterwards. If an FPGA has less than 2m LUTs, the approach is simply not viable. With SGA however, time sharing allows one to "spread" those 2m LUTs over the same number of slices but more phases, effectively making the trade-off between security and time rather than security and area.
CONCLUSION
In this paper, we have demonstrated the advantages that extra flexibility in the design of field programmable logic can afford wrt. security. With a focus on an SGA architecture, but the concept of time sharing more generally, we showed how a range of countermeasures against power analysis attacks can be realised in a low-overhead manner (versus alternatives such as FPGAs).
On one hand, simulated results can only go so far: as with most aspects of physical security, detail relating to concrete implementation can be very important. On the other hand, similar devices are already gaining traction (cf. the Tabula ABAX family). With this in mind, proactive rather than reactive (esp. once deployed) investigation of such topics can act as an important design guide. For example, results in Section 3 demonstrate that only minor alterations to a baseline SGA architecture can yield tangible benefits. Put another way, treating security (in this case against fault and power analysis, but more generally also) as a first-class design metric now could allow more satisfactory use in securitycritical applications in the future.
Based on the initial potential illustrated here, one can identify (at least) three areas of further work:
1. use of a more accurate power model to mitigate the use of simulation and improve relevance to physical test devices, 2. study of a full AES implementation, and 3. answer some questions wrt. an SGA-specific tool-chain, in particular whether it is feasibility to realise the generic countermeasures in a fully automatic way.
