Abstract-A growing number of connected objects, with their high performance and low-resources constraints, are embedding lightweight ciphers for protecting the confidentiality of the data they manipulate or store. Since those objects are easily accessible, they are prone to a whole range of physical attacks, one of which are fault attacks against which countermeasures are usually expensive to implement, especially on off-the-shelf devices. For such devices, we propose a new generic software countermeasure, using SIMD instructions available in almost any offthe-shelf devices, to thwart most fault attacks while preserving the performances of the targeted cipher.
I. Introduction
The expansion of the Internet of Things (IoT) brings many benefits but also raises a number of issues with respect to security and privacy. Lightweight cryptography (LWC) is investigated in order to address IoT security issues while seeking the best trade-off between security, power consumption, performance and footprint.
During the last few years, several lightweight block and stream ciphers have been proposed [1] - [5] . These ciphers are mainly designed to resist black-box mathematical attacks. However, since they are used in IoT devices in pervasive environments, implementation-related attacks must also be considered. Resistance against side channel attacks [6] is now considered a valuable property which should be taken into consideration when designing lightweight ciphers [7] - [9] . Another kind of physical attacks based on fault injections must also be considered [10] . Many such attacks have been introduced [11] , [12] , and the proposed countermeasures have significant impacts on the cryptographic implementations' performances and sizes, especially for off-the-shelf devices with no particular hardware mechanism to thwart them.
In this paper, we introduce a new paradigm, based on the use of Single Instruction Multiple Data (SIMD) instructions, for implementing spatial redundancies to thwart fault attacks. First, we describe the concept of using SIMD instructions which are increasingly available in off-the-shelf IoT devices. Then, we introduce a method for implementing this countermeasure in a completely generic way. Finally, we report practical experiments on the block cipher PRIDE and on the stream cipher TRIVIUM before concluding with some future work.
II. Intra-Instruction Redundancy
Recently, a countermeasure based on Intra-Instruction Redundancy [13] was proposed to thwart fault attacks. It consists in using a bit-sliced implementation of a given cipher applied on 32 input blocks. The aim is to exploit a 32-bit architecture -which is the most widely used architecture in IoT devices -taking as input 15 blocks of data interleaved with 15 blocks of redundancy and 2 reference blocks. The reference blocks are constant inputs (plaintexts and keys) for which the corresponding ciphertexts are known. IIR principally allows to thwart mono-bit fault models thanks to the redundancy and also to thwart instruction skip thanks to the reference blocks. Unfortunately, multi-bit fault models can still be effective: for example, a two-bit fault on the two copies of a data block will be undetectable. Moreover, IIR imposes to use, in most cases, a less efficient implementation of the cipher due to the Boolean circuit transformation overhead necessary for bit-slicing [14] , to take as input 15 blocks of data per encryption and to use n words in order to store and manipulate an n-bit input. However, using reference blocks as part of a countermeasure is very effective against instruction skip. Thereby, we investigated the possibility of keeping this property while using a conventional (i.e. non-bitscliced) implementation of a cipher. Moreover, we also looked at the possibility to start from an efficient 8-bit implementation -which is usually the preferred option for lightweight ciphers -on a 32-bit architecture.
III. Internal Redundancy Countermeasure
In this section we describe how SIMD instructions can be used to implement cryptographic algorithms resistant to fault attacks. We shall call this approach the Internal Redundancy Countermeasure (IRC).
A. General Principle
It is common to use a 32-bit implementation of a cipher on a 32-bit architecture to fully exploit the architecture's capabilities. However, the use of spatial redundancy in this case requires a larger memory overhead. In order to decrease it, we propose to use an 8-bit implementation of the cipher simultaneously applied on 4 blocks on a 32-bit word. We also use reference blocks to increase the countermeasure's efficiency. The manipulated words are thus composed of one data byte interleaved with the corresponding byte of the reference block and two copies. Figure 1 shows a typical example of the used words.
Ref.
8-bit

Data
8-bit
Data 8-bit Figure 1 : Using SIMD on a 32-bit word structure Then, we replace each 8-bit operator by means of a single stream of 32-bit instructions corresponding to the same operation performed independently on each byte in a SIMD fashion. This has a timing overhead since it generally requires more instructions than the original 32-bit implementation would but it highly decreases the required memory overhead since it uses a single stream of instructions instead of 4 parallel ones (which is not always possible according to the architecture). Finally, at the end of encryption or decryption, comparisons are made involving the different copies and the stored reference ciphertext. All corresponding copies are expected to be equal to each other and each obtained reference ciphertext to be in turn equal to the stored reference ciphertext. Therefore, to perform a fault injection, an attacker must obtain the same fault on each copy of the data without affecting the reference block. It can be extremely difficult to achieve in practice, especially when the reference block is interleaved between copies of the data. Now we will describe the different ways of using IRC.
B. Using SIMD on block ciphers
The construction of the words depends on the required security level. Generally, there are two possibilities to prevent the same fault on k spatial copies of blocks:
i. Fault detection: use k`1 copies of the data in each word and trap the system when they do not all lead to the same end result. ii. Fault correction: use 2k`1 copies of the data in each word and return the one which appears the most, by applying a majority vote among them. It provides an additional security against safe-error attacks. Our approach offers the possibility of having either one of these two strategies. For fault detection, a representation as the one given in Figure 1 can be used. For fault correction, a single reference block split into two nibbles (4-bit words) can be used. Then, each nibble is arranged between two copies of the data as depicted on Figure 2 . The drawback of this latter method is that the nonlinear operations are more complex to implement since available SIMD instructions cannot be used. Now we will detail the case of the fault detection. Let E be an 8-bit implementation of a block cipher which takes as input a b-byte plaintext P = P 1¨¨¨Pb , uses a b 1 -byte secret key K = K 1¨¨¨Kb 1 and produces a bbyte ciphertext C = C 1¨¨¨Cb . Our scheme uses a b-byte reference plaintext RP = RP 1¨¨¨R P b , a b 1 -byte reference key RK = RK 1¨¨¨R K b 1 and a b-byte reference ciphertext RC = RC 1¨¨¨R C b . First, for each i P t1,¨¨¨, bu, the byte P i concatenated with RP i , P i and RP i is stored in a 32-bit word as illustrated in Figure 3 .
. . .
RP b
PT Figure 3 : IRC on block ciphers -composition of words For each i P t1,¨¨¨, b 1 u, the byte K i concatenated with RK i , K i and RK i is also stored in a 32-bit word. Then, the cipher is executed by means of a single stream of 32-bit instructions operating independently on each byte, denoted by IRC(E), to obtain, for each i P t1,¨¨¨, bu, the byte C i concatenated with RC i , C i and RC i . Figure 4 shows the execution of IRC(E). Finally, comparisons are made involving each copy and the stored reference ciphertext. The ciphertext is returned only if all tests are valid as illustrated in Figure 5 . 
C. Using SIMD on stream ciphers
Modern stream ciphers are usually composed of 2 parts: i. The first one is an initialization step: a function maps the secret key and a public initialization vector to an internal state S 0 . Then, another function E is applied to S 0 to produce a pre-keystream-generation internal state S 1 . ii. The second one is the keystream generation: a function I is applied to S 1 which modifies its value and generates a byte of keystream (in the case of an 8-bit implementation). It produces as many keystream bytes as necessary to add to the plaintext in order to produce the ciphertext. In this context, IRC consists in first applying to the initialization step the same method as previously described on block ciphers. The initial internal state IRC(RS 0 ,S 0 ) is composed of the internal state S 0 , a reference internal state RS 0 and their respective copies. RS 0 is obtained from a reference key and iv, which can be themselves easily generated (rather than stored). IRC applies E to IRC(RS 0 ,S 0 ) by means of a single stream of instructions operating independently on each byte, operation that we shall denote by IRC(E). It obtains an internal state IRC(RS 1 ,S 1 ) composed of S 1 , RS 1 and their copies. Figure 6 shows the initialization step protected by IRC in case of fault detection, where S n,i (resp. RS n,i ) denotes the i-th byte of the internal state S n (resp. RS n ). Moreover, each operation in Figure 6 is made for all i P t1,¨¨¨, bu with b the number of bytes of S 0 . Figure 7 shows the first keystream generation protected by IRC with the same notation as previously. Figure 7 : First keystream byte generation However, although the keystream byte is correct, it is possible that IRC(RS 2 ,S 2 ) contains a fault since only a part of the internal state is generally used to produce the keystream. Consequently, IRC makes also comparisons on IRC(RS 2 ,S 2 ) to ensure its correct value. Then, it replaces in each word the obtained reference internal state RS 2 by S 1 in order to protect each new keystream byte by the previous one as illustrated in Figure 8 . Each generated keystream word is hence composed of the actual keystream byte, the previous one and their respective copies. IRC can therefore make comparisons between the copies of the keystream bytes and with the previously stored keystream byte. Figure 9 shows the generation of the following keystream bytes protected by IRC also with the same notation as previously. To determine when the first byte of keystream K 1 is returned, two options are possible:
i. Return K 1 after the comparisons in IRC(RK,K 1 ): we obtain a security similar to the case previously described for block ciphers. ii. Return K 1 after the comparisons in IRC(K 1 ,K 2 ): we obtain an additional temporal redundancy since IRC generates K 1 twice consecutively and compares the obtained values. It has an overhead of one use of the iteration I, which has generally a low cost.
IV. Practical implementations & tests
In order to test IRC, we deployed it in the "fault detection mode" as previously described on two different 32-bit architectures quite representative of the off-the-shelf devices used for IoT: an ARM Cortex-M3 micro-controller on which we tested our own overloaded operators and an ARM Cortex-M4 micro-controller in order to exploit the SIMD instructions it provides. In both cases, we report experiments on one representative lightweight block cipher PRIDE and one representative lightweight stream cipher TRIVIUM. Finally, we report practical implementations of fault injections against the both IRC-protected implementations. Note that no other countermeasure than IRC was implemented.
A. IRC on PRIDE
PRIDE is a block cipher introduced by Albrecht et al. [2] in 2014. It is one of the most efficient lightweight block cipher in terms of software implementation as shown by the performance comparisons given in [2] , [15] . The specifications of PRIDE are given in [2] . In Table I we compare the performances and footprints of the 8-bit implementation of PRIDE given in [16] and the same implementation protected by means of IRC. To be fair, we also include the performances of the 32-bit implementation of PRIDE given in [17] which of course achieves higher throughput on a 32-bit platform. [5] , [18] in 2006. It belongs to the eSTREAM portfolio of recommended stream ciphers and has been specified as an ISO standard [19] . The specifications of TRIVIUM are given in [5] . In Table II , we compare the performances and footprints of an 8-bit optimized implementation of TRIVIUM with and without IRC. To be fair, we also include the performances of another 32-bit optimized implementation of that cipher. 
C. Fault Attacks
In order to test IRC, for both IRC-protected PRIDE and TRIVIUM, we injected faults into the chip at different temporal locations using EM pulses as in [17] because with this approach we did not need to decapsulate the chip and we were able to inject faults at precise enough instants. Practical fault attacks have been proposed against PRIDE [17] , [20] using EM pulses or against TRIVIUM [12] , [21] . It is of the utmost importance to thwart such attacks since EM fault injection is a relatively low-cost mean of injection. The EM pulse used had a duration of 200ns, the applied voltage across the loop was varied by steps of 1 V between 180V and 219V and we injected 250 pulses by targeting the middle of the die. In total, we obtained 4823 faults (resp. 3703) from 10,000 EM injections in the case of PRIDE (resp. TRIVIUM). In both cases, IRC allowed us to fully thwart such a fault injection: the use of the spatial redundancy and of the reference block was necessary to detect all the faults.
V. Conclusion & Discussion
In this paper we describe a new paradigm, based on the use of SIMD instructions for implementing spatial redundancies to thwart fault attacks on any cipher: we have called it the Internal Redundancy Countermeasure (IRC). It consists in executing the cipher by means of a single stream of 32-bit instructions operating independently on each byte. The overhead of IRC depends on the targeted cipher as illustrated in this paper. However these impacts have to be leveraged with the high fault coverage achieved. Moreover IRC has been shown to work on a widely spread processor core and hence does not need any hardware modification with respect to the already existing processors embedding SIMD instructions. By illustrating the efficiency of this approach, we hope to encourage chip manufacturers to integrate dedicated SIMD instructions to help tackle such a complex issue as protection against fault attacks. One further step in this research work will to be investigate how to efficiently enhance this scheme to thwart side channel attacks.
