Software-based cryptographic implementations can be vulnerable to side-channel analysis. Masking countermeasures rank among the most prevalent techniques against it, ensuring formally the protection vs. value-based leakages. However, its applicability is halted by two factors. First, a masking countermeasure involves a computational overhead that can render implementations inefficient. Second, physical effects such as glitches and distance-based leakages can cause the reduction of the security order in practice, rendering the masking protection less effective. This paper, attempts to address both factors. In order to reduce the computational cost, we implement a high-throughput, bitsliced, 2ndorder masked implementation of the PRESENT cipher, using assembly in ARM Cortex-M4. The implementation outperforms the current state of the art and is capable of encrypting a 64-bit block of plaintext in 6,532 cycles (excluding RNG), using 1,644 bytes of data RAM and 1,552 bytes of code memory. Second, we analyze experimentally the effectiveness of masking in ARM devices, i.e. we examine the effects of distance-based leakages on the security order of our implementation. We confirm the theoretical model behind distance leakages for the first time in ARM-based architectures.
Introduction
Nowadays, everyday devices, sensors, vehicles and other items are embedded with electronics, allowing network connectivity and information exchange. Often, these fairly simple devices need to maintain a high level of security against powerful adversaries with passive monitoring, as well as active tampering capabilities.
For instance, side-channel attacks (SCA) allow attackers to learn sensitive data by observing physical characteristics of a cryptographic implementation. Their discovery in 1999 by Kocher et al. [32] exposed a blind spot in theoretical, proof-driven cryptography and has motivated researchers to find efficient
The work described in this paper has been supported by the Netherlands Organization for Scientific Research NWO under project ProFIL (628.001.007).
countermeasures. A very common option for provably secure software countermeasures is masking [18] , which uses secret-sharing techniques to hinder key recovery.
However, the masking countermeasure can imply a severe performance overhead in terms of processing speed due to the quadratic computational complexity required [30] . Moreover, masking can formally ensure protection against a theoretical leakage model, namely the value-based model. As a result, device-specific divergence from the assumed model can lead to security order reduction. For instance, software devices often exhibit distance-based leakages, which have been theorized to reduce the order of a masked scheme by 50% [2] . This paper attempts to answer whether masking countermeasures and ARM devices are friends or foes. The contribution is twofold and extends to both the performance factor as well as the security order factor.
First, we improve the current state of the art by creating an efficient, bitsliced, 2nd-order implementation of PRESENT. The PRESENT cipher was selected due to its widespread applicability in the Internet of Things context. Our implementation requires 1,644 bytes of RAM, 1,552 bytes of code memory and encrypts 32 blocks of data in 209,023 clock cycles, achieving a throughput of 6,532 clock cycles per block, excluding the cost of random number generation. Thus, we demonstrate that ARM-based architectures can host masked implementations efficiently, given that the implementors opt for full-scale assembly programs and use efficient state representations.
Second, we examine potential distance-based leakages in ARM architectures. That is, we perform side-channel experiments in order to test whether our ARM Cortex-M4 device is prone to causing order reduction in our 2nd-order implementation. In addition, we confirm that the observed order reduction follows the theorized reduction established by Balasch et al. [2] . That is, we confirm the order-reduction theorem (Section 5) in ARM-based architectures for the first time.
In the next section, we describe the work of other practitioners who implemented PRESENT and relate their performance figures with our work. In Section 3 we offer a brief description of the PRESENT cipher. Section 4 discusses the design options and optimizations w.r.t. the masked ARM implementation, as well as the performance results. Section 5 links the order reduction model suggested by Balasch et al. [2] to our ARM-based device. Finally, Section 6 concludes and discusses future work.
Related work
In this section, we describe the work of those implementors that addressed the implementation of PRESENT in software. We do this in ascending order of word size according to the architecture.
4-bit architectures
Poschmann implemented PRESENT in different software platforms [39] . In a 4-bit µC, particularly an Atmel ATAM893-D at 2 MHz he obtained a performance figure of 55,734 cycles per block. He also implemented PRESENT in an 8-bit ATmega µC clocked at 4 MHz, obtaining a performance of 10,089 cycles.
8-bit architectures
Papagiannopoulos presented a bitsliced implementation of PRESENT on the 8-bit ATtiny85 µC. He applied bitslicing to the permutation and substitution layers using a bitslice factor of 8 [38] . That work relied on the PRESENT Sboxes resulting from the application of 2-stage Boyar-Peralta heuristic in tandem with SAT solvers [12] . He obtained a throughput (cycles per block) of 2,967 using 3,816 bytes of Flash and 256 bytes of SRAM. In this work, we use the same Sbox. Dinu et al. also analyzed the suitability of a wide range of lightweight block ciphers in sensor-based applications in three different architectures: an 8-bit ATmega, 16-bit MSP430 and 32-bit ARM processor. They do not apply bitslicing and implemented the Cipher Block Chaining (CBC) and counter (CTR) modes of operation [23] . The CBC implementation requires 121,906 cycles on the ATmega processor whereas the CTR implementation can obtain one block of ciphertext in 15,239 cycles. Furthermore, the authors from [25] implement PRESENT in an ATiny 8-bit µC, using 80 bits keys the required 11,343 cycles, 1,000 bytes of code and 18 bytes of RAM. Using the same platform Papagiannopoulos decreased the amount of cycles to 8,712 cycles in [37] by using a merged SP layer, squared and compact representations of the Sbox and minimal key register rotations. Finally Rauzy et al. presented a design methodology for inserting Dual-rail with Precharge Logic (DPL) in a software implementation of PRESENT in an automatic way [41] . They relied on an 8-bit AVR ATmega 163 implementation (bitsliced). They require 235,427 cycles for obtaining a single block of ciphertext. Contribution In this manuscript we present a very fast and 2nd-order protected implementation of the PRESENT block cipher by combining bitslicing and 2nd-order masking. We rely on the 32-bit ARM Cortex-M4 CPU 3 . The analytical results can be seen in section 4.3. Our implementation can encrypt one PRESENT plain text in 6,532 cycles using 1,644 bytes of RAM and 1,552 bytes of ROM. To our knowledge, this is the first high-order protected implementation of PRESENT that includes side-channel evaluation. We have evaluated our implementation against first, second and third-order security using state-of-the art techniques (Section 5). None of the works described in this section performed such exhaustive evaluation on their implementation while protecting it against second-order attacks [5, 17, 23, 25, 34, [37] [38] [39] 41] . Our performance figures suggest that our implementation is between 2.5 and 21.2 times faster than prior art relying on the same architecture (Section 4.4). Further, we have made our implementation available under the General Public License (GPL) 4 .
16-bit architectures
Finally, since the constructions found in PRESENT are also used on the hash functions SPONGENT and H-PRESENT [8, 10] , the same approaches we present in this manuscript can be applied to their implementation.
PRESENT
Given the need of alternative cryptographic primitives aimed at low-power and compact applications such as RFID and sensor networks, a variety of lightweight primitives such as PRESENT has been proposed in the last few years [9] . Standardized in ISO/IEC 29192-2:2012 5 , it consists of a substitution-permutation (SP) network, 80/128-bit key sizes and 64-bit data blocks. PRESENT applies the following layers during 31 rounds to a 64-bit state b:
1. addRoundKey: During the execution of PRESENT, 32 round keys (K i w.r.t. 1 ≤ i ≤ 32) are generated via a key schedule using the encryption key K as an input. The last subkey, K 32 is used for post-whitening. Each round key K i has a size of 64 bits. Thus, each execution of addRoundKey is comprised of the XOR operation between the state and the round key, i.e. b ← b ⊕ K i . 2. sBoxLayer: This layer is a non-linear substitution operation that relies on a 4-bit Sbox (F 4 2 → F 4 2 ), applied 16 times per round to the state. The 64-bit state is divided in 16 groups of 4 bits that feed the PRESENT Sbox (Table  1 ). 
pLayer: This layer consists of a linear bit-wise permutation where each bit
i of the state (b i ) is moved to another position P (i) according to Table 2 . Finally, the round subkeys are generated as follows. Given a key K of 80 bits s.t. K 79 , K 78 , ..., K 0 , a round key i of 64 bits is the 64 left most bits of K updated via the following operations:
1. 61 bits rotations to the left of K. 2. The left most 4 bits are processed in the PRESENT Sbox. 3. The round counter is exclusive-ored with the bits K 19 , ..., K 15 of K.
Bitsliced Masking of PRESENT for ARM Cortex-M4
The current section describes the design choices investigated in order to develop a protected, high-throughput, assembly-based PRESENT implementation. Sections 4.1, 4.2 describe the logic-level optimizations performed, while Section 4.3 discusses the instruction-level improvements.
Bitslicing and Efficient Sbox Representation
CPU architectures tend to operate best on their native word size or half-words and they encounter performance issues with bit-level manipulation. To deal with this issue, the Cortex-M4 features bit-banding support 6 , as well as a wide selection of bit-field instructions. However, applying them in the context of PRESENT requires extensive use of load and store instructions or numerous bit extractions/insertions, often resulting in poor performance.
Bitslicing is a technique introduced by Biham to tackle this inefficiency for DES [6] . Instead of using registers to store consecutive bits of a state, one uses them to hold one specific bit from several different states, effectively transforming bit-level operations into SIMD equivalents.
In our implementation, we employ a bitsliced representation of factor 32, i.e. we process in parallel 32 cipher blocks, 64 bits each, resulting in 256 bytes per bitsliced encryption. Doing so, allows us to efficiently compute both the substitution and the permutation layer of PRESENT. Analytically, the Sbox can be decomposed into GF (2) operations which can be accelerated by via the SIMDlike instructions and it no longer requires the application of memory lookup tables. 7 . Similarly, the bit permutations can be accelerated by directly exchanging the memory contents of the corresponding bitsliced bits according to the permutation pattern, instead of relying on bit extraction, insertion and shifting.
The GF (2) decomposition of the Sbox has sparked interest in the optimization of boolean circuits w.r.t. computational efficiency. In our implementation, we use the optimized boolean circuit suggested for PRESENT by Courtois et al. [21] . The optimized representation was generated by applying the Boyar-Peralta heuristic [12] , which reduces the circuit's gate complexity, i.e. the number of AND, OR, XOR, NOT operations. The representation is shown below. T1 = X2^X1; T2 = X1&T1; T3 = X0^T2; Y4 = X3^T3; T2 = T1&T3; T1^= Y4; T2^= X1; T4 = X3|T2; Y3 = T1^T4; X3 =~X3; T2^= X3; Y1 = Y3^T2; T2 |= T1; Y2 = T3^T2;
Values X1-X4 represent an Sbox input, T1-T4 hold temporary values and Y1-Y4 are output values. The total cost is 14 operations, 4 non-linear (AND, OR) and 10 linear (XOR,NOT).
Boolean Masking
Chari et al. [18] were among the first to suggest that splitting intermediate values using a secret sharing scheme would force attackers to analyze joint distribution functions on multiple points. That is, a dth-order masking scheme splits a sensitive value x into d + 1 shares (x 0 , x 1 , . . . , x d ) as follows:
Assuming sufficient noise, it has been shown that the number of traces required for a successful attack grows exponentially w.r.t. the order d [18, 40] . Masking involves several implemenation angles, e.g. Goubin et al. [26] , Messerges [35] and recently Coron [19] applied the masking principle in lookup tables used in Sbox computation. Adopting a different implementation angle, Trichina [47] , Canright [14] , Akkar et al. [1] and Blömer et al. [7] applied masking in the context of GF operations used in Sbox computation. This operationbased approach was formalized by Ishai, Sahai, and Wagner's shared secret approach (ISW), which introduced the notion of private boolean circuits [30] . ISW provided implementors with a provably secure method to mask operations in GF (2) for any masking order d.
This work employs a bitsliced representation of PRESENT and enhances the implementation using a 2nd-order protection scheme. As demonstrated in Section 4.1, the Sbox is decomposed into GF (2) operations. Thus, ISW is our technique of choice in order to apply 2nd-order protection on the boolean operations required for the Sbox computation. Table 3 shows the ISW equivalent of common boolean operations when applied to bitsliced operands a and b, as well as the computational cost involved for each operation. The values z i,j where 1 ≤ i < j ≤ (d + 1) are drawn from a uniform random distribution and the remaining z i,j are computed us-
Note that the cost of the NOT operation is a single negation, the cost of the XOR operation is linear and the cost of the AND,OR operations is quadratic. In our implementation, the OR operation is converted to a single AND and three NOT operations in order to apply the ISW method. 
The quadratic computational complexity of non-linear operations can result in a computationally demanding masked Sbox. To avoid this, several techniques [15, 16, 21, 27, 45] on reducing the multiplicative complexity of an Sbox, i.e the number of AND,OR operations. The decomposition that we currently use (shown in Section 4.1) is optimal w.r.t. multiplicative complexity, since bruteforce techniques [28] demonstrate that the minimal complexity in GF (2) of cryptographically relevant, 4-bit Sboxes is 4 non-linear operations.
ARM-based Optimizations
Our implementation targets the ARM Cortex-M4 microcontroller architecture using ARM assembly with Thumb2 encoding. Thus, we use a 32-bit architecture with 14 general purpose registers designed for low-cost, low-power applications. The implementation board is the Riscure Pinata 8 which is based on the STM32F417IG SoC by ST and embeds an ARM 32-bit Cortex-M4 CPU clocked at 168 MHz. It features 1,024 Kbytes of Flash and 196 Kbytes of RAM. The device is also equipped with a TRNG on the board in order to generate the random values associated to our masking implementation. In the case of the STM32F417IG, the TRNG generates 32-bit random numbers via an integrated analog circuit. Note that the computational penalty w.r.t. random number generation is particularly steep when implemented on-the-fly, amounting to roughly 25 percent of the total computation. Still, we note that the random numbers can be precomputed in advance, given that the application context allows for idle intervals between consecutive encryptions. Below, we discuss implementation details and efficiency improvements pertaining to the ARM architecture, memory organization and assembly instructions.
1. Memory organization: Our design requires two full bitsliced states in RAM, each comprising of three sub-states corresponding to the three-share masking scheme. The two full bitsliced states are needed because the permutation layer would otherwise overwrite unprocessed data. We optimize for cycles by integrating the permutation into the Sbox and writing words to their permuted destination immediately after the Sbox computation. Wherever the code operates on shares we organize our fetch and store data in batches so as to reduce overhead. In most cases we use the LDM and STM instructions to load or store three or four words at a time. This yields improvements in the Sbox computation when reading in the next four words to be substituted, in the key schedule, where three words at a time are read in for processing and also when converting a regular state representation from/to a bitsliced one. 2. Loop Unrolling: To improve the efficiency of our Sbox implementation, which encrypts twelve shares (four bit-sliced data blocks of three shares each), we unroll the substitution process to reduce the unnecessary read/write steps required for a looped construction. The unrolling adds considerable size to the code, yet we achieve trading code size for throughput. Note that unrolling is performed with memory access in mind. For example, we mentioned that adding the key schedule is performed in a loop of three words. This optimizes the key schedule operation and maximizes the amount of data we can bring from/to the RAM. 3. Key Schedule: The round key is not stored in a bitsliced fashion and the key schedule is computed on the fly. Note that round key precomputation is also a valid implementation option, assuming that the key does not need to be renewed often. Since, key refreshing can act as a side-channel countermeasure, we chose to retain the on-the-fly key updates. Updating the round key requires a push through the Sbox for four bits each round. To that purpose, we use Cortex-M4's UBFX instruction for extracting a contiguous series of bits from a word in an efficient manner. In addition, we used ARM's barrel shifter function, which allows the second operand to be shifted with no additional cost before an instruction is performed.
Performance Results
The current section summarizes the achieved performance results w.r.t. throughput and size. We depict in Tables 4, 5 the performance figures of the works described in Section 2. As mentioned, we outperform prior art on the same architecture between 2.5 and 21.2 times [23] . PRESENT-80 no -no ATMega163 10,089 [39] PRESENT-80 no -no C167CR 19, 460 As expected, the ISW implementation of the Sbox dominated CPU time, accounting for 95,88 percent of all clock cycles within the encryption process. A complete breakdown of the memory and time overheads required for different modules is provided in Table 6 .
Masking Effectiveness in ARM Cortex-M4
In this section, we assess experimentally the security level (masking order) provided by the ISW masking scheme, taking into account the possibility of distancebased leakages in ARM Cortext-M4. In addition, we investigate whether the PRESENT-80, CBC, MSP430 1,108 52 [23] PRESENT-80, CBC, ARM 1,304 124 [23] PRESENT-80, CTR, ATMega 1,416 54 [23] PRESENT-80, CTR, MSP430 1,244 58 [23] PRESENT-80, CTR, ARM 1,532 140 [37] PRESENT-80 1,794 - [41] PRESENT-80, bitslicing 1,620 288 [41] PRESENT-80, bitslicing + DPL 3,056 352 theoretical repercussions of distance-based leakages can be confirmed experimentally. In other words, we examine whether the cost of "lazy engineering" as introduced by Balasch et al. [2] is applicable to an ARM-based microcontroller.
Experimental Pitfalls
The effective and efficient evaluation of the actual mask order of cryptographic implementations remains an open problem due to several evaluation pitfalls. Effectivity-wise, when evaluating a masking scheme via the measured power consumption, we face the pitfall of the limited attack scope. That is, a particular attack technique in use may fail to exploit the available leakage due to e.g. an unsuitable choice of intermediate values or an incorrect power model assumption 9 . Moreover, introducing additional countermeasures on top of the masking scheme may render particular exploitation techniques ineffective, while the implementation remains vulnerable to different lines of attack.
In order to tackle this issue, the research community followed several approaches. Prior research established generic side-channel distinguishers such as Mutual Information Analysis (MIA) [4] , the Kolmogorov-Smirnov and the Cràmervon Mises tests [48, 49] , which require minimal assumptions about the noise and the power model of the device under test. On the other side of the spectrum, Standaert et al. [44] proposed an evaluation framework assuming the strongest possible adversary, equipped with extensive profiling capabilities and Bayesian templates.
While being effective, the aforementioned approaches focus on leakage exploitation and perform key recovery, which may require a large number of traces. Thus, they face the efficiency pitfall w.r.t. computational and storage requirements. Note that this increased demand for resources is magnified when inserting extra countermeasures in a masked implementation. Thus, it can be difficult to decide with confidence whether the masking order is reduced or not.
In order to evaluate the effective masking order, we opt for a more recent approach called leakage detection methodology [31] . This approach focuses on leakage detection and disregards exploitation. Thus, the acquisition and the computational cost is reduced while the methodology can retain its generic nature.
Despite the gain achieved via decoupling detection and exploitation, the leakage detection methodology still presents challenges w.r.t. efficiency. In the context of software masking, we need to combine multiple time samples in order to evaluate the masked implementation. Thus, we rely on the work by Schneider et al. [42] , who extended the leakage detection methodology into higher-order evaluations by providing efficient, incremental formulas that can handle the computation involved with minimal memory requirements. In certain cases, we also resort to traditional evaluation techniques such as correlation-power analysis (CPA) [13] , despite their limited attack scope, so as to enhance our discussion.
Bitsliced Masking and Distance-Based Leakages
In order to perform leakage detection and determine the actual masking order, we opt to use the fixed vs. random, non-specific t-test statistic. The process involves two steps: a custom acquisition of two trace sets (populations) and a population comparison based on statistical inference.
In the first step, we perform a fixed vs. random acquisition and obtain two distinct trace sets for comparison: S f ixed and S random , under the same encryption key. For S f ixed , the input plaintext is set to a fixed value, while for S random , the input is drawn from a uniformly random distribution. Following the suggestion from Shneider et al. [42] , the implementation receives the fixed or random plaintext in a non-deterministic and randomly-interleaved manner. This type of acquisition is performed in order to randomize the implementation's internal state and avoid measurement-related variations over time, e.g. due to environmental parameters. The evaluation test to be performed is non-specific, i.e. we target all sensitive values computed during encryption. Thus, we maintain a wide attack scope, without any prior assumptions on the leakage model or intermediate values.
The acquisition is performed on the ARM-based Pinata device, using a Picoscope 5203 oscilloscope and the Riscure current probe 10 . The device clock operates on 168 MHz and the oscilloscope's sample rate is 1 GSample/sec. We also apply post-processing in the form of signal resampling.
For the second step, we model the sets S f ixed and S random as independent random samples {S 1 f ixed . . . S n f ixed } and {S 1 random . . . S m random } drawn from normal distributions with means µ f ixed , µ random , standard deviations σ f ixed , σ random and σ f ixed = σ random . Subsequently, leakage detection methods will test the equality of means µ f ixed , µ random (null hypothesis). Finding a statistic for this test is known as the Behrens-Fisher problem and an approximate solution is the Welch t-test [33] with υ degrees of freedom, as shown below.
The null hypothesis H null is rejected at a given level of significance α, if |w|> t α/2,υ , where t α/2,υ is the value of the Student t distribution with υ degrees of freedom 11 . In the evaluation context, rejecting H null implies leakage detection, i.e. potential evidence of an ineffective masking scheme. A common rejection criterion that we also use in our analysis is |w|> 4.5, which corresponds to υ > 1000 and α > 0.99999 [22] . Note that that H null rejection shouldn't be interpreted directly as an applicable vulnerability. Even after detection, the amount of traces required for exploitation may render an attack infeasible. In this work, we need to evaluate the masking order provided by our ARMbased, 2nd-order masked cipher. From a theoretical point of view, a 2nd-order ISW masking countermeasure is capable of preventing value-based leakages of order 2 or less. However, practice has demonstrated that software implementations, including ARM microcontrollers, may exhibit leakages with large divergence from the value-based leakage abstraction. An exemplary case is the distance-based leakage model, observed by Daemen et al. [46] , addressed by Coron et al. [20] and recently formalized by Balasch et al. [42] . This particular divergence leads in the reduction of the security order. Balasch et al. theorized that a dth-order scheme can reduce to order d 2 and provided experimental validation using an AVR-based microcontroller. We will refer to this formalization as the order-reduction theorem. To address such leakage divergence issues in our implementation, we use the Welch t-test in order to verify experimentally the theoretical security claims.
We commence the evaluation by testing the 1st-order security of our masked cipher. We perform the 1st-order t-test on the first round of bitsliced PRESENT. The size of both S f ixed and S random is 10k traces with 30k samples per trace. The trace waveform and t-test results are visible in Figures 1,2 . We observe that that we remain well below the 4.5 threshold, indicating that our 2nd-order masked PRESENT implementation is able to maintain 1st-order security. To enhance our confidence, we also perform a 1st-order CPA attack, with a large amount of traces (800k) to exploit potential 1st-order leakages. We use the HW model and a custom-made selection fuction due to the bitsliced Sbox computation. Similarly to Balasch et al. [3] , the selection function must take into account that not all Sbox output bits leak at the same time due to the GF (2)-oriented Sbox implementation. Thus, our selection function focuses on key bits from different registers that once combined through the Sbox, affect a single bit of the Sbox output. Attacking a section of the 1st round with 10k traces, while the RNG is disabled, is successful, confirming the validity of our choice w.r.t. the leakage model (HW ) and selection function. The results are visible in Figure 3 . We also perform the CPA attack with enabled RNG and the results are visible in Figure 4 . In order to manage the computation required, we employ the techniques suggested by Bottinelli et al. [11] , i.e. we partition the 800k traces, compute correlation coefficient per partition, then recombine in order to reduce the execution and memory workload. The results demonstrate that no 1st-order leakage can be exploited in the presence of our 2nd-order scheme. Both the t-test and the CPA result is in accordance with the order-reduction theorem, since a 2nd-order masked implementation can maintain 2 2 = 1 order of security in the presence of distance-based leakages.
Assuming that our device exhibits distance-based leakage, it is of particular interest to prove experimentally that the order-reduction theorem holds when we test the 2nd-order security of our ARM-based masked implementation. Performing a 2nd-order evaluation requires pre-processing the acquired trace sets in order to generate all possible 2-tuples (pairs) of distinct samples via a combination function. Subsequently, the multivariate 2nd-order t-test is performed on the generated trace sets in order to determine the robustness of the 2nd order.
The main hindrance of this process is the computational complexity pertaining to generating and processing all N oSamples 2 sample pairs. Even with a small number of samples per trace, the evaluation cost can quickly become prohibitive. To address this issue, researchers have relied on intuitive selection of points of interest in conjunction with naive search [36] or they deployed heuristic techniques such as projection pursuits [24] to perform point of interest selection for higher-order attacks. In our evaluation, we follow the intuitive approach by focusing on a reduced version of the 1st round which contains the substitution layer. Inside this reduced round, we enumerate naively all possible pairs. Given the bitsliced nature of the implementation and the considerable RNG overhead, the reduced round has a length of 800 samples. In order to keep the processing cost manageable, we use the incremental formulas suggested by Schneider et al. which enable the efficient computation of the multivariate statistical moments required for 2nd-order t-tests. The memory-less feature of the computation yields significant improvement compared to straightforward computation techniques. In addition, we partition the reduced round into windows of 150 samples each and perform the attack in each window independently. Figure 6 shows the t-test results using 10k fixed input traces and 10k random input traces for the sample window with the largest detected leakage. The test value slightly exceeds the threshold, indicating potential leakage. Thus, it hints the experimental verification of the order-detection theorem in our ARM-based device for 2nd-order ISW schemes. However, several concerns were raised over the t-test robustness, usually w.r.t. the exact threshold value (Appendix A from [2] , [22] ). As a result, it remains an open question whether 2nd-order leakages are practically exploitable in our context. To investigate this, we perform a 2nd-order CPA-based attack using the centered product combination function and the custom bitsliced selection function on the 1st round of PRESENT. The point selection window has size 100 samples and we use 100k traces. The results are visible in Figure 7 and show that the leakage is exploitable with roughly 60k traces.
As a result, we suggest that the order-reduction theorem remains applicable in software-based, masked implementations for the ARM Cortex-M4. However, we recommend that the exploitation is always verified in practice.
Moreover, we need to stress the fact that this type of behavior has been observed in a specific ARM-based device. Although it provides indications on the behavior of similar architectures, this experimental result should not be extrapolated as a hard fact w.r.t. all ARM Cortex-M devices. Naturally, a 3rdorder multivariate t-test is able to detect a large amount of leakage, as shown in Figure 8 and indicates that a 3rd-order attack is also applicable. 
Conclusions
This paper investigated the speed and space requirements of a bitsliced implementation of PRESENT on the ARM Cortex-M4 architecture, protected with 2nd-order ISW masking. In addition, we explore and confirm the applicability of the order-reduction theorem in the context of ARM-based devices. From the attacker point of view, future work can involve deciding on the optimal strategy to attack masked implementations, given the amount of leakage available in different security orders. From the defender's point of view, implementors need to also investigate the computational cost of the randomness required for masking, which itself may pose a bigger issue than the quadratic computational complexity of masking.
