Abstract. The research on pairing-based cryptography brought forth a wide range of protocols interesting for future embedded applications. One significant obstacle for the widespread deployment of pairing-based cryptography are its tremendous hardware and software requirements. In this paper we present three side-channel protected hardware/software designs for pairing-based cryptography yet small and practically fast: our plain ARM Cortex-M0+-based design computes a pairing in less than one second. The utilization of a multiply-accumulate instructionset extension or a light-weight drop-in hardware accelerator that is placed between CPU and data memory improves runtime up to six times. With a 10.1 kGE large drop-in module and a 49 kGE large platform, our design is one of the smallest pairing designs available. Its very practical runtime of 162 ms for one pairing on a 254-bit BN curve and its reusability for other elliptic-curve based crypto systems offer a great solution for every microprocessor-based embedded application.
Introduction
The field of pairing-based cryptography has become the key enabler for novel protocols and algorithms: privacy-aware group-signature schemes [9, 22] , identitybased encryption schemes [7, 23] , and since recently even provable leakageresilient protocols [25] rely on pairing operations. The practical advantages of those protocols motivate their use in the very competitive markets of embedded microprocessors and smart cards.
The biggest implementation challenges of pairing-based cryptography are related to its tremendous resource and runtime requirements. Therefore, researchers started to implement optimized pairing operations for desktop computers [1, 6] , for smart phones [20, 31] , and as dedicated hardware modules [16, 24] . Cost-sensitive embedded applications however simply do not have the budget for such powerful application processors or 130-180 kGE of dedicated hardware.
For these embedded scenarios, implementations on light-weight RISC processors have been done. For example, Szczechowiak et al. [33] need 17.9 seconds for a pairing on an ATmega microprocessor, Gouvêa et al. [18] need 1.9 seconds on an MSP430X microprocessor, and Devegili et al. [15] need 2.5 seconds on a Philips HiPerSmart MIPS microprocessor. Unfortunately, such runtimes are not very promising for real-world, interactive applications as pairing-based protocols like group-signature schemes often happen to rely on several pairing and group operations. The resulting overall runtimes of several seconds would be considerably too slow. Additionally, it is unclear to which degree timinganalysis, power-analysis, or fault-analysis attacks have been considered in all those implementations.
These limitations motivated us to be the first to implement constant-runtime, side-channel protected optimal-Ate pairings using Barreto-Naehrig (BN) curves [4] on an ARM Cortex-M0+ [2, 3] microprocessor. The respective pairing runtime of 993 ms seems very promising as it is several times faster than related work 1 , but might be insufficient for interactive protocols as well. Therefore, it was a necessity to improve performance by adding dedicated hardware.
In this paper, we present three reusable pairing platforms which offer runtimes of down to 162 ms requiring 10.1 kGE of dedicated hardware at mostsignificantly less than similarly fast hardware implementations by related work. Our rigorous hardware/software co-design approach equipped one platform with a multiply-accumulate instruction-set extension and another platform with a drop-in accelerator 2 [35] . By building a flexible, specially crafted drop-in module with several novel design ideas, we were able to improve the runtime of pairing and group operations up to ten times. This concept platform consisting of CPU, RAM, ROM, and drop-in module consumes merely 49 kGE of hardware in total with 10.1 kGE of those being spent for the drop-in accelerator. The practicability of this platform is evaluated for several high-level pairing protocols [7, 8, 22 ] -each operating in significantly less than one second. Its reusability for EllipticCurve Cryptography (ECC) is further verified for secp160r1, secp256r1 [11, 29] , and Curve25519 [5] , requiring 11.9-36.8 ms for a side-channel protected point multiplication. Those results make the drop-in based platform highly suitable for embedded computing, smart cards, wireless sensor nodes, near-field communication, and the Internet of Things.
The paper is structured as follows: Section 2 gives an overview on pairings and Section 3 covers the implementation aspects of the high-level pairing arithmetic. In Section 4, the architectural options to build suitable pairing platforms are presented. The respective platforms are evaluated in Section 5 and compared with related work in Section 6. The (re-)usability of our drop-in platform is content of Section 7. A conclusion is finally done in Section 8.
Background on Pairings
The wide range of cryptographic protocols in pairing-based cryptography is based on three cyclic order-n groups G 1 , G 2 , G T and a bilinear pairing operation. A bilinear pairing e : G 1 × G 2 → G T accepts an element of the two additive groups G 1 and G 2 , respectively, maps these to the multiplicative group G T , and hereby fulfills several properties:
Computability: e(P, Q) can be computed efficiently.
The groups G 1 , G 2 are typically groups over elliptic curves and G T is the subgroup of a large extension field. However, only certain elliptic curves allow the definition of G 1 , G 2 , G T with an admissible bilinear pairing, e.g., [4, 27] . In this paper, we focus on the pairing-friendly elliptic curves by Barreto and Naehrig [4] of the form E : y 2 = x 3 + b with b = 0 (BN curves). Ate pairings a(Q, P ) based on these curves can be described as follows:
Note that for G 1 , G 2 and G T to have the same prime order n, G 2 and G T need to be subgroups of E(F p 12 ) and F * p 12 , respectively. The BN curves use a parameter u such that a desired security level is achieved. This allows the computation of the prime p and the prime group order n in dependence of u:
As another benefit, BN curves possess an efficiently computable group homomorphism that exploits the curve's sextic twist E . Utilization of this homomorphism allows the compression of the elements in G 2 , which leads to a more efficient definition of the Ate pairing, namely
The pairing a itself consists of the evaluation of a rational function f λ,Q and a final exponentiation that maps all cosets to the same unique representative:
Owing to the Frobenius homomorphism, the final exponentiation by (p 12 − 1)/n can be split into an easy part (p 6 − 1)(p 2 + 1) and a hard part (p 4 − p 2 + 1)/n. The function f λ,Q can in general not be evaluated directly. However, Miller [26] described an important property of rational functions, namely
Pairing
Integer Arith.
Hash, PRNG, Symmetric Algorithm, ... The property allows the computation of f λ,Q in polynomial time by merely evaluating vertical (ν) and straight ( ) lines in elliptic curve points using a double-and-add approach. Values of λ with low Hamming weight result in a particularly fast computation of f λ,Q , the pairing becomes optimal. In this work, we used the efficient optimal-Ate pairing by Vercauteren [34] .
High-Level Arithmetic
The computation of bilinear pairings over BN curves requires several layers of arithmetic. As illustrated in Figure 1 , all arithmetic is based on a multi-precision integer arithmetic layer. On top of that, prime-field arithmetic and a tower of extension fields are built upon. The elliptic curve groups used as G 1 and G 2 utilize the prime field and its quadratic extension field, respectively. The largest extension field F p 12 is used by G T . The pairing computation itself is based on the groups G 1 , G 2 , G T , and their underlying field arithmetic. Methodology. Our state-of-the-art implementations are based on the techniques used by Beuchat et al. [6] and Devegili et al. [14] . The pairing implementation uses the fast formulas by Costello et al. [13] , the inversion trick by Aranha et al. [1] , a lazy reduction technique in F p 2 [6, 31] , and a slightly modified variant of the final exponentiation by Fuentes-Castañeda et al. [17] that requires less memory (see Appendix A.1). The prime-field inversion using Fermat's little theorem is optimized according to Appendix A.2. Since operations in G T and in the hard part of the final exponentiation take place in the cyclotomic subgroup of F * p 12 , dedicated squaring formulas are utilized [19] . The point multiplications in both elliptic curve groups use Montgomery ladders that are based on fast formulas [21] in homogeneous projective co-Z coordinates.
Parameters. As this work aims to offer a certain degree of flexibility, both the 80-bit and the 128-bit security level are supported. The two elliptic curves BN158 [18] (u = 40 00800023 h ) and BN254 [30] (u = −40800000 00000001 h ) of the form y 2 = x 3 +2 were chosen. Those lead to particularly fast execution times as the respective constants λ of f λ,Q have low Hamming weights. The extension field F p 2 is represented as
, with ζ = (1 + i) for BN254 and ζ = 1 1+i for BN158. Implementation Attacks. An important aspect in the implementation of pairings and group arithmetic for embedded applications is the consideration of side-channel attacks. While scalar factors or exponents are typically the secret operands for operations in G 1 , G 2 and G T , an elliptic curve point may have to be protected in the case of pairing operations.
As a countermeasure to timing attacks, all implemented algorithms have constant, data-independent runtime. Therefore, e.g., some fast but vulnerable point multiplication algorithms are not used. Both the point multiplications in G 1 , G 2 and the exponentiations in G T hence use Montgomery ladders. The implementation's countermeasures against first-order Differential Power Analysis (DPA) attacks comprise Randomized Projective Coordinates (RPC) [12] in both the pairing computation and the point multiplications in G 1 and G 2 . To detect fault attacks on data, point multiplications in G 1 and G 2 include several point verifications. DPA and fault attacks on exponentiations in G T as well as fault attacks on pairings were also taken into consideration, but can better be handled on the protocol layer using randomization.
Hardware Architectures
To meet the high requirements of pairing-based cryptography in embedded devices, our goal was to equip a stand-alone microprocessor, designated for embedded applications, with a dedicated hardware unit such that: (i) Pairing computations are usable within interactive (e.g., authentication) protocols. (ii) A pre-existing microprocessor platform is modified only minimally. (iii) The overall hardware requirements, i.e., the costs, are kept small and considerably below 100 kGE needed in related work [16, 24] . (iv) Embedded applications such as wireless sensor nodes and NFC should be practically feasible. Figure 2 summarizes potential architectures that can be used to attain such goals. The straightforward solution (a), a sole off-the-shelf microprocessor, requires minimal hardware-development time, however potentially delivers insufficient performance. The runtimes desirable for interactive protocols can only be achieved by either adding powerful, dedicated instructions (b), or by adding dedicated co-processors. Contrary to a dedicated hardware module (c), a drop-in module (d) is memoryless and requires neither a Direct Memory Access (DMA) controller nor a multi-master bus. Wenger [35] showed the advantages of the drop-in concept in comparison to a dedicated hardware module for binary-field ECC. However, the applicability of this technique for prime-field based pairings is still an open question.
Following up the potential architectures, we consecutively evaluate the practicability of a plain microprocessor design (a), a multiply-accumulate instructionset extension (b), and a dedicated drop-in module (d). 
The Used Microprocessor
The accomplishment of the initially set goals highly depends on the used microprocessor. As the runtime figures by Szczechowiak et al. [32] and Gouvêa et al. [18] discourage the use of an 8-bit or 16-bit microprocessor, a 32-bit microprocessor is preferred as a basis. Moreover, the bottleneck between computation unit and RAM is less of an issue if 32-bit interfaces are used. We hence decided to utilize a self-built processor functionally equivalent to the ARM Cortex-M0+ [2] , because the Cortex-M0+ was especially designed for embedded applications and currently is one of the smallest 32-bit processors in production. The Cortex-M0+ has 16 32-bit general-purpose registers of which 8 are efficiently usable. It comes with a mixed 16/32-bit Thumb/Thumb-2 instruction set and optionally either a 32-cycle or single-cycle 32-bit multiplier. In its minimum configuration, ARM specifies its Cortex-M0+ to require only 12 kGE in a 90 nm process technology.
The Software Framework
The biggest advantage of an off-the-shelf microprocessor are the vast (opensource) toolchains. Thus a high-level framework capable of pairing-based cryptography using BN curves was created in C. It provides extension field arithmetic, elliptic curve operations, and bilinear pairings. The framework focuses on both good performance and low memory consumption. To achieve the latter, several optimizations were incorporated into the framework. First, virtually all of the memory is allocated on the stack. As stack variables are discarded at the end of each function, stack allocation facilitates the reduction of required memory by separating code into different functions. Second, allocated memory is reutilized where possible. Third, memory-optimized algorithms are used, e.g., for the final exponentiation as in Appendix A.1. Last, compiler optimizations are used to decrease the program size. Therefore, the compiler options -ffunction-sections, -fdata-sections and the linker options -gc-sections, --specs=nano.specs are passed to the bare-metal ARM GNU toolchain (version 4.7.4).
The high-level pairing framework is common to all three evaluated platforms. The main difference between these platforms is the implemented finite-field arithmetic. While (a) and (b) control the whole finite field arithmetic in software, (d) relies on finite-state machines to perform additions, subtractions and multiplications in F p and F p 2 . Nevertheless, all implementation options ensure constant runtime and consider side-channel attacks.
Assembly-Optimized Software Implementation (a)
The plain microprocessor platform (a) is based on a Cortex-M0+ with a singlecycle multiplier. Its hand-crafted assembly routines for optimized prime-field arithmetic always perform a reduction step to ensure constant runtime. This is accomplished by storing the reduction result either to the target or a dummy memory location via masking of the operand addresses. The crucial prime-field multiplication utilizes an unrolled Separated Product Scanning (SPS) method of the Montgomery multiplication [28] that is derived from [10] . The SPS variant is chosen because of the particular F p 2 -multiplication technique [6, 31] we use, which performs the required three multiplications and two reductions separately. Product scanning can further be efficiently implemented on the processor if three registers are used as an accumulator, as presented in [36] . The reduction step for the curve BN254 is further optimized as several multiply-accumulates can be skipped due to the sparse prime [18] .
Multiply-Accumulate Hardware Extensions (b)
The performance of the prime-field multiplication significantly suffers from the 32 × 32 → 32 bit multiplier of the Cortex-M0+, which results in 80% of a pairing's runtime being spent in F p multiplications. To improve this, the processor core is equipped in (b) with a multiply-accumulate extension similar to [36] . It adds the result of a full 32 × 32 → 64 bit multiplication to three accumulation registers in a single cycle. In order to avoid a modification of the compiler toolchain, the TST instruction, which is not required for prime-field multiplication, is reinterpreted as a multiply-accumulate instruction if a certain bit in the control register is set. The control register is manipulated accordingly at the beginning and the end of a prime-field multiplication. Besides accelerated multiply-accumulate operations, the prime-field multiplication requires less registers for temporary variables, which we exploit by caching some of the operand words in the product scanning routine.
The Drop-in Module (d)
As a consequence of the high-level runtime and area goals, it is of utmost importance to maximize the utilization of the invested chip hardware. To achieve this, a lightweight hardware drop-in accelerator is placed between processor and data memory. The respective design, which is shown in Figure 3 , uses a Cortex-M0+, but any other processor is equally suitable.
The drop-in module provides unrolled state machines and an appropriate arithmetic unit for 160-bit and 256-bit F p multiplication, F p addition and F p subtraction. It further encompasses state machines to control F p 2 addition, F p 2 subtraction, F p 2 multiplication and F p 2 squaring. Several memory-mapped registers are used to control the drop-in module. A lightweight arbiter is built in which always gives preference to the CPU when the CPU wants to access the data memory. In such case, the drop-in module is prepared to stall its operation.
The core element of our drop-in module is a multiply-accumulate unit that is used to perform a Finely Integrated Product Scanning (FIPS) [ 2 load operations. Instead of using a dual-port memory, we attain a perfectly utilized bus and a perfectly utilized multiplier by using a two-cycle multiply-accumulate unit that is based on a W × W/2-bit multiplier. This saves 3 kGE for W = 32 in an 130 nm process compared to a traditional W × W -bit multiplier.
A finite-field operation is started by writing three memory pointer registers (OpA, OpB, and RES) and a control register. As those registers are mapped at consecutive addresses, the store-multiple instruction (STM) of the Cortex-M0+ can be used to efficiently start an operation. A started finite-field multiplication is performed using the following hardware components: a W ×W/2 = 32×16-bit multiplier, a ld(2N ) + 2W = 68-bit ACCumulator, a W = 32-bit register for operand A (OpAReg), a 3W/2 = 48-bit register for operand B (OpBReg), and a W = 32-bit WRITE register. In OpBReg, the top 32 bits are always written by the bus and the lowest 16 bits are used as an operand of the multiplier. Therefore, a sequence of shift/rotate operations is necessary to actually multiply the loaded operands. Table 1 visualizes the dataflow within the drop-in module. For a single multiply-accumulate operation five clock cycles are necessary. As the drop-in This data is later written to the address RES+i+j, when the bus is not utilized.
As the fully utilized bus needs some free cycles to write the result, we use a zig-zag product scanning technique (cf. Figure 4) [37] . In this technique, consecutive columns are traversed in different order, which allows caching of a single operand from one column to the next. This frees the bus for 2N cycles, which are exactly the 2N cycles required to store the computed results.
Although the implemented FIPS multiplication is quite complex, the software running on the CPU is completely independent of the methodology used to perform finite-field arithmetic within the drop-in module. However, there are two implementation guidelines the software has to deal with. First, constant variables have to be temporarily copied to the data memory when being used. Second, there are two techniques to wait for the drop-in module to finish. A function delegating an operation to the drop-in module can either start an operation and wait for it to finish, or wait for a previously started operation to finish and only then start a new operation. The latter case is more performant because the CPU and the drop-in module potentially work in parallel, i.e., the control flow operations involved in the invocation of the routines that call the drop-in module are done while the drop-in module is computing. However, temporary variables on the stack are freed once a function finishes, which requires adding additional wait statements within the extension-field arithmetic to prevent the drop-in from accessing reallocated memory locations. Nevertheless, the utilization of the drop- in is increased from 77.6% to 85.1% when the function first waits for previous operations to finish. Similarly, the utilization of the RAM is raised from 75.7% to 80.1% (cf. 34.6% in (b), 17.0% in (a)).
Implementation Results
To verify the achievement of the area and performance goals initially set, the three microprocessor-based platforms (a), (b) and (d) were evaluated with respect to hard-and software. Regarding the overall hardware platforms, runtime, area, power, and energy consumption are distinctive. Regarding the software part, the evaluation focuses on the runtimes of the underlying finite-field arithmetic and the most expensive operations used within protocols: the point multiplications in G 1 and G 2 , the exponentiation in G T , and the pairing operation.
The results in Table 2 show that the multiply-accumulate extension speeds up the prime-field multiplications by factors of 4.0-5.0
3 , but leaves the prime-field additions unaffected. The same speed-ups are observed for prime-field inversions and point multiplications in G 1 . However, the impact of the multiply-accumulate extension on the performance of both pairings and operations in G 2 , G T is lower and lies between a factor of 2.1 and 3.3. Considering the performance of the dropin module, an even greater speed-up is observed compared to the plain software implementation. In this case, prime-field multiplications, inversions and point multiplications in G 1 are up to 11.3 times faster, which eventually results in an up to 6.1 times faster computation of pairings. On average, operations using BN158 are 3.0 times faster than operations using BN254.
Throughout all implementations, the demand for data memory is kept relatively low, with a maximum of 1,876 bytes and 2,880 bytes for BN158 and BN254, respectively. Similarly, the program sizes are kept small, e.g., 18 KB for BN254. Given a typical clock frequency of 48 Mhz, the performance results of the point multiplications in G 1 , G 2 , the exponentiation in G T , and the pairing operation are illustrated in Figure 5 Table 2 focuses on the software part, the most important hardware characteristics are visualized in Table 3 . The runtime is given for a single pairing computation. Both area and power measurements were determined for an 130 nm low-leakage UMC technology. The area results in a 90 nm UMC technology are explicitly marked. The designs were synthesized and their power and runtime evaluated for a clock frequency of 48 MHz. Both data and program memory were realized using RAM and ROM macros of appropriate sizes. The program memory encompasses all routines required to implement pairing-based protocols, i.e., pairings, operations in G 1 , G 2 , and G T . These platforms are hence readyto-use for future applications based on pairings over BN curves.
According to Table 3 , BN254 pairing computations with reasonable performance are available at the cost of 57.7 kGE in an 130 nm process technology. Switching to the more advanced 90 nm process technology shrinks the design to 49.0 kGE, constituting one of the smallest available hardware designs for pairings with practical relevance. In terms of power consumption, the plain microprocessor design is, as expected, the most economical. The multiply-accumulate extension and the drop-in module increase power consumption by 25% and 70%, respectively. Due to their increased performance, these platforms are more energyefficient though. Their respective demand for energy is 2.1 and 3.5 times lower.
Comparison with Related Work
As a consequence of our hardware/software co-design approach, comparison with related work focuses on two aspects. On the one hand, the pure software implementation on the Cortex-M0+ is brought into relation to other software imple- mentations on low-resource hardware. On the other hand, the resulting hardware design is compared with other dedicated pairing hardware implementations. The comparison of our software implementation with related implementations of Ate pairings over BN curves providing approximately 128-bit security is summarized in Table 4 . Gouvêa et al. [18] provide highly optimized software implementations for the 16-bit microcontroller MSP430 and a variant of its successor MSP430X, which is equipped with a 32-bit multiplier (MPY32). The implementation by Devegili et al. [15] is evaluated on a 32-bit Philips HiPerSmart smart card, which has a SmartMIPS architecture and clearly is a direct competitor of Cortex-M0+-based smart cards. However, it is unclear to which extent side-channel resistance is considered by either of them.
As both the MSP430 and the Cortex-M0+ use a 16-bit instruction-set, it is important to highlight the exceptionally low program and data memory footprint of our implementations. It is however hard to compare the quality of an implementation when different frameworks and different microprocessors are involved.
Other pairing implementations for 32-bit ARM processors are limited to the Cortex-A series, such as in [20] . However, their pairing's runtime of 9.9 ms on In comparison to [16] and [24] , our drop-in-based platform is 2.2-3.1 times smaller with regard to total area consumption. In both [24] and our case the CPU and the data memory can be reused for other applications. In terms of dedicated hardware, our drop-in-based platform is 16.6 times smaller than the work of Fan et al. In exchange, their design is faster and provides the best arearuntime product according to Figure 6 . However, it depends on the application how much hardware area is actually acceptable to be spent on a dedicated pairing accelerator.
7 Re-usability of our Drop-in Architecture
To emphasize the practicability of our low-area platforms for deploying cryptography to embedded environments, several protocols that are relevant in such context have been assessed in terms of the performance to expect. Using the Drop-in Module for Pairing-based Protocols. The short signature scheme by Boneh et al. [8] is interesting for constrained signature devices as it aids to reduce communication. As a representative of group signatures, which help to provide anonymous authentication, the scheme by Hwang et al. [22] was chosen. To be able to establish a random session key without the necessity of verifying public keys, the identity-based encryption scheme by Boneh et al. [7] in its Key Encapsulation Mechanism (KEM) variant was evaluated as it combines good performance with small parameters. Additionally, the leakage resilient bilinear ElGamal KEM by Kiltz and Pietrzak [25] is taken into consideration because it is proven to have bounded side-channel leakage.
The number of computationally expensive operations and the expected overall runtime of each of the aforementioned protocols are presented in Table 6 . The runtimes are given for the drop-in module based platform. As the figures suggest, all of the protocols may be performed on the device with user interaction as response times lie noticeably below one second.
Using the Drop-in Module for ECC. In order to emphasize the reusability of our drop-in module based design, we also evaluated the performance of the standardized curves [11, 29] secp160r1 and secp256r1 and the performance of Curve25519 by Bernstein [5] , which many people fancy as replacement curve of standardized NIST curves. Again, we follow the point multiplication methodology from [36] , which relies on Montgomery ladders, randomized projective coordinates and multiple point validation checks. All implementations have similar hardware footprints and require 4.1 kGE (500 bytes) for RAM, 6.2 kGE (3,200 bytes) for ROM, 10.1 kGE for the drop-in module, 12.6 kGE for the Cortex-M0+, and 33 kGE in total (in a 90 nm UMC technology). Point multiplications for secp160r1, secp256r1, and Curve25519 need 570 kcycles, 1,765 kcycles, and 1,110 kcycles, respectively. Note that we do not take advantage of the special form of the underlying primes. However, with runtimes of 11.9-36.8 ms (at 48 MHz) the drop-in concept is clearly an enabler of ellipticcurve based interactive protocols.
Conclusion
According to our evaluations of three microprocessor-based hardware designs, the utilization of a compact 32-bit microprocessor results in notably small pairing implementations. Requiring merely 45.2-49.0 kGE of chip area, we provided one of the smallest available hardware designs capable of bilinear pairings. The most prominent platform was however obtained by the construction of a dedicated drop-in hardware module for prime-field arithmetic. Its low area requirements and highly practical runtime facilitate pairing-based cryptography in interactive embedded applications.
