Identity-based cryptography uses pairing functions, which are sophisticated bilinear maps defined on elliptic curves. Computing pairings efficiently in software is presently a relevant research topic. Since such functions are very complex and slow in software, dedicated hardware (HW) implementations are worthy of being studied, but presently only very preliminary research is available. This work affords the problem of designing parallel dedicated HW architectures, i.e., co-processors, for the Tate pairing, in the case of the Duursma-Lee algorithm in characteristic 3. Formal scheduling methodologies are applied to carry out an extensive exploration of the architectural solution space, evaluating the obtained structures by means of different figures of merit such as computation time, circuit area and combinations thereof. Comparisons with the (few) existing proposals are carried out, showing that a large space exists for the efficient parallel HW computation of pairings.
Introduction
Pairings play an important role in identity-based encryption (IBE). Such form of cryptography was introduced by Shamir in 1984 in [20] , where a public key is derived from public identifiable information such as an e-mail address, and the corresponding private key is created by binding the user identity with a trusted authority's secret key. This idea avoids the reliance on certificates to validate the authenticity of a public key and, in some situations, simplifies the infrastructure of a public key system. In the same paper, Shamir left the task of building an effective IBE algorithm as an open problem. Solutions for implementing this idea were presented by Boneh and Franklin [5] , Cocks [6] and Sakai et al. [19] . Apart from the Cocks scheme, both the Boneh-Franklin scheme and the Sakai et al. scheme make use of bilinear maps, or pairings, defined over elliptic curves. Nowadays there are many other cryptographic pro-tocols benefiting from using pairings, including a variety of signature and key-establishment schemes. A survey of pairing-based cryptographic protocols can be found in [7] .
Today the Tate pairing [9] is considered as the most convenient function in terms of computational cost. Starting from 2002 many progresses have taken place for the computation of the Tate pairing in the case of supersingular elliptic curves. The original algorithm was designed by Miller [17] and was modified by Barreto at al. [3] , who obtained the so-called BKLS algorithm, while a closed formula in characteristic 3 was presented for the first time by Duursma and Lee [8] . More recently the Duursma-Lee algorithm has been improved by Kwon [16] , Barreto et al. [2] and Granger et al. [12] . Related works about the implementation of such algorithms by means of dedicated hardware (HW) architectures have been done by Grabher et al. [11] and Kerins et al. [14] . In [11] an evaluation of FPGA architectures for polynomial and normal bases in characteristic 3 is shown, in the case of the Duursma-Lee and Kwon algorithms; there the emphasis is on the efficiency of arithmetic in the field F 3 m . In [14] the focus is instead on high-level parallelism, in the Duursma-Lee algorithm, for the operations in the extension field F q k (in particular with q = 3 m ): these authors reduce computation time, but at the expense of a high number of function units and hence of the silicon area of the device.
In this paper we describe a method for designing parallel HW architectures for the Duursma-Lee algorithm in characteristic 3. Here emphasis is on executing in parallel the operations in the base field F 3 m (not in F (3 m ) k as in [14] ), and we show that a large space for such solutions exists. We balance the trade-off between the speed and the size of the dedicated HW device. Such considerations are applicable to the design of HW accelerators for pairing-based primitives in embedded systems as well as in internet servers.
We point out that the software (SW) computation of the Tate pairing is relatively slow [2, 12] , and therefore HW solutions are desirable. The existing studies in [11, 14] are preliminary and do not consider parallelism or do so only at high level [11] , thus exploring the solution space at a limited extent. Moreover the Duursma-Lee algorithm (as well as the others) is decidedly complex and hence is over the threshold of manual parallelization. In fact we adopt the methods and the techniques of the theory of automated algorithm scheduling [10] , for an exhaustive automated exploration of the parallelism issue for the selected algorithm. We develop all the necessary SW tools and we generate several parallel architectures. Then we evaluate them with respect to typical figures of merit such as computation time (T), circuit area (A) and combinations thereof (AT, etc.), and we compare them with respect to literature solutions.
The paper is organized as follows: Section 2 summarizes the basic notions about the Tate pairing; Section 3 outlines the architecture and sketches the parallelization methodology; Section 4 sketches arithmetic details and function units; Section 5 presents several parallel architectures for the Tate pairing; Section 6 evaluates and compares such architectures to literature solutions; and Section 7 contains the conclusions and some suggestions for future research directions.
Recall of the Tate Pairing
We give here the minimum indispensable mathematical details for introducing the Duursma-Lee algorithm in characteristic 3. Let E (F q ) be an elliptic curve defined over a field F q . Let l be a positive integer, co-prime to q, such that E (F q ) contains a point of order l. In cryptography l is usually a large prime such that l | #E (F q ) and l 2 #E (F q ). Let k be the smallest integer satisfying l | q k − 1. This value k is the embedding degree of the curve with respect to l. The Tate pairing is defined in terms of rational functions over points of an elliptic curve evaluated in a divisor. Let P ∈ E (F q )[l], then l (P) − l (O) is a principal divisor. Hence there exists a rational function
be a point with coordinates in F q k ; then we construct a divisor D ∈ Div 0 (E) such that D Q ∼ (Q) − (O). Such divisor should be chosen so that its support is disjoint from that of the divisor of f P , and f P (
Over a supersingular elliptic curve there always exists a specific endomorphism φ (distortion map) [21] , mapping an l-torsion point to another linearly independent l-torsion point, i.e., φ :
In particular, there exists a distortion map linking l-torsion points in the base field to those in the extension field. Papers [3, 13] list several cryptographically interesting curves and distortion maps. Using such maps, Barreto et al. [3] have improved the calculation of the Tate pairing by rearranging the Miller algorithm [17] into the Modified Tate pairing:
where μ l is the subgroup of the l-th roots of unity in F * q k .
The quotient group F * q k /(F * q k ) l is isomorphic to elements of order l in F * q k through the final exponentiation by q k −1 l . Now consider the following supersingular elliptic curve, for a field of characteristic 3, i.e., for F q with q = 3 m :
The embedding degree is k = 6 and, given the l-torsion point
Duursma and Lee introduced in [8] a faster Tate pairing algorithm using a group order l = q 3 + 1 = 3 3m + 1, instead of l dividing q 3 + 1, and in this way they found a closed formula for f P (φ (Q)). The pseudo-code of the Duursma-Lee algorithm is shown in (2.1) and is taken from [8] .
Algorithm 2.1: Duursma-Lee as in [8] . To obtain results compatible with equation (1) and the BKLS algorithm [3] , the output of algorithm (2.1) must be powered to ε = 3 3m − 1. Further refinements for supersingular elliptic curves in characteristic 3 are outlined in [2, 12, 16] , as well as for other characteristics. There the main difference from algorithm (2.1) is the removal of the cube root operations, replaced with two extra cube powers in the base field and one cube power in the extension field. To compute the Modified Tate pairing efficiently, a careful implementation of F 3 m arithmetic is needed. Following the approach used in [3, 11, 12, 14] , we represent the extension field F 3 6m as a tower:
Then we expand all the operations in algorithm (2.1) into their equivalent forms but only with operations in the base field F 3 m . The comments in (2.1) list their numbers; square is counted as one multiplication.
We developed our work only for the Duursma-Lee algorithm in characteristic 3 because, selecting the base field F 3 m accurately, cube root and power can be computed efficiently also in the polynomial basis representation. In the rest of the paper we model the algorithm entirely as a sequence of operations in the base field F 3 m , in order to be able to expose the maximum possible degree of parallelism.
Scheduling Methodology
In the automated design of application specific circuits, the algorithmic description is commonly partitioned into a datapath, i.e., a set of interconnected function units, and a control path, usually a finite state machine or a processor. The former processes data and the latter sends commands to drive the units. Data flow through the datapath by steps, called control steps. Each unit is limitedly programmable and can execute one out of a set of similar operation types, and several replicated units may work in parallel. Scheduling the operations of the algorithm so that each control step achieves the best usage of units is therefore a task of great importance to optimize performances.
The algorithm is scheduled to minimize a target set of figures of merit. Here we want to put in parallel as many operations as possible, and we consider the following figures of merit: computation time, circuit area and combinations thereof. Therefore we apply the instruction scheduling methodology with a resource-constrained approach [10] .
This methodology is quite general, flexible and explores extensively the architectural solution space. It works as follows: the algorithm is modeled as a data dependence graph (DDG), the nodes and arcs of which represent operations and data dependencies among operations, respectively; where necessary, temporary variables store the intermediate results; each node is labeled by a time latency and circuit area cost; scheduling the algorithm means sorting the nodes of the DDG in topological order compatibly with the resource and latency constraints. The following analyses are performed:
1. Assume the number of units of each type is unlimited and the latencies of the units are all identical, then:
• compute the As-Soon-As-Possible (ASAP) schedule of the DDG; • compute the As-Late-As-Possible (ALAP) schedule of the DDG.
From such schedules, determine the maximum number of each unit type, called "variability range", such that by allowing more units of that type further speed-up of the algorithm would not take place any longer.
2. Chosen a parameter to optimize (total computation time, total circuit area or a function thereof), and given a fixed set of function units (possibly replicated), each labeled with its effective latency, do what follows:
• using the List-Based Scheduling (LBS) discipline, find the schedule of the algorithm compatible with the resource and latency constraints;
• for such schedule compute the figure of merit under observation (cost / performance estimate);
• and additionally compute the minimum number of temporary variables (register allocation). (2) by changing the number of each function unit type, but remaining within the bounds determined in step (1), valid for the unlimited case.
Repeat step
The entire procedure yields all the LB schedules possible by varying the resources. The LBS discipline uses the latency and number of every function unit type, to construct the schedule by assigning all the operations to each control step at a time. It maintains a ready-set of operations, i.e., those the predecessors of which are already scheduled, assigned to the current control step until the available resources are run out. To determine which of the ready-set operations of the same type to schedule first, that having minimum mobility is selected. Mobility is the control step interval delimited by the ASAP and ALAP schedules constrained by the actual resources and latencies; both schedules are recomputed whenever the resource availability changes. The LBS heuristic discipline is well established and is known for its efficiency in the general case [10] . We have developed (in C language) all the SW tools needed for the above described scheduling methodology. The Duursma-Lee algorithm (2.1) contains a few initialization operations and a loop repeated for a constant (and relatively high) number of times; most operations are in the loop body, the rest is almost costless. Therefore we schedule only the loop body, and the resulting architecture iterates the computation as many times as the loop does. The loop body is expressed as a list of operations in the base field F 3 m , with temporary variables, latency and area specifications; the list is given to the tools, which extract the labeled DDG of the algorithm and process it as explained above.
Datapath and Function Units
The HW coprocessor we considered for the Duursma-Lee algorithm is a dedicated parallel architecture, as outlined in Section 3. The datapath contains function units for each operation type, and the units may be replicated. The units work at the level of the base field F 3 m ; hence all data are elements of F 3 m . Each function unit has one or two inputs (for unary or binary operations) and one output, and contains buffer registers to store operand(s) and result (hence 2 or 3 registers); such buffers decouple the units of one another. A generic unit consumes one clock cycle for loading the operand(s) into the input buffers, one or more cycles for computing (depending on the unit type) and one final cycle for storing the result into the output buffer. A set of registers is also included to store temporary variables. All the registers (buffer and temporary) have the same size as an element of F 3 m (i.e., m digits as in [11, 14] , see below).
Function units and temporary registers are connected through dedicated busses and switches, each of size F 3 m . The control path drives all the units and registers, and dispatches data via the busses. Such an architecture is consistent with those adopted in [11, 14] . In all the cases we considered, almost all the units and registers happen to have fan-in and fan-out of one or two, and the rest is only slightly over. This limits the number of busses and switches; hence we assume that their time and area costs are negligible.
The Tate pairing function defines a highly arithmeticintensive primitive. There are several works dedicated to the hardware implementation of efficient arithmetic in characteristic 3, see [4, 11, 15, 18] . In these papers the implementation is done on a FPGA device and the size is evaluated by counting the number of the needed FPGA elements or equivalent gates. To estimate the area and latency costs of each function unit, here we follow a similar approach.
We use the base field F 3 m , as in [11, 18] , and we represent its elements in polynomial basis, the usual choice. An element of F 3 is a digit of {−1, 0, 1}, encoded with two bits in such a way that by swapping the bits the sign is inverted. The elementary operations in F 3 (addition, subtraction and multiplication) can be performed in one clock cycle by using two FPGA look-up tables of type 4 : 1 (LUT).
The base field F 3 m is isomorphic to
is a m-degree irreducible reduction polynomial with coefficients in F 3 . We use the trinomial type f (
. A generic element of F 3 m is a sequence of m digits in F 3 . We need binary and unary function units with inputs A, B ∈ F 3 , limitedly programmable to invert the signs of A or B. To do so, we use for each input an array of 2m two-way one-bit multiplexers (MX), to swap when requested the operand bits and thus to invert its sign.
The adder unit must perform the four algebraic additions ±A ± B. Such operations can be performed digit-wise in one clock cycle, using an array of m adders in F 3 (F3A) [4, 14] plus an array of 4m MXs for the signs of the two summands. The multiplier unit must perform the four operations (±A) × (±B). The multiplication of two elements in F 3 m is performed in digit-serial way: the former factor is processed in parallel, the latter by groups of D digits in parallel in one clock cycle; hence the total number of cycles is m/D . Assuming D ≤ m − h holds, to simplify reduction, the area cost of the multiplier is the sum of the following contributions: (m + 1) D F3As; (m + 2) D − 2 multipliers in F 3 (F3M); some control logic, namely 4m MX; some inner registers for partial products, equivalent to (6m + 2D − 2) flip-flops (FF) [4] ; and finally an array of 4m MXs for the signs of the two factors, as done in the adder. The cube power unit must perform the two operations (±A) 3 . In fields of characteristic 3 represented in polynomial basis, this is a linear operation implementable by thinning the coefficients of the base A [18] . The unit takes one clock cycle [4] with an area cost of 2m F3As plus 2m MXs for the base sign. The cube root unit must perform the two operations 3 √ ±A. In the same field type as before, cube root can be computed as in [1] , provided the reduction trinomial f (x) satisfies m, h ≡ 1, 1 or m, h ≡ 2, 2 (mod 3). A unit specific for f (x) has a latency of one clock cycle and an area equal to (m − 1) F3As plus 2m MXs for the radicand sign.
Finally, all the registers (temporary and buffer) store m digits of F 3 , hence they consist of arrays of 2m flip-flops; read / write takes one clock cycle. The area and latency formulas of the function units are summarized in Table 1 . As we want to compare our results with those in the literature, e.g. in [11, 14] , we have chosen m = 97, h = 16 and D = 4; such values are mathematically and technologically sound, and fit all the previous constraints. Moreover, since every function unit contains input / output buffers, we have added two more clock cycles to the unit latency, to account for the delay of the read / write operations, and we have increased the unit area to account for the I / O buffers. Since we measure all the areas in terms of FPGA LUTs, we assume the equivalences 1 MX = 1 LUT and 1 FF = 4 3 LUT, which are somewhat conservative for the FPGAs used for comparison [4, 22] . We stress that our results are rather stable with respect to these hypotheses. In Table 2 the resulting area and latency costs of the various units are shown.
Synthesis of Architectures
Here we list the results obtained by scheduling, as described in Section 3, the loop body of the Duursma-Lee algorithm (2.1). First we consider the ideal case, with unboundedly many function units having identical latency, i.e., one clock cycle; the total numbers of operations are listed in (2.1). In such conditions execution takes 18 control steps, which is the absolute minimum. Table 2 lists the variability range of each unit type (see Section 3, point 1). The adder and multiplier units exhibit a somewhat high variability of 18 and 11, respectively, meaning there is space for parallelization. Second, we conduct a scheduling campaign with the effective latency and area costs for each unit as in Table 2 , optimizing the loop time T, measured in clock cycles, and the loop area-time product AT, measured in equivalent LUTs × clock cycles (see Section 4), by allowing different combinations of function units. In particular, since the multiplier has the highest time and area costs ( Table 2) , it is advisable to find the optimal schedules with respect to the selected figure of merit, by allowing from 1 up to 11 multipliers (i.e., the variability range, Table 2 ). This is carried out as follows:
1. Keeping fixed the number of multipliers, compute all the schedules changing the numbers of adders, cube power and root units, within their variability ranges.
2. For all such schedules, compute the figure of merit under observation (either T or AT) and select the schedule with the minimum value of the figure of interest. (1), increasing the number of multipliers; terminate when the variability range of the multiplier has been completely explored.
Restart from point
In principle there may exist multiple optimal schedules with the same number of multipliers; in practice such an event is unlikely and never occurs in our scheduling campaign. Table 3 shows the optimal schedules with respect to the computation time T. Each row lists: the numbers of function unit types, the absolute value of T and the increment percentage with respect to the minimum time of 87 clock cycles, reached with 9 multipliers; no further speed-up is achieved with more such units. It is evident that with 7 multipliers, the residual time performance gain, achievable with more units, is about 10%. Therefore, in general the time optimal architecture ought not to contain over 7 multipliers. The register allocation procedure shows that all the schedules in Table 3 require from 28 up to 32 temporary registers. Since register allocation does not depend on the area costs of the function units, we assume to have exactly 32 registers, which will suffice for any possible schedule, and we always add their area cost to the total area A of the device. Table 4 shows the optimal schedules with respect to the area-time product AT (where area is estimated as in Table 2 plus the 32 temporary registers). Each row lists: the numbers of function unit types and the percentage increment of AT with respect to the minimum, reached with 5 multipliers, and the absolute time T. This is the trade-off for balancing area and time costs, and is in good agreement with the behavior of the AT 2 figure of merit as well (computed but not shown here) [10] . Solutions common to both Tables 3 and 4 are due to chance. Clearly with over 5 multipliers AT must grow again, while T tends to stabilize asymptotically.
Evaluation and Comparison
Works dealing with HW architectures for the Tate pairing are [11, 14] . Such authors analyze the modified Duursma-Lee algorithm (D-L) and use function units similar to those explained in Section 4, with identical values of m, h and D; hence the loop body iterates exactly m = 97 times. In [14] Kerins et al. give a parallel FPGA architecture, with units working in the extension field F 3 6m (and presumably with temporary registers, though nothing is said). To synthesize such high-level units, they use 18 multipliers and 6 cube power units in F 3 m , plus over 100 adders; their architecture computes the entire D-L algorithm in 8, 924 clock cycles. Instead, in [11] Grabher et al. present a FPGA coprocessor for arithmetic in F 3 m , which contains one unit per type, has 32 temporary registers (as we do) but works serially. Higher-level operations are compiled on a general purpose processor. The reported running time of the entire D-L algorithm is of 59, 946 clock cycles.
For the complete D-L algorithm, we obtain a minimum computation time of 87m = 87 × 97 = 8, 439 clock cycles with 9 multipliers and 4 adders (see Tab. 3); such a time is comparable with that of [14] , but our architecture is by far smaller. In the case of minimum area-time product AT, a number of 5 multipliers and 4 adders yields a total D-L time of 108m = 108 × 97 = 10, 476 clock cycles (see Tab. 4), still comparable to the minimum time of [14] but with an even smaller area. In both cases, our total D-L time is about one order of magnitude shorter than that of [11] . In summary, scheduling seems to be convenient and the performance leap is stable against the assumptions we made.
Moreover, we determined also the optimal time schedule with a single function unit per type, which yields a time cost of 426 clock cycles for the loop body (see Tab. 4). The total time to compute the whole D-L algorithm is then of 426m = 426 × 97 = 41, 322 clock cycles, which represents a speedup of 31% with respect to the time of 59, 946 clock cycles reported in [11] . Such a speed-up is due to the adoption of an automated and exhaustive scheduling methodology.
Conclusions
An analysis of the computation of the Tate pairing with the Duursma-Lee algorithm, by means of a parallel dedicated HW co-processor in F 3 m , has been presented. Results show that the algorithm is well suited for parallelism and that the cost-performance tradeoff can be considerably improved with respect to literature solutions. Future research directions include the analysis of other Tate pairing algorithms as well as of different architecture models, e.g. changing m and D, or for instance an architecture with a limited number of busses (two or three) or a pipelined structure.
