Abstract
INTRODUCTION
Public key encryption systems such as Rivest-Shamir-Adleman (RSA) [1] and Elliptic Curve Cryptography (ECC) [2, 3] play a vital role in contemporary secure systems as they ensure the privacy, message integrity, authenticity, and non-repudiation requirements of secure communications.
The security strength of any encryption algorithm depends on its key size [4] . ECC is becoming increasingly the promising public-key method because it uses shorter keys at a security level equivalent to other public-key algorithms like RSA [4] .
For real-time applications such as SSL server connections, a software implementation of an ECC scheme may not provide the desired performance level and hardware implementations are needed as they better meet the performance requirements. There are many factors which must be considered in the hardware implementation of ECC system designs such as:
• Power consumption as it is becoming the main design constraint by the advent of portable and contactless devices. • Flexibility as the device is neither tied to a certain ECC curve nor needs a specific type of irreducible polynomial.
• Scalability as the device can provide various levels of security without changing the underlying hardware. For instance, the system supports different operand sizes to accommodate different levels of security. Otherwise, the system will become outdated within a short period of time.
• Throughput of the system is a very important factor in cases the system likely need to handle thousands of operations per second like servers, router gateways, etc.
The operation that dictates the execution time of an elliptic curve cryptographic protocol is the point multiplication of the Galois Fields (GF) operands, and its hardware implementation would have a significant impact on the system performance [4] .
One of the other crucial parameters in the implementation of ECC architectures is the type of the underlying finite field upon which the elliptic curve operations are based. ECC implementations could either use prime fields GF(p) of binary finite fields GF (2 m ) in which field elements are usually represented as binary polynomials. The latter ones are often chosen for hardware realizations as they require smaller hardware circuits for implementation [5] .
Related Work
Due to the increased significance of cryptography, several public key cryptographic hardware for GF (2 m ) have been proposed in the literature. The authors of [6] have targeted a compact architecture that performs three different cryptographic algorithms: RSA, ECC and paring-based cryptography.
They have tried to improve the time/area metrics by designing reusable functional units which can be shared among different modules. They have used the similarity of these three cryptographic algorithms for their basic arithmetic operations such that it allows diverse utilizations of the functional units in the design. However, they have not reported any results to show how much their design improved these metrics. The authors of [7] proposed an architecture that supports arbitrary operand sizes and provides multiplication, modular squaring and inversion operations. However, the computation time for the modular inversion of that architecture is in the order of O(m 2 ). In contrast, our proposed architecture performs the inversion operation in the order of O(m). Meanwhile, the architecture proposed in [8] performs multiplication in GF (2 m ) for any value of m less than 256. Such architecture offers the flexibility in selecting the field size at the expense of being limited to only few fixed irreducible polynomials. Furthermore, the proposed architecture does not support the inversion operation. In [9] a reconfigurable design has been presented which supports variable operand sizes.
The advantage of this architecture is its support for both GF (2 m ) and GF(p). However, it is restricted to specific types of elliptical curves. An architectures for implementing LSD multipliers for binary fields GF (2 m ) is presented in [10] . In this architecture, internal accumulators have been deployed for storing intermediate results and then, these extra accumulators were used to increase the maximum operating frequency by reducing the critical path delay of the multipliers. However, no hardware implementation report has been given in [10] and just analytical area and timing reports have been presented. Alternative modular designs have been proposed in [10, 11] based on offline reconfiguration. For instance, [10] presents a modular Field Programmable Gate Array (FPGA) based architecture that uses very simple hardware components. However, it does not perform the inversion operation. Meanwhile, the architecture presented in [11] supports both binary and polynomial field multiplication but only considers a specific type of irreducible polynomials and reconfiguration is offline. Hence, it can only be used for modular multiplications and does not support other important crucial operations in ECC, such as inversion and addition. In [12] , the authors have proposed a reordered normal bias multiplier which gives the designer the ability to set a trade-off between the area and speed performance and have implemented their architecture on a 780-pin FPGA circuit.
Another reconfigurable design was proposed in [13] to support various operand sizes for both binary and prime fields. This design implements RSA operations as well. It also supports power-gating approach to reduce the power wastage. However, it requires the data to be aligned before being exchanged with the outside world which complicates the design and increases the delay. A Graphics Processing Units (GPU) implementation is proposed in [14] . In this work, Least Significant Bit (LSB) invariant scalar point multiplication for binary elliptic curves is implemented on NVidia graphics 
Paper Contributions and Organization
Based on the above discussion of the related literature, only few of the proposed architectures support key size selection feature and also save power for smaller key applications. In our previous work [17] , a modular architecture has been proposed which is not only area efficient but also offers these features. However, when a small key size is used, the proposed architecture sets the other unutilized modules deactivated even if there are other requests to the system, as the case with the aforementioned related architectures. Such a common deficiency makes the system inefficient in terms of throughput and performance when there are many concurrent requests to the system which is a typical feature of secure server systems.
In contrast, this paper alleviates such performance deficiencies by presenting a new parallel architecture that has the unutilized modules simultaneously operating to handle other requests in order to use the system more efficiently and increase the performance and throughput of the system.
The main contributions and characteristics of the proposed architecture that distinguish it from the existing literature are summarized as follows:
• The organization of the rest of this paper is as follows: the necessary background discuss of multiplication and inversion algorithms in GF (2 m ) is given in Section 2. Then, the base architecture to be used in this paper is briefly discussed in section 3. The new parallel architecture is discussed in Section 4. Simulation setup and results are analyzed in Section 5 and Section 6 concludes the paper.
MULTIPLICATION AND INVERSION OPERATIONS IN GF(M )
Elliptic curve cryptography (ECC) is a public-key cryptography algorithm which is based on the algebraic structure of elliptic curves over finite fields. Point addition/subtraction, point doubling, and scalar point multiplication are geometrically-defined operations for elliptic curves. Two implementation alternatives, namely, prime field GF(p) and binary field GF (2 m ) are used for ECC systems. We only discuss the GF(2 m ) arithmetic as our proposed architecture is based on it. Equation
(1) represents an elliptic curve on binary field GF(2 m ):
where x, y, a, b are m-bit binary numbers and b is non-zero. More details about the underlying theory behind of ECC operations can be found in [4] .
The creation of public key in ECC requires scalar point multiplication on the base point, P. Scalar point multiplication could be done by repetitively performing point doubling and point addition. The basic operations for scalar point multiplication boil down to modular addition, subtraction, multiplication, and division on GF(2 m ) operands [4] . Addition and subtraction are trivial operations which are simply done by bit-wise XOR operation on operands. Division operation is more complicated than multiplication and inversion and therefore, it is normally substituted with an inversion of divisor followed by a multiplication [4] . In the following subsections, implementing multiplication and inversion operations are briefly discussed.
Modular Multiplication in GF(2 m )
A modular multiplication in GF ( Table I shows the pseudo-code of the algorithm presented in [11] which is the basis of our proposed architecture too. More details about this algorithm could be found in [11] . In this paper, a modified version of this algorithm is used in order to support selectable operand size.
Modular Inversion in GF(2 m )
A modular inversion operation in GF ( we have presented an architecture that uses the algorithm presented in [18] as it can be easily modified to achieve selectable key size architecture. The pseudo-code of the algorithm is given in Table II . Both multiplication and inversion algorithms, depend on very simple operations such as shift right, shift left, addition and XOR. Using this feature, a hardware component has been designed that performs both multiplication and inversion operations which saves the system in terms of area and power consumption.
OVERVIEW OF BASIC ARCHITECTURE
In our prior work [17] , a modular multiplier had been designed which implements all the necessary arithmetic operations over GF (2 m ) required to do ECC computations. In this architecture, users can select the operand size according to their security needs. Let us assume that we want to design an m-bit multiplier, where m is the maximum operand size that would be needed in the foreseeable future. In this architecture, the m-bit binary operands are split into k smaller n-bit words (m = k × N)
as follows:
, and a ij are binary bits of multiplicand A
, and b ij are binary bits of multiplier B
, and f ij are binary bits of modulus F Regarding above discussions, the basic multiplier architecture is composed of k modules each has n-bit size as shown in Figure 1 . Each of the modules [0 -k-1] is responsible for operating on one n-bit word of the operand. In this architecture two design parameters must be defined: (i) number of words k, and (ii) word-width n. Suppose that k=8 and n=8. If a 24-bit computation is required, then, the first three modules will be activated by controller and remaining modules will remain inactive.
Referring to the multiplication and inversion algorithms (Table I and Table II ), a mechanism is required for selecting the appropriate MSBs of operands which is one of the tasks of controller module. For example, if a three word computation is needed, then the MSBs coming from the third module will be used for the calculation.
Inside each module, there are four shift registers to hold four input/output words of the operands.
The MSBs of each register are also connected to the controller, which selects the appropriate MSBs for operands. Besides, each module has an ALU that is responsible for performing various operations such as, XOR, shifting, etc. The controller unit generates appropriate signals to coordinate the operations performed inside this ALU. The block diagram of the controller unit is presented in Figure 2 . The controller has three sub-blocks: Choosing n=32 and k=8, the design was synthesized by 45nm technology cells with clock frequency adjusted to 1GHz. The main goal in this design was introducing a low-power and area-efficient design. As Figure 3 , the power consumption is directly proportional to number of used modules. Hence, the design saves power when a smaller key size operation is requested.
PROPOSED MODULAR PARALLEL ARCHITECTURE
In our previous work [17] , the power consumption was reduced by making relation between selected operand size and power consumption. However, that system could not deal with new requests even if there are free unutilized modules which can perform new tasks. In other words, it did not efficiently use the available hardware resources. Thus motivated, we propose a parallel architecture in this paper that increases the overall performance and throughput of the system by grant all requests as long as there are idle modules which can handle new requests. There are two scenarios for making the system operate in parallel. One solution is handling multiple same operations in parallel, i.e. handling multiple multiplication operations. Since high level scalar ECC operations are composed of different operations, this parallelism will not be effective. For instance, a point doubling operation is consisted of multiplication, inversion, and addition operations. This way, if two point doubling requests do not enter at the same time, the second request must wait for the first operation to be completed. Therefore, the system should handle multiple different operations to run in parallel. In our developed parallel architecture, the system handles different operations with different operand sizes concurrently. For instance, both multiplication and inversion are required to be simultaneously handled.
The block diagram of the parallel architecture is depicted in Figure 4 . Comparing Figure 1 and Figure 4 , the parallel architecture has module separators between consecutive modules. These module separators are used to dynamically partition the modules where each partition is responsible for one task. The module separator is a simple multiplexer which selects either zero or previous module's output as the input of the next module. This partitioning is online and its control signals come from Module-Assigner that will be discussed later.
The Controller unit shown in Figure 5 generates appropriate signals to coordinate operations performed in the system. This Controller has three sub-blocks: i) Module Assigner, ii) Task Controllers, and iii) Correlator which will be discussed in the following sub sections.
Module-Assigner:
Module-Assigner connects the system to outside environment. It receives requests from outside and replies to them. Each new request states how many modules it requires (word-count) and also the instruction type that must be performed such as multiplication, division, or addition/subtraction. In response, the Module-Assigner will check (i) if there is a free Task-Controller to take care of this particular task and (ii) if there are enough free consecutive modules to handle this task. If one of these conditions fails, then Module-Assigner will reply to the requesting part that it cannot accept this task by putting zero on grant signal. Otherwise, the grant signal will be one.
Each Task-Controller has a busy tag which is used for this purpose. To check the second condition, the Module-Assigner starts from the first module and check if it can find word-count free modules in a row. For example, if the task requires three modules, it will start with checking modules 0, 1 and 2.
If module 0 is taken, then it will check modules 1, 2 and 3. This process continues until it finds three consecutive free modules and after that it gives a task-number to that particular task in order to make a distinction between tasks. Otherwise, it refuses the new task. This search procedure is implemented by a look up table. The size and complexity of this lookup table is proportional to the number of modules in the system.
After accepting the new request, the Module-Assigner will follow these steps: 3. When the result of the task at hand becomes ready, the corresponding Task-Controller raises its done-signal. Then, Module-Assigner puts the task number on outtask-number line to notify the outside parts that the results on the output bus belong to this specific task number. Similar to the loading phase, the modules will send their results to the output bus one by one.
4. Eventually, the corresponding Task-Controller and all assigned modules will become free. Furthermore, suppose that task A arrives while task B is ready to send its results. Module-Assigner gives the priority to task B and rejects task A. This way, there are more free modules for new tasks and more importantly, it prevents the deadlock.
Task-Controllers
The block diagram of Task-Controllers part is shown in Figure 6 . This part has t independent units where each unit is the Finite-State-Machine in the basic architecture shown in Figure 2 . Each of these units generates the appropriate signals to coordinate operations performed inside modules. These units are connected to Module-Assigner and Correlator. The Module-Assigner informs each unit when it is its turn to start computation by load-complete signal. At the end of computation of each task, the corresponding unit will inform it to Module-Assigner by raising its done-signal.
Correlator:
The Correlator creates a channel between Task-Controller units and their corresponding modules as shown in Figure 7 . The Correlator sends the generated op-codes by Task-Controller to modules and in the opposite side, it sends the MSB of operands inside modules to their respective Task-Controller.
Per each Task-Controller unit, there is an MSB-Selector circuit inside Correlator. These MSB-Selectors are exactly as the MSB-Selector circuit in the basic architecture. The appropriate MBS bits selected by these circuits will be forwarded to the corresponding Task-Controller units.
In the basic architecture [17] , there was only one Finite-State-Machine and one MSB-Selector circuit. In the new parallel architecture, there are many of these units. Besides, the Load/Store module in the basic design is replaced by a more complex Module-Assigner unit, which handles the task assignment in addition to controlling the load and store operations. busy with inversion operation. In the next cycle, the multiplication process computation is over and corresponding modules will send their result. Eventually in cycle 8, the inversion process is over and modules 4, 5, 6 and 7 will send their results as well and afterwards, the processor enters idle state again.
EXPERIMENTAL RESULTS
Based on the proposed architectures, a 1024-bit multiplier is designed in three different modes:
• 16-64 Mode: 16 modules each one is 64-bit wide.
• 32-32 Mode: 32 modules each one is 32-bit wide.
•
64-16 Mode: 64 modules each one is 16-bit wide.
The designs are synthesized by Synopsys Design Compiler in the 45 nm technology mode and the FreePDK45 Process Design Kit (PDK). The clock frequency is set to 1 GHz. The operating conditions are set to typical, the supply voltage is fixed at 1.1V, and the temperature is set to 27°C.
For place and route, the Cadence SOC Encounter tool in 45nm technology mode is used. Post-layout analysis is used for power and area estimation. Static timing verification method is used for evaluating the design for set up and holds time violations. For timing estimation, a software tool is also developed for generating random data vectors in the test bench codes and these test bench codes are simulated by ModelSim tool.
Area
Table III summarizes the reported area figures from post-layout die area for 1024-bit implementation.
As Table III presents, the basic design is more efficient in terms of area overhead when the number of modules increases as expected. In the new parallel design, the Module-Assigner part has a lookup table to keep track of the idle modules. The size of the lookup table grows with increasing the number of modules and clearly its area will increase accordingly. In the basic design, the complexity of the controller is much less relevant to the number of modules. Increasing the number of modules, the number of I/O pins also decreases, but, the load input and store output phases will increase as well.
We compare our designs' area in 32-32 mode with three other 1024-bit designs which is presented in Table IV . It must be noted that our architectures use newer technology compared to other designs which has considerable effect in area as well.
Timing
Each operation is done in four phases, namely, check availability, load operands, computation and store result. The check availability phase does not exist in the basic design since it was designed to handle just one task at a given time. The duration of check-availability phase is fixed and it is always 4 clock cycles regardless of the granularity of the system and the number of controllers. The durations of the load and store phases depend on the number of required modules for that particular operation. The addition operation always needs one clock cycle and the multiplication always needs twice the operand size number of cycles. The operation cycle of the inversion operation depends on the input data bit-pattern. We perform a large number of simulations by changing bit patterns of input vectors to figure out the average number of clock cycles needed to perform an inversion operation. The average number is 5.5 times the operand size. Changing the module-size and keeping the maximum operand size the same, only the durations of load and store phases will change which is negligible. Consequently, the main computation cycle is proportional to the operand width O(m) and is irrespective of the size of module. This discussion is summarized in Table V .
The main improvement in our parallel architecture is the throughput performance of the system. To evaluate our claim, we simulate both systems in 32-32 mode and generate 100,000 point doubling operations where each operation requires limited random number of modules for completion. This is a typical situation for today secure systems as they likely must handle hundreds of ECC operations per second. Point doubling operation is a good candidate for testing the utilization and throughput of the system because it has all basic operations. Each scalar point multiplication is composed of a series of point doubling and additions. Considering the involved basic operations, point doubling is very similar to point addition and therefore, it is a good approximation for scalar point multiplication as well. Each point doubling operation is composed of seven additions, three multiplications and one inversion. Each point doubling operation has random field-size between 160-bit and 256-bit. In 32-32 mode these operations will require between 5 to 8 modules. 160-bit ECC is secure today and most of typical applications use this scheme and 256 bit ECC in secure until 2030 [19] . So this way, a wide range of secure applications' data will be injected to the system. Table VI presents the differences between the results of the two architectures. As the results in Table VI shows, the new architecture completes the task almost 3.47 times faster than the basic architecture. Hence, the throughput of parallel architecture is 3.47 times more than throughput of basic design.
Where the change in key size is limited, the system response time and controller area overhead deteriorates by increasing the granularity of the system as Table VII shows. Making the modules smaller, the response time is increasing because the load and store operations take more time.
Therefore, it is better to choose granularity low in typical cases where the change in the key size is limited. However, the flexibility of the system in terms of supporting different key sizes will be low as it does not support keys that are not submultiples of module size when the granularity of the system is low. This implies the existence of a trade-off between the flexibility and area, performance similar to other existing parallel systems.
Power and Energy
Similar to the area evaluation, our designs' power consumption in 32-32 mode are being compared with three other 1024-bit designs which is presented in Table VIII . The proposed architecture in [7] does not report any actual power figures and has reported just analytical results. Our proposed designs offer very low-power compared to two other designs because of i) implementing power and/or clock gating techniques and ii) use of simple components.
Although, the power consumption in parallel architecture is more than the basic design, it requires less energy to accomplish tasks when there are several concurrent tasks. To illustrate the difference, assume that there are 100,000 point doubling tasks like what was assumed in timing evaluation. As Table IX presents, the basic design consumes less power than the parallel design because in average only 25% of its modules are active whereas 95% of the modules are active in parallel design.
However, the parallel design completes the tasks in much less time and requires 9% less energy compared to basic architecture.
CONCLUSIONS
In this paper, a high-performance parallel arithmetic processor architecture for GF (2 m ) has been proposed which supports all essential ECC operations. The architecture is modular, supports arbitrary operand sizes and is scalable for very large operand sizes. When a small key size operation is requested, the system deactivates the remaining modules to reduce the power consumption if there is no parallel request to the system. Otherwise, it handles new requests by the remaining unutilized modules in parallel with other modules. The new request to the system could be different operations with different operand sizes that let the system to use its resources very efficiently which considerably increases the overall performance and throughput of the system. Adding other operations such as point addition and point doubling to our architecture is our future work in our ultimate goal to design a complete ECC processor.
GF(2 m ) Arithmetic Operations for Elliptic Curve Cryptography
Esmaeil Amini, Zahra Jeddi, Ahmed Khattab, and Magdy Bayoumi 
