Abstract. The current deployment of Digital Right Management (DRM) schemes to distribute protected contents and rights is leading the way to massive use of sophisticated embedded cryptographic applications. Embedded microprocessors have been equipped with bulky and power-consuming coprocessors designed to suit particular data sizes. However, flexible cryptographic platforms are more desirable than devices dedicated to a particular cryptographic algorithm as the increasing cost of fabrication chips favors large volume production. This paper proposes a novel approach to embedded cryptography whereby we propose a vector-based general purpose machine capable of implementing a range of cryptographic algorithms. We show that vector processing ideas can be used to perform cryptography in an efficient manner which we believe is appropriate for high performance, flexible and power efficient embedded systems.
Introduction
Given the commercial value of digital contents, their management in mobile equipments (like PDAs, mobile phones or even smart-cards) has become a critical issue for content issuers. Digital Right Management (DRM) schemes are being worked on. For example the Open Mobile Alliance (OMA) is working on a DRM architecture for the mobile industry [1] . In those DRM schemes, the distribution, management and protection of data rely on the use of complex cryptographic protocols and algorithms. In such a context, the processors used (in particular those in mobile equipments) face constraints of size, power, cost, performance and security.
During the past 15 years, we saw a lot of publications about hardware modules for cryptographic applications. Most of those proposals make use of processors which are very application specific. They are not only optimized for one particular algorithm but also for particular sizes to suit market requirements. For security, counter-measures have been proposed, most of which are software-based leading to bulkier codes and slower programs. A hardware-software co-design approach is being undertaken by other researchers [2, 3, 4] in order to have a hardware that would reduce the cost of those software counter-measures.
Our approach uses Data Parallel techniques for cryptographic applications. We first describe how we chose the vector design space. We then illustrate how cryptographic algorithms can be vectorized by giving two examples. This then takes us to the design of the corresponding vector processing machine before finally presenting results obtained on the functional simulator. With this approach, we propose an architecture which can achieve high performance and flexibility with little increase in control logic compared to scalar processors. Those characteristics of performance and flexibility are particularly relevant to DRM applications where cryptographic applications are made to run on processors having different constraints, going from the 'computer terminal' of the Rights Issuer to the small embedded chip of the DRM Agent found in a mobile equipment.
Having a quantitative approach
During the past 15 years, we saw an explosion in the use of cryptographic processors for embedded applications. For secret-key algorithms those hardware implementations can be considered to be rather straight-forward. For Public-Key systems however, given the complexity of the computations involved, designers have been implementing systems for static lengths (like having long-precision number multipliers for example). Some have been integrating crypto-oriented instructions into the instruction set of General Purpose Processors (GPPs) [5, 6, 7] . Others had a more scalable approach as depicted in [8] . But none have had a systematic approach where hardware designers would look for a design which would be the 'best' trade-off between speed, security, chip area and power consumption.
Having identified this need, we went back to the architecture design space and look for the best architecture that would allow us to undertake such a quantitative study. Note that this paper focuses on the micro-architecture design of a cryptographic accelerator. Issues of security (abd related countermeasures) are beyond the scope of this paper.
A case for a vector architecture
According to [9] , the architecture design space can be decomposed into a tree shown in Figure 1 . Parsing through the architecture design space, Single Instruction Scheme Processors are chosen in order to maintain compatibility with existing smart-card chips. Having a Multiple Instruction Scheme would imply having a multi-processor system which does not fit with actual power and size constraints on embedded chips. Instruction Level Parallel architectures can be put aside because having parallel instruction executions:
-requires complicated instruction decoding and scheduling units, which be against our motivation of reducing complexity. -implies the use of very sophisticated instruction decoders and issuers, which consume a lot of power. -is not well suited for those particular applications: most cryptographic algorithms involve the sequential use of precise instructions/operations leaving little room for parallelism at this level.
A Data Level Parallel approach was chosen because -the data used by those cryptographic algorithms can be decomposed into a vector of shorter data onto which operations can be applied in parallel (or partially-parallel ) as illustrated in this paper. -The instruction decoding is simpler.
-In terms of security, working on data in parallel can in theory reduce the relative contribution of each data piece to the external power consumption as announced by [10] .
Hence we used Data Level Parallel techniques to design our cryptographic processing unit. Our design's vector machine is controlled by a General Purpose Processor (GPP) which also allows the optimal execution of 'scalar' codes 1 .
Proposed methodology for vectorizing cryptographic algorithms
We chose two case studies to illustrate how cryptographic algorithms can be vectorised: the AES symmetric key algorithm and modular multiplication based on Montgomery's algorithm (used in both RSA or Elliptic Curves Public Key Cryptography). For each of those case studies we look at their performance on a scalar MIPS-I architecture ([11, 12] ). We identify the most time-consuming operations. We then show how the latter can be improved by having a vector approach based on an instruction set defined in Appendix A. In section 5 we show how these algorithms perform on our functional simulator.
Vectorizing the Advanced Encryption Standard
The AES algorithm is described in [13] . The algorithm is meant to work for key lengths namely 128, 192 or 256 bits. In this study, we will concentrate on the 128-bit version of the AES as it is very representative of what's happening.
Scalar Implementation on the MIPS Our test implementation on the MIPS-IV is illustrated in Figure 2 . The key schedule is done first and the sub-keys stored in RAM. The encryption process is then executed. No counter-measures are implemented. We focus on the encryption process. The SUBBYTE is a byte-wise look-up process. For this purpose we have a VBYTELD Vx, Ry, m instruction as explained in Appendix A. Such an instruction can be implemented given we have the memory organization described in Section 4.2. Note that this optimization is also useful for the KEY-SCHEDULE.
Originally the SHIFTROWS function is composed of left rotations on each row of the data matrix and if we had represented each row of the data matrix on a 32-bit word, the SHIFTROWS would have been very simple. But in our implementation, each 32-bit word is one column of the data matrix, hence the difficulty of implementing this operation. Suppose The MIXCOLUMNS operation is the most time consuming one as shown in Table 1 . It is a matrix multiplication working on each column as defined below: 
Each of the individual byte multiplications is done in the field GF(2 8 ), modulo the irreducible polynomial given by
whose binary representation is 0x11B. The central operation is hence the multiplication operation by x modulo m(x). Given the instructions in Appendix A, the MIXCOLUMNS operation can be implemented as follows: 
Vectorizing Montgomery's Modular Multiplication
Two commonly used Public Key algorithms are RSA and ECC 3 . RSA is based on the modular exponentiation of large integers (typically between 1024 to 2048 bits or more). ECC is based on the scalar multiplication of a point on an elliptic curve in a finite field (either in Fp with p prime or F 2 m ). In both cases the most critical operation is the long precision modular multiplication. One of the most efficient algorithms to do those multiplications is the one based on Montgomery's reduction algorithm as shown in [14] .
For our study, we looked at Elliptic Curve Cryptography over binary fields [15] . The basic modular multiplication consists of multiplying the co-ordinates of given points on the elliptic curve. Those co-ordinates have a polynomial representation and the multiplication is done modulo an irreducible polynomial in the same field. Modular multiplications have been thoroughly studied and optimized. Methods like those proposed in [16] based on Montgomery's method are quite rapid algorithms. As explained in [16] , Montgomery's algorithm can be implemented to interleave the multiplication and the reduction phases. In the latter paper, the authors show that we can use Montgomery's algorithm to calculate c(
is an irreducible polynomial. Given that we are working in the field F 2 m , the polynomials involved in this algorithm are of length m, the authors in [16] show that r(x) can be chosen such that:
If we suppose that the multiplicand a(x) can be decomposed into a linear combination of 32-bit polynomials denoted by Ai(x) such that
we have the algorithm in Figure 3 for a 32-bit architecture: C0(x) is the least significant 32-bit word of the polynomial c(x) and N 0 (x) is the 'Montgomery's constant', which is pre-calculated, such that Vector Approach to Modular Multiplication We looked at the vector instructions that can help to enhance the execution of this 'interleaved' Montgomery Modular Multiplication. As a result of which, we obtain the following assembly code (The comments refer to the algorithm in 
Proposed Architecture
Vector Processor techniques have been widely used either in super-computers like the Cray machine [17] or in Digital Signal Processing applications like on Intel's MMX or the T0 architecture described in [18] . In the latter example, the authors already use a MIPS-like scalar processor. In this section we present the foundations for our vector architecture.
Our design aims at offering high performance for the parallel data cryptographic processes without penalizing the scalar executions. Because of this, we have an approach where we go from an already existing, highly performing, General Purpose Processor and 'plug' in the vectorial co-processor. This is particularly true with the MIPS architecture where co-processor interfaces are well defined, easing user Application Specific Extensions. The specification and definition of what we will call the Vectorial MIPS for Cryptography (VeMICry) has to be done on two levels: -Resource/Architectural Level: definition of the resources present in that vectorial unit (register files, processing units, memory interface units . . . ). -Instruction scheduling and pipelining: Specification of the vector instructions' execution with respect to the scalar pipeline in addition to the inner pipeline for each of the vector instruction.
Architectural specification
Appendix A of [19] provides a comprehensive picture of the theory behind vector processing and its application to micro-processors. To suit the MIPS 'load-store' architecture and to avoid complex memory accesses, we chose a Register-to-Register vector architecture: we hence hope to reduce memory-register transfers, which are the privileged attack paths for side channel analysis. Note that in this paper we work only on code implemented directly in assembly language, which means that we will not be talking about compiler optimization techniques.
Vector Register File
The structure and architecture of the vector register file will be the determinant factor in defining the rest of the architecture. 6 factors will determine the structure of our vector register file: -m: The size of each element of the vector elements. Currently m = 32.
-q: The number of such vector registers.
-p: The number of elements in each vector register. This will be called the depth of each vector.
-r: The number of lanes into which the vector registers are organized. This notion is borrowed from [18] where it is associated to the number of VPUs 4 available to the VeMICry. We have as many lanes as there are VPUs. Ideally we would have r = p allowing us to work on the p elements in parallel: the j th VPU for example would be 'associated' to a register file made of all the j th elements of all the vector registers. However, in some cases, for size and power constraints we will not be allowed p VPUs. We leave r as a parameter for our analysis as to what would be the best performance to size trade-off. As a result, the j th VPU will be associated not only to the j th elements across the register file but also j + r th , j + 2r th . . .
The number of elements of the vector processor onto which the function is applied. Our analysis revealed that it would be interesting to work on vector lengths which are not necessarily equal to the depth of each vector register; specially in the case where r = p, both in terms of speed and power consumption. Setting the vector's length could done by setting a configuration register for example 5 . -The memory latency is also an important factor. This not only depends on the number of read and write ports per VPU but also on the definition of the interface with the memory or even how many 'memory banks' we could have in parallel. In our architecture, we propose to have a software managed memory bank per lane. Within each 'bank' we have 4 parallel concurrently accessible byte arrays of say 1 kilobytes each. Such a structure allows each VPU to smartly fetch four bytes in parallel, specially for the VBYTELD instruction. 
E L E M E N T S

Fig. 4. Vector Register File
We obtain the register file architecture shown in Figure 4 . We propose to study the influence of those 6 factors on performance and area. In addition to the Vector Registers, we identified the need for a Vector Conditional Register (VCR) which is a p bit register, a Scalar Buffer Interface (SBI) register to act as buffer from scalar values being shared between the scalar core and the vector processing unit and a CARry buffer (CAR) to store the most significant word or carry when doing addition or multiplication (in particular when l = p).
Vector Instruction Execution & Scheduling
In this section we briefly describe the schedule and execution of the vector instructions. A vector instruction is meant to replace what would be in software a loop; a loop where the data being operated on are independent from each other and where the calculation of each iteration of the loop is independent from the calculation of the neighboring iterations. However by looking at some of the instructions in Appendix A, we can see that operations like VADDU, do not obey to this basic requirement. For such instructions we will take advantage of the fact that the calculation on each element of the vector is only 'partially' independent from that on its neighbors.
From then on, we define three classes of vector instructions:
Definition 1. A Genuinely Independent Vector Instruction (GIVI) is one where the transformation applied to every element of the operand vectors is independent from the application of that same transformation on this same element's neighbors.
Definition 2. A Partially Independent Vector Instruction (PIVI) is one where the transformation applied to every element of the operand vectors depends partially on the result of the same operation applied to one of its neighbors.
Definition 3. A Memory Accessing Vector Instruction (MAVI) is a vector register-memory instruction where a memory access is required for the application of the required transformation on every element of the operand vectors.
Each of those groups of instructions has its own dependency constraints which lead to the definition of a characteristic sequence of execution's decomposition for each group. The instruction decoding is handled by the scalar MIPS as part of its 'normal' five stage pipeline.
-IF: Instruction Fetch.
-ID: Instruction Decode.
-EX: (Scalar) Execution Stage.
-DC: Data Cache read and alignment.
-WB: Write Back stage. Upon the detection of a vector instruction, each VPU enters into its own four stage pipeline:
-Data Fetch (DF) stage where each VPU fetches the two (depending on the instruction) elements from the target vector registers. If a scalar register is involved, the value is fetched from the latter scalar register and written back into the SBI register. -Execute-Multiply (EXM) stage where the VPU performs the corresponding multiplication or addition calculation for a PIVI. For a GIVI or a MAVI, nothing is done. -Execute-Carry (EXC) stage where the 'carry' selection is done for the PIVIs and the latter's calculation is completed. For a GIVI or a MAVI, the corresponding calculation/manipulation is done onto the arguments fetched in stage DF. -Write Back (WB) stage where the result from the VPU is written back to the corresponding element of the destination vector register. It is left to the programmer to make sure the vector register length is properly set before doing any vector instruction when working on vectors of length l < p.
GIVI execution Let's consider the general case where p is 'too' large and that we only have r VPUs where r ≤ p (could be specially true for embedded processors). This means each VPU will have to enter PIVI execution In a Partially Independent Vector Instruction, the calculation on every element of the vector register depends on the calculation of the neighboring elements: the functions concerned by this category are VADDU, VSPMULT, VSAMULT and VTRANSP. For the optimal schedule of the PIVI instructions we will assume that each VPU has an internal 'temporary' 32-bit register. Most the above mentioned instructions have to handle the addition of vector elements and to anticipate on the carry being propagated from the neighboring least significant element. To do so, we assume that each VPU has a 32-bit Carry Select Adder (CSA): at each addition step the addition is performed for both cases where 'incoming' carry is 0 or 1 and the 'correct' output is determined once the correct carry is known. Like this the PIVI instruction can be made to have the same instruction issue rate as the GIVI. Vector Instructions' chaining and hazards If we work on vector depths which are greater than the number of VPUs, an instruction may take several iterations as illustrated Figure 5 for a GIVI instruction. The main type of hazard we might be confronted which is data hazard. Data hazards occur when the instruction I has as operand the result from the preceding instruction I − 1. With our vector operations, data hazards occur when an instruction takes only 1 or 2 iterations (i.e. p r ≤ 2). For instructions having a larger number of iterations, the latency incurred by the multi-iteration process diffuses the data dependency. The following table describes the different data hazards that might occur between an instruction I − 1 and the instruction I and how, when this is possible, pipeline stalls can be avoided by using data feed-forward mechanisms.
Functional Simulation
We started by building a functional simulator for our VeMICry architecture: a functional architecture allows us to test the vector code presented in Sections 3.1 and 3.2. Moreover, with such a simulator, we can perform performance studies in terms of instruction cycles and see the effect of the parameters from Section 4.2.
Use of the ArchC simulation tool
The ArchC tool is an architecture description language which is developed by the Computer Systems Laboratory of the Institute of Computing of the University of Campinas (www.archc.org). The tool allows to build an architectural instruction simulator which is composed on:
-A language description used to describe the target architecture including the memory hierarchy (AC ARCH) and the instruction set architecture (AC ISA). -A simulator generator (ACSIM) which uses the above description language to generate a Makefile which is then used for building a SystemC model. It is based on a widely used commercial tool like SystemC [20] and allows to build quite simple architectures which is sufficient for our immediate needs. Moreover, the simulation software builder is based on GCC ( www.gnu.org). Hence it is easy to modify the instruction set. The idea behind this study is to build a simulator of our VeMICry architecture to test the vector instructions described in A and perform some preliminary performance studies in terms of instruction cycles. Table 2 . Data Dependencies on the vector instructions
Building the functional model
Our architecture is based on the 32-bit MIPS architecture with an instruction set fully compatible with the (basic) MIPS-I family. Moreover, we hacked GCC's Assembler to compile our vector codes.
The backbone of the VeMICry model is composed of the definition files of the MIPS-I model which we have upgraded to add our vector instructions. In our model model:
-We have 8 vector registers (q = 8).
-Each vector is composed of 8 × 32-bit elements (p = 8).
-We have 8 VPUs working in parallel (r = 8). Hence there are eight lanes where in the j th lanes the j th VPU works across the j th elements of the vector registers. -We assume that each instruction is executed in 1 cycle (only a functional model).
The simulator generates a series of basic statistics like the sequence of instructions executed (vemicry.dasm), a trace of the Program Counter (vemicry.trace) and the occurrences of each instruction along with the number of cycle-counts (vemicry.stats).
Functional simulation of vectorised AES
As explained previously, the vector instructions are used to optimize the SHIFTROWS, MIXCOLUMNS, ADDROUNDKEY and SUBBYTE operations. The KEY SCHEDULE is implemented as a separate routine.
We validated the results generated by our vector AES encryption code. Simulations show that encrypting 16 bytes (for an AES-128) takes 160 instruction cycles. In addition to this the KEY SCHEDULE took 246 instruction cycles. Those figures represent a large gain in performance when compared to the same algorithms implemented the scalar MIPS. For the scalar code the key schedule took 519 instruction cycles and the encryption took 3283 cycles.
More performance gain is achieved when we encrypt larger data files. We ran simulations where we encrypted 32 bytes with one same key, i.e. we ran the KEY SCHEDULE once and the encryption codes was modified to work on 8 words of each vector register. Encrypting 32 bytes took 182 instruction cycles. This illustrates a major advantage of our architecture: depending on the depth of vector registers, we are able to encrypt large data tables with little performance penalties.
Another big advantage with our approach is that robust software counter-measures (like those described in [2] ) can be implemented to compensate for any side-channel information leakage.
Simulation of vectorized Montgomery Multiplication in binary fields
On the VeMICry, the calculation of the Montgomery's constant is executed in 22 instruction cycles. The main part of the modular multiplication takes 97 instruction cycles. The same modular multiplication operation takes 22331 instruction cycles on the scalar MIPS.
Note that our test values are taken from the field GF (2 191 ), which means that the data values have a maximum length of 192 bits. Given that in the actual architecture each vector register has 8 elements, each vector register is used to hold the 192 bits of each variable. With a depth of 8, we could work on up to 256-bits ECC (with the same number of instruction cycles), which would be far from what would be required for the next 20 years or so.
Note that in the preceding example, we perform a reduction by 32 bits each time. However, one could envisage to perform a reduction by 64 bits as this would mean that we would have half as many loops. In the algorithm depicted in Figure 3 , each word is on 64 bits, which means that the calculated N 0 is also on 64 bits and also the we shift by 64 bits in the end. We only perform half the number of loops.
We modified the vector code presented at the end of section 3.2 to emulate this reduction by 64 bits. The calculation of N 0 took 72 instruction cycles and the modular multiplication itself took 84 instruction cycles. Note that N 0 can be calculated only once at the beginning of the signature algorithm and hence for comparing performances, we focus only on the multiplication algorithm. Performance gain when doing a 64-bit reduction is of the order of 13% compared to the same algorithm implemented with a reduction by 32 bits. This gain is achieved at the expense of one additional vector register.
Ongoing research
To have a significant quantitative study, it makes sense to study modular multiplications on larger values like in RSA. So the next phase of the study is to test the modular multiplication on 1024 to 2048 bit values and see how the number of instruction cycles changes by varying the different sizes of the vector architecture. Then we will be implementing a synthesizable Verilog model to add the 'gate count' parameter to our benchmark.
Conclusion
In this paper we proposed a vector architecture for embedded cryptography. We have shown how the vector approach is relevant to cryptography and how cryptographic algorithms can be efficiently vectorised. We built and validated a functional model of our vector architecture. The vector architecture combined with our proposed instructions have helped us to reduce the number of cycles taken for an AES encryption from 3283 on the MIPS-I to 160 on the VeMICry. Likewise, modular multiplication in the field GF (2 191 ) has been reduced from 22331 instruction cycles to 84 cycles. We can anticipate that each lane will be at least (if not less) complex than a scalar MIPS. This would mean that our vector approach is a sound one given the performance figures measured. Further research is currently being done to study the complexity of our vector architecture and find the best trade-off between performance, size and power consumption.
A Vector Instructions
The VeMICry processor is composed of two families of instructions: the scalar instructions which correspond to the conventional MIPS-I instruction set and the vector instructions tailored to suit cryptographic requirements.
Suppose we have a vector processor having q vector registers. Each vector register is a vector of p words of 32 bits each. We also have a Vector Condition register (VCR) which contains p bits and which is used for conditional vector instructions to show if the condition is applied to each of the individual words of the vector. Moreover, we have a second 'scalar' register called the Carry Register (CAR) which, for some instructions, 'carry bits/words' are written back. We also assume that we are able to work on an arbitrary vector length l with, of course, l ≤ p. 
VBYTELD
V l , Ri, n each word of V l is treated as four bytes. Each byte is an offset which is added to the address stored in R i and the byte stored at that address is read from the VPU's corresponding memory. The read byte is written to the same location as that of its original corresponding byte. This process is executed for n words of V l .
VLOAD
V l , R i , n loads in V l the n consecutive 32-bit words from memory starting from address stored in R i with a stride of 1 (The notion of stride is introduced in Annexe A of [19] . A stride of '1' 
MTVCR
Rj
Writes to VCR the value contained in the scalar register Rj.
MFVCR
Rj
Copies the value contained in VCR to the scalar register Rj.
