Modern cellular networks allow users to transmit information at high data rates, have access to IP-based networks deployed around the world, and access to sophisticated services. In this context, not only is it necessary to develop new radio interface technologies and improve existing core networks to reach success, but guaranteeing confidentiality and integrity during transmission is a must. The KASUMI block cipher lies at the core of both the f8 data confidentiality algorithm and the f9 data integrity algorithm for Universal Mobile Telecommunications System networks. KASUMI implementations must reach high performance and have low power consumption in order to be adequate for network components. This paper describes a specialized processor core designed to efficiently perform the KASUMI algorithm. Experimental results show two orders of magnitude performance improvement over software only based implementations. We describe the used design technique that can also be applied to implement other Feistel-like ciphering algorithms. The proposed architecture was implemented on a FPGA, results are presented and discussed.
the datapath and four specialized new instructions were defined to control the extended unit. To prove this concept, the use of a complex core to carry out the job is feasible, but the time needed to understand the core's internals and conceive a way to extend it increases noticeably, thus the simplicity of the source code was a key factor to make the final decision. The processor core chosen for this work is the MyRISC core [14] , which models a MIPS R2000 32-bit processor with a fivestage pipeline structure. It is important to mention that this same approach can be used to integrate the proposed extension to other RISC type processors. Figure 4 shows the organization of the proposed extension to the MyRISC core.
Processor organization
The part above the thick horizontal line corresponds to the initial RISC processor core as it is distributed. The components lying below the thick line correspond to the KASUMI extension, which carries out the processing corresponding to two rounds. The modules that store and generate data operands are located in the processor's Instruction Decode (ID) stage, whereas the modules that perform encryption operations belong to the Execute (EX) stage.
The extended register file
The new functional unit contains ten 32-bit registers that store the data it processes. Extended instructions move data from/to integer registers to/from a register within this new register file. Figure 5 shows the organization of this data unit. Registers 0 and 1 store the plaintext block the KASUMI functional unit works with; after the ciphering process the registers store the ciphertext block produced.
The 32 most significant bits of the block are stored in register 0, whereas register 1 stores the 32 least significant bits. The 128-bit encryption key K is split into four 32-bit parts and stored in registers 2 to 5. Registers 6 to 9 store the ciphering constants used along with the encryption key to generate the set of round keys KLi,KOi,KIi for each round i. There is no need to preload the array of constants since these values are automatically stored every time the RESET signal is asserted.
Any of the first six registers within the extended register file can be synchronously written by specifying its address and the value to store, in the same way as for integer registers. Registers 0 and 1 can be written in parallel to store the ciphertext block produced by the block ciphering modules. These two kinds of writing can not be accomplished simultaneously. The array that stores the encryption key K (registers 2 to 5) is synchronously rotated upwards to compute the appropriate round keys for the next two rounds. This is also true for the array that stores the ciphering constants (registers 6 to 9). The only kind of writing allowed occurring at the same time as the rotation of the arrays is the parallel writing of registers 0 and 1. The register file asynchronously outputs the contents of the ten internal registers. An additional output issues the contents of a specific register indicated by an input address line in an asynchronous fashion as well. 
The forwarding unit

The ciphering datapath
This module is parallel to the EX stage of the processor's datapath and performs the encryption process using the block issued by the forwarding unit and the round keys computed by the key generation unit. It carries out an odd round followed by an even round of the KASUMI algorithm in four steps: K1, K2, K3 and K4, where each step takes one clock cycle to complete.
The strategy followed to design this module is shown in figure 6 . The manipulation strategy considers a pair of consecutive rounds; an odd round followed by an even round. It changes the structure of the pair without altering its effects, adds components that balance the structure and discovers a design pattern that replicates. This pattern then turns into the basic building block that is implemented once and then reused until completion of the ciphering process. Figure 6 .a shows two reordered FO blocks, while figure 6.b shows the result of expanding the two FO blocks and splitting the 32-bit XOR gate located between the two FO blocks into two 16-bit XOR gates and "unfolding" the datapath comprising the upper FO function block's output, the two 16-bit XOR gates and the lower FO function block.
Notice that both, the 32-bit R0 input and the 32-bit R2 output, are now split into two 16-bit lines. Figure 6 .c shows the result of joining the two FO function blocks to highlight the parallelism between each pair of FI function blocks. Some 16-bit XOR gates with one zero input are added along the datapath in certain places so that the datapath can be divided in three structurally similar sections each with two PI blocks. The final ciphering datapath for the two rounds is shown in figure 7 , where each pair of PI blocks from figure 6.c is grouped into the so called dpFI blocks. This reordering process was originally proposed in [7] , were a special purpose reusebased architecture for implementing the KASUMI algorithm was described. The architecture exploits reutilization of components to implement the eight rounds required by the algorithm.
In spite of this multicycle operation, the ciphering datapath is not intended to work in a pipelined fashion. This means that an instruction that uses the ciphering datapath is not allowed to enter the K1 module until the previous instruction has left the K4 module. In the KASUMI functional synchronous registers are indicated by grey boxes. When the ciphering process reaches the K3 module it commands the KASUMI register file to rotate the arrays storing the encryption key and the ciphering constants, by means of a control signal indicated by a dashed line in 
The key generation unit
The key generation unit outputs two sets of round keys ({KL1,KO1,KI1} and {KL2,KO2,KI2}) and stores them into the ID/EX pipeline register to be issued to the ciphering datapath during the next clock cycle. This unit receives as inputs the four 32-bit words comprising the encryption key K from the forwarding logic and the four 32-bit words storing the ciphering constants from the extended register file. Figure   8 shows the organization of the key generation unit.
The extended instructions
Four instructions were added to the MIPS instruction set to control the extended KASUMI functional unit. Instruction formats for the four instructions are shown in 
The kxor3 instruction
It carries out the operation Rs ⊕ KRt, where Rs is an integer register and KRt is a register in the extended KASUMI register file. This instruction uses the integer EX and MEM pipeline stages and saves the result in the integer register addressed by Rd during the integer WB stage. Its mnemonic is kxor3 Rd, Rs, KRt.
The k2rnd instruction
It carries out the operations corresponding to a sequence of an odd round and an even round of the KASUMI block cipher. It does not need explicit operands; it uses the outputs of the forwarding logic and the key generation unit. A sequence of four k2rnd instructions performs the whole KASUMI algorithm. k2rnd is a multicycle instruction whose execution phase is actually made up of four cycles: K1, K2, K3 and K4. Only after a k2rnd instruction has finished with cycle K4, the next k2rnd instruction will enter K1. During the MEM stage this instruction issues the computed block to the extended register file in order for it to be stored in registers 0 and 1. Since this operation is synchronous, the block is actually written when the instruction enters the WB stage. Its mnemonic is k2rnd. Figure 10 illustrates the pipelined execution of the instructions making up the encryption process. The operands of the instructions are carefully chosen to show how the extended processor deals with special execution conditions. The first six kxor1 instructions load the plaintext block and the encryption key into the extended registers. The next four k2rnd instructions perform the encryption process using the operands stored by the previous instructions. Notice that the address of the target register in instruction 1, which is 0, equals the address of the first source register in instruction 2. For integer instructions this would cause a data hazard and the bypassing of the value computed by instruction 1 in the EX stage to the ID stage of instruction 2 during the third clock cycle. However, for the instructions in figure 9 the bypassed value is ignored by the integer forwarding logic since the target register of instruction 1 is an extended register, not an integer register as the source register of instruction 2. This situation is called a false data hazard and is handled by the processor by appropriately setting a control signal. The same situation occurs during cycles 4 and 5. A true data hazard occurs during the eighth cycle because the first k2rnd instruction needs to compute the two sets of round keys and, at this time, the encryption key K has not been completely stored.
Details about the execution of extended instructions
However, the forwarding logic in the KASUMI functional unit overcomes this problem. This module receives the bypassed data signals from the kxor1 instructions in the EX, MEM and WB stages of the integer pipeline and issues them to the key generation unit to produce the round keys needed in the next cycle. The KASUMI forwarding logic ignores any bypassed signal issued by an instruction different from kxor1 and kxor2. Note that the overlapped execution of a k2rnd instruction (10) and an integer instruction (11) 
Performance evaluation
The number of instructions needed to implement the KASUMI block cipher in software, using the standard MIPS32 instruction set, is much higher than the number of extended instructions needed when the proposed extension is used. cycles and assuming that the processor can operate at the same frequency that the extension unit, the proposed architecture can achieve a throughput of 385 Mbs.
Other architectures for the KASUMI algorithm have been reported. In [9] , the authors report two architectures that implement logic for only one round, i.e. the FO and the FL function blocks. The first architecture, called Type 1, iterates over these two components eight times until completion of the process, feeding the design's output back to its input at the end of each iteration, sacrificing performance in the interests of achieving low hardware complexity. The Type 2 architecture contains a four-stage inner-round pipelined FO module that results in an increased operating frequency and an improved throughput, by a factor of four. The two-round architecture described in [10] takes advantage of both inner-and outer-round pipeline techniques to decrease the period of the clock and increase the throughput. Inner-round registers are negative edge-triggered, whereas outerround registers are positive edge-triggered; consequently, the execution time of each round is one clock cycle. The pipelined design allows this circuit to process two blocks simultaneously, with an initial latency of eight cycles. The S-boxes in this architecture are implemented with combinational logic. The two architectures reported in [11] are similar to that described in [10] . The authors look to reduce the area required by implementing a two-round iterative architecture. An interesting fact about this design is that the S7 and S9 S-boxes are implemented as combinational logic and, alternatively, mapped to embedded memory blocks within the FPGA. There are registers at the end of each round, making the architecture to have a total completion latency of 8 clock cycles when the S-boxes are implemented as combinational modules, and 40 cycles when the S-boxes are mapped to embedded memory blocks; this due to the inner-round pipeline stages introduced by the registered outputs of the synchronous memory blocks. This tworound design is not intended to work in a pipelined fashion. It is possible to manipulate the structure of the KASUMI block cipher, by means of aggressive simplifications, to get inexpensive datapaths with long latencies that carry out the ciphering process. The work reported in [12] presents the application of a simplification technique to design two KASUMI architectures with latencies of 56 and 32 cycles, respectively. A third architecture with a latency of 8 cycles is mentioned, and its results provided, but the architecture is not fully described. A crypto processor that consists of a 32-bit RISC processor block and coprocessor blocks dedicated to the AES, KASUMI, SEED, triple-DES private key crypto algorithms and ECC and RSA public key crypto algorithm is described in [13] . The 32-bit RISC type processor controls the dedicated crypto block and performs the interface operations with external devices such as memory and an I/O bus interface controller. The custom processing blocks are connected to the processor by a 64-bit bus. Table 6 shows a performance comparison of the proposed extension against other reported architectures. As the focus of this paper is to describe the KASUMI extension and the general approach of how this can be integrated into a RISC processor, the area data shown in the table includes only the area required by the extension itself, i.e. no other parts of the processor are considered. As seen from the table, the number of hardware resources required by the extension is similar to those required by the architectures that implement the KASUMI algorithm using a reuse or hybrid approach. Note that, apart from the pipelined architecture described in [10] , the proposed extension achieves the highest throughput/area ratio due to the efficient reutilization of the optimized two-round ciphering datapath.
Conclusions
This paper proposed a processor-based approach to the problem of efficiently implementing the KASUMI algorithm. The general approach consisted of three phases. First, the design of a high performance hardware module that performs two rounds of the KASUMI algorithm. Second, the addition of a functional unit to a RISC processor core intended to be used in embedded environments. Third, the extension of the instruction set of the processor to exploit the capabilities of the new hardware. Replacing a long sequence of arithmetic and logical instructions by dedicated hardware reduces code size by two orders of magnitude and, consequently, the number of clock cycles needed for completion of the ciphering process. The addition of a specialized hardware module for encryption avoids requesting that service from an external coprocessor. It is important to mention that the processor used was selected to validate the approach because of its simplicity; but this general approach can be used to add a similar extension unit to other RISC-like processors. 
