Abstract. In this paper, state-of-the-art hardware implementations of MISTY1 block cipher are presented for areaconstrained wireless applications. The proposed MISTY1 architectures are characterized of highly optimized transformation functions i.e. FL and {FO-XOR-EKG}. The FL function re-utilizes logic AND-OR-XOR combinations whereas {FO-XOR-EKG} function explores 2 × compact design schemes for s-boxes implementation. A Combined Substitution Unit (CSU) and threshold area implementation are proposed for s-boxes based on Boolean reductions and Common Sub-expression Eliminations (CSEs). Besides, {FO-XOR-EKG} function is designed for manifold operations of FO / FI functions, 32-bit XOR operation and ex-
Introduction
With the wide-spread usage of wireless devices and embedded systems, cryptographic hardware circuits are proving as the critical component of modern day SystemOn-Chips (SOCs) laying the foundations for network security. However, the provision of security features in communication networks is materialized in the form of performance degradation thereby increasing the area or reducing the throughput. Considerable efforts are underway to optimize the hardware design / implementation of encryption algorithms for respective applications. MISTY1 block cipher characterizing repeated-loop structure is highly suitable for area-constrained applications. Taking this into account, a study has been carried out on compact MISTY1 implementations for mobile applications, hand-held devices and RFIDs.
Developed by Mitsubishi Electric Corporation, MISTY1 is an ISO / IEC standardized 64-bit block cipher algorithm designed to process smaller blocks of data e.g. PIN of 8-byte [1] . It has a proven probability parametric value of 2^-56 against "Linear and Differential Cryptanalysis" [2] . Many attacks have been proposed by researchers to break MISTY1 block cipher. The attacks though exposed several weaknesses of MISTY1; they could not compromise the full security for 8-rounds MISTY1 [3] , [4] . Moreover, the complexities subject to time-domain and acquisition of large plain-text data for retrieval of the secret key made it practically impossible to undermine the security of MISTY1 block cipher. Therefore, MISTY1 is considered as a secure algorithm and is currently being employed for online-transactions and ATM networks.
A detailed study has been carried out on the existing hardware designs of MISTY1, KASUMI, AES, SHA-1, CAMELLIA and SAFER [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] . The review covered the comparison of performance parameters i.e. area, speed and frequency. For high speed architectures, the commonly employed schemes include Look-Up Tables (LUTs) or Block RAMs (BRAMs) implementation for s-boxes [10] , [11] , [19] , [20] . In addition, optimizations are made for path delay reduction by effective pipe-lines implementation [11] , [19] . As a result, very high throughput value of the order of 100 Gbps is obtained and is employed for image encryption, Ethernet devices, sensor networks, etc. [19] . Though high speed architectures have wide range of applications in sensor networks, the area they constitute make the devices power-hungry. Thus, high speed designs are not suitable for portable devices and mobile equipments.
In comparison to high throughput encryption cores, compact designs make use of the logic optimization techniques for transformation functions and s-boxes using combinational logic [5] [6] [7] [8] [9] [10] , [12] [13] [14] [15] [16] [17] [18] , [21] , [22] . Besides, re-utilization methodologies have also been implemented exploiting the rolling-feature of the architecture. The analysis carried out during this work implies that the most areaefficient cryptographic hardware circuit constitutes 1947 NAND gates for AES algorithm [15] . Moreover, for low area MISTY1, a very compact hardware architecture has been realized in [5] consisting of 2331 NAND gates. The silicon area though is very less, we discovered that architectural and logic optimizations can be carried out on MISTY1 algorithm for very compact hardware implementations. The major contributions / salient features of our work are as under: a. Optimization of S9 and S7 s-boxes.
i. Design and implementation of combined substitution unit for S9 and S7 for compact MISTY1.
ii. Threshold area implementation of S9 / S7 s-boxes for very low area MISTY1. The paper is organized as follows. MISTY1 algorithm is briefly described in Sec. 2 followed by S9 / S7 s-boxes optimization in Sec. 3. The optimized MISTY1 transformation functions are explained in Sec. 4. A detailed explanation of proposed MISTY1 hardware architectures is presented in Sec. 5. Lastly, ASIC results and conclusion are summarized in Sec. 6 and 7 respectively.
64-bit MISTY1 Block Cipher
MISTY1 algorithm is depicted in Fig. 1a and its constituent functions FO, FI and FL are shown in Figs. 1b-1d respectively. The algorithm transforms 64-bit plain-text to 64-bit cipher text using 128-bit secret key after n rounds operation. The specifications described in [1] recommend the value for the number of rounds as n > 8. Moreover, for n-rounds operation, the algorithm requires the generation of 128-bit extended key by using its FI function. MISTY1 odd rounds consist of 2 × FL functions, FO function and 32-bit XOR whereas the even rounds comprise of only FO function and 32-bit XOR. Furthermore, the last round is an exception consisting of only 2 × FL functions. The outputs of FL functions are finally concatenated to form 64-bit cipher text. The FO and FI functions have a feistel-like structure with FI consisting of substitution functions S9 and S7. The substitution functions S9 and S7 substitute the respective 9-bit and 7-bit inputs to 9-bit and 7-bit outputs by logic operations or LUTs [1] .
S9 and S7 S-boxes Optimization

Scheme 1 -S9 / S7 Combined Substitution Unit (CSU)
In this scheme, a Combined Substitution Unit (CSU) is proposed for S9 and S7 s-boxes to substitute 9-bit and 7-bit inputs to 9-bit and 7-bit outputs respectively on alternate clock cycles. The first step involved in CSU design is the algebraic reductions of S9 / S7 logic expressions. Therefore, XOR gates are replaced by NOT gates (for both S9 and S7 logic expressions) and 3-input AND gates are reduced to maximize the use of 2-input AND gates (for S7 logic expressions only). The Common Sub-expression Elimination (CSE) is then carried out from combined S9 / S7 logic expressions thus eliminating the redundant 2-input ANDs and AND-XORs sub-expressions. The reduced algebraic expressions of S9 and S7 are expressed in Tab. 1 whereas common sub-expressions are shown in Tab. 2.
Reduced S9 Algebraic Expressions
Tab. 1. Reduced algebraic expressions for S9 and S7.
Tab. 2. CSEs for S9 and S7.
The AND gates of S9 and S7 reduced logic expressions are shown by respective bits for simplicity; however the implementation is carried out by permuting 9-bits to form 36 × combinations. The path delay of CSU using parallel AND-XORs hierarchy is expressed as (1) whereas the area reduction as compared to straight-forward s-boxes {2 × S9 + S7} is found as 60.8% illustrated in Tab. 3. CSU  58  128  9  9  438  60.8   {2 × S9  + S7}   S9  89  101  --1120  -S7  104  77  --S9  89 
Scheme 2 -S9 / S7 Threshold Area Implementation
S9 / S7 threshold area implementation is depicted in Figs. 2 and 3 consisting of MUXes, AND, XORs and 1-bit high enabled registers. The proposed design scheme sets a threshold limit for area of S9 and S7 s-boxes to generate a throughput value > 4 Mbps.
The substituted bits for S9 and S7 are produced after 45 and 58 clock cycles respectively and are based on the maximum possibilities of S9 and S7 logic sub-expressions. For instance, S9 s-box has 36 × combinations for AND gates and 9 × combinations for respective 9-bits. The additional input bit 1 in multiplexers is used to generate all the possible input values for S9 and S7 substitution functions whereas input bit 0 reproduces the registers output for certain clock cycle operations of {FO-XOR-EKG} function. The reset register values mentioned in Figs. 2 and 3 are based on the respective logic expression of S9 / S7 and reduce the clock cycle operations. Table 4 summarizes the percentage area reduction of 81% with the proposed design as compared to MISTY1 FI function with {2 × S9 + S7} s-boxes.
Optimized MISTY1 Transformation Functions
FL Function Implementation
Figures 4 and 5 depict FL functions generating 32-bit output after 2 and 4 clock cycles respectively. The input to the proposed FL function is a 32-bit plain text or the output of {FO-XOR-EKG} function and the outputs are saved in enabled registers. The design provides a reference for area reduction of FL function and can be configured for 8 and 16 clock cycles operation. The area reduction for compact MISTY1 architectures is mainly due to use of 1 × FL function; however the area reduction with the proposed methodologies FL -1 and FL -2 is found as 4% and 6.1% respectively as compared to straight-forward FL function (ref. Fig. 1d ). The area for 8 / 16 clock cycles FL function is also mentioned in Tab. 5; since the NAND gates difference w.r.t. proposed FL -2 is insignificant, they are not implemented in this paper.
Novel Design of {FO-XOR-EKG} Function
{FO-XOR-EKG} function is the core part of the proposed MISTY1 hardware architecture. The re-utilization methodology has widely been adopted for the optimum operation of {FO-XOR-EKG} function. The intended idea behind the design of proposed {FO-XOR-EKG} function is to perform the transformation operations including FO / FI and 32-bit XOR operation (appended with FO in rounds [1] [2] [3] [4] [5] [6] [7] [8] . Moreover, the design can generate the extended keys for onward use in MISTY1 8-rounds operation. The accumulation of the above mentioned functionalities in a single function reduces the circuit area considerably. The area reduction is complemented with optimized implementation of S9 and S7 s-boxes within {FO-XOR-EKG} function. Thus, 2 × design schemes for {FO-XOR-EKG} function implementing CSU based s-box and S9 / S7 threshold area s-boxes are shown in Figs. 6 
Functions Steps Operations
Extended key generation
Step The two architectures primarily have the same design basis but differ in terms of clock cycle operations. In order to incorporate all the functionalities, 2 × 9-bit XORs and 2 × 7-bit XORs are appended with optimized s-boxes with inputs being fed by multiplexers and registers. In addition, a 16-bit secret key is added in the input multiplexer and KO i have a variable value of 16-bit SK or 0s. Table 6 describes the algorithm / steps involved for the execution of above mentioned functions.
The EK generation and FO function differs in the selection for input texts and KO i XOR. For EK generation, the input and KO i is assigned as 16-bit SK and 16-bit 0s respectively whereas the input and KO i for FO is FL L / FO O/P and SK respectively. Moreover, as compared to EK generation, the FO function has extended clock cycle operations carried out for 3 × FIs, 3 × XORs and 32-bit XOR (ref to Fig. 1b for FO function operations) [7] , [10] is depicted in Tab. 7.
Area-Efficient MISTY1 Hardware Architectures
The proposed hardware architecture of area-efficient MISTY1 8-rounds algorithm is depicted in Fig. 8 .
The input to MISTY1 algorithm is a 64-bit plain-text (PT) and 128-bit secret key (SK) and the output is a 64-bit cipher-text. A 128-bit extended key (EK) is generated prior to MISTY1 8-rounds operation by {FO-XOR-EKG} function and is saved in an external 128-bit extended key register for onward round operations. The SK and EK in conjunction are later used for MISTY1 round transformation operations. The EK generation by {FO-XOR-EKG} function readily reduces the circuit area as it avoids the use of independent key generation module, i.e. FI function. However, the extended key generation by {FO-XOR-EKG} function reduces the throughput for MISTY1 8-rounds operation since EKs have to be generated in advance requiring multiple clock cycles. The speed i.e. throughput value (Mbps) of the proposed architectures can be calculated as: Throughput = Output (bits) / Clock Cycles (sec). 
Hardware Implementation of Proposed MISTY1 Architectures
Hardware implementation of the proposed MISTY1 architectures is performed on ASIC platform 180 nm, 1.8 V standard library cell using Synopsys Design Compiler and is optimized for area. A comprehensive analysis was carried out to obtain moderate speed MISTY1 -architecture 1 by integrating FO -1 with FL -1 whereas FO -2 is configured with FL -2 resulting into a threshold area MISTY1 -architecture 2. Table 8 
