Abstract-With increasing usage of hardware accelerators in modern heterogeneous System-on-Chips (SoCs), the distinction between hardware and software is no longer rigid. The domain of cryptography is no exception and efficient hardware design of so-called software ciphers are becoming increasingly popular. In this paper, for the first time we propose an efficient hardware accelerator design for SOSEMANUK, one of the finalists of the eSTREAM stream cipher competition in the software category. Since SOSEMANUK combines the design principles of the block cipher Serpent and the stream cipher SNOW 2.0, we make our design flexible to accommodate the option for independent execution of Serpent and SNOW 2.0. In the process, we identify interesting design points and explore different levels of optimizations. We perform a detailed experimental evaluation for the performance figures of each design point. The best throughput achieved by the combined design is 67.84 Gbps for SOSEMANUK, 33.92 Gbps for SNOW 2.0 and 2.12 Gbps for Serpent. Our design outperforms all existing hardware (as well as software) designs of Serpent, SNOW 2.0 and SOSEMANUK, along with those of all other eSTREAM candidates.
Ç

INTRODUCTION
T HE eSTREAM [17] competition aimed at identifying modern stream ciphers in two separate profiles, one for software and the other for hardware platforms. Out of 34 initial submissions, four software stream ciphers, namely, HC-128, Rabbit, Salsa20/12, SOSEMANUK and three hardware stream ciphers, namely, Grain v1, MICKEY 2.0 and Trivium made into the final portfolio.
With advancement of technology, the difference between hardware and software stream ciphers is becoming blurred day by day. To satisfy the shrinking energy budgets, dedicated accelerators and customized instruction-sets are also commonly found in modern processors [3] and heterogeneous multiprocessor System-on-Chips (SoCs). Along the same direction, recent years have witnessed several attempts in hardware accelerator designs of software ciphers [8] , [23] , [25] , [30] , [36] , [37] , [39] , [42] , [47] , [49] .
In the call for the AES competition [1] , one of the requirements was that the cipher should be implementable in both hardware and software. After Rijndael [14] won the competition in 2001, initial few years were predominated by software implementations. However, subsequently many hardware designs have been attempted and now Intel has made a special AES instruction set in their x86 series of processors [3] .
The story of eSTREAM competition [17] is however different. It created two separate profiles for software and hardware. Some of the initial submissions, such as Rabbit and Salsa20/12, were for both the profiles. During later rounds the categorization was made exclusive and both Rabbit and Salsa20/12 were moved to the software category.
Motivation for SOSEMANUK Hardware Design
One of the primary reasons for Sosemanuk hardware not being attempted so far is due to the fact that from submission to final selection, SOSEMANUK [11] was in the software category all throughout. We find this categorization artificial since, the software implementations often rely on efficient custom hardware or accelerator that is tightly coupled with the general-purpose processor. In fact, in [21] , hardware performances of selected eSTREAM candidates were analyzed and the following interesting conclusion was drawn about SOSEMANUK.
With regard to SOSEMANUK; the utility as a hardware cipher is clear thus in our opinion requires adding to the hardware focus profile:
However, it is surprising that no hardware design was attempted for SOSEMANUK after [21] which remains the only hardware benchmark for this cipher so far. This is despite the fact that there is no practical attack on SOSEMA-NUK and the cipher retains its claimed 128-bit security. From purely technical point of view, there are three hurdles against an efficient hardware implementation of Sosemanuk, described below. i) Identification of the stage distribution of the combinational path, so that maximum efficiency can be reached. This is discussed in detail in Section 3. ii) The combination of unrolling and pipelining leads to highly complex design space, for which it is hard to estimate the efficiency from theoretical analysis. We took the effort in enumerating as many design points as possible, and implemented each of them for detailed analysis. This is discussed in detail in Section 3. iii) A difficult design decision is to create a flexible instruction-set that supports the Sosemanuk S-Box and linear transformation (LT) in a very compact manner. We applied several compaction and overlapping techniques to manage the operations within 32-bit instruction-width and still provide great flexibility. This is discussed in detail in Section 4. There exist hardware designs for the other eSTREAM software finalists, e.g., for HC-128 [8] , Rabbit [42] and Salsa20/12 [23] , [49] . In this paper, we complete the picture by proposing an efficient hardware for SOSEMANUK. We design a flexible accelerator for SOSEMANUK with additional modes for Serpent [10] block cipher and SNOW 2.0 stream cipher [16] whose design principles are used to construct SOSEMANUK.
Motivation for Unified Architecture for Three Algorithms
The idea of unified hardware architecture for cryptographic algorithms is not new. There are several interesting works that combine AES with other algorithms in a single design [5] , [7] , [35] . Due to stringent area constraints in embedded systems as well as due to increasing manufacturing costs, re-usable designs and flexible IPs are continuously sought for [33] . However, increasing flexibility comes at the cost of reduced efficiency, in terms of energy and runtime. Hence, designing such a unified architecture is challenging. To do this properly, one has to follow either from algorithmic kernel perspective [4] , [40] or from purely practical perspective. In this case, our motivation is to offer a practical solution that can support multiple cryptographic functionalities in one IP. Such combinations are found commonly in any communication protocol. For example, 4G LTE standard mandates the use of block cipher, stream cipher and authentication. It is interesting to note that, by moving deeper into the quest for a common algorithm/IP to address diverse security requirements, one ends up with fundamental constructions, e.g., SPONGE [6] .
The origin of the name SOSEMANUK is explained in [11, Section 1] . Literally, it means snow-snake, which is appropriate since it combines Serpent (which literally means snake) and SNOW 2.0. Though the word snow does not imply any kind of snake (except possibly snake-shaped object made out of snow), we note that the names of all the three ciphers begin with the letter "S" which is itself serpentine in shape! Hence we take the liberty to refer to the three ciphers as three snakes in the title. The design is generally referred as TripleS. A more specific notation to identify different design points is introduced later.
Our Contributions
We list our contributions as follows. 1) We propose a novel and efficient hardware for SOSEMANUK. It is the first of its kind since other than [21] , no other hardware design of SOSEMA-NUK has been attempted. 2) For the first time, we present a flexible accelerator that combines a block cipher and two stream ciphers. 3) We identify 12 incremental design points in the design process and report optimizations and evaluations of each of them. 4) Our design outperforms all existing hardware (as well as software) designs of Serpent, SNOW 2.0 and SOSEMANUK, along with those of all other eSTREAM candidates. 5) We propose a tweak to prevent the differential fault attacks [29] , [34] on SOSEMANUK with negligible increase in area and no compromise on throughput. 6) Duplicating hardware components to perform parallel data stream processing for throughput maximization is done in some of the existing cryptographic hardware designs [3] . We do not employ such tricks and apart from absolute throughput and area, we also report throughput per area as one of the figures of merit.
2 BRIEF DESCRIPTION OF SERPENT, SNOW 2.0 AND SOSEMANUK SOSEMANUK [11] combines the design philosophies of the block cipher Serpent [10] and the stream cipher SNOW 2.0 [16] . Below we mention the salient design features of each of the three ciphers.
Description of Serpent
Serpent was a candidate for the AES competition. It is a 32-round Substitution-Permutation (SP)-network operating on four 32-bit words. It encrypts a 128-bit plaintext P to a 128-bit ciphertext C in 32 rounds under 33 many 128-bit subkeyŝ K 0 ; . . . ;K 33 . The cipher supports three different key lengths, namely 128-bit, 192-bit or 256-bit. Keys with less than 256 bits are expanded into full 256-bit keys by appending one "1" bit to the MSB end, followed by as many "0" bits as required. Serpent uses eight many 4-to-4-bit S-boxes S 0 ; . . . ; S 7 . The cipher can be formally described aŝ
where IP and FP are the initial and the final permutations respectively over the 128 bit-positions and the round function R i is defined as
Here L is a linear transformation andŜ i is the application of the S-box S i mod 8 32 times in parallel.
In [10] , the eight Serpent S-boxes act on 4-bit words and are defined as permutations of Z 16 , as follows: In each round, 32 copies of one S-box is applied in parallel to map 128-bit input to 128-bit output.
When the S-boxes are applied in bitslice mode, each of them act as 128-bit to 128-bit S-box and the initial and final permutation steps are no longer required. In bitslice mode, each S-box takes four 32-bit words as inputs and produces four 32-bit words as outputs. The notations ; ; j and~mean bitwise XOR, AND, OR and NOT operations respectively. The notation ahopi ¼ b means a ¼ ahopib. Each S-box has a bitsliced version. For example, the bitsliced version of S 0 is given below, S-box S0(r0, r1, r2, r3, r4) { r3^= r0; r4 = r1; r1 & = r3; r4^= r2; r1^= r0; r0 |= r3; r0^= r4; r4^= r3; r3^= r2; r2 |= r1; r2^= r4; r4 =~r4; r4 |= r1; r1^= r3; r1^= r4; r3 |= r0; r1^= r3; r4^= r3; } In each case, r0, r1, r2, r3 act as the input words and r4 act as an auxilliary variable. The output indices are discussed in Section 4 (they are listed in the second column of Table 1 ).
The bitsliced implementation LT of the linear transformation L is given below, LT(x0, x1, x2, x3) { x0 = ROTL(x0, 13); x2 = ROTL(x2, 3); x1 = x1^x0^x2; x3 = x3^x2^T32(x0 << 3); x1 = ROTL(x1, 1); x3 = ROTL(x3, 7); x0 = x0^x1^x3; x2 = x2^x3^T32(x1 << 7); x0 = ROTL(x0, 5); x2 = ROTL(x2, 22); } The 256-bit effective key (after necessary padding) is written as eight 32-bit words w À8 ; . . . ; w À1 which are then expanded into an intermediate key w 0 ; . . . ; w 131 by the following recurrence:
for i ¼ 0; . . . ; 131, where f is the fractional part of the golden ratio ð ffiffi ffi 5 p þ 1Þ=2 or 0x9e3779b9 in hexadecimal. Now the Sboxes are used to transform the prekeys w i into words k i of the round keys as follows: Now, the ith subkey is formed as
The above assumes bitsliced implementation of the S-boxes. Otherwise IP needs to be applied to the round keys to place the key bits in proper position.
2.2 Description of SNOW 2.0 SNOW 2.0 [16] uses an LFSR of length 16 (each entry is a 32-bit word) with feedback polynomial
where a is a root of
. . . ; s t Þ denote the state of the LFSR at time t ! 0. There is a finite state machine (FSM) with two registers R1; R2 and an S-box S. The output of the FSM and the keystream word generated are respectively given by
For t ! 0, the registers R1 and R2 are updated as
Note that ( means addition modulo 2 32 . According to the SNOW 2.0 specification [16] , the cipher supports a secret key K of either 128 or 256 bits and a 128-bit initialization vector IV ¼ ðIV 3 ; IV 2 ; IV 1 ; IV 0 Þ. The 128-bit key is denoted by ðk 3 ; . . . ; k 0 Þ and the 256-bit key is denoted by ðk 7 ; . . . ; k 0 Þ. For the 128-bit case, the LFSR is loaded as follows:
For the 256-bit case, it is loaded as
and s i ¼ k 1 È 1 for i ¼ 0; . . . ; 7. Next, the LFSR is clocked 32 times without producing any output and the new element to be inserted is given by
2. Serpent1 is just one round of Serpent with the S-box S 2 , but without the key addition and the linear transformation. The LFSR used is defined over the same finite field as in SNOW 2.0, but is of length 10 instead of 16. The new value is computed as
The FSM uses two 32-bit registers R1, R2 as in SNOW 2.0, but instead of an S-box connecting them, it has a transformation Trans connecting them. The update of the FSM for t ! 1 and the output f t are given below,
where lsbðxÞ is the least significant bit of the word x and muxðc; x; yÞ selects x if c ¼ 0, or y if c ¼ 1, and
The outputs of the FSM are grouped by four and then the output keystream words are generated as ðz tþ3 ; z tþ2 ; z tþ1 ; z t Þ ¼ Serpent1ðf tþ3 ; f tþ2 ; f tþ1 ; f t Þ È ðs tþ3 ; s tþ2 ; s tþ1 ; s t Þ:
The key setup corresponds to the key setup of Serpent24, that produces 25 128-bit subkeys. The 128-bit IV is used as input to Serpent24 block cipher and the outputs ðY 
DESIGN SPACE EXPLORATION
In order to support flexibility of operation across and within a cipher, our proposed design is weakly programmable via custom assembly instructions. The design structure is as shown in Fig. 1 . By loading the program memory with the assembly instructions and setting up the I/O as shown in the figure, the design can be plugged in easily in a System-on-Chip (SoC) environment. Note that, the number of ports and port-width for output keystream and the data memory bank vary from design to design.
We started with the design and related optimizations in an incremental fashion and the process led to 12 different design points. For ease of discussion, let us introduce a few notations. Let S e , S n and S o denote the implementations of the individual ciphers Serpent, SNOW 2.0 and SOSEMANUK respectively. Analogously, S en means a combined implementation for both Serpent and SNOW 2.0 and S eno means a combined implementation of all the three ciphers. We use a second subscript u preceded by a comma to denote a version of the same cipher with the LFSR unrolled (we would explain shortly what does unrolled mean). We use a superscript ðnÞ to denote that there are n pipeline stages in the design. For example, S ð2Þ eo;u means a two-stage implementation of SOSEMANUK and Serpent together with the LFSR unrolled.
From a preliminary RTL analysis, the critical paths for S-box and LT are identified as shown in Fig. 2 (dotted lines). These could be further split into two pipeline stages. For Serpent, this decision is actually counter-productive since, the throughput degrades from one round per cycle to one round per two cycles. On the other hand the throughput (in terms of bits per cycle) of both SNOW 2.0 and SOSEMANUK remains the same. This boosts the throughput of SNOW 2.0 and SOSEMANUK as a higher clock frequency could be achieved due to a smaller critical-path.
The LFSR evolution of SOSEMANUK and SNOW 2.0 is always implemented in two pipeline stages. Whereas the Serpent (in S e , S en , S eo or S eno ) is implemented in either 2 or 3 stages. In Fig. 3 , we show how the Serpent components were divided across the different pipeline stages for two-stage implementation. In case of the three-stage implementation, the linear transformation is done in the three pipeline stage and the loading of input operands for LT is done in the second stage.
In Fig. 3 , we show how the LFSR components were divided across the different pipeline stages. The rationale for generating the addresses in the first pipeline stage is twofold. First, it accommodates for the one cycle read latency of the storage. Second, the addresses for the next SNOW 2.0 iteration is already available in the LFSR and therefore, the pipeline can operate at maximum throughput. Splitting the second pipeline stage into further stages would either cause a decrease in the throughput or a complex bypass logic leading to the same critical path. We started with a basic design of three-stage Serpent, which we call S ð3Þ e;basic . Both from timing and area perspective, several optimizations were applied to this design point leading to an optimized version S ð3Þ e . The optimizations are explained in Section 5. These optimizations were retained in subsequent evolution of the design points. From S 
INSTRUCTION SET DESIGN
For the programmability of the architecture, one could opt for a configurable input, where only three operational modes are specified to run SOSEMANUK, Serpent or SNOW 2.0. However, such a design would not allow for any algorithmic flexibility. We intended to design an ISA that would let users execute the three main modes as well as variants of these ciphers. For a typical bus interface, the instruction word needs to be nibble/byte/ word-oriented. As we explored the flexibility of specifying the indices in Serpent rounds, the most compact opcode required 32 bits.
Serpent Rounds
For Serpent key scheduling, eight 32-bit words, namely, w 0 ; . . . ; w 7 are operated with the same transformation, however, with different variable ordering. To keep the operator generic, the variable ordering is encoded in the instruction. For four different variable ordering, four different instructions are designed. Each bitsliced implementation of Serpent S-box is triggered via one specific instruction. This requires total 4 þ 8, i.e., 12 instructions.
Each of the Serpent bitsliced S-boxes takes five inputs, of which the first four contain four 32-bit words to process and the fifth one serves as an auxiliary variable. We use five variables, denoted by r i , i ¼ 0; . . . ; 4, for the S-box and the linear transformation. If the inputs to the S-box are in r 0 , r 1 , r 2 , r 3 , then r 4 is the default auxilliary variable and as per the Sbox definitions, the output indices are given in the second column of Table 1 . These S-box outputs go directly as inputs to LT, which produces the outputs in the same locations. However, the outputs of LT need to be fed as input to the next S-box in the next round. Thus, after the first round, S 0 puts the outputs in r 1 ; r 4 ; r 2 ; r 0 which also remain the outputs of LT. In the second round, S 1 takes inputs from r 1 ; r 4 ; r 2 ; r 0 and produces outputs in r 2 ; r 1 ; r 0 ; r 4 (as per row S 1 , column 4 in the table) instead of r 2 ; r 0 ; r 3 ; r 1 (row S 1 , column 2 in the table). This continues and the indices for the first eight rounds are shown in the third and fourth column of Table 1 .
In software, the S-box and LT are implemented typically as functions or macro and therefore in any Serpent round the S-box output indices and the LT input indices can remain the same. On the other hand, the default hardware implementation is to use signals [21] for passing the data between round key access, S-box computation and linear transformation. Unlike [21] , we created a software-controlled permutation network, resulting in a mux-based implementation. This provides for additional flexibility in controlling the mapping without any noticeable throughput degradation. The inputs and outputs of the permutation networks are shown in the fourth and the fifth column of Table 1 .
The mapping of the permutation network can be explained by an example as follows. Consider the fifth Serpent round, i.e., the row corresponding to S 4 in the table. The indices for the S-box input are 4; 1; 3; 2 and those for the S-box output are 1; 0; 4; 2. If one creates a list ½4; 1; 3; 2; 0 of the input r i indices, where position four corresponds to r 0 , then the positions of the output indices 1; 0; 4; 2 in this list is given by 1; 4; 0; 3 respectively. As shown in the table, this is precisely the output of the permutation network, which serves as the input to the next LT.
In the accelerator design exploration, this assembly control of permutation network indices allowed us to efficiently implement round key access, S-box implementation and linear transformation. The permutation network is decoupled from the combinatorial logic, which could be conveniently moved between pipeline stages for best timing results. A subtle benefit of this scheme is the possibility to accommodate different permutation network mapping for different algorithm variants.
For each of these Serpent round functions, at least four indices are required, where the fifth index can be computed from them. This requires total 12 3-bit indices requiring total 36 bits. To restrict the instruction bitwidth within 32, an instruction is issued before the first round specifying the input indices for round key function. The input indices of linear transformation of round n acts as input indices of Sbox of round n þ 1.
For every round, the same instruction with different index parameters is called. There is a special instruction for the final round, which skips the linear transformation. Therefore, total three different instructions for initialization, Serpent round and final round are needed.
The instruction set is flexible for diverse indexing options in the Serpent rounds as well as different order of S-box accesses during key scheduling. Naturally, the increase or decrease of Serpent rounds is also possible.
SNOW 2.0 Operations
The instruction set for SNOW 2.0 contains only three instructions namely, load key, initialization and keystream generation. The key, IV and keylength are loaded via input pins. This is followed by 32 rounds of initialization. Finally the keystream generation instruction is issued. Naturally, the datapath for initialization and keystream generation is shared.
The LFSR feedback polynomial is hardwired in the microarchitecture for maximizing the performance.
Additional SOSEMANUK Operations
The key scheduling and round functions' instructions from Serpent could be completely reused for SOSEMANUK. Additionally, SOSEMANUK initialization required one instruction for loading the LFSR. This instruction requires two different sets of parameters. The first set specifies indices of input data and the second set specifies LFSR indices. Since we stored the output indices of every Serpent round, the first set of parameters are already available in the microarchitecture. Therefore, only the LFSR indices need to be stored. The values are stored into R1 and R2, when the LFSR indices are specified as 10 and 11 respectively. Note that, SNOW 2.0 uses a larger LFSR compared to SOSEMA-NUK leaving few LFSR positions redundant.
For the encryption operation, two specific instructions for keystream generation and Serpent round call is designed. The keystream generation for SOSEMANUK uses a different transformation compared to SNOW 2.0 though, the rest of the datapath is shared.
All the instructions are 32 bit wide, of which 2 bits are used to distinguish between different mode of the application. Currently, three different modes, i.e., Serpent, SNOW 2.0 and SOSEMANUK are supported. Depending on the mode, slightly different behavior for LFSR shifting, register 
MICROARCHITECTURE DESIGN AND OPTIMIZATIONS
We describe the different design choices and optimizations of the microarchitecture in the following sections.
Storage
We employed diverse types of storage for TripleS. In the following, by register, we indicate Standard Cell Memories (SCM) when referring to registers. For look-up tables and SBoxes, suitable Memory Macro (MM) is selected by using a commercial memory compiler. There are three specific requirements for storage among Serpent, SNOW 2.0 and SOSEMANUK. For SNOW 2.0 and SOSEMANUK, a and a À1 values are precomputed and stored in 256 entry 32-bit wide look-up tables. For initial implementation of SOSEMANUK, one read port is sufficient for both the tables. For unrolled version, two read ports are required for each.
For Serpent round key, 132 entry 32-bit wide storage with both read and write operations is required. Since each Serpent round requires four accesses to the storage, it is divided into two separate memories storing even and odd-indexed locations. For this purpose, a suitable dual-port memory macro was selected by using Faraday Memory Compiler [19] . SNOW 2.0 requires an S-box implementation with 32-bit input and 32-bit output. This S-box can be decomposed into the 8-bit input, 8-bit output Rijndael S-box and a few logical operations [16, Section 6] . The complete Rijndael S-box is hardcoded into the architecture, which incurs little area overhead and does not affect the runtime performance.
Sliding LFSR
SNOW 2.0 has 16 32-bit registers and SOSEMANUK requires only 10 32-bit registers. In our generic design we have 16 32-bit registers. When the design executes in SOSE-MANUK mode, six slots of the LFSR are left unused. For SOSEMANUK keystream generation, four consecutively dropped words from the LFSR are XOR-ed with four consecutive Serpent1 outputs. We use the LFSR locations 0 to 3 to store the dropped words before they are XOR-ed. This utilizes the shifting naturally. The same effect is achieved in [21] by creating a separate shift register.
Unrolled LFSR
For some of our design points, we create a version with the LFSR unrolled for two steps with an aim to achieving better throughput. The idea is to perform two consecutive updates of the LFSR in one clock cycle. This involves shifting of the LFSR by two positions and loading the positions S tþ9 and S tþ8 with two new values. Pictorially the rolled and the unrolled versions of the LFSR are shown in Fig. 5 . For clarity, the unrolling effects in the FSM update is not shown. Naturally, it involves two consecutive computations of R1 and R2 in the same clock cycle.
In principle, further unrolling is possible. However, the Serpent1 function for keystream generation is called after every four iterations of LFSR updates of SOSEMANUK. By unrolling two steps of output generation, the Serpent1 function needs to be called once after every two cycles (4/2) of our implementation. If we unroll one more iteration, it would mean that the Serpent1 function needs to be called after every d4=3e = 2 cycles. In other words, we need to wait till two cycles of our implementation anyway before generating the keyword and this gives us no advantage at all.
Additional Optimizations
Apart from LFSR unrolling and optimization of the permutation network, several other design optimizations are performed to improve the throughput and area. These are briefly described in the following.
The rotate operations in Serpent contain constant operands. Instead of having a flexible rotation unit, dedicated bit wiring is used for the rotations. Each of the eight S-boxes in Serpent are accessed in a particular order. An 8-bit global register, called serpent_rk_index, is used for incrementing the indices of round key access. The five lower-order bits from the same register are used to determine the particular S-box to be called in a particular round. Serpent key scheduling requires eight registers, namely, w 0 ; . . . ; w 7 . These are re-used again during keystream generation for two different purposes. First, for storing the S-box input indices for the next Serpent round. Second, for holding the intermediate values f tþ3 ; f tþ2 ; f tþ1 ; f t of SOSEMANUK during the generation of its keystream. 
Security Enhancement
Most of the attacks on SNOW 2.0 and SOSEMANUK have complexity more than 2 128 and hence not practical for a keylength of 128 bits. These works include the guess and determine attacks in [2] , [20] , [48] and linear cryptanalysis of SOSEMANUK and SNOW 2.0 [13] , [26] .
There are two fault attacks on SOSEMANUK with better complexity. The differential fault attack of [34] requires around 6,144 faults, and this is an work equivalent to around 2 48 SOSEMANUK iterations and a storage of around 2 38:17 bytes. The time complexity of 2 48 is dominated by a pruned complexity of 2 16 to guess the values of eight LFSR states and eight FSM outputs and a complexity of 2 32 to guess the initial value of R1. In [29] , an improved attack is presented that requires only around 4,608 faults, 16 as in [34] ) for the first part and a complexity of 2 32 to guess the initial value of R1. To prevent this fault attack, we duplicate the LFSR's S 1 , S 8 and S 9 , since the fault attack in [34] must determine the complete LFSR state in order to be successful. We call this variant S 0 ð2Þ eno;u . If at any step the two copies of any one of the three LFSR's do not have the same value, then the process is aborted. Thus the complexity of guessing the LFSR states is increased by at least 2 96 , thereby moving the total complexity beyond 2 128 . Though the published fault attack [34] is not practical, we demonstrate the counter-measure just to emphasize the fact that such an attack (and any similar attack that may be devised in future) can be easily protected with negligible decrease in performance.
PERFORMANCE EVALUATION
All the design points were first modeled in Synopsys Processor Designer version G-2012.06-SP2 Linux [45] , a high-level processor design environment. The algorithm outputs were verified with cycle-accurate instruction-set simulation.
Optimized RTL implementation is generated from the highlevel description automatically, which is again functionally verified by running RTL simulation. The high-level design environment considerably reduced the modeling and exploration efforts. The RTL model complexity, in terms of lines of code, is approximately 20Â that of the high-level description. On the other hand, as has been demonstrated with several commercial and academic studies, the RTL generated from the Processor Designer performs reasonably well when compared with manual developer. The generated core, in this way, could be assembly-programmable and also retain high implementation efficiency. The implementation efficiency suffers to some extent, as has been shown in [9] , particularly due to the pre-conceived structural template of processors.
The generated RTL model is synthesized with Synopsys Design Compiler version D-2010.03-SP4 [43] , with target technology being UMC Faraday LL/RVT low-K process and the assumption of best conditions at 1.32 V and À40 C. During synthesis, compile_ultra option with high timing effort and topographical mode is used. Repeated synthesis with increasing clock frequency is performed as long as no timing violation is reported. The generated timing results are used to analyse the critical path and then timing optimizations to the high level description of the model are applied accordingly. RTL switching activity is recorded and provided as an input to Synopsys PrimeTime version D-2010.03-SP4 [44] . for obtaining power estimates. The performance estimates for memory structures are obtained by using Faraday Memory Compiler [19] , 65 nm technology library. For all the design points, the memory access time satisfies the core frequency.
Area, Timing and Power
The evolution of design points are associated with corresponding area, timing and power figures. For convenience, first the area results are presented in Table 2 , followed by throughput, power and energy-efficiency results in Table 3 . In the following, the design evolution, its rationale and the observed results are presented stepwise. e . In order to achieve higher bits per cycle, we moved to a two-stage Serpent implementation. eno;u . We moved back again from a threestage to a two-stage implementation. The rationale is that the critical path of unrolled SOSEMANUK datapath is comparable to the critical path of a two-stage Serpent implementation. Therefore, it is advisable to retain the two-stage Serpent implementation for higher Serpent throughput. This reduces the throughput of eno;u . Finally, fault detection logic is implemented, which does not affect the throughput at all. The area increment is only 0:469 KGates. From Table 3 , the variation of throughput along the design points can be observed. For Serpent, a move from three-stage to two-stage implementation is always associated with an increase in throughput. The implementation of SNOW 2.0 is done for all the design points in two pipeline stages, resulting in 32 bits per cycle throughput. For SOSEMANUK, the initial implementation at S ð2Þ eno and S ð3Þ eno generated 128 bits of output after every six cycles. This is due to four consecutive LFSR operations followed by a one-cycle stall when the intermediate values are loaded in the Serpent1 address generation instruction. In the sixth cycle, the Serpent1 function is executed. S For the multi-mode design points, the area efficiency results in terms of throughput per area are presented in Fig. 6 . The gradual changes in the area efficiency between different points are as following. 
Initialization Latency
The initialization latency of the different algorithms for different design points are shown in Table 4 . For Serpent, all the design points require 99 cycles for initialization. The initialization involves 33 rounds of key scheduling. For each round, there are two instructions for computing the recurrence followed by a one-cycle S-box computation.
For SNOW 2.0, the initialization requires 32 initial clocking of the LFSR, which are accomplished in 32 cycles.
For SOSEMANUK, the truncated key schedule requires 25 rounds, with each round consuming three cycles similar to Serpent. This is followed by the encryption of IV with Serpent24. This operation requires one cycle for initialization of the permutation network indices and one additional final cycle, which accesses the round key twice. In between, the 24 rounds require two and one cycle for three-stage and two-stage design variants respectively. The loading of the LFSR, R1 and R2 needs altogether three cycles.
In terms of overall performance, the two best design points are S eno;u provides better performance for Serpent.
Benchmarking with Other Implementations
Before our current work, hardware performance of SOSE-MANUK has been discussed only in [21] . According to [21] , the maximum clock frequency achieved in 0. 13 
eno;u , S 0 ð2Þ eno;u 1,000
observed that the design presented in [21] [21] , which introduces another difficulty in comparison. Overall, our designs are driven by primarily, flexibility and throughput. While extremely high throughput is not desirable in all the deployment scenarios, it provides an additional knob to the end-user, who may, reduce the clock frequency comfortably to achieve an intended throughput and thus reduce power consumption.
The best hardware for SNOW 2.0 is due to [18] . They implemented on XC4VLX15 series of Xilinx ISE 6.3.03i FPGA and report a throughput of 8.076 Gbps at a clock speed of 252.4 MHz. Throughput per slice was 3.42 Mbps/ slice. The throughput in terms of bits per cycle of [18] is same as the proposed design. Absence of standard cell synthesis results prevented us from further benchmarking the performance of SNOW 2.0 algorithm.
The available published hardware implementation of Serpent in [28] i.e., eight times [31] . Even our multi-mode design point S
eno;u achieves a frequency of 1,000 MHz, which is 8:13Â more than that reported in [28] . In terms of bits per cycle, the design in [28] reports 4Â more throughput than ours due to their 4Â replication of Serpent units.
Though AES and SOSEMANUK are structurally different, it is interesting to note that the highest throughput obtained in our SOSEMANUK implementation outperforms state-of-the-art AES (both software and hardware) implementations [12] , [24] , [27] , [28] , [41] .
It is trivial to show performance improvement in a dedicated accelerator compared to the software implementations on general-purpose processors. For the sake of completeness, we compare the performance with those reported in [32] . There, the throughputs of SOSEMANUK and SNOW 2.0 are as given in Table 5 . The proposed accelerator improves these performances by at least 3:7Â and 1:8Â for SOSEMANUK and SNOW 2.0 respectively.
To compare the throughput of our SOSEMANUK hardware with that of other eSTREAM finalists, we quote the throughput results as available in the literature. For hardware category, we have the following pairs of throughputs (in Gbps), the first of which is in 0.25 mm [22] and the second in 0.13 mm [21] . Table 1 ) helps to encode it in small hardware logic. ii) Serpent S-boxes and Linear Transformation have efficient bitsliced implementations (Section 2.1), suitable for compact hardware design. iii) In the design, four keystream words (128 bits) are generated as output in each round, which leads to high throughput. iv) The critical path is shorter compared to other eSTREAM candidates. This has been discussed in Section 5 of our paper and in [21] as well. It is difficult to benchmark implementations across different process technology nodes and moreover, across different technology generations. Nevertheless, it can be appreciated that our proposed SOSEMANUK hardware implementation is clearly comparable in throughput with several state-of-the-art hardware-oriented stream ciphers and improves upon the software performance significantly. Additionally, the flexibility provided by the presented design can be used for the following.
Dynamically switching between SOSEMANUK, SNOW 2.0 or Serpent. Enhancing this to SNOW 3G [46] remains an interesting future work. An ISA for the ciphers allow different softwarebased control of the algorithm flow. This can be used for security-performance trade-offs as well as for potential mechanism to counter side-channel attacks. Diverse indexing options in the Serpent rounds and S-Box accesses leaves considerable room for exploring completely new cipher designs.
CONCLUSION AND FUTURE WORK
We propose a hardware accelerator for the eSTREAM finalist software stream cipher SOSEMANUK. Since the cipher combines the design principles of the block cipher Serpent and the stream cipher SNOW 2.0, we accommodate these two ciphers also in our design. In terms of performance, our hardware beats all stand-alone hardware implementations of all the three ciphers as well as the existing hardwares for all the other ciphers of eSTREAM portfolio. Because of the complicated design of SOSEMANUK, the hardware area is not suitable for light-weight applications; however, our design can certainly be used as a flexible hardware accelerator serving the purpose of both block and stream ciphers. It can be noted that LFSR unrolling of SNOW 3G resulted in diminishing area-efficiency [38] . In that context, the unrolling results of SOSE-MANUK, experimented in this work, is encouraging and we intend to probe further unrolling possibility. The area efficiency for SNOW 2.0 with unrolling option can be explored, too.
