Abstract-The problem of securing data present on USB memories and SD cards has not been adequately addressed in the cryptography literature. While the formal notion of a tweakable enciphering scheme (TES) is well accepted as the proper primitive for secure data storage, the real challenge is to design a low cost TES which can perform at the data rates of the targeted memory devices. In this work, we provide the first answer to this problem. Our solution, called STES, combines a stream cipher with a XOR universal hash function. The security of STES is rigorously analyzed in the usual manner of provable security approach. By carefully defining appropriate variants of the multi-linear hash function and the pseudo-dot product based hash function we obtain controllable trade-offs between area and throughput. We combine the hash function with the recent hardware oriented stream ciphers, namely Mickey, Grain and Trivium. Our implementations are targeted towards two low cost FPGAs-Xilinx Spartan 3 and Lattice ICE40. Simulation results demonstrate that the speeds of encryption/decryption match the data rates of different USB and SD memories. We believe that our work opens up the possibility of actually putting FPGAs within controllers of such memories to perform low-level in-place encryption.
Ç

INTRODUCTION
T RADITIONALLY, cryptography has been mainly used to secure data in transit. In the last decade, however, there has been an increase in interest in securing stored data. This interest is reflected in some recent standardizing efforts [6] and a galaxy of algorithms that have been proposed for securing stored data [21] , [30] , [31] , [38] , [43] , [48] . A consensus has been reached among researchers that a type of symmetric encryption scheme, called Tweakable Enciphering Scheme (TES) [30] , is appropriate for the application of encrypting data stored in storage devices which have a sector-wise organization including hard disks and NAND flash memories.
The specific application which a TES is meant to serve is called low level disk encryption or in-place disk encryption [30] . In this application, it is assumed that the encryption/decryption algorithm resides in the disk controller and it views the storage media as a collection of sectors. The disk controller encrypts the data before writing and decrypts it after reading and before sending it to the operating system. This generic model of disk encryption is independent of other details like operating systems, file system types, etc.
Almost all the known constructions of TES use a block cipher as a building block. Some schemes like CMC, EME, EME* use only block ciphers, whereas schemes such as XCB [38] , HCTR [48] , HCH [21] , TET [29] , HEH [42] , [43] use a block cipher along with a suitable hash function. In terms of efficiency, the current state-of-the-art for the cost of encryption/decryption per block is either two block cipher calls [28] , [30] , [31] ; or, 1 block cipher call plus 1 GF ð2 n Þ multiplication, where n is the size of the underlying block cipher [43] .
Though there has been an active effort in designing new TES, there are only a few works which report efficient implementations of such schemes. To our knowledge the only works which report implementations of TES in hardware are [19] , [37] . The designs in [37] are targeted towards the Xilinx Virtex 4 family of FPGAs while the designs in [19] were targeted towards the Virtex 5 family. The throughput reported in [19] is very encouraging as for all reported designs more than 10 Gbits/sec of throughput is obtained. The best design in terms of speed provides a throughput of more than 15 Gbits/sec. These implementations are prototypical studies and they firmly demonstrate that the speed of TES can match the data rates of modern day disc controllers. We note that the studies are targeted towards high end FPGAs and so it may not be cost efficient for large scale deployment in commercial hard disks. On the other hand, the design philosophies adopted in these works can be easily adopted for design of ASICs. One can then expect the throughput rates to be higher as ASICs are capable of operating at much higher frequencies compared to FPGAs. Use of ASIC on the other hand, will involve a longer design cycle and the loss of reconfigurability.
An algorithm called XTS is widely used in many publicly available products for the purpose of disk encryption. This algorithm was described in an IEEE standard [5] and is essentially a two-key version of the algorithm XEX designed by Rogaway [40] . The security provided by XTS is significantly weaker than that of a TES. As recorded in the standard [5] itself, XTS is vulnerable to an easy mix-and-match attack which breaks its security as a TES. The argument provided for adopting XTS was that the algorithm is much faster than the then existing TES algorithms. The issue of security-efficiency trade-off of XTS is best exemplified by the following comments by Rogaway appearing in [4] : "the nominally "correct" solution for (length-preserving) enciphering of disk sectors and the like is to apply a tweakable, strong PRP (aka wide-blocksize encryption) to the (entire) data unit . . . because of its much weaker security properties, I expect that XTS is an appropriate mechanism choice only in the case that one simply cannot afford the computation or latency associated to computing a strong PRP."
In our work, we show that stream cipher based TESs provide low area and low power solutions and can still match the data rates of the target devices. Since it is possible to obtain the required efficiency without any loss in security, there is no reason to use an algorithm which provides qualitatively lower security guarantee. In view of this, we do not consider XTS any further in this work.
The Case for Low Cost Solutions
Storage is an integral part of numerous modern devices. For example, non-trivial storage is provided in modern smart phones and cameras. Even personal bio-medical instruments such as meters for measuring serum glucose or blood pressure have facilities for storing past readings. From a user point of view, the data present on the smart phone of a top political or business personality is no less sensitive than the data present on his or her laptop. With devices being increasingly inter-connected and also connected to the net, it may be easy for a piece of malware to load itself into a smart phone and transmit sensitive data. Keeping the stored data in an encrypted form provides a baseline protection to such malicious activity.
Most resource constrained devices do not have hard disks but rely on flash memories. NAND type flash memory has a similar organization as hard disks. Consequently, TES algorithms that are applicable for hard disks can also be applied here. But, one needs to keep in mind the constraints in these devices in terms of area and power utilization. A significant increase in the area will lead to an increase in both the cost and the size of the device. Further, a high demand on power will drain out the battery much sooner. Both of these will negatively impact the utility of the device to the user.
A basic constraint for deployment of any cryptographic protection mechanism is that the performance of the system should not degrade. In the context of stored data, this means that the speed of encryption and decryption should match the raw data rates of the device. Further, any solution whose cost is high is unlikely to be adopted. All of these lead to the following question.
Is it possible to design a scheme for securing stored data under the following constraints?
1) The area and power requirement must be low.
2) The actual cost of implementation must be low.
3) The speed should match the data rates of the target device.
Contributions of This Work
In this paper, we provide an affirmative answer to the above question through a new construction of a TES (which we call STES) using hardware oriented stream ciphers. Here we provide an overview of the several aspects of the solution. Details are worked out over the rest of the paper.
The application that we have in mind is to encrypt flash memories which have a block-wise organization. Our target applications include USB memories and memory cards like those specified in the SD standard [8] , and high density smart cards which contain NAND flash memories of the order of megabytes for storing user data [32] . The SD standard classifies memory cards into four categories based on their speeds. These categories are named as normal speed, high speed, ultra high speed-I (UHS-I) and ultra high speed-II (UHS-II) and the required bus speeds are 12.5 MB/ sec; 25 MB/sec; 50-104 MB/sec; and 156-312 MB/sec respectively. The UHS-II category of devices are only recommended for special applications like storing high quality streaming video etc.
These speed requirements are to be contrasted with the speeds of modern hard disks using technologies like serial ATA and native command queuing which can achieve data rates of more than 3 Gigabits/sec (the SATA revision 2 specifies a speed of 6 Gigabits/sec). So, for encryption of SD cards, the speed of encryption is not much demanding, and is much less than the speeds achieved by the TES implementations reported in [19] , [37] . As discussed above, speed is only one of the issues. The designs in [19] , [37] are neither low area nor low power. Being targeted towards Virtex 4 and 5 FPGAs, these are not also low cost. Hence, in terms of only speed of constrained devices, the designs in [19] , [37] are really an over-engineering.
Low cost designs. Our target platform is low cost yet performant FPGAs. Examples are the Xilinx Spartan 3 and Lattice ICE40 FPGAs [36] . It may be possible to put such an FPGA in a personal device like a smart phone [7] . TES designs targeted towards such platforms can be directly deployed to small devices.
Low area/power. As mentioned earlier, most TES designs are based on block ciphers. Only one work [45] outlines a TES which uses a stream cipher supporting an initialization vector (IV). The stream cipher based construction in [45] has a bug and the corrected construction appears in [44] . This is the starting point of our work. The description in [44] , [45] is at a high level using a stream cipher and a hash function. We mention below our choice of the stream cipher, the design of the hash function and the consequent development of a new TES scheme that we call STES. The details of our construction are sufficiently different from that in [44] , [45] to necessitate a separate complete security analysis.
The eStream [3] Profile-2 portfolio provides three stream ciphers which have very small hardware footprint, namely Grain128, Mickey 2.0 and Trivium. We consider all these three candidates and later describe implementation results using varying opportunities for parallelism.
The TES construction requires a hash function with provably low collision and differential probabilities. Usual polynomial hash is one way to design such a hash function. But, this requires a finite field multiplier over GF ð2
' Þ, where ' is the IV length of the underlying stream cipher. This may not be a good choice for low area/power designs. Instead, we chose the multi-linear hash function [18] , [24] for implementation. When used directly, this also requires a GF ð2 ' Þ multiplier. We, however, use the so-called Toeplitz version of this hash function, where it is possible to use a GF ð2 d Þ multiplier where d is a divisor of '. We call d to be the data path of the hash function. By varying d, we can achieve a nice trade-off between the size and the throughput of the hash function. The theoretical possibility of obtaining a hardware efficient hash function using a Toeplitz version of the multi-linear hash was indicated in [46] .
Another interesting hash function is based on Winograd's pseudo-dot product [49] and has been mentioned in [13] . Again a direct implementation of this hash function requires a GF ð2 ' Þ multiplier, whereas using a Toeplitz version (suggested in [46] ) one can use a GF ð2 d Þ multiplier for some d dividing '. For a fixed message of a particular length, the total number of multiplications required by the pseudo-dot product based hash function is about half that required by the multi-linear hash function. This seems to suggest that the pseudo-dot product hash function should be the one of choice. Somewhat surprisingly, we show that from the point of view of parallel hardware implementation, there is no significant difference in speed of the two functions. But the pseudo-dot product based hash functions provide some advantage over the multilinear ones in terms of area when implemented with larger data paths.
Later, we present different design decisions for the hash function and the implementation results. We believe that our implementation of the hash functions is of independent interest and will be useful for other applications. There is no work in the literature which provides such a careful hardware implementation of the multi-linear and the pseudodot product hash function as we do.
As mentioned earlier, the stream cipher based TES construction in [45] is at high level. The actual issue of choosing the hash function is not adequately addressed. The multilinear and the pseudo-dot product hash function that we use require a long hash key. Due to its length, the key cannot be stored and has to be generated using the stream cipher itself. As a result, the security proof in [45] does not apply any more. We carefully work out the complete proof and obtain a security bound which improves upon the ones given in [44] , [45] .
The result of all this is that STES is a low power/ area design and can be implemented in low cost FPGA. Further, this is achieved while retaining the usual guarantee of rigorous security analysis. To a designer, STES holds out the dual attractiveness of formal security analysis combined with low cost and low area/power implementations.
PRELIMINARIES
In this section we give an overview of the basic primitives used for the new construction. For a binary string X, bitsðX; i; jÞ denotes the binary string formed by the substring of X extending from position i to position j. For binary strings X and Y , XjjY denotes the concatenation of X and Y . We shall often treat n-bit binary strings as elements in GF ð2 n Þ, and for X; Y 2 f0; 1g n , X È Y and X Á Y (or sometimes XY ) would mean the addition and multiplication in GF ð2 n Þ.
Stream Ciphers with IV
Modern stream ciphers, such as those in the eStream [3] portfolio, take as input a short secret key K and a short initialization vector (IV) and produce a "long" and random looking string of bits. Let SC K : f0; 1g ' ! f0; 1g L be a stream cipher with IV, i.e., for every choice of K from a certain pre-defined key space K, SC K maps an '-bit IV to an output string of length L bits. The length L is assumed to be long enough for practical sized messages to be encrypted. For our application, the length of L is determined by the length of the sector. By SC i K ðIV Þ we shall denote the first i bits of the output of SC K ðIV Þ.
Multilinear Universal Hash (MLUH)
A keyed hash function is chosen from an indexed family fH t g t of hash functions. The key t is chosen uniformly at random from the index set. Suppose the range consists of '-bit strings. The hash function is said to be universal if for distinct X and X 0 , Pr½H t ðXÞ ¼ H t ðX 0 Þ ¼ 1=2 ' ; further it is said to be XOR universal (XU), if for distinct X and X 0 and
' . We will be interested in a particular type of hash function called the multi-linear function [18] , [24] . The following definition is a variant based on the so-called Toeplitz construction. An Multilinear Universal Hash with data path d uses an ðm þ b À 1Þd-bit key K to map a dm-bit message M to a db-bit digest. The message M is written as M ¼ M 1 jjM 2 jj Á Á Á jjM m and the key K is written as K ¼ K 1 jjK 2 jj Á Á Á jjK mþbÀ1 , where each M i and K j are d bits long. We define
where
The additions and multiplications are in the field GF(2 d ). Note that the message and key lengths are multiples of d. This restriction can be lifted by appropriate padding. We do not perform this, since for our application it is easy to ensure that the condition holds.
It is not difficult to show that an MLUH is an XU function. More specifically for a uniform random key K, any pair of distinct messages M 1 ; M 2 and any a
The XOR universal property for a more general version of the multi-linear hash function has been proved in [46] .
Pseudo-Dot Product Based Universal Hash
Another construction of hash function can be based on Winograd's pseudo-dot product [49] . This has been pointed out in [13] . We describe the Toeplitz variant of the pseudo-dot product construction as mentioned in [46] . This will be denoted by PD. The PD construction uses an ðm þ 2b À 2Þd-bit key K to map a dm-bit message M to a db-bit digest. where
The PD function is also an XU hash function and can be proved by a simple probability calculation.
Tweakable Enciphering Scheme
A tweakable enciphering scheme is a pair of indexed family of functions The applicability to disk encryption comes via the following relation. For encryption, the tweak T is taken to be the sector address and the message X is taken to be the content of the sector. Let Y be the output of the encryption of X under the tweak T and the secret key K. By the length preserving constraint, this Y is of the same length as that of X. The value of X is overwritten using Y in the sector pointed to by T . Thus, after the whole disk is encrypted, it consists only of the encrypted values. Note that the encryption is done sector-wise and so decryption can also be done sectorwise. This means that the value Y in a particular sector with address T can be decrypted without disturbing the values of the other sectors.
In a typical in-place disk encryption application, the sectors are individually encrypted and stored in the encrypted form. The encryption/decryption module resides just above the disk controller. An encrypted sector read by the disk controller is decrypted by the module and returned to the calling routine. Similarly, the writing of any sector to the disk is first encrypted by the module and then the disk controller writes it to the appropriate sector. This mode of operation ensures that if the disk is accessed without the secret key, then it appears to be random.
In the definition of TES, Tweak, Msg and Cpr are mentioned to be non-empty finite sets of binary strings. For disk encryption applications, Msg ¼ Cpr and is the set of all binary strings whose lengths are equal to the length of a sector. Similarly, Tweak is taken to be the set of all binary strings of some fixed length ' where 2 ' is greater than the number of sectors in a disk.
CONSTRUCTION OF STES
The description of the encryption algorithm using STES is given in Fig. 1 and a schematic diagram is shown in Fig. 3 . The construction is parameterized by a stream cipher SC supporting '-bit IVs, a hash function MLUH with data path d and a fixed '-bit string fStr. This is emphasized by writing STES½SC; MLUH; fStr. When one or more of the parameters are clear from the context, we drop these for simplicity of notation; if all three parameters are clear, then we simply write STES. We assume that d j '. Plaintexts and tweaks are of fixed length. If P is any plaintext and T is any tweak, then we also assume that d j ðjP j þ jT j À 2'Þ. For practical implementations, the restrictions on d are easy to ensure as we discuss later.
The secret key for STES is the secret key K of the underlying stream cipher. In this context, we would like to mention the role that the parameter fStr can play. From the point of view of the formal security analysis, there is no restriction on fStr. Thus, this can be used as a secret customization option. In other words, for actual deployment, one may choose a uniform random value for fStr and keep it secret. This provides an additional layer of obscurity over and above the provable security analysis that we perform. There is another advantage to using fStr as part of the secret key. The security bound that is obtained is in terms of the IV length ' and the number of queries for which security holds can be obtained as a function of 2 ' . The key length jKj should be at least ' for the analysis to be meaningful. If the key length is equal to ', then certain "out of model" attacks may apply as has been pointed out in [22] . Increasing the key length by keeping fStr as part of the secret key may help in preventing such attacks.
Apart from the secret key K, the input to the encryption algorithm of STES is the tweak T and a plaintext P . Similarly, the input to the decryption algorithm of STES consists of T and the ciphertext C.
The encryption algorithm begins with some length calculations, and fixes values for the variables ' 1 , ' 2 and ' 3 which determine the key lengths necessary for the different calls to MLUH made later in the algorithm. Next, the input plaintext is parsed into three parts P 1 , P 2 and P 3 where P 1 and P 2 are both ' bits long and P 3 is jP j À 2' bits long. In Line 9, ð' 1 þ ' 2 þ 'Þ bits are generated from the stream cipher SC K using the fStr as input. These bits are parsed into two strings t 0 and t 00 which are later used as keys for MLUH. The part P 3 of the message and the tweak T is hashed using the MLUH and mixed with the message parts P 1 and P 2 to generate two strings A 1 and A 2 . These strings are used as an input to the function Feistel which is described in Fig. 2 . The function Feistel receives two keys K and t 00 and it mixes the input strings A 1 and A 2 by appropriate use of the hash MLUH and the stream cipher SC. The inverse function for Feistel is also shown in Fig. 2 .
The call to Feistel also returns the string W of length equal to ' 3 which is the length of P 3 . This string W is XORed with P 3 to obtain C 3 . The Feistel network produces two more '-bit outputs B 1 and B 2 . B 1 and B 2 are used to produce the ciphertexts C 1 and C 2 respectively.
From the description given in Figs. 1 and 2, it may appear that the string W of length ' 3 is required to be actually returned to the main body of the algorithm. This, however, is not the case. For example, depending on the specific design choices it may be possible to start computing W (in line 3, Fig. 2 ) as soon as few bits of Z 2 (in line 18, Fig. 1 ) are available.
Variant of STES Using PD
The description of the algorithm STES can be trivially modified by using the pseudo-dot product hash PD which is described in Section 2.2. If PD is used instead of MLUH the key lengths required are to be suitably changed. For hashing m blocks (where each block is d bits long) of message the PD construction requires m þ 2b À 2 blocks of keys, where bd is the length of the output. Hence the parameter ' 1 in Line 4 must be fixed to b 1 þ 2b À 2 and ' 2 to ð3b À 2Þd. All the calls to MLUH should be replaced by PD in the algorithm STES and in the function Feistel with the same parameters as it appears in the descriptions. This variant is formally denoted as STES½SC; PD; fStr. 
Some Characteristics of the Construction
The internal keys. Though STES is parameterized using a single stream cipher key K, but internally it generates three more keys b, t 0 and t 00 , where t 0 and t 00 are used as the keys for H and b is used to mask the outputs of H. If it is possible to store t 0 and t 00 , then the call to SC in Line 9 is not required and this will gain some efficiency. But, as the key t 0 is much larger in size compared to K, for most applications this will not be practical. Using a register to store this key within the FPGA will greatly push up the area requirement. For designs targeted at small area implementations, it is necessary to generate the hash key on the fly.
One of the roles of the key b is to ensure that the masked output of H is equal to fStr with low probability. It is used in Lines 13 and 18 in two different ways to mask the respective outputs of H. The reason for this is to break symmetry. To see this consider Fig. 3 with b n 1 replaced by b. Then the output of the encryption algorithm applied to ðX 1 ; X 2 ; X 3 Þ is equal to the output of the decryption algorithm applied to ðX 2 ; X 1 ; X 3 Þ which leads to an easy distinguishing attack. Using b n 1 to mask the output of the second hash call prevents this situation. This effect of using b n 1 can also be seen by studying the proof of Claim 4 given in the Appendix, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TC.2014.2366739. Instead of reusing b, one may choose to use two independent masks (each generated by the stream cipher) in Lines 13 and 18. Since, the only requirement is to break symmetry, using independent masks is unnecessary and will lead to an increase in the amount of key material. So, we have chosen not to adopt this approach.
Message length. STES is defined only for fixed length messages. This is because our intended application is encryption of disks and flash memories where the message is a sector and has a fixed length. It is possible to extend the construction to accommodate variable length messages. This can be achieved by appropriately padding the message and including the message length as an input of the hash computation in lines 13 and 18 of the encryption algorithm. This, however, results in slightly more computation than the fixed length case. Since, variable length messages is not required in our context, we chose to cut out the additional complexity of padding and formatting from the algorithm.
Efficiency. The computationally costly operations that take place in the algorithm are the calls to the stream cipher and the hash functions. There is one call to the stream cipher from the main body of the algorithm to generate the hash keys and the other two calls are part of the function Feistel. It is to be noted that in real life stream ciphers are quite fast in generation of the outputs, but when a stream cipher is called on different initialization vectors then there is a significant time required for initialization. The three calls to the stream cipher required in STES are all on different initialization vectors. Hence, stream cipher initializations occupy a significant amount of the time required for STES.
The hash functions MLUH and PD can be implemented very efficiently in hardware with a proper choice of the data path d. The choice of d dictates the amount of parallelism possible. Recall that the main goal of the construction is to enable a hardware realization which uses small amount of hardware resources. A proper choice of the stream cipher and the data path can help in realizing a circuit with adequate throughput but with a small hardware footprint. These issues are discussed in details in Section 5 where we demonstrate that STES meets the expected efficiency requirements both in terms of time and circuit size.
Parallelism. There exists ample scope to exploit parallelism in the construction of STES. In the hardware implementation that we present later, we decided to use two stream cipher cores, our specific design choices give rise to an architecture where decryption is a bit faster than encryption. This characteristic of the design has some good practical implications, as read speeds in memory are faster than the write speeds and it is expected that a typical block would be read many more times than it would be written.
SECURITY OF STES
In this section, we state the usual security theorem for STES. To do this, we need to introduce the appropriate notions of security of a stream cipher with IV and that of a tweakable enciphering scheme.
Pseudo-Random Function (PRF)
Let Dom and Ran be two non-empty finite sets of binary strings. Let f K be an indexed set of functions where f K : Dom ! Ran and K is chosen from some index set K. The notion of pseudo-random function is formalized in the following manner.
Let K be chosen uniformly at random from K. An adversary A has to distinguish f K from a uniform random function f Ã where f Ã is chosen uniformly at random from the set of all functions from Dom to Ran. A is a probabilistic algorithm which has oracle access to either f K or f Ã . Suppose the oracle is f K . The adversary A submits queries X 1 ; . . . ; X q to the oracle and gets back the responses f K ðX 1 Þ; . . . ; f K ðX q Þ. A can make the queries in an adaptive manner, i.e., it can decide on the ith query after receiving the responses to the first ði À 1Þ queries. Without loss of generality, we will assume that the queries are distinct. At the end of the interaction, A outputs a bit.
Let Pr½A f K ) 1 denote the probability that A outputs 1 after interacting with f K . This probability is over the randomness of f K (arising from the random choice of K) as well as the randomness of A.
Similarly, let A interact with f Ã and let Pr½A f Ã ) 1 denote the probability that A outputs 1 after interacting with f Ã . The advantage of A in distinguishing f K from the uniform random f Ã is defined as follows:
Let Adv prf f ðt; q; sÞ be the supremum of the advantages of all adversaries running in time t, makingueries and providing a total of s bits in all its queries. The quantity s is called the query complexity. Note that Adv prf f ðt; q; sÞ is always positive even though the quantity defined in (4) can sometimes be negative. The value of Adv prf f ðt; q; sÞ is called the PRF-bound for f.
Stream Cipher With IV
Recall that for a key K from the key space K, a stream cipher with IV is a function SC K : f0; 1g ' ! f0; 1g L . The basic idea of security is that for a uniform random K and for distinct inputs IV 1 ; . . . ; IV q , the strings SC K ðIV 1 Þ; . . . ; SC K ðIV q Þ should appear to be independent and uniform random to an adversary. This is formalized by requiring a stream cipher to be a PRF. See [12] for further discussion on this issue. Adv prf SC ðt; q; sÞ denotes the PRF-advantage of SC against any adversary that runs in time t, makesueries and has query complexity s.
Tweakable Enciphering Scheme
Consider a TES ¼ ðE K ; D K Þ K2K . Let K be chosen uniformly at random and an adversary A is given access to the oracles ðE K ; D K Þ. A query to E K is a pair ðT; XÞ and a query to D K is a pair ðT; Y Þ, where T is a tweak, X is a message and Y is a ciphertext. Appropriate responses to the queries are provided to the adversary A.
Note that A is allowed to make the queries in an adaptive manner, i.e., for the ith query it can decide on whether to send it to E K or D K and the content of the query based on the responses it has received to the previous ði À 1Þ queries. The restrictions are that no two queries to E K should be equal; no two queries to D K should be equal; if Y has been obtained as a response to a query ðT; XÞ to E K , then the query ðT; Y Þ to D K is not allowed; and similarly, if X has been obtained as a response to a query ðT; Y Þ to D K , then the query ðT; XÞ to E K is not allowed. These queries are pointless as the adversary already knows the answer to these queries. Let Pr½A E K ;D K ) 1 be the probability that A outputs 1 after interacting with the oracles E K and D K .
For each tweak T , let PðT; ÁÞ be a length preserving permutation chosen uniformly at random from the set of all length preserving permutations from Dom to Ran. By P we denote the collection fPðT; ÁÞg T 2Tweak of all these tweakindexed uniform random length preserving permutations. For each T , let P À1 ðT; ÁÞ be the inverse of PðT; ÁÞ and let P
À1
be the collection fP À1 ðT; ÁÞg T 2Tweak .
Suppose E K and D K are replaced by P and P À1 respectively and consider the interaction of A with P and P À1 . Let Pr½A P;P À1 ) 1 be the probability that A outputs 1 after such interaction. The advantage of A is defined as follows:
Define Adv
AEf prp TES ðt; q; sÞ to be the supremum of the advantages of all adversaries which run in time t, make a total ofueries and send a total of s bits in all the queries. Security of a scheme against an adversary which has access to both the encryption and the decryption oracles is called security as a strong pseudo-random permutation.
Suppose now that the oracles E K and D K are replaced by two oracles which return independent and uniform random strings on any input. More precisely, if ðT; XÞ is a query to E K , then an independent and uniform random string of length equal to the length of X is returned; if ðT; Y Þ is a query to D K , then similarly, an independent and uniform random string of length equal to the length of Y is returned. Let Pr½A $;$ ) 1 be the probability that A outputs 1 after such interaction. The advantage of A is defined as follows:
AErnd TES ðt; q; sÞ to be the supremum of all advantages of adversaries which run in time t, make a total ofueries and send a total of s bits in all the queries.
The AErnd and AEg prp advantages are related as follows:
For a proof of (7) see [20] , [31] .
Security Statement for STES
The following theorem specifies the security of STES. 
Here t 0 is t þ t 00 , where t 00 is the time to processueries using STES½SC; H; fStr.
The theorem guarantees that if the stream cipher acts like a random function then, for any arbitrary adversary which makes a reasonable number of queries, the advantage of distinguishing STES½SC from a tweak-indexed family of length preserving permutations is small. The proof of the theorem consists of a standard game transition argument and a combinatorial analysis of some collision probabilities. The full proof is given in Appendix A, available in the online supplemental material.
HARDWARE IMPLEMENTATION OF STES
STES can be instantiated in various ways by plugging in different possibilities for the stream cipher and the hash function. There are, however, some common design ideas. The goal of this section is to describe the basic ideas behind the different implementations. Since we have implemented a number of designs, it is not possible to describe the details of all the designs. Neither is this necessary. From our descriptions of certain specific designs, it is possible to understand the details of the other designs. Results, however, are presented for all the implementations and these are consistent with our design goals of low area/power implementations.
Below we describe our basic design decisions. Then comes the descriptions of individual implementations of the different hash functions and the stream ciphers followed by the description for the STES implementation and the data flow and timing analysis.
Basic Design Decisions
The important design decisions are the following.
Message length. Our target is encryption of fixed size blocks. So, STES has been designed for fixed length messages. In particular, we consider the message length to be 512 bytes. This value matches the current size of memory blocks. It is to be observed that the design philosophies are quite general and can be scaled suitably for other message lengths.
Stream cipher. We chose the following stream ciphers: Grain128 [34] , Trivium [17] and Mickey128 2.0 [9] . These are the eStream [3] finalists of hardware oriented designs. There are several works in the literature which reports compact hardware implementations of these ciphers [15] , [23] , [26] .
There are different ways to implement these ciphers with varying amount of hardware cost. In particular, Grain128 and Trivium are amenable to parallelization and one can adopt strategies to design hardware which can give an output of only one bit per clock as in [15] or exploit the parallelization and increase the data-path to give more throughput at the cost of more hardware as in [34] and [17] . For instantiations with Grain128 and Trivium we tried different data paths for the stream ciphers and thus implemented multiple designs which provide a wide range of throughput. As mentioned in [35] , there exists no trivial way to parallelize Mickey. So, our implementations of Mickey uses a data path of one bit. The various parameters considered for implementing the stream ciphers are shown in Table 1 .
Hash function. The main component required to implement MLUH or PD is a finite field multiplier. In Section 2.2, we describe MLUH parameterized on the data path d, which signifies that the multiplications in MLUH with data path d take place in the field GF ð2 d Þ. We consider values of d equal to 4, 8, 16, 32 and 40. The corresponding irreducible polynomials used to perform field multiplications are given in Table 2 . The number of multipliers used for implementing the MLUH varies with the value of the data path.
The case of PD is a bit curious. For implementing PD
with data path d we use multipliers in GF ð2 d 2 Þ. This decision was taken to make the design suitable for STES. Note that in case of PD each multiplication requires two message and key blocks which is not the case in MLUH. As the message and key blocks are generated on the fly hence this design helps in preventing data stalls in the circuit. This issue is discussed in more details in Section 5.2. Target FPGA. We target our designs for Xilinx Spartan 3 and Lattice ICE40 FPGAs. The rationale is that these are considered suitable for implementing hardware/power constrained designs. Moreover, they are cheap and one can consider deploying these FPGAs directly into a commercial device. In particular, in Spartan III the LUTs within SLICEM can be used as a 16 Â 1-bit distributed RAM or as a 16-bit shift register (SRL16 primitive). This functionality of a 16-bit shift register has been previously exploited to achieve compact implementations of stream ciphers [15] . We follow this suggestion.
The ICE40 FPGAs does not provide such functionalities. On the other hand, their architectural design specifically supports low power implementations. Our experimental results also suggest that they are much more competitive in this respect compared to Spartan 3 devices. To the best of our knowledge, there are no previous work reporting stream cipher implementation on this class of FPGAs.
Implementation of the Universal Hash Function
Implementations of the two hash functions MLUH and PD are described separately. In both cases, the size of the digest is equal to ' which is the size of the IV of the stream cipher.
Design of MLUH. An MLUH with data path d denotes that the multiplications are performed in GF ð2 d Þ. We choose d such that d divides ' and as before, we denote b ¼ '=d. For convenience of exposition, we recall the description of MLUH 
The basic strategy for the above computation is to apply b different multipliers. For a fixed ', since b ¼ '=d, as d grows, the value of b decreases. The computation proceeds column-wise. The b multiplications
Since there are b multipliers, these can be done in parallel. The results of these multiplications are stored separately. In the next step, the products 
parallel and these results are xor-ed with the previous results. This is continued until all columns have been computed. In Fig. 4 , we showcase the above strategy of computing MLUH with a specific architecture for d ¼ 8 and ' ¼ 80, so that b ¼ 10, i.e., there are 10 multipliers where each multiplier can multiply two elements of GF ð2 8 Þ. The whole architecture consists of ten 8-bit registers, 10 multipliers and 10 8-bit accumulators. All the registers are connected in cascade forming a 10-stage first in first out (FIFO) structure with parallel access to all states. In Fig. 4 , the registers are labeled as regk1, regk2, . . ., regk10. These registers are used to store ten 8-bit blocks of the key. Each multiplier takes one of its inputs from the FIFO structure and the other takes its input directly from the input line m i . Initially all registers in the FIFO and accumulators have zero value.
The FIFO is fed with the key blocks K 1 ; K 2 ; . . . ; etc. one in each cycle through the input line depicted k i in the figure. After ten clock cycles, the FIFO is full, i.e, the registers contain the key blocks K 1 ; K 2 ; . . . ; K 10 and the input line m i contains the message block M 1 and the multiplications in the first column of MLUH are performed. Then each product is accumulated in the respective accumulators. In the next clock, the FIFO contains the key blocks K 2 ; . . . K 11 and the input line m i contains the message block M 2 ; the second column of multiplications are computed and these results are accumulated in the respective registers. This is continued until all the columns have been processed. The final output of MLUH is obtained by concatenating the final values in the accumulators. The control unit is not shown in the figure, it consist of a counter and some comparators.
Design of PD. For hashing a m-block message to a b block output, where each block is d-bit long, MLUH requires mb multiplications in GF ð2 d Þ. In comparison, PD requires only mb 2 multiplications (assuming m is even). So, the total number of multiplications required by PD is about half that of MLUH. This suggests that PD should be the design of choice. But, that is not necessarily true, at least in our context, as we describe below. Each multiplication in PD is of the form ðM i È K j ÞðM iþ1 È K jþ1 Þ. So, performing this multiplication requires two blocks of message and key material. In contrast, for MLUH a multiplication can be performed when one block of message and key are available. This difference is important.
In STES, PD is to be used in conjunction with a stream cipher. For the second hashing in the encryption algorithm, the message and key blocks to be used by PD are obtained as an output from the stream cipher. To keep this balance, we decided to construct a PD with d=2-bit multipliers when the message and key blocks are considered to be d-bit blocks. Here we showcase an architecture for PD with 4-bit multipliers which produces a 80-bit output. Later, we use this PD construction with a stream cipher with a 8-bit data path to construct STES.
We implement the PD as is shown in Fig. 5 . The methodology adopted is the same as in case of the architecture of MLUH (shown in Fig. 4) , in the sense that here also we compute column wise. But as we use 4-bit multipliers, to get a 80-bit output we require 20 multipliers. We assume that the key blocks (K j ) and message blocks (M i ) are obtained as 8-bit blocks, but we treat each multiplication as ðM This 4-bit design of the PD is not more efficient than the MLUH architecture. Though the 4-bit multipliers used in PD are smaller than the 8-bit multipliers but we require the double of them and also the double amount of registers and the extra xors required at the input of the multipliers makes this PD architecture marginally more costly than the MLUH architecture.
Our experiments (presented later) shows PD to be more efficient in terms of area for higher data paths. This can be explained by an intuitive argument. The asymptotic area complexity of a fully parallel k-bit multiplier (like the Karatsuba multiplier) is super-linear in k [33] , thus the total area of a k-bit multiplier is expected to be larger than two k=2-bit multipliers for large values of k.
The total clock cycles required to compute with PD is also a bit more than MLUH. This is due to the increased amount of key material necessary for computing PD. For the specific architecture described in Fig. 5 , the twenty registers in FIFO has to be filled before the first column of PD can be computed, and this implies an initial delay of 20 cycles if we assume that we obtain 8 bits of key material in each cycle. Whereas the initial delay in case of MLUH is only 10 cycles. The multipliers used in both PD and MLUH are Karatsuba multipliers, they were implemented following the same design strategy as presented in [19] . The irreducible polynomials used to implement the multipliers are listed in the Table 2 , also these multipliers are smaller than the ones presented in [19] as they operate on smaller numbers. To keep the speed high and seeing that there are no dependencies between multiplications in MLUH and PD, after a careful re-timing process the multipliers for d > 4 were divided into almost balanced pipeline stages, the specific number of stages used in each implementation is reported in Tables 3  and 4 .
Implementation of Stream Ciphers
In this work we consider three stream ciphers: Trivium, Grain128 and Mickey128-2.0. These stream ciphers are the eStream hardware based stream ciphers and are in general very easy to implement in hardware as they are constructed using simple structures like shift registers and some simple Boolean functions. All these three stream ciphers can be implemented using a shift register as a basic primitive.
We implement the stream cipher using various data paths, here by a data path we mean the number of bits of output the stream cipher can produce in each clock cycle. A lower data path uses less parallelism and thus can be implemented with fewer hardware resources. The various data paths that we consider for the three stream ciphers along with some other important parameters are depicted in Table 1 .
For all the three stream ciphers, the bit-wise versions (i.e. the ones with data path of one bit) can be implemented in a very compact way in Spartan-3 devices. Spartan-3 FPGAs can configure the Look-Up Table ( LUT) in a SLICEM slice as a 16-bit shift register without using the flip-flops available in each slice. Shift-in operations are synchronous with the clock, and output length is dynamically selectable. A separate dedicated output allows the cascading of any number of 16-bit shift registers to create whatever size shift register is needed. Each configurable logic block can be configured using four of the eight LUTs as a 64-bit shift register. Such an usage of the LUT in Spartan-3 is called a SRL16 primitive [50] . This SRL16 primitive can be used to implement the shift registers of the stream ciphers [15] . SRL16 supports only a single entry and a single bit shift, so if the data path is 1 then this primitive can be directly used. For higher datapaths also the synthesizer accommodates parts of the design within the available SRL16 blocks, thus giving rise to very compact designs in Spartan 3.
We implemented the stream ciphers with all data paths specified in Table 1 . We did not implement Mickey128-2.0 with data paths more than 1, as such parallelization in Mickey is not straight forward to obtain.
Here, as an example we will explain in details the a specific architecture of Trivium with a 2-bit datapath. The internal state of Trivium is a 288-bit shift register, for implementation purposes it is divided into a three registers SR1, SR2 and SR3 as shown in Fig. 6 . All the three shift registers have two inputs and two outputs and in each clock cycle their internal states are shifted by two positions. Initially SR1 and SR2 are initialized with the 80-bit key K and the 80-bit IV respectively. SR3 has as initial value the string 1 3 jj0 108 . In the Fig. 6 it can be seen that for each shift register its feedback functions depends on some bits from it and a function computed with some bits from the previous register. For example, the feedback of shift register SR3 depends on some bits of SR2 and two bits from itself. It is easy to see in Fig. 6 that the feedback functions for all registers and the function to compute the final outputs S even and S odd are replicated two times, just they have different inputs. In the case of Grain128 the way to increment the datapath also consist of replicating the feedback functions of shift registers and the output function. Increasing the datapath of Grain128 and Trivium brings a significant increase in throughput since it reduces the time used for setup and give a parallel output for the stream.
Implementation of STES
We implemented STES with all the three stream ciphers with the data-paths specified in Table 1 . When we consider a stream cipher with data path d in implementing STES, then we use the hash function with the same value of the data path, i.e., we use multipliers in GF ð2 d Þ if the hash function is MLUH, and if it is PD then the multipliers are in
We will explain in details a 8-bit data path implementation using Trivium and MLUH, but for other instantiations of stream ciphers and hash the basic design remains the same. Note that Trivium uses a 80-bit IV and a 80-bit key.
In Fig. 7 we show the generic architecture for encrypting/decrypting with STES, we shall explain the architecture with reference to the algorithm of STES (Fig. 1) and the Feistel network (Fig. 2) .
The circuit presented in Fig. 7 consists of the following basic elements:
1) The MLUH constructed with 8-bit multipliers as discussed in Section 2.2. In the diagram this component is labeled MLUH. 2) Two stream cipher cores labeled SC1 and SC2.
3) Two 80-bit registers RegH1 and RegH2 which are used to store the output of MLUH. 4) Four registers labeled regF1, regF2, regKh and regb.
All these registers are 80 bits long and are formed by ten registers each of eight bits connected in cascade, so that they can be used as a FIFO queue. The same structure was used in the design of MLUH and PD. 5) One special register regb1 which is able to store a 80-bit data and rotate it in one bit position. This register outputs 8-bit data each clock cycle when the control input ce is activated. 6) Seven multiplexers labeled 1, 2, 3, 4, 5, 6 and 7.
7) The control unit whose details are not shown in the Figure. 
8) The connections between MLUH and the registers
RegH1, RegH2 have a data path of 80 bits. All other connections have a data path of 8 bits. 9) The input lines M i , IV and K which receives the data and tweak, the initialization vector and the key respectively. 10) The output line C i which outputs the cipher.
The MLUH computes the MLUH, it receives as inputs message blocks M i , tweak blocks T i and key blocks K i and give as output the result of MLUH in its output port S. The register RegH1 and RegH2 receive the output from S as input, in this case jSj ¼ 80 bits. The registers RegH1 and RegH2 are designed to give eight bit blocks as outputs in each clock cycle in their output port BO. The MLUH receives its input from the 3 Â 1 multiplexer labeled 1. Notice, that in the algorithm of STES, the MLUH is called on three different inputs. Multiplexer 1 helps in selecting these inputs. In the algorithm MLUH is called on two different keys t 0 and t 00 , thus, MLUH can receive the key from two different sources: the key t 0 is received directly from the output of the stream cipher SC1 or SC2. The key t 00 is received either directly from stream cipher SC1 or from the register regKh which is used to store t 00 . To accommodate these selection of keys the input port Ki of MLUH receives the input from the 2 Â 1 multiplexer 5.
We use two stream ciphers SC1 and SC2. Both take the key from the input line K of the circuit. SC1 receives the IV from multiplexer 2, it selects between input line IV or F 1 . Multiplexer 3 feeds the IV to the stream cipher SC2, it selects between IV or F 2 .
In the algorithm of STES we can see that the output of MLUH is xored with the value of b or b n 1 depending which hash is computed Z 1 or Z 2 and whether encryption or decryption mode is being executing. The selection between these two values is made with Multiplexer 7.
In the encryption mode the stream W is generated using SC2 but in the decryption mode it is generated by SC1. Multiplexer 6 is used to select the correct stream cipher to produce the cipher text or plain text.
The data flow and timing analysis of the architecture in Fig. 7 is provided in Appendix B, available in the online supplemental material.
EXPERIMENTAL RESULTS
We implemented STES on two different families of FPGAs: Lattice ICE40 and Xilinx Spartan 3. For Spartan 3 we used the device xc3s400-fg456 and in case of Lattice ICE40 we selected LP8KCM225. The place and route results in case of Spartan 3 were generated using Xilinx-ISE version 10.1. For ICE40 we used Silicon Blue Tech iCEcube release 2011.12.19577. We measured the power consumption of the circuits using
For our implementations we report performance in terms of throughput, area and power-consumption. In case of Spartan 3 we report area in terms of number of slices and for ICE40 we report in terms of number of logic cells. It is to be noted that size of a Spartan 3 slice is almost equal to twice the size of a ICE40 logic cell.
In this section we present the experimental results in two parts. First in Section 6.1 we report performance data of the primitives, i.e., the stream ciphers and the hash function and in Section 6.2 we report results of STES using various instantiations of the primitives.
Primitives
In Tables 3 and 4 2). For all the cases shown in the tables, we consider hashing a message of 4,016 bits to 80 bits. We also computed MLUH and PD considering the output to be 96 bits, these results are similar and hence are not shown. Note that as we consider only datapaths which divides the output length hence for 80-bit outputs (which are used with Trivium) we did not implement the hash functions with 32-bit datapath, similarly for 96-bit outputs (which are used with Grain) we did not implement the versions with 40-bit datapath.
It is clear from the tables that in both Spartan 3 and ICE40 with the increase in data path the throughput increases at the cost of area. For MLUH-1b, we obtain a very high frequency, as in this case the multiplier is only an AND gate. As the data path increases, the complexity of the circuit implementing the multiplication grows which increases the critical path of the circuit. For 16 and 40-bit implementations we break the critical path of the multiplier by dividing it into balanced pipeline stages, the number of pipeline stages were carefully selected to maintain a high operating frequency. This is the reason why all 8, 16 and 40-bit implementations operate on similar frequencies on both Spartan 3 and ICE40.
In Table 4 , the performance of PD is shown. For 4-bit data paths, the size of PD is a bit bigger than MLUH but for bigger data paths the size of PD is smaller. This conforms to the argument that we provided in Section 5.2. As explained in Section 5.2, the number of cycles required to compute PD is also marginally more than that of MLUH. Moreover, due to the more complex circuitry of PD it operates lower frequencies and thus achieves lesser throughput than MLUH.
In case of both MLUH and PD we can see that the number of logic cells required for ICE40 FPGA is almost double than the slices required in Spartan 3. It is to be noted that a logic cell in ICE40 has much lesser components than in a Spartan 3 slice, which explains the difference in area in the two families. Moreover, the ICE40 implementations operate at a little lower frequencies compared to the Spartan 3 implementations, this can also be explained by the fact that as a ICE40 has lesser components so the critical path of the implementations in ICE40 are more complex in terms of logic resources.
As mentioned above, Tables 3 and 4 present performance data for hashing when the output is 80 bits. In our design, the value of the datapath has to divide the bit size of the output. This is the reason, these two tables do not provide performance data for 32-bit datapath. We had run experiments with 96-bit output and 32-bit datapath. The results are as follows: MLUH-32b required 1,248 slices, operates at a frequency of 174.75 MHz, there are three pipeline stages and the throughput is 4,529.64; the corresponding figures for PD-32b are 905, 170.48, 2 and 4,244.27. As expected, the area required by PD is lesser and the throughput is also slightly lower.
In Table 5 , we present the performance data of Trivium, Grain and Mickey with various data paths. In the tables the names of the stream ciphers are suffixed with the data path.
The bit-wise implementation of Trivium and Grain128 on Spartan 3 were done using SRL16 primitives and this allowed us to obtain very compact designs: 49 Slices for Trivium-1b and 67 Slices for Grain128-1b. Grain128-1b is larger than Trivium-1b due to the complexity of its output and feedback functions. Mickey128 was implemented only with a one bit data path because there is no direct way to parallelize it.
The data in Tables 5 shows that the increase in data path does not have much effect on the total area of Grain128 and Trivium. For example, Trivium-8b requires 148 slices and Trivium-16b requires 203 slices. Though one would expect that doubling the data path would require double the hardware resources, that is not the case. The growth in area is small because in Trivium the state is stored in a 288-bit shift register independent of the size of data-path. For wider data paths we only require to replicate the output and the feedback functions a suitable number of times.
Wider data path implementations of Grain128 also have the same behavior as implementations of Trivium. As Grain128 has a 96-bit IV, hence for our requirement that the data path must divide the IV length we do not implement grain with a 40-bit data path which we do for Trivium.
Experimental Results on STES
Using the primitives described in Section 6.1 we construct STES. The performance results are shown in Tables 6 and 7 . The tables show data for STES implemented with various stream cipher instantiations and data paths. The tables also show the power consumption characteristics for the implementations. Note that we did not include PD implementations for data paths less than 4 bits. As our specific design of PD would not allow 1-bit data paths and for 2-bit data paths there is no advantage of PD over MLUH.
From Table 6 we can observe the following:
Among the one-bit data path implementations, STES with Trivium achieves the smallest area and STES with Mickey is the fastest closely followed by STES with Grain128. STES[Grain,MLUH]-1b has the best throughput per area metric. The implementations which use Grain128 are in general faster than the ones using Trivium, because the implementations with Grain need less clock cycles in comparison with implementations with Trivium. For higher data paths, the constructions with PD have lesser area than the constructions with MLUH. This is consistently observed for d > 8. For 4-bit data paths implementations the MLUH based constructions are smaller and also STES[Grn,ML]-8b is slightly smaller than STES[Grn,PD]-8b.
The constructions with PD operates at a slightly lower frequency than the MLUH based constructions, this is due to the fact that PD has a more complex circuitry. Moreover, PD based constructions take 4jIV j=d cycles more than the MLUH based constructions. Thus the PD based constructions have a lower throughput. In Table 7 we present the experimental results for implementations of STES on ICE40. The comparative behavior reflected in the Table 7 is almost the same as the behavior of implementation on Spartan 3 shown in Table 6 . In general the implementation on ICE40 are slower than the implementations on Spartan 3, but the power consumption on ICE40 is much better. In particular we observed that the static power consumption in ICE40 remains constant for all variants. This is probably due to the fact that ICE40 was specifically designed to be used in low power applications, hence its architecture has special characteristics which allows it to run with a very low power consumption. The throughput/power (TPP) metric for all constructions is significantly better for ICE40.
Comparison with Block Cipher Based Constructions
As mentioned earlier, STES is highly motivated by the construction presented in [45] , here we present some results and estimations on the construction in [45] . The construction in [45] does not use stream cipher, it uses a block cipher in counter mode to do the bulk encryption, and the suggested hash functions are either normal polynomial hashes or BRW polynomials. The performance of the construction in [45] when implemented using a AES with 128-bit key and a normal polynomial hash is shown in Table 8 . The table reports four implementations, which are described below:
TES-AESs-1s. TES in [45] implemented with a sequential AES128 and a fully parallel 128-bit Karatsuba Multiplier in Spartan 3. TES-AESs-4s. Sequential AES with one 4-stage pipelined 128-bit multiplier implemented in Spartan 3. TES-AESp-4s. 10-stage pipelined AES with 4-stage pipelined 128-bit multiplier implemented in Virtex 5. TES-sAES-1s. This is an estimation based on a very compact AES reported in [41] , and a polynomial hash which uses four 32-bit multipliers as used in case of MLUH32b. The estimation is based on the data in [41] that the AES occupies 167 slices and takes 42 cycles to produce a single block of cipher. The estimated slices is obtained by summing the slices of the components and the frequency is estimated by considering that the critical path of the circuit would be given by the component with the highest critical path. Real implementations may change these data.
The results in Table 8 shows that the implementations of the TES described in [45] with a sequential AES in Spartan 3 takes up much more area than our designs with stream cipher and our designs with data path of more than 8 bits achieves higher throughput at the cost of smaller area and lower power consumption.
TES-AESp-4s is a huge design and it does not fit in a Spartan 3 device (note the slices in Virtex 5 have much more resources than the slices in Spartan 3 and the of slices in these two families are not quite comparable). TES-AESp-4s achieves throughput similar to the designs of HMCH [Poly] and HEH [Poly] reported in [19] , but it cannot be in any sense considered as a lightweight design. But the performance of TES-AESp-4s do show that the TES in [45] can achieve quite high throughput.
The design philosophy adopted in TES-sAESs-1s is probably best comparable to our stream cipher based designs. As in TES-sAESs-1s we intend to use a very compact AES.
The estimation shows that such an implementation would also occupy quite a large area but not achieve a good throughput.
Use of lightweight block ciphers. In the current days there have been numerous proposals for lightweight block ciphers like PRESENT [14] , KATAN, KTANTAN [16] , KLEIN [25] , LED [27] etc. These block ciphers are designed to optimize the hardware resources required to implement them. In a generic description of a block cipher based TES any secure block cipher can be used, thus there is no technical difficulty in plugging in a lightweight block cipher in an existing description of a TES and this would lead to a low cost design compared to the AES alternatives that we just discussed. But a thing to note is that lightweight block ciphers are mainly designed to be used in specific applications like in RFID authentication etc., and are not designed for bulk encryption modes of operations. The lightweight block ciphers have small block lengths, for example all the schemes mentioned above have a block length of 64 bits of lower. Such small block lengths would restrict their use in TES as the block length of the block cipher used in a block cipher based TES is an important security parameter. Recall that all existing block cipher based TES enjoys a security upper bound of cs 2 =2 n , where s is the query complexity of the adversary, c is a small constant and n the block length of the underlying block cipher. Thus, the security guarantees provided by the known reductions are not sufficient if n has a value less or equal to 64. Very recently, two lightweight block cipher families called SIMON and SPECK [10] have been proposed. Both these families include a specification which uses 128-bit block size. These block ciphers can be interesting in the context of designing low cost block cipher based TES and provided they survive cryptanalysis, designing TES around them may form possible future work.
CONCLUSION
The main design goal of STES was to obtain a TES which can be implemented in a compact form and would have low power consumption. Our experiments validate that STES does achieve these goals to a large extent.
In the introduction we mentioned the speeds recommended by the SD standard. The commercially available memories do not achieve the values specified in the standard. In Table 9 we present the maximum speed of the various classes of memories sold by ADATA [1] , [2] . The first four rows correspond to SD cards and the last three rows gives data of USB memories. In the last two columns we mention our smallest design which can attain the necessary data rates of the given memory devices. Table 9 clearly demonstrate that STES can serve as a viable scheme for encryption in a large class of non-volatile memory devices.
Debrup Chakraborty received the BE degree in mechanical engineering from Jadavpur University, Kolkata, India, in 1997, and the MTech and PhD degrees in computer science from the Indian Statistical Institute, Kolkata, in 1999 and 2005, respectively. He is currently a researcher in the Computer Science Department of Centro de Investigaci on y de Estudios Avanzados del IPN, Mexico City, Mexico. His current research interests include design and analysis of provably secure symmetric encryption schemes, efficient software/hardware implementations of cryptographic primitives, pattern recognition, and neural networks.
Cuauhtemoc Mancillas-L opez received the BE degree in electronic and communications engineering from ESIME-Instituto Polit ecnico Nacional ( " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
