Abstract-A new class of polynomials was introduced by Bernstein (Bernstein 2007) which were later named by Sarkar as BernsteinRabin-Winograd (BRW) polynomials (Sarkar 2009). For the purpose of authentication, BRW polynomials offer considerable computational advantage over usual polynomials: ðm À 1Þ multiplications for usual polynomial hashing versus b m 2 c multiplications and dlog 2 me squarings for BRW hashing, where m is the number of message blocks to be authenticated. In this paper, we develop an efficient pipelined hardware architecture for computing BRW polynomials. The BRW polynomials have a nice recursive structure which is amenable to parallelization. While exploring efficient ways to exploit the inherent parallelism in BRW polynomials we discover some interesting combinatorial structural properties of such polynomials. These are used to design an algorithm to decide the order of the multiplications which minimizes pipeline delays. Using the nice structural properties of the BRW polynomials we present a hardware architecture for efficient computation of BRW polynomials. Finally, we provide implementations of tweakable enciphering schemes proposed in Sarkar 2009 which use BRW polynomials. This leads to the fastest known implementation of disk encryption systems.
where X ¼ ðx 1 ; . . . ; x m Þ 2 IF m q and h 2 IF q . Traditionally, the evaluation of Poly h ðXÞ has been done using Horner's rule, which requires ðm À 1Þ multiplications and ðm À 1Þ additions in IF q . In the rest of this paper, we will refer to Poly h ðÞ as a normal polynomial.
Bernstein [2] introduced a new class of polynomials which were later named in [23] as Bernstein-Rabin-Winograd (BRW) polynomials. BRW polynomials on m message blocks defined over IF q have the interesting property that they can be used to provide authentication, but, unlike the normal polynomial they can be evaluated using only b m 2 c multiplications in IF q and dlog 2 me squarings. Thus, these polynomials potentially offer a computational advantage over the normal ones.
The use of BRW polynomials in hardware has not been addressed till date. As will be clear from discussions later, the structure of a BRW polynomial is fundamentally different from the normal ones, and there are some subtleties associated with their efficient implementation that are worthy of further analysis.
In particular, the recursive definition of a BRW polynomial gives it a certain structure which is amenable to parallelization. It turns out that to take advantage of this parallel structure one needs to carefully schedule the order of multiplications involved in the polynomial evaluation. The scheduling is determined by the dependencies in the multiplications and also by the desired level of parallelization and hardware resources available.
The contributions of this paper are twofold. First, we present a hardware architecture for efficient evaluation of BRW polynomials. The hardware design heavily depends on the careful analysis of the inherent parallelism in the structure of a BRW polynomial. This leads to a method to determine the order in which the different multiplications are to be performed so that the polynomial can be evaluated using a minimum number of clock cycles.
As our second contribution, we present efficient hardware implementations of two TESs which use BRW polynomials. Comparisons are made with various other existing constructions which make use of normal polynomials. One of the most important applications of a TES is disk encryption. As a consequence of our implementation and comparative study, we conclude that TES schemes using BRW polynomials provide the fastest options for disk encryption.
Computing BRW polynomials in hardware. From the point of view of hardware realizations, the most crucial building block of a polynomial hash function is a field multiplier. Digit-serial multipliers yield compact designs in terms of area and enjoy short critical paths but they require several clock cycles in order to compute a single field multiplication. In contrast, fully parallel multipliers are able to compute one field multiplication every clock cycle. However, due to their large critical path, these multipliers seriously compromise the design's maximum achievable clock frequency.
Since polynomial hash blocks require the batch computation of a relatively large number of products, it makes sense to utilize pipelined multiplier architectures. In this work, we decided to utilize a k-stage pipeline multiplier with k ¼ 2; 3. After a latency period required to fill up the pipe, these architectures are able to perform one field multiplication every clock cycle. The advantage is a much shorter critical path than the one associated with fully parallel multiplier schemes [3] .
In using a pipelined multiplier, our main concern is to find a proper ordering of the multiplications which would minimize the delay in the pipeline. In the ideal case, there should always be multiplications ready to be done at every clock cycle. Another objective is to reduce the need to store the intermediate results so that one can minimize the extra storage locations utilized in the circuit.
To achieve this we analyze the structure of the BRW polynomial. Our analysis views the polynomial as a tree where addition and multiplication nodes are interconnected with each other. Viewing the BRW polynomial as a tree immediately gives us information about the dependence of the various operations required for its computation.
We discover some interesting properties of the tree, and use these properties to design a scheduling algorithm. The scheduling algorithm takes as input a BRW polynomial and the desired number of pipeline stages and outputs the schedule (or order) in which the different multiplications are to be performed. This schedule has several attractive features.
For pipeline structures with two or three stages, we
give a full characterization of the number of clock cycles that is required for computing the polynomial. 2. The schedule ensures that the pipeline delays would be minimal. 3. The scheduling algorithm greedily attempts to minimize the storage. Experimental data show that the requirement of extra storage grows very slowly with the increase in the number of blocks. Utilizing the schedule produced by the scheduling algorithm we came out with a hardware architecture that is meant for computing BRW polynomials with a fixed number of message blocks. We showcase a specific architecture which uses 31 blocks of messages and a 3-stage pipelined Karatsuba multiplier.
Tweakable enciphering schemes using BRW polynomials. In the second contribution of this paper, we use BRW polynomials for efficient hardware implementation of TESs. These are length preserving block-cipher modes of operations which provide security in the sense of strong pseudorandom permutations. A fully defined TES for arbitrary length messages using a block cipher was first presented in [13] . In [13] , it was also first stated that a possible and important application area for such type of encryption schemes is low level disk encryption.
Since then, there has been a lot of activity toward constructions and analysis of such schemes. Most TES proposals fall into three basic categories: Encrypt-MaskEncrypt type, Hash-ECB-Hash type, and Hash-CounterHash type. The schemes which fall within the first category use two layers of encryption with a light weight masking layer in-between. Examples are the modes CMC [13] , EME [14] , EME Ã [11] . The constructions of the other two categories use two layers of hashing with a single layer of encryption between the two hash layers. In the Hash-ECB-Hash type constructions an electronic code book mode forms the encryption layer, whereas in the case of Hash-Counter-Hash constructions a counter mode of operation is used for the encryption layer. Some modes of Hash-ECB-Hash type are PEP [6] , TET [12] , HEH [22] , whereas the modes XCB [19] , HCTR [25] , HCH [7] , ABL [20] fall under the Hash-Counter-Hash type.
The main component of Encrypt-Mask-Encrypt type constructions are block-ciphers, and to encrypt an m block message these constructions require about 2m block cipher calls. On the other hand, the constructions of the type Hash-ECB-Hash and Hash-Counter-Hash require computation of two polynomial hash functions in addition to the block cipher calls. These constructions require about m blockcipher calls along with additional finite field multiplications to encrypt an m block message.
The modes that have been mentioned above use a normal polynomial evaluation, i.e., they compute the function Poly h ðÞ. The modes PEP, TET, HEH, HCH, HCTR, XCB all require the evaluation of two such polynomials each of them on about m blocks, thus these modes require 2m finite field multiplications and about m block-cipher calls. 1 In a recent work [23] , a class of new TESs was reported, which can be instantiated either by a normal polynomial or a BRW polynomial. The usage of BRW polynomial has the advantage that it can hash m blocks using about m=2 multiplications, whereas normal polynomial evaluation would require m multiplications. This decreases the computation cost significantly over the previously known modes.
Almost all known TES schemes known before [23] were implemented in various hardware platforms in [18] . In [18] , a careful analysis of the possible parallelism for all the modes was done and the designs tried to exploit the schemes' parallelism to their fullest extent. The designs were targeted toward Virtex 4 family of FPGAs and the main design goal was speed. The designs used a 10-stage pipelined AES encryption/decryption core. For hashing, a fully parallel Karatsuba multiplier was employed for performing the field multiplications. In those modes where both block-cipher and multiplier blocks were required, the critical path was decided by the later block. The obtained throughput figures were satisfactory with the design goal which was meant to match the speed of the modern day disk controllers (the interested reader can see [18] for a detailed discussion of the design decisions and the results obtained in that work). In [18] , the constructions reported in [23] were not included as these constructions are more recent.
In this work, we provide efficient hardware implementations of some of the most efficient schemes reported in [23] . The fundamental difference of the schemes reported in [23] from the previous schemes is in the use of the BRW polynomials which are significantly different in structure from the normal polynomials. Using our analysis and implementation of BRW polynomials significantly brings down the length of the critical path. Further, due to the drastic reduction in the required number of field multiplications, the latency of the whole circuit also goes down. The combined effect is to provide significantly higher throughput compared to the designs studied in [18] .
The constructions in [23] can also be instantiated using a normal polynomial. We compare the performance of the different instantiations. For a TES using normal polynomials we also use a pipelined multiplier and run parallel instances of the Horner's rule. Our strategy of computing a normal polynomial using pipelined multipliers is similar to the strategy used in [24] .
The organization of the rest of the paper is as follows: In Section 2, we define the BRW polynomials and present a tree-based analysis of such polynomials. Using the tree structure of the BRW polynomials we develop a scheduling algorithm and provide analysis of the scheduling algorithm. Finally, based on the scheduling algorithm we present the hardware architecture for computing BRW polynomials. In Section 3, we provide implementation details of the hardware architecture used for evaluating a BRW polynomial.
In Section 4, we describe the algorithms HEH and HMCH, which are the two new tweakable enciphering schemes proposed in [23] . These algorithms are analyzed from the perspective of efficient hardware implementation and specific design decisions are formulated. In Section 5, we discuss the experimental results obtained from our hardware realizations. The paper is concluded in Section 6.
BRW POLYNOMIALS
A special class of polynomials was introduced in [2] for fast polynomial hashing and subsequent use in message authentication codes. In [2] , the origin of these polynomials were traced back to Rabin and Winograd [21] , but the construction presented in [2] has subtle differences compared to the construction in [21] . The modifications were made keeping an eye to the issue of computational efficiency. Later in [23] , these polynomials were used in the construction of tweakable enciphering schemes and the class of polynomials were named as BRW polynomials.
Let X 1 ; X 2 ; . . . ; X m ; h 2 IF q , then the BRW polynomial H h ðX 1 ; . . . ; X m Þ is defined recursively as follows:
. . . ; X m Þ, if t 2 f4; 8; 16; 32; . . .g and t m < 2t. Computationally the most important property is that for m ! 2, H h ðX 1 ; . . . ; X m Þ can be computed using bm=2c multiplications and dlg me squarings. In the rest of the paper, we will use either H h ðÞ or BRW h ðÞ to denote a BRW polynomial.
A Tree-Based Analysis
A BRW polynomial H h ðX 1 ; . . . ; X m Þ can be represented as a tree T m which contains three types of nodes, namely, multiplication nodes, addition nodes, and leaf nodes. The tree T m will be called a BRW tree and can be recursively constructed using the following rules:
1. For m ¼ 2; 3 it is easy to construct T m directly as shown in Fig. 1 .
s , for some s ! 2, the root of T m is a multiplication node. The left subtree of the root consists of a single addition node which in turn has the leaf nodes h m and X m as its left and right child, respectively. The right subtree of the root is the tree T mÀ1 . 3. If 2 s < m < 2 sþ1 for some s ! 2, the root is an addition node with its left subtree as T 2 s and the right subtree as T mÀ2 s . A construction of the BRW tree T 16 corresponding to the polynomial H h ðX 1 ; . . . ; X 16 Þ is shown in Fig. 2 . According to this construction, the following two properties hold.
. Any leaf node is either a message block X j or it is h k , for some j; k. . For a multiplication node, either, its left child is labeled by a message block X j and the right child is labeled by h; or, its left child is an addition node which in turn has a message block X j and h k as its children for some j and k. As a consequence, for a multiplication node, there is exactly one leaf node in its left subtree which is labeled by a message block. As we are only interested in multiplications, we can ignore the addition nodes and thus simplify the BRW tree by deleting the addition nodes from it. We shall address the issue of addition later when we describe our specific design in Section 3, and we would then see that ignoring the additions as we do now will not have any significant consequences from the efficient implementation perspective. We reduce the tree T m corresponding to the polynomial H h ðX 1 ; . . . ; X m Þ to a new tree by applying the following steps in sequence.
1. Label each multiplication node v by j where X j is the leaf node of the left subtree rooted at v. 2. Remove all nodes and edges in the tree T m other than the multiplication nodes. 3. If u and v are two multiplication nodes, then add an edge between u and v if u is the most recent ancestor of v in T m . The procedure above will delete all the addition nodes from the tree T m . We shall call the resulting structure a collapsed forest (as the new structure may not be always connected, but its connected components would be trees) and denote it by F m . Note that for every m, there is a unique BRW tree T m and hence a unique collapsed forest F m . The collapsed forests corresponding to polynomials H h ðX 1 ; . . . ; X 16 Þ and H h ðX 1 ; . . . ; X 30 Þ are shown in Fig. 3 .
By construction, the number of nodes in a collapsed forest F m is equal to the number of multiplication nodes in T m . The nodes of F m are labeled with integers. Label j of a node in F m signifies that either the multiplicands are X j and h; or, one of the multiplicands is ðX j þ h k Þ for some k. As a result, there is a unique multiplication associated with each node of a collapsed forest.
For example, the multiplication ðX 2 þ h 2 Þ Ã ðX 1 þ hÞ is associated with the node labeled 2 in Fig. 3 . Refer to Fig. 1 to see this. Similarly, if the outputs of nodes labeled 4 and 6 are A and B, respectively, then the multiplication associated with the node labeled 8 is ðX 8 
Þ. This procedure easily generalizes and it is possible to explicitly write down the unique multiplication associated with any node of a collapsed forest. So, the problem of scheduling the multiplications in T m reduces to obtaining an appropriate sequencing (linear ordering) of the nodes of F m .
The structure of the collapsed forest corresponding to a polynomial H h ðÁÞ helps us to visualize the dependencies of the various multiplications involved in the computation of H h ðÁÞ. The following definitions would help us to characterize dependencies among those operations. Definition 1. Let v be a node in a collapsed forest F , the level of v in F denoted by level F ðvÞ is the number of nodes present in the longest path from v to a leaf node. A node v in F such that level F ðvÞ ¼ 0 is said to be independent. Any node v with level F ðvÞ > 0 is said to be dependent. . If x is a label of a node and x 2 ðmod 4Þ then the node is an independent node. 7. If x is a label of a node and x 0 ðmod 8Þ then x has at least x À 2 and x À 4 as its children. 8. If x is the label of a node and x 4 ðmod 8Þ, then x À 2 is the only child of x.
Scheduling of Multiplications
Our goal, as stated earlier, is to design a circuit for computing BRW polynomials using a pipelined multiplier. If we use a pipelined multiplier with N stages, then N clock cycles would be required to complete one multiplication, but in each clock cycle N different multiplications can be processed, as long as these N multiplications happen to be independent of each other, i.e., none of these N multiplications should depend on the results of the others. Thus, if it can be guaranteed that N independent multiplications are available in each clock cycle then the circuit will require m þ N clock cycles to complete m multiplications (there would be an initial latency of N clocks for filling the pipe and thereafter the result of one multiplication would be produced in each subsequent clock cycle). A collapsed forest is a convenient way to view the dependencies among the various multiplications which are required to compute a BRW polynomial. In this section, we propose an algorithm Schedule which uses a collapsed forest to output a multiplication schedule. The aim of the algorithm is to minimize the number of clock cycles.
For designing the scheduling algorithm we require two lists L 1 and L 2 . For a list L and an element x of L, we shall require the following operations. Note that PopðLÞ does not delete the first element from L. Two successive pop operations from L without any intermediate delete operation will result in the same element. Each node in the collapsed forest is given two fields NC and ST associated with it. If x is a node in the collapsed forest then x:NC represents the number of children of node x, and x:ST denotes the time at which the node x was inserted into the list L 2 (the requirement of ST will become evident soon). Let ParentðxÞ denote the parent of node x in the collapsed forest.
The algorithm for scheduling is described in Fig. 4 . The algorithm uses a function Process which is also depicted in Fig. 4 . The inputs to the algorithm are m and a variable NS which represents the number of pipeline stages. The outputs from Step 103 of Process form a sequence of integers. This provides the desired sequence of multiplications.
Before the main while loop begins (in line 11) the list L 1 contains all the independent nodes in the collapsed forest corresponding to the given polynomial and L 2 is empty. Within the while loop no nodes are inserted in L 1 , but new nodes are inserted into and get deleted from L 2 . L 2 is a queue, i.e., the nodes get deleted from L 2 in the same order as they enter it. The way we define the operations Pop(), Delete(), and Insert() guarantee this.
At any given clock cycle, the nodes in the forest can be in four possible states: unready, ready, scheduled, and completed. A node x is unready if there exist a node y on which x is dependent but y has not been completed yet. A node becomes ready if all nodes on which it depends are completed. A node can only be scheduled after it is ready. Once a node is scheduled it takes NS clock cycles to get completed.
In the beginning, the nodes with level zero, i.e., the independent nodes are the only nodes in the ready state all others being in the unready state. These independent nodes are listed in L 1 at the beginning, no more nodes are further added to L 1 . Thus, the nodes in L 1 can be scheduled at any time. As the algorithm proceeds, nodes get scheduled in line 103 of the function Process.
After a node is scheduled the algorithm updates the field NC (number of children) of its parent. When the last child of a given node x is scheduled then x is inserted into the list L 2 , and in the field ST of x a record of the time when its last child was scheduled is kept.
If a node is in L 2 then it is sure that all its children have been scheduled but not necessarily completed. The condition in line 12 checks if the last child of a given node in L 2 has already been completed and if a node x passes this check then it is ready to be scheduled.
For each execution of the while loop (lines 10 to 20) at most one node gets scheduled and once a node is scheduled it is deleted from the corresponding list. The condition on the while loop (line 10) checks whether both the lists are empty and the condition on line 12 checks whether the first element of L 2 is ready; in the next two propositions we state why these checks would be sufficient.
If L 1 and L 2 are both empty then there are no nodes left to be scheduled. Further, the algorithm terminates, i.e., the condition that L 1 and L 2 are both empty is eventually attained.
Proof. Suppose both L 1 and L 2 are empty but there is a node v which is left to be scheduled. As L 1 contains all independent nodes in the beginning and it is empty thus v is not an independent node. As v has not been scheduled and it is not in L 2 thus there must be a child of v which has not been scheduled. As there must exist a path from v to some independent node x, applying the same argument repeatedly we would conclude that there exist some independent node x which has not been scheduled. This give rise to a contradiction as L 1 is empty. For the second statement, note that as long as L 1 is nonempty, each iteration of the while loop results in exactly one node of F m been added to the schedule. This node is either a node in L 2 (if there is one such node), or, it is a node of L 1 .
Once L 1 becomes empty, if L 2 is also empty, then by the first part, the scheduling is complete. If L 2 is nonempty, then let v be the first element of L 2 . It may be possible that an iteration of the while loop does not add a node to the existing schedule. This happens if clock À v:ST NS. But, the value of v:ST does not change while the value of clock increases. So, at some iteration, the condition clock À v:ST > NS will be reached and the node v will be output as part of the call Process ðv; L; clockÞ. t u Proposition 3. If the first element of L 2 is not ready to be scheduled then no other elements in L 2 would be ready.
Proof. Let v be the first element in L 2 , as v is not ready to be scheduled, hence clockÀv:ST NS. Let u be any other node in L 2 , as u was added to L 2 later than v thus u:ST > v:ST and so clock À u:ST < clock À v:ST < NS. Thus, u is also not ready to be scheduled. t u Some examples about the running of the algorithm Schedule are provided in Section 2 of the supplementary material, available online.
Optimal Scheduling
Given a BRW polynomial on m message blocks, the number of nodes in the corresponding collapsed tree is p ¼ bm=2c. The scheduling of these nodes is said to be optimal if one node can be scheduled in each clock cycle thus requiring p clock cycles to schedule all the nodes. If such a scheduling is possible for a given value of the number of stages (NS) we say that the scheduling admits a full pipeline, as such a scheduling will not give rise to any pipeline delays.
The above notion of optimality is a strong one and an optimal scheduling will not exist for all values of m and NS. Existence of an optimal scheduling for NS stages means that in each clock cycle NS independent nodes are available.
If m is a power of two then it is easy to see that the collapsed forest would contain a single tree and the root would be dependent on all other nodes (as is the case in Fig. 3a) , thus no scheduling procedure can yield an optimal scheduling for such an m for any NS > 1.
Also, as the number of pipeline stages increases, for an optimal scheduling to be possible, more independent multiplications are required. For small values of NS , however, the following theorem gives the conditions for which Schedule gives an optimal scheduling for NS ¼ 2 and 3. Theorem 1. Let H h ðX 1 ; X 2 ; . . . ; X m Þ be a BRW polynomial and let p ¼ bm=2c be the number of nodes in the corresponding collapsed forest. Let clks be the number of clock cycles taken by Schedule to schedule all nodes, then
Proof. Both the proofs are by induction. We present the proof only for NS ¼ 2 as the other case is similar. For p ¼ 3 (i.e., m ¼ 6) the explicit output of the algorithm is 2, 6, 4, and it takes three clock cycles to schedule the three nodes, this proves that the base case is true. Suppose the results hold for some p ! 3 and we wish to show the results for p þ 1.
There are the following cases to consider:
4. Then, p 0 ðmod 4Þ, hence by induction hypothesis the p nodes were scheduled in p þ 1 cycles, signifying that there was one cycle when no node was scheduled. The last node in this case has label 2ðp þ 1Þ and as 2ðp þ 1Þ 2 ðmod 4Þ, hence the last node is an independent node (from Proposition 1), hence the last node can be scheduled in the missed cycle, thus the total clocks required for p þ 1 nodes would be p þ 1.
2. p þ 1 2 ðmod 4Þ. Then, p 1 ðmod 4Þ, hence by induction hypothesis p nodes were scheduled in p cycles, the last node to be scheduled has label 2ðp þ 1Þ and 2ðp þ 1Þ 4 ðmodÞ 8 and hence by Proposition 1, has only one child and the label of the child is 2p. Considering the previous case, 2p was not the last node to be scheduled; hence, the node 2ðp þ 1Þ can be scheduled in the p þ 1th cycle. 3. p þ 1 3 ðmod 4Þ. Then, p 2 ðmod 4Þ, hence p nodes were scheduled in p cycles, the last node to be scheduled has label 2ðp þ 1Þ and 2ðp þ 1Þ 2 ðmod 4Þ and hence by following the same arguments as in case 1 the nodes can be scheduled in p þ 1 cycles. 4. p þ 1 0 ðmod 4Þ. Then, p 3 ðmod 4Þ, hence by induction hypothesis p nodes were scheduled in p cycles. The last node to be scheduled has label 2ðp þ 1Þ, and by Proposition 1 it would have nodes with labels 2p and 2ðp À 1Þ as its children. Considering cases 2 and 3 if p nodes are scheduled then the last node to be scheduled has label 2ðp À 1Þ which is a child of the node 2ðp þ 1Þ, hence the node 2ðp þ 1Þ cannot be scheduled in the p þ 1th cycle. Thus, the number of cycles required would be p þ 2. This completes the proof. t u
From the proof above one can obtain a recursive description of the output of the scheduling algorithm for NS ¼ 2. Let p ! 4, and x 1 ; . . . ; x p be the sequence for p, where x 1 ; . . . ; x p 2 f2; 4; . . . ; 2pg. Then, the following is the construction of the sequence for p þ 1:
. If p þ 1 0 ðmod 2Þ then output the sequence x 1 ; . . . ; x p ; 2ðp þ 1Þ; . If p þ 1 3 ðmod 4Þ, then output the sequence x 1 ; . . . ; x p ; 2ðp þ 1Þ; and . If p þ 1 1 ðmod 4Þ, then output the sequence x 1 ; . . . ; x pÀ1 ; 2ðp þ 1Þ; x p . Similarly if NS ¼ 3, and if x 1 ; . . . ; x p be the sequence for p ! 6, then the following is the construction of the sequence for p þ 1:
. If p þ 1 0 ðmod 2Þ, then output the sequence x 1 ; . . . ; x p ; 2ðp þ 1Þ; . if p þ 1 1 ðmod 4Þ, then output the sequence x 1 ; . . . ; x pÀ2 ; x pÀ1 ; 2ðp þ 1Þ; x p ; and . if p þ 1 3 ðmod 4Þ, then output the sequence x 1 ; . . . ; x pÀ2 ; 2ðp þ 1Þ; x pÀ1 ; x p . As stated, our definition of optimality is a strong one. It is possible to define optimality in a weaker sense as follows: Given a BRW polynomial and number of stages NS a scheduling of the multiplication nodes is called weakly optimal if it takes the minimum number of clock cycles among all possible schedules for the given polynomial and the given value of NS. Using this weaker definition of optimality it would be guaranteed that for any polynomial and any value of NS a optimal schedule will always exist. Moreover, if a schedule is optimal in the stronger sense that we formulated it would also be optimal in the weaker sense. Characterizing weak optimality seems to be combinatorially a difficult task and is not required for our case, as for our work, we mostly can show strong optimality or small deviations from it.
The Issue of Extra Storage
Optimizing the number of clock cycles should not be the only goal for a scheduling algorithm. An important resource associated with a pipelined architecture is the requirement of extra storages for storing the intermediate results. The issue of storage in the case of computing BRW polynomials is simple, we illustrate the issue with an example. Refer to the diagram of the collapsed tree in Fig. 3b, suppose (2) then the starting times and finishing times (in clocks) of the nodes would be as below.
Note that the results of the multiplications in nodes 2, 10, 18, 26 which are completed in the clocks 3, 5, 7, and 9, are further used to compute the multiplications in the nodes 4, 12, 20, and 28 which are started in the clocks 9, 10, 11, and 12, respectively. Hence, the results obtained in the clocks 3, 5, 7, and 9 are all needed to be stored. If we continue in this manner we shall see that the scheduling in (2) would require a significant amount of extra storage for storing the intermediate results.
In contrast to the scheduling in (2), if we follow the algorithm Schedule, then the starting and the finishing time of the nodes would be as:
Number of intermediate storages for this schedule is just one and can be seen from the following considerations.
. Node 2 is completed in clock 3 and in the same clock node 4 gets started which requires the result of the multiplication in clock 3 thus the result of node 2 is not required to be stored. . In clock 4 node 6 is completed and 10 is started, as 10 does not depend on 6, hence the result of node 6 needs to be stored. . Continuing in this way we see that only the results of nodes 6, 8, 12, and 20 are needed to be stored (they are underlined in the table above).
. But, this does not mean that four distinct storage locations are required, as the storage locations can be reused. . Note that node 8 is ready in clock 7 and it is required to be stored. Node 6 was stored previously, and the result was already utilized when node 8 started in clock 5. Thus, the location used for storing 6 can be used to store 8. 
1.
A result x is required to be stored if it is completed in a certain clock t and the node y which starts at t is not a parent of x. 2. If there exists a storage location which stores results that have been already used, then the location can be reused, otherwise a new storage location must be defined. The extra storage requirement for Schedule grows very slowly with the increase in the number of message blocks. Fig. 5 shows the number of storage for various number of message blocks for NS ¼ 3.
The values reported in Fig. 5 are obtained by the procedure described above. It may be possible to come up with a closed form formula which shows the amount of extra storage required for each configuration. This combinatorial problem is not straightforward and remains open. For all practical purposes the procedure depicted above can give an exact count of the extra amount of storage required.
A HARDWARE ARCHITECTURE FOR THE EFFICIENT EVALUATION OF BRW POLYNOMIALS
Utilizing the nice properties of the BRW polynomials as discussed in the previous sections we propose a hardware architecture for computing such polynomials. We "showcase" our architecture for 31 blocks of messages using a three-stage pipelined multiplier. The number of message blocks of the polynomial and the pipeline stages of the multiplier can be varied without hampering the design philosophy. This issue of scalability is discussed later. Each block is 128 bits long, and so the multiplication, addition and squaring operations take place in the field IF 2 128 generated by the irreducible polynomial
This specific design would be also useful for the designing of tweakable enciphering schemes which are discussed in Section 4.
The schematic diagram of the proposed architecture is shown in Fig. 6 , where the principal component is a threestage pipelined Karatsuba multiplier denoted as KOM. (We postpone a detailed description of the multiplier design to Section 4 of the supplementary material, available online.) At the output of the multiplier, we placed two accumulators, ACC1 and ACC2, which are used to accumulate intermediate results. Both squaring blocks in Fig. 6 are equipped with output registers that allow to save the last field squaring computation. The multiplier block KOM has two inputs designated as inMa and inMb.
The first multiplier input (inMa) is the field addition of three values. Explanations of these values are as follows:
1. The first of these values is the output of a multiplexer block M1 that selects between the key h or any one of the two accumulators. 2. The second value is the output of a another multiplexer that selects between the last output produced by the multiplier or zero. 3. Finally, the third value is the input signal inA. The second multiplier input (inMb) consists of the field addition of two values. Explanations of these values are as follows:
1. The first one is taken from the output of a multiplexer M2 that selects either the output of Sqr1, or Sqr2, or the key h. 2. The second value is the input inB. As was discussed in Section 2, the computation of a 31-block BRW polynomial denoted as, H h ðP 1 ; . . . ; P 31 Þ, requires the calculation of b 31 2 c ¼ 15 multiplications. We give in Fig. 7 the time diagram that specifies the way that these 15 multiplications were scheduled. The final value of the polynomial H h ðP 1 ; . . . ; P 31 Þ is obtained in just 18 clock cycles.
The data-flow specifics of the architecture in Fig. 6 is shown in the time diagram of Fig. 7 . This figure shows the different data stored/produced in the various blocks at each clock cycle along with the order in which the multiplications were performed. M 1 ; . . . ; M 15 denote the 15 multiplications to be computed and the multiplicands are depicted in the rows designated inMa and inMb, which are the two inputs of the KOM block. The row designated C denotes the output of the multiplier. As a three-stage pipelined multiplier is being used, a multiplication scheduled at clock i can be obtained at C in clock i þ 3.
The rows ACC1 and ACC2 denote the values which are accumulated in the accumulators in the various clock cycles. Note that an entry M i in any of the rows representing the state of the two accumulators signify that the value M i gets xor-ed to the current value in the accumulator, and an entry ÃM i denotes that the accumulator gets initialized by M i .
The rows squaring1 and squaring2 show the state of the squaring circuits output register. Each of the circuits for squaring can compute the square of the current content of the output register in one clock cycle, maintain its current state, or initialize its value with h 2 taking h as a fresh input. As depicted in Fig. 7 , the computation of the polynomial H h ðX 1 ; . . . ; X 31 Þ can be completed in 18 clock cycles and the final value can be obtained from the accumulator ACC2.
The circuit shown in Fig. 6 uses the strategy of computing the squares as required on the fly. An alternative strategy would be to precompute the required powers of h and store them in registers. By using this strategy we can get rid of the squaring circuits at the cost of some extra storage, and come up with a circuit which would be very similar to the circuit described in Fig. 6 .
If the precomputing strategy is adopted, then for computing H h ðP 1 ; . . . ; P 31 Þ we need to store h 2 , h 4 , h 8 , h 16 in registers. The multiplexer which feeds inMb in this case would be a five-input multiplexed, where four of the inputs come from the registers where the squares were stored and the fifth input is the input line h. As squaring in binary extension fields is easy, these two strategies do not provide significantly different performances. This becomes evident from the experimental results. Irrespective of the way in which squarings are performed, the construction of the circuit follows the scheduling strategy as dictated by the algorithm Schedule. According to Theorem 1, if a three-stage pipelined multiplier is used, then for computing H h ðP 1 ; . . . ; P 31 Þ the 15 multiplications can be scheduled in 15 clock cycles without any pipeline bubbles. Fig. 7 shows that this is indeed the case as starting from clock 1 to 15, in each clock cycle, a multiplication gets scheduled without any pipeline delays. The extra storage required to store the intermediate products is provided by the accumulator ACC1, which stores the products M 2 , M 5 , M 6 , and M 9 .
ACC2 is used to accumulate the final result, note that the products M 10 , M 13 , M 14 , and M 15 are accumulated in order in the accumulator ACC2. These multiplications corresponds to the nodes 16 , 30 , 24 , 28 of the collapsed forest (see Fig. 3b ), which in turn are the roots of the trees.
Scalability. The architecture presented previously is meant for 31-block messages. But the same design philosophy can be used for k-block messages for any fixed k.
Here, we give a short description of how the circuit for computing H h ðP 1 ; . . . ; P m Þ grows with the growth of m. A 3-stage pipelined multiplier is assumed. For ease of exposition, we shall only consider the case where the powers of h are precomputed.
The main components of the circuit will be the two multiplexers which are connected to the inputs of the multiplier, the accumulators, and the registers to store the powers of h. If H h ðP 1 M2 would thus be a ðs þ 1Þ-input multiplexer. The number of accumulators required would be at most one more than the number of extra storages required. For a given polynomial H h ðP 1 ; . . . ; P m Þ, the number of extra storages required by Schedule can be determined using the procedure described in Section 2.4.
If the number of accumulators required is then M1 would be substituted by an ð þ 1Þ-input multiplexer, where inputs come from the accumulators and the last one is the input line h. The data-flow specifics can be automatically obtained from the algorithm Schedule.
TES CONSTRUCTIONS BASED ON BRW POLYNOMIALS
We shall devote this section to study an application of BRW polynomial for construction of a cryptographically useful object. As stated in the Introduction, in a recent work [23] it was suggested that BRW polynomials can be used instead of normal polynomials to design tweakable enciphering schemes of the hash-ECB-hash and hashcounter-hash family. Tweakable enciphering schemes are known to be useful in design of in-place disk encryption scheme, and in the light of the present standardizing activities of IEEE working group on security in storage [1] the study of these schemes has gained much importance in the current days. In [23] , it was claimed that TES constructions using BRW polynomials would be far more efficient than their counter parts which use normal polynomials. The claim was justified using operation counts, as a BRW polynomial requires about half the amount of multiplications than the normal polynomials. But, in [23] real design issues were not considered and thus there exist no hard experimental data to demonstrate the amount of speedups which can be achieved by the use of such polynomials. Here, we concentrate on the real design issues for hardware implementation of some of the schemes described in [23] , and ultimately provide experimental results which justifies that TES with BRW polynomials would have higher throughput than the ones using the normal ones.
The Schemes
There are two basic schemes described in [23] , which are named as HEH and HMCH. The schemes can be instantiated in different ways for different applications. The encryption and decryption algorithms for HEH and HMCH are described in Figs. 8 and 9 , respectively. The descriptions are for a specific instantiation which is suitable for the purpose of disk encryption. In the description of the algorithms we assume that E K : f0; 1g n ! f0; 1g n is a block cipher, whose inverse is E À1 K : f0; 1g n ! f0; 1g n . The additions and multiplications are all in the field IF 2 n represented by a irreducible polynomial ðxÞ of degree n which is primitive. For our implementations we use the field IF 2 128 and ðxÞ
An A 2 f0; 1g n can be seen as a polynomial a 0 þ a 1 x þ Á Á Á È a n x nÀ1 where each a i 2 f0; 1g, thus every n bit string A can be treated as an element in IF 2 n . By xA we mean the n bit binary string corresponding to the polynomial xða 0 þ a 1 x þ Á Á Á þ a n x nÀ1 Þ mod ðxÞ. This operation can be performed easily by a shift and a conditional xor. In the description h ð:Þ can be instantiated in two different ways, it can either be h Á Poly h ð: 
Analysis of the Schemes and Design Decisions
We analyze here the schemes presented in Section 4 from the perspective of efficient hardware implementations and thus come up with some basic strategies for designing them. The implementation is targeted toward the disk encryption application, thus in the following discussions we shall only consider messages of fixed lengths which are 512 byte long, i.e., 32 blocks of 128 bits. 2 Our primary design goal is speed, but we shall try to keep the area metric reasonable. The basic components of both schemes are a block cipher (which we chose to instantiate using AES-128) and the polynomial hash (either Poly or BRW). Thus, in terms of hardware the basic components required would be an AES (both encryption and decryption cores) and an efficient finite-field multiplier. As the focus of this work is in BRW polynomials, in the rest of this Section we shall discuss about the instantiation with only BRW polynomials here, the instantiation with Poly h ðÞ is briefly discussed in Section 4.4.
Referring to the algorithm HEH.Encrypt Fig. 9 , requires ðm þ 1Þ encryption calls to the block-cipher, and for HMCH.Decrypt T h;K , m encryption calls and one decryption call to the blockcipher are required. The ðm À 1Þ block-cipher calls required by both encryption and decryption procedures of HMCH can be parallelized. Thus, for both modes the bulk amount of block-cipher calls can be parallelized. This suggests that a pipelined implementation of AES would be useful for implementing the ECB mode in HEH and the counter type mode in HMCH. Computation of the BRW h ð:Þ can also be suitably parallelized (as discussed in Section 3). Thus, we also decided to use a pipelined multiplier to compute the BRW hash.
Out of many possible AES designs reported in the literature [17] , [10] , [5] , [15] , [8] we decided to implement a 10-stage pipelined AES core architecture with the counter mode and/or the electronic code book functionalities. This decision was taken based on the fact that the structure of the 2. 512 byte is the current size of disk sectors, though starting from this year hard disks with sector sizes of 4,096 bytes are also commercially available. The basic strategy of design that we shall present would have the required scalability.
AES algorithm admits to a natural ten-stage pipeline design, where after 11 clock cycles one can get an encrypted block in each subsequent clock cycle. We refrain ourselves from using deeper pipeline designs such as the ones reported in [16] , because such designs would incur a higher latency, i.e., the total delay before a single block of ciphertext can be produced would be higher with more pipeline stages. As the message lengths in the target application are particularly small (512 bytes), such pipeline designs are not suitable for a disk sector encryption application.
As a target device for the implementation, we choose FPGAs of the Virtex 5 family. These are one of the most efficient devices available in market. In [4] , a highly optimized AES design suitable for Virtex 5 FPGAs was reported. One important design decision taken in [4] was to implement the byte substitution table using the LUT fabric, this is in contrast to previous AES designs where extra block RAMs were used for the storage of the look up tables. This change has a positive impact both in area and the length of the critical path, given rise to better performance. The design described in [4] is sequential. The AES design implemented in this work closely follows the techniques used in [4] , but we suitably adapt and extend the techniques in [4] to a pipelined design.
Another important characteristic of the AES design presented here is that we did not attempt to design a stand-alone core equipped with encryption and decryption functionalities but instead, we chose to design separate cores for encryption and decryption. This gave us better throughput and also provided some extra flexibility in terms of optimization.
One of the TES schemes requires a sequential AES decryption core. In our experiments, we were unable to obtain good performance for the decryption core using the strategies as described in [4] . Hence, for the design of the sequential decryption core, we adopted ideas from [9] , where the AES transformations inverse byte substitution (IBS) and inverse mixcolumn (IMC) are combined together in a single module which are called inverse T-Boxes. We implemented those T-boxes using large multiplexer blocks and the FPGA fabric, thus avoiding the usage of the slower block RAM memory blocks. The price to pay on this design decision is that our AES decryption core occupies twice as much slices as the design reported in [4] (See Table 2 for details).
As it has been mentioned, in the case of the field multiplier we decided to use a three stage pipelined Karatsuba multiplier. The number of stages was fixed keeping an eye to the critical path of the circuit. Once we fixed our design for AES we selected the pipeline stages for the multiplier in such a manner that it matches the critical path of the AES. As both components would be used in the circuit, hence if a very high number of pipeline stages for the multiplier is selected then, the critical path would be given by the AES but the latency for multiplication would increase. Several exploratory experiments suggested that a three stage pipeline would be optimal as the critical path of such a circuit would just match that of the AES circuit (See Section 4 of the supplementary material, available online, for more details on the field multiplier design).
Both HEH[ ] and HMCH[ ] were proved to be secure as tweakable enciphering schemes in [23] . The security proof requires h ðÞ to be a almost xor universal (AXU) hash function. Both h Á BRWðX 1 ; . . . ; X mÀ1 Þ and h Á PolyðX 1 ; . . . ; X mÀ1 Þ are AXU. If : f1; . . . ; m À 1g ! f1; . . . ; m À 1g be a fixed permutation then it is easy to see that h Á BRWðX ð1Þ ; X ð2Þ ; . . . ; X ðmÀ1Þ Þ would also be AXU. Thus, using any fixed ordering of the messages for evaluating each of the BRW polynomials in the modes will not hamper their security properties. This observation is important in the context of hardware implementations of HEH[BRW] and HMCH [BRW] . As, for an optimal computation of BRW polynomials we require a different order of the messages than the normal order. In our case, the permutation ðÞ is dictated by the algorithm Schedule. If m ¼ 31 and the number of pipeline stages of the multiplier is three the permutation needed for the correct execution of Schedule is shown in Table 1 .
Thus, for implementing HEH [BRW] .Encrypt, h ðP 1 ; . . . ; P 31 Þ in line 2 of the encryption algorithm in Fig. 8 is replaced by h Á BRWðP ð1Þ ; . . . ; P ð31Þ Þ. A similar change is done in line 10 of the encryption algorithm and lines 2 and 10 of the decryption algorithm. For implementing HMCH[BRW] we replace h ðP 2 ; . . . ; P 32 Þ in line 2 of Fig. 9 by h Á BRWðP ð1Þþ1 ; P ð2Þþ1 ; . . . ; P ð31Þþ1 Þ. A similar change is done in line 10 of the encryption algorithm and lines 2 and 10 of the decryption algorithm.
An analysis of the possibility of parallelization and the details of the scheduling for the modes HMCH[BRW] and HEH [BRW] is provided in the supplementary material, available online. As our design of the BRW architecture requires two message blocks in each cycle, hence to exploit the parallelism we decided to use two pipelined AES encryption cores for both HMCH[BRW] and HEH [BRW] . This leads to the maximum possible throughput at the cost of more area. For decryption in HEH[BRW] two pipelined decryption cores are also required as the bulk encryption in HEH[BRW] is done using a electronic code book type mode. But in case of HMCH[BRW] the bulk encryption is done using a counter mode, hence for decryption in HMCH[BRW] a single AES sequential core is enough. In Section 5, we provide results for different variants using different AES cores that we implemented for both the modes.
Architecture of HMCH[BRW]
We implemented the modes HEH[BRW] and HMCH [BRW] . For both modes, encryption and decryption functionality were implemented in a single chip. In this section, we shall only describe the architecture for HMCH[BRW] which uses two pipelined encryption cores and a single sequential Fig. 10 . For ease of exposition in Fig. 10 , we only show the encryption part of the circuit, an additional component of the circuit is the sequential decryption core which we omit for the sake of simplicity. The main components of the general architecture depicted in Fig. 10 are the following: A BRW polynomial hash block (which corresponds to the circuit shown in Fig. 6 ), two AES cores (equipped with both electronic code book and counter mode functionalities), and two x 2 T imes blocks. The x 2 T imes blocks compute x 2 A, where A 2 IF 2 128 . The architecture also includes five registers to store the values M 1 , 1 , 2 , U 1 , and S, and makes use of six multiplexer blocks labeled mux1 to mux6 in the figure. When the x 2 T imes block is first activated, it simply outputs the value placed at its input (for the circuit of Fig. 10 , this input value will correspond to either 1 or 2 ). Thereafter, at each clock cycle the field element x 2 A will be produced as an output, where A 2 IF q is the last value computed by this block. The control unit of this architecture consists of a ROM memory where a microprogram with 67 microinstructions has been stored, each microinstruction consisting of 28-bit control words. Additionally, the control unit uses a counter that gives the address of the next instruction to be executed.
The general data flow of Fig. 10 can be described as follows: First the parameter 1 is computed as 1 ¼ E K ðT Þ. This is done by properly selecting mux1 and mux2 so that the tweak T gets encrypted in a single mode by the AES even core. The value so obtained is stored in the register reg 1 and also 2 ¼ x 1 is computed and stored in reg 2 . Then, the plaintext blocks P 2 ; . . . ; P m are fed into the BRW hash block through the inputs inA and inB and the proper selection of mux4 and mux5. After 21 clock cycles, the hash of the plaintext blocks is available at outHash, allowing the computation of the parameter M 1 as, M 1 ¼ outHash È P 1 , where P 1 is taken from the input signal in B. The parameter U 1 is computed as E K ðM 1 Þ by selecting the second input of mux1 as the input value for the AES even core. The value so computed is stored in regU 1 . At this point the circuit of It is noticed that this last computation is achieved in 28 clock cycles using the two AES cores in parallel. The encryption blocks C i for i ¼ 2; . . . ; m are simultaneously sent to circuit's outputs outA and outB, and to the BRW hash block through a proper selection of mux4 and mux5. After 21 clock cycles, the cipher blocks' hash is available at outHash, allowing the computation of the encryption block C 1 as, C 1 ¼ outHash È U 1 , where U 1 was previously computed and stored as explained above.
HEH[Poly] and HMCH[Poly]
For the sake of comparison we also implemented HEH [Poly] and HMCH [Poly] . As stated in Section 4 these schemes can be obtained by replacing h ðÞ by Poly h ðÞ in the algorithms of Figs. 8 and 9. When a normal polynomial is used for the constructions then the usual Horner's rule is the most efficient way to compute it. At first glance, the advantages of a pipelined multiplier cannot be used due to the sequential nature of the Horner's rule. In [24] , A three way parallelization strategy was proposed to evaluate a normal polynomial using three different multipliers and thus running three different instances of the Horner's rule in parallel. We adopt the strategy presented in [24] by utilizing a three staged pipelined multiplier as a tool to evaluate a normal polynomial using Horner's rule. where
Note that the multiplications in p 1 does not depend on the multiplications in p 2 and p 3 , etc. Hence, a three staged pipelined multiplier can be used to compute hPoly h ðP 1 ; P 2 ; . . . ; P 31 Þ. If h 2 and h 3 are precomputed then the computation of the polynomial can be completed in 35 clock cycles.
For HEH[BRW] we used two pipelined AES encryption and decryption cores and for HMCH[BRW] we used two pipelined encryption core and a single sequential decryption core. The usage of two AES cores gave us considerable savings in the number of clock cycles (as discussed in Section 3 of the supplementary material, available online), as h Á BRWð:Þ could be computed in only 21 cycles. But h Á Polyð:Þ requires 35 clock cycles to complete, and hence dedicating two cores for this task does not give rise to any savings. Hence, while implementing HEH[Poly], we used one pipelined AES encryption core and one pipelined AES decryption core and for HMCH[Poly] we used one pipelined AES encryption core and one sequential AES decryption core.
EXPERIMENTAL RESULTS
In this section, we present the experimental results obtained from our implementations. All reported results were obtained from place and route simulations, where the target device is XILINX Virtex 5 xc5vlx330-2ff1760. Table 2 shows the performance of the basic primitives. For the sake of simplicity in the comparison of the results, the area figures of the AES cores shown in Table 2 did not include the area expenses of computing the AES key scheduling algorithm, which was implemented at a cost of 750 slices with a small associated critical path that did not affect the maximum clock frequency achievable by the rest of the design. Table 2 clearly shows that BRW h ð:Þ is much faster than Poly h ð:Þ, but BRW h ð:Þ occupies more slices than Poly h ð:Þ. We note that only the pipelined AES decryption core achieved lower frequency than the hash blocks. Thus, in case of HMCH[BRW], which does not use the pipelined decryption core, the critical path is given by the hash block and in case of HEH[BRW] the critical path is given by the pipelined decryption core.
For both HEH [BRW] and HMCH[BRW] we implemented three variants, we name these variants as 1, 2, and 3. The naming conventions along with the performance of the variants are described in Table 3 . It is worth mentioning that all the six TES variants implemented require a minimum of three and a maximum of four AES cores, but only one key scheduling block was implemented per TES variant. The results reported in Table 3 includes the key schedules. Table 3 also shows the variants using Poly. From the results shown in Table 3 we can infer the following: The closest work with which our designs can be compared is [18] . In [18] ] achieve much better throughput using lesser area. Although, one needs to be careful in comparing the area metric as the structure of the slices in Virtex 4 and Virtex 5 are very different. All in all our designs achieve much better throughput than the ones reported in [18] because of the following reasons:
. The Virtex 5 technology adopted in this work allows for higher achievable frequencies. . Our AES core design is especially suited for the Virtex 5 technology and uses the special slice structure of such devices. As a result, much better frequencies than the designs reported in [18] can be achieved. . The multiplier used in [18] is a combinatorial circuit which produces one product in each clock cycle, this design gives a much longer critical path than our pipelined multiplier. Hence, our circuits for HEH [Poly] and HMCH[Poly] operate at much higher frequencies and thus give better throughput.
CONCLUSION
We studied BRW polynomials from a hardware implementation perspective and designed an efficient architecture to evaluate BRW polynomials. The design of the architecture was based on a combinatorial analysis of the structural properties of BRW polynomials. Our experiments show that BRW polynomials are an efficient alternative to normal polynomials. Moreover, we explored constructions of hardware architectures for tweakable enciphering schemes using BRW polynomials and the results show that designing TES using BRW polynomials are a far better alternative than the ones using normal polynomials. In spite of the comprehensive study that we present in this paper, we think that the following interesting problems which are left open, are worth of further study:
1. In this work a full characterization of the algorithm Schedule for small values of m was given. Although for our, and all other practical purposes this would be enough, a full characterization for arbitrary values of m may be an interesting combinatorial exercise. Such a characterization may also tell us which configurations of the collapsed forest would admit a full pipeline given a number of pipeline stages. This study would help to define a weaker form of optimality (as mentioned in Section 2.3), which would be achievable in all cases.
2. Given a fixed number of pipeline stages, we presented a method for counting the number of extra storage locations for each configuration of the collapsed forest. A somewhat more formal combinatorial analysis may yield a closed form formula for counting the extra storage locations. 3. Our designs for HEH and HMCH strive for exploiting parallel computation opportunities assuming the messages to be encrypted are 32 AES-block long, which is the size of a disk sector. In a practical application though, multiple sectors may be written or read at the same time from a disk. This opens up the possibility of identifying ways of parallelizing across sectors. The structure of both HEH and HMCH would allow such parallelism, yielding architectures that can give some extra savings on the total number of clock cycles reported here (the interested reader is referred to Section 3.1 of the supplementary material, available online, for more details on this). His current research interests includes design and analysis of provably secure symmetric encryption schemes, efficient software/hardware implementations of cryptographic primitives, pattern recognition, and neural networks.
Cuauhtemoc Mancillas-Ló pez received the BE degree in electronic and communications engineering from ESIME-Instituto Polité cnico Nacional (IPN), Mexico, in 2004, and the MSc degree in computer science from CINVESTAV-IPN, Mexico, where he is currently working toward the PhD degree in the Computer Science Department. His current research interests include design and analysis of provably secure symmetric encryption schemes, efficient software/hardware implementations of cryptographic primitives, and computational arithmetic. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
