The contribution of the paper is two-fold. First, we design a novel permutationbased hash mode of operation FP, and analyze its security. The FP mode is derived by replacing the hard-to-invert primitive of the FWP mode -designed by Nandi and Paul, and presented at Indocrypt 2010 -with an easy-to-invert permutation; since easy-to-invert permutations with good cryptographic properties are normally easier to design, and are more efficient than the hard-to-invert functions, the FP mode is more suitable in practical applications than the FWP mode.
1 Introduction
Block-cipher-based hash modes
Iterative hash functions are generally built from two components: (1) a basic primitive C with finite domain and range, and (2) an iterative mode of operation H to extend the domain of the hash function; the symbol H C denotes the hash function based on the mode H which invokes C iteratively to compute the hash digest. Therefore, to design an efficient hash function one has to be innovative with both the mode H and the basic primitive C. Merkle-Damgärd mode used with a secure block cipher was the most attractive choice to build a practical hash function; some examples are SHA-family [30] , and MD5 [34] . The security of a hash function based on the Merkle-Damgärd mode crucially relies on the fact that C is collision and preimage resistant. The compression function C achieves these properties when it is constructed using a secure block cipher [9] . However, several security issues changed this popular design paradigm in the last decade. The first concern is that the security of Merkle-Damgärd mode of operation -irrespective of the strength of the primitive C -came under a host of generic attacks; the length-extension attack [13] , expandable message attack [22] , multi-collision attacks [20] and herding attack [10, 21] are some of them. Several strategies were discovered to thwart the above attacks. Lucks came up with the proposal of making the output of the primitive C at least twice as large as the hash output [26] ; this proposal is outstanding since, apart from rescuing the security of the Merkle-Damgärd mode, it is also simple and easy to implement. Another interesting proposal was HAIFA that includes a counter injected into the compression function C to rule out many of the aforementioned attacks [7] . Using the results of [9] , it is easy to see that the Wide pipe and the HAIFA constructions are secure when the underlying primitive is a secure block cipher. Despite the aforementioned foolproof design strategies, it turns out that using a block cipher as the basic primitive of a hash function may not be the best alternative, for several reasons. (1) A hash function does not need both the encryption and decryption functions of a block cipher; one of them could be avoided. (2) The key schedule of a block cipher often turns out to be weak [8] . (3) Furthermore, the key schedule weaknesses of a block cipher render invalid the very common ideal cipher assumption under which the security of block-cipher-based hash functions is usually based; note that an ideal cipher assumption is stronger than an ideal permutation assumption since, in the former case, an extra assumption is that a huge number of ideal permutations need to be independent too. (4) The amount of memory needed to implement a wide block cipher is larger due to the 'extra' key schedule than needed for an equally sized permutation. Table 1 : Indifferentiability bounds of permutation-based hash modes, where the hash size is n-bit in each case; π denotes the permutation, or one of many equally sized permutations. FP Ext1 is a natural variant of FP with parameters shown in the row. The ϵ is a small fraction due to the preimage attack on JH presented in [6] . Msg-blk denotes the message block.
Mode of
Msg-blk Size of π Rate Indiff. bound # of independent operation (ℓ) (a) (ℓ/a) lower upper permutations
Hamsi [24] n/8 2n 0.07 n/2 n 1 Luffa [5] n/3 n 0.33 n/4 n 3 Sponge [4] n 3n 0.33 n n 1 Sponge [4] n 2n 0.5 n/2 n/2 1 JH [29] n 2n 0.5 n/2 n(1 − ϵ) 1 Grøstl [17] n 2n 0.5 n/2 n 2 FP n 2n 0.5 n/2 n 1 
Permutation-based hash modes
For the reasons described in the previous section, the popularity of permutation-based hash functions has been on the rise since the discovery of weaknesses in the Merkle-Damgärd mode. Sponge [3] , Grøstl [17] , JH [39] , Luffa [11] and the Parazoa family [1] are some of them. We note that 9 out of 14 semi-finalist algorithms -and 3 out of 5 finalist algorithms -of the NIST SHA-3 hash function competition are based on permutations. Also, NIST selected Keccak as the winner of the SHA-3 competition, which is a permutation-based Sponge construction. Other notable example is MD6 [35] . In Table 1 .2, we compare generic security and performance (measured in terms of rate) of various well known permutationbased hash modes.
Our contribution

The FP mode
Our first contribution is to give a proposal for a new hash mode of operation FP based on a single wide pipe permutation (see Figure 1 ). The FP mode is derived from the FWP (or Fast Wide Pipe) mode designed by Nandi and Paul at Indocrypt 2010 [31] . 1 The difference between the FWP and the FP mode is simple: the FP mode is obtained when the underlying hard-to-invert function f : {0, 1} m+n → {0, 1} 2n of the FWP mode is replaced by an easyto-invert permutation π : {0, 1} 2n → {0, 1} 2n . There are a number of practical reasons for switching from FWP to FP: (1) Easy-to-invert permutations are usually efficient, and such permutations with strong cryptographic properties are abundant in the literature (e.g. JH, Grøstl and Keccak permutations); (2) hard-to-invert functions are difficult to design, or they are less efficient.
On the other hand, easy-to-invert permutations -even though they are faster -have some drawbacks; the most crucial of them is that they allow the attacker to use reverse queries in addition to forward queries, and, as a result, make the adversary inherently more powerful. Therefore, a good deal of caution is required to design a hash mode of operation that uses permutations. We show that the FP mode based on an ideal permutation is indifferentiable from a random oracle up to approximately 2 n/2 queries (forward and reverse together); this means that the FP mode is secure against all generic attacks -including (multi) collision, 2nd preimage, herding attacks -up to approximately 2 n/2 queries, under the assumption that the underlying 2n-bit permutation is structurally strong. Moving further, we performed experiments to implement our indifferentiability framework with randomly generated graphs using C programs, and our experiments strongly indicate that the indifferentiability security of the FP mode could be improved close to n bits. Another important feature of our work is that the security guarantee is based on only one assumption -like the Sponge and JH -that the underlying permutation should not display any structural weaknesses; note that the security of many permutation-based hash functions (e.g. Grøstl and Luffa) requires additional assumptions such as independence of several ideal permutations. In Figure 1 .2, we compare the FP mode and a natural extension of it FP Ext1 with other permutation-based hash functions. It is noteworthy that the FP mode exhibits the best security/rate trade-off, when the internal permutation size is fixed.
Design and FPGA implementation of SAMOSA
Our second contribution is establishing the practical usefulness of the FP mode. As an example, we design a concrete hash function family SAMOSA. 2 It is based on the FP mode, where the internal primitives are the P -permutations of the Grøstl hash function. We provide security analysis of SAMOSA, demonstrating its resistance against any known practical attacks.
As demonstrated by the AES and the SHA-3 competitions, the security of a cryptographic algorithm alone is not sufficient to make it stand out from among multiple candidates competing to become a new American or international standard. Excellent performance in software and hardware is necessary to make a cryptographic protocol usable across platforms and commercial products. Assuring good performance in hardware is typically more challenging, since hardware design requires involved and specialized training, and, as it turns out, that the majority of designer groups lack experience and expertise in that area.
In case of SAMOSA, the algorithm design and hardware evaluation have been performed side by side, leading to full understanding of all design decisions and their influence on hardware efficiency. In this paper, we present efficient high-speed architecture of SAMOSA, and show that this architecture outperforms the best known architecture of Grøstl in terms of the throughput to area ratio by a factor ranging between 24 and 51%. These results have been demonstrated using two representative FPGA families from two major vendors, Xilinx and Altera. As shown in [16] , these results are also very likely to carry to any future implementations based on ASICs (Application Specific Integrated Circuits). Additionally, we demonstrate that SAMOSA consistently ranks above BLAKE, Skein and Grøstl in our FPGA implementations. Although it still loses to Keccak and JH, nevertheless, a relative distance to these algorithms substantially decreases compared to Grøstl, despite using the same underlying operations. This performance gain is accomplished without any known degradation of the security strength.
Additionally, SAMOSA's dependence on many AES operations makes it suitable for software implementations that use general-purpose processors with AES instruction sets, such as AES-NI. Finally, in both software and hardware, SAMOSA could be an attractive choice for applications where both confidentiality and authentication are required to share AES components. One such example is IPSec, protocol used for establishing Virtual Private Networks, which is one of the fundamental building blocks of secure electronic transactions over the Internet.
Although SAMOSA comes too late for the current SHA-3 competition, it still has a chance to contribute to better understanding of the security and performance bottlenecks of modern hash functions, and to find niche platforms and applications in which it may outperform the existing and upcoming standards.
Notation and convention
Throughout the paper we let n be a fixed integer. While representing a bit-string, we follow the convention of low-bit first (or little-endian bit ordering). For concatenation of strings, we use a||b, or just ab if the meaning is clear. The symbol ⟨n⟩ m denotes the m-bit encoding of n. The symbol |x| denotes the bit-length of the bit-string x, or sometimes the size of the set x. Let x 
Definition of the FP Mode
Suppose n ≥ 1. Let π : {0, 1} 2n → {0, 1} 2n be the 2n-bit permutation used by the FP mode. The hash function FP π is a mapping from {0, 1} * to {0, 1} n . The diagram and description of the FP transform are given in Figures 1 and 3(a) , where π is modeled as an ideal permutation. Below we define the padding function pad n (·). Padding function pad n (·). It is an injective mapping from {0, 1} * to ∪ i≥1 {0, 1} ni , where the message M ∈ {0, 1} * is mapped into a string
The function pad n (M ) = M ||1||0 t satisfies the above properties (t is the least non-negative integer such that |M | + 1 + t = 0 mod n). Note that k = ⌈ |M |+1 n ⌉ . In addition to the injectivity of pad n (·), we will also require that there exists a function dePad n (·) that can efficiently compute M , given pad n (M ). Formally, the function dePad n : ∪ i≥1 {0, 1} in → {⊥} ∪ {0, 1} * computes dePad n (pad n (M )) = M , for all M ∈ {0, 1} * , and otherwise dePad n (·) returns ⊥. The padding rule described above satisfies this property also.
Indifferentiability Framework: An Overview
We first define the indifferentiability security framework, and briefly discuss its significance. Definition 3.1 (Indifferentiability framework) [13] 
The significance of the framework is as follows. Suppose, an ideal primitive G is indifferentiable from an algorithm T based on another ideal primitive F. In such a case, any cryptographic system P based on G is as secure as P based on T F (i.e., G replaces T F in P). For a more detailed explanation, we refer the reader to [27] . Some limitations of the indifferentiability framework have recently been discovered in [15] and [33] . They offer a deep insight into the framework; nevertheless, the observations are not known to affect the security of the indifferentiable hash functions in any meaningful way.
An oracle, a system, and a game. An oracle is an algorithm (accessed by another oracle or algorithm) which, given an input as an appropriately defined query, responds with an output. For example, in Figure 2 (a), T , F, G and S are oracles. A system is a set of oracles (e.g. System 1 = (T, F), System 2 = (G, S) in Figure 2(a) ). A game is the interaction of a system with an adversary. We refrain from providing a formal definition of a game, since such formalization will not be necessary in our analysis.
Main Theorem: Birthday Bound for FP Mode
Let RO : {0, 1} * → {0, 1} n and π : {0, 1} 2n → {0, 1} 2n are a random oracle and an ideal permutation. Our indifferentiability framework uses three systems G0 = (FP, π, π −1 ), G1 = (FP1, S1, S1 −1 ), and G2 = (RO, S, S −1 ) (see Figure 2 The description of FP1, S, S −1 , S1, and S1 −1 will be provided in Section 10. Now we state our main theorem using Definition 3.1.
, and σ ≤ K2 n/2 , where K is a fixed constant derived from ε.
In the next few sections, we will prove Theorem 4.1 by breaking it into several components. First, we briefly describe what the theorem means: it says that no adversary with unbounded running time can mount a nontrivial generic attack on the hash function FP π using at most K2 n/2 queries. The parameter K is an increasing function in ε, and is constant for all n > 0. To reduce the notation complexity, we shall derive the indifferentiability bound assuming ε = 0.5 for which, we shall derive, K = 1/ √ 56.
Outline of the Proof. Our proof of Theorem 4.1 uses the blueprint developed in [28] and [29] dealing with indifferentiability security analysis of FWP and JH modes of operation. However, to make the paper self-contained, we write out the proof from scratch. The proof consists of the following two components (see Definition 3.1): (1) Construction of a simulator S = (S, S −1 ) with the worst-case running time
. This is done in Section 10. (2) Showing that, for any adversary A with unbounded running time,
where the systems G0 = (FP, π, π −1 ) and G2 = (RO, S, S −1 ). The systems G1 and G2 are called the main systems. Proof of (1) is, again, composed of proofs of the following three (in)equations:
• In Sections 8 and 9, we will concretely define the simulator pair (S, S −1 ) and a new intermediate system G1. Using them we will show in Section 10,
• In Sections 11 and 12, we will appropriately define a set of events BAD i and GOOD i in the system G1, and will establish that
• In Section 13, we complete proof of (1) by establishing that
where
5 Organization of the paper
Proof of main theorem
Sections 6 to 13 are devoted to complete the proof of Theorem 4.1. The three parts of the proof -i.e., proving (2), (3), and (4) -are done in Sections 10, 12, and 13. These sections make use of the results that are developed in the following sections: Section 6 defines the data structures used by all systems and proofs; Sections 7, 8, 9, and 11 give detailed description of all the systems.
Design and implementation of SAMOSA
In Section 14, we propose a new concrete hash function named SAMOSA, and provide its security analysis. Finally, in Section 15, we give FPGA hardware implementation results for SAMOSA.
Data Structures
The systems G0, G1, and G2 have been mentioned in Section 4 (see schematic diagram in Figure 2 (b)). The pseudocode of them is given in Figures 3(a) , 5, and 3(b) . In this section we describe several data structures used by these systems.
Objects used in pseudocode
Oracles
The main component of a system is the set of oracles that receive queries from the adversary. In Figure 2 (b), any algorithm that receives a query is an oracle. Note that, except the adversary A, each rectangle denotes an oracle.
The systems use a total of 9 oracles. The oracles FP, FP1, and RO are mappings from {0, 1} * to {0, 1} n . The oracle S is a mapping from {0, 1} 2n to {0, 1} 2n . The permutations π, π −1 , S1, and S1
−1 are all defined on {0, 1} 2n , while S −1 is a mapping from {0, 1} 2n to {0, 1} 2n ∪ {⊥}. Instruction-by-instruction description of these oracles and the used subroutines are provided in the subsequent sections.
Global and local Variables
The oracles described above will use several global and local variables. The local variables are re-initialized every new invocation of the system, while the global data structures maintain their states across queries. The tables D l , D s and D π are global variables initialized with ⊥. The graphs T π and T s are also global variables which initially contain only the root node (IV, IV ′ ). Other than them, all other variables are local, and they are initialized with ⊥.
Query and round: definitions
In Figure 2 (b), an arrow denotes a query. The submitter and receiver algorithms of a query are denoted by the rectangles attached to the head and the tail of the arrow. 
Fresh and old queries. The current short query can also be of two disjoint types: (1) an old query, which is already present in the relevant database (e.g. for G1, when an adversary submits an s-query which is an intermediate π-query of a previously submitted long query); or (2) a fresh query, which is so far not present in the relevant database.
Message block. In order to compare the time complexities of the oracles FP, FP1 and RO on a uniform scale, we recall the notion of a message block. A long query M -irrespective of the oracle -is assumed to be a sequence of k message blocks
Note that, for FP and FP1, every message block m i corresponds to a π-query x||m i for some bit-string x. However, it is not known how the RO processes the message blocks of a long query M . We assume that the RO processes the message blocks sequentially, and that the time taken to process a message block is the same for all FP, FP1 and RO.
Round (and query).
The time interval to process a short query or a message block is defined as a round. We assume that each round takes an equal amount of time. To simplify the analysis, henceforth, unless otherwise specified, a query would mean either a short query or a message block.
Rules of the game. An adversary never re-submits an identical query. Moreover, an s-query (or s −1 -query) is also not submitted, if it matches with the output of a previous s −1 -query (or s-query).
Graph theoretic objects used in proof of main theorem
In addition to objects defined in the section above, we will use the following notions for a rigorous mathematical analysis of our results.
Suppose, π : {0, 1} 2n → {0, 1} 2n is an ideal permutation, and D is a finite set of pairs of the form (x, π(x)).
Reconstructible message
From the high level, M is a reconstructible message for the set D, if D contains all the π-queries and responses (x, π(x)), required to compute
(Full) Reconstruction graph
To put it loosely, a reconstruction graph stores reconstructible messages on its branches. A full reconstruction graph stores all reconstructible messages. We now define them formally. A weighted digraph T = (V, E) is defined by the set of nodes V , and the set of weighted edges E. A weighted edge (v, w, v ′ ) ∈ E is an ordered triple, such that v, v ′ ∈ V , and w is the weight of the ordered pair (v, v ′ ). Figure 4 , which will be discussed in detail in the subsequent sections.)
Definition 6.1 (Reconstruction graph) Suppose a weighted digraph T = (V, E) is such that V is a set of 2n-bit strings, and, for all (a, b, c) ∈ E, the weight b is an n-bit string. The graph T is called a reconstruction graph for D if, for every
(y 1 y ′ 1 , m 2 , y 2 y ′ 2 ) ∈ E, the following equation holds: y 2 y ′ 2 = π(y 1 m 2 ) ⊕ (y ′ 1 ||0) (all variables are n bits each), where (y 1 m 2 , π(y 1 m 2 )) ∈ D .
(An example of reconstruction graph is given in
A branch B of a reconstruction graph T , rooted at IV IV ′ , is fertile, if dePad n (m 1 m 2 · · · m k ) ̸ =⊥, where {m 1 , m 2 , · · · , m k } is
the sequence of weights on the branch B.
Remark: Each fertile branch of a reconstruction graph corresponds to exactly one reconstructible message.
Definition 6.2 (Full reconstruction graph) A reconstruction graph T (for the set D) is full, if, for each reconstructible message M (for D), T contains a fertile branch
B that corresponds to M . FP(M ) 01. If M ∈ Dom(D l ) then return D l [M ]; 02. m 1 m 2 . . . m k := pad n (M ); 03. y 0 := IV , y ′ 0 := IV ′ ; 04. for (i := 1, 2, . . . k) y i y ′ i := π(y i−1 ||m i ) ⊕ (y ′ i−1 ||0); 05. r := π(y k ||y ′ k ); 06. D l [M ] := r[n, 2n − 1]; 07. return D l [M ]; π(x) 11. If x / ∈ Dom(D π ) then D π [x] $ ← {0, 1} 2n \ Rng(D π ); 12. return D π [x]; π −1 (r) 21. If r / ∈ Rng(D π ) then D −1 π [r] $ ← {0, 1} 2n \ Dom(D π ); 22. return D −1 π [r]; (a) System G0 = (FP, π, π −1 ). For all i, |mi| = |yi| = |y ′ i | = |r/2| = n. RO(M ) 001. If M ∈ Dom(D l ) then return D l [M ]; 002. D l [M ] $ ← {0, 1} n ; 003. return D l [M ]; S −1 (r) 300. x $ ← {0, 1} 2n ; 301. if x ∈ Dom(D s ) then Abort; 302. D s [x] := r; 303. FullGraph(D s ); 304. return x; S(x) 100. r $ ← {0, 1} 2n ; 101. if r ∈ Rng(D s ) then Abort; 102. M := MessageRecon(x, T s ); 103. if |M| = 1 then r[n, 2n − 1] := RO(M ); 104. D s [x] := r; 105. FullGraph(D s ); 106. return r; MessageRecon(x, T s ) 201. M ′ := FindBranch(x, T s ); 202. M := {dePad(X) | X ∈ M ′ }; 203. return M; (b) System G2 = (RO, S, S −1 ).
Figure 3:
The main systems G0 and G2
View
Very loosely, the data structure view records the history of the interaction between a system and an adversary. Let x i and y i be the i-th query from the adversary and the corresponding response from the system. The view of the system after j queries is the sequence of queries and responses {(x 1 y 1 ), . . . , (x j y j )}.
Main System G0
Following the definition provided in Section 2, the system G0 implements the FP hash function using the ideal permutations π and π −1 . See Figure 3 (a).
Main System G2
See Figure 3 (b) for the pseudocode. The random oracle RO defined in Section 4 is implemented through lazy sampling. The only remaining part is to construct the simulator-pair (S, S −1 ). Our design strategy for the simulator-pair is fairly straightforward and simple. Before going into the details, we first provide a high level intuition.
Intuition for the simulator pair (S, S −1 )
The purpose of the simulator pair (S, S −1 ) is two-pronged: (1) to output values that are indistinguishable from the output from the ideal permutation (π, π −1 ), and (2) to respond in such a way that FP π (M ) and RO(M ) are identically distributed. It will easily follow that as long as the simulator-pair (S, S −1 ) is able to output values satisfying the above conditions, no adversary can distinguish between G0 and G2.
To achieve (1), the simulator S, for a distinct input x, should output a random value, such that the distributions of S(x) and π(x) are close. Similarly, the simulator S −1 , for a distinct input r, should give outputs such that the random variables S −1 (r) and π −1 (r) follow statistically close distributions.
To achieve (2), the simulator-pair needs to generate reconstructible messages from the set D s . To accomplish this, it needs to do the following:
• Assessing the power of the adversary: To asses the adversarial power the simulator-pair (S, S −1 ) maintains the full reconstruction graph T s for the set D s that contains all s-, s −1 -queries and responses; this helps the simulator keep track of all 'FP-mode-compatible' messages (more formally, all reconstructible messages) that can be formed using the elements of D s . This is accomplished by a special subroutine FullGraph. The pictorial representation of the reconstruction graph T s is given in Figure 4 .
• Adjusting the elements of the tables D l and D s : Whenever a new reconstructible message M is found, the simulator makes this crucial adjustment: it assigns FP S (M ) := RO(M ). It is fairly intuitive that, if S and π produce outputs according to statistically close distributions, then the distributions of FP S (M ) and FP π (M ) are also close. Since FP S (M ) = RO(M ), the distributions of RO(M ) and FP π (M ) are also close. This is accomplished by the subroutine MessageRecon.
Detailed description of the simulator pair (S, S −1 )
We first describe the two most important parts of the simulator-pair: the subroutines FullGraph and MessageRecon. To determine such messages M , first, FindBranch(x, T s ) collects all branches between the nodes (IV, IV ′ ) and x; then, it selects the sequence of weights X = m 1 m 2 · · · m k for all such branches. Finally it returns a set {M = dePad(X)} for all X. If no such M ̸ =⊥ is found, then the subroutine returns the empty set. With the definition of the above subroutines, we now describe how S and S −1 respond to queries.
FullGraph(D s
An s-query and response (for S):
For an s-query, the simulator S assigns a uniformly sampled 2n-bit value to r; if r matches an old range point in D s then the round is aborted. 4 Then the subroutine MessageRecon(x, T s ) is invoked which returns a set of reconstructible messages M. If |M| = 1 then the RO is invoked on M ∈ M, and the value is assigned to r[n, 2n − 1]. Finally, the graph T s is updated by FullGraph, before r is returned. In Appendix B, we show that the worst-case running time of S after σ queries is O(σ 5 ).
An s −1 -query and response (for S −1 ): For an s −1 -query, the simulator S −1 assigns a uniformly sampled 2n-bit value to x; if x matches an old domain point in D s then the round is aborted. Finally, the graph T s is updated by FullGraph, before x is returned. In Appendix B, we show that the worst-case running time of S −1 after σ queries is O(σ 5 ).
Intermediate system G1
The pseudocode is provided in Figure 5 . For the sake of clear understanding, we first discuss the motivation for designing this system.
Motivation for G1
The main motivation for constructing a new system G1 is that it is difficult to compare between the executions of the systems G0 and G2, instruction by instruction. The difficulty arises from the fact that G2 has a graph T s , and two extra subroutines FullGraph, and MessageRecon, while G0 has no such graphs or subroutines. To get around this difficulty, we reduce G0 to an equivalent system G1 by endowing it with additional memory for constructing a similar graph T π , and supplying it with the additional subroutines MessageRecon and PartialGraph. These additional components do not result in any difference in the input and output distributions of the systems G0 and G1 for any adversary (this result is formalized in Proposition 10.1); therefore, in the indifferentiability framework, G0 can be replaced by G1. Even though G1 and G2 now appear 'close', there are still important differences. The most crucial of them is that, in the former case, the long queries are processed as a sequence of π-queries; therefore, current s-and s −1 -queries of G1 may match old π-queries and responses, while such events are not possible for G2. This difference comes with two implications:
1. The reconstruction graph T π in G1 is built using s-, s −1 -, π-queries, and their responses stored in the 2. In G1, the reconstruction graph T π may not be full for the set D π , since the subroutine PartialGraph adds only a few nodes -rather than all nodes -to T π every round; by contrast, the reconstruction graph T s -built by the subroutine FullGraph -for G2 is necessarily full for the set D s . In Section 11, we identify a set of events in the system G1, and then, in Section 12, show that, if those events do not occur, then the reconstruction graphs in both the systems are full. 
FP1(M )
001. m 1 m 2 · · · m k−1 m k := pad n (M ); 002. y 0 := IV , y ′ 0 := IV ′ ; 003. for (i := 1, · · · , k) { 004. r := π(y i−1 m i ); 005. y i y ′ i := r ⊕ (y ′ i−1 ||0); 006. if y i−1 m i is fresh then PartialGraph(y i−1 m i , r, T π );} 007. if Type3 then BAD := True ; 008. r := π(y k y ′ k ); 009. if Type0-b then BAD := True ; 010. if y k y ′ k is fresh then PartialGraph(y k y ′ k , r, T π ); 011. D l [M ] := r[n, 2n − 1]; 012. return D l [M ]; MessageRecon(x, T s ) 201. M ′ := FindBranch(x, T s ); 202. M := {dePad(X) | X ∈ M ′ }; 203. return M; π(x) 301. if x / ∈ Dom(D π ) then D π [x] $ ← {0, 1} 2n \ Rng(D π ); 302. return D π [x]; π −1 (r) 501. If r / ∈ Rng(D π ) then D −1 π [r] $ ← {0, 1} 2n \ Dom(D π ); 502. return D −1 π [r];|m i | = |m| = |y i | = |y ′ i | = |y c | = |y ′ c | = |y| = |y ′ | = |y * | = |r/2| = n, for all i.
Detailed description of G1
Now we describe G1 in detail. For the moment, we postpone the description of the Type0,1, 2, 3 and 4 events until Section 11, since they do not impact the output and the global data structures of G1. We first discuss the subroutines used by the oracles FP1, S1 and S1 −1 .
PartialGraph(x, r, T π ). This subroutine is invoked whenever a fresh π-and π −1 -query -with r = π(x) -is encountered. The subroutine updates the reconstruction graph T π with (x, r) in the following way: First, the subroutine ContactPoints(y c = x[0, n − 1]) is invoked, that returns a set C containing all nodes in T π with y c being least significant n bits. The size of C determines the number of fresh nodes to be added to T π in the current iteration. Using the members of C and the new pair (x, r), new weighted edges are constructed, stored in E, and added to T π using the subroutine AddEdge. See Figure 4 for a pictorial description.
Note that the reconstruction graph T π may not be full for the elements in D π ; hence the name PartialGraph.
MessageRecon(x, T s ): This subroutine has been described already in the context of G2, that determines new reconstructible messages. Note that the graph T s is the maximally connected subgraph of T π with the root-node (IV, IV ′ ), generated by the s-, s −1 -queries and responses stored in D s ; x is the current s-query.
Now we describe how the oracles S1, S1 −1 , and FP1 respond to queries. A long query and response (for FP1): FP1 mimics FP, while updating the graph T π using the subroutine PartialGraph, whenever a fresh π-query is generated.
First Part of Main Theorem: Proof of (2)
From the definitions of systems G0 and G1 -in Sections 7 and 9 -we are well equipped to prove (2).
Proposition 10.1 For any distinguishing adversary A,
Proof. From the description of S1 and S1 −1 , we observe that, for all x ∈ {0, 1} 2n , S1(x) = π(x) and S1 −1 (x) = π −1 (x). Likewise, from the descriptions of FP1 and FP, for all M ∈ {0, 1} * , FP1(M ) = FP(M ). 2 11 Type0, 1, 2, 3, and 4 of System G1
In this section, we concretely define the Type0, Type1, Type2, Type3 and Type4 events of the system G1 (see Figure 5 ). Informally they will be called 'bad' events, since these events set the variable BAD in G1. We first provide the motivation for these events.
Motivation
We recall that the adversary submits s-, s −1 -and long queries to the system G1 and receives responses, and based on the history of query-response pairs, known as view -she then tries to distinguish G1 from G2. Intuitively, those events are called 'bad', for which the outputs from the π and π −1 oracles of G1 can be predicted by the adversary with probability better than when interacting with G2. These events primarily involve various forms of collision, occurring in the graph T π , allowing the adversary to generate non-trivial reconstructible messages. Secondly, we need to catch the events where current queries match old queries too. One can intuit that these events help the adversary in distinguishing G1 from G2. It is also important to note that, if T π is not a full reconstruction graph then the adversary can also use this fact to compel G1 to produce outputs different from those from G2 (since G2 always maintains the full reconstruction graph T s ). Next sections deal with concrete definitions of these events, keeping the above motivation in mind.
Classifying elements of D π , branches of T π , and π/π −1 -queries
The Type0 to Type4 events depend on the elements in D π , the branches of T π , and the types of π-and π −1 -queries. In the following sections we first classify them.
Elements of D π : six types
The query-response pairs of D π are classified according to its known and unknown parts. The known part of a query-response pair is the part that is present in the view of the system G1, or it can be derived from the view with probability 1; the unknown part is not present in the view, and it cannot be derived from the view with probability 1. We observe that there are six types of such a pair, and we denote them by Q0, Q1, Q2, Q3, Q4 and Q5 in Figure 6 (a); the head and tail nodes -each 2n bits -denote the input to, and the output from the query. Two-sided arrowhead indicates that the corresponding input-output pair is generated from either a π-or a π −1 -query. The red and green circles -each n bitsdenote unknown and known parts.
Branches of T π : four types
The branches of T π can be classified into four types, as shown in Figure 6( 
The π-and π −1 -queries: nine types
We observe that -based on the types described in the sections above -the current π-and π −1 -query can be categorized into the following classes.
1. Current π-query is an s-query. This can be of two types. 
Type0 and Type1 on Fresh queries
Intuition
We address the classes 1a, 2a, and 3ci of Section 11.2.3 together, since they are connected by the fact that the π-or π −1 -query is fresh. It is straightforward to notice that, if the outputs of the fresh queries are uniformly distributed, then distinguishing between G1 and G2 is difficult: Type0 events are designed to measure the degree to which the outputs of the π-and π −1 -queries are uniformly distributed. The second scenario is when the adversary is able to generate a non-trivial reconstruction message, for distinguishing G1 from G2. This is possible, if the fresh π-query causes a node collision in the graph T π , or if it causes an old query to be attached to a fresh node, or if an s −1 -query can be attached to a node of T π . Type1 events cover these events. In addition, we require that the absence of these events make the graph T π a full reconstruction graph. Detailed descriptions are below.
Type0: Distance from the uniform
Type0 event occurs when the output of a fresh π/π −1 -query is distinguishable from the uniform distribution U[0, 2 2n − 1]. A Type0 event can be of three types: event Type0-a occurs when a fresh π-query is an s-query; event Type0-b occurs when a fresh π-query is the final π-query of a long query; event Type0-c occurs when an s −1 -query is a fresh π −1 -query. Reverse query collision (n bits) (Type1-c)
Old
Figure 7: Type1 events of G1. All arrows are n bits each. Red arrow denotes fresh n bits of output from the ideal permutation π/π −1 . The symbol "=" denotes n-bit equality.
Type1: Collision in T π
There are three types of Type1 events (see Figure 7) . The purpose is to ensure that, if they do not occur then (1) no non-trivial reconstructible message can be generated by the adversary, (2) the growth of T π every round is "small", and (3) T π is a full reconstruction graph for the set D π .
• Type1-a. Suppose yy ′ is a fresh 2n-bit node generated when a fresh π-query is attached to T π . This event occurs when yy ′ collides with another node in T π ; this collision can be used to generate at least two reconstructible messages in the next rounds -one of them can be used to distinguish G1 from G2. It is important to note that, even though we are interested in 2n-bit node collision, Type1 event captures collision on the least significant n bits of the nodes. Therefore, it includes a bigger set of events than necessary. This is done to bound the growth of the graph; more precisely, it allows at most one fresh node to be added in the next round, if this event does not occur.
• Type1-b. Suppose yy ′ is a fresh node as defined above. This event occurs if yy ′ collides with any element in Dom(D π ); like before, this event can also be used to form a non-trivial reconstructible message. In a similar manner as Type1-a, we define Type1-b event when y collides with the least significant n bits of any element in Dom(D π ), and, as a result, it covers more events than required. Exactly like the Type1-a event, this is used to bound the growth of the graph, that is, it ensures that no new nodes can be added to T π in the present round, if this event does not occur.
• Type1-c. This event occurs when the output of the current s −1 -query collides with any node in T π , and thereby, the absence of this event precludes the formation of a reconstructible message. Like the previous two types, we define this event when a node and the output of the s −1 query collide on the least significant n bits. The absence of this event ensures that the s −1 -query is not added to T π .
Remark: Our conservative choice of Type1 events, eventually, degrades the indifferentiability bound of FP. The bound of n/2 bits of this paper seems likely to be improved by relaxing the above conditions. We experimented with a smaller set of events than the ones mentioned above, and obtained an indifferentiability bound very close to n bits. However, constructing a theoretical proof of that turns out to be an involving task.
Type2, Type3 and Type4 on Old queries
Intuition
Now we deal with the classes 1b, 2b, 3a, 3b and 3cii of Section 11.2.3. All of them address the issue when the current queries match old ones. The classes 1b or 2b happen when an s-or s −1 -query matches one of six types of old elements stored in D π ; these events can potentially help the adversary in distinguishing between G1 and G2, and we identify class 1b as Type2, and class 2b as Type4 events; the case by case analysis of the events will follow in a while.
The remaining classes are now 3a, 3b and 3cii, when the adversary submits a long query -say M -to the oracle FP1, and it is found that M is already present on some (fertile) branch of the graph T π (3a and 3b), or it is not present at all on any branch of T π (3c). The class 3c necessarily includes a fresh π-query, and this scenario has already been considered in Type0, and Type1; one can also see that class 1b (or Type2 events) already included the class 3cii. So we skip them here. The other classes -3a and 3b -are crucial now, and they represent when M corresponds to an already present red or green branch of T π (definitions in Section 11.2). We ignore the classes 3b, and 3aii, since they do not help the adversary in distinguishing systems.
So now we focus on the class 3ai, which deals with the final π-query of a red branch. Depending on the type of branch, the adversary tries to predict the most significant n bits of the final π-query (i.e., the hash output) with non-trivial probability; she succeeds only for Type3 events that will be discussed shortly.
Type2
Recall that a query-response pair in D π can be of six types: Q0 to Q5. Type2 event is divided into several cases depending on the type of the current s-query.
Type2-Q1, Type2-Q2, and Type2-Q4 events occur, if the s-query is type Q1, Q2 and Q4 respectively.
Type2-Q3 event occurs, if the s-query is type Q3
, and if the most significant n bits are distinguishable from the uniform distribution.
Type2-Q5.
We observe that a Q5 query can be located in two different types of branch in T π , as shown in Figures 8(b 
)(I) and (II).
• Type2-Q5-1 occurs if the current s-query is Q5, and is located in a type I branch, and if the least significant n bits are distinguishable from random.
• Type2-Q5-2 occurs if the current s-query is Q5, and is located in a type II branch.
Type3
In this case, we consider the final π-query of a red branch as the current query. 
Type4
This event is shown in Figure 9 . The Type4 event occurs, if the current s −1 -query is equal to the output of an old query of type Q1, Q2, Q3, Q4 or Q5.
Second Part of Main Theorem: Proof of (3)
With the help of the Type0 to Type4 events described in Sections 11.3, and 11.4, we are equipped to prove (3). First, we first fix a few definitions.
Definitions: GOOD i and BAD i
GOOD i and BAD i . BAD i denotes the event when the variable BAD is set during round i of G1, that is, when Type0, Type 1, Type2, Type3 or Type4 events occur. Let the symbol GOOD i denote the event ¬ ∨ i j=1 BAD i . The symbol GOOD 0 denotes the event when no queries are submitted. From a high level, the intuition behind the construction of the rounds, this event is impossible in the current round.
BAD i event is straightforward: we will show that if BAD i does not occur, and if GOOD i−1 did occur, then the views of G1 and G2 (after i rounds) are identically distributed for any attacker A. GOOD1 i and BAD1 i . In order to get around a small technical difficulty in establishing the uniform probability distribution of certain random variables, we need to modify the above events GOOD i and BAD i slightly. The event BAD1 i occurs when Type0, Type2, Type3 or Type4 events occur in the i-th round. The event GOOD1 i is defined as GOOD i−1 ∧¬BAD1 i .
Proof of (3)
To prove (3) we need to show two things:
The proof of (6) is straight-forward. To prove (5), we proceed in the following way. Observe
If we can show that
then (7) reduces to (5), since
As a result, we focus on establishing (8), which is done in Appendix C.
Third (or Final) Part of Main Theorem: Proof of (4)
To prove (4), we need individually compute the probabilities Type0, Type1, Type2, Type3 and Type4 events described in Sections 11.3, and 11.4. Since we assume
Definition of Type1 event guarantees that T π has i nodes after i − 1 rounds, given GOOD i−1 . We assume i ≤ 2 n/2 ; this implies (2 n − i) ≥ 1 2 2 n .
Estimating probability of Type0
From Section 11.3.2 we obtain,
Estimating probability of Type1
From Section 11.3.3 we obtain,
Estimating probability of Type2
From Section 11.4.2 we obtain,
Estimating probability of Type3
From Section 11.4.3 we obtain,
Estimating probability of Type4
From Section 11.4.4 we obtain,
Final computation
We conclude by combining the above bounds into the following inequality which holds for
A New Hash Function Family SAMOSA
Now we design a concrete hash function family SAMOSA based on the FP mode defined in Section 2. In the subsequent sections, we also provide security analysis and hardware implementation results of SAMOSA.
Description of SAMOSA
SAMOSA hash family is based on the FP mode and P-permutation of the Grøstl hash function family. Letting n denote the length of hash in bits (n = 256 and 512 bits), the complete description of the hash function SAMOSA-n is provided in Figure 10 . SAMOSA is composed of three components: (1) The FP mode and the padding rule pad n (·) (see Section 2), (2) IV IV ′ = ⟨0⟩ n ||⟨n⟩ n , and (3) the Grøstl permutation P 2n (see [17] ). 
Security analysis of the SAMOSA family
There are two ways to attack the SAMOSA hash function family: (1) Attacking the FP mode and (2) attacking the underlying permutation P 512 or P 1024 . In the next subsections we present the analysis results on the mode and the permutations. Based on that we conjecture that the SAMOSA family cannot be attacked non-trivially with work less than the brute force.
Security of the FP mode.
In Section 4 we have shown that the FP mode is indifferentiable from a random oracle up to approximately 2 n/2 queries (up to a constant factor) where n is the hash size in bits. Our rigorous analysis with the FP mode reveals that it may be possible to improve the bound to nearly 2 n queries. The analysis implies that it is hard to attack any concrete hash function based on the FP mode without discovering non-trivial weaknesses in the underlying permutation. In our case, the permutations are P 512 and P 1024 of the Grøstl hash family.
Security analysis of Grøstl permutations P 512 and P 1024 .
The permutations P 512 and P 1024 of the Grøstl hash function have been two most heavily analyzed primitives in the SHA-3 hash function competition [32, 37, 19, 23, 40] . The best analysis on P 512 so far has been the discovery of differential properties up to 9 (out of 10) rounds with work 2 368 and memory 2 64 ; for the permutation P 1024 , the best analysis is the discovery of differential properties up to 10 (out of 14) rounds with work 2 392 and memory 2 64 . Given the enormous costs to implement these attacks, and also given the huge third-party cryptanalysis the permutations of Grøstl have resisted so far, it seems fair to say that P 512 and P 1024 are secure for all practical purposes.
15 FPGA Implementations of SAMOSA-256 and SAMOSA-512
Motivation and previous work
In case the security of two competing cryptographic algorithms is the same or comparable, their performance in software and hardware decides which one of them get selected for use by standardization organizations and industry.
In this section, we will analyze how SAMOSA compares to Grøstl, one of the five final SHA-3 candidates, from the point of view of performance in hardware. This comparison makes sense, because both algorithms share a very significant part, permutation P, but differ in terms of the mode of operation. The FP mode requires only a single permutation P, while Grostl mode requires two permutations P and Q, executed in parallel. Our goal is to determine how much savings in terms of hardware area are introduced by replacing the Grøstl construction for hash function with the FP mode. We also would like to know whether these savings come at the expense of any significant throughput drop. Finally, we would like to analyze how significant is the improvement in terms of the throughput to area ratio, a primary metric used to evaluate the efficiency of hardware implementations in terms of a trade-off between speed and cost of the implementation.
Multiple hardware implementations of Grøstl (and its earlier variant, referred to as Grøstl-0) have been reported in the literature and in the on-line databases of results (see [38] , [2] ). Most of these implementations use two major hardware architecture types: a) parallel architectures, denoted (P+Q), in which Groestl permutations P and Q are implemented using two independent units, working in parallel, and b) quasi-pipeline architectures, denoted (P/Q), in which, the same unit, composed of two pipeline stages, is used to implement both P and Q, and the computations belonging to these two permutations are interleaved [16] . Additional variants of each architecture type are possible, and the two most efficient ones are the basic iterative architecture (denoted as x1), and vertically folded architecture, with the folding factor 2 (denoted as /2(v)) [16] .
A summary of implementation results, obtained for various architectures, using Xilinx Virtex 5 FPGAs, is given in Table 2 . Although, the implementation by Latif et al. [25] is currently the most efficient on Virtex 5, this implementation relies on the use of low-level Xilinx FPGA primitives, and as a result is not portable to FPGAs of other vendors, such as Altera. Since our implementation of SAMOSA presented in this paper is fully portable, and does not use any low-level primitives, we compare it with the second best design of Grøstl reported earlier in the literature [16] , which has the same features. This design is based on the quasi-pipelined basic iterative architecture denoted as x1 (P/Q). This way, we will be also able to provide comparison for an alternative FPGA family, Stratix III from Altera. [25] x1 (P/Q) 6200 1419 4.37 Gaj et al. [16] x1 (P/Q) 6117 1795 3.41 Homsirikamol et al. [18] x1 (P/Q) 6072 1912 3.18 Gaj et al. [16] /2(v) (P/Q) 3721 1195 3.11 Homsirikamol et al. [18] /2(v) (P+Q) 4014 1598 2.51 Gaj et al. [16] x1 (P+Q) 7213 2906 2.48 Baldwin et al. [2] x1 (P+Q) 7709 3137 2.46 Guo et al. [2] x1 (P+Q) 5027 3798 1.32
High-speed architectures of SAMOSA and Grøstl
In case of SAMOSA the best high speed architecture is the basic iterative architecture shown in Figure 11 . In this architecture, a single round of the permutation P is implemented as a combinational logic, and executed in a single clock cycle. As a result, r clock cycles are required to process each h-bit message block (where r is the number of SAMOSA rounds; r = 10 for h = 256, and r = 14 for h = 512), and the throughput becomes equal to
In case of Grøstl, the best high-speed architecture, based on Table 2 , is a quasi-pipelined architecture, denoted as x1 (P/Q). This architecture is shown in Figure 12 . The most important difference compared to the architecture of SAMOSA is that the central part of this architecture can be used to implement either a round of P or a round of Q, depending on a value of a control signal. We denote this logic as the P/Q round. Additionally, in order to speed up processing, we introduce a pipeline register that divides this logic into two independent pipeline stages. As a result, at the same time, one of these stages can process a part of permutation P, and the other can process a part of permutation Q. A total of 2r + 1 clock cycles are required to finish r rounds of both P and Q, and the clock frequency increases compared to the non-pipelined version. The throughput of this architecture is given by b/((2r
, where b = 2h is a message block size, and the datapath width in the Grøstl architecture.
For fairness, both designs use the same circuit interface, proposed in [12] , the same design methodology, and the same coding style. In particular, both designs use 64-bit input and output data buses, and the standard I/O units known as SIPO (Serial-In Parallel-Out) and PISO (Parallel-In Serial-Out).
The padding units of SAMOSA and Grøstl are illustrated in Figure 13 . The major difference between these two padding units is the existence of Block Counter in the padding unit of Grøstl. This counter and the following multiplexer have a small affect on the area of the Grøstl implementation with padding unit, but are not likely to affect the critical path, and thus throughput of the entire circuit.
Comparison of SAMOSA and Grøstl in terms of the hardware performance
Below, we compare SAMOSA and Grøstl in terms of three major hardware performance metrics: Area, Throughput, and Throughput to Area Ratio. The exact results of the comparison are shown in Tables 3 and 4 . All results were generated using Xilinx ISE v13.1 and Altera Quartus II v11.1. Automated Tool for Hardware EvaluatioN (ATHENa) [2] was used to automate the optimization and result extraction process. No low-level primitives and no embedded resources (such as Block Memories or DSP units) were used in our implementations, which makes them fully portable among multiple FPGA families from various vendors. Each design has been implemented in two different versions: with and without padding unit. The designs with padding unit are more complete, while the designs without padding units are more suitable for comparison with hardware implementations presented in earlier academic papers on Grøstl and other SHA-3 candidates (as these implementations typically did not contain padding units).
Comparison in terms of Area
As shown in Tables 3 and 4 , for comparable hardware architectures, SAMOSA has significantly lower area requirements than Grøstl. For Xilinx FPGAs, the area reduction is between 27 and 35%; for Altera FPGAs, it is between 31 and 34%. This reduction is explained as follows. First, P round is simpler than P/Q round, as the relevant logic does not need to be switched from implementing P permutation to implementing Q permutation of Grøstl. Although both permutations are quite similar, they still differ in two out of four major operations: AddRoundConstant and ShiftBytes. Additional area requirements may result from inserting a pipeline register between two stages of the P/Q round, as shown in Figure 12 (in some FPGA families, these registers may be combined with the preceding logic and no increase in the number of configurable logic units will be observed). Secondly, SAMOSA requires less surrounding logic than Grøstl. The total width of registers outside of the P round in the basic iterative architecture of SAMOSA is 3h. In Grøstl, the registers outside of the P/Q round have the total width of 2b = 4h. The total width of the multiplexers, outside of the P round in SAMOSA is 4h. The width of similar multiplexers outside of the P/Q round in Grøstl is 5b = 10h. Finally, the number of the 2-input XOR gates in SAMOSA is h, while in Grøstl it is 3b = 6h. Additionally, in the designs with padding unit, SAMOSA benefits from eliminating Block Counter from the padding logic, as shown in Figure 13 . All these differences amount to a significant advantage of SAMOSA over Grøstl in terms of the circuit area. This advantage is particularly important taking into account that one of the major weakness of Grøstl is its inherently large area in any high-speed hardware implementations.
Comparison in terms of Throughput
In terms of Throughput, SAMOSA and Grøstl have very similar equations for Throughput. For SAMOSA the Throughput is given by f CLK ·h/r, while for Grøstl it is f CLK ·2h/(2r+1). Since r is relatively large (10 for the 256-bit hash function variants, and 14 for the 512-bit hash function variants), 2h/(2r + 1) ≈ h/r, and thus the primary difference comes from different clock frequencies.
As shown in Tables 3 and 4 , the quasi-pipelined implementation of Grøstl has higher clock frequency than the basic iterative architecture of SAMOSA. However, this difference is relatively small. It does not exceed 20% for Xilinx Virtex 5 implementations, and 6% in case of Altera Stratix III implementations.
The critical paths of both architectures are marked with bold lines in Figures 11 and 12 . In case of SAMOSA the critical path includes P round, one XOR gate, and two multiplexers. In case of Grøstl, it covers P/Q round stage 2, one XOR gate, and two multiplexers. In theory, one could expect a larger difference in frequency due to pipelining. However, in practice, the effect of pipelining is limited due to difficulties of dividing critical path into two equal halves. Additionally, the frequency of Grøstl before pipelining is already quite high (and similar to the frequency of SAMOSA), and its increase is limited also by the delays of other signal paths in the circuit.
Effect of padding in low-area architectures
SAMOSA has a simpler padding unit. The difference is shown in Figure 13 for the case of byte padding, i.e., padding of messages that end on a boundary of a byte. The elimination of Block Counter reduces the complexity of the control unit as well as the area associated with the padding logic. This reduction, although relatively minor for high-speed implementations, may prove to be quite significant for low area implementations. Tables 5 and 6 present the comparison between SAMOSA and the SHA-3 finalists using the best single-message architecture, i.e., architecture capable of processing only one message at a time. All algorithms have been implemented without padding units, in two variants, with 256-bit and 512-bit output, in Xilinx Virtex 5 and Altera Stratix III FPGAs. The primary metric used for comparison is throughput to area ratio. All results, other than the results for SAMOSA, are based on [16] .
Comparison of SAMOSA with the SHA-3 finalists
In terms of the throughput to area ratio, SAMOSA performs consistently better than BLAKE, Grøstl and Skein, and loses only to Keccak and JH in both 256-bit and 512-bit variants. Additionally, it reduces the gap in performance to Keccak and JH as compared to Grøstl. It also outperforms Skein in the 512-bit variant on Xilinx Virtex 5, where Grøstl loses to Skein. Furthermore, due to its similarity to Grøstl, SAMOSA has an additional advantage compared to other SHA-3 candidates when resource sharing with the Advanced Encryption Standard (AES) is possible, as demonstrated in [36] .
Conclusion and Open Problems
This paper gives proposal for a novel permutation based hash mode of operation named FP. Our indifferentiability security analysis establishes that the new mode is secure against all generic attacks up to approximately 2 n/2 queries; more interestingly, our experimental results, based on randomly generated reconstruction graphs using C programs, suggest that the security bound can be improved to nearly 2 n queries (n is the hash size in bits). We leave the proof of this improved result as an open problem. We also design a concrete hash function family SAMOSA based on the FP mode and the P permutations of the SHA-3 finalist Grøstl; we claim it is hard to attack SAMOSA with complexities significantly less than the brute force. Our FPGA hardware implementations of SAMOSA show remarkable improvement in the throughput to area ratio compared to the SHA-3 finalists Grøstl, BLAKE and Skein. It is still not known how efficient SAMOSA is in software. We leave the software implementations of SAMOSA as future work. 
A Definitions
C Proof of (8) (8) is as follows: Pr
Let V 1 i and V 2 i denote the views of the systems G1, and G2 respectively, after i queries have been processed. To prove (8) , it suffices to show that given GOOD1 σ , the views V 1 σ and V 2 σ are identically distributed. We do this by induction on the number of queries i = σ.
Induction Hypothesis: Given GOOD1 i , V 1 i and V 2 i are identically distributed.
Base: When i = 0, then no query has been made; therefore the hypothesis is true.
Induction
Step: Now assume the induction hypothesis holds. We have to show that if GOOD1 i+1 occurred, then V 1 i+1 and V 2 i+1 are identically distributed.
) denote the input-output pairs for the systems G1 and G2 respectively in the i + 1st round. Note that the induction hypothesis implies that V 1 i and V 2 i are identically distributed given GOOD i occurred. Also note that
. A little reflection shows that proving the induction step is equivalent to proving the following proposition.
Proposition C.1 (Proof of Induction
Step) Given GOOD1 i+1 and 
Proof.
1. This result is easy since
2. To prove this, we first establish the following lemma which is the main ingredient in our proof.
Lemma C.2
The reconstruction graphs T s of the systems G1 and G2 are isomorphic after i rounds, given GOOD i and
Proof. For a fresh π-or π −1 -query, the graph T π of system G1 is augmented in one phase (see the subroutine PartialGraph of Figure 5 ). In that phase, all possible nodes generated from a fresh π-query are added to the graph T π . A straightforward analysis of the Type1 events shows that if these events do not occur then no nodes can be added beyond this phase. . We continue by considering all possible cases based on a set of conditions for the system G1 in the i + 1st round. Our decision tree produced 17 cases, which have been derived from a sequence of questions (see Figure 14) : Cases 1 through 9 consider when I i+1 is an s-query, cases 10 through 11 consider when I i+1 is an s −1 -query, while cases 12 through 17 consider when I i+1 is the round input of a long query. Implication. This case is impossible since GOOD1 i+1 implies that Type2 event did not occur for G1 in the current i + 1st round. 
