Abstract-Because of the weakening of the widely-used SHA-1 hash algorithm and concerns over the similarly-structured algorithms of the SHA-2 family; the US NIST has initiated the SHA-3 contest in order to select a suitable drop-in replacement. In this paper we use the unrolling method to describe the high-speed hardware implementation of the SHA-3 hash function candidate KECCAK on FPGA devices.
I. INTRODUCTION
In today's modern world of e-mail, internet banking, online shopping, and other sensitive digital communications, cryptography has become a vital tool for ensuring the privacy of data transfers. Hash functions operate at the root of many popular cryptographic methods in current use, such as the Digital Signature Standard (DSS), Transport Layer Security (TLS) and Internet Protocol Security (IPSec) protocols, numerous random number generation algorithms, encryption algorithms, all-or-nothing transforms, and password storage mechanisms [1] .
As cryptographic algorithms become more widely used, the need for high-speed implementations of these algorithms increases. Software-based implementations of cryptographic algorithms fall short in performance in many applications, e.g. on heavily loaded servers. Therefore, an obvious need for high-speed implementations exists.
In many of these cryptographic schemes, the throughput of the incorporated hash functions specifies the throughput of the system. Especially in applications where transmission and reception rates are high, any latency or delay on calculating the digital signature of the data packet leads to degradation of the network's quality of service [2] .
Reprogrammable hardware is an almost ideal choice for cryptographic implementations because high speed can be achieved without significant reduction in flexibility. Flexibility, meaning that the design can be easily changed or modified, is of especially great importance in cryptographic implementations for the following reasons. First, a cryptographic algorithm can be considered secure only until proven otherwise. If a severe flaw in an algorithm is found, the algorithm must be replaced with a more secure one. Second, in many applications, a large variety of different algorithms are in use, and therefore, it should be easy to change from one algorithm to another.
Following the weakening of the widely-used SHA-1 hash algorithm and concerns over the similarly-structured algorithms of the SHA-2 family, the NIST has set up the SHA-3 competition with the goal of identifying one (or more) modern hash functions which can act as a drop in replacement for the SHA-2 family [3] .
KECCAK hash function is one of these candidates accepted by NIST for the SHA-3 hash function competition. In this paper we describe the implementation of the KECCAK on FPGAs.
The paper is organized as follows; section 2 presents the KECCAK algorithms. Section 3 describes some optimization techniques. The characteristics of FPGA implementations are presented in section 4. Section 5 presents the obtained results and section 6, concludes the paper. The KECCAK Hash function produces a final digest message of 256 bits, which is dependent on the input message, composed of multiple blocks of 1024 bits each. The input message block is XORed onto a part of the current state and the result is passed through the KECCAK-f permutation. The KECCAK algorithm consists of 3 stages: (i) initialization and padding; (ii) absorbing phase; and (iii) squeezing phase. A pseudo code for this algorithm is depicted below [4] , [5] . The state is logically grouped into a 5×5 matrix of 64-bit words. The KECCAK-f permutation consists of 24 rounds, which are identical except for the addition of a round-dependent constant. Each round has five steps (θ, ρ, π, χ and τ), which feature simple logical operations and permutations of the state bits. The initial state is all zero and in each round the introduced data is mixed with the current state. 
III. OPTIMIZATION TECHNIQUES
In this section a discussion is given about methods for architectural optimization in an FPGA. There are three primary definitions of speed depending on the context of the problem: throughput, latency, and timing [6] .
Several techniques have been proposed to improve the implementation. The most relevant are:
• Unrolling techniques that optimize the data dependency. An unrolled architecture implements multiple rounds of the core compression function in combinational logic, thereby reducing the number of clock cycles required to compute the hash. This comes at the cost of an increase in area. The number of rounds unrolled in the algorithm, k, must be a divisor of the total number of rounds, n, of the algorithm. Thus the number of clock cycles to execute the algorithm decreases by a factor of k. The goal is to increase the minimum clock period by a factor smaller than k, thus allowing for shorter latency and higher throughput [7] .
• The usage of embedded memories to store the required constant values.
• Use of pipelining techniques, to achieve higher working frequencies. Due to highly dependent data computation the resulting throughput is usually not improved and more complex control logic is required.
IV. IMPLEMENTATION
The architecture of the core is illustrated in Fig. 1 . The I/O buffer allows the core to compute the absorbing phase while the words of the next block are transferred through the bus. Two R blocks consist of the round function and each of them works in different clock cycle. fig. 1 , these signals determine the status of the buffer to be input or output mode and specify which of the R blocks are active in one clock cycle. The processing of a complete message block requires 16 clock cycles.
V. RESULT
The presented hashing core was captured in VHDL and was fully simulated and verified using the Model Technology's ModelSim Simulator. We have used Altera Quartus II and Xilinx ISE to evaluate VHDL with the tools for FPGA. These tools provide estimations of the amount of resources needed and the maximum clock frequency reached [8, 9] .
The throughput is calculated by:
where the block size is 1024 and the required clock cycle is 16. 
VI. CONCLUSION
In this paper we have described the implementation of the KECCAK on FPGAs. As seen in the last two tables the throughput increases and the penalty is an increase in area.
