Abstract. There exists a broad range of RFID protocols in literature that propose hash functions as cryptographic primitives. Since Keccak has been selected as the winner of the NIST SHA-3 competition in 2012, there is the question of how far we can push the limits of Keccak to fulfill the stringent requirements of passive low-cost RFID. In this paper, we address this question by presenting a hardware implementation of Keccak that aims for lowest power and lowest area. Our smallest (fullstate) design requires only 2 927 GEs (for designs with external memory available) and 5 522 GEs (total size including memory). It has a power consumption of 12.5 µW at 1 MHz on a low leakage 130 nm CMOS process technology. As a result, we provide a design that needs 40 % less resources than related work. Our design is even smaller than the smallest SHA-1 and SHA-2 implementations.
Introduction
Radio Frequency Identification (RFID) is a technology that makes great demands on cryptographers to implement secure applications. The main challenges are the limited power consumption of tags that are in the field as well as the limited chip area that is available. In the past, several RFID-protocol designers proposed to use hash functions to provide cryptographic services. Hash functions are basic building blocks to implement, e.g., digital signatures or privacy-preserving protocols. However, it has been shown that these building blocks can not be implemented as efficient as other cryptographic primitives like AES or PRESENT as highlighted by M. Feldhofer and C. Rechberger [13] and A. Bogdanov et al. [12] . Until now it remains an open question if Keccak is a suitable candidate for those devices and if it can fulfill these demands.
Before Keccak has been selected as the winner of the NIST SHA-3 competition in October 2012, several authors reported performance results for ASIC platforms. Most of them target high-speed implementations which require between 27 and 56 kGEs (synthesized on 90 or 130 nm CMOS process technology).
or contactless payment. In this context, RFID faces several security and privacy challenges. Most of these applications carry enough sensitive information to require strong cryptographic services. Secure RFID is essential also for new applications that require integrity of tag data, confidentiality during communication, and authentication or proof-of-origin to prevent counterfeiting-a major challenge where RFID might help to stop the process of piracy.
In the following, we list the principle design criteria and requirements for security-enabled RFID devices.
Reading Range and Power. The primary concern in passive RFID systems is the limited power that is available for the tags. Tags draw their energy from the electromagnetic field of a reader and use internal capacitors to buffer the energy to perform computations. The available energy depends thereby on various factors such as the distance to the reader, the size of the antenna, the operating frequency, and the field-strength of the reader. Inductively-coupled tags operating in the 13.56 MHz frequency range typically have enough power available. The magnetic field of the readers is quite high (1.5 to 7.5 A/m). This means that there are several milliwatts of power available for the tags to perform cryptographic operations. Long-range tags (e.g., UHF EPC Gen2 tags), in contrast, have a reading range of several meters. These tags have only a fraction of power available, i.e., a few microwatts that are drawn from the electromagnetic (far-)field of the reader. Thus, these tags have to operate in an environment where the power source is being up to 1 000 times lower compared to short-range HF systems. In practice, the total power consumption of those devices is typically limited to at most 10-15 µW per MHz on average and 3-30 µW peak power (depending on read or write operations) [33, 34] .
Costs and Chip Area. During the last decade, several authors made chip area estimations for low-cost passive RFID tags. One of the first estimations have been made by S. Sarma from the MIT Auto-ID Center [35, 36, 37] and S. Weis in 2003 [41] . They predicted the costs for a low-cost tag to be 5 (dollar) cents in the near future and estimated the actual die size of a low-cost tag accordingly to be between 5 000 and 15 000 gate equivalents where only up to 2 000 gates are usable for security purposes. Similar estimations have been made by D. Ranasinghe and P. Cole in 2008-both from the Auto-ID Lab Adelaide-who reported numbers from 2 000 to 5 000 GEs for security-related functions [33] . They stated that the number of available gates naturally increases over the years due to improvements in manufacturing and process technology as also highlighted by M. Feldhofer and J. Wolkerstorfer in [15] .
Speed and Response Time. Tags have to answer the reader within a specific response time. This time is usually very short, i.e., 15-250 µs for EPC Gen2 tags (nominal range), 320 µs for ISO/IEC 15693 tags, and 86-91 µs for ISO/IEC 14443 tags 1 . However, it is principally not required for a tag to finish the computation within this short period of time (if even possible). Instead, a challenge-response protocol is needed that allows a larger time frame for cryptographic operations (without causing a recognizable delay). Thus, challenging a tag that is for example clocked with 1.5 MHz, would take the reasonable time period of 4.8 ms to perform a computation needing 7 200 clock cycles.
Hash Functions for RFID
One of the first who proposed to use hash functions in RFID protocols in 2003 were S. Weis, S. Sarma, R. Rivest, and D. W. Engels [24, 41] . They made use of the difficulty to invert a one-way hash function to realize access control services for low-cost EPC tags. The so-called "hash-lock" protocol works as follows. First, the owner of the tag generates a random key and sends the hash to the tag (i.e., the MetaID). After that, the tag stores the hash and locks its memory. To unlock the memory, the owner has to send the original key to the tag which hashes it and compares the digest with the stored MetaID. Another proposal has been made by A. Shamir who presented the RFID protocol SQUASH (squashed form of SQUare-hASH) in 2008 [38] . He described a tag authentication scenario using a challenge-response protocol where the tag and the reader share a secret key S. The reader issues a random number R and sends it to the tag. After that, the tag calculates H(S, R) where H represents a public hash function. The tag sends the hash back to the reader which can independently calculate the same message digest to proof the authenticity of the tag. As a cryptographic primitive, A. Shamir proposed to use the 64-bit SQUASH function, which is based on the well-studied Rabin encryption scheme. Note that the SQUASH function does not provide collision resistance since it is not necessarily required for the given RFID authentication scenario (this however lowers the resource requirements for practical implementations).
An approach to calculate a message digest using block ciphers has been proposed by H. Yoshida et al. [42] in 2005 and by A. Bogdanov et al. [12] in 2008. The latter authors presented DM-PRESENT which is based on the 64-bit cipher PRESENT as well as H-PRESENT that provides a 128-bit security level.
The first sponge-construction based hash function has been presented by G. Bertoni, J. Daemen, M. Peeters, and G. V. Assche at the ECRYPT Hash Workshop in 2007 [5] . Since then, several hash-function proposals were made with respect to RFID applications including Keccak, QUARK [2] , Spongent [11] , and Photon [18] .
Related Work on Keccak Implementations. There exist several Keccak implementations where most of them have been designed for FPGAs. Highspeed implementations have been reported by J. Strömbergson [39] , B. Baldwin et al. [3] , E. Homsirikamol et al. [22] , K. Kobayashi et al. [31] , F. Gürkaynak et al. [20] , and K. Gaj et al. [16, 17] . Low-area FPGA designs have been presented by S. Kerckhof et al. [29] , J.-P. Kaps et al. [26] , and B. Jungk and J. Apfelbeck [25] .
In view of ASIC designs, there exist many high-speed variants proposed by S. Tillich et al. [40] , A. Akin et al. [1] , L. Henzen et al. [21] , and X. Guo et al. [19] .
Note that there also recently exists an open-source project at OpenCores.org [23] . To the authors' knowledge, there are only two publications that report a low-area implementation of Keccak on ASICs. The Keccak team reported numbers for a low-area version of Keccak needing 9.3 kGEs (including memory) on a 130 nm CMOS process technology [10] . In 2010, E. B. Kavun and T. Yalcin presented several low-resource designs of Keccak for RFID in [27] . Their full-state version (1 600 bits) needs about 20 kGEs on the same process technology.
Keccak Specification and Design Exploration
In this section, we first give a brief overview about Keccak with the focus on parameters likely to be integrated in the SHA-3 standard. Afterwards, we explore different design decisions and discuss various optimizations for practical implementations.
The Sponge Construction. Keccak is based on a new cryptographic hash family, the so-called sponge function family [7] . As opposed to existing hash constructions, which are classically based on the Merkle-Damgård construction, a fixed length permutation f is used to allow the handling of arbitrary length input and to produce fixed length outputs, e.g., 224, 256, 384, or 512 bits. The permutations are performed on a state with a fixed size of b bits.
The state is cut into two parts of size r (rate) and c (capacity), respectively. The rate defines the number of input bits which are processed in one block permutation. The capacity c of the sponge function represents the remaining bits of the state, i.e., c = b − r. The authors of Keccak proposed values for r and c in their submitted Keccak specification [9] , e.g., b = 1 600 bits, r = 1 088 bits, and c = 2n = 512, where n is the length of the output.
Hashing works as follows. First, the state is initialized with 0 b and the input is padded to a length that is a multiple of r using the very simple multi-rate padding scheme [8] . After that, it is cut into blocks of size r. During the initial absorbing phase, the message blocks are XORed with the first r bits of the state followed by a single state permutation f. After the sponge has absorbed the whole message, it switches to the squeezing mode in which r bits are output iteratively (again followed by single state permutations f ).
The Keccak-f Permutation. The authors of Keccak proposed seven different state-permutation functions Keccak-f that can be used. These permutation Keccak-f organizes the b-bit state as a 3-D matrix with dimension 5×5×w, with w = 2 . This matrix can be split into slices and lanes. A slice is a matrix composed of 25 bits with constant z coordinate (5 bits in each row and 5 bits in each column). A lane is a simple array consisting of w bits of constant x and y coordinate. Figure 1 shows the structure of the state.
The Keccak-f permutation is a round based function, each of the 12 + 2 rounds consists of five parts: For a more in-depth explanation of Keccak we refer to the Keccak reference [8] .
Design Exploration and Decisions
We decided to analyze the hardware complexity of Keccak-f with a state size of both 1 600 (full-state) and 800 bits. For each design, we implemented two versions. The first version aims for lowest power and lowest area (Version 1). The second version (Version 2) targets the same goals but tries to find an optimal trade-off between power, area, and speed without causing a significant weight gain in one direction. For both designs, we decided to use low width datapaths, i.e., 8 and 16 bits. This is because lower datapath widths would result in unacceptable throughput penalties while higher datapath widths exceed the limited power and area requirements. Moreover, we serialized all operations and the applied components have been re-used as much as possible. Figure 2 shows the basic hardware architecture of our designs. It consists of a controller, a datapath, a Look-up Table ( LUT) for constants, an input/output interface, and an external RAM block. As a requirement, our design should feature all necessary components for Keccak (permutation calculation, sponge function, input handling including padding) and should be flexible (support multiple output lengths). Memory Type and I/O Interfaces. We decided to use RAM macros for state storage because they require typically less resources than standard-cellbased designs (in terms of power and area). For our first version, we decided to use an 8-bit interface; for the second version we use an 16-bit interface (to improve speed). As a major requirement, no more than b bits (the size of the state, e.g., 1 600 bits) should be used. As input/output interface, we chose to implement an 8-bit AMBA APB interface, which is very simple and provides a standardized communication interface.
Constants: LUT vs. LFSR. The round constants for the ι transformation as well as the ρ rotation offsets should be stored in a simple LUT. The round constants can be also generated using a 7-bit Linear Feedback Shift Register (LFSR) but this would require more power and area.
Lane-and Slice-wise Processing. Software implementations as well as the compact co-processor described in [10] operate lane-wise, i.e., lanes are fetched from the memory and are subsequently processed. This approach however needs a lot of additional storage and is slow on the small data buses we are using.
An interesting alternative, namely slice-wise processing, was proposed by B. Jungk and J. Apfelbeck [25] . Although initially designed and implemented for FPGAs, slice-wise processing serves as an excellent starting point for a lowresource ASIC implementation. All operations except ρ can be performed on a slice-per-slice basis. In order to perform these four transformations on a slice in a single cycle, the rounds of the Keccak-f permutation must be rearranged: the initial round solely consists of θ and ρ, followed by 23 rounds of π, χ, ι, θ and ρ, and the final round consists of π, χ and ι. This round schedule differs slightly from the one used by Jungk and Apfelbeck.
The ρ transformation as well as the sponge computations cannot be performed slice-wise but only on a lane-per-lane basis. For this reason, we use both lane-and slice-wise processing and combine these two approaches into a single datapath. This combination is a challenge when using an external memory as it must both be possible to access slices as well as lanes while still using the full bandwidth of the memory bus and keeping the core's internal storage small. We tackle this problem using a technique called interleaving which will be explained in the next section.
Low-Power Optimizations. To reduce current drain, we integrated clock gating and operand isolation techniques. In the case of clock gating, registers are only clocked whenever new values should be stored. Operand isolation sets the inputs of combinational parts, whose outputs are not needed in the current cycle, to a constant value, i.e., to 0. Both these methods reduce switching activity which is the main contributor to power consumption in CMOS technology. Applying these techniques to our design helps us to drastically reduce power consumption while the area impact is kept low.
The Keccak Architectures
In this section, we first describe two hardware architectures for the full-state Keccak algorithm. Our first design (Version 1) aims for lowest power and area. Our second design (Version 2) trades area for higher throughput. After that, we discuss the implications of smaller state sizes and present two architectures using 800 bits only. Figure 3 shows the datapath architecture of our design. It provides an 8-bit memory interface and is mainly composed of an interleave and de-interleave unit, two 64-bit registers, one slice unit, and two ρ units. of the upper lane. Using this technique, a single n-bit memory word contains information about 2 lanes but only n/2 slices. This fact helps us to drastically decrease the size of the internal memory needed as will be explained later. Due to the fact that the state consists of an odd number of lanes, one selected lane has to be stored non-interleaved; we chose the lane [0, 0], since this is the only one with a ρ offset equal to 0. Therefore, we can skip this single lane in this phase.
Version 1: Pushing the Limits towards Lowest Power and Area
Combined Slice-and Lane-Processing. The two 64-bit registers r0 and r1 combined either store two lanes or four slices. In the latter case, only 100 out of 128 bits are used. The interleaved memory technique described above allows us to load and store two lanes at full bus speed (i.e., 16 memory cycles on an 8-bit bus) and four slices in only 13 cycles. When not using interleaving, the size of the registers need to be increased to 100 bits in order to store 8 slices. Figure 4 shows the architecture of the slice-processing unit. The π operation is a rewiring of the input, χ is computed on the 5 rows of one slice in parallel, and ι is a single XOR with a bit of the round constant. For the θ transformation, the column parity of the previous slice is stored in a 5-bit register. The parity of a slice is computed and XORed to the stored parity. The result is then added to each of the 5 rows. In the initial and final round, some parts must be skipped. For this reason, two multiplexers allow bypassing of blocks.
A single ρ unit is made up of a barrel shifter and a register with half the size of the memory-bus width. The upper 4 bits of the rotation offset are handled by proper register addressing while the lower 2 bits are done by actual shifts to the left.
The Round Computation. The computation of a single modified round consists of two main phases: the slice-processing phase and the ρ transformation phase:
-In the slice-processing phase, the column parity of slice 63 (after having applied ι • χ • π) is first computed and stored in the parity register. Then, the following is repeated 16 times: four slices are loaded within 13 clock cycles and after performing θ • ι • χ • π on each slice, the result is stored in memory.
-For the ρ phase, two lanes are fetched from memory. With the help of two separate ρ units, the lanes are implicitly rotated by the specified offsets and stored back to memory. This is done for all 24 lanes which have an offset other than 0.
Version 2: Trading Area for Higher Throughput
The previously described design requires low resources in terms of power and area but lacks in speed and throughput. The main drawback is the use of an 8-bit memory interface and the asymmetric datapath. During the slice processing, 25 bits are processed at once while the ρ phase operates on only 8 bits which is inefficient in terms of power. We therefore make use of a 16-bit memory interface that allows writing of single bytes to trade some gates for higher speeds. The cycle count for the ρ phase is therefore cut into half. For the slice-processing unit, this is not the case. Instead, a single 16-bit word has information on 8 slices but only 4 slices can be stored in the 128-bit internal register. Thus, 8 bits have to be discarded. With further optimizations (reading the upper byte of a 16-bit memory word in the next cycle after writing the lower byte) the cycle count for the permutation can be decreased by about 30 %. The number of additional gates for these modifications is marginal and limited to the need of 8-bit wide ρ units (shifter and register) and the increase of the RAM-macro cell due to the additional 8-bit pre-charge logic, write logic, and sense amplifiers.
Adapting to an 800-Bit State
Our design can also be used with an 800-bit state, only small additions to the controller are necessary to support both state sizes. When restricting to 800 bits, some optimizations are possible. First, only half of the RAM size is required. Second, the size of the internal registers can be cut down to a total of 100 bits, i.e., the memory needed to store four slices. A single lane now consists of 32 bits, this reduces memory requirements in the lane-processing phase to 64 bits. Furthermore, the number of rounds is reduced from 24 to 22. The cycle count needed for a single Keccak-f round is reduced by a factor of 2. For detailed implementation results see Section 5.
A possible trade-off between area and speed is to extend the used interleaving scheme to more than two lanes. When interleaving four 32-bit lanes into one 128-bit word, four lane registers and a 16-bit memory interface are needed. The core area will be comparable to that of the 1 600-bit version, while saving roughly 1 000 cycles per permutation compared to the 16-bit 2-lane case. However, we did not implement this approach to minimize the area requirements.
For even smaller state sizes, i.e., 400 or 200 bits, the number of lanes used in the interleaving scheme has to be chosen according to the desired cycle count and area requirements. 
Results
We implemented both designs in VHDL using a mixed tool design flow. For synthesis, we used the Synopsys Design Compiler 2012.06 that generates a netlist targeting the FSC0L D standard-cell library from Faraday. This library is based on the UMC 0.13 µm low-leakage process which has a standard supply voltage of 1.2 V. The following area results have been obtained after synthesis (using lowarea optimizations enabled); power values have been generated using Cadence Encounter Power System v8.10 after place and route (using Cadence Encounter RTL-to-GDSII). We further used low-leakage RAM macros from Faraday as storage blocks. Circuit size is expressed in terms of gate equivalences (GE), 1 GE is the area occupied by a 2-input NAND Gate. All values have been determined for a hash output length of 256 bits, the capacity c was set to 512 bits as suggested by the Keccak authors [9] . Table 1 and Table 2 show the area usage of our 1 600-bit designs for different chip components. For our lowest-area version, the two registers use almost 40 % of the occupied area. The slice unit needs the largest combinational part with 13 %. The higher-throughput version needs slightly more area mainly due to the larger ρ units, the controller, and the 16-bit RAM macro interface, i.e., 221 GEs for the core (and 155 GEs in addition for the larger RAM macro). In total it is 6.38 % larger. Table 3 provides more results including throughput and power. It shows that our higher-throughput version needs 32 % less clock cycles (15 427 instead of 22 570); this translates to a throughput of 44.3 kbps (for Version 1) and 64.8 kbps (for Version 2) at a clock frequency of 1 MHz. The power consumption values are nearly the same: our low-area version needs 5.5 µW per MHz of power (core only) and 12.5 µW per MHz (with memory included) and our higher-throughput version needs 5.6 µW per MHz and 13.7 µW per MHz, respectively. The maximum frequency of the core is 61 MHz. Comparison with Related Work. We compare our solutions with the two most relevant publications of low-resource full-state Keccak implementations. It shows that our work requires significantly less area, i.e., 41 % compared to the implementation of [10] (note that the authors estimated the total size of their low-area design to 9.3 kGEs including an external 64-bit memory). Our design is also more compact than the work of E. B. Kavun and T. Yalcin [27] (about a factor of 4). We also compare our designs with the smallest SHA-1 and SHA-2 implementations from [32] and [30] . It shows that our design has about the same size as SHA-1 and needs about 36 % less area than SHA-2. The power values of our design are also compelling requiring less than 15 µW per MHz (including memory), this is 72 % less than [27] .
Results for an 800-Bit State
We also adapted our design for use with an 800-bit state. As a result, the size of the core could be decreased by roughly 300 GEs (mainly due to the use of smaller registers, cf. Section 4.3). In fact, 2 611 GEs are needed for our low-area version (Version 1) and 2 837 GEs are needed for the higher-throughput variant (Version 2). In addition to these savings, the RAM size requirements are halved. The 8-bit RAM macro for the low-area version needs 2 016 GEs and the 16-bit RAM macro needs 2 108 GEs. Thus, our designs require 4 627 GEs and 4 945 GEs in total, respectively. Regarding power consumption, the smaller state versions need slightly less power, i.e., 12.4 and 13.1 µW per MHz. The cycle count for both versions drops by more than 50 %. 10 712 clock cycles are needed for Version 1 and 7 464 clock cycles are required for Version 2. The throughput, however, suffers due to the smaller chosen blocksize of 800 − 2 × 256 = 288 bits. It decreases to 26.9 and 38.6 kbps.
Discussion
As already stated in the introduction and in Section 2, our primary goal was to determine a lower bound for Keccak in terms of power and area. The following points invite to further discussions:
-The throughput of our design is relatively low but still acceptable for the targeted RFID applications. Increasing throughput is possible by adapting our design to broader memory interfaces (i.e., 32 bits). This of course will increase the area and power requirements. -The use of 1 600 and 800-bit Keccak for low-cost passive RFID tags has to be considered with caution: our smallest design requires about 5.5 kGEs and 4.6 kGEs, respectively. But there exist more compact hardware implementations that use primitives like block ciphers which can be used in a mode to provide hashing capabilities [12, 13] . -Integration: if external memory is available, e.g., in implementations where other chip components share a common memory, only the core logic has to be integrated requiring around 3 kGEs. Note that our design makes use of an 8-bit (standardized) AMBA interface and can therefore be easily adopted for existing designs. -The difference between the 1 600 and 800 bit versions of our Keccak implementations is significant. The 800-bit version is about 900 GEs smaller in size while being twice as fast. -For even more "lightweight" applications, the properties of the design might be modified (though might not being standard conform anymore), e.g., modifying the level of collision-resistance property; or reducing the size of the state to 400 or less bits as suggested by [27] . Note that such smaller state versions are specified from the Keccak team but will not likely be part of the SHA-3 standard. -We did not integrate any countermeasures against implementation attacks which has to be considered in scenarios where Keccak is used for authenticated encryption, for instance. Keccak can be protected using, for example, secret-sharing techniques as shown by G. Bertoni [4, 6] . Note that this will increase the area requirements. Future work has to evaluate low-resource SCA and fault-attack countermeasures for Keccak.
Conclusions
With the results given in this paper, we show that full-state Keccak can be implemented with less than 5.5 kGEs. There is room for improvements and it can be expected that the limits will be further pushed down towards an acceptable border where an integration into passive low-cost tags is getting more attractive. By now and without making any modification and restrictions for certain RFID applications, we obtain power values that are below 15 µW at 1 MHz (thus guaranteeing high reading ranges) while providing 128-bit of security.
