Abstract. Physical Unclonable Functions (PUFs) have properties that make them very attractive for a variety of security-related applications. Due to their inherent dependency on the physical properties of the device that contains them, they can be used to uniquely bind an application to a particular device for the purpose of IP protection. This is crucial for the protection of FPGA applications against illegal copying and distribution. In order to exploit the physical nature of PUFs for reliable cryptography a so-called helper data algorithm or fuzzy extractor is used to generate cryptographic keys with appropriate entropy from noisy and non-uniform random PUF responses. In this paper we present for the first time efficient implementations of fuzzy extractors on FPGAs where the efficiency is measured in terms of required hardware resources. This fills the gap of the missing building block for a full FPGA IP protection solution. Moreover, in this context we propose new architectures for the decoders of Reed-Muller and Golay codes, and show that our solutions are very attractive from both the area and error correction capability points of view.
Introduction
Virtually all keyed cryptographic primitives, regardless of whether they are based on public-key or private-key cryptography, assume the secrecy of the key used to encrypt/sign a given message. Since the late 90's, there has been a lot of interest in developing methods to guard against key compromise at the protocol level but also at the physical level [1, 2, 3] . By physical level, we mean mechanisms which can make the platform where cryptographic primitives run (more) secure to key compromise. One of the most interesting of these methodologies is the idea of Physical Unclonable Functions (PUFs) as introduced in [1] . A PUF is a primitive that maps challenges T i to responses R i , which are highly dependent on the physical properties of the device in which the PUF is contained or embedded.
We will write R i ← PUF(T i ) to denote the response R i of a PUF to a challenge T i . Physical Unclonable Functions have essentially two parts: i) a physical part and ii) an operational part. The physical part is a physical system that is very difficult to clone. It inherits its unclonability from uncontrollable process variations during manufacturing. In the case of PUFs on an IC such process variations are typically deep-submicron variations such as doping variations in transistors. The operational part corresponds to the function. In order to turn the physical system into a function a set of challenges T i (stimuli) has to be available to which the system responds with a set of sufficiently different responses R i . Examples of PUFs include optical PUFs [1] , silicon PUFs [4] , coating PUFs [2] , IntrinsicPUFs [5] , and LC-PUFs [6] . Regardless of their particular instantiation, their unclonability, and tamper evidence properties have made PUFs very useful tools in IP protection and secure key storage applications.
IP Protection For FPGAs. Field Programmable Gate Arrays (FPGAs) are gaining widespread acceptance as substitutes for ASICs in many applications. In fact their re-programmability has made them very attractive in the embedded market, where software and functionality updates can be common and desirable by customers. As a result of this shift, it is increasingly the case that the functionality of an embedded system is presented in the form of a bit configuration file or, in the case of microprocessors, in the form of a program. Thus, the very property that makes FPGAs so attractive (their programmability) also makes it very easy for counterfeiters to copy an IP developer's configuration file and create a similar product without the up-front cost of Intellectual Property (IP) development. This problem was introduced most recently 1 by Simpson and Schaumont [8] . In particular, the authors in [8] showed that by using a PUF on an FPGA they could develop protocols that allow binding of a particular IP to a particular FPGA. Their protocols also allow proving authenticity of the IP to the hardware platform. In [5] , the authors further reduce the computation and communication complexity of the protocols in [8] and introduce the idea of Intrinsic-PUFs based on the start-up values of SRAM memory values. Both based their protocols on symmetric-key primitives. In [9] , the authors observe that by introducing public-key cryptography, the corresponding private-key does not need to ever leave the FPGA, even during the enrollment stage, thus increasing the security of the overall system. A common characteristic of all PUF-based protocols in [5, 8, 9] is the derivation of a key(s) from the PUF, which is used to encrypt a piece of IP and authenticate its origin. In the remainder of the paper, we will refer to the encrypting operation for ease of presentation but it is clear that our discussion extends to the computation of Message Authentication Codes (MACs) and/or signatures on a particular IP block.
The Need for a Helper Data Algorithm. Notice that PUF responses are noisy by nature. In other words, two calls to the PUF with the same challenge T i will produce two different but closely related responses R i , R i , where the measure of closeness can be defined via a distance function. We will make the distance function more explicit in Sect. 2. Intuitively, the distance function should be small among responses originating from the same device and very large for PUF responses originating from different devices. Nevertheless, it is clear that the plain PUF response can not be used as the key, since this would mean that the data encrypted under response R i could not be decrypted with response R i , even if both responses originate from the same PUF embedded in the same device 2 . In order to derive reliable and uniform strings from (imperfect) sources of randomness, such as a PUF, the concept of a fuzzy extractor or helper data algorithm were introduced in [10, 11] .
Related Work. To our knowledge, there is no previous description of the complexities and design choices made to implement a helper data algorithm on hardware and, more specifically, on FPGAs. In both [12, 13] the noisy nature of a PUF is acknowledged. Their solution to the problem is to add an error correcting stage based on BCH codes. Other codes are not considered and no detailed explanation of how to choose the code is given. Gassend [14] also considers the problem of noisy measurements in PUFs by considering Hamming codes and product codes. The solutions based on product codes in [14] is only able to correct up to two errors. Gassend gets around this problem by trying different challenges until the response has a sufficiently small number of errors that they can be solved. It is worth noticing that a similar problem to the one we are considering is present in biometrics. In fact, the first fuzzy extractor construction [15] was aimed at biometric applications. Dodis et al. [11] also describe a software implementation of a fuzzy extractor based on BCH codes. Somewhat related to our construction is the construction of Hao et al. [16] where they implement a two stage error correcting scheme for biometric applications (iris recognition). The scheme in [16] uses first a Hadamard code and then a Reed-Solomon code in a concatenated manner. Notice that the authors in [16] do not consider the hardware implementation of their schemes. In addition, Reed-Solomon codes are optimized for burst errors and thus, are not applicable to our solution since errors present in PUF responses tend to be random.
Our Contributions. In this paper, we focus on the study and implementation of fuzzy extractors on FPGAs, as [5, 8, 9] assume the existence of such a block but do not provide explicit constructions nor investigate the hardware costs of fuzzy extractors on FPGAs. Our work can be seen as the final block necessary to generate cryptographic keys and, thus, allows for the construction of full IPprotection solutions on FPGAs. We focus on making an efficient choice of code. By efficient, we mean two things. First, we aim to be able to reconstruct the same key with high probability. In other words, given a code we want to achieve an error probability (the probability that an error pattern happens that can not be corrected by the chosen error correcting code) of at least 10 −6 ≈ 2 −20 . We argue in Sect. 3.2 that this is a conservative estimate, which can be applied for most applications. In this respect, we show empirically that from the codes considered, the best codes (meaning those that can achieve a low error probability) are BCH codes [17, 18] .
Our second efficiency measure refers to hardware resources. In particular, once we have achieved a certain error probability, we desire that the error correcting decoding algorithm implementation be as area efficient as possible. This, in fact, is a key requirement and makes our work fundamentally different from other helper data algorithm implementations. In particular, the aim of our solution is not the implementation of a helper data algorithm on an FPGA by itself. Rather, our aim is to implement a helper data algorithm in as little hardware as possible and, in the process, allow for the secure deployment of IP. The IP block, in fact, is the one that should determine the FPGA resources, not the helper data algorithm. In our search for an area efficient solution, we turned to different code constructions. We find that concatenation of codes as introduced by Forney [19] , allows for the use of codes that are much simpler to implement and possibly more area efficient than BCH codes. In particular, an odd repetition code followed by a Reed-Muller code or a Golay code can satisfy our error probability requirements. We expect that such construction will incur in considerable area savings with respect to a construction based on BCH codes only. We also propose new architectures for the decoders of Reed-Muller and Golay codes, which are of independent interest. In addition, we identify which universal hash function constructions from those already known in the literature are most suitable for small area implementations. These results are described with focus on an implementation results targeting a Spartan-3E Xilinx FPGA, which is a typical FPGA used in low cost applications.
Notation. Algebraically a binary linear code C with message length k and codeword length n is a k-dimensional subspace of F . An important data structure related to the linear code is the generator matrix G whose rows are elements of a basis for the linear code. For a binary linear [n, k, d]-code, we can write the generator matrix in the standard form as G = (I k |P ), where I k is the k × k identity matrix and P is a k × (n − k) matrix. The parity check matrix is then found as H = (P T |I n−k ). This is an n − k by n matrix such that the inner product of any codeword with any column of H equals zero. Encoding a message m is accomplished by computing v = mG. The syndrome of a received word r = v + e, where v and e are a codeword and error, respectively, is defined as S r = Hr = He. We refer the reader to [20, 21] as standard references for error correcting codes.
Helper Data Algorithms
PUF responses can not be used as a key (as in e.g. [2] ) in a cryptographic primitive for two reasons. First, PUF responses are obtained through measurements on physical systems, which are typically noisy. This leads to a problem since cryptographic functions are very sensitive to noise on their inputs. Second, PUF responses are not uniformly distributed. Hence, even if there was no noise, the response would not form a cryptographically secure key. In order to deal with both issues a Helper Data Algorithm (HDA) or Fuzzy Extractor or has to be used. In the remainder of this paper, we will use the two terms interchangeably. For the precise definition of a Fuzzy Extractor and Helper Data algorithm we refer to [10, 11] .
In general a helper data algorithm deals with both issues (noise and nonuniformity of keys) by implementing first an information reconciliation phase and second, by applying a privacy amplification or randomness extraction primitive. In order to implement those two primitives, helper data W are generated during the enrollment phase. During this phase, carried out in a trusted environment, a probabilistic procedure called Gen is run. Later, during the key reconstruction or authentication phase, the key is reconstructed based on a noisy measurement R i and the helper data W . During this phase, a procedure called Rep is performed. We present one of the constructions for such procedures previously described in [11] . Other constructions as well as constructions for other metrics can be found in [11] . Notice that all constructions have an error correcting stage. Optimizing such stage will be our focus in the next sections.
Construction Based on Code Offset. In order to implement the procedures Gen and Rep an error correction code C and a set H of universal hash functions [22] is required. The parameters [n, k, d] of the code C are determined by the length of the responses R and the number of errors t that have to be corrected. The distance d of the code is chosen such that t errors can be corrected. The Genprocedure takes as input a PUF response(s) R and produces as output a key K and helper data W = (W 1 , W 2 ), i.e., (K; W ) = (K; (W 1 , W 2 )) ← Gen(R). This is achieved as follows. First, a code word C S ← C is chosen at random from C. Then, a first helper data vector equal to W 1 = C S ⊕R is generated. Furthermore, a hash function h i is chosen at random from H and the key K is defined as K ← h i (R). The helper data W 2 is set to i. During the key reconstruction phase the procedure Rep is run. It takes as input a noisy response R from the same PUF and helper data W and reconstructs the key K i.e. K ← Rep(R , W ). This is accomplished according to the following steps: (1) Information Reconciliation: Using the helper data W 1 , W 1 ⊕ R is computed. Then, the decoding algorithm of C is used to obtain C S . From C S , R is reconstructed as R = W 1 ⊕ C S ; and (2) Privacy amplification: The helper data W 2 is used to choose the correct hash function h i ∈ H and to reconstruct the key as K = h i (R). Notice that we have implicitly assumed the use of a binary code. The security of the above constructions has been established in [10, 11] .
Searching for Good Linear Codes and Efficient HDAs
We will model the noise present in PUF responses as a binary symmetric channel (BSC). In particular, in a BSC, the bit error probability p b specifies the probability that a transmitted information bit is received in error. Then, with probability 1 − p b the sent information bit equals the received one. We assume that all bits are independent, which turns out to be a good assumption as shown in [9, 5] . In [9] , the authors also show that the probability that a string of n bits has more than t errors is given by:
Notice that the value of P total will determine the minimum distance of the code and thus, the size of the code. We argue that a conservative value is P total ≤ 10 −6 . To see this, we relate the failure error probability of our system to the error probability of the hardware platform (in this case FPGAs), which is given in terms of the Failure-In-Time (FIT) unit. Given a FIT rate λ, the probability that a failure will occur until time t is given by P Failure until time t (t) = 1 − exp(−λt). For both Xilinx and Altera devices, the lowest FIT rate that we found in [23, 24] was five for older devices. All device families manufactured in newer process technology have a FIT rate higher than twelve. Assuming conservatively a FIT rate equal to five, for a 15-year period the resulting failure probability is 6.610 −4 . Thus, it is clear that assuming P total ≤ 10 −6 is quite conservative for any realistic application (i.e. applications that should last more than a month).
The Naive Approach: Simple Codes
We consider codes which can be used to correct random errors (as opposed to burst errors) in a received word r. Thus, we do not consider Reed-Solomon codes, which have very good burst error correcting capabilities, as well as convolutional codes for similar reasons. We also discarded LDPC codes [25] , which are very efficient but require very large and sparse binary matrices, thus making them resource intensive in hardware applications (see e.g. [26, 27] for designs targeting FPGAs, which occupy more than 50% of a high-end FPGA). We do not consider Hadamard codes explicitly but notice that they are equivalent to first-order ReedMuller codes. Similarly, the Hamming code is a [2 m − 1, 2 m − m − 1, 3]-code and can therefore correct only one error. Thus, it is not very useful to decode received vectors with high error rates as in the applications we are considering. In order to attain the desired error probability of 10 −6 , we simply start by looking at what can be achieved via straight forward use of the following codes: repetition code, Reed-Muller codes of the first order 3 , binary Golay code, and binary BCH codes. We refer to [28] for extensive tables for different error probabilities. For our particular case, we summarize the relevant code parameters in Table 1 . In Table 1 , 
4080
the last column refers to the number of SRAM source bits required to obtain 171 "error-free" bits which can then be hashed to obtain 128 random and uniformly distributed bits. This is based on the amount of entropy in SRAM PUFs reported in [5] . Then, the last column in Table 1 can be computed as n 171/k , where n and k, refer to the code parameters. We observe that in this table, we have only considered the code as stated in the first column or a shortened version of it in the case of the [1020, 43, 439]-BCH code. Shortening a code 4 is one of many different code modifications and one which we found useful empirically. We also notice that the number of errors that a shortened code can correct is at least t. However, correcting the additional error patterns enabled by the code shortening results in additional decoder complexity. Notice that neither the Golay code nor the [256, 9, 128]-ReedMuller code provide our desired error probability and that the BCH codes provide a very low error rate. Thus, the question that we ask is if we can extract a 128-bit cryptographically secure key with less than 4080 bits of SRAM.
A New Construction Based on Concatenated Codes
Notice that the previous scheme has several disadvantages. First, a scheme based on the repetition code alone, although low complexity, requires more than 5000 bits of SRAM. Second, although we have not discussed the complexity of the decoders yet, BCH decoder algorithms are very complex and, thus, we expect it to be expensive in terms of area. Thus, in this section we propose a new scheme based on concatenation of two error correcting codes C 1 and C 2 [19] to get less complexity while achieving the same or comparable error probabilities. Given two codes C 1 and C 2 with parameters [n 1 , k 1 
Notice that very long codes can be constructed from considerably shorter ones. Furthermore, a concatenated code can correct (depending on construction) random and burst errors simultaneously and the decoding complexity of two short codes is lower than the complexity of the entire code. Based on our discussion on fuzzy extractors in Sect. 2, our new constructions for the procedures Gen and Rep are described in Algorithms 1 and 2, respectively.
Algorithm 1. Gen c Algorithm for Concatenated Codes
Before considering specific examples, we discuss what the error probability will be for our concatenated codes. Intuitively, the main idea of our construction is to first use a rather simple code, let's say C 2 to bring the number of errors down. Then, with the second code, C 1 , the remaining errors are corrected. By a clever choice of the first code also the second code, has to correct only a few errors making the scheme more efficient. A similar idea is presented in [29] as a way to cope with extremely noisy channels. In particular, [29] reduced the error probability by using a repetition code and then combine the first stage with a more powerful code. For example, a simple calculation with (1) will demonstrate that just using the [3, 1, 3] repetition code, it is possible to bring the error probability from 15% to 6%. In general, the resulting error probability can be estimated as follows. Given two codes C 1 , C 2 with parameters [n 1 , k 1 
, respectively, the Rep c -procedure will first decode with Decode C2 and the result will be decoded with Decode C1 . Thus, the error probabilities P 2 and P 1 after decoding with the decoding algorithms of C 2 and C 1 , respectively, correspond to: where ui ∈ {0, 1} and lu = l1 · n1. 7: Perform Steps 4 through 12 of Algorithm 1 to recover K.
where p b is the bit error probability of the source, in our case the noise in the PUF response. Notice that P i is the word error probability. In other words, the probability that a word will be in error after decoding. This is equal to the bit error probability in the case of the repetition code but not in the case of Golay, Reed-Muller or BCH codes. However, it is well known [30] that the resulting bit error probability is always less or equal to the word error probability as estimated in (2) . Thus, we are being conservative in our estimates and the results in Table 2 correspond to worst case error probabilities. We report in Table 2 the best constructions that we found. Extensive tables for different PUF bit-error probabilities can be found in [28] . Our first observation about Table 2 is that we can achieve probabilities ≤ 10 −6 not using BCH codes. This is a nice outcome of the construction, since this allows us to consider other codes which accept a more efficient implementation. Nevertheless, BCH codes error correction capabilities are indisputable. A second and perhaps more important observation is that when we combine a repetition code with a BCH code in a concatenated construction, the code size and thus, the number of SRAM source bits required decreases considerably (compare 4080 in Table 1 with the shortened [226, 86, 43]-BCH code at 2260 bits). Thus, it is clear that our construction offers considerable advantages. [23, 12, 7] 8.00 E-06 3105 RM [16, 5, 8] 1.70 E-06 5040 repetition [11, 1, 11] G24[24, 13, 7] 5.41 E-07 3696 G23 [23, 12, 7] 4.52 E-07 3795
HDA Architecture and Implementation Results
In addition to the error correcting properties of the codes that we previously considered, we also consider their performance from a hardware perspective. In particular, in this work we aim to make designs as small as possible in order to reserve space for the actual IP block to be implemented on the FPGA. We propose decoder architectures for first order Reed-Muller codes and for the binary Golay codes. For hash functions, we take the architecture proposed by Krawczyk [31] since this family accepts a more efficient implementation than all the other ones proposed or described in [32, 31, 33] . Due to space constraints, we refer to [28] for an exact analysis of their complexity.
Reed-Muller Codes. In this work we only consider first-order Reed-Muller codes because of their simple decoding algorithms. The procedure for decoding these codes is shown in Algorithm 3. To describe the process of generating the characteristic vectors let us denote the row of the generator matrix corresponding with the variable v i by v Compute the dot product of each of these vectors with the received message.
4:
Compute the majority of the values of the dot products and assign it to ui. 5: end for 6: Multiply (u1, · · · um) by the sub-matrix consisting of the last m rows of G to get the vector s with 2 m entries. 7: Assign the majority of the entries in s + x to u0.
vectors corresponding with a variable. We are aware of two different hardware structures proposed in the literature for hardware based decoding of R(1, m) codes which are also described in [21, Chapter 13] . However, since we target a low resource implementation, we propose an architecture that is a factor of m (asymptotically) smaller than those proposed in [21] at the cost of additional processing time. We defer a more detailed comparison of the architectures to the full version of the paper.
The newly proposed design is based on Algorithm 3 and is shown in Fig. 1 . The most important parts of this circuit are the GM-generator and the CV-generator modules. The output of the GM-module periodically consists of each column of the generator matrix. Due to the structure of the R(1, m) generator matrix and considering the fact that only the last m rows of the (m+1)×2 m generator matrix are required, it is easy to verify that GM-generator can be realized using an m-bit counter. The CV-generator module generates the bits of each of the characteristic vectors using the current column of the generator matrix, based on the index of the characteristic vector, the m − 1 values of k j from (3), which is the output of CV Index and the variable being currently decoded which is the output of Variable counter. CV-generator consists of several multiplexers, which output an input value or its logical inverse, and the required circuitry to compute the product in (3). The Majority Decoder is a counter which can be either m bit or m+1 bit wide and its content must be compared with 2 m−1 or 2 m , respectively. Hence its output is its (m − 1)th or mth bit, respectively. The output of Algorithm 3 is the message corresponding to the appropriate codeword whereas for PUF applications the codeword is needed. After the message is correctly decoded it is multiplied, again using the inner product module, by the generator matrix and the result is stored in the Received Word module. Our decoder requires m2 m−1 2 m = m2 2m−1 iterations to process each row of the generator matrix with 2 m columns for each of the 2 m−1 characteristic vectors and each of the m decoded messages. To this time the 2 m iterations for processing each of the 2 m columns of the generator matrix should be added. The Received Word module is used for both input and output of the values. The overall complexity of the decoder is shown in Table 3 . 
Golay
Code. An arithmetic decoder as seen in Algorithm 4 uses the weight structure of the syndrome to determine the error patterns [34] . P is the nonidentity part of the generator matrix. The vector cp i is the ith column of P and rp i the ith row, respectively. The error vector is denoted as e = (x, y) where x and y are vectors of length 12. A vector x i is the zero vector with a 1 at the ith position. Our proposed circuit shown in Figure 2 can be derived from Algorithm 4. The main steps in the decoding process are the computation of the syndrome and helper syndrome, the determination of the Hamming weight, GF (2) addition and the comparison of the Hamming weights with a constant.
We use a dot product block, which consists of 1 AND, 1 XOR and 1 FF for the serial computation of the syndrome and the helper syndrome. These syndromes are of length 12 and therefore the Hamming weight is at most 12 which can be represented with 4 bits. The result of the dot product or the GF (2) addition respectively is loaded into a 12-bit shift register and a simple 4-bit counter counts the number of ones. The 4-bit counter requires 8 XOR, 8 AND and 4 OR gates. The result is then compared with a constant depending on the step of Algorithm 4. This step requires a 4-bit comparator and therefore 12 AND, 8 OR and 8 NOT gates. The constants are stored in memory and need 3 FFs.The GF (2) addition is done with an XOR gate. Furthermore we need 24 FF for the error vector e, 24 FF to store the received vector and 288 FF for the generator matrix. The gate complexity without the control circuit is shown in Table 3 . The above circuit can be optimized by removing the shift register and constructing the generator matrix on the fly, so we need to store only 24 bits for the first column and 23 bits for the second column. The other columns are determined by a simple shift operation. The complexity of the circuit without control can be seen in Table 3 .
Universal Hashing. As previously mentioned, we use a construction due to Krawczyk [31] . This construction makes use of random binary matrices, where the hash value h A (x) is the Boolean multiplication of the matrix A by the message x. Krawczyk [31] shows how this can be implemented using a simple LFSR. We only need to store the first column of the matrix and the next columns are generated by the LFSR. The circuit is shown in Fig. 3. For a 128- (1, 4) , R (1, 5) , R(1, 6)) as well as the Toeplitz design for a universal hash function based on [31] in VHDL. We target the Spartan-3E 500 FPGA and use Xilinx ISE v9.2 for our tooling. We synthesized and mapped all designs. The Spartan-3E family of devices corresponds to low end FPGAs typically used in automotive and consumer electronic applications. The results are shown in Table 4 . We observe that our designs our very space efficient. In particular, combining any repetition code decoder (a counter plus logic), a Reed-Muller or Golay code, any of the designs for the hash function our utilization is not greater than 10%. It seems clear that the Reed-Muller code is both superior in area but also in error correction performance (based on Table 2 ). Al- though, the Golay code decoder can be optimized by generating the parity check matrix on the fly, in terms of area it is still outperformed by the Reed-Muller codes. We can estimate the complexity of the overall helper data algorithm by assuming a concatenated construction with a Repetition code, a Reed-Muller code, a Toeplitz-based hash function and 100% overhead for control. Even then, the overall fuzzy extractor requires less than 10% of the FPGA resources. Unfortunately, we have not found any BCH code implementations on FPGAs to which we can compare. However, the BCH decoding algorithms themselves are much more complex, thus, it is expected that their hardware complexity will be similarly higher. In the full version of the paper, we expect to have a full BCH decoder and thus, be able to fully compare all our constructions.
Conclusion
We present the first efficient implementations of helper data algorithms on FPGAs. Helper data algorithms are used to extract cryptographic keys from the noisy response of, e.g., a Physical Unclonable Function (PUF). PUFs have become an attractive subject of research due to their nice properties for unforgeable authentication and secure key storage purposes. In particular, one can deploy them to securely bind applications to the underlying hardware, a mechanism that has various applications and most prominently IP protection. Our solution offers the last missing building block toward real world IP protection on FPGAs. Our helper data algorithms are efficient with regard to the required hardware resources, which is important for hardware design. In the design of our helper data algorithms, we make use of various linear codes constructions, each with own advantages and shortcomings. These constructions are then compared in terms of error correction capabilities and hardware resource usage, giving the designer the necessary tools to make an informed decision when implementing a helper data algorithm.
