Abstract-We present a novel design for a tiny application-specific programmable processor for BCH decoding. The design is optimized for use in a PUF key extractor, where low-area overhead is extremely important. Due to it's flexible nature, it can support a wide range of BCH codes. The complete design for a BCH(413, 296, 13) decoder requires only 1% (less than 70 slices) of the available resources of a small FPGA.
I. Introduction
One of the requirements of most cryptographic systems is the ability to securely generate, store and recover highquality secret keys. The high-quality property requires the key to be both unique and unpredictable. The fact that generating such a secure key is not trivial was recently once again made clear by Lenstra et al. [1] , who showed that a large amount of public RSA keys share the same prime factors, making them instantly exploitable. Designing secure storage for keys is not trivial either and often increases system implementation overhead.
Physically Unclonable Function (PUF) [2] key extractors [3] [4] [5] aim to solve both these problems. Each physical instantiation of an extractor produces a unique, unpredictable, fixed key by design, generated from the inherent randomness of the PUF. Since the key can always be regenerated with the extractor, there is no need for expensive, secure non-volatile memory. An essential part of any PUF key extractor is an error correction block.
Contribution: We present a novel design for a tiny and application-specific programmable processor for BCH decoding, a perfect fit for use in a PUF key extractor.
Paper outline: In Section II, we introduce the notation used throughout the paper and give background information on BCH code construction and decoding algorithms. Section III describes the design of our processor. Results for synthesis and runtime are presented in Section IV. Finally, conclusions are given in Section V.
II. Background
In this section, the notation used throughout the paper is explained. Next, we look at the mathematical background of BCH codes. A short overview of BCH code construction is given and the ideas behind BCH decoding algorithms are shown. Since this paper focuses on the design and implementation of a processor, we do not go deeper into the mathematics behind these algorithms.
A. Notation
A binary Galois field is written as F 2 x . The symbol ⊕ is an addition over F 2 x , i.e. a XOR operation, and ⊗ a multiplication. An element of F 2 x is written in capitals, e.g. A. C(n, k, t) stands for a BCH code with code length n, data length k and number of corrigible errors t.
B. BCH code construction
A BCH code C(n, k, t) is defined by its generator polynomial G, which is constructed as follows [6, 7] . First, one selects the size u of the underlying field
G is defined as the least common multiple of all M i . This gives a code of length n = ord(A), with k = n − ord(G). In this paper, we only consider codes for which A = α, a primitive element of F 2 u , and b = 1, i.e. primitive narrow-sense BCH codes.
Codewords C are created by padding data word D ∈ F 2 k to length n and adding to this the modulus of the padded D and the code's generator polynomial, i.e.:
Eq. 1 clearly shows that C is always a multiple of G, since
By shortening the data word by m bits, the codeword will also be reduced by m bits. E.g. from C(255, 21, 55), one can create C(235, 1, 55), which has the same generator polynomial and error correction capabilities.
C. BCH decoding
BCH decoding consists of a three step process: syndrome calculation, error polynomial calculation and error position calculation. Each of these steps is explained in more detail in the next paragraphs.
1) Syndrome calculation:
The first decoding step is calculating the so called syndromes. One takes a received codeword R ∈ F 2 n , which is the sum of an error-free codeword C and an error vector E, and evaluates it as a polynomial. The syndromes are the evaluation results of R(x) for x = α i , with i = 1, . . . , 2t. A syndrome S i is thus defined as
2) Error locator polynomial calculation: Suppose we have an error vector E = x l1 + x l2 + . . . + x ly . Then the value of the first three syndromes is:
The Berlekamp-Massey (BM) algorithm [8, 9] , when given a list of syndromes S i , returns an error locator polynomial
One of the problems with the original BM algorithm is that it requires an inversion of an element A ∈ F 2 u in each of its 2t iterations. To eliminate this costly operation, Burton [10] devised an inversionless version of the algorithm. Multiple authors have suggested improvements to this algorithm in the form of space-time tradeoffs, e.g. [11] [12] [13] .
3) Error location calculation: Finding the roots of Λ(x) gives the location of the errors in R. The Chien search algorithm [14] is an efficient way of evaluating all possible values of α i . It does this by improving multiplications in the evaluation formula to constant factor multiplications by noting that intermediate results for Λ(α i+1 ) differ a constant factor from intermediate results for Λ(α i ):
III. Design In general, BCH decoders are designed for high throughput, since they are most often used in high-throughput communication devices. In our case, however, the BCH decoder is intended for error correction of the output of a physically uncloneable function (PUF) [2] , i.e. PUF key extraction [3] [4] [5] . In this setting, throughput is only a secondary requirement, since the PUF generates relatively few data to correct and error correction has to happen only once, at startup. Furthermore, a PUF key extractor is generally part of a larger design, and thus, should be as small as possible. As such, our design approach towards the BCH decoder is markedly different from the de-facto standard of using systolic arrays [11-13, 15, 16] . The primary goal of the design is to be as small as possible, and be flexible, since different PUF types require different BCH parameters. Secondary comes time efficiency. In the following section, the design of the BCH decoder in explained in detail.
A. Hardware
In order to execute the three algorithms necessary for BCH decoding a controller is needed. Furthermore, this controller needs to be easily adaptable to different code parameters, because the type of BCH code used in a PUF key extraction device depends on a lot of factors such as PUF error rate, PUF output width and final key length [4] . Due to these requirements, a microcontroller design seems best suited for the decoder design.
Components
The BCH decoding processor consists of three main components, which are shown in Fig. 1 . Each of them is described in the next paragraphs. 
1) Data block:
The data block consists of a data RAM block, which stores all data necessary for the decoding as well as the corrected codeword, and an attached arithmetic unit (ALU). Since virtually all arithmetic for BCH decoding is over elements in F 2 u , only a single Galois field operation is supported by the ALU: single-cycle multiply-accumulate, with the ability to execute either multiplication or addition separately. The ALU contains a single register for the accumulator and has a dual port input from the RAM.
2) Address block: Part of the novelty of our design is the use of a dedicated address block. This block consists of a tiny address RAM, of only 5 elements, and an attached ALU. The reason for including a separate address block is explained later on. The ALU works over elements in Z and supports increase by one, decrease by one and binary inversion, which is equal to negate and decrease by one in two's complement notation. This allows the use of the address block both for address pointer storage, for array pointer arithmetic and for keeping track of counter values.
3) Controller:
The controller consists of a firmware ROM, as well as an FSM to interpret this machine code and control the microprocessor.
Communication Not only are both the data block and the address block controlled by the controller, both also have outputs connected to it. This allows the controller to compare the content of the RAMs or the result of an arithmetic operation to some fixed value. The controller can block write signals going to both RAM blocks, which allows conditional execution for all instructions.
Code analysis on the three algorithms shows that almost every arithmetic operation takes place on array elements. This lead to the development of the address block, which allows very efficient array pointer arithmetic. The address input of the data RAM is wired straight to the output of the address RAM. Therefore only indirect access of data elements is supported. Since at most five address pointers are needed at any time, the address to these pointers can be included in each instruction word. Thus, this "forced" indirect addressing actually is one of the nice aspects of the processor, driving down both firmware size and runtime. For example, an array sum can be programmed with just three instructions: accumulate, increase address pointer and conditional branch.
B. Software
The three algorithms for BCH decoding are implemented in an assembly language for the hardware described in the previous section. In this section, we list the processor's instruction set architecture (ISA) and go over some of the techniques used to achieve a time-efficient implementation. Table I lists the instruction set architecture of the processor. All instructions are 10-bits wide and contain bit fields for conditional execution and (if applicable) target and destination address pointer(s). Some instructions are implemented specifically with the target algorithms in mind. E.g. the rotr instruction also sets a conditional execution flag depending on the LSB of the affected data word, this eliminates the need for a separate check, allowing the implementation of the syndrome calculation algorithm's inner loop with only two instructions.
Instruction Set Architecture

Optimization Techniques
In order to improve the runtime of our firmware a few techniques are used.
First of all, the algorithm's inner loops are all unrolled. This reduces the overhead of costly conditional jumps back to the start of the loop. Pre-and post-loop patch code is avoided by manually tuning the number of loop unrolls to the code parameters, which keeps the impact on firmware size low. This loop optimization technique improves the Table I Instruction set architecture of the processor.
Opcode
Result Cycles
runtime of our initial firmware up to 30%. The next big improvement in runtime is due to the combination of multiplication and addition in a single-cycle instruction. The merge of these two instructions results in a further 38% speedup of our error location calculation algorithm. Code duplication, in order to move conditional branches out of loops, improves the runtime of the Berlekamp-Massey implementation by another 20%. The support for conditional execution speeds up the syndrome calculation algorithm further, by 28%, due to the elimination of conditional jumps in the inner loop. Finally, the last improvement to runtime is due to improved memory management, with syndrome calculation seeing a 64% speed increase over an implementation with straightforward variable placement.
IV. Implementation
In the next paragraphs, we list the results for FPGA synthesis of our design and show the impact of code parameters on runtime.
Synthesis Our design is completely implemented in Verilog and was synthesized for the Xilinx © Virtex-6™ family of FPGAs using Xilinx ISE 12.2 M.63c with design strategy 'Area reduction with Physical synthesis'. As can be seen in Table II , the total size of our design is very small and changes little for different BCH codes. No separate RAM blocks are used, since our design uses RAM & ROM blocks which are implemented within LUTs. Thus, the listed slice count is the actual total size that the design requires. To the best of our knowledge, a comparison with existing BCH decoders is near impossible and makes little sense. This is due to the target application of our design: PUF key extraction. The primary goal of our design is compactness, for existing designs it is high throughput [11] [12] [13] [15] [16] [17] . Further complicating this is that the area of other implementations are either given for an ASIC implementation [11, 15, 16] or simply not stated [12, 13, 17] . Furthermore, the codes used for our target application are generally defined over F 2 u where 8 ≤ u ≤ 10, with high error correcting capabilities of 3-10% [3] [4] [5] , and our firmware is optimized with this in mind. We have not been able to find designs for such code parameters. Finally, most publications deal with Reed-Solomon decoding, which requires slightly different algorithms than those needed for BCH decoding, making fair comparisons even harder.
Runtime Code parameters greatly influence the runtime of each algorithm. The high-order approximate formulas for each algorithm's runtime in Table III clearly show that t has the largest influence, unless very long BCH codes are used. In this same table, formulas are given for the ideal runtime, which we define as: the number of cycles needed if each inner loop iteration takes one cycle, no matter how many operations are inside the loop, without parallel execution.
Comparing these ideal runtime formulas with the formulas for our implementation shows that the coprocessor is very efficient. Of note are the syndrome calculation and error location calculation implementations, which on average require only 2-4 times more cycles than in the ideal case, even with the overhead of conditional loop branches. Table IV lists the runtime of our processor for the example BCH codes. It clearly shows that the number of corrigible errors t has the largest effect on the runtime. V. Conclusion We have presented the design and implementation of both hard-and software for a tiny application-specific programmable BCH decoding processor. Our design requires less than 1% (70 slices) for a BCH(413, 296, 13) decoder on a small Virtex-6 FPGA, and gets close to the ideal runtime for two out of three required algorithms.
Due to its extremely small size, it is the perfect match for a PUF key extraction system. Such a system will spend multiple milliseconds interfacing a PUF [4] and thus the speed of our design is well within acceptable limits.
