We design and implement a cryptographic biometric authentication system using a microcoded architecture. The secure properties of the biometric matching process are obtained by means of a fuzzy vault scheme. The algorithm is implemented in a reprogrammable, microcoded coprocessor called FV16. We present the micro-architecture of FV16 as well as a dedicated assembler for this architecture. Our coprocessor can be attached to an ARM processor, and offers a 83-fold cycle count improvement when the fuzzy vault algorithm is migrated from embedded ARM software (13.8 million cycles) to the FV16 coprocessor (166 thousand cycles).
INTRODUCTION
An authentication system based on biometric information offers greater security, and is more convenient than the traditional methods of personal verification. Along with the rapid growth of this emerging technology, the system performance, including the matching accuracy and speed, is continuously improved. In a fingerprint-based biometric matching system, the comparison is made between the features extracted from an input fingerprint image and a reference template. Because of the uniqueness and sensitivity of the reference template data, secure storage is a key factor for the biometric system security. This is especially an issue for embedded applications. Special precautions must therefore be taken to protect the template from possible attacks.
A naive approach is to encrypt the template using a secret key such as a PIN. When a matching operation needs to be performed, the system decrypts the template using the PIN and then performs the biometric matching. However, this defeats the purpose of biometric devices: one tries to be independent of PIN codes entered by the user. Moreover, some dedicated attacks still could extract the secret key using a side-channel attack (SCA) [1] , and in turn the template. A clean solution to this problem is to store a noninvertible transformed version, for instance a hash, of the template on the embedded device, and to perform the comparison in the transformed space. The main property of a cryptographic random hash function is that it is a one-way function, so that the output hash value will not give any information about the input [2] . Therefore, any similarity in the input will not reflect in the output hash value. For fingerprint verification, hashing is not suitable because different fingerprint scans are not exactly the same, which means that their output hash value will always be different. To address this problem, we adopt the idea of a fuzzy vault [3] [4] to conduct the biometric authentication. In a fuzzy vault scheme, a transformed version of the minutiae together with a large set of noise data is stored. A suitable fuzzy vault matching algorithm then is able to distinguish between noise and input data points. In this work, we design and implement a fingerprint verification system using this novel technique. In order to construct the system efficiently and to make it reconfigurable, we build a domain-specific microcoded coprocessor, which is optimized for fuzzy vault algorithms. It can be used for a class of applications that require a fuzzy vault scheme. This paper is organized as follows. Section 2 introduces the algorithm we adopt and the possible design approaches. Section 3 discusses the system implementation and design flow in details. Section 4 shows results and Section 5 draws the conclusions.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 
CRYPTOGRAPHIC BIOMETRICS 2.1 Application
A novel cryptographic technique called the fuzzy vault scheme has been proposed recently [3] , [4] . It integrates well-known error-control coding methods and cryptographic techniques, and can be used to combine biometric authentication and encryption. The objective of this algorithm is to store biometrics data in a 'vault', a cryptographically safe store. The classic fingerprint vault construction however is based on the assumption that the fingerprint features to match are perfectly aligned -a condition that is very difficult to achieve in practice. The algorithm we adopt in our work addresses this alignment problem in a systematic way to make a complete and adaptive authentication system based on fuzzy vault [8] .
As shown in Fig. 1 , in order to address the security problem posed by the leakage of the stored biometric information, instead of templates, we store a machine-generated bit stream as the PIN on the device and present it as the coefficients for a Galois Field encoding polynomial in the enrollment phase. This polynomial is used to encode the minutiae template, generating the lockset of the fingerprint fuzzy vault. The next step is to add a large number of noise points to conceal this lockset. The combination of the lockset and the noise points forms the fuzzy vault. In the fingerprint matching phase, fuzzy vault unlocking needs to be performed to generate a code PIN'. Comparison of PIN and PIN' will indicate whether the matching is successful or not [8] . 
Design Approach
The characteristics of fuzzy vault matching are complex decisionmaking as well as complex data processing. In addition, we target an implementation on a portable, resource-constrained platform. This means we need a specialized architecture, as given by one of the options of Fig 2. A software solution based on standard program components such as a CPU can lack in execution speed, as well as in energy consumption. A full-hardware design in FPGA or ASIC, on the other hand, will achieve the required performance at the expense of flexibility and design cost. This leads to a specialized programmable solution, such as a DSP or an Application Specific Instruction Processor (ASIP). In our approach, we adopt a
Fig. 2. Design approaches for embedded systems
microcoded coprocessor architecture, where we have full control for over all the function blocks in the datapath, the communication network, and the controller. Instead of constructing the system based on a predefined processor core, we begin from the application specifications and define our own datapth, from which a specific microcoded coprocessor called FV16 is developed. In this work, FV16 is developed in three steps: (1) Identify recurring and intensively used operations, for which special hardware modules are constructed; (2) Platform design: create interconnect, storage and control architecture to integrate datapath elements. Together with this architecture, define an instruction set, which is an abstracted version of the design; (3) Decompose the C program into assembly instructions. Generate microcode using a customized assembler.
IMPLEMENTATION 3.1 Architecture
The architecture of our coprocessor, FV16, is shown in Fig 3. In the fuzzy-vault algorithm, all operations execute in the ( ) 16 
GF
field. Thus all the fingerprint minutiae feature elements are represented by 16-bit integers. The fuzzy vault construction and unlocking procedure can be fully described using 16-bit arithmetic. The coprocessor is microcoded, with a separate data path and controller. This benefits the design by introducing more programmability. As shown in the figure, our system includes an ALU, a register file (RF), a data RAM, as well as a data address , and a 1-bit condition register (Z) are included in the control unit. Since the PC is a 16-bit register, 64K bytes of program memory can be addressed. The Z register is used to store the result of the last compare operation. If the result was one, the Z register contains a "1"; otherwise it is "0". During execution, the FV16 coprocessor first fetches an instruction from program memory into the instruction register (IR) and sends it to the decoder. The decoded instruction is then executed. If the instruction performs a comparison operation, the condition register Z needs to be updated. For conditional branch instructions, the execution depends on the logic level of Z, which will combine with the inputs from IR to determine the proper sequence of control vectors. Therefore, each instruction requires at least three cycles to complete: fetching, decoding and executing. In order to speed up the system, a three-stage pipeline architecture is implemented. This allows a new instruction every clock cycle. For multiple-cycle instructions, such as RAM access and Galois Field multiplier, the assembler will add "NOP" instructions to avoid pipeline stalls. This simplifies the pipeline controller design.
In a high-level language programming environment, the address generation is done by the compiler, which performs variable allocations, and which converts all index expressions into integerarithmetic operations. In the FV16 coprocessor, we implement a dedicated hardware data address generator (DAG), similar to DSP processors. Accepting a base address, this address generation unit can provides an address increase, address reset, or any particular address depending on the request. The use of such hardware address generation improves the execution performance, and eases the programming of the coprocessor.
Special Functional Block Models
Besides the blocks we discussed before, from the architecture diagram in Fig.3 , there still are several function units left. These blocks are designed as special function modules to make the coprocessor more efficient in terms of speed. The special blocks are identified by means of algorithm analysis, where we find some functions are used extensively. The underlying framework of our system is Galois Field ( ) 16 2 GF arithmetic and a Galois Field multiplier is included as special computation unit. During the vault construction, a large number of randomly distributed noise points are needed to protect the biometric information. Also in the unlock procedure, the unlocking matrix includes random elements. Therefore a pseudo-random number generator is included to generate the noise required by the algorithm. In addition, a triangle block is needed for calculating the physical distance between two elements. Next we will explain these blocks individually.
Galois Field Multiplier
All the calculations in the fuzzy vault algorithm are based on ( ) 16 2 GF arithmetic. A Galois Field adder and a Galois Field multiplier are required. While the GF adder is implemented using logic XOR, the implementation of GF multiplication is more complicated. As shown in Fig.4 , we use a bit-serial Galois Field architecture. First the shift register is initialized with all-zero state. For each clock cycle the partial product vector is added to the actual state. After 16 cycles the product is available. This entire unit we implemented as a single functional block, called the Galois Field Multiplier block (GFM). Besides using GFM, we considered two alternative implementations for Galois Field multiplication. One is to write the multiplication algorithm in C for a general purpose ARM processor and another one is programming it in assembly instructions for the FV16 coprocessor's ALU. The cycle numbers required by these three different methods are shown in Fig. 6 (GFM) in log scale. Taking advantage of the FV16 coprocessor with the GFM special block, the total execution cycles needed for the Galois Field multiplier is 85K. In contrast, it takes 1.02M on the same coprocessor without GFM block and 7.79M for C software running on the ARM. Thus, an improvement of 90 times in cycle count can be obtained using the GFM functional unit inside of FV16.
Fig. 4. Block diagram for bit-serial GF multiplier

Pseudo-random Number Generator
In order to provide the random elements for the vault construction and unlocking procedures, we adopt a 16-bit Linear Feedback Shift Register (LSFR) pseudo-random number generator in our design. The primitive polynomial used for minimal hardware is: 
. Every clock cycle, the shift registers generate a 16bit random value. For comparison, we also implement the pseudo-random number generator in assembly code without the special hardware and in C targeting ARM, separately. The assembly code implements a LSFR in software and the C program uses the rand() function. Fig. 6 shows the performance results (RNG), indicating that the RNG block makes the system more than 3 times faster than the coprocessor-based design without RNG special module, and 580 times faster than the software only implementation on the ARM processor.
Triangle Block
At the beginning of the unlock procedure, the input value needs to be compared with the values in the fuzzy vault to find out the closest elements for constructing the unlock set. This comparison 16 cycles needs to find out the physical distance between two feature points instead of a simple comparison between two numbers. According to the minutiae feature extraction procedure [8] , the elements in the lock set and the unlock set are constructed by a pair of minutiae coordinates (r,θ ). Thus the distance between two elements is:
Since the FV16 coprocessor has only one ALU unit, trying to implement this distance function using basic instructions will take a large number of cycles. In order to speed up the system, we design a function block especially for this calculation, whose function diagram is shown in Fig. 5 . Fig. 6 presents the performance improvement by implementing the triangle computation block (TRI). The system requires 10 times less clock cycles compared to those without TRI block, and 100 times less clock cycles compared to C program running on an ARM processor. 
Programmer's Model
Based on the architecture discussed before, we construct a programmer's model to execute all the function blocks. Each instruction is 16-bit, which in most cases is divided into three fields: the operations, the address of the source register and the address of the destination register. The operation is encoded in the first 8-bit field. Table 1 presents the instructions and their corresponding operation codes. Some special data moving instructions and branch instructions belong to special types. Their operation codes are not encoded as 8bits, and we will discuss it in detail below. All the instructions can be classified as one of the following five types: 
Addressing Modes
There are three addressing modes for the registers of FV16 coprocessor. One is register direct addressing, which move data from one register to another. The second mode is the immediate data addressing, in which the data is contained in the instruction. Another mode is inherent addressing, where the instructions always use the same source or destination.
Data Transfer Operations
The coprocessor uses an internal 8K×16bit RAM block, which can be accessed with dedicated instructions. These instructions read from and write to the memory, and take care of address generation: 
Branch Instructions
Since loops are used extensively in the fuzzy vault scheme, branch instructions, including non-conditional jump and conditional jump, are designed to support decision-making and control flow. 
Design Flow
In the previous sections we discussed the architecture of the microcoded coprocessor FV16, as well as the programmer's model for writing assembly instructions. In this section, we present the system design flow, which is shown in Fig. 7 . Given the application specifications, the designer will partition the Cspecification in driver software running on the embedded ARM and a specialized coprocessor. Both the coprocessor architecture and the software running on it need to be designed. A specialized language, GEZEL [9] , is used to construct the datapath for the coprocessor. At the same time, the C program for the algorithm, which needs to run on the coprocessor, is converted to assembly code following the programmer's model. Then the microcode is generated by the assembler and stored in the program ROM, which is part of the microcoded coprocessor. All three components, the C program for the driver, the GEZEL code for the coprocessor datapath and the microcode running on the coprocessor are co-simulated in GEZEL. GEZEL is an open design environment for domain-specific micro-architecture linked with the ARM instruction-set simulator [9] . As an example, Fig. 8 shows the GEZEL design for the Galois Field multiplier used in the FV16 coprocessor.
In order to compile the assembly instructions for the FV16 microcoded coprocessor architecture, we build a dedicated assembler based on a public available universal retargetable assembler framework from Tomasz Sztekja [10] . This is a powerful assembler and linker package, which is written fully in Java. It is very flexible and can support almost any architecture. More important, it is open source for users to port to their own processors. 
RESULTS
Following the design flow, we implement the fingerprint verification algorithm based on the fuzzy vault scheme on the application specific microcoded coprocessor FV16. Using the cycle true simulation of the GEZEL, we find out the cycle number for completing the whole procedure is 166K cycles. As a comparison, we also write the embedded software in C and crosscompile it into an executable to be simulated on an ARM instruction-set simulator (ISS). The simulation shows that it takes over 13.8M cycles to finish the algorithm. In terms of source code size, 1400 lines of GEZEL code are used for the datapath description for FV16 coprocessor, and 1024 lines of assembly code are used to implement the algorithm.
After performance evaluation, the secure vault fingerprint verification system based on the FV16 coprocessor can be converted into synthesizable VHDL and run on a reconfigurable FPGA platform. The Synplicity tool Synplify Pro is used to perform the synthesis using Xilinx Virtex2 XC2V1000 as the target platform. The system results are shown in Table 2 : 
CONCLUSIONS
We design an application specific microcoded coprocessor called FV16, based on which a HW/SW co-design for a secure biometric authentication system is constructed. An instruction set, as well as the programmer's model, is constructed for writing assembly programs targeting on this architecture. In this work we propose a complete design flow to show how the design tasks are integrated. From the design flow it is clear how other applications can be mapped onto this microcoded coprocessor. Also the results show that using our coprocessor makes the design over 83 times more efficient compared to software only implementations.
ACKNOWLEDGMENTS
