Abstract-This paper presents the implementation of a speaker-verification system on field programmable gate array. The algorithm is executed by software over an embedded system that includes a MicroBlaze microprocessor connected to a vector floating-point unit (VFPU). The VFPU is designed to speed up the resolution of any vector floating-point operation involved in the verification algorithm, whereas the microprocessor manages the control of the process and executes the rest of operations. With a clock frequency of 40 MHz, the system is capable of executing the complete algorithm in real time, processing a voice frame in 9.1 ms. The same verification process was carried out for two different systems: 1) an ARM Cortex A8 microprocessor; and 2) configuring MicroBlaze with the scalar floating-point unit provided by Xilinx. The experimental results show that when comparing our proposed system against both systems, the number of clock cycles is reduced by a factor of 11.2× and 15.4×, respectively. The main advantage of the VFPU is its flexibility, which allows quick adaptation of the software to the potential changes produced in both the system and the user requirements. The algorithm was tested over a public database that contains the utterances of different users acquired under different environmental conditions, providing good recognition rates.
requiring more expensive capture devices. Such a feature is fundamental to the rapid expansion of this technology in the low-cost consumer market. Nowadays, some smartphones already include applications that allow authorized users to unlock their phones by reproducing a passphrase, which replace the traditional methods based on key or PIN numbers. In addition, commercial transactions, or remote personal verification by phone, are also applications of speaker verification that could be currently used by a large segment of the population [1] , [2] .
Most speaker-verification systems adopt an architecture that consists of two differentiated phases: 1) enrollment; and 2) classification. During the enrolment phase, the system is trained with several utterances provided by a specific user. Utterances are processed in order to obtain a set of features that represent the specific physical structure of the individual's vocal tract. Although there are many approaches for extracting these features, the mel-frequency cepstrum coefficients (MFCC) have been the most widely used in speaker verification, as well as in speech recognition [3] [4] [5] . These coefficients have the ability to reliably represent the distinguishing features of a signal of voice belonging to a particular user. In addition, they show a robust behavior against background noise, providing high recognition accuracy even in text-independent scenarios. Finally, based on these coefficients, a model for each user is generated. In the classification phase, the user, whose identity should be verified, pronounces a new utterance. This utterance is processed and its feature vector, based on the MFCC coefficients, is extracted and compared against the model previously stored during the enrolment phase. The comparison yields a result that is used to accept or deny the identity claimed by the user.
The classification phase, also known as the matching process, is based on the algorithms capable of distinguishing the extracted features of any individual from the genuine user. Algorithms based on generative models, such as hidden Markov models or Gaussian mixture models, have traditionally been applied to speaker verification [6] , [7] . The success of these statistical methods depends on the proper estimation of their parameters, which are calculated by maximizing the likelihood of the data for a particular model. Such estimation provides good convergence properties, although it does not guarantee finding the global maximum [8] , [9] . In contrast, support vector machine (SVM) is a discriminative approach that is intended to estimate a decision surface directly, rather than modeling a probability distribution function [10] .
Several versions of this algorithm, which use different kernels, have already been employed in speaker verification, providing outstanding matching results [11] .
On the other hand, due to the large number of computationally demanding operations performed during the verification process, the computational cost of these algorithms is usually high. Additionally, the platform in which the system is implemented should be able to manage a significant amount of data in real-time. These issues are especially important in text-independent scenarios, where the verification process is achieved by pronouncing utterances of several seconds in length. Microprocessors of moderate economic cost have been proposed as a compromise solution for the implementation of biometric systems. The flexibility of a software implementation allows the rapid development of applications using fixed hardware architectures. However, the complexity of some algorithms leads to a scenario that only high-performance microprocessors, included in personal computers, are appropriate for their processing in real time, at the expense of increasing the price and power consumption of the whole system. Hardware architectures based on field programmable gate arrays (FPGAs) are another alternative for implementing these algorithms. At a reasonably low-cost, these devices allow designing high-speed architectures useful for implementing biometric algorithms [12] , [13] . There are also some publications addressing the implementation of speaker-or speech-recognition systems on FPGAs. Most of these publications proposed the implementation of only one stage, feature extraction or classification, for accelerating the total execution time [14] [15] [16] [17] . Recently, Ramos-Lara et al. [18] presented a more advanced implementation of a complete speaker-verification system on an FPGA. Authors pointed out that their implementation was able to carry out the feature extraction and classification stages in a shorter time than the frame length (real-time processing). However, this performance was achieved at the expense of designing the system according to a specific sample frequency and a particular codification rate and by transforming the original floating-point operations to a fixed-point format. Note that any change introduced in any of the previous parameters involves redesigning the entire system. These drawbacks can be partially overcome by adding a floating-point unit (FPU) as part of the FPGA design. However, most of the proposed FPUs only include basic arithmetic operations (mul, add, sub and div) and are unable to process biometric algorithms in real time [19] , [20] . An additional disadvantage, within the framework of biometrics, is that these math coprocessors only admit scalar numbers as input operands. Many functions used in speaker-verification system perform computations whose input operands are vectors of variable length. Although the FPU is able to resolve these computations, the time needed for their calculation could be significantly accelerated if a vector floating-point unit (VFPU) is used [21] , [22] . When operating with vectors, the VFPU increases the throughput by reducing both the number of CPU fetches and the number of memory accesses.
This paper presents the implementation of a whole MFCC-SVM speaker-verification system on an FPGA.
The system consists of the Xilinx MicroBlaze general-purpose 32-bit microprocessor, and a VFPU that calculates any vector operation defined in floating-point format. The architecture of the VFPU is generic, so that it can be easily adapted to other soft-core microprocessors or FPGA families. Compared with a custom-hardware implementation, the main feature of the VFPU is its flexibility, which provides the possibility of easily introducing modifications in the algorithm or adding new processing stages. Besides, in the particular case of a speakerverification system, such flexibility allows designing the VFPU independently of the number of bits used for the codification of the input samples or the number of coefficients included in the feature vector. Furthermore, in applications in which samples of voice are affected by environmental conditions (background noise, distortion, etc.), or users have a remarkable common characteristic in their voice (due to age, prosodic features, etc.), changes in the parameters can be quickly introduced for adapting the system to such particular characteristics in order to improve the recognition rates. For instance, in [23] , a training process that includes simulated noisy data that provides an improvement higher than 23% compared with a classic training method was proposed. In [24] , it is shown that, owing to the effect of aging, the recognition rates can be degraded by approximately 20% every one or two years. Finally, in [25] , it is demonstrated that the recognition performance based on cepstral features is improved by adding higher-level information, including prosodic and lexical features, which allows the equal error rate to be reduced by up to 19%. This paper is organized as follows. Section II describes the basic theory about the algorithm presented based on MFCC and SVM. Section III presents the internal structure of the VFPU and the floating-point operations being implemented. Section IV shows the experimental results. Finally, the conclusions are drawn in Section V.
II. ONLINE SPEAKER-VERIFICATION ALGORITHM
The architecture of the proposed speaker-verification system is well documented [18] , so it will be briefly reviewed here. Specifically, the block feature extraction is based on the MFCC coefficients, whereas the SVM algorithm is the basis for designing the block creation model and classification.
A. MFCC-Based Feature Extraction
The MFCC are the essential elements used for obtaining the feature vector x m . During the feature extraction, the speech signal is segmented into frames of 25 ms in length. The frame m is the basic unit from which the feature vector is obtained. Any frame is overlapped by 15 ms with its previous one, the frame advance being 10 ms.
Usually, the optimal number of parameters that form the feature vector is 26. The first one, C 0 (m), is the Napierian logarithm of the energy localized in a temporal window; the following 12 parameters, C 1 (m), . . . ,C 12 (m), are directly based on the MFCC, which represent the spectral envelope of the signal of voice [3] [4] [5] ; and the last 13 parameters are known as differential or delta coefficients and are denoted as C 0 (m), . . . , C 12 (m). Such delta coefficients represent the variation of the MFCC between adjacent frames. Table I summarizes the complete sequence of functions and operations involved in the calculation of the feature vector. Functions are listed in sequential order according to their execution (the final parameters are outlined in bold). Note that the first parameter is directly obtained in Step 3. The MFCC coefficients are calculated in Step 10, after applying several functions over the original frame m. Finally, delta coefficients are obtained in Step 11 with a delay of 2 frames. As can be seen, to compute these delta coefficients, it is necessary to calculate before the MFCC of previous frames m − 2 and m − 1, and the subsequent frames m + 1 and m + 2.
B. Model Creation
The aim of the training stage is to create a model for each user, which contains the main characteristics that represent the user's voice. The data employed to build such a model include several feature vectors of this user (genuine), as well as other feature vectors belonging to different people (impostors). The model is obtained offline, using a training algorithm that runs in a desktop computer, and in this particular case, employing the public database BANCA [26] . Since the classification stage is implemented by an SVM algorithm, the model consists of a set of Q support vectors y j (i ) (i = 0.25, j = 0.Q − 1) and their associated parameters (ρ, γ , P j ). Note that the size of y j (i ) (26 elements) coincides with the number of parameters that form the feature vector x m . LIBSVM and Torch are specific libraries suitable to perform the training process of classifiers based on SVM models [27] , [28] . The training algorithm is executed several times, changing the number of feature vectors, and adjusting a characteristic threshold called ρ. The purpose is to find the optimal number Q of support vectors y j (i ) that lead to the best classification results. In this particular case, we found that such optimal number Q is 3636. Each of these support vectors y j (i ) has an associated constant P j (Lagrange coefficient) provided as a result of the training algorithm, along with an additional parameter γ common to all support vectors. The optimal values obtained for γ and ρ are 0.4 and −0.44, respectively. Besides, the total number of feature vectors used in this offline training process is 8000, including genuine and impostor users. Thus, the model λ for a specific user can be described as
(1)
C. SVM-Based Classification
The aim of the SVM classifier is to figure out whether the feature vector matches the user's model or not. Such a comparison yields a binary result, which assigns 1 when the match is positive and 0 when it is negative. Note that this comparison is performed for each frame m.
SVM maps the input data into a higher dimension feature space, in which a hyperplane that separates and maximizes the margin between classes is found. Such a transformation, from an initial space to another of a higher dimension, is achieved by means of a function kernel. One of the most common kernels used in the identification is the radial basis function. This function, adapted to the context of speaker verification, is represented by the following expression:
being x m (i ) the feature vector of frame m described by
The calculation of (2) is, by far, the most time-consuming process, which involves the most intensive computations. The result α(m), along with the threshold parameter ρ obtained during the training stage, is employed for determining the assignation value β(m)
finally, after analyzing all the frames in the utterance, if the percentage (denoted as matching) of feature vectors belonging to the user's model overcomes a threshold, the identity claimed by the user is confirmed as genuine (T refers to the total number of frames) 
III. VECTOR FLOATING-POINT UNIT ARCHITECTURE

A. VFPU Description
Figs. 1 and 2 show the internal architecture of the VFPU presented in this paper. Its main features can be summarized as follows.
1) The VFPU executes computations on vectors of arbitrary size using operands of single precision (32-bit) defined by the standard IEEE-754. Using this format, the total compatibility between the data shared by the microprocessor and the VFPU is ensured. Although a design based on half-precision (16-bit) would consume less hardware resources, in such case a block for data conversion should be included to guarantee the compatibility between both formats, which may introduce for some stages a penalty on the execution time. Furthermore, using half-precision the computations performed in some stages may lead to produce underflow (overflow) errors, which should be conveniently managed to avoid their potential effect on the recognition process. 2) Computations can be performed with vectors stored in external memory, scalar numbers provided by the microprocessor, or any combination of them. Likewise, the result of any computation can be placed on an external memory, or read by the microprocessor.
3) The internal architecture of the VFPU is designed to optimize vector computations based on the execution of a set of basic floating-point operations. In this way, these computations can be performed without unnecessary accesses to external memory, which are used to store temporary results. The data path, described in Fig. 1 , basically consists of four blocks. The Bus Interface connects the VFPU to the microprocessor through the system bus. This interface is in charge of managing the writing in registers S0-S7 of both scalar operands and memory addresses, in which the vectors are located. The Memory FIFO reads these vectors directly from external RAM, and writes in the same memory the resulting vector obtained after finishing a set of operations. The Register File contains 16 registers of 32-bits (R0 to R15), which store the result provided by the FPU. These registers can be used as new operands in subsequent operations. The FPU is designed in order to perform the operations described in Table II . As Fig. 2 shows, its internal design includes a specific block capable of performing the exponential function, which is the basis of the kernel employed by the SVM classifier. The addition of this block, as part of the VFPU, is necessary to solve the algorithm in real time, since SVM is the most time-consuming process. Furthermore, the table also presents the throughput and the latency (in clock cycles, T CLK ) for each operation. As the FPU is internally segmented into several stages, the number of results per second available at its output depends on the type of operations launched by the control unit. For instance, although the latency of an exponential function is 37 · T CLK , when such an operation is consecutively executed more than once, the second and subsequent results are obtained in 29 · T CLK (throughput of 34.5 kFLOP/MHz). The rest of the operations behave in a similar way, so that, as indicated in Table II , the maximum throughput provided by the VFPU is 1 MFLOP/MHz.
The denormalization block (Denorm) turns the IEEE-754 format into a fixed point more suited to carry out any operation. Once the process of calculation is completed, the normalization (Norm) and rounding (Round) blocks perform the opposite operation, providing the result in the original format.
B. Architecture for Maximum Speed Processing
The design of the VFPU should be performed in order to ensure the capacity of the system to work in real time, so that the feature extraction and matching processing of frame m should be performed before a new frame (m + 1) is ready to be processed (in our particular case, this time is 10 ms, which is the frame advance). The aim of this section is to highlight the computational features of the architecture proposed in Section III-A. For that purpose, a simple example based on the processing function represented in (6) is chosen
Note that this function is very similar to the kernel used by the SVM classifier shown in (2), but setting γ and P j at −1 and 1, respectively, and fixing the number of support vectors to 3636. It should be pointed out that, as described in Section II-B, the actual values used in the experimental results for γ , ρ and P j are different and are obtained by applying the training algorithm. For the sake of simplicity, let us suppose that initially the resolution of (6) is performed using only one control unit (UC). The program that should be executed is shown in Fig. 3 .
Clearly, in such a program, two loops can be identified: 1) the external loop indexed by j and related to the number of support vectors; and 2) the internal loop managed by index i , whose upper limit is determined by the size of the feature vector. Note that such a code is highly vectorizable, that is, the elements of the input (output) vectors x m (i ) and y j (i ) are read (written) in sequential order and the same sequence of operations is repeated for all the elements of a vector. As it will be shown in Section IV, this property is very important to achieve high acceleration factors by the VFPU. Thus, the microprocessor manages the execution process and solves the nonvectorizable code, whereas the VFPU is in charge of performing all the vectorizable operations involved in (6) . Fig. 4 shows that such operations are launched sequentially, according to their particular latency represented in Table II . This simple structure works properly, but it has two important drawbacks.
1) The UC only launches a new operation if both the input operands and the block used for implementing the operation are available. As a consequence, additions, subtractions, and multiplications employed to calculate the exponent of (6) take 4 · T CLK . This time is far from the maximum throughput achievable by these operations (1 MFLOP/MHz). 2) Note that once the exponential function is launched, the processing of a new exponent could be started (the exponential result is not required for its evaluation). However, since operations are executed sequentially, the UC must wait until the result of the exponential is accumulated in register R12. The efficiency of this structure can be readily improved by introducing some modifications in the software program oriented to mitigate the first drawback. The idea, shown in Fig. 5 , is very simple and allows the throughput to be maximized when calculating the exponent value. Now, the outer loop is partially unrolled to compute groups of four support vectors at the same time. For instance, in the example of Fig. 5 , the VFPU operates with support vectors ( j + i ) (i ∈ [0:3], being j = 4 · k and k ∈ [0:908]). Thus, the UC would launch four times the same operation in four consecutive clock cycles; one for each of the four support vectors processed in parallel. As Fig. 4 shows, and taking latencies presented in Table II into account, using the initial pseudocode any support vector can be processed in 353 · T CLK . However, as indicated in Fig. 5 , when introducing the proposed software modification, four support vectors can be processed in 440 · T CLK , which is equivalent to processing each one in only 110 · T CLK . Note that the second and subsequent exponentials are calculated in 29 · T CLK , according to the throughput of this operation.
On the other hand, the speed of the memory controller, along with the amount of data to be read, may limit the performance of the VFPU. The execution time needed to process (6) is affected not only by the memory bandwidth but also by the improvement obtained when processing groups of four support vectors in parallel. From these considerations, this time could be approximated by the following expression:
where max(a, b) is a function that returns the maximum value. N INIT represents the initial latency (not included in Figs. 4 and 5) necessary for the initialization of the UC (5 · T CLK ) and those registers (4 · T CLK ) that work as accumulators (N INIT = 9 · T CLK ). The second drawback, related to the wait-state cycles introduced by the UC, could be eliminated including two control units (UC1 and UC2) and using pipeline techniques at function level. Using this new structure, the UC1 controls the calculation of the exponent, whereas the UC2 manages the evaluation of the exponential. Since both control units try to access the FPU, an arbiter is needed to manage the permissions (Fig. 1 ). Fig. 6 shows that the control units launch operations following a specific sequence, so that during the calculation of the j th exponent, the exponential of the previous one, ( j − 1)th, is also processed in parallel. Using only one control unit (Fig. 4) , the exponent and exponential function (including the accumulation) are solved in 312 · T CLK and 41 · T CLK , respectively. However, if both operations are launched in parallel, the execution time is mainly dominated by the calculation of the exponent, which is the longer operation. In addition, if this hardware structure is combined by programming the VFPU in such a way that groups of four support vectors are processed at the same time (loop unrolling), the resulting throughput is substantially increased. This improvement is represented in Fig. 7 , which shows as operations are launched in a specific order by the control units and their impact on execution time. This design of the VFPU has some interesting features. 1) Note that the operations (subtraction, multiplication, and addition) launched by the UC1 are executed in pipeline (throughput 1 MFLOP/MHz), so that their execution time is equal to the number of operations managed by such a control unit (N UC1 = 312 · T CLK ).
2) The operations launched by UC2 interrupt the pipeline created by UC1, since UC2 takes the control of the bus arbiter and stops the operations controlled by UC1. The delay introduced by this interruption adds an additional execution time, which is equal to the number of operations managed by UC2 (N UC2 = 8 · T CLK ). 3) As Fig. 2 shows, the normalization and rounding blocks are shared by all operations except by abs(a) and neg(a). Thus, when a new result is available at the output of the exponential, the UC1 is forced to delay 1 · T CLK the launching of a new operation. This clock cycle is the time needed by the block norm of Fig. 2 to normalize any value. Therefore, the delay added to the execution time is equal to 4 · T CLK , which is the number of exponential functions managed by UC2. In order to obtain a general expression, let Q and F be the number of support vectors and the size of a feature vector, respectively, and let N INIT be the initial configuration delay defined previously. The time needed to solve (6) using a VFPU with two control units and resolving groups of four support vectors in parallel could be approximated by
Note that like (7), this execution time T 2UC does not depend on the speed of the memory controller, since the addition of N UC1 = (4 · 3 · F) plus N UC2 = (4 · 2) is higher than 4 · 2 · F (the number of cycles devoted to read data from external memory). Analyzing expression (8) and substituting F and Q by 26 and 3636, respectively, it is easy to conclude that the total number of clock cycles needed to calculate (6) is 302, 697 · T CLK . Consequently, the average throughput provided when processing these computations is about 0.96 MFLOP/MHz, very close to the maximum theoretical value of 1 MFLOP/MHz provided by the VFPU.
A simple way of increasing the computational capability of the VFPU is augmenting the number of lanes that form the architecture. Lanes are implemented by creating N identical copies of the FPU included in Fig. 1 . Thus, from a theoretical point of view, the execution time is reduced by N, since the system is able to process in parallel N groups of four support vectors y j (i ). However, this reduction is also achieved at the expense of increasing the total area by N. Additionally, the more the lanes are included, the more important is the computational capability of the VFPU, but also the more significant are the limitations introduced by the memory controller. If the system includes N lanes related to N identical FPU, expression (8) is modified as follows:
Note that time reduction is not proportional to the number of lanes N. In fact, depending on both the size of the feature vector F and the number of lanes N, this time T NLanes could be limited by either the amount of memory accesses (N · 4 · 2 · F) or the number of cycles needed by both control units 4 · (3 · F + 3) to solve the operations. Thus, the addition of new lanes does not always provide the expected benefits in terms of computational capability.
However, in situations in which the memory access is the most restrictive term in (9) , there are some modifications that can be added in the design of the VFPU. Usually, these modifications involve a tradeoff between resource utilization and performance. For instance, the vector x m (i ), which is identical for all support vectors y j (i ) ( j = 0.3665), could be read only once in order to save memory accesses. Such a vector could be initially stored in an internal circular shift register (CSR) and used when required by the operations involved in (6) . Thus, when including a CSR as part of the VFPU, (9) would be modified as follows: (N INT + F) represents the time needed to read x m (i ), which could be neglected when compared with the rest of the terms of (10) . Unlike (9) , when N = 2 the execution time is now limited by the number of operations launched by both control units, rather than by the memory accesses. After performing some preliminary designs, we realized that when including the CSR the area is increased about 0.1· N. Table III shows the tradeoff between the area and the performance for different values of N and using one or two control units (1 UC or 2 UCs). Results are normalized regarding the simplest design, which is based on one lane and one UC (first row of Table III ). Note that the maximum value tested for N is four, as with a higher value, the execution time would be limited by the memory access. In the design based on two UCs and one lane, the inclusion of a CSR only increases the area, but does not give any additional advantage in terms of speed. However, the use of two UCs is interesting, since it reduces the time by 25.8% and only increases the area by 10%. For two lanes, the fastest solution is achieved including two UCs and a CSR. The CSR increases the area about 8.4%, with regard to the design based on two UCs, but also it reduces the time by 22.9%. Likewise, when N = 4 the addition of a second control unit does not reduce the resolution time. However, it is observed that when the CSR is added, the increase in area is about 9%, but a significant improvement is obtained in the execution time (48.9%). As it will be shown in Section IV, the optimal solution is obtained including only one lane and two UCs. This will be the structure employed in the experimental results, since such an implementation is able to process frames in real time using the minimum area and providing the maximum throughput.
IV. EXPERIMENTAL RESULTS
In order to experimentally prove the advantages of our proposal, a XC3S2000 Spartan 3 FPGA has been selected for its implementation. The system includes a MicroBlaze microprocessor that executes using software the whole speakerverification algorithm. The VFPU, designed from the scratch in VHSIC Hardware Description Language, is connected to the microprocessor through the system bus and solves any vector floating-point computation. Square root and division are the operations whose design is based on a radix-2 restoring algorithm. The logarithm and exponential functions are developed following a CORDIC algorithm. The rest of operations are implemented by combinational circuits.
The clock frequency used to obtain the experimental results is 40 MHz. Program and data are located in a 2-MB SRAM external memory. This memory is connected to both the microprocessor and the VFPU, which have direct access to read and write data. Moreover, other peripherals such as timers, UARTs, input-output ports, and so on, are also implemented as part of the embedded system. Table IV shows the resources of the FPGA required for the implementation of the whole system and the maximum clock frequency reported by the synthesis tool. The equal error rate (ERR) is defined as the point of the DET curve where FMR and FNMR are equal. This parameter is usually accepted as a measure of quality of a biometric algorithm. As expected, the best ERR (7%) is given for the database with utterances acquired under controlled conditions. In contrast, the worst results, which correspond to utterances obtained in adverse and degraded conditions, present a ERR ranged between 15% and 17%, respectively.
A. Recognition Results
B. Speed Processing
In order to compare the performance of the proposed VFPU in terms of speed, the speaker-verification algorithm was executed on two additional systems: an ARM Cortex-A8 microprocessor clocked at 720 MHz and the MicroBlaze microprocessor configured with its own FPU designed by Xilinx. The results for ARM are given for two different implementations. In the first one, a standard compilation was performed over a code based on single instruction single data (SISD) operations. In the second one, the code was rewritten including NEON instructions, which were programmed by means of intrinsics to increase performance by using vectorization [29] . The latter implementation is usually faster, since NEON performs single instruction multiple data (SIMD) processing on several floating-point lanes. Such dedicated instructions are used to load (store) vector data between the registers and the external memory. Note that with these four implementations, it is easy to compare the performance obtained by nonvectorized (μBlaze + FPU and ARM-SISD) versus vectorized implementations (μBlaze+VFPU and ARM-SIMD). Table V shows the execution time for such four implementations including the feature extraction and matching stages. The table also presents the specific execution time for each function described in Table I . Such results are also presented in clock cycles, so that they can be particularized for the operating frequency of a faster FPGA featured with a higher degree of speed.
As mentioned earlier, a complete frame processing should be carried out in <10 ms (frame advance). Only the embedded system designed with the VFPU (9.1 ms) and the ARM-SIMD execution using NEON instructions (5.71 ms) are capable of executing the whole speaker-verification algorithm in time. However, it is important to point out that ARM achieves this result using a clock frequency 18 times higher than the proposed VFPU. Therefore, if both systems use the same frequency, our proposal would be about 11.29 times faster than the ARM-SIMD execution. Furthermore, it is noteworthy that the NEON architecture consists of four lanes, unlike the actual VFPU implementation that is performed including only one lane. If the VFPU was built with 2 lanes, applying (9) the frame matching stage would be processed in 4.83 ms with a clock of 40 MHz, which is faster than the ARM-SIMD implementation that takes 5.55 ms (Column 5 of Table V) .
When configuring MicroBlaze with the scalar FPU supplied by Xilinx, the execution time is approximately 140 ms, which compared with our proposal leads to an average acceleration (Column 4 of nonvectorizable computations involved in a specific code. 4) Loop unrolling to process groups of up to four operations at the same time. 5) Usually, the configuration time of the VFPU is considered negligible when compared with the computational time. However, when short-vectors are processed this simplification could be false. In such cases, the code is hardly accelerated because configuration and computation time have similar values. Table V shows the degree of compliance (low, medium, or high) of these five factors for each stage involved in the whole algorithm. Note that those stages with higher degrees of compliance provide higher acceleration factors. For instance, the frame matching and the logarithm stage are accelerated by 17.25× and 22.43×, respectively, since they meet all factors mentioned before. In contrast, the stage devoted to calculate the delta coefficients is accelerated only by 2×, since although its code is vectorizable, the configuration time of the VFPU is not negligible, pipelining techniques cannot be applied and exponential functions are not utilized. In general, stages included in the frame extraction step provide lower accelerations, since they only meet some of the five factors described previously.
Ramos-Lara et al. [18] presented a custom-hardware implementation of a speaker verification system. Extrapolating their results for a frequency of 40 MHz, a frame would be processed in 5.8 ms. Other similar proposals are presented in [15] and [30] leading to different results. However, if they are compared with the VFPU implementation, there are some drawbacks that should be pointed out.
1) The hardware design presented in [18] is based on fixed-point arithmetic. Authors achieved high accurate results using a variable word length, whose dimension is adjusted to obtain similar results as those produced in floating-point arithmetic. Thus, any change in the feature vector (adding for example the second derivative of the MFCC coefficients), or in the number of bits used in the codification of the input samples, involves redesigning the overall system. Due to the flexibility of our implementation any of these changes only require a simple modification that should be performed on the software program. Such flexibility in not offered by any of the custom-hardware designs presented in [18] , [15] , or [30] . Additionally, the design described in [18] is performed using specific tools (Xilinx Core Generator) that are only valid for a particular FPGA vendor. In contrast, as mentioned before, the architecture of the VFPU is generic, so that it can easily be implemented in any FPGA. 2) Inherently, floating-point computations have associated a large dynamic range, which is especially important when processing extremely large data sets or data sets where the range may be unpredictable. This characteristic is very suited when dealing with intensive computations such as the kernel function used by the SVM classifier.
3) The software code is usually written using float-type variables, since by default many functions (logarithm, exponential, trigonometric, square root etc.) are defined and implemented in floating-point arithmetic. Thus, such arithmetic could be coded directly into hardware operations represented in this format. However, fixedpoint arithmetic requires an additional effort, since the original program should be transformed. Further, when performing operations in fixed-point arithmetic there is a risk of producing an overflow, underflow or round-off error. Particularly, this could happen if the database or the size of the input samples change.
Moreover, expression (8) is quite consistent with the results shown in Table V. Note that since γ = −1 and P j = 1, then N UC2 is equal to 16 (eight new multiplications are added). This theoretical expression, calculated with a particular frequency of f CLK = 40 MHz, leads to an execution time of about 7.75 ms, which represents an approximated error of 0.5% against the real value of 7.79 ms. Such error is mainly due to the communication delays produced by the configuration time of the VFPU.
V. CONCLUSION
This paper presented the design and implementation on FPGA of a complete biometric algorithm for speaker verification. The feature extraction stage is based on the calculation of the MFCC, whereas the classification is performed by means of a SVM model. The paper also describes a generic architecture of VFPU that solves all the vector floating-point computations involved in the algorithm. Additionally, the architecture provides a high flexibility, which allows quickly adapting the parameters of the algorithm to different conditions related with the acquisition of samples or the particular features of a group of users. Its design includes two control units that maximize the throughput and allow the entire algorithm to be solved in real time. The performance of the VFPU was compared with two systems of similar features: the FPU provided by Xilinx and the ARM Cortex A8 microprocessor. Experimental results show as each frame is processed by the VFPU in 3.64 · 10 5 clock cycles, which represents an acceleration factor of 11.2× and 15.4×1 when compared with systems based on an ARM-NEON microprocessor and the FPU of Xilinx, respectively.
