Abstract-The Levenberg-Marquardt (LM) algorithm is a nonlinear parameter learning algorithm that converges accurately and quickly. This paper demonstrates for the first time to our knowledge, a real-time implementation of the LM algorithm on field programmable gate arrays (FPGAs). It was used to train neural networks to solve the eXclusive Or function (XOR), and for 3D-to-2D camera calibration parameter estimation. A Xilinx Virtex-5 ML506 was used to implement the LMA as a hardware-in-the-loop system. The XOR function was approximated in only 13 iterations from zero initial conditions, usually the same function is approximated in thousands of iterations using the error backpropagation algorithm. Also, this type of training not only reduced the number of iterations but also achieved a speed up in excess of 3 × 10 6 when compared to the software implementation. A real-time camera calibration and parameter estimation was performed successfully on FPGAs. Compared to the software implementation the FPGA implementation led to an increase in the mean squared error and standard deviation by only 17.94% and 8.04% respectively. The FPGA increased the calibration speed by a factor of 1.41 × 10
single camera [5] , or systems dealing with projective geometry estimation [6] . The algorithm investigated can also be used for motion compensation systems [7] , [8] , and robotic vision systems [9] [10] [11] [12] [13] .
To demonstrate nonlinear parameter estimation we used neural network (NN) parameter estimation and camera calibration as practical examples. If we consider camera parameter estimation for an automated optical inspection system, the system will be able not only to tell that a fault occurred in a production line but also show how such a fault occurs and what caused the fault based on the value of the parameters estimated by the learning algorithm. It should be noted that this study is not limited to NNs or camera calibration it can integrated into a wider range of applications depending on nonlinear parameter estimation.
Replicating advanced learning algorithms on dedicated hardware is a challenging problem. In the field of NNs, and nonlinear parameter estimation, most studies are limited to the popular back-propagation algorithm and offline full precision training performed in a software environment [14] . This study builds on previous work which investigated the LM-algorithm in reduced precision, [15] .
The paper is organized as follows. Section II provides a brief background on the problem. Section III presents details of the two experiments conducted in this paper. Section IV provides the results of these experiments. The paper ends with conclusions in Section V.
II. BACKGROUND
The LM algorithm finds a solution of a system of nonlinear equations, y = φx, by finding the parameters, φ, that link dependent variables, y, to independent variables, x, by minimizing an error of a function of said system by using error gradient information for every parameter considered in the system. The LM algorithm in (1) , estimates the parameters that make up a specific system function recursively until convergence by finding the appropriate change, Δφ, leading to smaller errors. The LM algorithm depends on error, E, the Hessian matrix H, the gradient of the error, ∇J, a scalar μ which controls the trust region, and I is the identity matrix.
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/ Fig. 1 . Diagram of proposed Levenberg-Marquardt-algorithm partitioning between hardware (FPGA) and software (CPU). Fig. 1 shows a generic supervised learning procedure, in this case adapted to the LM algorithm. The upper part in the diagram is the operation phase of a function, and the lower part is the phase which adjusts the parameters in the operational phase. The upper part of the diagram shows the operation of a function depending on certain weights and parameters. The function operation tends to start at specified initial starting weights which are adjusted by a learning algorithm to reach the desired state of operation with the lowest errors. The lower part in the diagram represents the LM-algorithm training algorithm. The arrows indicate the flow of signals and parameters to and from the operation and the training sections of the learning system. The operation phase tends to be less complex than the learning phase, so some studies managed to port simple learning algorithms onto a field programmable gate array (FPGA). Porting learning algorithms onto an FPGA tends to be challenging, particularly when it is related to online training NNs on the FPGA, [14] . Usually, the processing is ported to hardware and the training remains in software or in some cases the error back propagation algorithm is performed on hardware [16] . This study, for the first time to our knowledge, shows the implementation of the more complex LM algorithm in hardware. In Fig. 1 , we propose to keep the high-speed operation in software as its hardware counter part has been comprehensively studied in literature. The hardware implementation of a complex second-order learning algorithm such as the LM algorithm has not been extensively researched. Also, the amount of speed up gained from speeding up the learning phase on FPGA far exceeds the already fast implementation of the operation phase in software.
Equation (1) summarizes the algebraic operation required to perform the LM algorithm. To generate the hardware configuration to be implemented on the FPGA, the algebraic operations were deconstructed to simpler parts. Even though the LM algorithm solves nonlinear functions, its main operation resembles a solver of linear systems. In this study, we use the Xilinx AccelDSP software package to construct and integrate three separate operations that comprise this solver, see Fig. 2 . In order to build this system for FPGA, we use cores that make up the algebraic operations of the LM algorithm. Fig. 2 provides a diagram composed of three cores that are implemented on FPGA. The three cores that comprise the LM algorithm can be broken down into a QR factorization, QR, a matrix multiplication, mtimes, and an upper triangular system solver, triangSolver. It was decided to split the LM algorithm into two separate parts to allow for faster debug, core generation time reduction, and due to the limited resources on an FPGA. Both parts can be integrated into other projects requiring real-time nonlinear parameter estimation. The first part contains the first two cores that factorize the Hessian matrix and multiply the Q matrix by the gradient of the Jacobian, ∇J. The second part solves the system using a back-substitution algorithm.
We used the orthogonal, or QR, factorization because it can deconstruct any rectangular matrix, A, to a product of a unitary matrix and an upper triangular matrix, A = QR, where Q is the orthogonal or unitary, and R is the upper triangular matrix. Unitary matrices are desirable for numerical computation because they preserve length, angles, and do not magnify errors, [17] . QR factorization was chosen over LU factorization because the QR factorization does not require any pivoting or permutations without resorting to division operations. The error gradient is then multiplied by the unitary matrix Q, ensuring that no increase in range occurs due to these multiplications since the maximum value Q can take is 1. After performing the factorization and multiplication, the results of those two cores are routed to a triangular system solver which uses the back substitution method to find the solution Δφ.
III. EXPERIMENT
Two experiments were conducted for a real-time LM-algorithm implementation on the FPGA. The first experiment used the FPGA-LM algorithm to train a NN to solve the eXclusive Or function (XOR) problem. The second experiment made use of the FPGA-LM algorithm for explicit camera calibration.
In the following subsections, we present how we implemented the LM algorithm on the FPGA. To our knowledge, this is the first time this type of work has been done.
The design and simulation of the experiments required several tools and development environments. Matlab 2009b, Xilinx ISE [18] with the AccelDSP development environment were used for software and hardware prototyping. An FPGA was used as a hardware-in-the-loop (HIL) system connected to the computer by ethernet; several studies used similar development environments [19] [20] [21] . The NN toolbox in Matlab 2009b was used to simulate the XOR training in floating point as a benchmark. The real-time operation of the FPGA-LM algorithm design was loaded onto the FPGA and was invoked by the NN toolbox as a custom training function. The nonlinear parameter estimation module in the Camera Calibration toolbox for Matlab, [22] , was replaced with the real-time FPGA-based LMA to perform the main optimization operation in the camera calibration algorithm. A Xilinx Vertix-5 ML506 development board, an Intel Core2Duo 3.0 GHz processor, 4 GB RAM, and Windows XP operating system was used to compile and connect the learning module on the FPGA.
It should be noted that AccelDSP limits the QR factorization and the triangular system solvers specification to have a maximum word length of 24 bits and matrices dimensions to 32 units wide. Another limitation of the FPGA is the maximum speed of any design operating in a HIL configuration, all the designs were limited to a maximum of 100 MHz. This was found to be sufficient as the current system design is intended to be a proof of concept of real-time operating nonlinear estimators to be used within stand-alone hardware systems requiring very highspeed and low-power solution where the need of a real-time solution far outweighs the costs of having an integrated FPGA.
A. Exclusive OR function
The XOR estimation using a NN is considered as a benchmark operation for various training techniques. The XOR is a function of two variables whose interaction forms a system that is linearly inseparable; meaning that it cannot be solved using a linear system, see Fig. 3 .
To simulate NN training of the XOR function, two input signals with 32 binary samples were used to estimate the magnitude of the signals traversing through the NN. The data were divided into three segments; with 24 samples for training (75%), four samples (15%) for validation leaving four samples (15%) for testing. NNs neuron parameter may have both weights and biases. The NN constructed had two layers, four weights linking the input-to-hidden layer with no biases, and two weights linking the hidden-to-output layer with a 1 bias at the output neuron, see Fig. 4 . The biases from the hidden layer were removed as the NN reached the same solution without them. This type of structure comprises a NN with seven parameters. For repeatability, the NNs were initialized to zero starting values.
In order to find the appropriate ranges for the variables used in the FPGA-LM algorithm, a floating point simulation of XOR-NN training was used to collate the variables and find the appropriate ranges, [23] . The input variables were quantized to fit the fixed-point format of Q (24, 19) ; comprising a signed number with a range of 2 4 and 2 −19 resolution. The output of the XOR-NN was set to Q(48,38), a precision generated using the AccelDSP workflow. 
B. Camera Mapping
Camera calibration is a process of finding the nonlinear parameters which map world coordinate system (WCS) in 3D to camera coordinate system (CCS) in 2D. The camera calibration operation and optimization has a similar setting as NNs when it comes to the optimization to find the camera system parameters with studies performing the operation phase on hardware and the learning phase in software [24] , [25] . Camera calibration adopts an established analytical framework that uses the LM algorithm for the estimation of both the linear and nonlinear distortions of the camera system. We make use of the wellknown camera calibration toolbox used in both Matlab and OpenCV, [22] . The calibration test data are found in [26] . Five planes and their reference grid were used for calibration, see Fig. 5 . The FPGA-LM algorithm was integrated into the toolbox by integrating it into the main optimization routine.
The LM algorithm is used to find the parameters mapping the reference 3D coordinates into the 2D camera coordinates. This type of mapping uses a combination of a linear and a nonlinear system of equations. The linear system is used for the estimation of linear camera parameters which describe the rotation, translation, and scaling vectors as in (2) points by using rotation matrix R, a translation matrix T , a focal point estimate f and center point estimate c.
The nonlinear effects of the camera system are estimated using equations dealing with radial and tangential distortions as in (3), where ρ (1, 2, 5) specify the extent of the radial distortion term r 2 = x 2 + y 2 , and the tangential distortion is specified by ρ (3, 4) , where a 1 = 2xy, a 2 = r 2 + 2x 2 , a 3 = r 2 + 2y 2 .
The number of unique parameters in both equations amount to a vector with a total of 12 parameters. It should be noted that the camera calibration toolbox for Matlab uses the data from all the images to construct a single vector containing all the parameters derived from all of the images. However, hardware development tools limit the size of the FPGA-LM algorithm leading to image mapping of each image on an individual basis. Table I shows the specifications and the number of components used by the two parts of the FPGA-LM algorithm; this summary of components excludes the logic required to time the internal functioning of the core. However, it is correlated with the number of cycles required per function call. It can be noted that only a few multipliers and adders were used, the clock cycles per function call were high. This is due to the process in which all cores were generated using the default settings in AccelDSP; mainly that all vector operations were rolled into loops and not performed in parallel. The loop-rolling process decreases the amount of multipliers and addition operations by using local memories which are triggered to supply intermediate values during algebraic operations. Loop unrolling reduces the clock cycles needed to perform the required operation; however, for the current study there was no need to produce cores with lower latency and higher resource usage as the FPGA used is large enough to accommodate for the XOR FPGA-LMalgorithm implementation. Both cores took up to 71% of the resources available in this FPGA. After generating the cores, the FPGA-LM algorithm was integrated into the Matlab NN toolbox by designing a custom training function replacing the internal software LM algorithm. The results of the software LM algorithm and FPGA-LM algorithm can be seen in Fig. 6 . The upper graph in this figure shows that the software solution convergence stopped after only nine iterations; if the backpropagation algorithm was used, we can expect the optimization to require tens of thousands of iterations. The LM algorithm on software reached the optimization stopping criteria by reaching a level lower than the minimum gradient. The lower graph in Fig. 6 shows the convergence of the hardware trained NN. The solution converged in 13 iterations, incurring only four more training iterations than software. The FPGA-LM algorithm optimization ended when the maximum allowable μ value was reached. It can be assumed that the NN converged to the correct solution in both hardware and software as they both reached error levels lower than 10 −6 . Table II provides a summary of the software and hardware optimization results. The summary indicates the number of epochs taken and the mean squared error (MSE) of the solution. We can note that the hardware solution had a slower rate of convergence after the sixth epoch attributed to the lower precision nature of the learning system on hardware.
IV. RESULTS

A. XOR
In order to compare hardware and software performance, we compare the time it takes to perform the main LM-algorithm function in both cases. We assume that there are no signal routing delays or memory transfer delays to and from the FPGA chip. The nine ephods in the software implementation of the XOR LM algorithm took 0.3982 s. The FPGA-LM algorithm required 13 iterations at 100 MHz, taking up to 13 μs, accelerating the learning algorithm by a factor of 3.0633 × 10 6 . For reference, a recent study [27] which implemented the training of a NN using the popular backpropagation that solves the XOR problem in 760 iterations reaching an error levels ten orders of magnitude higher than training performed in the LM algorithm. 
B. Camera Calibration
Table III provides a summary of hardware utilization required by the FPGA-LM algorithm suitable to calculate the required 12 parameters for the WCS-to-CCS (3D-to-2D) mapping. It should be noted that part 1 of the LM algorithm was generated and simulated in fixed point on the computer due to the fact that the routing of the algorithm requires more resources than the one available on the Virtex-5 SX50T available for this study. The amount of logic required to time the design function and make it cycle true is correlated with the amount of clock cycles per function call, so the routing delay prohibited the generation of an FPGA netlist which fits on the FPGA under consideration. Several vector operations were unrolled to reduce the amount of timing; however, they did not reduce the amount of logic required in order to fit on the FPGA chip. A simulation of HDL code comprising the operation of Part1 shows that it introduced an error of a magnitude of only 10 −5 . Hence, for this experiment only the 12 × 12 back substitution system (part 2) was operating in real time on the FPGA hardware taking only 7273 slices from the total of 32 640 slices available on the Virtex5 SX50T. In order to fit the whole solution onto a single chip, we need to use either a newer Virtex-6 chip or an LX85 Virtex5 FPGA. Fig. 7 shows the parameter estimation convergence rate for the 12 parameters both in software, in the top graph, and through the FPGA, on the lower graph. The parameters initial and end values vary in their range extensively; this type of normalization is used for the purpose of illustrating the conver- gence of the software and FPGA solutions of the LM algorithm. The graph displays the convergence rate after collating the parameters during the optimization process and normalizing them to a range of [0, 1] and taking the absolute values. The software convergence is smoother than the FPGA and reaches a lower error level. From the FPGA system convergence figure, we notice that the parameter estimation was not as smooth and that some of the parameters did not reach a lower level of error after the 16th iteration. The camera calibration optimization system has a maximum of 30 iterations. Fig. 8 shows a comparison of the error scatter from the floating point optimization on the left and the fixed-point FPGA optimization on the right. The figure shows that the error resulting from the FPGA solution is almost similar to the software solution. Table IV shows the summary of camera calibration in terms of the number of iterations and the mean squared error (MSE) and standard deviation (σ) of the mapping error for the five images in their original format and also when the images and the reference grid are rescaled to [0, 1] . When mapping the raw images, we notice that the floating LM algorithm ends in 20 iterations because the criteria of lowest change in the norm of parameters reached a level below 10 −9 . In all of the cases and data considered, the FPGA-LM algorithm terminated when the maximum number of iterations was reached. The FPGA-LM algorithm had satisfactory performance in terms of error and standard deviation when compared to software. When the data was in its original scale the FPGA-LM algorithm MSE performance increased by 173.5% on average, the standard deviation of the FPGA solution was also 60.10% larger than the software solution. The lower part of Table IV shows the performance of the camera calibration mapping the rescaled data. All images and their reference grid were mapped to be in the range [0, 1] . This type of rescaling process was first applied in this study to try to reduce the ranges of the resultant Hessian matrix, the error gradient matrices, and also to reduce the condition number of the Hessian matrix. The condition number of the matrix has a significant effect when dealing with quantized data which on occasion leads to singularities that do not occur when working in full precision. The rescaling reduced the initial condition number; however, the Hessian matrices that occur later in the algorithm are more difficult to solve even on a floating point precision; this can be seen in the worse performance of the software LM algorithm. Image 3 TABLE IV  CAMERA CALIBRATION LM-ALGORITHM TRAINING SUMMARY had an ill-conditioned Hessian matrix that could not be solved on neither software nor hardware. Image 5 was solved on both platforms; however, the asterisk indicates that the first iteration of the FPGA-LM algorithm was solved on floating point with subsequent iterations on hardware, a situation which would be preferable in real-world applications where the initial conditions are obtained using the highest precision available, and the adaptive system will only need to account for small changes that a system undergoes after calibration. Even though the rescaling rendered some of the mapping examples inoperable, the average increase in the hardware solution MSE was lowered to 17.94% more than the software solution while the standard deviation of the error remained almost the same only increasing by 8.04%.
The software implementation for the calibration of image 1 in software LM algorithm took 0.4228 s. The FPGA-LM algorithm required 30 iterations at 100 MHz, taking up to 30 μs, resulting in an operation almost 1.41 × 10 6 faster than software.
V. CONCLUSION
The LM algorithm was programmed and used successfully in real time on an FPGA, for the first time to our knowledge. It was found that the hardware implementation of the LM algorithm was able to find NN parameters to solve the XOR problem in just 13 iterations compared to existing studies of hardware learning using the back propagation algorithm. The FPGA-LM-algorithm NN training speed was increased by a factor of 3 × 10 6 . The same algorithm was also used to find the optimal mapping parameters in the camera calibration problem having a penalty of only 17.94% increase in the MSE and an 8.04% increase in the standard deviation when compared to a software solution. The FPGA-LM algorithm for camera calibration speed was increased by a factor of 1.4 × 10 6 . We were able to get these results by using high level software and hardware development environments. However, the limitations of the development environment restricted the variables to 24 bits of precision, and the FPGA hardware area restricted the size of the system that needs to be solved.
The FPGA hardware used in this study proved to be a restriction for portions of the algorithms used. By using a newer FPGA with more resources, all the algorithms can be ported to hardware including the operation phase of the function estimation allowing for the elimination of some of the restrictions posed by the current deployed solution. A solution implemented fully in hardware would allow for a reduction in the latency of the solution and can remove the dependency on software.
This study provides a roadmap for the implementation of the LM algorithm on fixed-point hardware which can be ported onto a dedicated ASIC for high-speed low-power applications. In most cases, the requirement for a real-time low-powered solution outweighs the costs of including an FPGA. To this extent, the solution demonstrated here can be readily integrated into a robotics vision and control or in an optical inspection system. To operate viably, both of these systems ideally require a real-time parameter estimation module.
