Abstract-We consider the problem of enabling fixed-point implementation of linear algebra kernels on low-cost embedded systems, as well as motivating more efficient computational architectures for scientific applications. Fixed-point arithmetic presents additional design challenges compared to floating-point arithmetic, such as having to bound peak values of variables and control their dynamic ranges. Algorithms for solving linear equations or finding eigenvalues are typically nonlinear and iterative, making solving these design challenges a nontrivial task. For these types of algorithms, the bounding problem cannot be automated by current tools. We focus on the Lanczos iteration, the heart of well-known methods such as conjugate gradient and minimum residual. We show how one can modify the algorithm with a low-complexity scaling procedure to allow us to apply standard linear algebra to derive tight analytical bounds on all variables of the process, regardless of the properties of the original matrix. It is shown that the numerical behavior of fixed-point implementations of the modified problem can be chosen to be at least as good as a floating-point implementation, if necessary. The approach is evaluated on field-programmable gate array (FPGA) platforms, highlighting orders of magnitude potential performance and efficiency improvements by moving form floating-point to fixed-point computation.
INTRODUCTION
T HE Lanczos iteration [1] is the key building block in modern iterative numerical methods for computing eigenvalues or solving systems of linear equations involving symmetric matrices. These methods are typically used in scientific computing applications, for example when solving large sparse linear systems of equations arising from the discretization of partial differential equations (PDEs) [2] . In this context, iterative methods are preferred over direct methods, because they can be easily parallelized and they can better exploit the sparsity in the problem to reduce computation and, perhaps more importantly, memory requirements [3] . However, these methods are also interesting for smalland medium-scale problems arising in real-time embedded applications. In the embedded domain, on top of the advantages previously mentioned, iterative methods allow one to trade off computation time for accuracy in the solution and enable the possibility of terminating the method early to meet real-time deadlines. An example application is real-time optimal decision making with applications in, for instance, advanced control systems [4] .
In both cases more efficient forms of computation, in the form of new computational architectures and algorithms that allow for more efficient architectures, could enable new applications in many areas of science and engineering. In high-performance computing (HPC), power consumption is quickly becoming the key limiting factor for building the next generation of computing machines [5] . In embedded computing, cost, power consumption, computation time and size constraints often limit the complexity of the algorithms that can be implemented, thereby limiting the capabilities of the embedded solution.
Porting floating-point algorithm implementations to fixed-point arithmetic is an effective way to address these limitations. Because fixed-point numbers do not require mantissa alignment, the circuitry is significantly simpler and faster. The smaller delay in arithmetic operations leads to lower latency computation and shorter pipelines. The smaller resource requirements either result in more performance through parallelism for a given silicon budget, or a reduction in silicon area and lower power consumption and cost. This latter observation is especially important, since the cost of chip manufacturing increases at least quadratically with silicon area [6] . It is for this reason that fixed-point architectures are ubiquitous in high-volume low-cost embedded platforms, hence any new solution based on increasingly complex sophisticated algorithms must be able to run on fixed-point architectures to achieve high-volume adoption. In the HPC domain, heterogeneous architectures that incorporate fixedpoint processing could help to lessen the effects of the power wall, which is the major hurdle in the road to exascale computing [7] .
However, while fixed-point arithmetic is widespread for simple digital signal processing (DSP) operations, it is typically assumed that floating-point arithmetic is necessary for solving general linear systems or general eigenvalue problems, due to the potentially large dynamic range in the data and consequently on the algorithm variables. Furthermore, the Lanczos iteration is known to be sensitive to numerical errors [8] , hence moving to fixed-point arithmetic could potentially worsen the problem and lead to unreliable numerical behaviour.
To be able to take advantage of the simplicity of fixed-point circuitry and reduce cost, power and computation time, the complexity burden shifts to the algorithm design process [9] . In order to have a reliable fixed-point implementation one has to be able to establish bounds on all variables of the algorithm to avoid online shifting, which would negate any speed advantages, and avoid overflow errors. In addition, the bounds should be of the same order to minimize loss of precision when using constant wordlengths. There are several tools in the design automation community for handling this task [10] . However, because the Lanczos iteration is a nonlinear iterative algorithm, all state-of-the-art bounding tools fail to provide practical bounds. Unfortunately, most linear algebra kernels (except extremely simple operations) are of this type and they suffer from the same problem.
Summary of Contribution
We propose a novel scaling procedure first introduced by us in [11] to tackle the fixed-point bounding problem for the nonlinear and recursive Lanczos kernel. The procedure gives tight bounds for all variables of the Lanczos process regardless of the properties of the original matrix, while minimizing the computational overhead. The proof, based on linear algebra, makes use of the fact that the scaled matrix has all eigenvalues inside the unit circle. This kind of analysis is currently well beyond the capabilities of state-of-the-art automatic methods [12] . We then discuss the validity of the bounds under finite precision arithmetic and give simple guidelines to be used with existing error analysis [8] to ensure that the absence of overflow is maintained under inexact computation. We also show how the same scaling approach can be used for bounding variables in other nonlinear recursive linear algebra kernels based on matrix-vector multiplication, such as the Arnoldi iteration [13] -a generalization of the Lanczos kernel for non-symmetric matrices.
The potential efficiency improvements of the proposed approach are evaluated on an FPGA platform. While Moore's law has continued to promote FPGAs to a level where it has become possible to provide substantial acceleration over microprocessors by directly implementing floating-point linear algebra kernels [14] - [17] , floating-point operations remain expensive to implement, mainly because there is no hard support in the FPGA fabric to facilitate the normalisation and denormalisation operations required before and after every floating-point addition or subtraction. This observation has led to the development of tools aimed towards fusing entire floating-point datapaths, reducing this overhead [18] , [19] . However, the hard integer support built into the FPGA fabric means that the FPGA is a good example platform where the efficiency gap between fixed-point and floating-point computation is very large, which is why they are used for evaluation in this paper, even though the presented results are equally applicable for implementing these algorithms on embedded microcontrollers and fixed-point DSPs.
To exploit the architecture flexibility in an FPGA we present a parameterizable architecture generator where the user can tune the level of parallelization and the data type of each signal. This generator is embedded in a design automation tool that selects the best architecture parameters to minimize latency, while satisfying the accuracy specifications of the application and the FPGA resources available. Using this tool we show that it is possible to get sustained FPGA performance very close to the peak theoretical generalpurpose graphics processing unit GPGPU performance when solving a single Lanczos problem to equivalent accuracy. If there are multiple independent problems to solve simultaneously it is possible to exceed the peak floating-point performance of a GPGPU. If one considers the power consumption of both devices, the fixed-point Lanczos solver on the FPGA is more than an order of magnitude more efficient than the peak GPGPU efficiency. The test data are obtained from a benchmark set of equations coming from a Boeing 747 optimal controller.
Outline
The remainder of the paper is organised as follows: Section 2 describes the Lanczos algorithm; Section 3 presents our scaling procedure and contains the analysis to guarantee the absence of overflow in the Lanczos process; Section 4 presents numerical results showing that the numerical quality of the linear equation solution does not suffer by moving to fixedpoint arithmetic; in Section 5 we introduce an FPGA design automation tool that generates minimum latency architectures given accuracy specifications and resource constraints; in Section 6 this tool is used to evaluate the potential relative performance improvement between fixed-point and floatingpoint FPGA implementations and perform an absolute performance comparison against the peak performance of a high-end GPGPU; Section 7 discusses the possibility of extending this methodology to other nonlinear recursive kernels based on matrix-vector multiplication. Section 8 concludes the paper. [20] , or to solve systems of linear equations of the form using the conjugate gradient CG [21] method when is positive definite or the minimum residual MINRES [22] method when is indefinite. The Arnoldi iteration, a generalisation of Lanczos for non-symmetric matrices, is used in the generalized minimum residual (GMRES) method for general matrices and is dicussed in Section 7. The Lanczos (and Arnoldi) algorithms account for the majority of the computation in these methodsthey are the key building blocks in modern iterative algorithms for solving formulations of linear systems appearing in all kinds of applications.
Methods involving the Lanczos iteration are typically used for large sparse problems arising in scientific computing where direct methods, such as LU and Cholesky factorization, cannot be used due to prohibitive memory requirements [23] . However, iterative methods have additional properties that also make them good candidates for small problems arising in real-time applications, since they allow one to trade-off accuracy for computation time [24] .
FIXED-POINT ANALYSIS
A floating-point number consists of a sign bit, an exponent and a mantissa. The exponent value moves the binary point with respect to the mantissa. The dynamic range-the ratio of the smallest to largest representable number-grows doubly exponentially with the number of exponent bits, therefore it is possible to represent a wide range of numbers with a relatively small number of bits. However, because two operands can have different exponents it is necessary to perform denormalisation and normalisation operations before and after every addition or subtraction, leading to greater resource usage and longer delays. In contrast, fixed-point numbers have a fixed number of bits for the integer and fraction fields, i.e. the exponent does not change with time and it does not have to be stored. Fixed-point hardware is the same as for integer arithmetic, hence the circuitry is simple and fast, but the representable dynamic range is limited. However, if the dynamic range in the data is also limited and fixed, a 32-bit fixed-point processor can provide more precision than a 32-bit floating-point processor because there are no bits wasted for the exponent.
There are several challenges that need to be addressed before implementing an application in fixed-point. Firstly, one should determine the worst-case peak values for every variable in order to avoid overflow errors. For linear timeinvariant (LTI) algorithms it is possible to use discrete-time system theory to put tight analytical bounds on worst-case peak values [25] . A linear algebra operation that meets such requirements is matrix-vector multiplication, where the input is a vector within a given range and the matrix does not change over time. For some nonlinear non-recursive algorithms interval arithmetic [26] can be used to propagate data ranges forward through the computation graph [27] . Often this approach can be overly pessimistic for non-trivial graphs because it cannot take into account the correlation between variables.
Linear algebra kernels for solving systems of equations, finding eigenvalues or performing singular value decomposition are nonlinear and recursive. The Lanczos iteration belongs to this class. For this type of computation the bounds given by interval arithmetic quickly blow up, rendering useless information. Table 1 highlights the limitations of state-of-the-art bounding tool Gappa [12] -a tool based on interval arithmetic-for handling the bounding problem for one iteration of Algorithm 1. Even when only one iteration is considered, the bounds quickly become impractical as the problem size grows, because the tool cannot use any extra information in matrix beyond bounds on the individual coefficients. Other recent tools [28] that can take into account the correlation between input variables can help to tighten the single iteration bounds, but there is still a significant amount of conservatism. More complex tools [29] that can take into account additional prior information on the input variables can further improve the tightness of the bounds. However, as shown in Table 1 , the complexity of the procedure limits its usefulness to very small problems. In addition, the bounds given by all these tools will grow further for more than one iteration. As a consequence, these types of algorithms are typically implemented using floating-point arithmetic because the absence of overflow errors cannot be guaranteed, in general, with a practical number of fixed-point bits for practical problems.
Despite the acknowledged difficulties there have been several fixed-point implementations of nonlinear recursive linear algebra algorithms. CG-like algorithms were implemented in [12] , [28] Given and
The tool described in [29] can also use the fact that . Note that has unit norm, hence , and can be trivially scaled such that all coefficients are in the given range. '-' indicates that the tool failed to prove any competitive bound. Our analysis will show that when all the eigenvalues of have magnitude smaller than one, holds independent of for all iterations . *No other bound smaller than could be proved in less than two days.
[30], [31] , whereas the Lanczos algorithm was implemented in [32] . Bounds on variables were established through simulationbased studies and adding a heuristic safety factor. In the targeted DSP applications, the types of problems that have to be processed do not change significantly over time, hence this approach might be satisfactory, especially if the application is not safety critical. In other applications, such as in optimization solvers, the range of linear algebra problems that need to be solved on the same hardware is so varied that it is not possible to assign wordlengths based on simulation in a practical manner. Importantly, in safety-critical applications analytical guarantees are desirable, since overflow errors can lead to catastrophic behaviour of the system [33] .
A Scaling Procedure for Bounding Variables
We propose the use of a diagonal scaling matrix to redefine the problem in a new co-ordinate system to allow us to control the bounds in all variables, such that the same fixed precision arithmetic can efficiently handle problems with a wide range of matrices. For example, if we want to solve the symmetric system of linear equations , where , we propose instead to solve the problem where and the elements of the diagonal matrix are chosen as to ensure the absence of overflow in a fixed-point implementation. The solution to the original problem can be recovered easily through the transformation . An important point is that the scaling procedure and the recovery of the solution still have to be computed using floating-point arithmetic, due to the potentially large dynamic range and unboundness in the problem data. However, since the scaling matrix is diagonal, the cost of these operations is comparable to the cost of one iteration of the Lanczos algorithm. Since many iterations are typically required, most of the computation is still carried out in fixed-point arithmetic.
In order to illustrate the need for the scaling procedure, Fig. 1 shows the evolution of the range of values of (Line 4 in Algorithm 1) throughout the solution of one optimization problem from the benchmark set described in Section 4. Notice that a different Lanczos problem has to be solved at each iteration of the optimization solver. Since the range of Lanczos problems that have to be solved on the same hardware is so diverse, without using the scaling matrix (2) it is not possible to decide on a fixed data format that can represent numbers efficiently for all problems. Based on the simulation results, with no scaling one would need to allocate 22 bits for the integer part to be able to represent the largest value of occurring in this benchmark set. Furthermore, using this number of bits would not guarantee that overflow will not occur on a different set of problems. The situation is similar for all other variables in the algorithm.
Instead, when using the scaling matrix (2) we have the following results: Lemma 1. The scaled matrix with as in (2) has, for any non-singular symmetric matrix , spectral radius .
Proof. Let be the absolute sum of the offdiagonal elements in row , and let be a Gershgorin disc with centre and radius . Consider an alternative non-symmetric preconditioned matrix . The absolute row sum is equal to 1 for every row of , hence the Gershgorin discs associated with this matrix are given by . It is straightforward to show that these discs always lie inside the interval between 1 and when , which is the case here. Hence, according to Geshgorin's circle theorem [23 where (7) follows from the properties of matrix norms and the fact that . The equality follows from the 2-norm of a real symmetric matrix being equal to its largest absolute eigenvalue [23, Theorem 2.3.1].
We continue by bounding and , which are used in Lines 2, 4, 5 and 6 of Algorithm 1 and represent the coefficients of the tridiagonal matrix described in (1) . The eigenvalues of the tridiagonal approximation matrix (1) [34] can be upper bounded by one if the input is known to be smaller than one. A possible source of problems both for fixed-point and floating-point implementations is encountering . However, this would mean that we have already computed a perfect tridiagonal approximation to , i.e. the roots of the characteristic polynomial of are the same as those of the characteristic polynomial of , signalling completion of the Lanczos process. If an upper bound estimate for is available, it is possible to bound all variables analytically without using the scaling matrix (2). However, the bounds will lose uniformity, i.e. the elements of would still be in but the elements of would be in . The scaling operation that has been suggested in this section is also known as diagonal preconditioning. However, the primary objective of the scaling procedure is not to accelerate the convergence of the iterative algorithm, the objective of standard preconditioning. Sophisticated preconditioners attempt to increase the clustering of eigenvalues. Our scaling procedure, which has the effect on normalising the 1-norm of the rows of the matrix, can be applied after a traditional accelerating preconditioner. However, since this will move the eigenvalues, it cannot be guaranteed that the scaling procedure will not have a negative effect on the convergence rate. In cases where the convergence rate is not satisfactory, a better strategy could be to include the objective of normalising the 1-norm of the rows of the matrix in the design of the accelerating preconditioner to trade-off the goals of fast convergence and implementation cost. Since the design of sophisticated preconditioners is very application-specific it is not possible to describe a general procedure to achieve this goal.
Validity under Inexact Computations
We now use Paige's error analysis of the Lanczos process [8] to adapt the previously derived bounds in the presence of finite precision computations. We are interested in the worst-case error in any component. In the following, we will denote with the deviation of variable from its value under exact arithmetic.
Unlike with floating-point arithmetic, fixed-point addition and subtraction operations involve no round-off error, provided there is no overflow and the result has the same number of fraction bits as the operands [35] , which will be assumed in this section. For multiplication, the exact product of two numbers with fraction bits can be represented using fraction bits, hence a -bit truncation of a 2's complement number incurs a round-off error bounded from below by . Recall that in 2's complement arithmetic, truncation incurs a negative error both for positive and negative numbers.
The maximum absolute component-wise error in the variables involved in Algorithm 1 is summarised in the following proposition: Proposition 1. When using fixed-point arithmetic with fraction bits and assuming no overflow errors, the maximum difference in the variables involved in the Lanczos process described by Algorithm 1 with respect to their exact arithmetic values can be bounded by:
for all iterations , where .
Proof. In the original analysis presented in [8] , higher order error terms are ignored since every term is assumed to be significantly smaller than one for the analysis to be valid, hence, higher order terms have a negligible impact on the final results. We do the same here as it significantly clarifies the presentation. According to [8] , the departure from unit norm in the Lanczos vectors can be bounded by for all iterations . In the worst case, all the error can be attributed to the same element in , hence, neglecting higher-order terms we have leading to (10) .
The error in Line 3 of Algorithm 1 can be written, using the properties of matrix and vector norms, as where the last term represents the maximum componentwise error in matrix-vector multiplication. Using (15) one can infer that the bound on is the same as the bound on given by (10), leading to (11).
Neglecting
where the last term arises from the maximum round-off error in the dot-product computation. Using the bounds given by Theorem 1 one arrives at (12) .
Going back to the original analysis in [8] one can use the fact that the 2-norm of the error in the relationship (9), i.e.
can be bounded from below by for all iterations . One can infer that the bound on is the same as (16) . Using the properties of vector norms leads to (13) .
The error in Line 6 of Algorithm 1 is given by Using the bounds given by Theorem 1 yields (14) . ◽
The error bounds given by Proposition 1 enlarge the bounds given in Theorem 1. In order to prevent overflow in the presence of round-off errors the integer bitwidth for has to increase by bits, which will be one bit in all practical cases. For the remaining variables, which have bounds that depend on , one has two possibilities-either use extra bits to represent the integer part according to the new larger bounds, or adjust the spectral radius through the scaling matrix (2) such that the original bounds still apply under finite precision effects.
The latter approach is likely to provide effectively tighter bounds. We now outline a procedure, described in the following lemma, for controlling and give an example showing how to make use of it.
Lemma 2. If each element of the scaling matrix (2) is multiplied by
, where is a small positive number, the scaled matrix has, for any non-singular symmetric matrix , spectral radius .
Proof. We now have . The new Gershgorin discs are given by , which can be easily proved to lie inside the interval between and . ◽ For instance, if one decides to use fraction bits on the benchmark problems described in Section 4 with dimension , the worst error bounds would be given by
The value of in Lemma 2 is chosen such that the following inequalities are satisfied which, for the given values, is satisfied by .
In this section we show that even though these algorithms are known to be vulnerable to round-off errors [36] they can still be executed using fixed-point arithmetic reliably by using our proposed approach. In order to evaluate the numerical behaviour of the Lanczos method, we examine the convergence of the Lanczos-based MINRES algorithm, described in Algorithm 2, which solves systems of linear equations by minimising the 2-norm of the residual . Notice that in order to bound the solution vector , one would need an upper bound on the spectral radius of , which depends on the minimum absolute eigenvalue of . In general it is not possible to obtain a lower bound on this quantity, hence the solution update cannot be bounded and has to be computed using floatingpoint arithmetic. Since our primary objective is to evaluate the numerical behaviour of the computationally-intensive Lanczos kernel, the operations outside Lanczos are carried out in double precision floating point arithmetic. Fig. 2 shows the convergence behaviour of a single precision floating point implementation and several fixed-point implementations for a symmetric matrix from the University of Florida Sparse Matrix Collection [37] . All implementations exhibit the same convergence rate. There is a difference in the final attainable accuracy due to the accumulation of round-off errors, which is dependent on the precision used. The figure also shows that the fixed-point implementations have a stable numerical behaviour, i.e. the accumulated round-off error converges to a finite value. The numerical behaviour is similar for all linear systems for which the MINRES algorithm converges to the solution in double precision floating-point arithmetic.
In order to investigate the variation in attainable accuracy for a larger set of problems we created a benchmark set of approximately 1000 linear systems, coming from an optimal controller for a Boeing 747 aircraft model [38] , [39] under many different operating conditions. The linear systems are the same size ( ) but the condition numbers range from 50 to . The problems were solved using a 32-bit fixed-point implementation, and single and double precision floating-point implementations, and the attainable accuracy was recorded in each case. The results are shown in Fig. 3 . As expected, double precision floating-point with 64 bits achieves better accuracy than the fixed-point implementations. However, single precision floating-point with 32-bits consistently achieves less accurate solutions. Since single precision only has 23 mantissa bits, a fixed-point implementation with 32 bits can provide better accuracy than a floating-point implementation with 32 bits if the problems are formulated such that the full dynamic range offered by a fixed representation can be efficiently utilized across different problems. Fig. 3 also highlights that these problems are numerically challenging-if the scaling matrix (2) is not used, even the floating-point implementations fail to converge. This suggests that the proposed scaling procedure can also improve the numerical behaviour of floatingpoint iterative algorithms along with achieving its main goal for bounding variables. An example application supporting this claim is described in [39] .
The final attainable accuracy , denoted by , is determined by the machine unit round-off . When using floor rounding , where denotes the number of bits used to represent the fractional part of numbers. It is wellknown [40] that when solving systems of linear equations these quantities are related by O where is the condition number of the coefficient matrix. We have observed that, for , this relationship holds with approximate equality for all tested problems, including the problem described in Fig. 2 . For smaller bitwidths, excessive round-off error leads to unpredictable behaviour of the algorithm for the described application.
The constant of proportionality, which captures the numerical difficulty of the problem, is different for each problem. This constant cannot be computed a priori, but relationship (17) allows one to calculate it after a single simulation run and then determine the attainable accuracy when using a different number of bits, i.e. we can predict the numerical behaviour when using bits by shifting the histogram for 32 fixed-point fraction bits.
PARAMETERIZABLE FPGA ARCHITECTURE
The results derived in Section 3 can be used to implement Lanczos-based algorithms reliably in low cost and low power fixed-point architectures, such as fixed-point DSPs and embedded microcontrollers. In this section, we will evaluate the potential efficiency improvements in FPGAs. In these platforms, for addition and subtraction operations, fixed-point units consume one order of magnitude fewer resources and incur one order of magnitude less arithmetic delay than floating-point units providing the same number of mantissa bits [41] . These platforms also provide flexibility for synthesizing different computing architectures. We now describe our architecture generating tool, which takes as inputs the data type, number of fraction bits , level of parallelization and the latencies of an adder/subtracter ( ), multiplier ( ), square root ( ) and divider ( ) and automatically generates an architecture described in VHDL.
The proposed compute architecture for implementing Algorithm 1 is shown in Fig. 4 . The most computationally intensive operation is the matrix-vector multiplication in Line 3 of Algorithm 1. This is implemented in the block labelled , which is a parallel pipelined dot-product unit consisting of a parallel array of multipliers followed by an adder reduction tree of depth . The remaining operations of Algorithm 1 are all vector operations and scalar operations that use dedicated components.
The degree of parallelism in the circuit is parameterized by parameter . For instance, if , there will be two blocks operating in parallel, one operating on the odd rows of , the other on the even. All links carrying vector signals will branch into two links carrying the even and odd components, respectively, and all arithmetic units operating on vector links will be replicated. Note that the square root and divider, which consume most resources, only operate on scalar values, hence there will only be one of each of these units regardless of the value of . For the memory subsystem, instead of having independent memories each storing one column of , there will be independent memories, where half of the memories will store the even rows and the other half will store the odd rows of .
The latency for one Lanczos iteration in terms of clock cycles is given by where is the number of cycles it takes the reduction circuit, illustrated in Fig. 5 , to reduce the incoming streams to a single scalar value. This operation is necessary when computing and . Note in particular that for a fixed-point implementation with and , the reduction circuit is a single adder and , as expected. Table 2 shows the latency of the arithmetic units under different number representations.
For cases where is large and the resources available cannot provide parallel multipliers, it is possible to reduce the size of the dot-product unit, which consumes most resources. For the case where , the dot product unit consists of parallel multipliers, followed by an adder tree of depth . An extra adder is needed at the output to add up the intermediate results, hence the overall latency of the dot-product computation does not change. The dotted links in Fig. 4 would carry one value every two cycles.
Design Automation Tool
In order to evaluate the performance of our designs for a given target FPGA chip we created a design automation tool that generates optimum designs with respect to the following rule: subject to P > where is defined in (18) with the explicit dependence of latency on the number of fraction bits noted. P represents the probability that any problem chosen at random from the benchmark set meets the user-specified accuracy constraint , and is used to model the fact that for any finite precision representation-fixed point or double precision floating point-there will be problem instances that fail to converge to the desired accuracy for numerical reasons. The user can specify -the tolerance on the error, and -the proportion of problems allowed to fail to converge to the desired accuracy. In the remainder of the paper, we set , which is reasonable for the application domain for the data used.
is a vector representing the utilization of the different FPGA resources: flip-flops (FFs), look-up tables (LUTs), embedded multipliers and embedded RAM, for the Lanczos architecture illustrated in Fig. 4 with parallelism degree and a -bit fixed point datapath.
This problem can be easily solved. First, determine the minimum number of fraction bits necessary to satisfy the accuracy requirements (20) by making use of the information in Fig. 3 and the relationship (17) . Once is fixed, find the maximum such that (21) remains satisfied using the information in Table 3 and a model for the number of LUTs, flipflops and embedded multipliers necessary for implementing each arithmetic unit for different number of bits and data representations [41] . Note that the actual resource utilization of the generated designs can differ slightly from the model predictions. However, the modeling error has been verified to be insignificant compared to the efficiency improvements that will be presented in Section 6.
Since the entire matrix has to be stored on-chip, memory is typically the limiting factor for implementations with a small number of bits. For larger numbers of bits embedded multipliers limit the degree of parallelization. In the former case, storage of some of the columns of is implemented using banks of registers so FFs become the limiting resource. In the latter case, some multipliers are implemented using LUTs so these become the limiting resource. Fig. 6 shows the trade-off between latency and FFs offered by the floating-point Lanczos implementations and two fixed-point implementations that, when embedded inside a MINRES solver, meet the same accuracy requirements as the single and double precision floating-point implementations. The trade-off is similar for other resources. We can see that the fixed-point implementations make better utilization of the available resources to reduce latency while providing the same solution quality.
PERFORMANCE EVALUATION
In this section we will evaluate the relative performance of the fixed-point and floating-point implementations under the resource constraint framework of Section 5 for a Virtex 7 XT 1140 FPGA [42] . Then we will evaluate the absolute performance and efficiency of the fixed-point implementations against a high-end GPGPU with a peak floating-point performance of 1 TFLOP/s.
The trade-off between latency (18) and accuracy requirements for our FPGA implementations is investigated in Fig. 7 . For high accuracy requirements a large number of bits are needed reducing the extractable parallelism and increasing the latency. As the accuracy requirements are relaxed it becomes possible to reduce latency by increasing parallelism. The figure also shows that the fixed-point implementations provide a better trade-off even when the accuracy of the calculation is considered.
The simple control structures in our design and the pipelined arithmetic units allow the circuits to be clocked at frequencies up to 400 MHz. Noting that each Lanczos iteration requires operations we plot the number of operations per second (OP/s) against accuracy requirements in Fig. 8 . For extremely high accuracy requirements, not attainable by double precision floating-point, a fixed-point implementation can still achieve approximately 100 GOP/s. Since double precision floating-point only has 52 mantissa bits, a 53-bit fixed-point arithmetic can provide more accuracy if the dynamic range is controlled. For accuracy requirements of and the fixed-point implementations can achieve approximately 200 and 300 GOP/s, respectively. Larger problems would benefit more from incremental parallelization leading to greater performance improvements, especially for lower accuracy requirements.
The GPGPU curves are based on the NVIDIA C2050 [43] , which has a peak single precision performance of 1.03 TFLOP/s and a peak double precision performance of 515 GFLOP/s. It should be emphasized that while the solid lines represent the peak GPGPU performance, the actual sustained performance can differ significantly [44] . In fact, [45] reported sustained performance well below 10% of the peak performance when implementing the Lanczos kernel on this GPGPU.
The trade-off between performance and accuracy requirements is important for the range of applications that we consider. For some HPC applications, high accuracy requirements, even beyond double precision, can be a high priority. On the other hand, for some embedded applications that require the repeated solution of similar problems, accuracy can be sacrificed for the ability to apply actions fast and respond quickly to new events. In some of these applications, solution accuracy requirements of can be perfectly reasonable.
The results presented so far have assumed that we are processing a single Lanczos problem at a time. Using this approach the arithmetic units in our circuit are always idle for some fraction of the iteration time. In addition, because the constant term in (18) is relatively large, the effect of incremental parallelization on latency reduction becomes small very quickly. In the situation when there are many independent problems available [46] , it is possible to use the idle computational power by time-multiplexing multiple problems into the same circuit to hide the pipeline latency and keep arithmetic units busy [47] . The number of problems needed to fill the pipeline is given by the following expression If the extra storage needed does not hinder the achievable parallelism, it is possible to achieve much higher performance, exceeding the peak GPGPU performance for most accuracy requirements even for small problems, as shown in Fig. 8b . Using this approach there is a more direct transfer between parallelization and sustained performance. The sharp improvement in performance for low accuracy requirements is a consequence of a significant reduction in the number of embedded multiplier blocks necessary for implementing multiplier units, allowing for a significant increase in the available resources for parallelization.
For the Virtex 7 XT 1140 [42] FPGA from the performanceoptimized Xilinx device family, Xilinx power estimator [48] was used to estimate the maximum power consumption at approximately 22 Watts. For the C2050 GPGPU [43] , the power consumption is in the region of 100 Watts, while a host processor consuming extra power would still be needed . The dotted line, the cross and the circle represent fixed-point and double and single precision floating-point implementations, respectively. The jumps in the fixed-point curve mark the points at which the level of parallelization changes. for controlling the data transfer to and from the GPGPU. Hence, for problems with modest accuracy requirements, there will be more than one order of magnitude difference in power efficiency when measured in terms of operations per watt between the sustained fixed-point FPGA performance and the peak GPGPU floating-point performance.
OTHER LINEAR ALGEBRA KERNELS
It is expected that the same scaling procedure presented in Section 3 will also be useful for bounding variables in other iterative linear algebra algorithms based on matrix-vector multiplication. 10: end for
11: return
The standard Arnoldi iteration [13] , described in Algorithm 3, transforms a non-symmetric matrix into an upper Hessenberg matrix (upper triangle and first lower diagonal are non-zero) with similar spectral properties as using an orthogonal transformation matrix . At every iteration the approximation is refined such that where and . Since the matrix is not symmetric it is not necessary to apply a symmetric scaling procedure; hence, instead of solving , we solve and the computed solution remains the same as the solution to the original problem. The following proposition summarises the variable bounds for the Arnoldi process:
Proposition 2. Given the scaling matrix (2), the Arnoldi iteration applied to , for any non-singular matrix , has intermediate variables with the following bounds for all , and :
where denotes the iteration number and and denote the component of a vector and component of a matrix, respectively. Proof. According to the proof of Lemma 1, the spectral radius of the non-symmetric scaled matrix is still bounded by . As with the Lanczos iteration, the eigenvalues of the approximate matrix are contained within the eigenvalues of even throughout the intermediate iterations. One can use the relationship (8) to show that the coefficients of the Hessenberg matrix are bounded by . The bounds for the remaining expressions in the Arnoldi iteration are obtained in the same way as in Theorem 1. ◽ It is expected that the same techniques could be applied to other related kernels such as the unsymmetric Lanczos process or the power iteration for computing maximal eigenvalues.
CONCLUSION
Fixed-point computation is more efficient than floating-point from the digital circuit point of view. We have shown that fixed-point computation can also be suitable for problems that have traditionally been considered floating-point problems if enough care is taken to formulate these problems in a numerically favourable way. Even for algorithms known to be vulnerable to numerical round-off errors, accuracy does not necessarily have to be compromised by moving to fixed-point arithmetic if the dynamic range can be controlled such that a fixed-point representation can represent numbers efficiently.
Implementing an algorithm using fixed-point arithmetic gives more responsibility to the designer, since all variables need to be bounded in order to avoid overflow errors that can lead to unpredictable behaviour. We have proposed a scaling procedure that allows us to bound and control the dynamic range of all variables in the Lanczos method-the building block in iterative methods for solving the most important linear algebra problems, which are ubiquitous in engineering and science. The proposed methodology is simple to implement but uses linear algebra theorems to establish bounds, which is currently well beyond the capabilities of state-of-theart automatic tools for solving the bounding problem.
The capability for implementing these algorithms using fixed-point arithmetic could have an impact both in the high performance and embedded computing domains. In the embedded domain, it has the potential to open up opportunities for implementing sophisticated functionality in low cost systems with limited computational capabilities. For highperformance scientific applications it could help in the effort to reach exascale levels of performance while keeping the power consumption costs at an affordable level. For other applications there are substantial processing performance and efficiency gains to be realized. 
