ABSTRACT
Introduction.
In scientific computations one of the recurring problems is finding the solution of systems of linear equations. To find such solutions there are two main categories: direct methods, such as Gaussian elimination or the Cholesky Decomposition (where the solution is given by evaluating a derived formula) and iterative methods such as the Conjugate Gradient or the more generalized version GMRES. In this latter approach a solution is refined, based on previous results, until a certain acceptable value is reached. * THE AUTHORS WOULD LIKE TO ACKNOWLEDGE THE SUP-PORT OF THE EPSRC (GRANT EP/C549481/1 AND EP/E00024X /1) As the size of a problem increases, matrix and vector operations become increasingly intensive in terms of computation and may require significant processing time. Nonetheless they can be accelerated by performing, whenever possible, parallel operations. As a result of advancements in FPGA density, massively parallel floating point computation has become feasible at the increased cost of resource utilization. Previous implementations have explored the benefits of using FPGAs. These implementations include a Cholesky approach that achieved a performance increase of 50% over software on a APEX EP20K15000E FPGA [1] ; a Jacobi solver implementation on a Xilinx VirtexII Pro XC2VP50 which achieved a speedup of 1.3 to 36.8 relative to a highend processor, depending on the matrix structure [2] ; and two CG implementations. One of these implementations used the Logarithmic Number System (LNS) and reached up to 0.94 GFLOPS on a VirtexII-6000 [3] , while the other used a rational number system representation and achieved 0.27 GFLOPS on a VirtexII Pro XC2VP4 [4] . Table 1 summarizes FPGA implementations of Conjugate Gradient method in terms of year of publication, number system, input problem structure, device and GFLOPS achieved. This paper introduces the first Conjugate Gradient FPGA implementation to take advantage of a banded structure input matrix using IEEE floating point arithmetic standard [6] . The main contributions of this paper are thus:
• a parameterizable widely-parallel and deeply-pipelined hardware design for the solution of systems of diagonally banded linear equations,
• results showing that a sustained performance of 32 GFLOPS is possible for a Virtex5-330T, translating to at least an order of magnitude speedup compared to the performance of an AMD Opteron 1220 CPU at 2.8 GHz.
• a comparison with previous implementations demonstrating how it is possible to process larger matrix orders by taking advantage of their structure. 
Hardware Implementation.
This implementation exploits the banded structure of the input matrix and performs the CG required matrix-by-vector operation one row at a time. As a result, we trade-off full parallelism with resource reutilization to increase matrix orders. The main aspects under analysis are the achievable speed, latency and throughput. It is is also important to understand resource growth with input matrix order and band width, and determine the maximum matrix order attainable with this computation scheme. This resource utilization is reduced compared to a dense matrix, since this implementation only stores band elements for each row of the structured matrix. Due to the nature of this algorithm the data iterates through the hardware as depicted in Fig. 1 . These iterations are used to spread the loading of problems and eliminate potential I/O bottlenecks as described in [5] .
The modular floating point arithmetic units used in this design are provided by Xilinx Core Generator FP v3. With all units having maximum latency, a formula for number of clock cycles required per iteration of the CG method can be derived (1) . This expression grows linearly with the matrix order due to the sequential vector-by-vector operations required to implement the matrix-by-vector multiplication. There is also a logarithmic growth of cycles per iteration due to the vector-by-vector reduction tree as depicted schematically in Fig. 2 . It is also possible to deduce the number of problems that can be processed simultaneously in a pipeline fashion. This formula (2) is equivalent to the number of clock cycles per iteration divided by the number of clock cycles required to input each problem, n, from the internal FIFOs. In order to have an integer number of problems in this pipeline and hence utilize the full instantiated FP units a FIFO, of size κ, has be inserted into the system in order to ensure that the number of clock cycles required per iteration is a multiple of n. 
Since the latency of the addition floating point module is 12 cycles, this is the number of interleaved pipelined problems required to efficiently utilize the the vector-by-vector operation. In this scheme a simple feedback loop can be used to implement the required summation reduction tree as described in Fig 2. This property affects the clock cycles required per iteration (1) as well as the total number of pipelinable problems P max given by (2). 
Resource utilization
Instantiated resources are optimized and are kept in operation throughout each iteration, avoiding any pipeline stalls. This implementation requires a number of floating point computational units that grows linearly with matrix band width, w, given by 2w + 13. Theoretical storage utilization grows as Θ(nw) due to the need to store nw elements for each input matrix A. To store these values, in a FIFO, a mixture of embedded BlockRAMs and SRLC32 primitives are used [7] . This mixture depends on the length of the FIFO. When this length is below 64, they are implemented solely using SRLC32 primitives. When above 64, they are implemented by combining BlockRAMs and SRLC32 primitives for efficiency. This is due to the fact that Xilinx Coregen BlockRAM FIFOs are only available in sizes of 2 n with n > 3; thus SRLC32 primitives are used to take up any slack. These discrete configurations explain the BlockRAM utilization jumps in Fig. 3 . This figure plots the resource usage as a function of the matrix order when the band width is equal to five, w = 5. Results showed that for the extreme case where band width is equal to its matrix order, w = n, the Virtex5-330T can accommodate matrix orders up to 92. For band widths smaller than 92 the limiting resources are the available registers. When this is not the case, the limiting resources becomes the DSP48 embedded blocks. Best fit resource usage functions for DSP48Es, LUTs, and registers as a function of matrix order and band width are described in (3), (4), and (5) respectively. DSP48Es(n, w) = 2w + 8
kLUTs(n, w) = 676 + 36n + 87w (4) kREGs(n, w) = 1007 + 91n + 108w (5) Fig. 3 . BlockRAMs, Registers, DSP48s, and LUTs resource utilization with matrix order for the Virtex5-330T. Light lines represent the best fit based on the synthesis reports for band width equivalent to five, w = 5. The dark line represent the upper bound of BlockRAM utilization.
Software comparison and discussion
The peak FLOPS performance (6) is given by the total number of active floating point units, which is a function of the matrix band width, w, and the maximum achievable frequency. FLOPS = F Punits × FPGA freq (6) Acceleration relative to software is provided by pipelining and parallelization. An important speedup is provided by the matrix-by-vector operation, as depicted in Fig. 2 . This operation which requires n(2w − 1) sequential oper- Fig. 4 . Iteration time per CG problem on a CPU and FPGA with w = 5. The bold line describes a high-end theoretical peak CPU performance and a measured ALTAS implementation, and three lighter lines show the FPGA Virtex5-330T implementation using a full pipeline (P = Pmax), the pipeline with 5 problems, and with a single problem.
ations in software, is reduced to the latency of the multiplication core, which is given by a constant plus the latency of the adder tree plus n − 1 cycles. The performance improvement between the FPGA and CPU is is given by (7) . In this formula, the numerator comprises of the FPGA's operating frequency, FPGA freq, times the number of problems on the pipeline, P , times the number of clock cycles required for a software iteration and a denominator that is comprised of the CPU's operating frequency, CPU freq, times the clocks per iteration on the FPGA. This is assuming the CPU is able to process a floating point operation every clock cycle, which is an overly optimistic view on CPUs. When the FPGA's pipeline is full, with P max problems given by (2) , this formula is reduced to (8) .
FPGA freq × (2wn + 4w + 5n) × P CPU freq × (29n + 12 log 2 w + 115)
Speedup f ull = FPGA freq × (2wn + 4w + 5n) CPU freq × n (8) For the case where w = 5, Fig. 4 compares the growth in processing time, for each CG iteration, on an FPGA and on a AMD Opteron 2.8 GHz. One bold line represents the top theoretical performance of the high end CPU performing a FP operation every clock cycle while the other represents the same CPU running an ATLAS optimized implementation of the CG algorithm [8] . Three lines are shown for the FPGA implementation: one representing a full pipeline, other for 5 problems in this pipeline and another for a single problem. It is possible to verify that the performance of the FPGA, with pipeline full, is superior to the CPU theoretical maximum, even for the low parallelism provided by w = 5.
For the case where there is only one problem in the FPGA's pipeline, performance is matched and surpassed for matrix orders above a certain threshold, which itself varies with the degree to which the pipeline is filled.
Conclusions.
This paper has described a deeply-pipelined and widelyparallel Conjugate Gradient FPGA implementation using the IEEE single precision floating point standard for problems with a banded input matrix. It was demonstrated that matrix bands up to 92 are achievable, with peak floating point performance above 32 GFLOPS for the Virtex5-330T device. With the lower capacity and readily available Virtex2-6000, it is possible to process matrices with band widths up to 28 and reach the floating point performance of 8 GFLOPS. Taking advantage of hardware parallelization, the required processing time for a single iteration is reduced from Θ(nw) to Θ(n), at the cost of increasing hardware utilization from Θ(1) to Θ(w). With this parallelization combined with pipelining it is possible to outperform a high-end CPU. In the case of the higher density FPGA, this performance is improved by at least an order of magnitude. This implementation also takes advantage of the iterative nature of the CG algorithm in order to spread the data transfer, which then becomes well within the FPGA's I/O capacity.
4 References.
