This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering, and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99% of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses 8 Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 GFLOPS), by comparing it to DGEMM function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.
Introduction
Today's FPGAs are fast and large enough to allow hardware implementation of various algorithms that work faster compared to their software-only counterparts executing on generalpurpose microprocessors [1] , [2] , [3] , [4] , [5] . There is a plethora of research efforts regarding the use of FPGA accelerators to speed up critical parts of computationally-intensive programs. They vary in scope and way in which acceleration is accomplished; however, they all rely on some kind of parallelism, and their performance is determined by the number of concurrently working functional units.
Due to its significance in science and engineering, matrix multiplication methods and their optimisations are a very often studied subject in the field of both software and hardware design. Its inherent parallelism is especially interesting from the aspect of various parallel and distributed systems. FPGA-accelerated matrix multiplication became a viable faster alternative to software implementations from the moment when FPGA started to offer a potentially better multiplication performance than microprocessors, that is, when they started to include a dedicated multiplier blocks [6] .
Related work
There are several recent works which treat the problem of performing double-precision floating-point matrix multiplication in FPGA. Architecture by Dou et al. [7] consists of a master processor and a linear array of P processing elements. The master processor divides the input matrices into tiles with dimensions Si x N and N x Sj, respectively, and schedules them to the processing array, which calculates one Si x Sj block of the result before moving to the next one.
Each processing element (PE) contains one multiplier and one adder, two register banks with Si/P words of storage for storing elements of the first matrix, and two register banks with Si×Sj/P words, for storing intermediate results. Elements of the second matrix are not reused, and therefore not stored. The total used local storage is M = 2×Si+2×Si×Sj words, and the required input bandwidth at maximum performance is B=2× P /√M /2 words per clock cycle.
Zhuo and Prasanna [8] give the comprehensive overview of previous works on integer and single-precision matrix multiplication in FPGA, and identify the challenges associated with their possible expansion to double-precision arithmetic. They then introduce two architectures, and three corresponding algorithms. The first two algorithms work with small matrices (those which can fit into accelerator internal memory), which allow them to achieve the optimal latency depending on the available communication bandwidth. Both algorithms divide input matrices into rectangular blocks, but differ in number of multipliers and adders per processing element, and number of required processing elements for a given size of input matrices. The third algorithm is suitable for larger matrices, and uses the block matrix multiplication algorithm with square blocks. It employs a linear list of processing elements, each with one multiplier and one adder, similarly to architecture by Dou et al. [7] . However, it swaps the execution order of the two inner loops, out of three loops which constitute the matrix multiplication algorithm. In that way, there is no need for two sets of registers, and for a given local memory size M and number of processing elements P, the algorithm requires input bandwidth B=2× P /√M .
Architecture by Kumar and al. [9] uses an algorithm for scheduling input data to processing elements which has the same loop execution order as that of Zhuo and Prasanna [8] . However, instead of a systolic array-like structure (in which every PE communicates only with the adjacent ones), it uses broadcast to distribute the same elements of the first matrix simultaneously to all PEs.
The elements of second and resultant matrices are exchanged only with the adjacent PEs, as is the case with the other two related works.
All of the three architectures are equivalent to each other, and have the same performance of 2×P FLOPS per clock cycle. They use the classical block matrix multiplication algorithm, and can multiply two square matrices of order N in N 3 /P clock cycles. They have the form of a linear list of processing elements, which allows them to be scalable, that is, easily expandable to larger devices or multiple FPGAs. The architectures are also modular, because they treat floating-point arithmetic blocks as modules, which can be interchanged with modules having different implementation or characteristics, without affecting the properties of the architecture. Finally, all of the architectures have balanced processing power and bandwidth requirements. This represents the optimal use of available resources, as the computing and communication phases completely overlap. The respective papers also discuss trade-offs between local memory size and required communication bandwidth. However, they do not distinguish between input and output bandwidth, and take into account only their aggregate value. This is appropriate when communication channel is bus-based, and therefore half-duplex. However, as the full matrix multiplication has highly asymmetric traffic patterns in inbound and outbound directions, and, as the most backplane and network technologies transition to point-to-point, full-duplex architectures (the notable example being PCI-Express), that leave the communication channel in outbound direction almost unutilised. Dou et at. [7] implement double-precision floating-point units which are IEEE-754 [12] compliant, with the exception of denormal number support. They use FPGAs with 18x18 integer multiplier blocks, and construct a floating-point multiplier from 9 such blocks. Work of Kumar et al. [9] is more recent, and use FPGAs with 25x18 multipliers. In spite of that, their floating-point multiplier design require 13 such blocks. The level of IEEE-754 compliance is the same as that of Dou et at. [7] . Zhuo and Prasanna [8] describe three floating-point multiplier and adder implementations with different level of IEEE standard compliance: the "fully-compliant", "moderately-compliant" (similar to those in [7] and [9] ) and "least-compliant" (which, in addition to the absence of denormal support, also lack all the rounding modes except "round toward zero" and does not generate exceptions). They specify the number of pipeline stages, area, and clock frequency for adders and multipliers using all three level of compliance. However, they do not include implementation details and number of used multiplier blocks. 
Essence of the proposed approach
Existing solutions to FPGA-accelerated dense matrix multiplication problem have very similar architectures, because they all depend on the classic block matrix multiplications algorithm.
Faster algorithms do exist [10] , [11] , however, they are much more complex, and generally not suitable for hardware implementation. Since there is not much room for improvements in the architectural domain, it is possible to achieve better performance primarily by implementing larger number of smaller and faster floating-point units.
This paper presents an architecture and implementation of a FPGA accelerator for multiplication of matrices in IEEE double-precision format, which aims to be as fast as possible by using the following techniques, not found in other related works:
• A block matrix multiplication architecture which returns the result blocks as soon as they are computed, and leaves their final accumulation to the host processor. This allows for a less constrained placement and routing on FPGA, and consequently higher clock frequency.
Such architecture also exhibits a much more symmetric communication pattern (similar inbound and outbound bandwidth requirements), which make it especially well suited for full-duplex communication links. We show that the additional load exhibited on the host processor is relatively small, as the accelerator computes all multiplications and almost all (n-1 out of n) additions.
• Implementation of floating-point units in an efficient way, with only 8 embedded FPGA 25x18 integer multiplier blocks per floating-point multiplier. This allows realisation of larger number of processing elements. The floating-point units are also deeply pipelined, which contributes to the high clock frequency, but also in some instances reduces area, because it allows better utilisation of embedded blocks. We show that the latencies associated with the deeper pipelines do not have any negative implications, as they are in any case much smaller than the other latencies present in the system. After presenting the accelerator architecture and implementation, we quantify the performance gain of doing FPGA-accelerated matrix multiplication, in comparison to software implementations from several high-performance libraries, executing on commodity microprocessors. We take into account the current state of the art in both microprocessor and FPGA technology. Modern microprocessors perform very well, because they have high clock frequencies, multiple cores, and appropriate support in the form of floating-point vector SIMD units. However, we show that FPGA-accelerated matrix multiplication can still achieve several times higher performance.
Accelerator architecture
The accelerator consists of a linear list of multiplier-adder processing elements (PE), with memories for buffering input (and possibly output) data spread equally across all PEs (Fig. 1 ).
Although the PE connection pattern in the form of a tree is also possible [13] , the linear list has the advantage of a much more regular structure, which allows simpler routing between PEs and consequently the higher clock frequency. After the initial latency, a list of n PEs multiply two nelement vectors in one clock cycle, or two square matrices of order n in n 
= i×j×k output blocks. The ratio of input to output traffic is s
=2, occurs for i=1 and k=1. However, as i or k increase, s (2) decreases rapidly, and 
. When utilising a full-duplex link, Algorithm 2 require b FD
times less bandwidth. For k=1, there is no bandwidth reduction. However, for 7 In each clock cycle, accelerator performs n multiplications, n-1 product accumulations and multiplies two n x n matrices in n 2 clock cycles. The input matrices can be the blocks of larger matrices. The FIFO R in variant (a) facilitates the multiplication of block matrices, by storing a block of the result, until the accelerator multiplies and accumulates all the corresponding input blocks. There is also an additional adder for that purpose. FIFO's total size is n 2 elements, and in each clock cycle elements rotate to the left. When the multiplication of the last block completes, the multiplexer MX allows the FIFO contents to be shifted out and sent to the host computer. Variant (b) immediately returns elements of multiplied matrices without buffering. In the case of block matrix multiplication, it relies on the host computer to add together matrices representing parts of a resultant matrix block. This variant has the advantage of being simpler and easier to place and route, and also uses less BRAM resources.
a sufficiently large k, Algorithm 2 reduces the required bandwidth almost by a factor of two. The rest of this paper discuss Algorithm 2 and the corresponding accelerator architecture shown in Fig. 1b.
Before the matrix multiplication starts, the host computer must send to the accelerator the initial n 2 elements of X 11 The one remaining addition is done on the host processor, which, in total, performs 1/n of all additions. However, the more useful measure of accelerator performance is the time necessary to multiply two matrices of given size. According to Algorithm 2, the accelerator can multiply matrices X and Y in T comp = i×j×k×n The Algorithm 1 is equivalent to the third algorithm from [8] , and algorithms from [7] and [9] when they use minimal local memory size of Ω(k 2 ) words. They have the constant bandwidth requirements of B HD =3 and B FD =2 words per clock cycle. All of the related papers further pursue the idea of data reuse by allowing lower input traffic at the expense of larger input buffers. For memory of size M words, and p processing elements, they require √M / p times lower bandwidth in input direction. This leads to more complex block scheduling algorithms, which use more logic and memory resources. That, in turn, limit the number of PEs which can be implemented on a chip, and also has negative impact on maximum achievable clock frequency. In contrast to this approach, our Algorithm 2 reuse input blocks exactly to the extent needed to perfectly balance traffic in both directions. Thus, it is able to optimally utilise full-duplex links and has a relatively simple architecture. This simplicity, together with an efficient implementation of floating-point units, allows us to place more PEs on a chip than previously possible, and consequently achieve better performance.
Implementation of floating-point units
When utilising embedded multipliers and memories, FPGAs use about 20 times more transistors for the same amount of logic than standard-cell-based ASICs [14] . The difference is even larger in comparison to integrated circuits based on full-custom design (such as commodity microprocessors). As a consequence, clock speed of FPGAs is typically an order of magnitude lower than that of microprocessors and the only way to accomplish better performance is by use of relatively high level of parallelism. It is desirable to implement as many functional units as possible and at the same time maximise their clock frequency.
Design techniques for optimal design of arithmetic circuits in FPGA differ significantly from the techniques used in ASIC VLSI design [15] . The reason for this is fixed structure and scarcity of 
Design considerations
The accelerator is implemented using Xilinx XC6VSX475T FPGA, because of its large number of embedded multipliers (2016). We report resource utilisation and speed measurements as simulated with this device. However, it should be also pointed out that this FPGA does not exist in the fastest (-3) speed grade, which imposes an additional limit to the maximal achievable clock frequency.
Addition / subtraction
To perform a two-operand floating-point addition/subtraction, it is necessary to assure that both operands have the same exponent by right shifting the mantissa of the smaller one and increasing its exponent accordingly, add the mantissas if they have the same sign (effective addition), or subtract the smaller mantissa from the larger otherwise (effective subtraction). Sign of the result is equal to the sign of the larger mantissa. It may then be necessary to normalise the result by shifting it one place to the right (in the case of effective addition) or up to n places to the left (in the case of effective subtraction), where n is the width of the mantissa. Finally, it is necessary to round the result according to the chosen rounding mode. We use this, "classic" algorithm for floating-point addition, because it has the minimal resource requirements. Variants which use dual data paths, leading zero prediction and rounding before final normalisation offer lower latency, at the cost of much greater complexity [16] . They are commonly used in microprocessors and other designs requiring low latency, but they are not practical for FPGA implementation.
The floating-point adder consists of d a =10 pipeline stages, A1-A10. In stage A1, we compare mantissas and calculate exponent difference, which determine the number of places smaller mantissa should be shifted to the right. In order to compute correctly rounded results (as defined in [12] ), we extend the right side of the operands with three additional bits (guard, round and sticky bit, respectively), which are initially zero. In stage A2, there are three levels of logic, each implementing a funnel shifter. The usual approach is to use six shifters, in order to perform a 32, 16, 8, 4, 2 and 1 bit shifts. However, the six-input LUT architecture of the target FPGA allows combining two shifts in the same LUT, thus reducing the number of required logic levels. We or together all the bits which are shifted out, and keep them as the rightmost (sticky) bit. In stage A3, mantissas are swapped if necessary, in order to assure that first mantissa is always the larger one.
We perform the addition/subtraction with ripple-carry adder which use one LUT per result bit and a dedicated fast carry-propagation chain. Although the adder is very efficient in terms of used resources, its propagation delay increases linearly with operand size [17] and with 56-bit operands it becomes a clock-limiting factor. Since, for this purpose, low logic count and high clock speed are 
Multiplication
To perform a floating-point multiplication of two operands, it is necessary to multiply their mantissas, add exponents, and calculate the result sign as xor of operand signs. The most complex part is the multiplication of the mantissas. In the case of IEEE double precision format, mantissas are 53 bits wide, and their efficient (both in area and speed) multiplication on FPGA require the use of embedded multiplier-adder blocks. In Xilinx Virtex-5 and Virtex-6 devices, these blocks are called DSP48E and DSP48E1, respectively, and contain a 25×18 bit signed multiplier (24×17 bit unsigned), followed by a 48-bit adder with 17-bit shifter before its other input.
There are several ways to use DSP48E/E1 blocks as "tiles," parts of a larger multiplication parallelogram [18] . Each block compute one partial product, and add it together with a part of previously computed result. We propose the design shown in Fig. 3 , to maximise the use of internal 48-bit adders for accumulating partial products. Variant (a) uses fewer DSP48E/E1 blocks, but more LUTs. Because of the overall availability of logic resources in the target device, we use variant (b).
The floating-point multiplier consists of d m =17 pipeline stages, M1-M17. The relatively large number of stages is necessary because DSP48E/E1 blocks require two clock cycles in order to execute multiplication and addition at full speed. For that purpose, they have internal pipeline 11 Fig. 3 : Implementing 53×53 bit unsigned integer multiplier using Virtex-5 DSP48E or Virtex-6 DSP48E1 blocks. Multiplier (a) uses 6 DSP48E/E1 blocks and more LUTs, while multiplier (b) uses 8 DSP48E/E1 blocks and less LUTs. Variant with 7 DSP48E/E1 blocks is also possible.
registers after both multiplier and adder. We use blocks VII and VIII as 5×24 bit multipliers, and the other six blocks (I, II, III, IV, V and VI) as 24×17 bit multipliers. In stage M1, we calculate the exponent and sign of the result, and also the partial product corresponding to block VII and the partial products which do not use DSP blocks (the dark-coloured areas in Fig. 3b ). Multiplications corresponding to blocks I and VIII take place in stage M3 and those corresponding to blocks II, III, IV, V and VI in stages M5, M7, M9, M11 and M13, respectively. The adder in each block accumulate the parts of the result which are located above and to the right from it (as in Fig. 3b ) and calculated in previous stages. Those additions take place in stages M2, M4, M6, M8, M10, M12 and M14. Because the rightmost 52 bits of the complete 106-bit product are necessary only to calculate the sticky bit, we can or them together as soon as they are computed. In stage M15, we normalise the result mantissa and compute the rounding digit. If we assume that input numbers are normal (MSB=1), the multiplication result can be either normal, or may require a single normalisation left shift. The rounding takes place in the stages M16 and M17 (using a two-stage adder as in Fig. 2b ).
The resource utilisation is 447 LUTs and 520 flip-flops and achieved clock frequency is 492.12 MHz. The described multiplier work only with normal operands. It is relatively easy to add support for subnormal numbers by placing leading zero counters and left shifters on multiplier inputs (to pre-normalise the mantissas), while increasing the exponents width from 11 to 17 bits Table 1 .
In order to compare the accelerator performance to that of microprocessors, we measure the time necessary to execute a double precision matrix multiplication (DGEMM) function from four highly optimised BLAS libraries, with randomly generated square matrices, on two different computers with comparable microprocessors. Both microprocessors are manufactured in 45 nm process and belong to the same technological generation as the 40 nm Virtex-6 FPGA. We observe the similar performance in all processor/library combinations (Table 1) . Since this measurements are used only in order to establish a baseline software implementation, we do not analyse the small differences which exist between them. )). We believe this choice is due to the worse numerical stability of such algorithms.
In comparison with software implementations using only one processor core (single-thread library version), the FPGA accelerator is 20 times faster than the slowest and 18 times faster than the fastest processor/library combination. In comparison with software implementations using all four processor cores (multiple-thread library version), the FPGA accelerator is 5.6 times faster than the slowest and 4.5 times faster than the fastest processor/library combination.
The achieved results are consistent with the previously published predictions, according to which FPGAs will continue to offer better floating-point performance than microprocessors [6] .
However, we believe that with the current generation of FPGAs and microprocessors, this gap has reached its maximum, and that it will begin to shrink in the future. We base that prediction on the following observations: the recently announced 28 nm Xilinx Virtex-7 FPGA family offer 1.78 times more DSP48E1 blocks compared to Virtex-6, and also some marginal increase in clock frequency (we do not consider other FPGA manufacturers, such as Altera and Lattice, as they do not offer devices of such size). At the same time, 32 nm Intel Sandy Bridge processor offer the vector instruction set (AVX) two times wider than the previous SSE, and also better performance per core compared to the previous generation of processors [19] . There is currently up to 8 cores per chip (in Intel processors), and this number will probably significantly increase in the future [20] . This is not to say that FPGA accelerators will not still be able to achieve better performance in the areas for which microprocessors do not have direct support, but it will become increasingly difficult for
FPGAs to outperform them in floating-point arithmetic.
Conclusion
This paper has proposed an architecture and corresponding implementation for a FPGAaccelerated floating-point matrix multiplication. To our knowledge, it is the first such work to demonstrate not only comparable, but several times faster performance than that of commodity microprocessors from the same technological generation.
The architecture is as simple as possible, in order to minimise resource utilisation and maximise clock frequency. The employed block matrix multiplication algorithm sends the result blocks to the host processor as soon as they are computed. The consequence of this approach is that, while all multiplication and almost all additions are done in FPGA, 1/n of all additions, where n is block order, must be performed on the host processor. However, this number is very small: in our The implementation is also performance-oriented, and focuses on efficient use of embedded FPGA resources. Each PE uses 8 DSP blocks, which allows for a total of 252 PEs. In comparison, a related work based on the same size DSPs [8] require13 DSP blocks per PE. We have also proposed a multiplier design with 6 DSP blocks (which would equal to 336 PEs), however it would require the target device with more general-logic resources. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. They can be readily reused in any other project related to FPGA floating-point arithmetic.
We have compared the accelerator performance with that of high-end general-purpose microprocessors. In order for comparison to be as objective as possible, we have performed measurements using processors with large amounts of cache memory and high clock frequency, executing matrix multiplication functions from highly optimised libraries. The FPGA accelerator achieved results 18 times better than the fastest single-core, and 4.5 better than the fastest four-core software implementation.
