Modular exponentiation of matrices on FPGA-s by Herendi, T. & Major, R.
ar
X
iv
:1
11
1.
36
68
v1
  [
cs
.D
S]
  1
5 N
ov
 20
11
Acta Univ. Sapientiae, Informatica, 3, 2 (2011) 172–191
Modular exponentiation of matrices on
FPGA-s
Tama´s HERENDI
University of Debrecen
email: herendi.tamas@inf.unideb.hu
Roland Sa´ndor MAJOR
University of Debrecen
email: mroland@digikabel.hu
Abstract. We describe an efficient FPGA implementation for the expo-
nentiation of large matrices. The research is related to an algorithm for
constructing uniformly distributed linear recurring sequences. The design
utilizes the special properties of both the FPGA and the used matrices to
achieve a very significant speedup compared to traditional architectures.
1 Introduction
Field-programmable gate arrays (FPGA) offer a number of special options in
computation. Utilizing the unique properties of an FPGA, some algorithms
that are impractical to implement on a more traditional architecture can be-
come both convenient to create and resource-efficient. The programmable ar-
ray of look-up tables commonly found on an FPGA provide both flexibility
in creating logic to suit specific needs and naturally lend themselves to great
parallelism in computations.
Fast operations on matrices are of great practical interest. Ways to speed
up certain matrix calculations still find their way into numerous applications.
Faster implementations of matrix algorithms can be achieved either from a
“software” point of view, by improving upon the algorithm itself, or from a
“hardware” point of view, by using faster or differently structured architec-
tures.
Computing Classification System 1998: B.2.4
Mathematics Subject Classification 2010: 65F60 11Y55
Key words and phrases: pseudo random number generators, linear recurring sequences,
uniform distribution, matrix exponentiation, parallel arithmetic, FPGA design, hardware
acceleration of computations, hardware implementation of computations
172
Modular exponentiation of matrices on FPGA-s 173
Theoretical improvements on matrix algorithms include Strassen’s algo-
rithm [12] and the Coppersmith-Winograd algorithm [2]. The naive algorithm
for matrix multiplication is a well-known Θ(n3) algorithm. Strassen’s algo-
rithm uses an idea similar to the Karatsuba-multiplication. It has a time
complexity of O(nlg 7) by dividing the matrices into sub-matrices. Then by
multiplying them in a different arrangement, it manages an overall lower multi-
plication count compared to the classical algorithm. Research implementing it
on the Cell Broadband Engine can be found in [5]. Strassen’s algorithm and its
applicability to the project is briefly discussed in Section 7. The Coppersmith-
Winograd algorithm further improves the complexity to O(n2.376) by combin-
ing the idea of Strassen with the Salem-Spencer theorem. [9] discusses and
compares the performance of implementations of these algorithms.
Numerous research has been done on creating efficient realizations of differ-
ent matrix operations on different architectures. [8] and [10] both use FPGAs
to perform matrix inversion.
The design presented here is an implementation of matrix multiplication
on an FPGA. Works of similar nature can be found in [1] and [4], dealing
with FPGA configurations used for floating point matrix multiplication. [11]
uses an FPGA design for digital signal processing. [3] discusses another FPGA
implementation for accelerating matrix multiplication.
The research in this paper is related to an algorithm for the construction of
pseudo random number generators. It requires the exponentiation of large ma-
trices to an extremely high power. This allows for numerous optimizations to
be made on the FPGA implementation, resulting in an extremely fast design.
A speedup factor of ∼200 is achieved compared to a highly optimized program
on a more traditional architecture.
We give the details of a design implemented on a Virtex-5 XC5VLX110T
FPGA that multiplies two 896× 896 sized matrices. The matrices are defined
over the mod 4 residue class ring. Using this property and the fact that the
hardware uses 6-LUTs (Lookup Tables), we describe first a module that com-
putes the dot product of vectors taken from Z284 in a single clock cycle at
100MHz clock speed. With these modules we construct a matrix multiplier
module that computes the C ∈ Z20×204 product matrix of A ∈ Z
20×28d
4 and
B ∈ Z28d×204 in d clock cycles at 100MHz. The significance of the value 28
in the implementation and its experimental determination is also discussed.
Finally, we describe how to use these modules for multiplying matrices taken
from Z896×8964 . The proposed algorithm deals with the management of stored
data in such a way that it can be accomplished completely in parallel with
the computations. The resulting design completes the multiplication in 64800
174 T. Herendi, R. Major
clock cycles at 100MHz.
Future work for increasing the size of the used matrices, and further opti-
mizing the design’s performance using Strassen’s algorithm is also described.
2 Mathematical background
The present work is initiated by a method for the construction of uniformly
distributed pseudo random number generators. (See [7].) The generator uses
recurring sequences modulo powers of 2 of the form
un ≡ ad−1un−1 + ad−2un−2 + · · · + a0un−d mod 2
s, ai ∈ {0, 1, 2, 3}, s ∈ Z
+
The theoretical background can be found in [6].
The construction assumes that the values a0, a1, . . . , ad−1 are such that
xd − ad−1x
d−1 − · · · − a0 ≡ (x− 1)
2P(x) mod 2
holds for some P(x) irreducible polynomial. It is practical to choose P(x) to
have maximal order, since the order of P is closely related to the period length
of the corresponding recurring sequence. The sequence un obtained this way
does not necessarily have uniform distribution, however exactly one of the
following four sequences does:
u
(0)
n ≡ ad−1u
(0)
n−1 + ad−2u
(0)
n−2 + · · · + a1u
(0)
n−d+1 + a0u
(0)
n−d mod 2
s
u
(1)
n ≡ ad−1u
(1)
n−1 + ad−2u
(1)
n−2 + · · · + a1u
(1)
n−d+1 + (a0 + 2)u
(1)
n−d mod 2
s
u
(2)
n ≡ ad−1u
(2)
n−1 + ad−2u
(2)
n−2 + · · · + (a1 + 2)u
(2)
n−d+1 + a0u
(2)
n−d mod 2
s
u
(3)
n ≡ ad−1u
(3)
n−1 + ad−2u
(3)
n−2 + · · · + (a1 + 2)u
(3)
n−d+1 + (a0 + 2)u
(3)
n−d mod 2
s .
For the details see [7]. Finding the sequence with uniform distribution is of
interest. Let
M(u) =


0 1 . . . 0 0
...
...
. . .
...
...
0 0 . . . 1 0
0 0 . . . 0 1
a0 a1 . . . ad−2 ad−1


Modular exponentiation of matrices on FPGA-s 175
be the companion matrix of sequence u. To find which of the above se-
quences has a uniform distribution, we have to compute M(u)2
d+1−2 mod 4.
If M(u)2
d+1−2 mod 4 equals the identity matrix, then the period length of un
is 2d+1 − 2, which means it is not the sequence we are searching for.
The exponentiation of matrices to high powers can quickly become time
consuming on traditional computers. The aim of the project was to utilize
the special properties of an FPGA to achieve a significant upgrade in speed
compared to implementations on more traditional architectures.
3 Hardware used in the implementation
The project was implemented on a Xilinx XUPV505-LX110T development
platform. The board features a variety of ports for communication with the
device. As a first approach the RS-232 serial port was used to send data
between the board and a PC. A high-speed PCI Express connection is also
available if the amount of data transferred would necessitate its use.
The board’s most prominent feature is the Virtex-5 XC5VLX110T FPGA.
The FPGA’s main tool for computation is the array of 6-input look-up tables,
arranged into 17280 Slices, with four look-up tables found in each Slice, adding
up to a total of 69120 LUTs. A single 6-input LUT can store 64 bits of data,
where its six input bits are used as an address to identify the single bit of data
that is to be outputted. By manipulating the 64 bit content of the look-up
table, it can be configured to carry out arbitrary Boolean functions with at
most six input bits. In our design they are used to create LUTs performing
a multiply-accumulate function, which are hierarchically arranged into larger
and more complex modules. One out of four LUTs on the device can also be
used as a 32 bit deep shift register; these are the basis to implement containers
storing the data, which is directly fed to the computational module.
Attached to the board, there is a 256MB DDR2 SODIMM module, which
is used for storing data exceeding the amount that can be practically stored
on the FPGA.
4 Structure of modules used in the computation
The basic elements of the design are the LUTs denoted by L(a, b, s) = c, where
a, b, c and s are two-digit binary numbers. The function carried out by L is a
multiply-accumulate (for short: MA) function, i.e.:
c ≡ (a · b) + s mod 4 .
176 T. Herendi, R. Major
Let a = 2α1 + α0, b = 2β1 + β0, s = 2σ1 + σ0, c = 2γ1 + γ0, where
α0, α1, β0, β1, σ0, σ1, γ0, γ1 ∈ {0, 1}, and L = (l1, l0) where l1 and l0 are two
single bit LUTs, according to the following:
• l0(α0, β0, σ0) = γ0
• l1(a, b, s) = γ1
Figure 1: The structure of L(a, b, s)
We remark that while l0 needs only three input bits to accomplish its func-
tion, l1 requires all six bits of input.
The LUTs l0 and l1 were configured to the values shown in Table 1 and
Table 2 to perform the multiply-accumulate function.
P
P
P
P
P
P
P
P
(α0, β0)
σ0 0 1
(0,0) 0 1
(0,1) 0 1
(1,0) 0 1
(1,1) 1 0
Table 1: Contents of l0
With the help of these basic units one can compute the dot product w of
two vectors u = (u0, u1, . . . , un−1) and v = (v0, v1, . . . , vn−1). Let us define a
modulem = (L[0], L[1], . . . , L[n−1]) by cascading nMA units denoted by L[i].
In this module m we use the output of a given MA unit as the sum input of
the next unit, i.e. si+1 = ci for i = 0, 1, . . . , n − 2, where si and ci are the s
input and c output of L[i].
Therefor m is a function that accepts a pair of vectors u, v of two-digit
numbers of length n and outputs on cn−1 the two-digit dot-product of the two
vectors, i.e. m(u, v) = w.
Modular exponentiation of matrices on FPGA-s 177
P
P
P
P
P
P
P
P
(a, b)
s
0 1 2 3
(0,0) 0 0 1 1
(0,1) 0 0 1 1
(0,2) 0 0 1 1
(0,3) 0 0 1 1
(1,0) 0 0 1 1
(1,1) 0 1 1 0
(1,2) 1 1 0 0
(1,3) 1 0 0 1
(2,0) 0 0 1 1
(2,1) 1 1 0 0
(2,2) 0 0 1 1
(2,3) 1 1 0 0
(3,0) 0 0 1 1
(3,1) 1 0 0 1
(3,2) 1 1 0 0
(3,3) 0 1 1 0
Table 2: Contents of l1
Figure 2: The structure of m(u, v)
In total, the number of LUTs used inm is 2n. Note that vectors of arbitrary
length can be used in the computation if we connect the output of module m
to the sum input of L[0] (cn−1 = s0), and then iteratively shift u and v onto
the module’s input by n elements at a time:
178 T. Herendi, R. Major
Function iterated m(u, v) // k = length(u) = length(v)
1. Define κ = ⌈ k
n
⌉, v ′, u ′ ∈ Zκ·n4
2. for i = 0 to κ · n − 1 do // fill v and u with 0’s
3. if i < k then v ′i = vi else v
′
i = 0
4. if i < k then u ′i = ui else u
′
i = 0
5. end for
6. Define vtemp, utemp,w, let w = 0
7. for i = 0 to κ− 1 do // shift v ′ and u ′ to vtemp and utemp
8. vtemp = (v
′
i·n, v
′
1+(i·n), . . . , v
′
n−1+(i·n))
9. utemp = (u
′
i·n, u
′
1+(i·n), . . . , u
′
n−1+(i·n))
10. w = w+m(vtemp, utemp)
11. end for
12. return w
end Function
Here u ′ and v ′ are the extensions of u and v by 0’s.
We shall see that the number chosen for n is critical in setting many char-
acteristics of the entire project. The experiment used for determining n will
be discussed in the following chapter.
Our aim is to obtain a module that performs the matrix multiplication of
A,B ∈ Zk×k4 , where Z4 is the mod 4 residue class ring. In the following, let
C ∈ Zk×k4 be the output matrix, such that C = A× B. Furthermore, let ai be
the ith row of matrix A and let bj be the jth column of matrix B.
The multiplier units denoted by m are used to create more complex mod-
ules in a hierarchical manner. First, by taking ten m multiplier blocks we
create a row of multipliers R = (m0,m1, . . . ,m9). This is used to compute ten
consecutive elements of a single row of the output matrix:
R(ai, bj, bj+1, . . . , bj+9) = (ci,j, ci,j+1, . . . , ci,j+9) ,
where ci,j = ai ·bj. The input vector ai is used by all ten multiplier units of R.
The length of these vectors, as mentioned above, can be arbitrary, but vectors
of length greater than n will need to be iteratively shifted to the input of R.
By taking ten row multipliers we can create a unitM10×10 = (R0, R1, . . . , R9)
which outputs a 10× 10 sub-matrix of C:
Modular exponentiation of matrices on FPGA-s 179
M10×10(ai, ai+1, . . . , ai+9, bj, bj+1, . . . , bj+9) =


ci,j ci,j+1 · · · ci,j+9
ci+1,j ci+1,j+1 · · · ci+1,j+9
...
. . .
ci+9,j ci+9,j+1 · · · ci+9,j+9

 .
Finally, four such units are arranged so that a 20 × 20 sub-matrix of C could
be obtained as output:
M20×20(ai, ai+1, . . . , ai+19, bj, bj+1, . . . , bj+19) =


ci,j ci,j+1 · · · ci,j+19
ci+1,j ci+1,j+1 · · · ci+1,j+19
...
. . .
ci+19,j ci+19,j+1 · · · ci+19,j+19

 .
The M20×20’s inputs are twenty vectors from both matrices A and B. Because
of hardware constraints — in particular the number of LUTs on the used device
— a larger arrangement of multipliers would be impractical to implement. The
module M20×20 is comprised of 400 m multiplier units. Figure 3 shows the
hierarchy of units used to build M20×20.
Figure 3: The structure of M20×20
The M20×20 unit can be used iteratively to multiply matrices of arbitrary
size, producing 20×20 sub-matrices of the output matrix C with each iteration.
After inputting twenty rows from matrix A and twenty columns from matrix
B and obtaining the desired output, we can simply repeat the process for a
180 T. Herendi, R. Major
set of rows and columns of A and B respectively, until we obtain the entire
output matrix C:
Function large matrix mult(A,B)
1. Define κ = ⌈ k
20
⌉, A ′, B ′, C ′ ∈ Zκ·n×κ·n4
2. for i = 0 to 20κ− 1 do
3. for j = 0 to 20κ− 1 do
4. if i < k and j < k a ′ij = aij else a
′
ij = 0
5. if i < k and j < k b ′ij = bij else b
′
ij = 0
6. end for end for
7. for i = 0 to κ− 1 do
8. for j = 0 to κ− 1 do
9. C ′[i,ji+19,j+19] = M20×20(ai, ai+1, . . . , ai+19, bj, bj+1, . . . , bj+19)
10. end for end for
11. return C ′[0,0k−1,k−1]
end Function
Here
C ′[
i,j
k,l] =


c ′i,j c
′
i,j+1 · · · c
′
i,l
c ′i+1,j c
′
i+1,j+1 · · · c
′
i+1,l
...
. . .
c ′k,j c
′
k,j+1 · · · c
′
k,l

 .
Note that in the naive algorithm large matrix mult(A,B), during the main
loop (lines 7-10), for each twenty rows read from A, the entire matrix B is read.
During the whole procedure, matrix A will be read entirely exactly once, while
matrix B will be read κ times. Methods improving on this number are described
in section 6.
Since for almost all practical cases the size k of matrices A,B ∈ Zk×k4 will be
greater than the parameter n, the vectors taken from these matrices will need
to be iteratively shifted onto the input of the multiplier M20×20, n elements
at a time. Therefore, an efficient way to both store and then use the vectors
taken from the matrices is the creation of FIFO type containers made of shift
registers.
Let tdn be a shift register of width n and depth d. It means that t
d
n can store
at most d vectors of length n, or equivalently a single vector of length at most
nd. We choose d such that nd ≥ k, thus it can store one row or column from
the input matrices A or B. Let the vector filling tdn be f = (f0, f1, . . . , fd−1),
Modular exponentiation of matrices on FPGA-s 181
where fi ∈ Z
n
4 , i = 0, 1, . . . , d − 1. In practice, t
d
n is a queue data structure.
In a single step, tdn outputs a vector of length n and shifts its content by n
places. For the ith activation, the container will output fi. After d activations,
the container becomes empty.
One container tdn is used to store a single row or column of matrices A
or B respectively. Connecting twenty of them in parallel, denoted by Td20n =
(tdn[0], t
d
n[1], . . . , t
d
n[19]), we obtain a container that stores twenty rows or column-
s. This is exactly the amount of data the M20×20 multiplier structure requires
as input in d iteration steps. After d activations Td20n has shifted all its stored
data to M20×20, broken up into pieces of length n for each activation. Two
such Td20n containers are connected to M20×20, one for the rows taken from
matrix A and one for the columns taken from matrix B.
Figure 4: The structure of Td20n
Using M20×20 and T
d
20n in a proper structure, we can execute one iteration
cycle of the computation. After filling one Td20n container with the desired
twenty rows from matrix A and one Td20n container with the desired twenty
columns from matrix B, we simply send d activation signals to the containers.
This will shift the data onto M20×20, which computes the 20 × 20 product
matrix in the way described in function iterated m(u, v). The number of
steps in one iteration cycle is d.
5 Experimental determination of parameters
Now, we turn to the determination of n (how many MA modules should be
connected into a single multiplier m). This sets the length of the vectors that
we use in the computation in a single step and thus has an effect on many other
technical parameters of the design. The goal was to find the greatest number
182 T. Herendi, R. Major
such that the multiplier would still reliably produce the correct dot product
in a single clock cycle. Clearly, this number dependents on the used hardware
and the clock frequency. For the device used, the chosen clock frequency was
100 MHz, the default frequency provided by the board.
The following experiment was devised to determine the value of n:
Let S be a multiplier m, called the “Subject”, and let E0, E1, . . . , E9 be ten
more m multipliers, called the “Examiners”. Informally, the Examiners’ duty
was to verify the answers given by the Subject to questions they already knew
the answer to. The “questions” here are test data: two vectors v, u of length n
generated by the following sequence to obtain suitable pseudo-random values:
Di = Di−1 +Di−2 + 2Di−4 +Di−5,
where D0 = D1 = D2 = D3 = 0, D4 = 1.
More formally, let p ∈ Z10 be a counter that cycles between values 0, 1, . . . , 9,
incrementing its value by one with each clock cycle, and returning to value
0 after 9. For each clock cycle during the experiment, the following happens
depending on the value of p:
• The output of S is checked for equality with the output of Ep. If inequality
is detected, then an error is noted.
• The test data Ep+1 is currently working on is given to S.
• New test data is given to Ep−1.
Procedure testing
1. Let S, E0, E1, . . . , E9 be m multipliers
2. Let D be the test data generator
3. Let i ∈ N, p ∈ Z10
4. forever do
5. i = i+ 1
6. p ≡ i mod 10
7. if Sout 6= Ep out then return ERROR
8. Ep−1 in ← D(i)
9. Sin ← Ep+1 in
10. end forever
end Procedure
Note that the output of S is checked every clock cycle, which yields that
S has only a single cycle to calculate its answer to the question it was given
Modular exponentiation of matrices on FPGA-s 183
Figure 5: Activity of testing module when counter’s value is p
in the preceding clock cycle. A given Examiner, however, has ten times more
time to work on its test data. Once in every ten clock cycles, new data is given
to the Examiner to work on, and its output is only checked nine clock cycles
later, just before it is given new input again. This way the Examiners have
enough time to compute the correct answer to the question by the time it is
needed.
As the initial value for n, we have chosen 16, a number small enough to
be reasonably expected to pass the criteria set for n, but large enough to
be of interest. If the experiment reported no error, meaning the Subject was
flawlessly able to calculate the dot product for a sufficiently long time, then
the value of n was increased and the experiment repeated. After the first
error was encountered, meaning the Subject was not able to keep up with the
calculations, the largest value was chosen for n for which there were no errors.
On the used device, the largest such value was found to be 28 at a clock
speed of 100 MHz and setting the length of m multipliers to n = 28 were able
to work error-free for days without interruption.
184 T. Herendi, R. Major
6 Computation of large matrices
In the Section 4 we gave an algorithm for using the described modules for
computing the product of large matrices. Following the description, the im-
plemented design would make use of the parallelism offered by the FPGA
only in the computation of dot products. Making further use of parallel oper-
ations, the design’s performance can be significantly improved. In this section
we describe the implementation choices made to raise the overall performance.
The biggest factor to consider is the management of data. When comput-
ing the product of large matrices, the amount of data to store and to move
between the computation modules can easily exceed the size which can be
practically stored on the FPGA. Fortunately, as mentioned before, a 256MB
DDR2 SODIMM is connected to the board as the main data storage device.
A module is generated using the Memory Interface Generator v3.5 intellectual
property core provided by Xilinx to implement the logic needed to communi-
cate with the DDR2 RAM. The module is structured hierarchically, connecting
the memory device to a user interface. All communication with the device is
done through two FIFO queues: one queue to send the command and address
signals, while the other queue is used for write data and write data mask (when
masking is allowed).
A naive utilization of the memory would be to simply read the required data
before each iteration of the computation, and writing the output back after
it is finished. An undesirable effect of this approach would be that the design
would spend significantly more time with memory management than with the
actual computation. The desired result would be that memory management
(and all other auxiliary operations) were done during the time interval of the
computation. Note that since both the size of the matrices and the multiplier
module is fixed, the time the multiplication consumes is a fixed constant, which
cannot be lowered. Optimally, the time of the computation should be an upper
bound for the running time of the entire design. The difficulty of reaching this
optimum lies in the high speed of the multiplier modules compared to the
memory module.
One way to resolve the problem caused by slow transmission speed is to
increase the amount of data stored on the FPGA. Informally, the main idea is
to keep enough data in a prepared state, i.e. by the time the multiplier module
finishes all of its computations, we have enough new data to continue working.
More formally, let us define the following quantities:
• Let d be the time necessary to complete one iteration of the computation.
Modular exponentiation of matrices on FPGA-s 185
As described in the previous sections, this is equal to the depth of the
containers Td20n.
• Let κ = ⌈ k20⌉, where k is the size of the matrices. (A,B ∈ Z
k×k
4 ) This
quantity is already used in algorithm large matrix mult. For the rest
of the section, it is practical to think of A and B as κ × κ sized block
matrices, where each element is a 20× 20 matrix.
• Let f(A,B) be an arbitrary algorithm executing matrix multiplication
on A and B, including the memory management needed for the com-
putation. Let K(f) be the number of times the algorithm needs to fill a
Td20n container, i.e. the number of times it has to read twenty rows or
columns from the matrices. Note that completely reading either input
matrices once means filling Td20n containers κ times, since one T
d
20n can
store twenty rows or columns at a time. Algorithm large matrix mult’s
main loop (starting at line 7) reads twenty rows from matrix A (filling a
Td20n once) and reads matrix B entirely for each step. Since the loop has
κ steps, it follows that K(large matrix mult) = κ2 + κ.
• Let δ be the time it takes to fill a Td20n container. This quantity depends
on both the width and depth of the container. The total time f(A,B)
spends on reading from memory to fill the containers is K(f)δ.
• Let Φ(f) be the total time the design has to spend with memory man-
agement. This is the sum of the time it spends on reading matrices A
and B from the memory and the time it spends on writing the product
matrix C into the memory. The number of times f has to read A and
B from the memory depends on f. Note that since the size of the total
output matrix C is the same as the size of A and B, writing C into the
memory takes time equal to reading either matrices once from the mem-
ory. In other words, it takes κδ time. The total time the design has to
spend with memory management is Φ(f) = K(f)δ + κδ.
• Let Γ(f) be the time f(A,B) spends on the computation itself. From the
definition of d and κ it follows that C(large matrix mult) = dκ2.
The goal here is to reduce K(f) in such a way that the data required for the
next iteration of the computation is always ready by the time the previous
iteration ends. If this arrangement is achieved then C(f) becomes the upper
bound for the running time of the design.
186 T. Herendi, R. Major
Storing more data on the FPGA can be done by adding more Td20n containers
to the design. During an iteration only two such containers are used directly.
The rest can be used to load data necessary for the forthcoming iteration steps.
Suppose the design has z + 2 pieces of Td20n containers. We assign z of the
containers to store rows from matrix A, called “row-stores”, and two of them to
store columns from matrix B, called “column-stores”. With this arrangement,
we can carry out z− 1 iterations of the computation, using up the data stored
in z − 1 row-stores and one column-store. This leaves one row-store and one
column-store to load new data into during the computation. Using the above
definitions, the allover computation takes (z − 1)d time.
Figure 6: Configuration of data stored on the FPGA
If we use all z row-stores and one column-store for the computation while
the remaining column-store is devoted to loading new columns into, then we
would have to load all z row-stores with new rows once we read all the columns
before we can continue the computation. This would take zδ time for each case
where we read all the columns but haven’t read all the rows yet, which happens
⌊κz ⌋ times. In total, it would add ⌊
κ
z ⌋zδ to the running time.
Instead, the computation of the output matrix moves slightly diagonally.
See Figure 7. The z − 1 row-stores used in the computations store a total of
(z− 1) · 20 rows. Initially, the row-stores are filled with rows a0 → a(z−1)·20−1.
Modular exponentiation of matrices on FPGA-s 187
New rows are loaded in at a slower pace than columns are. By the time all
columns are read once, the contents of the row-stores have shifted exactly to
the next segment of data needed, the next (z− 1) · 20 rows. After matrix B is
completely read once, the row-stores are filled with rows a
(z−1)2˙0 → a2(z−1)·20−1.
Reading rows and columns proceeds in this manner until we’ve completely
read matrix A once. For this reason, it is practical to choose z such that
(z − 1) | κ. All together we read matrix B κ
z−1
times and matrix A once.
During each z−1 iterations shown in Figure 6, twenty new columns and (z−1)·20κ
new rows are loaded into the column-store and row-store currently unused by
the computation. When the unused row-store is filled with twenty new rows, it
becomes active, to be used in the following iterations. The row-store containing
the rows with the least index becomes inactive in the computation and starts
accepting the new rows read.
Figure 7: Progression of computations through matrix C
Function improved matrix mult(A,B)
1. Define z, κ = ⌈ k
20
⌉, A ′, B ′, C ′ ∈ Zκ·n×κ·n4
2. for i = 0 to 20κ− 1 do
3. for j = 0 to 20κ − 1 do
4. if i < k and j < k then a ′ij = aij else a
′
ij = 0
5. if i < k and j < k then b ′ij = bij else b
′
ij = 0
6. end for end for
188 T. Herendi, R. Major
7. Fill the row-stores with rows a0 → a(z−1)·20−1
8. Fill the column-stores with columns b0 − b19
9. For i = 1 to κ
2
z−1
Do in parallel: |perform z− 1 iterations of the computation
|READ the next 20 columns mod κ · 20
|READ the next (z−1)·20κ rows mod κ · 20
|WRITE the result of the previous z− 1 iterations
11. return C ′[0,0k−1,k−1]
end Function
The possible values for the parameters used in this section depend on the
used hardware.
The size of the matrices used in the implementation are determined by
parameters n = 28 and d = 32. The LUTs on the device that comprise the Td20n
containers can be configured as d = 32 bit deep shift registers. For this reason
the matrices are of size 896 × 896. Rows with length k = 896 are the largest
that can be stored in containers that are one LUT deep, making them any
larger would double the number of LUTs needed for creating a Td20n. Because
of the limited number of LUTs which can be used for storage purposes, z = 10
was chosen. This yields that twelve Td20n containers are defined in the design.
Dealing with matrices larger than k = 896 is part of future work.
For convenience, time quantities are measured in clock cycles at 100MHz,
the clock speed of the M20×20 multiplier.
The value of δ depends on the DDR2 RAM used. The device was used at
200MHz, and has a 64 bit wide physical data bus.
From these values we determine the following parameters:
• κ = ⌈89620 ⌉ = 45,
• K(improved matrix mult) = κz−1κ+ κ =
45
9 · 45 + 45 = 270,
• δ = 140 clock cycles at 100MHz,
• Φ(improved matrix mult) = K(improved matrix mult)δ+κδ = 270·
140 + 45 · 140 = 44100 clock cycles at 100MHz,
• Γ(improved matrix mult) = dκ2 = 32 · 452 = 64800 clock cycles at
100MHz.
Modular exponentiation of matrices on FPGA-s 189
The goal of Γ(improved matrix mult) > Φ(improved matrix mult) is a-
chieved, meaning that the running time of the design is equal to the time used
by the computation.
The speedup provided by the configuration can be shown by comparing its
performance to a similar implementation created on a more traditional archi-
tecture. A highly optimized C++ program was created for a machine using
an Intel E8400 3GHz Dual Core processor with 2GB RAM. The algorithm is
strongly specialized for the task, making use of all available options for in-
creasing performance. It uses 64 bit long variables to perform multiplication
on 16 pairs of two-digit elements at once in parallel on both processor cores.
The running time of the multiplication of matrices of the same size is over
100 ms. The FPGA implementation, as mentioned above, achieves a runtime
of ∼0.6 ms. On average, a speedup factor of 200 is reached using the described
FPGA design.
7 Future work
The future course of research will focus on increasing the size of the used
matrices.
As mentioned in the previous section, simply increasing the depth d of the
Td20n containers would be impractical. Since a single LUT on the device can
only be configured as a 32 bit deep shift register, setting d > 32 would double
the number of LUTs needed for a Td20n, and the design is already using well
over half of the device’s LUTs that can be configured this way (13440 out of
17280, to be exact). Increasing the size of the matrices this way would require
the restructuring of both the multiplier module and the algorithm used for
memory management.
Instead, the currently implemented module can be used as a basic unit for
the multiplication of larger matrices. Then the entries of the large matrices
are 896× 896 blocks.
This also allows for further optimization using Strassen’s algorithm. Suppose
we double the matrix sizes, interpreting them as matrices with four blocks.
Using the classical algorithm, multiplying two 1792×1792 sized matrices would
take eight multiplication of the blocks. Using a divide-and-conquer strategy,
we can exchange one multiplication for a few extra additions.
[
A11 A12
A21 A22
]
·
[
B11 B12
B21 B22
]
=
[
−D2 +D4 +D5 +D6 D1 +D2
D3 +D4 D1 −D3 +D5 −D7
]
,
190 T. Herendi, R. Major
where
D1 = A11(B12 − B22)
D2 = (A11 +A12)B22
D3 = (A21 +A22)B11
D4 = A22(B21 − B11)
D5 = (A11 +A22)(B11 + B22)
D6 = (A12 −A22)(B21 + B22)
D7 = (A11 −A21)(B11 + B12).
This algorithm, with its O(nlg 7) time complexity, could speed up the design on
large matrices. We should note however, that the speed of the extra additions
have to be carefully considered. Since the multiplication is already extremely
fast, a similar improvement may also be necessary for additions if the overall
performance upgrade is to remain significant.
Acknowledgements
Research supported by the TA´MOP 4.2.1/B-09/1/KONV-2010-0007 project
and TARIPAR3 project grant Nr. TECH 08-A2/2-2008-0086.
References
[1] F. Bensaali, A. Amira, R. Sotudeh, Floating-point matrix product on
FPGA, IEEE/ACS International Conference on Computer Systems and
Applications, Amman, Jordan, 2007, pp. 466–473. ⇒173
[2] D. Coppersmith, S. Winograd, Matrix multiplication via arithmetic pro-
gressions, J. Symbolic Comput. 9, 3 (1990) 251–280. ⇒173
[3] N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan, Hardware
acceleration of matrix multiplication on a Xilinx FPGA, MEMOCODE
’07 Proc. 5th IEEE/ACM International Conference on Formal Methods
and Models for Co-Design, Nice, France, 2007, pp. 97–100. ⇒173
Modular exponentiation of matrices on FPGA-s 191
[4] Y. Dou, S. Vassiliadis, G. K. Kuzmanov, G. N. Gaydadjiev, 64-bit
floating-point FPGA matrix multiplication, Proc. 2005 ACM/SIGDA
13th international symposium on Field-programmable gate arrays, Mon-
terey, CA, USA, 2005, pp. 86–95. ⇒173
[5] T. J. Earnest, Strassen’s Algorithm on the Cell Broadband Engine, 2008,
http://mc2.umbc.edu/docs/earnest.pdf ⇒173
[6] T. Herendi, Uniform distribution of linear recurrences modulo prime pow-
ers, J. Finite Fields Appl. 10, 1 (2004) 1–23. ⇒174
[7] T. Herendi, Construction of uniformly distributed linear recurring se-
quences modulo powers of 2 (to appear). ⇒174
[8] A. Irturk, S. Mirzaei, R. Kastner, An Efficient FPGA Implementation of
Scalable Matrix Inversion Core using QR Decomposition, UCSD Techni-
cal Report, CS2009-0938, 2009. ⇒173
[9] B. Kakaradov, Ultra-fast matrix multiplication, An empirical anal-
ysis of highly optimized vector algorithms, Stanford Undergraduate
Research Journal 3 (2004) 33–36. ⇒173
[10] M. Karkooti, J. R. Cavallaro, C. Dick, FPGA implementation of matrix
inversion using QRD-RLS algorithm, Proc. 39th Asilomar Conference on
Signals, Systems, and Computers, Pacific Grove, CA, USA, 2005, pp.
1625–1629. ⇒173
[11] S. M. Qasim, A. A. Telba, A. Y. AlMazroo, FPGA design and imple-
mentation of matrix multiplier architectures for image and signal pro-
cessing applications, IJCSNS International Journal of Computer Science
and Network Security 10, 2 (2010) 168–176. ⇒173
[12] V. Strassen, Gaussian elimination is not optimal, Numer. Math. 13 (1969)
354–356. ⇒173
Received: May 17, 2011 • Revised: October 11, 2011
