Solving tri-diagonal linear systems using field programmable gate arrays by Warne, David et al.
This is the author’s version of a work that was submitted/accepted for pub-
lication in the following source:
Warne, David, Kelson, Neil A., & Hayward, Ross F. (2012) Solving tri-
diagonal linear systems using field programmable gate arrays. In 4th In-
ternational Conference on Computational Methods (ICCM 2012), 25-28
November 2012, Crowne Plaza, Gold Coast, QLD. (Unpublished)
This file was downloaded from: http://eprints.qut.edu.au/55985/
c© Copyright 2012 please consult the authors.
Notice: Changes introduced as a result of publishing processes such as
copy-editing and formatting may not be reflected in this document. For a
definitive version of this work, please refer to the published source:
Queensland University of Technology 
CRICOS No. 00213J 
 
 Solving Tri-Diagonal Linear Systems 
using Field Programmable Gate Arrays  
D. J. Warne1,2, N. A. Kelson1, R. F. Hayward2 
 
1High Performance Computing and Research Support, QUT 
2School of Electrical Engineering and Computer Science, QUT  
CRICOS No. 00213J a university for the world real 
R 
Outline 
• Background 
− Reconfigurable Computing 
− Field Programmable Gate Arrays 
− Tri-diagonal linear Systems 
− Tri-diagonal Matrix Algorithm 
• Hardware Design 
− TDMA modules 
− Solver Pipeline 
• Results 
− Simulation 
− Implementation in Hardware 
− Implementation Vs Simulation 
• Conclusion 
− Future Work 
− Summary 
 
Queensland University of Technology 
CRICOS No. 00213J 
Background 
CRICOS No. 00213J a university for the world real 
R 
Reconfigurable Computing (RC) 
• “Blank Slate” Computing. 
− User can control data path at runtime. 
− Device can be reconfigured “on-the-fly”. 
• But Why? 
− Allows for algorithm-specific processors to be developed. 
− Improved power utilisation possible (FLOPs/Watt).  
CRICOS No. 00213J a university for the world real 
R 
Reconfigurable Computing (cont...) 
• Novo-G (NSF/CHREC): 
− 96  Altera Stratix-IV E530 FPGAs. 
− 192 Altera Stratix-III E260 FPGAs. 
• Maxwell (FHPCA): 
− 32 Xilinx Virtex-4 LX160 FPGAs. 
− 32 Xilinx Virtex-4 FX100 FPGAs. 
• Specialist FPGA co-processor boards. 
− MPRACE,  RAPTOR, Nallatech, RASC 
CRICOS No. 00213J a university for the world real 
R 
Field Programmable Gate Arrays (FPGA) 
• Common device used for RC. 
− Array of configurable logic 
blocks. 
• Look-up tables. 
• multiplexers . 
− Distributed Block RAM. 
− Programmable interconnects. 
• Potential for massive scale 
parallelism. 
CRICOS No. 00213J a university for the world real 
R 
Using FPGAs as Co-processors 
• From software to hardware. 
− Take compute intensive operation. 
− Design circuit which partially or completely implements the 
operation. 
− Define data transfer model between CPU/FPGA. 
− Configure at runtime to obtain a special purpose accelerator 
card. 
• Nice in theory, but a change in paradigm for the software 
engineer. 
CRICOS No. 00213J a university for the world real 
R 
A Tri-diagonal Linear System Solver for 
FPGAs 
• Tri-diagonal systems are ubiquitous in scientific 
applications. 
• The tri-diagonal systems are simple to solve. 
− Good for a first attempt at co-processing with FPGAs 
− No general purpose FPGA tri-diagonal solver designs available 
in the literature. 
• Data transfer overhead is smaller for large matrices.  
− Only three diagonals + the RHS vector. 
CRICOS No. 00213J a university for the world real 
R 
Tri-diagonal Linear Systems 
• A tri-diagonal linear system has the form, 
 
 
 
• In the equivalent matrix equation,  
 
• A is banded with a bandwidth of 1. 
 
11 1 12 2 1
( 1) 1 ( 1) 1
( 1) 1
, [2,3,..., 1]i i i ii i i i i i
n n n nn n n
a x a x b
a x a x a x b i n
a x a x b
Ax b
0, , : 1ija i j i j
CRICOS No. 00213J a university for the world real 
R 
Tri-diagonal Linear Systems(cont...) 
• Only need to store the main diagonal, principle sub-
diagonal, and principle super-diagonal. 
 
 
 
 
• This structure simplifies the LU-decomposition and 
forward/backward substitution. 
 
12 23 ( 1)
11 22 33
21 32 ( 1)
0, , ,...,
, , ,...,
, ,..., ,0
n n
nn
n n
a a a
a a a a
a a a
U
A D
L
CRICOS No. 00213J a university for the world real 
R 
Tri-diagonal Matrix Algorithm (TDMA) 
LU-decomposition 
 
Forward/Backward 
Substitution 
  [1] [1] [1]
for  in 2 to 
    [ ] [ ] [ 1] [ ]
    [ ] [ ] [ ]
end
i n
i i i i
i i i
L L D
D D L U
L L D
for  in 2 to 
    [ ] [ ] [ 1] [ 1]
end
[ ] [ ] [ ]
for  in 1 to 1
    [ ] [ ] [ 1] [ 1]
    [ ] [ ] [ ]
end
i n
i i i i
n n n
i n
i i i i
i i i
x b
x x L x
x x D
x x x U
x x D
  
CRICOS No. 00213J a university for the world real 
R 
Parallel Methods For Solving Tri-diagonal 
systems  
• The TDMA is the best performing sequentially. 
• Other methods can improve performance in 
parallel architectures. 
− Two-way Elimination. 
− Cyclic Reduction. 
• This is beyond the scope of this project. 
− Simple first attempt. 
 
Queensland University of Technology 
CRICOS No. 00213J 
Hardware Design 
CRICOS No. 00213J a university for the world real 
R 
TDMA Hardware Modules 
LU-decomposition + 
Forward Substitution Backward Substitution 
CRICOS No. 00213J a university for the world real 
R 
Solver Pipeline 
• Backward substitution depends on forward loop 
completing. 
• Number of cycles required to process a single 
system of N equations is: 
 
• Naively, if we want to solve M systems of N 
equations then the total number of cycles is: 
 
• But here one loop is always idle.  
2c N
2totalc M N
CRICOS No. 00213J a university for the world real 
R 
Solver Pipeline (cont...) 
• We can exploit parallelism available on FPGAs. 
 
 
 
• We can begin factorising the next matrix while 
backward substituting the current matrix. 
• Nearly halves number of cycles 
  
totalc N N M
CRICOS No. 00213J a university for the world real 
R 
Solver Pipeline (cont...) 
• Two register banks 
− If Factorise/FSub is 
routed to Bank 0 then 
Bank 1 is routed to 
BSub.  
− Switch trigger on “end-
of-system” symbol. 
• One row is input per 
cycle. 
 
Queensland University of Technology 
CRICOS No. 00213J 
Results 
CRICOS No. 00213J a university for the world real 
R 
Simulation Results 
Performance  
(CPU = Intel X5650) 
Compiler 
and flags 
Runtime Speedup 
gcc –O0 142 ms 212x 
gcc –O2 48 ms 72x 
gcc –O3 50 ms 75 
icc –O0 375 ms 560x 
icc –O2 7 ms 10x 
icc –O3 3 ms 4.5x 
FPGA 
simulation 
0.671 ms 1x 
Accuracy Vs IEEE-754 32-bit 
Floating 
Point 
Operation 
Relative Error 
 
Mean Maximum 
+ 3.01e-08 2.91e-07 
- 2.40e-08 4.36e-07 
X 4.44e-08 1.12e-07 
/ 3.33e-08 6.21e-07 
TDMA 1.79e-05 1.79-e04 
CRICOS No. 00213J a university for the world real 
R 
Implementation in Hardware 
• Target system was the 
SGI RC100 blade 
− Operating within an Altix 
4700 server. 
− Two Xilinx Virtex-4 LX200 
FPGAs. 
− Five 8MB QDR SRAM 
DIMMs per FPGA. 
− Each cycle each FPGA 
may read and write 128 
bits from SRAM. 
 
 
CRICOS No. 00213J a university for the world real 
R 
Implementation Vs Simulation 
• Real runtime 590ms! 
− This is surely is not just data transfer. 
− Slower than all CPU runs. 
• Data transfer on RC100 
− Configuration can significantly affect performance 
• Direct I/O.  
• Buffered I/O.  
• Streaming DMA. 
− Can’t diagnose as our Altix 4700 is decommissioned. 
• Certain that times closer to simulation are achievable.  
Queensland University of Technology 
CRICOS No. 00213J 
Conclusion 
CRICOS No. 00213J a university for the world real 
R 
Future Work 
• Re-implement memory interface to be compatible with 
an alternate system. 
− Nallatech PCIe-280 Virtex-5 board. 
• Better exploit fine grain parallelism. 
− Two-Way Elimination? 
− Cyclic reduction? 
− Block TDMA? 
• Compare with other accelerator technology. 
−  GPUs  
− Intel Phi (Many-Integrated Cores). 
CRICOS No. 00213J a university for the world real 
R 
Summary 
• RC with FPGAs is a promising method for low power 
HPC. 
• Initial experimentation has been performed with a TDMA 
design for FPGAs. 
• Many improvements to be made, but was a useful 
exercise to boot-strap the discovery process. 
• Future work will extend the design to other platforms and 
improve performance. 
Queensland University of Technology 
CRICOS No. 00213J 
Thank-you! 
