Efficient Realization of Givens Rotation through Algorithm-Architecture
  Co-design for Acceleration of QR Factorization by Merchant, Farhad et al.
1Efficient Realization of Givens Rotation through
Algorithm-Architecture Co-design for
Acceleration of QR Factorization
Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Senior Member, IEEE, Soumyendu Raha,
S K Nandy, Senior Member, IEEE,, Ranjani Narayan, and Rainer Leupers
Abstract—We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better
performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units
(GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and
columns of an input matrix simultaneously. GGR takes 33% lesser multiplications compared to GR. For custom implementation of
GGR, we identify macro operations in GGR and realize them on a Reconfigurable Data-path (RDP) tightly coupled to pipeline of a
Processing Element (PE). In PE, GGR attains speed-up of 1.1x over Modified Householder Transform (MHT) presented in the
literature. For parallel realization of GGR, we use REDEFINE, a scalable massively parallel Coarse-grained Reconfigurable
Architecture, and show that the speed-up attained is commensurate with the hardware resources in REDEFINE. GGR also
outperforms General Matrix Multiplication (gemm) by 10% in-terms of Gflops/watt which is counter-intuitive.
Index Terms—Parallel computing, orthogonal transforms, dense linear algebra, multiprocessor system-on-chip, instruction level
parallelism
F
1 INTRODUCTION
QR factorization/decomposition (QRF/QRD) is a prevalent oper-
ation encountered in several engineering and scientific operations
ranging from Kalman Filter (KF) to computational finance [1][2].
QR factorization of a non-singular matrix Am×n of size m×n is
given by
A = QR (1)
where Qm×m is an orthogonal and Rm×n is upper triangular ma-
trix. There are mainly three methods to compute QR factorization,
1) Householder Transform (HT), 2) Givens Rotation (GR), and 3)
Modified Gram-Schmidt (MGS). MGS is used in the embedded
systems where numerical accuracy of the final solution is not
critical, while HT is employed in High Performance Computing
(HPC) applications since it is numerically stable operation. GR
is applied in the application domains pertaining to embedded
systems where numerical stability of the end solution is criti-
cal [3]. We sense here an opportunity in GR for applicability
in domains beyond embedded systems. Specifically, we foresee
opportunity in GR in generalization for annihilation of multiple
elements of an input matrix simultaneously where annihilation
regime spans over columns and rows. It is intended to expose
higher parallelism in classical GR through generalization and also
• Farhad Merchant and Rainer Leupers are with Institute for Communi-
cation Technologies and Embedded Systems, RWTH Aachen University,
Germany Email:{farhad.merchant, leupers}@ice.rwth-aachen.de
• Tarun Vatwani and Anupam Chattopadhyay are with School of Computer
Science and Engineering, Nanyang Technological University, Singapore
E-mail: {tvatwani,anupam}@ntu.edu.sg
• Soumyendu Raha and S K nandy are with Indian Institute of Science,
Bangalore
• Ranjani Narayan is with Morphing Machines Pvt. LTd.
Manuscript received October 20, 2016;
reduction in total number computations in GR. Such a general-
ization is possible by combining several Givens sequences and
performing common computations required for updating trailing
matrix beforehand. Surprisingly, such an approach is nowhere
presented in the literature and that has reduced relevance of GR
in several application domains. It has been emphasized in the
literature that GR is more suitable for orthogonal decomposition
of sparse matrices while for orthogonal decomposition of dense
matrices, HT is more suitable over GR [4][5]. For implementations
on multicore and General Purpose Graphics Processing Units
(GPGPUs), a library based approach is followed for Dense Linear
Algebra (DLA) computations where highly tuned packages based
on specifications given in Basic Linear Algebra Subprograms
(BLAS) and Linear Algebra Package (LAPACK) are developed
[6]. Several realizations of QR factorization are discussed in
section 2.3. Typically, block QR factorization routine (xgeqrf -
where x indicates double/single precision) that is part of LAPACK
is realized as a series of BLAS calls. dgeqr2 and dgeqrf operations
are shown in figure 1. In dgeqr2, Double Precision Matrix-vector
Fig. 1: dgeqr2 and dgeqrf Operations
(dgemv) operation is dominant while in dgeqrf, Double Precision
ar
X
iv
:1
80
3.
05
32
0v
2 
 [c
s.D
C]
  2
3 M
ar 
20
18
2Matrix Multiplication (dgemm) is dominant as depicted in the
figure 1. dgemv can attain up to 10% of the theoretical peak
in multicore platforms and 15-20% in GPGPUs while dgemm
can attain up to 40-45% in multicore platforms and 55-60% in
GPGPUs at ≈ 65W and ≈ 260W respectively [7]. Considering
low performance of multicores and GPGPUs for critical DLA
computations, it could be a prudent approach to move away
from traditional BLAS/LAPACK based strategy in software and
accelerate these computations on a customizable platform that
can attain order of magnitude higher performance than state-of-
the-art realizations of these software packages. A special care
has to be taken in designing of an accelerator that is capable
of attaining desired performance while maintaining generality of
the accelerator in also supporting other operations in the domain
of DLA computations. Coarse-grained Reconfigurable Architec-
tures (CGRAs) are a good candidate for the domain of DLA
computations since they are capable of attaining performance of
Application Specific Integrated Circuits (ASICs) while flexibility
of Field Programmable Gate Arrays (FPGAs) [8][9][10]. Recently,
there have been several proposals in the literature in developing
BLAS and LAPACK on custamizable CGRA platforms through
algorithm-architecture co-design where macro operations in the
operations pertaining to DLA computations are identified and re-
alized on a Reconfigurable Data-path (RDP) that is tightly coupled
to the processor pipeline [11][12]. In this paper, we focus on
acceleration of GR based QR factorization, where classical GR is
generalized to achieve Generalized Givens Rotation (GGR) where
GGR has 33% lesser multiplications compared to GR. Several
macro operations in GGR are identified and realized on RDP to
achieve superior performance compared to Modified Householder
Transform (MHT) presented in [7]. Major contributions in this
paper to achieve efficient realization of GR based QR factorization
are as follows:
• We improvise over Column-wise Givens Rotation (CGR)
presented in [13] and present GGR. While CGR is capable
of simultaneous annihilation of multiple elements of a
column in the input matrix, GGR can annihilate multiple
elements of rows and columns simultaneously
• Several macro operations in GGR are identified and im-
plemented on an RDP that is tightly coupled to pipeline
of a Processing Element (PE) resulting in 81% of the
theoretical peak in PE. This implementation outperforms
Modified Householder Transform (MHT) based QR fac-
torization (dgeqr2ht) implementation presented in [7] by
10% in PE. GGR based QR factorization also outperforms
dgemm in PE by 10% which is counter-intuitive
• Arguably, moving away from BLAS for realization of
GGR based QR factorization attains 10% higher perfor-
mance than the classical way of implementation where
Level-3 BLAS is used as a dominant operation in the state-
of-the-art software packages for multicore and GPGPUs.
This claim is validated by several case studies on multicore
and GPGPUs where it is also shown that moving away
from BLAS and LAPACK in these platforms does not
yield performance improvement
• For parallel realization in REDEFINE, we attach PE in
REDEFINE framework where 10% higher performance is
attained over dgeqr2ht implementation presented in [7].
We show that sequential realization in PE and parallel
realization of GGR based QR factorization in REDEFINE
are scalable. Furthermore, it is shown that the speed-up in
parallel realization in REDEFINE over sequential realiza-
tion in PE is commensurate with the hardware resources
employed in REDEFINE and the speed-up asymptotically
approaches theoretical peak of REDEFINE CGRA
For our implementations in PE and REDEFINE, we have
used double precision Floating Point Unit (FPU) presented in
[14] with recommendations presented in [15]. Organization of
the papers is as follows: In section 2, we discuss about CGR,
REDEFINE and some of the FPGA, multicore, and GPGPU based
realizations of QR factorization. Case studies on dgemm, dgeqr2,
dgeqrf, dgeqr2ht, and dgeqrfht are presented in section 3. GGR
and implementation of GGR in multicore, GPGPU, and PE is
discussed in 4. Parallel realization of GGR in REDEFINE CGRA
is discussed in 5. We conclude our work in section 6.
Abbreviations/Nomenclature:
Abbreviation/Name Expansion/Meaning
AVX Advanced Vector Extension
BLAS Basic Linear Algebra Subprograms
CE Compute Element
CFU Custom Function Unit
CGRA Coarse-grained Reconfigurable Architecture
CPI Cycles-per Instruction
CUDA Compute Unified Device Architecture
EREW-PRAM Exclusive-read Exclusive-write Parallel Random Ac-cess Machine
DAG Directed Acyclic Graph
FMA Fused Multiply-Add
FPGA Field Programmable Gate Array
FPS Floating Point Sequencer
FPU Floating Point Unit
GPGPU General Purpose Graphics Processing Unit
GR Givens Rotation
GGR Generalized Givens Rotation
HT Householder Transform
ICC Intel C Compiler
IFORT Intel Fortran Compiler
ILP Instruction Level Parallelism
KF Kalman Filter
LAPACK Linear Algebra Package
MAGMA Matrix Algebra on GPU and Multicore Architectures
MGS Modified Gram-Schmidt
MHT Modified Householder Transform
NoC Network-on-Chip
PE Processing Element
PLASMA Parallel Linear Algebra Software for Multicore Archi-tectures
QUARK Queuing and Runtime for Kernels
RDP Reconfigurable Data-path
XGEMV/xgemv Single/Double Precision General Matrix-vector Multi-plication
XGEMM/xgemm Single/Double Precision General Matrix Multiplication
XGEQR2/dgeqr2 Single/Double Precision QR Factorization based onHouseholder Transform (with XGEMV)
XGEQRF/xgeqrf Blocked Single/Double Precision QR Factorizationbased on Householder Transform (with XGEMM)
XGEQR2HT/xgeqr2ht Single/Double Precision QR Factorization based onModified Householder Transform
XGEQR2GGR/
xgeqr2ggr
Single/Double Precision QR Factorization based on
Generalized Givens Rotation
XGEQRFHT/xgeqrfht
Blocked Single/Double Precision QR Factorization
based on Modified Householder Transform (with
XGEMM)
XGEQRFGGR/ xge-
qrfggr
Blocked Single/Double Precision QR Factorization
based on Generalized Givens Rotation (with XGEMM)
PACKAGE ROUTINE/
package routine
Naming convention followed for different routines per-
taining to different packages. E.g., BLAS DGEMM/
blas dgemm is a Double Precision General Matrix
Multiplication Routine in BLAS
2 BACKGROUND AND RELATED WORK
CGR presented in [13] is discussed in section 2.1 and REDEFINE
CGRA is discussed in section 2.2. A detailed review of yesteryear
realizations of QR factorization is presented in section 2.3.
32.1 Givens Rotation based QR Factorization
For a 4 × 4 matrix X = xij , xij ∈ R4×4, applying 3 Givens
sequences simultaneously yields to the matrix GX shown in
equation 2.
GX =
p3
x11x12+s11
p3
x11x13+s12
p3
x11x14+s13
p3
0 x11s11p3p2 −
x12p2
p3
x11s12
p3p2
− x13p2p3 x11s13p3p2 −
x14p2
p3
0 x21s21p2p1 −
x22p1
p2
x21s22
p2p1
− x23p1p2 x21s23p2p1 −
x24p1
p2
0 x31x42p1 − x41x32p1 x31x43p1 − x41x33p1 x31x44p1 − x41x34p1

=

p3
x11x12+s11
p3
x11x13+s12
p3
x11x14+s13
p3
0 k1s11 − x12l1 k1s12 − x13l1 k1s13 − x14l1
0 k2s21 − x22l2 k2s22 − x23l2 k2s23 − x24l2
0 cx42 − x32s cx43 − x33s cx44 − x34s

(2)
Corresponding Directed Acyclic Graphs (DAGs) for CGR for
annihilation of x41, x31, and x21 and update of the second column
of the matrix X are shown in figure 1. For an input matrix of size
n×n classical GR takes n(n−1)2 sequences while CGR takes n−1
sequences. Furthermore, if the number of multiplications in GR is
GRM and number of multiplications in CGR is CGRM then
CGRM =
2n3 + 3n2 − 5n
2
(3)
GRM =
4n3 − 4n
3
(4)
Taking ratio of equation 3 and 4
α =
CGRM
GRM
=
3(2n+ 5)
8(n+ 1)
(5)
From equation 5, as n → ∞, α → 34 . As we increase the size of
the matrix, the number of multiplications in CGR asymptotically
approaches to 34 times the number of multiplications in GR. Im-
plementation details of CGR on systolic array and in REDEFINE
can be found in [13].
2.2 REDEFINE CGRA
REDEFINE CGRA is a customizable massively parallel Multi-
processor System on Chip (MPSoC) where several Tiles are con-
nected through Network-on-Chip (NoC) [16]. Each Tile consists
of a Compute Element (CE) and a Router. CEs in REDEFINE
can be enhanced to support several applications domains like
signal processing and Dense Linear Algebra (DLA) computations
by attaching domain specific Custom Function Units (CFUs)
to the CEs [17]. REDEFINE framework is shown in figure 3.
A Reconfigurbale Data-path is tightly coupled to a Processing
Element (PE) that is CFU for REDEFINE as shown in the figure
3. Performance of PE in dgemm and MHT based QR factorization
(dgeqr2ht) is shown in figure 4(a) [7]. Performance of PE over
several Architectural Enhancements (AEs) is shown in figure 4(b).
It can be observed in the figure 4(a) that PE attains 3-100x
better performance in dgemm and dgeqr2ht while PE with tightly
coupled RDP is capable of attaining 74% of the theoretical peak
performance of PE in dgemm as shown in figure 4(b). Performance
in dgemm an dgeqr2ht is attained through algorithm-architecture
co-design where macro operations in dgemm and dgeqr2ht are
identified and realized on RDP. We apply similar technique in this
exposition for GR where we first present GGR and identify macro
operations in GGR that are realized on RDP. GGR implementation
(dgeqr2ggr) outperforms dgeqr2ht and dgemm. Further details of
dgemm and dgeqr2ht realizations can be found in [17], [18], [19],
and [7].
2.3 Related Work
Due to wide range of applications in the embedded systems do-
main, GR has been studied extensively in the literature specifically
for implementation purpose since it was first proposed in [20].
An alternate ordering of Givens sequences was presented in [21].
According to the scheme presented in [21], the alternate ordering
is amenable to parallel implementation of GR while it does not
focus on fusing several Givens sequences to annihilate multiple
elements. For an input matrix of n × n, the alternate ordering
presented in [21] can annihilate maximum n2 elements in parallel
by executing disjoint Givens sequences simultaneously. Pipeline
Given sequences for computing QR decomposition is presented in
[22] where its is proposed to execute Givens sequences in pipeline
fashion to update the pair of rows partially updated by the previ-
ous Givens sequences. Analysis of this strategy is presented for
Exclusive-read Exclusive-write Parallel Random Access Machine
(EREW-PRAM) that shows that the pipeline strategy is twice
as fast compared to the classical GR. Greedy Givens algorithm
is presented in [23] that executes compound disjoing Givens
sequences in parallel assuming unlimited parallelism case. A high
speed tournament GR and VLSI implementation of tournament
GR is presented in [3] where a significant improvement is reported
in ASIC over triangular systolic array. The scheduling scheme
presented in [3] is similar to the one presented in [21] where
disjoint Givens sequences are applied to compute QR decompo-
sition. FPGA implementation of GR is presented in [24] while
ASIC implementation of square-root free GR is presented in [25].
A novel technique to avoid underflow/overflow in computation of
QR decomposition using classical GR is presented in [26] that
results in numerical stable realization of GR in LAPACK. A two-
dimensional systolic array implementation of GR is presented
in [27] where classical GR is implemented on two-dimensional
systolic array with diagonal elements of the array performs com-
plex operations like square root and division while the rest of
the array performs matrix update. Restructuring of tridiagonal and
bidiagonal algorithms for QR decomposition is presented in [28].
The restructuring strategy presented in [28] has several advantages
like it is capable of exploiting vector instructions in the modern
architectures, reduces memory operations, and the matrix updates
are in the form of Level-3 BLAS thus capable of exploiting cache
architecture through reuse of data compared to classical GR where
it is Level-2 BLAS. Although, the scheme presented in [28] re-
arranges memory bound operations like Level-2 BLAS in classical
GR to compute bound operations like Level-3 BLAS, the scheme
does not reduce computations unlike GGR. In general, works re-
lated to GR in the literature focus on different scheduling schemes
for parallelism and exploitation of architectural features for the
targeted platform. In case of GGR, total work is reduced compared
to the classical GR while also being architecture platform friendly.
For implementation of GR, there has been no work in the literature
where macro operations in the routine are identified and realized
carefully for performance.
3 CASE STUDIES
For our experiments on multicore and GPGPUs we use high-
tly tuned software packages PLASMA and MAGMA. Software
4Fig. 2: One Iteration of Column-wise Givens Rotation
Fig. 3: REDEFINE Framework
  
Int
el 
Co
re
Nv
idi
a G
TX
48
0 S
M
Al
ter
a S
tra
tix
 IV
Cl
ea
rS
pe
ed
 C
SX
70
0
Int
el 
Co
re
 i7
 H
as
sw
ell
Nv
idi
a T
es
la 
C2
05
0
0
10
20
30
40
50
60
70
80
90
100
Performance Improvement In-terms of 
Gflops/watt in DGEMM
Performance Improvement In-terms of 
Gflops/watt in DGEQR2HT
Platforms
P
er
fo
rm
an
ce
 Im
pr
ov
em
en
t I
n-
te
rm
s 
of
 G
flo
ps
/w
at
t
(a) Performance Comparison of PE
with Other Platforms for dgemm and
dgeqr2ht
  AE0 AE1 AE2 AE3 AE4 AE5
0
10
20
30
40
50
60
70
80
20x20
40x40
60x60
80x80
100x100
Architectural Enhancements
P
er
ce
nt
ag
e
(b) Performance of PE In-terms of
Theoretical Peak Performance in PE
for dgemm
Fig. 4: Performance and Performance Comparison of PE where
dgeqr2ht is Modified Householder Transform based QR
Factorization
stacks of these packages are shown in figure 5. Performance
of PLASMA and MAGMA depends on efficiency of underlying
BLAS realization as well as realization of scheduling schemes for
multicore and GPGPUs. For implementation of GGR in PLASMA
and MAGMA for multicore and GPGPUs, we add routines to
BLAS and LAPACK. Performance in-terms of theoretical peak
performance in dgemm, dgeqr2, and dgeqrf is depicted in figure
6(a) and performance in-terms of Gflops/watt for these routines is
depicted in figure 6(b).
3.1 dgemm
dgemm is a Level-3 BLAS routine that has three loops and time
complexity of O(n3) for a matrix of size n × n. Performance
of dgemm in LAPACK in-terms of theoretical peak of underlying
(a) PLASMA Software Stack (b) MAGMA Software Stack
Fig. 5: PLASMA and MAGMA Software Stacks
  
X 103
(a) Performance In-terms of Theoret-
ical Peak Performance of Underly-
ing Platform in dgemm, dgeqr2, and
dgeqrf in LAPACK, PLASMA, and
MAGMA
  1x1 2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 10x10
0
0.2
0.4
0.6
0.8
1
1.2
1.4
MAGMA_DGEMM
MAGMA_DGEQRF
MAGMA_DGEQR2
PLASMA_DGEQRF
LAPACK_DGEQR2
LAPACK_DGEQRF
Matrix Size
G
flo
ps
/w
at
t
X 103
(b) Performance In-terms of
Gflops/watt in dgemm, dgeqr2, and
dgeqrf in LAPACK, PLASMA, and
MAGMA
Fig. 6: Performance in dgemm, dgeqr2, and dgeqrf in LAPACK,
PLASMA, and MAGMA
platform is shown in figure 6(a) and in-terms of Gflops/watt in
MAGMA is shown in figure 6(b). It can be observed in the figure
6(a) that the performance attained by dgemm in Intel Core i7
and Nvidia Tesla C2050 is hardly 25% and 57% respectively. In-
terms of Gflops/watt it is 1.22 in Nvidia Tesla C2050. Due to
trivial nature of dgemm algorithm, we do not reproduce dgemm
algorithm here while standard dgemm algorithm can be found in
[29].
3.2 dgeqr2
Pseudo code of dgeqr2 is described in algorithm 1. It can be
observed in the pseudo code in the algorithm 1 that, it contains
three steps, 1) computation of a householder vector for each
column 2) computation of householder matrix P , and 3) update
of trailing matrix using P = I − 2vvT where I is an identity
5Algorithm 1 dgeqr2 (Pseudo code)
Allocate memory for input/output matrices and vectors
for i = 1 to n do
Compute Householder vector v
Compute P where P = I − 2vvT
Update trailing matrix using dgemv
end for
matrix. For our experiments, we use Intel C Compiler (ICC) and
Intel Fortran Compiler (IFORT). We also use different compiler
switches to improve the performance of dgeqr2 in LAPACK on
Intel micro-architectures. In Intel Core i7 4th Gen machine which
is a Haswell micro-architecture, CPI attained saturates at 1.1 [7].
In case when compiler switch −mavx is used that enables use
of Advanced Vector Extensions (AVX) instructions, the Cycles-
Per-Instruction (CPI) attained is increased. This behavior is due to
AVX instructions that use Fused Multiply Add (FMA). Due to this
fact, the CPI reported by VTune™can not be considered as a mea-
sure of performance for the algorithms and hence we accordingly
double the instruction count reported by Intel VTune™.
In case of GPGPUs, dgeqr2 in MAGMA is able to achieve
up to 16 Gflops in Tesla C2050 which is 3.1% of the theoretical
peak performance of Tesla C2050 while performance in terms of
Gflops/watt is as low as 0.04 Gflops/watt.
3.3 dgeqrf
Algorithm 2 dgeqrf (Pseudo Code)
1: Allocate memories for input/output matrices
2: for i = 1 to n do
3: Compute Householder vectors for block column m× k
4: Compute P matrix where P is Computed using House-
holder vectors
5: Update trailing matrix using dgemm
6: end for
Pseudo code for dgeqrf routine is shown in algorithm 2. In
terms of computations, there is no difference between algorithms
1, and 2. In a single core implementation, dgeqrf is observed to
be 2-3x faster than dgeqr2. The major source of efficiency in
dgeqrf is efficient exploitation of processor memory hierarchy and
dgemm routine which is a compute bound operation [30][31]. In
Nvidia Tesla C2050, dgeqrf in MAGMA is able to achieve up to
265 Gflops which is 51.4 % of theoretical peak performance of
Nvidia Tesla C2050 as shown in the figure 6(a) which is 90.5%
of the performance attained by dgemm. In dgeqr2 in MAGMA,
performance attained in terms of Gflops/watt is as low as 0.05
Gflops/watt while for dgemm and dgeqrf it is 1.23 Gflops/watt and
1.09 Gflops/watt respectively in Nvidia Tesla C2050 as shown in
figure 6(b). In case of dgqrf in PLASMA, the performance attained
is 0.39 Gflops/watt while running dgeqrf in four cores.
3.4 dgeqr2ht
Pseudo code for dgeqr2ht is shown in algorithm 3. A clear differ-
ence between dgeqr2 in the algorithm 1, dgeqrf in the algorithm
2, and dgeqr2ht in the algorithm 3 is in updating of trailing
matrix. dgeqr2 uses dgemv operation which is a memory bound
operation, and dgeqrf uses dgemm operation which is compute
bound operation while dgeqr2ht uses an operation that is more
Algorithm 3 dgeqr2ht (Pseudo Code)
1: Allocate memories for input/output matrices
2: for i = 1 to n do
3: Compute Householder vectors for block column m× k
4: Compute PA where PA = A− 2vvTA
5: end for
dense in-terms of computations resulting in lower θ where θ is a
rough quantification of parallelism through DAG based analysis
of routines dgeqr2, dgeqrf, and dgeqr2ht. In case of no-change
in the computations in the improved routine after re-arrangement
of computations (in this case fusing of the loops), θ translates
into ratio of number of levels in the DAG of improved routine to
number of levels in the DAG of classical routine. Implementation
of dgeqr2ht clearly outperforms implementation of dgeqrf in PE as
shown in figure 7(a). In the figure 7(a), the performance is shown
in-terms of percentage of theoretical peak performance of PE
and also in-terms of percentage of theoretical peak performance
normalized to the performance attained by dgemm in the PE. It
can be observed in the figure 7(a) that the performance attained
by dgeqr2ht is 99.3% of the performance attained by dgemm in
the PE. Furthermore, it can be observed in figure 7(b) that the
performance achieved in-terms of Gflops/watt in the PE is 35
Gflops/watt compared to performance attained by dgeqrf which
is 25 Gflops/watt. In multicore and GPGPUs, such a fusing does
not translate into improvement due to limitations of the platform.
Further details of dgeqr2ht implementation can be found in [7].
Based on case studies on dgemm, dgeqr2, dgeqrf, and dgeqr2ht,
we make following observations:
• dgeqrf in highly tuned software packages like PLASMA
and MAGMA attains 16% and 51% respectively. This
leaves a further scope for improvement in these routines
through careful analysis
• Typically, dgeqrf routine that has compute bound op-
eration like dgemm achieves 80-85% of the theoretical
peak performance attained by dgemm in multicore and
GPGPUs
• dgeqr2ht attains similar performance as dgeqrf in multi-
core and GPGPUs while it clearly outperforms dgeqrf in
PE
We further propose generalization in CGR presented in [13]
and derive GGR that can outperform dgeqr2ht in PE and in
REDEFINE.
4 GENERALIZED GIVENS ROTATION AND IMPLE-
MENTATION
From equation 1, matrix An×n, annihilation of element in the last
row and first column which is (n,1) would require application of
one Givens sequence
Gn,1A =
[
R(1)
0
]
(6)
where matrix R(k) is a matrix with k zero elements in the lower
triangle of the matrix and has undergone k-updates. In general
Givens matrix is given by equation 7.
Gi,j = diag(Ii−2, G˜i,j , Im−i) with G˜ =
[
c s
−s c
]
(7)
6  2x2 4x4 6x6 8x8 10x10 12x12
0
20
40
60
80
100
120
Peak of the PE in DGEQR2 Peak of the PE in DGEQRF
Peak of the PE in DGEQRFHT Peak of the PE in DGEQR2HT
Peak of the DGEMM in DGEQR2 Peak of the DGEMM in DGEQRF
Peak of the DGEMM in DGEQRFHT Peak of the DGEMM in DGEQR2HT
Matrix Size
P
er
ce
nt
ag
e
X 10
(a) Performance Comparison of PE with Other Plat-
forms for dgemm and dgeqr2ht
  2x2 4x4 6x6 8x8 10x10 12x12
0
5
10
15
20
25
30
35
40
DGEQR2 DGEQRF
DGEQRFHT DGEQR2HT
Matrix Size
G
flo
ps
/w
at
t
X 10
(b) Performance of PE In-terms of Theoretical Peak
Performance in PE for dgemm
Fig. 7: Performance of dgeqr2ht in PE
where c = Ai−1,jt , s =
Ai,j
t and t =
√
A2i−1,j +A
2
i,j . It takes
n−1 Givens sequences to annihilate n−1 elements in the matrix
A. There is a possibility to apply multiple Givens sequences
to annihilate multiple elements in a column of a matrix. Thus,
extending equation 6 to annihilate 2-elements in the first column
of the matrix A
Gn−1,1Gn,1A =
[
R(2)
0
]
(8)
where (Gn−1,1Gn,1)T (Gn−1,1Gn,1) =
(Gn−1,1Gn,1)(Gn−1,1Gn,1)T = I . Extending further equation
8 to annihilate n − 1 elements of the first column of the input
matrix A
G2,1G3,1...Gn−1,1Gn,1A =
[
R(n−1)
0
]
(9)
where (G2,1G3,1...Gn−1,1Gn,1)T (G2,1G3,1...Gn−1,1Gn,1)
= (G2,1G3,1...Gn−1,1Gn,1)(G2,1G3,1...Gn−1,1Gn,1)T = I .
Formulation in equation 9 can annihilate n − 1 elements in the
first column of the input matrix A. Furthering annihilation of
n − 1 elements to (n − 1) + (n − 2) elements that results in
matrix R(n−1)+(n−2) where n − 1 elements in the first column
and n− 2 elements in the second column of matrix R are zero as
given by equation 10.
(G3,2G4,2...Gn−1,2Gn,2)(G2,1G3,1...Gn−1,1Gn,1)A =[
R(n−1)+(n−2)
0
]
(10)
where ((G3,2G4,2...Gn−1,2Gn,2)(G2,1G3,1...Gn−1,1Gn,1))
((G3,2G4,2...Gn−1,2Gn,2)(G2,1G3,1...Gn−1,1Gn,1))T =
((G3,2G4,2...Gn−1,2Gn,2)(G2,1G3,1...Gn−1,1Gn,1))T
((G3,2G4,2...Gn−1,2Gn,2)(G2,1G3,1...Gn−1,1Gn,1)) = I . To
annihilate n(n−1)2 elements in the lower triangle of the input
matrix A, it takes n− 1 sequences and the Givens matrix shrinks
by one row and one column with each column annihilation.
Further generalizing equation 10,
(Gn,n−1)(Gn−1,n−2Gn,n−2)....(G3,2G4,2...Gn−1,2Gn,2)
(G2,1G3,1...Gn−1,1Gn,1)A =
[
R(n−1)+(n−2)
0
]
(11)
Fig. 8: CGR and GGR
and (Gn,n−1)(Gn−1,n−2Gn,n−2)....(G3,2G4,2...
Gn−1,2Gn,2)(G2,1G3,1...Gn−1,1Gn,1) = QT and QQT =
QTQ = I . Equation 11 represents GGR in equation form.
Algorithm 4 Generalized Givens Rotation (Pseudo code)
Allocate memory for input/output matrices and vectors
for i = 1 to n do
Compute 2− norm of the column vector
Update row 1
update rows 2 to n
end for
A pictorial view for 8 × 8 matrix that compares CGR with
GGR is shown in figure 8, and GGR pseudo code is given in
algorithm 4. It can be observed in the figure 8 that CGR operates
column-wise and takes total 7 iterations to upper traingularize
the matrix of size 8 × 8 while GGR operates column-wise as
well as row-wise and can upper triangularize matrix in a single
iteration. It can be observed in the algorithm 4 that the update
of the first row and the rest of the rows can be executed in
parallel. Furthermore, there also exist parallelism across the out
loop iterations in GGR. Theoretically, classical GR takes n(n−1)2
iterations to upper triangularize an input matrix of size n×n, and
CGR takes n− 1 iterations to upper triangularize an input matrix
of size n× n while GGR can upper triangularize a matrix of size
n× n in 1 iteration.
7However, in practical scenario, it is not possible to accom-
modate large matrices in the registers or Level 1 (L1) cache
memory. Hence, a sophisticated matrix partitioning schemes are
required to efficiently exploit the memory hierarchy in multicores
and GPGPUs.
4.1 GGR in Multicore and GPGPU
For multiocre realization of GGR, we use PLASMA framework
and for GPGPU realization, we use MAGMA framework depicted
in figure 5.
4.1.1 GGR in PLASMA
To implement GGR in multicore architectures, we first imple-
ment dgeqr2ggr routine in LAPACK and we use that routine
in PLASMA. GGR implementation in LAPACK is shown in
algorithm 5 as a pseudo code. It can be observed in the algorithm
5 that the most computationally intensive part in GGR is update
function. In our implementation, update function becomes part of
BLAS while in LAPACK, we implement dgeqr2ggr function that
is wrapper function of update function that calls update function
n times for input matrix of size n× n as shown in algorithm 5.
Algorithm 5 lapack dgeqr2ggr (GGR in LAPACK) (Pseudo code)
Allocate memory for input/output matrices and vectors
for i = 1 to n do
update(L, A(i,i), A, LDA, I, N, M, Tau(i), Beta)
end for
update()
Initialize k,l, and s vectors to 0
Calculate 2-norm of the column vector
Calculate k, l, and s vectors
Update row i of the matrix
Update row i+ 1 to n using k, l, and s vectors
Performance comparison of dgeqr2ggr, dgeqrfggr, dgeqr2,
dgeqrf, dgeqr2ht, and dgeqrfht in LAPACK and PLASMA is
shown in figure 9. It can be observed in the figure 9 that despite
more computations in dgeqr2ggr, the performance of dgeqr2ggr is
comparable to dgeqr2, and performance of dgeqrfggr is compara-
ble to dgeqrf in LAPACK and PLASMA. We compare run-time
of the different routines normalized to dgemm performance in
respective software packages since total number of computations
in HT, MHT, and GGR are different. Furthermore, in our im-
plementation of GGR in PLASMA, we use dgemm for updating
trailing matrix.
4.1.2 GGR in MAGMA
For implementation of magma dgeqr2ggr, we insert routines in
MAGMA BLAS. Pseudo code for the inserted routine is shown in
algorithm 6. The implementation consists of 3 functions, drnm2
that computes 2-norm of the column vector, computation of k,
and l vectors by function klvec and update of trailing matrix
by dtmup function. It can be observed in the algorithm 6 that
the most computationally intensive part in the routine is dtmup
function. There are two routines implemented, 1) dgeqr2ggr where
trailing matrix is updated using method shown in equation 2, 2)
dgeqrfggr where trailing matrix is updated using dgemm. Perfor-
mance of dgeqr2ggr and dgeqrfggr is shown in figure 9. It can
be observed in the figure 9 that the performance of dgeqr2ggr is
similar to performance of dgeqr2 in MAGMA while performance
Algorithm 6 magma dgeqr2ggr (GGR in MAGMA) (Pseudo
code)
dnrm2<<<grid, threads,0,queue− >cuda stream()
>>>(n,m,dC,lddc,dv,ddot)
klvec<<<grid,threads,0,queue− >cuda stream()
>>>(n,m,ddot,lddc,dv,dk,dl)
dtmup<<<n,threads,0,queue− >cuda stream()
>>>(n,m,dC,lddc,ddot,dv,dk,dl)
dnrm2()
for k = 1 to m do
for j = 0 to BLOCK SIZE do
w+ = A[ldda∗(row)+(m−1−j−k)]∗V shared[j]
if row == 0 then
dot[ldda ∗ (row) + (m − 1 − k − j)] =
−copysign(sqrt(w), vector[0])
else if row! = 0 then
dot[ldda ∗ (row) + (m− 1− k − j)] = w
end if
end for
end for
klvec()
if row == 0 then
dl[row] = 0
dk[row] = 1/dot[0]
else if row == m− 1 then
dl[row] = dv[row]/dot[row − 1]
dk[row] = dv[row − 1]/dot[row − 1]
else if (row > 0)&&(row < (m− 1)) then
dl[row] = dot[row]/dot[row − 1]
dk[row] = dv[row − 1]/(dot[row] ∗ dot[row − 1])
end if
dtmup()
tx = threadIdx.x
dC[m−1] = (dk[m−1]∗dC[m−1])−(dl[m−1]∗dC[m−2])
for j = m− 2− tx to 1 do
dC[j] = (dk[j] ∗ dot[j])− (dl[j] ∗ dC[j − 1])
j = j −BLOCK SIZE
end for
dC[0] = dk[0] ∗ dot[0]
of dgeqrfggr is similar to the performance of dgeqrf in magma.
Despite abundant parallelism available in dgeqr2ggr, GPGPUs are
not capable of exploiting this parallelism due to serialization of
the routine while in dgeqrfggr, GPGPUs perform similar to dgeqrf
due to dominance of dgemm in the routine.
4.2 GGR in PE
For implementation of dgeqr2ggr, and dgeqrfggr in PE, we use
similar method as presented in [7]. We perform DAG based
analysis of GGR and identify macro operations in the DAGs.
These macro operations are then realized on RDP that is tightly
coupled to PE as shown in figure 10. PE consists of two modules,
1) Floating Point Sequencer (FPS), and 2) Load-Store CFU.
FPS performs double precision floating point computations
while Load-Store CFU is responsible for loading and storing of
data in registers, and Local Memory (LM) to/from the Global
Memory (GM). Operation in the PE can be defined by following
steps:
80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1x1 2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 10x10 
LAPACK_DGEQR2 LAPACK_DGEQRF LAPACK_DGQRR2HT
LAPACK_DGEQRFHT LAPACK_DGQRR2GGR LAPACK_DGEQRFGGR
PLASMA_DGEQR2 PLASMA_DGEQRF PLASMA_DGEQR2GGR
PLASMA_DGEQRFGGR MAGMA_DGEQR2 MAGMA_DGEQRF
MAGMA_DGEQR2GGR MAGMA_DGEQRFGGR
P
e
rf
o
rm
an
ce
 (
N
o
rm
al
iz
e
d
 t
o
 d
ge
m
m
 p
e
rf
o
rm
an
ce
) 
Matrix Size 
x103 
Fig. 9: Performance of GGR in Different Packages and Platforms
• Send a load request to GM for input matrix and store input
matrix elements in LM
• Move input matrix elements to registers in the FPS
• Perform computations and store the results back to LM
• Write results back to GM if the computed elements are the
final output or if they are not to be used immediately
Up to 90% of overlap in computation and communication is
attained in the PE for dgemm as presented in [17]. To identify
macro operations, considering example of 4 × 4 matrix shown in
the figure 2 and equation 2. It can be observed in the figure 2
that computing Givens Generation (GG) is an square root of inner
product while computing update of the first row is similar to inner
product. Furthermore, computing rows 2,3, and 4 is determinant
of 2 × 2 matrix. We map these row updates on RDP as shown in
figure 12. It can be observed in the figure 12 that the RDP can be
re-morphed to perform scalar multiplication (MUL/DOT1), inner
product of 2-element vectors (DOT2), inner product of 3-element
vectors, determinant of 2 × 2 matrix (DET2), and innter product
of 4-element vectors (DOT4) operations. In our implementation
we ensure that the reconfiguration of RDP is minimal to improve
energy efficiency of the PE. We introduce custom instructions like
DOT4, DOT3, DOT2, DOT1, and DET2 instructions in PE to
implement these macro operations in RDP along with instruction
that can reconfigure RDP. Different routines are realized using
these instructions as shown in figure 11. In the figure 11, it
can be observed that irrespective of routine for QR factorization
implemented in PE, the communication pattern remains consistent
across the routines. In our implementation of dgeqr2ggr, we
ensure that RDP is configured to perform two DET2 instructions
in parallel that maximizes resource utilization of RDP. In our
implementation of dgeqr2ggr, instructions in the two function
UPDATE ROW1 and UPDATE are merged such that the pipeline
stalls in the RDP are minimized. Such an approach is not possible
in implementation of dgeqrfggr or dgeqrfht since trailing matrix
is updated using dgemm routine and until the matrix required to
update the trailing matrix update is not computed, trailing matrix
update can not be processed.
Speed-up in dgeqr2ggr over different routines is shown in
figure 13(a). It can be observed in the figure 13(a) that the
speed-up attained in dgeqr2ggr over other routines range between
1.1 to 2.25x. Performance in-terms of the percentage of peak
performance normalized to the performance attained by dgemm
for different routines for QR factorization is shown in figure
13(b). A counter-intuitive observation that can be made here
is that dgeqr2ggr can achieve performance that is higher than
the performance attained by dgemm in the PE while dgeqr2ht
performance reported in [7] is same as performance attained by
dgemm. dgeqr2ggr can achieve performance that is up to 82%
of the theoretical peak of PE. dgeqr2ggr attains 10% higher
Gflops/watt over dgeqr2ht which is the best performing routine as
reported in [7] and [19]. Furthermore, improvement in dgeqr2ggr
over other platforms is shown in figure 13(d). It can be observed
in the figure 13(d) that the performance improvement in PE for
dgeqr2ggr over dgeqr2ggr in off-the-shelf platforms is ranging
from 3-100x.
5 PARALLEL IMPLEMENTATION OF GGR IN RE-
DEFINE
An experimental setup for implementation of dgeqrfggr and
dgeqr2ggr is shown in figure 14 where PE that outperforms other
off-the-shelf platforms for DLA computations is attached as a
CFU to the Routers in REDEFINE CEs.
For implementation of dgeqrfggr and dgeqr2ggr in REDE-
FINE, it requires an efficient partitioning and mapping scheme
that can sustain computation to communication ratio that is com-
mensurate with the hardware resources of the platform, and also
ensures scalability. We follow similar strategy presented in [7] for
realization of dgeqrfggr and dgeqr2ggr in REDEFINE and propose
a general scheme that is applicable for the input matrix of any
size. For our experiments, we consider 3 different configurations
in REDEFINE consisting of 2 × 2, 3 × 3, and 4 × 4 Tile arrays.
Assuming that input matrix is of the size N ×N and REDEFINE
Tile array is of size K × K , the input matrix can be partitioned
into the blocks of NK × NK sub-matrices. Since, objective of our
experiment is to show scalability of our solution, we choose N
and K such that N%K = 0. Matrix partitioning and REDEFINE
mapping is depicted in figure 15. As shown in the figure 15, for
Tile array of size 2 × 2, we follow scheme 1 where input matrix
is partitioned in to sub-matrices of size 2 × 2. For Tile array
of size 3 × 3 the input matrix is partitioned in to sub-matrices
of size 3 × 3 and as the input matrix, the scheme 1 is used to
sustain computation to communication ratio. In implementation
of dgeqrfggr, we update trailing matrix using dgemm while in
implementation of dgeqr2ggr, the trailing matrix is updated using
DET2 instructions. Attained speed-up over sequential realization
in different Tile array sizes and matrix sizes is shown in figure 16.
Speed-up in dgeqr2ggr, dgeqr2, dgeqrf, dgeqrfht, and dgeqr2ht
over sequential realizations of these routines. It can be observed
in the figure 16(a) that the speed-up attained in dgeqr2ggr is com-
mensurate with the Tile array size in REDEFINE. For a Tile array
size of K × K , speed-up asymptotically approaches K × K as
depicted in the figure 16(a). Performance of dgeqr2ggr, dgeqrfggr,
dgeqr2, dgeqrf, dgeqrfht, and dgeqr2ht in-terms of theoretical peak
performance of Tile array size is shown in figure 16(b). It can be
observed in the figure 16(b) that dgeqrf2ggr can attain up to 78%
of the theoretical peak performance in REDEFINE for different
Tile array sizes.
6 CONCLUSION
Generalization of Givens Rotation was presented that resulted in
lower multiplication count compared to classical Givens Rotation
9Fig. 10: Processing Element with RDP
Fig. 11: Implementation of dgeqr2, dgeqrf, dgeqrfht, dgeqr2ht,
dgeqr2ggr, and dgeqrfggr in PE
Fig. 12: Different Configurations in RDP to Support
Implementation of dgeqr2ggr
operation. Generalized Givens Rotation was implemented on mul-
ticore and General Purpose Graphics Processing Units where the
performance was limited due to inability of these platforms in
exploiting available parallelism in the routine. It was proposed
to move away from traditional software packages based approach
to architectural customizations for Dense Linear Algebra compu-
tations. Several macro operations were identified in Generalized
Givens Rotation and realized on a Reconfigurable Data-path that
is tightly coupled to pipeline of a Processing Element. Generalized
Givens Rotation outperformed Modified Householder Transform
presented in the literature by 10% in Processing Element where
Modified Householder Transform is implemented with similar
approach of algorithm-architecture co-design. For parallel realiza-
tion, the Processing Element was attached to REDEFINE Coarse-
grained Reconfigurable Architecture as a Custom Function Unit
and scalibility of the solution was shown where speed-up in
parallel realization asymptotically approaches Tile array size in
REDEFINE.
REFERENCES
[1] G. Cerati, P. Elmer, S. Lantz, I. MacNeill, K. McDermott, D. Riley,
M. Tadel, P. Wittich, F. Wrthwein, and A. Yagil, “Traditional tracking
with kalman filter on parallel architectures,” Journal of Physics:
Conference Series, vol. 608, no. 1, p. 012057, 2015. [Online]. Available:
http://stacks.iop.org/1742-6596/608/i=1/a=012057
[2] F. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K. Nandy, and
R. Narayan, “Achieving efficient realization of kalman filter on CGRA
through algorithm-architecture co-design,” CoRR, vol. abs/1802.03650,
2018. [Online]. Available: http://arxiv.org/abs/1802.03650
[3] M.-W. Lee, J.-H. Yoon, and J. Park, “High-speed tournament givens
rotation-based qr decomposition architecture for mimo receiver,” in
Circuits and Systems (ISCAS), 2012 IEEE International Symposium on,
may 2012, pp. 21 –24.
[4] A. George and J. W. Liu, “Householder reflections versus givens
rotations in sparse orthogonal decomposition,” Linear Algebra and its
Applications, vol. 88-89, pp. 223 – 238, 1987. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/002437958790111X
[5] A. Edelman, “Large dense numerical linear algebra in 1993: the parallel
computing influence,” IJHPCA, vol. 7, no. 2, pp. 113–128, 1993.
[Online]. Available: http://dx.doi.org/10.1177/109434209300700203
[6] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney,
J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen,
“Lapack: A portable linear algebra library for high-performance
computers,” in Proceedings of the 1990 ACM/IEEE Conference on
Supercomputing, ser. Supercomputing ’90. Los Alamitos, CA, USA:
IEEE Computer Society Press, 1990, pp. 2–11. [Online]. Available:
http://dl.acm.org/citation.cfm?id=110382.110385
[7] F. A. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K. Nandy,
and R. Narayan, “Efficient realization of householder transform through
algorithm-architecture co-design for acceleration of qr factorization,”
IEEE Transactions on Parallel and Distributed Systems, vol. PP, no. 99,
pp. 1–1, 2018.
[8] J. York and D. Chiou, “On the asymptotic costs of multiplexer-based
reconfigurability,” in The 49th Annual Design Automation Conference
2012, DAC ’12, San Francisco, CA, USA, June 3-7, 2012, 2012, pp. 790–
795. [Online]. Available: http://doi.acm.org/10.1145/2228360.2228503
[9] Z. E. Ra´kossy, F. Merchant, A. A. Aponte, S. K. Nandy, and A. Chat-
topadhyay, “Efficient and scalable cgra-based implementation of column-
wise givens rotation,” in ASAP, 2014, pp. 188–189.
[10] Z. E. Ra´kossy, F. Merchant, A. A. Aponte, S. K. Nandy,
and A. Chattopadhyay, “Scalable and energy-efficient reconfigurable
accelerator for column-wise givens rotation,” in 22nd International
Conference on Very Large Scale Integration, VLSI-SoC, Playa del
Carmen, Mexico, October 6-8, 2014, 2014, pp. 1–6. [Online]. Available:
http://dx.doi.org/10.1109/VLSI-SoC.2014.7004166
[11] S. Das, K. T. Madhu, M. Krishna, N. Sivanandan, F. Merchant,
S. Natarajan, I. Biswas, A. Pulli, S. K. Nandy, and R. Narayan, “A
framework for post-silicon realization of arbitrary instruction extensions
on reconfigurable data-paths,” Journal of Systems Architecture -
10
0
0.5
1
1.5
2
2.5
2x2 4x4 6x6 8x8 10x10 12x12
Speed-up over DGEQRFHT
Speed-up over DGEQRF
Speed-up over DGEQR2
Speed-up over DGEQR2HT
Sp
e
e
d
-u
p
 
Matrix Size 
x10 
(a) Speed-up in dgeqr2ggr over dgeqr2, dgeqrf, dge-
qrfht, and dgeqr2ht in PE
2x2 4x4 6x6 8x8 10x10 12x12
0
20
40
60
80
100
120
DGEQR2 (PE) DGEQRF (PE) DGEQRFHT (PE)
DGEQR2HT (PE) DGEQR2 (DGEMM) DGEQRF (DGEMM)
DGEQRFHT (DGEMM) DGEQR2HT (DGEMM) DGEQR2GGR (PE)
DGEQR2GGR (DGEMM) DGEQRFGGR (PE) DGEQRFGGR (DGEMM)
P
e
rc
e
n
ta
ge
 
Matrix Size 
x10 
(b) Performance of dgeqr2ggr, dgeqrfggr, dgeqr2, dge-
qrf, dgeqrfht, and dgeqr2ht in PE In-terms of Theoret-
ical Peak Performance Normalized to Performance of
dgemm in the PE
2x2 4x4 6x6 8x8 10x10 12x12 
0 
5 
10 
15 
20 
25 
30 
35 
40 
45 
DGEQR2 DGEQRF DGEQRFHT 
DGEQR2HT DGEQR2GGR 
G
fl
o
p
s/
w
at
t 
Matrix Size 
x10 
(c) Performance of dgeqr2ggr, dgeqr2, dgeqrf, dgeqrfht,
and dgeqr2ht in PE In-terms of Gflops/watt
0
20
40
60
80
100
120 Performance Improvement In-terms
of Gflops/watt in DGEMM
Performance Improvement In-terms
of Gflops/watt in DGEQR2HT
Performance Improvement In-terms
of Gflops/watt in DGEQR2GGR
P
e
rf
o
rm
an
ce
 Im
p
ro
ve
m
en
t 
Platforms 
(d) Performance Improvement of dgeqr2ggr, dgeqr2,
dgeqrf, dgeqrfht, and dgeqr2ht in PE over Different
Platforms
Fig. 13: Performance of dgeqr2ggr in PE
Fig. 14: Experimental Setup in REDEFINE with PE as a CFU
Embedded Systems Design, vol. 60, no. 7, pp. 592–614, 2014. [Online].
Available: http://dx.doi.org/10.1016/j.sysarc.2014.06.002
[12] G. Ansaloni, P. Bonzini, and L. Pozzi, “Egra: A coarse grained recon-
figurable architectural template,” Very Large Scale Integration (VLSI)
Systems, IEEE Transactions on, vol. 19, no. 6, pp. 1062–1074, June
2011.
[13] F. Merchant, A. Chattopadhyay, G. Garga, S. K. Nandy, R. Narayan,
and N. Gopalan, “Efficient QR decomposition using low complexity
11
Fig. 15: Matrix Partitioning and Scheduling on REDEFINE Tile
Array
column-wise givens rotation (CGR),” in 2014 27th International
Conference on VLSI Design and 2014 13th International Conference
on Embedded Systems, Mumbai, India, January 5-9, 2014, 2014, pp.
258–263. [Online]. Available: https://doi.org/10.1109/VLSID.2014.51
[14] F. Merchant, N. Choudhary, S. K. Nandy, and R. Narayan, “Efficient
realization of table look-up based double precision floating point
arithmetic,” in 29th International Conference on VLSI Design and 15th
International Conference on Embedded Systems, VLSID 2016, Kolkata,
India, January 4-8, 2016, 2016, pp. 415–420. [Online]. Available:
http://dx.doi.org/10.1109/VLSID.2016.113
[15] F. Merchant, A. Chattopadhyay, S. Raha, S. K. Nandy, and
R. Narayan, “Accelerating BLAS and LAPACK via efficient
floating point architecture design,” Parallel Processing Letters,
vol. 27, no. 3-4, pp. 1–17, 2017. [Online]. Available: https:
//doi.org/10.1142/S0129626417500062
[16] M. Alle, K. Varadarajan, A. Fell, R. R. C., N. Joseph, S. Das,
P. Biswas, J. Chetia, A. Rao, S. K. Nandy, and R. Narayan, “REDEFINE:
Runtime reconfigurable polymorphic asic,” ACM Trans. Embed. Comput.
Syst., vol. 9, no. 2, pp. 11:1–11:48, Oct. 2009. [Online]. Available:
http://doi.acm.org/10.1145/1596543.1596545
[17] F. Merchant, A. Maity, M. Mahadurkar, K. Vatwani, I. Munje, M. Kr-
ishna, S. Nalesh, N. Gopalan, S. Raha, S. Nandy, and R. Narayan,
“Micro-architectural enhancements in distributed memory cgras for lu
and qr factorizations,” in VLSI Design (VLSID), 2015 28th International
Conference on, Jan 2015, pp. 153–158.
[18] M. Mahadurkar, F. Merchant, A. Maity, K. Vatwani, I. Munje,
N. Gopalan, S. K. Nandy, and R. Narayan, “Co-exploration of NLA
kernels and specification of compute elements in distributed memory
cgras,” in XIVth International Conference on Embedded Computer
Systems: Architectures, Modeling, and Simulation, SAMOS 2014, Agios
Konstantinos, Samos, Greece, July 14-17, 2014, 2014, pp. 225–232.
[Online]. Available: http://dx.doi.org/10.1109/SAMOS.2014.6893215
[19] F. Merchant, T. Vatwani, A. Chattopadhyay, S. Raha, S. K.
Nandy, and R. Narayan, “Achieving efficient QR factorization
by algorithm-architecture co-design of householder transformation,”
in 29th International Conference on VLSI Design and 15th
International Conference on Embedded Systems, VLSID 2016, Kolkata,
India, January 4-8, 2016, 2016, pp. 98–103. [Online]. Available:
http://dx.doi.org/10.1109/VLSID.2016.109
[20] W. Givens, “Computation of plane unitary rotations transforming a
general matrix to triangular form,” Journal of the Society for Industrial
and Applied Mathematics, vol. 6, no. 1, pp. pp. 26–50, 1958. [Online].
Available: http://www.jstor.org/stable/2098861
[21] J. Modi and M. Clarke, “An alternative givens ordering,” Numerische
Mathematik, vol. 43, pp. 83–90, 1984. [Online]. Available: http:
//dx.doi.org/10.1007/BF01389639
[22] M. Hofmann and E. J. Kontoghiorghes, “Pipeline givens sequences
for computing the qr decomposition on a erew pram,” Parallel
Computing, vol. 32, no. 3, pp. 222 – 230, 2006. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167819105001638
[23] E. J. Kontoghiorghes, “Greedy givens algorithms for computing
the rank-k updating of the qr decomposition,” Parallel Computing,
vol. 28, no. 9, pp. 1257 – 1273, 2002. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167819102001321
[24] S. Aslan, S. Niu, and J. Saniie, “Fpga implementation of fast qr decom-
position based on givens rotation,” in Circuits and Systems (MWSCAS),
2012 IEEE 55th International Midwest Symposium on, aug. 2012, pp.
470 –473.
[25] L. Ma, K. Dickson, J. McAllister, and J. Mccanny, “Modified givens rota-
tions and their application to matrix inversion,” in Acoustics, Speech and
Signal Processing, 2008. ICASSP 2008. IEEE International Conference
on, 31 2008-april 4 2008, pp. 1437 –1440.
[26] D. Bindel, J. Demmel, W. Kahan, and O. Marques, “On computing
givens rotations reliably and efficiently,” ACM Trans. Math. Softw.,
vol. 28, no. 2, pp. 206–238, Jun. 2002. [Online]. Available:
http://doi.acm.org/10.1145/567806.567809
[27] X. Wang and M. Leeser, “A truly two-dimensional systolic array fpga
implementation of qr decomposition,” ACM Trans. Embed. Comput.
Syst., vol. 9, no. 1, pp. 3:1–3:17, Oct. 2009. [Online]. Available:
http://doi.acm.org/10.1145/1596532.1596535
[28] F. G. V. Zee, R. A. van de Geijn, and G. Quintana-Ortı´, “Restructuring
the tridiagonal and bidiagonal QR algorithms for performance,” ACM
Trans. Math. Softw., vol. 40, no. 3, p. 18, 2014. [Online]. Available:
http://doi.acm.org/10.1145/2535371
[29] G. H. Golub and C. F. Van Loan, Matrix computations (3rd ed.).
Baltimore, MD, USA: Johns Hopkins University Press, 1996.
[30] J. A. Gunnels, C. Lin, G. Morrow, and R. A. van de Geijn,
“A flexible class of parallel matrix multiplication algorithms,”
in IPPS/SPDP, 1998, pp. 110–116. [Online]. Available: http:
//dx.doi.org/10.1109/IPPS.1998.669898
[31] J. Wasniewski and J. Dongarra, “High performance linear algebra
package for FORTRAN 90,” in Applied Parallel Computing, Large Scale
Scientific and Industrial Problems, 4th International Workshop, PARA
’98, Umea˚, Sweden, June 14-17, 1998, Proceedings, 1998, pp. 579–583.
[Online]. Available: http://dx.doi.org/10.1007/BFb0095385
Farhad Merchant is a Postdoctoral Research
Fellow at Institute for Communication Technolo-
gies and Embedded Systems, RWTH Aachen
University, Germany. Previously he has worked
as a Researcher at Research and Technology
Center, Robert Bosch and before that he was
Research Fellow at Hardware and Embedded
Systems Lab, School of Computer Science and
Engineering, Nanyang Technological University,
Singapore. He received his PhD from Computer
Aided Design Laboratory, Indian Institute of Sci-
ence, Bangalore, India. His research interests are algorithm-architecture
co-design, computer architecture, reconfigurable computing, develop-
ment and tuning of high performance software packages
Tarun Vatwani is a Research Associate at Hard-
ware and Embedded Systems Lab, School of
Computer Science and Engineering, Nanyang
Technological University, Singapore. He is a
B.Tech. graduate from Indian Institute of Tech-
nology, Jodhpur, India, His research interests
are computer architecture, high performance
computing, machine learning, performance tun-
ing of different software packages.
Anupam Chattopadhyay received his B.E. de-
gree from Jadavpur University, India in 2000.
He received his MSc. from ALaRI, Switzerland
and PhD from RWTH Aachen in 2002 and 2008
respectively. From 2008 to 2009, he worked as
a Member of Consulting Staff in CoWare R&D,
Noida, India. From 2010 to 2014, he led the
MPSoC Architectures Research Group in RWTH
Aachen, Germany as a Junior Professor. Since
September, 2014, he is appointed as an assis-
tant Professor in SCE, NTU.
12
0
2
4
6
8
10
12
14
16
2x2 (DGEQR2HT)
3x3 (DGEQR2HT)
4x4 (DGEQR2HT)
2x2 (DGEQR2)
3x3 (DGEQR2)
4x4 (DGEQR2)
2x2 (DGEQRF)
3x3 (DGEQRF)
4x4 (DGEQRF)
2x2 (DGEQRFHT)
3x3 (DGEQRFHT)
4x4 (DGEQRFHT)
2x2 (DGEQR2GGR)
3x3 (DGEQR2GGR)
4x4 (DGEQR2GGR)
Sp
e
e
d
-u
p
 
Matrix Size 
(a) Speed-up in dgeqr2ggr, dgeqrfggr, dgeqr2, dgeqrf,
dgeqr2ht, dgeqrfht Routines in REDEFINE over Their
Sequential Realization in PE
P
e
rc
e
n
ta
ge
 
Matrix Size 
0
10
20
30
40
50
60
70
80
2x2 (DGEQR2HT)
3x3 (DGEQR2HT)
4x4 (DGEQR2HT)
2x2 (DGEQR2)
3x3 (DGEQR2)
4x4 (DGEQR2)
2x2 (DGEQRF)
3x3 (DGEQRF)
4x4 (DGEQRF)
2x2 (DGEQRFHT)
3x3 (DGEQRFHT)
4x4 (DGEQRFHT)
2x2 (DGEQR2GGR)
3x3 (DGEQR2GGR)
4x4 (DGEQR2GGR)
(b) Performance of dgeqr2ggr, dgeqrfggr, dgeqr2, dge-
qrf, dgeqrfht, and dgeqr2ht in REDEFINE In-terms of
Theoretical Peak Performance of REDEFINE
Fig. 16: Performance of dgeqr2ggr in PE
Soumyendu Raha obtained his PhD in Scientific
Computation from the University of Minnesota in
2000. Currently he is a Professor of the Com-
putational and Data Sciences Department at the
Indian Institute of Science in Bangalore, which
he joined in 2003, after having worked for IBM
for a couple of years. His research interests are
in computational mathematics of dynamical sys-
tems, both continuous and combinatorial, and
in co-development and application of comput-
ing systems for implementation of computational
mathematics algorithms.
Ranjani Narayan has over 15 years experience
at IISc and 9 years at Hewlett Packard. She
has vast work experience in a variety of fields
computer architecture, operating systems, and
special purpose systems. She has also worked
in the Technical University of Delft, The Nether-
lands, and Massachusetts Institute of Technol-
ogy, Cambridge, USA. During her tenure at HP,
she worked on various areas in operating sys-
tems and hardware monitoring and diagnostics
systems. She has numerous research publica-
tions.She is currently Chief Technology Officer at Morphing Machines
Pvt. Ltd, Bangalore, India.
S K Nandy is a Professor in the Department of
Computational and Data Sciences of the Indian
Institute of Science, Bangalore. His research in-
terests are in areas of High Performance Em-
bedded Systems on a Chip, VLSI architectures
for Reconfigurable Systems on Chip, and Archi-
tectures and Compiling Techniques for Hetero-
geneous Many Core Systems. Nandy received
the B.Sc (Hons.) Physics degree from the In-
dian Institute of Technology, Kharagpur, India,
in 1977. He obtained the BE (Hons.) degree
in Electronics and Communication in 1980, MSc.(Engg.) degree in
Computer Science and Engineering in 1986, and the Ph.D. degree in
Computer Science and Engineering in 1989 from the Indian Institute
of Science, Bangalore. He has over 170 publications in International
Journals, and Proceedings of International Conferences, and 5 patents.
Rainer Leupers received the M.Sc. (Dipl.-
Inform.) and Ph.D. (Dr. rer. nat.) degrees in Com-
puter Science with honors from TU Dortmund
in 1992 and 1997. From 1997-2001 he was
the chief engineer at the Embedded Systems
chair at TU Dortmund. In 2002, he joined RWTH
Aachen University as a professor for Software
for Systems on Silicon. His research comprises
software development tools, processor architec-
tures, and system-level electronic design au-
tomation, with focus on application-specific mul-
ticore systems. He published numerous books and technical papers and
served in committees of the leading international EDA conferences. He
received various scientific awards, including Best Paper Awards at DAC
and twice at DATE, as well as several industrial awards. Dr. Leupers
is also engaged as an entrepreneur and in turning novel technologies
into innovations. He holds several patents on system-on-chip design
technologies and has been a co-founder of LISATek (now with Synop-
sys), Silexica, and Secure Elements. He has served as consultant for
various companies, as an expert for the European Commission, and in
the management boards of large-scale projects like HiPEAC and UMIC.
He is the coordinator of EU projects TETRACOM and TETRAMAX on
academia/industry technology transfer.
