The density matrix renormalization group algorithm on kilo-processor
  architectures: implementation and trade-offs by Nemes, Csaba et al.
The density matrix renormalization group algorithm on kilo-processor architectures:
implementation and trade-offs
Csaba Nemesa,∗, Gergely Barczac, Zolta´n Nagya,b, O¨rs Legezac, Pe´ter Szolgaya,b
aFaculty of Information Technology, Pe´ter Pa´zma´ny Catholic University, Budapest, Hungary
bCellular Sensory and Wave Computing Laboratory, Computer and Research Automation Institute, Hungarian Academy of Sciences,
Budapest, Hungary
c Strongly Correlated Systems ”Lendu¨let” Research Group, Department of Theoretical Solid State Physics, Wigner Research Centre for
Physics, Hungarian Academy of Sciences, Budapest, Hungary
Abstract
In the numerical analysis of strongly correlated quantum lattice models one of the leading algorithms developed to
balance the size of the effective Hilbert space and the accuracy of the simulation is the density matrix renormalization
group (DMRG) algorithm, in which the run-time is dominated by the iterative diagonalization of the Hamilton operator.
As the most time-dominant step of the diagonalization can be expressed as a list of dense matrix operations, the DMRG
is an appealing candidate to fully utilize the computing power residing in novel kilo-processor architectures.
In the paper a smart hybrid CPU-GPU implementation is presented, which exploits the power of both CPU and
GPU and tolerates problems exceeding the GPU memory size. Furthermore, a new CUDA kernel has been designed
for asymmetric matrix-vector multiplication to accelerate the rest of the diagonalization. Besides the evaluation of the
GPU implementation, the practical limits of an FPGA implementation are also discussed.
Keywords: strongly correlated systems, DMRG, GPU acceleration, FPGA acceleration
1. Introduction and Related Works
DMRG is a variational numerical approach developed
to treat low-dimensional interacting many-body quantum
systems efficiently [1, 2, 3]. In fact, it has become an
exceptionally successful method to study the low energy
physics of strongly correlated quantum systems which ex-
hibit chain-like entanglement structure [4]. For example,
it can be applied to simulate properties of anisotropic ma-
terials, such as polymers [5], or to describe accurately the
electronic structure of open d shell molecules [6], which is
beyond the capability of standard quantum chemical ap-
proaches. Additionally, the interacting system of atoms
trapped in an optical lattice, proposed as physical im-
plementation of quantum computer, is also tractable via
DMRG [7].
The original DMRG algorithm [1] was introduced in
1992 by Steven R. White and was formulated as a single
threaded algorithm. In the past various works have been
carried out to accelerate the DMRG algorithm on shared
[8] [9] and distributed memory [10, 11, 12, 13] architec-
tures, however, none of them took advantage of recent kilo-
processor architectures: graphical processing unit (GPU)
and field-programmable gate array (FPGA).
One of the first parallelizations was [8] converting the
projection operation to matrix-matrix multiplications and
∗Corresponding author
Email address: nemes.csaba@itk.ppke.hu (Csaba Nemes )
accelerating them via OpenMP interface. In [12] a sim-
ilar approach was presented for distributed memory en-
vironment (up-to 1024 cores) optimizing the communica-
tion between the cores, while in [13] the acceleration of
the computation of correlation function had been investi-
gated. Recently, [9] presented an acceleration on shared
memory architectures exploiting SU(2) symmetries, while
[14] proposed a novel direction for paralellization via a
modification of the original serial DMRG algorithm.
Graphical processing unit has been successfully em-
ployed in neighboring research areas to accelerate matrix
operations. In [15] GPU is used to accelerate tensor con-
tractions in plaquette renormalization states (PRS), which
can be regarded as an alternative technique to tensor prod-
uct states (TPS) or the DMRG algorithm. In [16] the
second-order spectral projection (SP2) algorithm has been
accelerated, which is an alternative technique to calcu-
late the density matrix via a recursive series of generalized
matrix-matrix multiplications.
In this paper we present the first attempt (to our best
knowledge) to investigate how the DMRG method can
utilize the enormous computing capabilities of novel kilo-
processor architectures (GPU, FPGA). In case of GPU a
smart hybrid CPU-GPU acceleration is presented, which
tolerates problems exceeding the GPU memory size, con-
sequently, supporting wide range of problems and GPU
configurations. Contrary to the previous acceleration at-
tempts not only the projection operation is accelerated,
Preprint submitted to Computer Physics Communications September 24, 2013
ar
X
iv
:1
30
9.
55
71
v1
  [
co
nd
-m
at.
str
-el
]  
22
 Se
p 2
01
3
but further parts of the diagonalization are also computed
on the GPU. In case of FPGA the performance limits of a
possible implementation are estimated and discussed.
The rest of the paper is organized as follows. Section 2
describes the models which are used as test cases to demon-
strate the operation of the algorithm. Symmetries which
can be exploited to decrease the computational require-
ments of the algorithm and the algorithm itself are pre-
sented in Sections 3 and 4, respectively. Acceleration on
GPU is presented in three sections (5, 6 and 7), while lim-
its of an FPGA implementation are described in Section 8.
Finally, implementation results and conclusions are given
in Sections 9 and 10, respectively.
2. Investigated models
In order to illustrate the underlying features of the al-
gorithm it is applied to the so-called spin- 12 Heisenberg
model and the spin- 12 Hubbard model. The selected mod-
els describe how to compute the Hamiltonian of the system
of interest, while the main task is to find some of the low-
lying eigenvalues and eigenvectors of the Hamiltonian by
a diagonalization algorithm. In practice instead of solving
the problem for the complete Hilbert space directly, var-
ious physical phenomena can be exploited to reduce the
complexity of the problem.
2.1. Heisenberg model
The Heisenberg model describes the physics of mag-
netic systems and provides theoretical description of vari-
ous experimental measurements. In the model a magnetic
system is simulated on a lattice of interacting spins. A
microscopic magnetic moment (spin) is localized at each
lattice site j and described by a quantized, two-valued vari-
able, σj ∈ {↑, ↓}, related to the two possible orientations
of the spin. Limiting the interactions to only neighboring
spins – which is often a good approximation – the Hamil-
tonian of the model is written as
H =
1
2
N−1∑
j=1
(
S+j S
−
j+1 + S
−
j S
+
j+1
)
+ ∆
N−1∑
j=1
Szj S
z
j+1 (1)
where S+j , S
−
j operators change, while S
z
j measures the
orientation of the spin on lattice site j. The overall behav-
ior of the system can be tuned via the relevant parameter
∆. The explicit matrix representation of an operator Oj
acting on site j of a chain with N spins is given as
Oj =
j−1⊗
i=1
I⊗O ⊗
N⊗
i=j+1
I (2)
where I is the identity and O is one of the followings
S+ =
(
0 1
0 0
)
, S− =
(
0 0
1 0
)
, Sz =
1
2
(
1 0
0 −1
)
. (3)
The Hamiltonian of N spins acts on the tensor product
space of dimension 2N , that is the dimension of the com-
plete Hilbert space grows exponentially as the size of the
system increases.
2.2. Hubbard model
The Hubbard model was introduced to describe elec-
trons in solids to characterize the transition between insu-
lating and conducting systems. The single-band Hubbard
model provides appropriate description of low tempera-
ture systems where all particles are in the lowest Bloch
band and the long-ranged interactions between the parti-
cles can be neglected due to strong screening effects [17].
More recently various multi-band Hubbard models have
been applied to high-temperature superconductivity [18]
and systems of higher spin to understand the behavior of
optically trapped ultracold atoms [7].
In the general spin-F system each lattice site is charac-
terized by 2F + 1 two dimensional vectors. Each vector is
assigned with a distinct label (from {−F,−F + 1, ..., F −
1, F}) called spin polarization value (denoted by σ). A
vector assigned to a spin polarization σ describes two or-
thogonal states: the site is occupied ([0; 1]) by the particle
of spin polarization σ or not ([1; 0]). As a consequence, a
lattice site of spin-F possesses 22F+1 internal degrees of
freedom.
The lattice model of interacting particles of spin-F con-
sists of two competing terms: the kinetic term, which de-
scribes the tunneling of particles between neighboring lat-
tice sites, and the local potential term, which describes
on-site density-density interaction measuring the attrac-
tion or repulsion between the interacting particles. The
single-band, fermionic Hubbard model of spin-F is defined
on a chain with N sites as
H = −t
N−1∑
j=1
F∑
σ=−F
(
c†j,σcj+1,σ + h.c.
)
+
U
2
N∑
j=1
∑
σ 6=σ′
nj,σnj,σ′
(4)
where t measures the hopping amplitude between neigh-
boring sites and U is the interaction strength. Creation
and annihilation operator acting on site j with spin po-
larization σ, denoted as c†j,σ and cj,σ, adds or removes
a particle located on site j with spin polarization σ. The
particle density of spin polarization σ on site j is measured
by operator nj,σ = c
†
j,σcj,σ. The explicit matrix represen-
tation of an operator Oj,σ acting on site j and polarization
σ is constructed as
Oj,σ =
F ′(j−1)⊗
i=1
Φ⊗Oσ ⊗
F ′N⊗
i=F ′(j+1)
I (5)
Oσ =
σ−1⊗
i=−F
Φ⊗O ⊗
F⊗
i=σ+1
I (6)
Φ =
(
1 0
0 −1
)
(7)
2
where F ′ = 2F + 1, I is the identity, Φ is the fermionic
phase-factor and O is one of the followings
c† =
(
0 0
1 0
)
, c =
(
0 1
0 0
)
. (8)
The Hamiltonian describing the spin-F system of N lattice
sites acts on the tensor product space of dimension 2F
′N ,
and similarly to the Heisenberg model, the dimension of
the complete Hilbert space blows up exponentially. Com-
paring to the bosonic operators of the Heisenberg model,
the key differences in the construction of operators are
the appearance of internal quantum number, σ, and the
presence of the phase-factor describing the antisymmetric
nature of fermionic systems. To ease the comparison of
the two models only the F = 12 case is presented, however,
the observed tendencies are valid for higher F values.
3. Symmetries to be exploited
In many systems the Hamilton operator does not change
the value of a measurable quantity, i.e., it commutes with
the operator connected to that measurable quantity. These
operators are called symmetry operators and can be used
to cast the Hilbert space to smaller independent subspaces[19].
Consequently, instead of solving a large matrix eigenvalue
problem, the eigenvalue spectrum can be determined by
solving several smaller problems. In the Heisenberg model
the total spin projection, Sz =
∑N
j=1 S
z
j , is such a symme-
try operator. Meanwhile, in the Hubbard model of spin-F
the total particle number associated to each spin polariza-
tion σ, Nσ =
∑N
j=1 nj,σ, is conserved. Thus, the distinct
quantum numbers helps to partition the Hilbert space into
multiple independent subspaces corresponding to a given
combination of quantum number values.
A given symmetry operator shares the same eigenvec-
tors of the Hamiltonian, thus the eigenstates of the Hamil-
tonian can be labeled by the eigenvalues of the symmetry
operator (quantum number, Q), and the Hilbert space can
2
4
8
16
32
64
128
256
512
1024
2048
4096
4096
4096
1E+0
1E+1
1E+2
1E+3
1E+4
1E+5
1E+6
1E+7
1E+8
No quantum 
numbers
Using quantum 
number(Q=0)
Max sector size
Number of retained block states
D
im
en
si
on
 o
f H
ilb
er
t s
pa
ce
Figure 1: Exploiting the projection symmetry in the Heisenberg
model the Hilbert space of the superblock can be restricted to the
subspace corresponding to Q=0.
be decomposed into subspaces (sectors) spanned by the
eigenvectors of each quantum number value [20]. Intro-
ducing a quantum number based representation, the sparse
operators (Eqs. 2, 5) can be decomposed to a set of smaller
but dense matrices, furthermore the Hamiltonian operator
(Eqs. 1, 4) becomes blockdiagonal.
4. Algorithm
The DMRG approach has two phases, in the infinite-
lattice algorithm the approximated Hilbert space of a finite
system of N interacting spins is built up iteratively, while
in the optional finite-lattice algorithm the number of the
interacting spins is fixed and further iterations are carried
out to increase the accuracy of the computed results. As
in both cases the iterations are very similar, for the sake of
simplicity, we consider only the infinite-lattice algorithm.
The detailed description of the algorithm can be found in
the original work [1] and various reviews [2, 3], here only
the key steps of an iteration of the infinite-lattice algorithm
are summarized in Algorithm 1 providing the basis of our
analysis.
Algorithm 1 One iteration of the infinite-lattice algo-
rithm
1: Load a left and a right block.
2: Form the superblock configuration.
3: Compute the lowest eigenstate of the superblock
Hamilton HSB . (Davidson method)
4: for each block do
5: Construct the density matrix for the given block
from the lowest eigenstate.
6: Compute the eigenvalues of the density matrix.
(Lanczos method)
7: Renormalize the basis of the block by keeping
states with high eigenvalues.
In the two-site DMRG procedure four subsystems (left
block describing l sites, 1 site, 1 site, right block describing
r sites) compose the finite system of N = (l + 2 + r) sites
called superblock. The sites contained in each block are
described maximally by m, optimally chosen states, which
can be significantly smaller than the exactly required ql or
qr basis, where q is the degree of freedom of one site. As
the central sites of the superblock are represented exactly
by q-q states, the size of the superblock Hilbert space is
q2m2. Considering, however, the symmetries mentioned
above, the problem can be restricted to a subspace of the
superblock corresponding to a particular Q value. E.g.,
in case of Heisenberg and Hubbard models the size of the
superblock Hilbert space can be reduced significantly as
demonstrated in Figures 1 and 2, respectively. It is, how-
ever, clear that even using symmetry operators the dimen-
sion of the reduced space grows exponentially with the size
of the lattice (if no truncation is done).
The infinite-lattice algorithm starts with the four site
configuration, where each block contains a single spin. In
3
4
16
64
256
1024
4096
4096
4096
4096
4096
4096
4096
4096
4096
1E+0
1E+1
1E+2
1E+3
1E+4
1E+5
1E+6
1E+7
1E+8
1E+9
No quantum 
numbers
Using quantum 
number(Q=[N/2,
N/2])
Max sector size
Number of retained block states
D
im
en
si
on
 o
f H
ilb
er
t s
pa
ce
Figure 2: Exploiting the conservation of particle number in the spin-
1
2
Hubbard model the Hilbert space of the superblock can be re-
stricted to the subspace corresponding to Q=[N
2
,N
2
].
each iteration step both blocks are enlarged by a single
site, making the complete system increase by two, until
the desired system size, N , is reached. In each iteration of
the DMRG algorithm, the lowest-lying eigenvector of the
corresponding superblock Hamiltonian (HSB) is obtained
by the iterative Davidson or Lanczos algorithm. (In the
paper the Davidson algorithm is considered.) From the
lowest eigenstate the density matrix is constructed which
carry the information how to optimally truncate the basis
of the enlarged block (m  ql+1) in order to keep the
problem size manageable [21].
The most time-consuming part of a full iteration is the
step of the Davidson routine which carries out the projec-
tion operation (X ′ = HSBX). Instead of constructing and
storing the enormous HSB matrix of size O
(
m4
)
explic-
itly, it is computationally favorable to obtain the projected
vector X ′ directly from the matrices of size O (m2) com-
posing HSB .
The HSB can be directly expressed by the operators
of the original four subsystems (l -1-1-r strategy) or by the
operators of two intermediate systems (LR strategy), so
called enlarged blocks, which come from the contraction
of each block with its neighboring site. In the current
implementation only the second strategy is investigated,
however, the first one is also straightforward and will be
included in the near future.
There are several practical benefits of these strategies.
First of all, the memory bandwidth limited matrix-vector
multiplication (BLAS Level 2) is converted to matrix-matrix
multiplication (BLAS Level 3) which can be efficiently ac-
celerated. Secondly, skipping of the explicit Kronecker
multiplications not only restructures the computation, but
decreases the number of operations. Finally, both strate-
gies drastically decrease the size of the matrices which
take part in the operations and thus the memory foot-
print of the algorithm. In case of LR and l -1-1-r strategy
the largest matrix has a size of O((mq)2) and O((m)2),
respectively. The second strategy is more favorable in ex-
treme situations when the GPU memory is limited and q
(internal degrees of freedom) is large (e.g. spin-F Hubbard
model with large F ).
4.1. LR strategy
In the LR strategy the HSB is expressed with operators
A
(L)
α and B
(R)
α defined on the left (L := l + 1) and right
(R := r + 1) enlarged blocks, respectively, as
HSB =
∑
α
A(L)α ⊗B(R)α , (9)
where the index α iterates over the distinct operator com-
binations required to construct the superblock Hamilto-
nian. Exploiting Kronecker multiplication properties, the
projected vector X ′ can be computed by matrix-matrix
multiplications as
X˜ ′ =
∑
α
A(L)α X˜B
(R)T
α , (10)
where vector X of size [BcolAcol] is reshaped to matrix
X˜ of size [Bcol, Acol] and vector X ′ of size [BrowArow] is
reshaped to matrix X˜ ′ of size [Brow, Arow].
Algorithm 2 The computation of the projected vector X ′
in case of l -1-1-r strategy strategy.
Require: size(X) = [DcolCcolBcolAcol]
1: function projectX l11r(A,B,C,D,X)
2: X1=reshape(X) as size(X1)=[D
col, Ccol, Bcol, Acol]
3: for each (i, j) do X ′1(:, :, i, j)=DX1(:, :, i, j)
4: for each (i, j) do X ′′1 (:, :, i, j)=X
′
1(:, :, i, j)C
T
5: X2=reshape(X
′′
1 ) as size(X2)=[D
rowCrow, Bcol, Acol]
6: for each (i) do X ′2(:, :, i)=X2(:, :, i)B
T
7: X3=reshape(X
′
2) as size(X3)=[D
rowCrowBrow, Acol]
8: X ′3=X3A
T
9: X ′=reshape(X ′3) as size(X
′)=[DrowCrowBrowArow]
10: return X ′
In the practical implementation Equation 10 operates
on even smaller matrices as the operators are decomposed
according to quantum numbers. Instead of a sparse ma-
trix A(L) several dense matrices A
(L)
qi→qj are stored repre-
senting how A(L) transforms the subspace (sector) corre-
sponding to qi to the one corresponding to qj . To compute
X ′ in case of a given A(L)α , B
(R)
α operator pair all possible
A
(L)
α,qi→qj , B
(R)
α,qk→ql transition pairs shall be submitted to
Equation 10 and each time only the corresponding ik and
jl segments of X and X ′ shall be used as
X˜ ′jl+ = A
(L)
α,i→jX˜ikB
(R)T
α,k→l (11)
where X˜ik and X˜
′
jl indicate the reshaped ik and jl seg-
ment of vector X and X ′, respectively. Fortunately, the
reshape operation has no computational cost as the data
in the memory is untouched and only the row/col sizes are
changing.
4
2
4
8
16
32
64
128
256
512
1024
2048
4096
4096
4096
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Davidson – Core-i7
Davidson – GTX570
Davidson – Xeon E5
Davidson – K20
H_SB*V – Core-i7
H_SB*V  – GTX570
H_SB*V  – Xeon E5
H_SB*V – K20
Number of retained block states
Pe
rc
en
ta
ge
 o
f t
ot
al
 ti
m
e
Figure 3: Heisenberg model: Runtime of the Davidson algorithm
and its HSBV operation compared to the total time of a DMRG
iteration step as the number of retained block states increases. CPU-
only versions are indicated by Core-i7 and Xeon E5, while hybrid
versions are indicated by GTX 570 and K20.
4.2. l-1-1-r strategy
In the l -1-1-r strategy the HSB is expressed by the
operators of the four subsystems:
HSB =
∑
α
A(l)α ⊗ aα ⊗ bα ⊗B(r)α , (12)
where the index α again iterates over the distinct operator
combinations required to construct the superblock Hamil-
tonian.
Similarly to the LR strategy, Kronecker multiplica-
tion properties can be exploited to compute the projected
vector X ′ efficiently with matrix-matrix operations, how-
ever, in this case a more complicated data storage and
several tensor multiplications are needed to avoid unnec-
essary memcopy operations. Using the procedure pro-
jectX l11r(), which computes the projected vector X ′
for one matrix quadruplet and is described in Algorithm 2,
the HSB can be calculated as
X ′ =
∑
α
projectX l11r(A(l)α , aα, bα, B
(r)
α , X) . (13)
In the similar manner as shown in LR strategy, A(l), a, b
and B(r) operators can be decomposed according to quan-
tum numbers and instead of large sparse matrix operations
several smaller dense matrix operations shall be submitted
to Algorithm 2. Furthermore, none of the reshape oper-
ations of Algorithm 2 involves practical data movement,
only the size descriptor variables are changing.
5. Runtime & Parallelism
In case of the Heisenberg and Hubbard model the run-
time analysis of the DMRG algorithm is shown in Figures 3
and 4, respectively. As the Davidson algorithm, which is
summarized in Algorithm 3, is the most time-dominant
part and takes more than 97% of the total time in the
4
16
64
256
1024
4096
4096
4096
4096
4096
4096
4096
4096
4096
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Davidson – Core-i7
Davidson – GTX570
Davidson – Xeon E5
Davidson – K20
H_SB*V – Core-i7
H_SB*V  – GTX570
H_SB*V  – Xeon E5
H_SB*V – K20
Number of retained block states
Pe
rc
en
ta
ge
 o
f t
ot
al
 ti
m
e
Figure 4: Similar to Figure 3 but for the Hubbard model.
CPU-only reference implementation, it has been chosen for
acceleration. Unfortunately, the full Davidson algorithm
cannot be implemented on the GPU as the problem size
in real world simulations usually exceeds the GPU mem-
ory size. Instead, a hybrid approach shall be implemented,
which can adjust the GPU workload according to the avail-
able GPU memory and the CPU-GPU performance ratio.
Algorithm 3 One iteration of the Davidson algorithm
Require: Previous (i−1) basis vectors already computed.
1: function davidsonIter(i)
2: W (:, i) = HSB · V (:, i) . BLAS-3: dgemm()
3: B(:, i) = V T ·W (:, i) . BLAS-2: dgemv trans()
4: [λ, y]←− get smallest eigvalue and vector of B
5: x = V · y . BLAS-2: dgemv()
6: r = −λ · x+W · y
7: if norm(r) ≈ 0 then
8: return with x and success
9: else
10: correct r
11: // orthonormalize r against V :
12: s = V T · r . BLAS-2: dgemv trans()
13: r = r − V · s . BLAS-2: dgemv()
14: normalize r and append to V
15: return without success
In the Davidson algorithm inherent parallelism can be
observed at two levels. First, at low level, all the matrix
and vector operations can be accelerated. Secondly, at the
level of projection computation (line 2 in Algorithm 3),
which is the most time-dominant part of the Davidson
algorithm itself taking more than 75% of the total time, the
projection can be computed as a sum of the independent
(AX)BT operations (see Equation 10).
At low level, the CPU part of the algorithm uses the
Basic Linear Algebra Subroutine (BLAS) interface and the
Intel MKL Library [22] for algebraic operations including
operator contractions, inner operations of both Davidson
and Lanczos algorithms [23] and operator transformations.
Unfortunately, in the Davidson algorithm, all the opera-
5
2
4
8
16
32
64
128
256
512
1024
2048
4096
4096
4096
0
1000
2000
3000
4000
5000
6000
GTX 570
K20
gemv_trans() 
memory request
Number of retained block states
A
va
ila
bl
e 
m
em
or
y 
fo
r 
ge
m
v_
tr
an
s(
) (
M
B
)
Figure 5: Remained free GPU memory after the projection opera-
tion, which can be used for gemv trans() and the memory request of
the gemv trans() in case of the Heisenberg model.
tions except the projection are BLAS level 2 matrix-vector
multiplications, which are bandwidth limited and not ideal
for acceleration. There is block extension [24] of the al-
gorithm, the so called Davidson-Liu, to determine a few
of the lowest eigenvalues, where more than one candidate
vectors are added at once resulting in BLAS level 3 op-
erations, however, in the current DMRG implementation
only the lowest eigenvalue is investigated. The remaining
option is to store as much data in GPU memory as possible
and execute the corresponding operations on GPU.
At the level of projection operation the independence
of matrix multiplications provides a straightforward hy-
brid parallelization and a future multi-GPU modification
of the current implementation. Acceleration can be im-
proved by developing the appropriate scheduling of the
matrix operations for different matrix sizes and architec-
tures.
6. Accelerating matrix-vector multiplications
Jacobi-Davidson version [25] of the original Davidson
algorithm [26] is a preconditioned subspace iteration tech-
nique [23] aimed at computing a few of the extreme eigen-
pairs of large sparse symmetric matrices and commonly
used in the DMRG implementations [2, 3]. In the pre-
sented work the [27] version of the algorithm (available in
Netlib [28]) is implemented.
In each iteration the subspace is extended with a new
base vector (V (:, i)), which is stored in the memory ac-
companied by its projection (W (:, i)). As the size of these
vectors can be very large (see Figures 1, 2) depending on
the model and the number of retained block states (m),
they cannot be fully stored in the GPU memory. How-
ever, in order to accelerate a matrix-vector multiplication
with GPU, at least the matrix shall be stored in the GPU
memory. In the current implementation the matrix of the
basis vectors (V ), which is used four times (see comments
in Algorithm 3) in BLAS level 2 operations, has been se-
2 4 6 8 10 12 14 16 18 20 22 24
0
5
10
15
20
25
30
35
40
Nvidia 
CuBlas::dgemv()
Nvidia 
CuBlas::dgemm()
Intel MKL on Core-i7
PCIe 2.0 limit
Kernel (belt width=4)
Kernel (belt width=8)
Kernel (belt width=16)
Width of the matrix
Pe
rf
or
m
an
ce
 (G
Fl
op
s)
Figure 6: GTX 570: Performance of the presented gemv trans() ker-
nel with different belt widths is compared to the performance of
the available implementations in case of matrix height 5e5. Addi-
tionally, the PCIe limit is displayed as PCIe throughput limits the
performance of the GPU acceleration if the transfer of the multiplier
vector cannot be avoided.
lected to be stored, although the storage of the projected
vector matrix (W ) can be added later as well.
In each iteration the new basis vector is loaded to
the GPU memory in the background (if there is available
space) and the workload of the BLAS level 2 operations is
shared between the CPU and GPU: CPU process the new
basis vector, while GPU operates on the older ones. With
this technique the power of both CPU and GPU can be ex-
ploited and the transfer time of the matrix can be hidden.
The implementation is flexible: if there is no more space on
the GPU or the CPU performance justifies it, more than
one base vector can be processed on the CPU leaving less
work for GPU. As shown in Figure 5, where the storage
requirement of V is compared with available free space af-
ter the projection operation, V cannot be fully stored on
a GTX 570 GPU even in case of the simple Heisenberg
model (m = 4096).
There are two types of BLAS level 2 operations: V TX
and V X indicated by gemv trans() and gemv() in Algo-
rithm 3, respectively. In the first case the multiplier vector
can be loaded while in the second case the result can be
written in smaller parts (∼5e5) to overlap with the com-
putation.
6.1. gemv trans()
In case of gemv trans() both MKL and CuBlas libraries
give poor performance for the special asymmetric matrix
size (∼5e5x20) required in our application (see Figures 6,
7, 8, 9 and 10), therefore, a new CUDA kernel has been de-
signed. The presented results are measured without data
communication, as in case of line 3 the multiplier vector is
already in the GPU memory providing ideal acceleration.
To estimate the performance of line 12, where the mul-
tiplier vector has to be loaded, the limiting factor of the
PCIe 2.0 is also displayed. In this case the overall perfor-
mance cannot exceed the PCIe limit even with overlapped
communication. In both cases the estimation of the overall
6
acceleration shall be carried out in an integral fashion, as
in each iteration the thickness of the matrix is increased by
one until a user defined limit (20 in the presented DMRG
test-cases) is reached.
The basic idea of the new kernel (see Algorithm 4)
can be summarized as follows. Each thread is associated
with a column of the matrix. Each thread loads the cor-
responding vector element and multiplies the elements of
the associated column. As threads of a warp load consec-
utive elements of the vector and the matrix, the coalesced
reading is obvious. If the number of threads (grid size *
thread block size) is less than the length of the matrix,
each thread is associated with a new unprocessed column
(coalesced readings again) as long as there is any. Af-
ter processing a new column each thread accumulates the
results to the results of the first column. Finally, the accu-
mulated results shall be summed across the threads, which
can be efficiently done via a sum reduction [29] in shared
memory. If the belt is smaller than the width of the ma-
trix, the whole procedure can be repeated (outer loop).
Algorithm 4 Proposed kernel for asymmetric
gemv trans()
1: function kernel gemv trans(
2: mtx, mtx width, mtx len, vec)
3: for each belt do
4: for (i = (blockIdx.x*blockDim.x) + threa-
dIdx.x; i < mtx len; i += (blockDim.x*gridDim.x))
do
5: priv = vec[i];
6: #define TMP(j) reg##j += mtx[i+
belt offset + j* mtx length] * priv;
7: BOOST PP REPEAT(BELT SIZE,TMP);
8: ...
9: // Sum Reduction
10: if (threadIdx.x == 0) then // Save results
The size of the shared memory requirement of a thread
block, which is equal to the size of a thread block multi-
plied with the height of the belt, can be a limiting factor
of the performance because in case of large shared mem-
ory usage less thread blocks can be assigned to one phys-
ical multiprocessor. In the presented measurements the
optimal height of the belt has been investigated, however,
even with the optimal height the performance decreases as
the width of the matrix increases. For extreme, asymmet-
ric matrices, which are used in our application, significant
speed-up (x4-5) can be reached compared to the CuBlas li-
brary, however, as the matrix tends to be more symmetric
the performance of the CuBlas::dgemm() operation (red
line in the Figures) increases and exceeds the performance
of the new kernel.
In case of the new Kepler architecture (K20), in which
Streaming Multiprocessor has significantly more CUDA
Cores than the SM of Fermi GPUs (GTX 570), the per-
multiprocessor warp occupancy shall be increased to use
2 4 6 8 10 12 14 16 18 20 22 24
0
5
10
15
20
25
30
35
40
Nvidia 
CuBlas::dgemv()
Nvidia 
CuBlas::dgemm()
Intel MKL on Xeon E5
PCIe 2.0 limit
Kernel (belt width=2)
Kernel (belt width=4)
Kernel (belt width=8)
Width of the matrix
Pe
rf
or
m
an
ce
 (G
Fl
op
s)
Figure 7: K20, no shuffle operation in the kernel: Performance of
the presented gemv trans() kernel with different belt widths is com-
pared to the performance of the available implementations in case
of matrix height 5e5. Additionally, the PCIe limit is displayed as
PCIe throughput limits the performance of the GPU acceleration if
the transfer of the multiplier vector cannot be avoided.
2 4 6 8 10 12 14 16 18 20 22 24
0
5
10
15
20
25
30
35
40
Nvidia 
CuBlas::dgemv()
Nvidia 
CuBlas::dgemm()
Intel MKL on Xeon E5
PCIe 2.0 limit
Kernel (belt width=4)
Kernel (belt width=8)
Kernel (belt width=16)
Width of the matrix
Pe
rf
om
an
ce
 (G
Fl
op
s)
Figure 8: K20, shuffle operation enabled: Performance of the pre-
sented gemv trans() kernel with different belt widths is compared
to the performance of the available implementations in case of ma-
trix height 1e5. Additionally, the PCIe limit is displayed as PCIe
throughput limits the performance of the GPU acceleration if the
transfer of the multiplier vector cannot be avoided.
all the available cores [30]. A new warp-level intrinsic
called the shuffle operation can be used to decrease the
shared memory requirement of the sum reduction algo-
rithm to increase the occupancy. In Figures 8, 9 and 10
the results of the new kernel extended with the shuffle
operation are displayed. The height of the optimal belt
is slightly increased as the shared memory request is de-
creased. Unfortunately, the shuffle operation provides only
a small performance gain in case of our kernel (compare
Figures 7 and 9).
The performance of the kernel is mainly dominated by
the speed of the coalesced reading of the matrix elements.
The gemv trans() operation is bandwidth limited in both
CPU and GPU architectures, however, the memory band-
width on GPU (e.g. GDDR5 in GTX 570: 152GB/s or
GDDR5 in K20 208GB/s) is usually higher than on CPU
(e.g. DDR3-1333 in dual channel with Core-i7: 21.2GB/s
7
2 4 6 8 10 12 14 16 18 20 22 24
0
5
10
15
20
25
30
35
40
Nvidia 
CuBlas::dgemv()
Nvidia 
CuBlas::dgemm()
Intel MKL on Xeon E5
PCIe 2.0 limit
Kernel (belt width=4)
Kernel (belt width=8)
Kernel (belt width=16)
Width of the matrix
Pe
rf
or
m
an
ce
 (G
Fl
op
s)
Figure 9: Similar to Figure 8 but for matrix height 5e5.
2 4 6 8 10 12 14 16 18 20 22 24
0
5
10
15
20
25
30
35
40
Nvidia 
CuBlas::dgemv()
Nvidia 
CuBlas::dgemm()
Intel MKL on Xeon E5
PCIe 2.0 limit
Kernel (belt width=4)
Kernel (belt width=8)
Kernel (belt 
width=16)
Width of the matrix
Pe
rf
or
m
an
c e
 (G
Fl
op
s)
Figure 10: Similar to Figures 8 and 9 but for matrix height 1e6.
or DDR3-1066 in quad channel with Xeon E5: 34.1GB/s).
The maximal memory throughput reached by the new ker-
nel (measured with matrix size 16× 5e5) was 114.7GB/s,
134.8GB/s and 143.7GB/s on GTX 570, on K20 without
shuffle and on K20 with suffle, respectively.
The results of the acceleration of the selected BLAS
level 2 operations of the Davidson algorithm on the Xeon
E5 + K20 architecture are summarized in Table 1. (On the
GTX 570 card the memory is too small to accelerate other
operations besides the projection.) Line 3 is accelerated
well as no extra communication is needed while the other
operations are either limited by the PCIe or the DDR3
throughput.
6.2. gemv()
The gemv() operation can be efficiently accelerated by
the standard CuBlas library even in case of asymmetric
matrices (see Figure 11). Unfortunately, merging of the
CPU and GPU results is slow on one CPU thread and
is the bottleneck of the acceleration. The implementation
can be improved by multithreaded merging to enable quad
channel memory or by computing everything on the GPU,
however, this is not always possible.
2 4 6 8 10 12 14 16 18 20 22 24
0
5
10
15
20
25
30
35
40
CuBlas::dgemv
CuBlas::dgemm
Intel MKL Xeon 
E5
DDR3 merge 
limit
Width of the matrix
Pe
rf
or
m
an
ce
 (G
Fl
op
s)
Figure 11: Performance of the gemv normal() operation of the avail-
able implementations in case of K20. Additionally DDR3 limit is
displayed, as in case of a CPU-GPU hybrid implementation, the
merge operation is limited by the DDR3 throughput.
7. Accelerating projection operation
The acceleration of the independent (AX)BT opera-
tions implementing the projection operation is based on
the observation that A and B matrices are already avail-
able before the Davidson algorithm starts and do not change
during the Davidson iterations. The necessary (AX)BT
operations are described by a list of operation records in
which each record contains all the necessary information
to compute an operation like Equation 11. For example,
it stores information from which segment of X (input) to
which segment of X ′ (output) the operation transforms.
The host side algorithm to handle the operation records
is summarized in Algorithm 5. During the construction of
the operation records the workload associated to each out-
put is computed. (Multiple operations can use the same
input or write the same output segment.) Next, the opera-
tion records are partitioned between CPU and GPU based
on the performance ratio of the two architectures. To avoid
merging of outputs all operation records corresponding to
the same output shall be computed on the same architec-
ture, however, to create a balanced workload partition-
ing, this is not always possible. During partitioning the
output associated to the largest workload is selected for
GPU iteratively as long as the desired workload ratio is
Table 1: Runtime of the accelerated matrix-vector operations of the
Davidson algorithm (see Algorithm 3)
Xeon E5 Xeon E5 + K20 speedup
Heisenberg
model
Line 3 20.21 4.64 4.36
Line 5 19.07 10.45 1.83
Line 12 20.05 5.94 3.38
Line 13 17.55 9.68 1.81
Hubbard
model
Line 3 113.80 22.66 5.02
Line 5 94.29 54.22 1.74
Line 12 114.00 37.67 3.03
Line 13 87.29 50.21 1.74
8
not exceeded. If the reached workload ratio is far from the
desired, the operation records of the output associated to
the next largest workload are partitioned between the two
architectures.
Algorithm 5 Host side algorithm to handle the operation
records
1: Create operation records and determine the workload
(FLOP) for each output.
2: Partition the operation records between CPU and
GPU based on their performance ratio and the out-
put workload statistics.
3: Selects scheduling strategy for the operations to be
computed on GPU.
4: Set-up the workload for GPU based on the selected
strategy.
After partitioning the proper scheduling strategy is se-
lected based on the memory requirements of the opera-
tion records. Three different strategies can be selected for
three different uses cases, however, currently only the first
two, more complex strategies have been implemented and
tested. The first strategy (4Streams) is designed for small
problem size, when all A, B, X, X ′ matrices and tempo-
rary matrices T for storing intermediate results can be held
in the GPU memory. The second strategy (NoStreams) is
designed for medium-sized problems, where all A, B and
X ′ matrices can be stored in GPU memory, but from X
and T only the processed matrices are allocated. In case
of extra-sized problems a third strategy (NoStreamAnd-
Storage) can be designed in which even A and B matrices
cannot be fully stored in the GPU memory. The mem-
ory footprint of the matrices in case of different strategies
are shown in Figure 12. In the demonstrated examples
the second strategy is sufficient for all DMRG iterations
as both GPU cards have enough memory.
The (AX)BT operations are implemented using the
cuBLAS libary [31], which is a BLAS implementation ded-
icated for Nvidia GPUs. In the demonstrated strategies
2
4
8
16
32
64
128
256
512
1024
2048
4096
4096
4096
0
0
0,01
0,1
1
10
100
1000
NoStreams
4Streams
Number of retained block states
M
em
or
y 
fo
o t
pr
in
t (
M
B
)
Figure 12: GPU memory footprints of the two strategies are com-
pared in case of the Heisenberg model.
two important features of the GPUs are exploited, which
are provided via the CUDA driver [32] and also accessible
through the cuBLAS library. The first feature is that mul-
tiple CUDA kernels can be executed simultaneously on the
GPU, while the second feature is that memory I/O oper-
ations can be executed in the background. From the as-
pect of programming both features can be accessed via the
CUDA streams. Streams are sequences of operations that
execute in issue-order, but operations in different streams
may run concurrently or interleaved.
In the 4Streams strategy there is enough GPU memory
to execute several (AX)BT simultaneously. One stream is
created for each output and operations corresponding to a
given output are assigned to the same stream to avoid in-
terference. For each stream a sufficiently large temporary
matrix is allocated to store the temporary result of AX.
CUDA operations are dispatched to hardware queues
in issue order [32]. To enable asynchronous concurrent
kernel execution in CUDA environment memory transfers
and kernels shall be issued in a depth-first order. Inside
the engine (kernel) queue an operation is dispatched if all
preceding calls in the same stream have been completed
and all preceding calls of the same queue have been dis-
patched. Consequently, to avoid blocking calls kernels of
the same streams shall not be issued immediately after
each other. As one (AX)BT operation consists of two ker-
nels, the kernel calls shall be separated and interleaved
with kernels of operations of other streams. To reach four
parallel streams (hence the name of the strategy) kernels
from four different stream shall be interleaved.
Algorithm 6 Grouping operation records in case of
4Streams strategy
1: function orderAndGroupRecords(records,
maxstream)
2: Sort records by input frequency.
3: setVisitedRecords.clear()
4: for each record i do
5: if i.stream ∈ setVisitedRecords then
6: for each record j following i do
7: if j.stream 6∈ in setVisitedRecords then
8: swap(i,j) and break
9: if i.stream is 6∈ in setVisitedRecords then
10: vecGroup.last().insert(i)
11: setVisitedRecords.insert(i.stream)
12: if setVisitedRecords.size()=maxstream
then
13: vecGroup.add(new Group)
14: setVisitedRecords.clear()
15: else
16: vecGroup.add(new Group)
17: vecGroup.last().insert(i)
18: setVisitedRecords.clear()
19: setVisitedRecords.insert(i)
return vecGroups
9
Overlapping of the transfer time of input segments with
kernel execution makes further constraints on the order of
the operation records: only those operation records shall
be issued which use already loaded input segments. To be
able to interleave different streams, it is favorable to load
the input segment first which is used by the most streams.
Algorithm 7 Dispatching operation records
1: for each group g do
2: for each record i in g do
3: Init copy of input segment Xi (when first used)
4: Init Ti = (AiXi)
5: for each record i in g do
6: Init X ′i = TiB
T
i
7: for each output segments of X ′ do Init copy back.
In case of 4Streams strategy the reordered operation
records are grouped (see Algorithm 6) such a way that
kernels belonging to the same group can be interleaved
(see Algorithm 7) in execution. First, operation records
are sorted to load the more frequently used input segments
earlier. Then, the records are iterated and each record
is potentially swapped backwards to create groups of four
consecutive operations belonging to four different streams.
In practice some technical constraints have been added to
slightly alter swapping behavior, which is not discussed
here for the sake of simplicity.
CUDA operations are launched according to the oper-
ation records as summarized in Algorithm 7. For the sake
of brevity the synchronization between the streams is not
shown as it can be implemented with CUDA events in a
straightforward way. The same code can be used for both
strategies as the NoStreams strategy can be represented by
groups which contain only one operation record. In case
of NoStreams strategy the preparation of the operation
records is much simpler and contains only the sorting by
input frequency to reach I/O overlap with computation.
The performance of the two strategies is compared in
Figures 13 and 14. Significant improvement can only mea-
sured at medium sized matrices (100-800 for GTX 570
and 100-1500 for K20), in which case several operations
shall be executed concurrently to keep all the CUDA cores
busy required for high performance. Slightly bigger gain
can be observed in case of K20 GPU which has 2496 Ke-
pler CUDA cores as opposed to GTX 570 having only 480
Fermi CUDA cores. Operations on large matrices (∼1500
for GTX 570 and ∼3000 for K20) provide enough work
for each CUDA core to approach the theoretical maxi-
mum double performance (180 GFlops for GTX570 and
1.17 TFlops for K20) without streams.
The two strategies are also compared by the run-time
of the simulated models in Table 2. In case of K20 the
concurrent kernel execution has a slightly greater benefit,
however, in both models operations on larger matrices,
where concurrency has no benefit, dominates the run-time.
0 200 400 600 800 1000 1200 1400 1600
40
60
80
100
120
140
160
180
200
128
256
512
1024
2048
4096 Strategy 
“4Streams”
Strategy 
“NoStreams”
Intel Core-i7, 
MKL::dgemm
GTX 570,  
CuBlas::dgemm, 
no streams
GTX570, 
CuBlas::dgemm, 
4 streams
Matrix size
Pe
rf
or
m
an
c e
 (G
Fl
op
s)
Figure 13: GTX 570, Heisenberg model: Performance of the two
strategies is compared. Additionally, the performance of CuBLAS
and MKL dgemm() in reference measurements is displayed as the
function of matrix size. Labels indicate the number of retained block
states at the displayed DMRG iterations.
0 200 400 600 800 1000 1200 1400 1600
50
150
250
350
450
550
650
750
850
950
1050
256
512
1024
2048
4096 Strategy 
“4Streams”
Strategy 
“NoStreams”
Intel Xeon E5, 
MKL::dgemm
K20, 
CuBlas::dgemm, 
no streams
K20, 
CuBlas::dgemm, 
4 streams
Matrix size
Pe
rf
or
m
an
c e
 (G
Fl
op
s)
Figure 14: Similar to Figure 13 but on K20 architecture.
In models where more symmetries are enabled the size of
the matrices tends to be smaller, consequently, in these
models the concurrency also tends to be more effective.
The performance results of the full projection compu-
tation including both CPU and GPU computations are
shown in Figures 15, 16, 17 and 18. The quality of the
acceleration is highly affected by the applied workload
ratio which depends on the performance ratio of CPU
and GPU at the given matrix size. In the configuration
file different ratios can be set for different matrix sizes
and in each DMRG iteration the user-defined ratio is se-
lected according to the average matrix size of the opera-
Table 2: Total time of strategies is compared
Model
NoStream
(sec)
4Streams
(sec)
decrease
GTX570
Heisenberg 671.54 652.58 2.82%
Hubbard 2980.27 2957.82 0.75%
K20
Heisenberg 244.67 227.33 7.09%
Hubbard 1056.33 1012.56 4.14%
10
32
64
128
256
512
1024
2048
4096
4096
4096
0
50
100
150
200
250
300
50%
55%
60%
65%
70%
75%
80%
85%
90%
GPU
Workload
GTX570 + 
Core-i7
GPU
Contribution
CPU
Contribution
Number of retained block states
Pe
rf
or
m
an
c e
 (G
Fl
op
s)
Pe
rc
en
ta
ge
 o
f t
he
 to
ta
l w
or
kl
oa
d
Figure 15: GTX 570, Heisenberg model: Performance results of the
hybrid CPU-GPU acceleration of the projection operation. Blue
bars associated to the secondary vertical axis indicate the ratio of
the current GPU workload.
16
64
256
1024
4096
4096
4096
4096
4096
4096
4096
4096
4096
0
50
100
150
200
250
300
30%
40%
50%
60%
70%
80%
90%
100%
110%
GPU
Workload
GTX570 + 
Core-i7
GPU
Contribution
CPU
Contribution
Number of retained block states
Pe
rf
or
m
an
c e
 (G
Fl
op
s)
Pe
rc
en
ta
ge
 o
f t
he
 to
ta
l w
or
kl
oa
d
Figure 16: Similar to Figure 15 but for the Hubbard model on GTX
570.
tion records. If the workload is properly distributed 257.8
GFlops (×3.2 speed-up) and 1071.1 GFlops (×6.1 speed-
up) can be reached on GTX 570 and on K20, respectively.
8. Limits of FPGA implementation
Field programmable gate arrays (FPGAs) are prog-
rammable integrated circuits used to realize reconfigurable
digital circuits via configurable logic resources and rout-
ing. Although originally developed for telecommunica-
tion and digital signal processing applications novel high
performance FPGAs (featuring high-speed embedded re-
sources, plentiful on-chip memory, high-level programming
tools and significantly lower power consumption than CPU
or GPU) are promising candidates for high performance
computing (HPC).
To estimate the performance of an FPGA implementa-
tion of the DMRG method the acceleration of the pro-
jection operation expressed as a series of dense matrix
multiplications (see Equation 11) shall be investigated.
The floating-point matrix-matrix multiplication was al-
ready implemented [33] on FPGA very efficiently using the
32
64
128
256
512
1024
2048
4096
4096
4096
0
200
400
600
800
1000
1200
30%
40%
50%
60%
70%
80%
90%
100%
110%
GPU
Workload
K20 + Xeon 
E5
GPU
Contribution
CPU
Contribution
Number of retained block states
Pe
rf
or
m
an
c e
 (G
Fl
op
s)
Pe
rc
en
ta
ge
 o
f t
he
 to
ta
l w
or
kl
oa
d
Figure 17: Similar to Figures 15 and 16 but for the Heisenberg model
on K20.
1024
4096
4096
4096
4096
4096
4096
4096
4096
4096
0
200
400
600
800
1000
1200
30%
40%
50%
60%
70%
80%
90%
100%
110%
GPU
Workload
K20 + Xeon 
E5
GPU
Contribution
CPU
Contribution
Number of retained block states
Pe
rf
or
m
an
c e
 (G
Fl
op
s)
Pe
rc
en
ta
ge
 o
f t
he
 to
ta
l w
or
kl
oa
d
Figure 18: Similar to Figures 15, 16 and 17 but for the Hubbard
model on K20.
rank-1 update scheme. Kumar et al. demonstrated that
the performance is not limited by the PCIe bandwidth,
which connects the FPGA to the host CPU, and nearly
full utilization of the processing elements can be reached.
The idea behind the rank-1 update approach is that
instead of inner products between the rows of the left
matrix and the columns of the right matrix outer prod-
ucts between the columns of the left matrix and the rows
of the right matrix are carried out and resulting matri-
ces are summarized. The advantage of the approach is
that instead of multiply–accumulate operations (MACCs)
multiply-add operations (MADDs) are used, which are in-
dependent from each other and can be pipelined to reach
high processing element utilization.
Assuming the previously described design the process-
ing elements can be fully utilized and the following best-
case estimations can be made according to the area re-
quirements of the floating-point units. In an ideal case at
most 114 or 193 multiply-add units can be implemented
on the largest Virtex-6 ( XC6VSX475T ) and Virtex-7
(XC7VX1140T) FPGAs, respectively. The estimated clock
frequency of the architectures are 437.82 and 443.65MHz
which would result in 99.82 and 171.2 GFLOPS comput-
ing performance. This performance achievement can be
11
compared to the performance of the mid-range GTX 570
GPU used in the paper. As the development time in case
of FPGA is still much longer than in case of GPU and the
high-end GPUs could significantly outperform the FPGA
in this problem class, the GPU architecture is the better
candidate for the acceleration.
9. Implementation Results
The implemented algorithm has been tested both on
a mid-range (Intel Core-i7 2600 3.4 GHz CPU + NVidia
GTX 570 GPU) and on a high-end configuration (Intel
Xeon E5-2640 2.5 GHz CPU + NVidia K20 GPU); the
results are displayed in Table 3 and 4, respectively. All
CPU-only measurements have been executed with multi-
threading enabled (4 threads on Core-i7 and 6 threads on
Xeon E5). The mid-range configuration with GPU is ap-
proximately 2.3-2.4 times faster than without GPU, while
the high-end configuration is accelerated by 3.4-3.5 times
using the GPU. However, a change from a mid-range, mul-
tithreaded CPU to a high-end CPU+GPU configuration
can produce 6.5-7 times acceleration.
Table 3: Heisenberg model: final timings compared
Time(sec)
Speed-up compared to
Core-i7 Xeon E5
Core-i7 1489.64 1 0.53
Core-i7 + GTX 570 652.58 2.28 1.21
Xeon E5 789.65 1.89 1
Xeon E5 + K20 227.33 6.55 3.47
Table 4: Hubbard model: final timings compared
Time(sec)
Speed-up compared to
Core-i7 Xeon E5
Core-i7 7210.72 1 0.48
Core-i7 + GTX 570 2957.82 2.44 1.16
Xeon E5 3433.16 2.10 1
Xeon E5 + K20 1012.56 7.12 3.39
To support the comparison of the results of the two
investigated models the key parameters affecting compu-
tational complexity are summarized in Table 5. Using the
same number of retained block states the Hubbard model
has larger values for all key parameters except the maxi-
mum sector size and the maximum matrix size. In case of
the Hubbard model more symmetries are exploited which
results in smaller sectors and, consequently, smaller ma-
trices.
In case of K20 the acceleration of the projection and
the matrix-vector operations is compared in Figures 19
and 20. The projection is accelerated by 5.7 times which
is in accordance with the theoretical performance capabili-
ties of the two architectures. Currently on Xeon processor
(see Figure 3) the projection operation is only accounted
Table 5: Model comparison in case of Xeon E5 + K20.
Heisenberg Hubbard ratio
Time(s) 244.67 1067.89 4.36
Flop 1.22E+014 4.89E+014 4.01
Max HSB size 12.24E+06 15.32E+06 1.25
Max Sector size 4.00E+06 3.47E+06 0.87
Average number of
sectors
9.36 50.71 5.42
Max matrix size 1704.23 1145.24 0.67
Peak GPU memory
footprint
950.47 1155.48 1.22
Average number of
Davidson iterations using
random starting vector
60.79 122.43 2.01
for 75% of the total run-time, therefore, the overall ac-
celeration is also affected by the rest of the operations
of the Davidson algorithm. Fortunately, as the number
of retained states (m) increases the time-dominance of the
projection also increases, which anticipates even better ac-
celeration for real-world simulations with large m.
As the acceleration of the full Davidson algorithm can
be limited by the GPU memory, an adaptive solution shall
be implemented which accelerates as much of the algo-
rithm as possible. Currently four matrix-vector operation
of the algorithm is accelerated in case of sufficient GPU
memory, however, latter acceleration of the rest of the op-
erations will be implemented as well.
10. Conclusion
In this paper acceleration of the DMRG algorithm us-
ing novel kilo-processor architectures (GPU, FPGA) has
been investigated. The GPU architecture has been found
to a promising accelerator, as the most time-dominant step
of the algorithm, the projection operation, can be formu-
lated as independent dense matrix multiplications, which
Xeon E5 Xeon E5 +K20
0
100
200
300
400
500
600
700
Davidson H_SB*V
Davidson GEMV
Davidson rest
not DavidsonT
im
e(
s)
x5.7
x2.45
x1.04
x0.99
Figure 19: K20, Heisenberg model: Acceleration of different parts of
the algorithm is compared for m = 4096.
12
Xeon E5 Xeon E5 +K20
0
500
1000
1500
2000
2500
3000
Davidson H_SB*V
Davidson GEMV
Davidson rest
not DavidsonT
im
e(
s)
x5.69
x2.43
x0.92
x1.01
Figure 20: K20, Hubbard model: Acceleration of different parts of
the algorithm is compared for m = 4096.
are ideal workload for GPUs. Moreover, in case of high-
end GPUs the acceleration of the projection is so remark-
able, that it is worth to consider the acceleration of the
rest of the algorithm to obtain a decent overall speed-
up. In the presented implementation some asymmetric
matrix-vector multiplications of the diagonalization have
been identified as the second most time-dominant part of
the algorithm and a new CUDA kernel has been designed
to efficiently accelerate these operations. The resulting
parallelized DMRG implementation is a hybrid CPU-GPU
solution, which distributes the workload according to the
performance and memory capabilities of the configurations
and anticipates a straightforward multi-GPU extension,
which is part of our current developments.
In subsequent works our DMRG implementation will
be utilized to investigate more complex models focusing
on physical problems of current interests, which are hard
to treat due to their high computational demands. For
example, high-spin fermionic systems relevant for ultra-
cold atomic experiments and extensions to treat ab-initio
quantum chemical applications [34, 35, 36] which are al-
ready under progress. Furthermore, a straightforward gen-
eralization of the presented algorithm to accelerate tensor
network states (TNS) algorithms [37] is another promising
research direction.
In general, FPGA chips have lower operating frequen-
cies than in case of GPU architectures and the attached
on-board memories are also smaller and slower. FPGAs
can outperform other architectures in such problems where
a custom arithmetic unit can reach a significantly better
utilization, however, in case of dense matrix operations of
the DMRG algorithm nearly ideal utilization of the CUDA
cores is reached. As the estimated performance of the con-
sidered high-end FPGAs is around the mid-range GPUs
and the development time for FPGA is significantly longer,
the GPU architecture is preferred for the acceleration of
the DMRG algorithm.
Acknowledgement
This research was supported by TA´MOP-4.2.2.C-11
/1/KONV-2012-0004, TA´MOP-4.2.1./B-11/2/KMR-2011-
002, TA´MOP-4.2.2./B-10/1-2010-0014 and the Hungarian
Research Fund (OTKA) under Grant Nos. NN110360,
K100908 and K84267.
References
[1] S. R. White, Density matrix formulation for quantum renormal-
ization groups, Phys. Rev. Lett. 69 (1992) 2863–2866.
[2] R. M. Noack, S. R. Manmana, Diagonalization and Numerical
Renormalization-Group-Based Methods for Interacting Quan-
tum Systems, AIP Conf. Proc. 789 (2004) 93–163.
[3] U. Schollwo¨ck, The density-matrix renormalization group, Rev.
Mod. Phys. 77 (2005) 259–315.
[4] O¨. Legeza, R. Noack, J. So´lyom, L. Tincani, Applications
of quantum information in the density-matrix renormalization
group, in: Computational Many-Particle Physics, Vol. 739 of
Lecture Notes in Physics, Springer-Verlag, Berlin Heidelberg,
2008.
[5] W. Barford, Electronic and Optical Properties of Conjugated
Polymers, Oxford University Press, 2005.
[6] G. Barcza, O¨. Legeza, K. H. Marti, M. Reiher, Quantum-
information analysis of electronic states of different molecular
structures, Phys. Rev. A 83 (2011) 012508.
[7] M. Lewenstein, A. Sanpera, V. Ahufinger, B. Damski,
A. Sen(De), U. Sen, Ultracold atomic gases in optical lattices:
mimicking condensed matter physics and beyond, Advances in
Physics 56 (2) (2007) 243–379.
[8] G. Hager, E. Jeckelmann, H. Fehske, G. Wellein, Parallelization
strategies for density matrix renormalization group algorithms
on shared-memory systems, Journal of Computational Physics
194 (2) (2004) 795 – 808.
[9] G. Alvarez, Implementation of the su(2) hamiltonian symme-
try for the {DMRG} algorithm, Computer Physics Communi-
cations 183 (10) (2012) 2226 – 2232.
[10] G. K.-L. Chan, An algorithm for large scale density matrix
renormalization group calculations, The Journal of chemical
physics 120 (2004) 3172.
[11] Y. Kurashige, T. Yanai, High-performance ab initio density ma-
trix renormalization group method: Applicability to large-scale
multireference problems for metal compounds, J. Chem. Phys.
130.
[12] S. Yamada, T. Imamura, M. Machida, Parallelization design on
multi-core platforms in density matrix renormalization group
toward 2-d quantum strongly-correlated systems, in: High Per-
formance Computing, Networking, Storage and Analysis (SC),
2011 International Conference for, 2011, pp. 1–10.
[13] J. Rinco´n, D. J. Garc´ıa, K. Hallberg, Improved parallelization
techniques for the density matrix renormalization group., Com-
puter Physics Communications 181 (8) (2010) 1346–1351.
[14] E. M. Stoudenmire, S. R. White, Real-space parallel density
matrix renormalization group, Phys. Rev. B 87 (2013) 155137.
[15] J. Yu, H.-C. Hsiao, Y.-J. Kao, Gpu accelerated tensor contrac-
tions in the plaquette renormalization scheme, Computers &
Fluids 45 (1) (2011) 55 – 58.
[16] M. J. Cawkwell, E. J. Sanville, S. M. Mniszewski, A. M. N.
Niklasson, Computing the density matrix in electronic struc-
ture theory on graphics processing units, Journal of Chemical
Theory and Computation 8 (11) (2012) 4094–4101.
[17] F. Gebhard, The Mott Metal-Insulator Transition: Models and
Methods, Springer, 1997.
[18] P. W. Anderson, The Theory of Superconductivity in the High-
Tc Cuprate Superconductors, Princeton University Press, 1997.
[19] A. I. To´th, C. P. Moca, O¨. Legeza, G. Zara´nd, Density matrix
numerical renormalization group for non-abelian symmetries,
Phys. Rev. B 78 (2008) 245109.
13
[20] J. F. Cornwell, Group Theory in Physics, An Introduction, Aca-
demic Press, 1997.
[21] O¨. Legeza, J. Ro¨der, B. A. Hess, Controlling the accuracy of the
density-matrix renormalization-group method: The dynamical
block state selection approach, Phys. Rev. B 67 (2003) 125114.
[22] Intel, Math kernel library 11.0, http://http://software.
intel.com/en-us/intel-mkl (2013).
[23] Y. Saad, Numerical Methods for Large Eigenvalue Problems,
Manchester University Press, Manchester, 1992.
[24] B. Liu, The simultaneous expansion for the solution of several of
the lowest eigenvalues and corresponding eigenvectors of large
real-symmetric matrices, Numerical Algorithms in Chemistry:
Algebraic Method (1978) 49–53.
[25] G. G. Sleijpen, H. Van der Vorst, A jacobi–davidson iteration
method for linear eigenvalue problems, SIAM Journal on Matrix
Analysis and Applications 17 (2) (1996) 401–425.
[26] E. R. Davidson, The iterative calculation of a few of the low-
est eigenvalues and corresponding eigenvectors of large real-
symmetric matrices, Journal of Computational Physics 17 (1)
(1975) 87 – 94.
[27] M. Sadkane, R. Sidje, Implementation of a variable block david-
son method with deflation for solving large sparse eigenprob-
lems, Numerical Algorithms 20 (2-3) (1999) 217–240.
[28] Netlib repository, https://netlib.org (2013).
[29] Cuda c programming guide 5.0,
http://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html (2013).
[30] Kepler tuning guide 1.0, http://docs.nvidia.com/cuda/
kepler-tuning-guide/index.html (2013).
[31] CUBLAS library 5.0, https://developer.nvidia.com/cublas
(2013).
[32] CUDA library 5.0, http://www.nvidia.com/object/cuda
(2013).
[33] V. Kumar, S. Joshi, S. Patkar, H. Narayanan, Fpga based
high performance double-precision matrix multiplication, Inter-
national Journal of Parallel Programming 38 (3-4) (2010) 322–
338.
[34] O¨. Legeza, T. Ro¨hwedder, R. Schneider, Numerical approaches
for high-dimensional PDE’s for quantum chemistry, in: Ency-
clopedia of Applied and Computational Mathematics, Springer,
2013.
[35] K. Marti, M. Reiher, The density matrix renormalization group
algorithm in quantum chemistry, Z. Phys. Chem. 224 (2010)
583–599.
[36] G. K.-L. Chan, Low entanglement wavefunctions, Wiley Inter-
disciplinary Reviews: Computational Molecular Science 2 (6)
(2012) 907–920.
[37] R. Orus, A Practical Introduction to Tensor Networks: Matrix
Product States and Projected Entangled Pair States, ArXiv e-
printsarXiv:1306.2164.
14
