In the numerical analysis of strongly correlated quantum lattice models one of the leading algorithms developed to balance the size of the effective Hilbert space and the accuracy of the simulation is the density matrix renormalization group (DMRG) algorithm, in which the run-time is dominated by the iterative diagonalization of the Hamilton operator. As the most time-dominant step of the diagonalization can be expressed as a list of dense matrix operations, the DMRG is an appealing candidate to fully utilize the computing power residing in novel kilo-processor architectures.
Introduction and Related Works
DMRG is a variational numerical approach developed to treat low-dimensional interacting many-body quantum systems efficiently [1, 2, 3] . In fact, it has become an exceptionally successful method to study the low energy physics of strongly correlated quantum systems which exhibit chain-like entanglement structure [4] . For example, it can be applied to simulate properties of anisotropic materials, such as polymers [5] , or to describe accurately the electronic structure of open d shell molecules [6] , which is beyond the capability of standard quantum chemical approaches. Additionally, the interacting system of atoms trapped in an optical lattice, proposed as physical implementation of quantum computer, is also tractable via DMRG [7] .
The original DMRG algorithm [1] was introduced in 1992 by Steven R. White and was formulated as a single threaded algorithm. In the past various works have been carried out to accelerate the DMRG algorithm on shared [8] [9] and distributed memory [10, 11, 12, 13] architectures, however, none of them took advantage of recent kiloprocessor architectures: graphical processing unit (GPU) and field-programmable gate array (FPGA).
One of the first parallelizations was [8] converting the projection operation to matrix-matrix multiplications and accelerating them via OpenMP interface. In [12] a similar approach was presented for distributed memory environment (up-to 1024 cores) optimizing the communication between the cores, while in [13] the acceleration of the computation of correlation function had been investigated. Recently, [9] presented an acceleration on shared memory architectures exploiting SU(2) symmetries, while [14] proposed a novel direction for paralellization via a modification of the original serial DMRG algorithm.
Graphical processing unit has been successfully employed in neighboring research areas to accelerate matrix operations. In [15] GPU is used to accelerate tensor contractions in plaquette renormalization states (PRS), which can be regarded as an alternative technique to tensor product states (TPS) or the DMRG algorithm. In [16] the second-order spectral projection (SP2) algorithm has been accelerated, which is an alternative technique to calculate the density matrix via a recursive series of generalized matrix-matrix multiplications.
In this paper we present the first attempt (to our best knowledge) to investigate how the DMRG method can utilize the enormous computing capabilities of novel kiloprocessor architectures (GPU, FPGA). In case of GPU a smart hybrid CPU-GPU acceleration is presented, which tolerates problems exceeding the GPU memory size, consequently, supporting wide range of problems and GPU configurations. Contrary to the previous acceleration attempts not only the projection operation is accelerated, but further parts of the diagonalization are also computed on the GPU. In case of FPGA the performance limits of a possible implementation are estimated and discussed.
The rest of the paper is organized as follows. Section 2 describes the models which are used as test cases to demonstrate the operation of the algorithm. Symmetries which can be exploited to decrease the computational requirements of the algorithm and the algorithm itself are presented in Sections 3 and 4, respectively. Acceleration on GPU is presented in three sections (5, 6 and 7), while limits of an FPGA implementation are described in Section 8. Finally, implementation results and conclusions are given in Sections 9 and 10, respectively.
Investigated models
In order to illustrate the underlying features of the algorithm it is applied to the so-called spin- Hubbard model. The selected models describe how to compute the Hamiltonian of the system of interest, while the main task is to find some of the lowlying eigenvalues and eigenvectors of the Hamiltonian by a diagonalization algorithm. In practice instead of solving the problem for the complete Hilbert space directly, various physical phenomena can be exploited to reduce the complexity of the problem.
Heisenberg model
The Heisenberg model describes the physics of magnetic systems and provides theoretical description of various experimental measurements. In the model a magnetic system is simulated on a lattice of interacting spins. A microscopic magnetic moment (spin) is localized at each lattice site j and described by a quantized, two-valued variable, σ j ∈ {↑, ↓}, related to the two possible orientations of the spin. Limiting the interactions to only neighboring spins -which is often a good approximation -the Hamiltonian of the model is written as
where S + j , S − j operators change, while S z j measures the orientation of the spin on lattice site j. The overall behavior of the system can be tuned via the relevant parameter ∆. The explicit matrix representation of an operator O j acting on site j of a chain with N spins is given as
where I is the identity and O is one of the followings
The Hamiltonian of N spins acts on the tensor product space of dimension 2 N , that is the dimension of the complete Hilbert space grows exponentially as the size of the system increases.
Hubbard model
The Hubbard model was introduced to describe electrons in solids to characterize the transition between insulating and conducting systems. The single-band Hubbard model provides appropriate description of low temperature systems where all particles are in the lowest Bloch band and the long-ranged interactions between the particles can be neglected due to strong screening effects [17] . More recently various multi-band Hubbard models have been applied to high-temperature superconductivity [18] and systems of higher spin to understand the behavior of optically trapped ultracold atoms [7] .
In the general spin-F system each lattice site is characterized by 2F + 1 two dimensional vectors. Each vector is assigned with a distinct label (from {−F, −F + 1, ..., F − 1, F }) called spin polarization value (denoted by σ). A vector assigned to a spin polarization σ describes two orthogonal states: the site is occupied ([0; 1]) by the particle of spin polarization σ or not ([1; 0]). As a consequence, a lattice site of spin-F possesses 2 2F +1 internal degrees of freedom.
The lattice model of interacting particles of spin-F consists of two competing terms: the kinetic term, which describes the tunneling of particles between neighboring lattice sites, and the local potential term, which describes on-site density-density interaction measuring the attraction or repulsion between the interacting particles. The single-band, fermionic Hubbard model of spin-F is defined on a chain with N sites as
where t measures the hopping amplitude between neighboring sites and U is the interaction strength. Creation and annihilation operator acting on site j with spin polarization σ, denoted as c † j,σ and c j,σ , adds or removes a particle located on site j with spin polarization σ. The particle density of spin polarization σ on site j is measured by operator n j,σ = c † j,σ c j,σ . The explicit matrix representation of an operator O j,σ acting on site j and polarization σ is constructed as
where F = 2F + 1, I is the identity, Φ is the fermionic phase-factor and O is one of the followings
The Hamiltonian describing the spin-F system of N lattice sites acts on the tensor product space of dimension 2 F N , and similarly to the Heisenberg model, the dimension of the complete Hilbert space blows up exponentially. Comparing to the bosonic operators of the Heisenberg model, the key differences in the construction of operators are the appearance of internal quantum number, σ, and the presence of the phase-factor describing the antisymmetric nature of fermionic systems. To ease the comparison of the two models only the F = 1 2 case is presented, however, the observed tendencies are valid for higher F values.
Symmetries to be exploited
In many systems the Hamilton operator does not change the value of a measurable quantity, i.e., it commutes with the operator connected to that measurable quantity. These operators are called symmetry operators and can be used to cast the Hilbert space to smaller independent subspaces [19] . Consequently, instead of solving a large matrix eigenvalue problem, the eigenvalue spectrum can be determined by solving several smaller problems. In the Heisenberg model the total spin projection, S z = N j=1 S z j , is such a symmetry operator. Meanwhile, in the Hubbard model of spin-F the total particle number associated to each spin polarization σ, N σ = N j=1 n j,σ , is conserved. Thus, the distinct quantum numbers helps to partition the Hilbert space into multiple independent subspaces corresponding to a given combination of quantum number values.
A given symmetry operator shares the same eigenvectors of the Hamiltonian, thus the eigenstates of the Hamiltonian can be labeled by the eigenvalues of the symmetry operator (quantum number, Q), and the Hilbert space can be decomposed into subspaces (sectors) spanned by the eigenvectors of each quantum number value [20] . Introducing a quantum number based representation, the sparse operators (Eqs. 2, 5) can be decomposed to a set of smaller but dense matrices, furthermore the Hamiltonian operator (Eqs. 1, 4) becomes blockdiagonal.
Algorithm
The DMRG approach has two phases, in the infinitelattice algorithm the approximated Hilbert space of a finite system of N interacting spins is built up iteratively, while in the optional finite-lattice algorithm the number of the interacting spins is fixed and further iterations are carried out to increase the accuracy of the computed results. As in both cases the iterations are very similar, for the sake of simplicity, we consider only the infinite-lattice algorithm. The detailed description of the algorithm can be found in the original work [1] and various reviews [2, 3] , here only the key steps of an iteration of the infinite-lattice algorithm are summarized in Algorithm 1 providing the basis of our analysis. Construct the density matrix for the given block from the lowest eigenstate.
6:
Compute the eigenvalues of the density matrix. (Lanczos method)
7:
Renormalize the basis of the block by keeping states with high eigenvalues.
In the two-site DMRG procedure four subsystems (left block describing l sites, 1 site, 1 site, right block describing r sites) compose the finite system of N = (l + 2 + r) sites called superblock. The sites contained in each block are described maximally by m, optimally chosen states, which can be significantly smaller than the exactly required q l or q r basis, where q is the degree of freedom of one site. As the central sites of the superblock are represented exactly by q-q states, the size of the superblock Hilbert space is q 2 m 2 . Considering, however, the symmetries mentioned above, the problem can be restricted to a subspace of the superblock corresponding to a particular Q value. E.g., in case of Heisenberg and Hubbard models the size of the superblock Hilbert space can be reduced significantly as demonstrated in Figures 1 and 2 , respectively. It is, however, clear that even using symmetry operators the dimension of the reduced space grows exponentially with the size of the lattice (if no truncation is done).
The infinite-lattice algorithm starts with the four site configuration, where each block contains a single spin. In Hubbard model the Hilbert space of the superblock can be restricted to the subspace corresponding to Q=[
each iteration step both blocks are enlarged by a single site, making the complete system increase by two, until the desired system size, N , is reached. In each iteration of the DMRG algorithm, the lowest-lying eigenvector of the corresponding superblock Hamiltonian (H SB ) is obtained by the iterative Davidson or Lanczos algorithm. (In the paper the Davidson algorithm is considered.) From the lowest eigenstate the density matrix is constructed which carry the information how to optimally truncate the basis of the enlarged block (m q l+1 ) in order to keep the problem size manageable [21] .
The most time-consuming part of a full iteration is the step of the Davidson routine which carries out the projection operation (X = H SB X). Instead of constructing and storing the enormous H SB matrix of size O m 4 explicitly, it is computationally favorable to obtain the projected vector X directly from the matrices of size O m 2 composing H SB .
The H SB can be directly expressed by the operators of the original four subsystems (l -1-1-r strategy) or by the operators of two intermediate systems (LR strategy), so called enlarged blocks, which come from the contraction of each block with its neighboring site. In the current implementation only the second strategy is investigated, however, the first one is also straightforward and will be included in the near future.
There are several practical benefits of these strategies. First of all, the memory bandwidth limited matrix-vector multiplication (BLAS Level 2) is converted to matrix-matrix multiplication (BLAS Level 3) which can be efficiently accelerated. Secondly, skipping of the explicit Kronecker multiplications not only restructures the computation, but decreases the number of operations. Finally, both strategies drastically decrease the size of the matrices which take part in the operations and thus the memory footprint of the algorithm. In case of LR and l -1-1-r strategy the largest matrix has a size of O((mq)
2 ) and O((m) 2 ), respectively. The second strategy is more favorable in extreme situations when the GPU memory is limited and q (internal degrees of freedom) is large (e.g. spin-F Hubbard model with large F ).
LR strategy
In the LR strategy the H SB is expressed with operators A (L) α and B (R) α defined on the left (L := l + 1) and right (R := r + 1) enlarged blocks, respectively, as
where the index α iterates over the distinct operator combinations required to construct the superblock Hamiltonian. Exploiting Kronecker multiplication properties, the projected vector X can be computed by matrix-matrix multiplications asX
where Algorithm 2 The computation of the projected vector X in case of l -1-1-r strategy strategy.
for each (i, j) do X 1 (:, :, i, j)=DX1(:, :, i, j)
4:
for each (i, j) do X 1 (:, :, i, j)=X 1 (:, :, i, j)C T 5:
for each (i) do X 2 (:, :, i)=X2(:, :, i)B 
8:
return X
In the practical implementation Equation 10 operates on even smaller matrices as the operators are decomposed according to quantum numbers. Instead of a sparse matrix
qi→qj are stored representing how A (L) transforms the subspace (sector) corresponding to q i to the one corresponding to q j . To compute
α,q k →q l transition pairs shall be submitted to Equation 10 and each time only the corresponding ik and jl segments of X and X shall be used as (11) whereX ik andX jl indicate the reshaped ik and jl segment of vector X and X , respectively. Fortunately, the reshape operation has no computational cost as the data in the memory is untouched and only the row/col sizes are changing. 
l-1-1-r strategy
In the l -1-1-r strategy the H SB is expressed by the operators of the four subsystems:
where the index α again iterates over the distinct operator combinations required to construct the superblock Hamiltonian.
Similarly to the LR strategy, Kronecker multiplication properties can be exploited to compute the projected vector X efficiently with matrix-matrix operations, however, in this case a more complicated data storage and several tensor multiplications are needed to avoid unnecessary memcopy operations. Using the procedure projectX l11r(), which computes the projected vector X for one matrix quadruplet and is described in Algorithm 2, the H SB can be calculated as
In the similar manner as shown in LR strategy, A (l) , a, b and B (r) operators can be decomposed according to quantum numbers and instead of large sparse matrix operations several smaller dense matrix operations shall be submitted to Algorithm 2. Furthermore, none of the reshape operations of Algorithm 2 involves practical data movement, only the size descriptor variables are changing.
Runtime & Parallelism
In case of the Heisenberg and Hubbard model the runtime analysis of the DMRG algorithm is shown in Figures 3  and 4 , respectively. As the Davidson algorithm, which is summarized in Algorithm 3, is the most time-dominant part and takes more than 97% of the total time in the CPU-only reference implementation, it has been chosen for acceleration. Unfortunately, the full Davidson algorithm cannot be implemented on the GPU as the problem size in real world simulations usually exceeds the GPU memory size. Instead, a hybrid approach shall be implemented, which can adjust the GPU workload according to the available GPU memory and the CPU-GPU performance ratio. 
[λ, y] ←− get smallest eigvalue and vector of B
5:
if norm(r) ≈ 0 then // orthonormalize r against V :
12:
normalize r and append to V
15:
return without success
In the Davidson algorithm inherent parallelism can be observed at two levels. First, at low level, all the matrix and vector operations can be accelerated. Secondly, at the level of projection computation (line 2 in Algorithm 3), which is the most time-dominant part of the Davidson algorithm itself taking more than 75% of the total time, the projection can be computed as a sum of the independent (AX)B
T operations (see Equation 10 ). At low level, the CPU part of the algorithm uses the Basic Linear Algebra Subroutine (BLAS) interface and the Intel MKL Library [22] for algebraic operations including operator contractions, inner operations of both Davidson and Lanczos algorithms [23] and operator transformations. Unfortunately, in the Davidson algorithm, all the opera- tions except the projection are BLAS level 2 matrix-vector multiplications, which are bandwidth limited and not ideal for acceleration. There is block extension [24] of the algorithm, the so called Davidson-Liu, to determine a few of the lowest eigenvalues, where more than one candidate vectors are added at once resulting in BLAS level 3 operations, however, in the current DMRG implementation only the lowest eigenvalue is investigated. The remaining option is to store as much data in GPU memory as possible and execute the corresponding operations on GPU.
At the level of projection operation the independence of matrix multiplications provides a straightforward hybrid parallelization and a future multi-GPU modification of the current implementation. Acceleration can be improved by developing the appropriate scheduling of the matrix operations for different matrix sizes and architectures.
Accelerating matrix-vector multiplications
Jacobi-Davidson version [25] of the original Davidson algorithm [26] is a preconditioned subspace iteration technique [23] aimed at computing a few of the extreme eigenpairs of large sparse symmetric matrices and commonly used in the DMRG implementations [2, 3] . In the presented work the [27] version of the algorithm (available in Netlib [28] ) is implemented.
In each iteration the subspace is extended with a new base vector (V (:, i)), which is stored in the memory accompanied by its projection (W (:, i) ). As the size of these vectors can be very large (see Figures 1, 2 ) depending on the model and the number of retained block states (m), they cannot be fully stored in the GPU memory. However, in order to accelerate a matrix-vector multiplication with GPU, at least the matrix shall be stored in the GPU memory. In the current implementation the matrix of the basis vectors (V ), which is used four times (see comments in Algorithm 3) in BLAS level 2 operations, has been se- lected to be stored, although the storage of the projected vector matrix (W ) can be added later as well.
In each iteration the new basis vector is loaded to the GPU memory in the background (if there is available space) and the workload of the BLAS level 2 operations is shared between the CPU and GPU: CPU process the new basis vector, while GPU operates on the older ones. With this technique the power of both CPU and GPU can be exploited and the transfer time of the matrix can be hidden. The implementation is flexible: if there is no more space on the GPU or the CPU performance justifies it, more than one base vector can be processed on the CPU leaving less work for GPU. As shown in Figure 5 , where the storage requirement of V is compared with available free space after the projection operation, V cannot be fully stored on a GTX 570 GPU even in case of the simple Heisenberg model (m = 4096).
There are two types of BLAS level 2 operations: V T X and V X indicated by gemv trans() and gemv() in Algorithm 3, respectively. In the first case the multiplier vector can be loaded while in the second case the result can be written in smaller parts (∼5e5) to overlap with the computation.
gemv trans()
In case of gemv trans() both MKL and CuBlas libraries give poor performance for the special asymmetric matrix size (∼5e5x20) required in our application (see Figures 6, 7, 8, 9 and 10) , therefore, a new CUDA kernel has been designed. The presented results are measured without data communication, as in case of line 3 the multiplier vector is already in the GPU memory providing ideal acceleration. To estimate the performance of line 12, where the multiplier vector has to be loaded, the limiting factor of the PCIe 2.0 is also displayed. In this case the overall performance cannot exceed the PCIe limit even with overlapped communication. In both cases the estimation of the overall acceleration shall be carried out in an integral fashion, as in each iteration the thickness of the matrix is increased by one until a user defined limit (20 in the presented DMRG test-cases) is reached.
The basic idea of the new kernel (see Algorithm 4) can be summarized as follows. Each thread is associated with a column of the matrix. Each thread loads the corresponding vector element and multiplies the elements of the associated column. As threads of a warp load consecutive elements of the vector and the matrix, the coalesced reading is obvious. If the number of threads (grid size * thread block size) is less than the length of the matrix, each thread is associated with a new unprocessed column (coalesced readings again) as long as there is any. After processing a new column each thread accumulates the results to the results of the first column. Finally, the accumulated results shall be summed across the threads, which can be efficiently done via a sum reduction [29] in shared memory. If the belt is smaller than the width of the matrix, the whole procedure can be repeated (outer loop).
Algorithm 4 Proposed kernel for asymmetric gemv trans()
1: function kernel gemv trans( 2: mtx, mtx width, mtx len, vec) 3: for each belt do 4: for (i = (blockIdx.x*blockDim.x) + threadIdx.x; i < mtx len; i += (blockDim.x*gridDim.x)) do #define TMP(j) reg##j += mtx[i+ belt offset + j* mtx length] * priv;
7:
BOOST PP REPEAT(BELT SIZE,TMP); 8: ...
9:
// Sum Reduction 10: if (threadIdx.x == 0) then // Save results The size of the shared memory requirement of a thread block, which is equal to the size of a thread block multiplied with the height of the belt, can be a limiting factor of the performance because in case of large shared memory usage less thread blocks can be assigned to one physical multiprocessor. In the presented measurements the optimal height of the belt has been investigated, however, even with the optimal height the performance decreases as the width of the matrix increases. For extreme, asymmetric matrices, which are used in our application, significant speed-up (x4-5) can be reached compared to the CuBlas library, however, as the matrix tends to be more symmetric the performance of the CuBlas::dgemm() operation (red line in the Figures) increases and exceeds the performance of the new kernel.
In case of the new Kepler architecture (K20), in which Streaming Multiprocessor has significantly more CUDA Cores than the SM of Fermi GPUs (GTX 570), the permultiprocessor warp occupancy shall be increased to use Figure 7 : K20, no shuffle operation in the kernel: Performance of the presented gemv trans() kernel with different belt widths is compared to the performance of the available implementations in case of matrix height 5e5. Additionally, the PCIe limit is displayed as PCIe throughput limits the performance of the GPU acceleration if the transfer of the multiplier vector cannot be avoided. Figure 8 : K20, shuffle operation enabled: Performance of the presented gemv trans() kernel with different belt widths is compared to the performance of the available implementations in case of matrix height 1e5. Additionally, the PCIe limit is displayed as PCIe throughput limits the performance of the GPU acceleration if the transfer of the multiplier vector cannot be avoided.
all the available cores [30]. A new warp-level intrinsic called the shuffle operation can be used to decrease the shared memory requirement of the sum reduction algorithm to increase the occupancy. In Figures 8, 9 and 10 the results of the new kernel extended with the shuffle operation are displayed. The height of the optimal belt is slightly increased as the shared memory request is decreased. Unfortunately, the shuffle operation provides only a small performance gain in case of our kernel (compare Figures 7 and 9) .
The performance of the kernel is mainly dominated by the speed of the coalesced reading of the matrix elements. The gemv trans() operation is bandwidth limited in both CPU and GPU architectures, however, the memory bandwidth on GPU (e.g. GDDR5 in GTX 570: 152GB/s or GDDR5 in K20 208GB/s) is usually higher than on CPU (e.g. DDR3-1333 in dual channel with Core-i7: 21.2GB/s The results of the acceleration of the selected BLAS level 2 operations of the Davidson algorithm on the Xeon E5 + K20 architecture are summarized in Table 1 . (On the GTX 570 card the memory is too small to accelerate other operations besides the projection.) Line 3 is accelerated well as no extra communication is needed while the other operations are either limited by the PCIe or the DDR3 throughput.
gemv()
The gemv() operation can be efficiently accelerated by the standard CuBlas library even in case of asymmetric matrices (see Figure 11) . Unfortunately, merging of the CPU and GPU results is slow on one CPU thread and is the bottleneck of the acceleration. The implementation can be improved by multithreaded merging to enable quad channel memory or by computing everything on the GPU, however, this is not always possible. Figure 11 : Performance of the gemv normal() operation of the available implementations in case of K20. Additionally DDR3 limit is displayed, as in case of a CPU-GPU hybrid implementation, the merge operation is limited by the DDR3 throughput.
Accelerating projection operation
The acceleration of the independent (AX)B
T operations implementing the projection operation is based on the observation that A and B matrices are already available before the Davidson algorithm starts and do not change during the Davidson iterations. The necessary (AX)B T operations are described by a list of operation records in which each record contains all the necessary information to compute an operation like Equation 11. For example, it stores information from which segment of X (input) to which segment of X (output) the operation transforms.
The host side algorithm to handle the operation records is summarized in Algorithm 5. During the construction of the operation records the workload associated to each output is computed. (Multiple operations can use the same input or write the same output segment.) Next, the operation records are partitioned between CPU and GPU based on the performance ratio of the two architectures. To avoid merging of outputs all operation records corresponding to the same output shall be computed on the same architecture, however, to create a balanced workload partitioning, this is not always possible. During partitioning the output associated to the largest workload is selected for GPU iteratively as long as the desired workload ratio is GPU based on their performance ratio and the output workload statistics. 3: Selects scheduling strategy for the operations to be computed on GPU. 4: Set-up the workload for GPU based on the selected strategy.
After partitioning the proper scheduling strategy is selected based on the memory requirements of the operation records. Three different strategies can be selected for three different uses cases, however, currently only the first two, more complex strategies have been implemented and tested. The first strategy (4Streams) is designed for small problem size, when all A, B, X, X matrices and temporary matrices T for storing intermediate results can be held in the GPU memory. The second strategy (NoStreams) is designed for medium-sized problems, where all A, B and X matrices can be stored in GPU memory, but from X and T only the processed matrices are allocated. In case of extra-sized problems a third strategy (NoStreamAndStorage) can be designed in which even A and B matrices cannot be fully stored in the GPU memory. The memory footprint of the matrices in case of different strategies are shown in Figure 12 . In the demonstrated examples the second strategy is sufficient for all DMRG iterations as both GPU cards have enough memory.
The (AX)B T operations are implemented using the cuBLAS libary [31] , which is a BLAS implementation dedicated for Nvidia GPUs. In the demonstrated strategies two important features of the GPUs are exploited, which are provided via the CUDA driver [32] and also accessible through the cuBLAS library. The first feature is that multiple CUDA kernels can be executed simultaneously on the GPU, while the second feature is that memory I/O operations can be executed in the background. From the aspect of programming both features can be accessed via the CUDA streams. Streams are sequences of operations that execute in issue-order, but operations in different streams may run concurrently or interleaved.
In the 4Streams strategy there is enough GPU memory to execute several (AX)B T simultaneously. One stream is created for each output and operations corresponding to a given output are assigned to the same stream to avoid interference. For each stream a sufficiently large temporary matrix is allocated to store the temporary result of AX.
CUDA operations are dispatched to hardware queues in issue order [32] . To enable asynchronous concurrent kernel execution in CUDA environment memory transfers and kernels shall be issued in a depth-first order. Inside the engine (kernel) queue an operation is dispatched if all preceding calls in the same stream have been completed and all preceding calls of the same queue have been dispatched. Consequently, to avoid blocking calls kernels of the same streams shall not be issued immediately after each other. As one (AX)B
T operation consists of two kernels, the kernel calls shall be separated and interleaved with kernels of operations of other streams. To reach four parallel streams (hence the name of the strategy) kernels from four different stream shall be interleaved. Sort records by input frequency.
3:
setVisitedRecords.clear() 4: for each record i do 5: if i.stream ∈ setVisitedRecords then 6: for each record j following i do 7: if j.stream ∈ in setVisitedRecords then 8: swap(i,j) and break 9: if i.stream is ∈ in setVisitedRecords then 10: vecGroup.last().insert(i) 11: setVisitedRecords.insert(i.stream) 12: if setVisitedRecords.size()=maxstream then 13: vecGroup.add(new Group) 14: setVisitedRecords.clear()
15:
else 16: vecGroup.add(new Group) 17: vecGroup.last().insert(i) 18: setVisitedRecords.clear() 19: setVisitedRecords.insert(i) return vecGroups Overlapping of the transfer time of input segments with kernel execution makes further constraints on the order of the operation records: only those operation records shall be issued which use already loaded input segments. To be able to interleave different streams, it is favorable to load the input segment first which is used by the most streams.
Algorithm 7 Dispatching operation records
1: for each group g do 2:
for each record i in g do
3:
Init copy of input segment X i (when first used)
4:
for each record i in g do 6:
for each output segments of X do Init copy back.
In case of 4Streams strategy the reordered operation records are grouped (see Algorithm 6) such a way that kernels belonging to the same group can be interleaved (see Algorithm 7) in execution. First, operation records are sorted to load the more frequently used input segments earlier. Then, the records are iterated and each record is potentially swapped backwards to create groups of four consecutive operations belonging to four different streams. In practice some technical constraints have been added to slightly alter swapping behavior, which is not discussed here for the sake of simplicity.
CUDA operations are launched according to the operation records as summarized in Algorithm 7. For the sake of brevity the synchronization between the streams is not shown as it can be implemented with CUDA events in a straightforward way. The same code can be used for both strategies as the NoStreams strategy can be represented by groups which contain only one operation record. In case of NoStreams strategy the preparation of the operation records is much simpler and contains only the sorting by input frequency to reach I/O overlap with computation.
The performance of the two strategies is compared in Figures 13 and 14 . Significant improvement can only measured at medium sized matrices (100-800 for GTX 570 and 100-1500 for K20), in which case several operations shall be executed concurrently to keep all the CUDA cores busy required for high performance. Slightly bigger gain can be observed in case of K20 GPU which has 2496 Kepler CUDA cores as opposed to GTX 570 having only 480 Fermi CUDA cores. Operations on large matrices (∼1500 for GTX 570 and ∼3000 for K20) provide enough work for each CUDA core to approach the theoretical maximum double performance (180 GFlops for GTX570 and 1.17 TFlops for K20) without streams.
The two strategies are also compared by the run-time of the simulated models in Table 2 . In case of K20 the concurrent kernel execution has a slightly greater benefit, however, in both models operations on larger matrices, where concurrency has no benefit, dominates the run-time. In models where more symmetries are enabled the size of the matrices tends to be smaller, consequently, in these models the concurrency also tends to be more effective.
The performance results of the full projection computation including both CPU and GPU computations are shown in Figures 15, 16, 17 and 18 . The quality of the acceleration is highly affected by the applied workload ratio which depends on the performance ratio of CPU and GPU at the given matrix size. In the configuration file different ratios can be set for different matrix sizes and in each DMRG iteration the user-defined ratio is selected according to the average matrix size of the opera- tion records. If the workload is properly distributed 257.8 GFlops (×3.2 speed-up) and 1071.1 GFlops (×6.1 speedup) can be reached on GTX 570 and on K20, respectively.
Limits of FPGA implementation
Field programmable gate arrays (FPGAs) are programmable integrated circuits used to realize reconfigurable digital circuits via configurable logic resources and routing. Although originally developed for telecommunication and digital signal processing applications novel high performance FPGAs (featuring high-speed embedded resources, plentiful on-chip memory, high-level programming tools and significantly lower power consumption than CPU or GPU) are promising candidates for high performance computing (HPC).
To estimate the performance of an FPGA implementation of the DMRG method the acceleration of the projection operation expressed as a series of dense matrix multiplications (see Equation 11 ) shall be investigated. The floating-point matrix-matrix multiplication was already implemented [33] on FPGA very efficiently using the rank-1 update scheme. Kumar et al. demonstrated that the performance is not limited by the PCIe bandwidth, which connects the FPGA to the host CPU, and nearly full utilization of the processing elements can be reached.
The idea behind the rank-1 update approach is that instead of inner products between the rows of the left matrix and the columns of the right matrix outer products between the columns of the left matrix and the rows of the right matrix are carried out and resulting matrices are summarized. The advantage of the approach is that instead of multiply-accumulate operations (MACCs) multiply-add operations (MADDs) are used, which are independent from each other and can be pipelined to reach high processing element utilization.
Assuming the previously described design the processing elements can be fully utilized and the following bestcase estimations can be made according to the area requirements of the floating-point units. In an ideal case at most 114 or 193 multiply-add units can be implemented on the largest Virtex-6 ( XC6VSX475T ) and Virtex-7 (XC7VX1140T) FPGAs, respectively. The estimated clock frequency of the architectures are 437.82 and 443.65MHz which would result in 99.82 and 171.2 GFLOPS computing performance. This performance achievement can be compared to the performance of the mid-range GTX 570 GPU used in the paper. As the development time in case of FPGA is still much longer than in case of GPU and the high-end GPUs could significantly outperform the FPGA in this problem class, the GPU architecture is the better candidate for the acceleration.
Implementation Results
The implemented algorithm has been tested both on a mid-range (Intel Core-i7 2600 3.4 GHz CPU + NVidia GTX 570 GPU) and on a high-end configuration (Intel Xeon E5-2640 2.5 GHz CPU + NVidia K20 GPU); the results are displayed in Table 3 and 4, respectively. All CPU-only measurements have been executed with multithreading enabled (4 threads on Core-i7 and 6 threads on Xeon E5). The mid-range configuration with GPU is approximately 2.3-2.4 times faster than without GPU, while the high-end configuration is accelerated by 3.4-3.5 times using the GPU. However, a change from a mid-range, multithreaded CPU to a high-end CPU+GPU configuration can produce 6.5-7 times acceleration. To support the comparison of the results of the two investigated models the key parameters affecting computational complexity are summarized in Table 5 . Using the same number of retained block states the Hubbard model has larger values for all key parameters except the maximum sector size and the maximum matrix size. In case of the Hubbard model more symmetries are exploited which results in smaller sectors and, consequently, smaller matrices.
In case of K20 the acceleration of the projection and the matrix-vector operations is compared in Figures 19  and 20 . The projection is accelerated by 5.7 times which is in accordance with the theoretical performance capabilities of the two architectures. Currently on Xeon processor (see Figure 3 ) the projection operation is only accounted for 75% of the total run-time, therefore, the overall acceleration is also affected by the rest of the operations of the Davidson algorithm. Fortunately, as the number of retained states (m) increases the time-dominance of the projection also increases, which anticipates even better acceleration for real-world simulations with large m.
As the acceleration of the full Davidson algorithm can be limited by the GPU memory, an adaptive solution shall be implemented which accelerates as much of the algorithm as possible. Currently four matrix-vector operation of the algorithm is accelerated in case of sufficient GPU memory, however, latter acceleration of the rest of the operations will be implemented as well.
Conclusion
In this paper acceleration of the DMRG algorithm using novel kilo-processor architectures (GPU, FPGA) has been investigated. The GPU architecture has been found to a promising accelerator, as the most time-dominant step of the algorithm, the projection operation, can be formulated as independent dense matrix multiplications, which are ideal workload for GPUs. Moreover, in case of highend GPUs the acceleration of the projection is so remarkable, that it is worth to consider the acceleration of the rest of the algorithm to obtain a decent overall speedup. In the presented implementation some asymmetric matrix-vector multiplications of the diagonalization have been identified as the second most time-dominant part of the algorithm and a new CUDA kernel has been designed to efficiently accelerate these operations. The resulting parallelized DMRG implementation is a hybrid CPU-GPU solution, which distributes the workload according to the performance and memory capabilities of the configurations and anticipates a straightforward multi-GPU extension, which is part of our current developments. In subsequent works our DMRG implementation will be utilized to investigate more complex models focusing on physical problems of current interests, which are hard to treat due to their high computational demands. For example, high-spin fermionic systems relevant for ultracold atomic experiments and extensions to treat ab-initio quantum chemical applications [34, 35, 36] which are already under progress. Furthermore, a straightforward generalization of the presented algorithm to accelerate tensor network states (TNS) algorithms [37] is another promising research direction.
In general, FPGA chips have lower operating frequencies than in case of GPU architectures and the attached on-board memories are also smaller and slower. FPGAs can outperform other architectures in such problems where a custom arithmetic unit can reach a significantly better utilization, however, in case of dense matrix operations of the DMRG algorithm nearly ideal utilization of the CUDA cores is reached. As the estimated performance of the considered high-end FPGAs is around the mid-range GPUs and the development time for FPGA is significantly longer, the GPU architecture is preferred for the acceleration of the DMRG algorithm.
