We have accelerated an ab-initio QMC electronic structure calculation using General Purpose computing on Graphical Processing Units (GPGPU). The part of the code causing the bottleneck for extended systems is replaced by CUDA-GPGPU subroutine kernels which build up spline basis set expansions of electronic orbital functions at each Monte Carlo step. We have achieved a speedup of a factor of 30 for the bottleneck for a simulation of solid TiO 2 with 1,536 electrons. To improve the performance with GPGPU we propose a new updating scheme for Monte Carlo sampling, quasisimultaneous updating, which is intermediate between configuration-by-configuration updating and the widely-used particle-by-particle updating. The error in the energy due to by the single precision treatment and the new updating scheme is found to be within the required accuracy of ∼ 10 −3 hartree per primitive cell.
I. INTRODUCTION
General Purpose computing on Graphical Processing Units (GPGPU) [1, 2] has attracted recent interest in HPC (High Performance Computing) for accelerating calculations at a reasonable cost. Environments for developing GPGPU, such as CUDA (Compute Unified Device Architecture) [3, 4] , have also contributed to the recent trend for using GPGPU for scientific applications with much increased portability [2] . Electronic structure calculations [5] form one of the largest fields within HPC and there have been many attempts to accelerate such calculations using GPGPU [2] . Electronic structure calculation using quantum Monte Carlo (QMC) methods can provide highly reliable estimates of material properties for a wide range of compounds [6, 7, 10] . The very high computational demands are not so important because of the inherently high efficiency of massively parallel computational facilities for Monte Carlo computations [11] . There have been several attempts to apply GPGPU to ab-initio QMC electronic structure calculations [12, 13] .
Previously we reported GPGPU acceleration of a QMC calculation for molecular systems, in which we achieved a speedup of more than a factor of 20 [12] . The key idea was to replace only the bottleneck subroutines in the main code by the CUDA kernel running on the GPU.
We emphasized that the replacement of the entire simulation code by its GPU version is not practical from the viewpoint of version administration [12] . This becomes more serious for practical program packages with large number of users, as is common in ab-initio electronic structure simulations [12] .
It was challenging to achieve substantial acceleration using such a 'partial replacement strategy', and it should give a speedup of at least more than a factor of ten to be advantageous to use multi-core processor technology. In Ref. [12] the main code written in Fortran90 (F90) was partially replaced by the GPU kernel, which were at the object code level. Users could switch back to the original CPU version of the subroutine if the GPU was not available. In the previous study GPGPU was applied to molecular simulations, although solid systems are the most attractive target for GPU-QMC electronic structure simulations [10] because of the vast CPU time required and the potential of QMC to achieve more reliable results than frameworks such as density functional theory (DFT).
The bottleneck in the present work differs from that in our previous molecular simulation [12] . In our previous work the bottleneck was the routine for computing the Hartree fields corresponding to the particle configuration [14] . In the present work the bottleneck is the routine for evaluating the single particle orbitals at the required particle positions. We have achieved a speedup of more than a factor of 30 with GPGPU compared with the single core performance of the conventional CPU evaluation. This acceleration does not, in principle, harm the MPI (massively parallel interface) parallelization efficiency, which is essentially the same as in our previous work [12] . The conventional MPI parallel evaluation [10] can be accelerated further by attaching a GPU to each node. In QMC calculations the electronic orbitals are calculated many times at different electronic positions. It is quite common in ab-initio electronic structure methods, including QMC, that one builds up orbital functions for given electronic positions. Our implementation achieved here would be useful also in self-consistent field (SCF) methods used in density functional theories (DFT) or molecular orbitals (MO) methods.
MPI parallelization has successfully been used in QMC electronic structure calculations [6, 7, 10] , obtaining ∼ 99% parallel efficiency by distributing the huge number of configurations over the processing nodes. On each node the evaluation is usually sequential, though there have been several attempts to exploit further parallelization within the node using, for example, OpenMP [6] . The evaluations performed on each node include updating an electronic configuration R (α) = ( r N ), and sampling with the updated configuration, where α is the index for MC steps. There are two major types of updating scheme, the configuration-by-configuration scheme (simultaneous updating) and the particle-by-particle scheme (PbP, sequential updating). In the former, attempted trial N-electron moves are generated to update a configuration, R (α) → R (α+1) , and then the new configuration is accepted or rejected. In the latter, a trial move of a single electron is attempted and accepted or rejected, r
, and the process is repeated N times.
Sequential updating is more efficient than simultaneous updating, and it is widely used in QMC electronic structure calculations [6] . For hybrid parallelization, including GPGPU and OpenMP, one seeks further parallelization in the sequential evaluation within an MPI node.
The GPGPU performs the accept/reject steps for each particle 'simultaneously'. The ratio of the probabilities evaluated in the Metropolis accept/reject algorithm [7] differs both from those for simultaneous and sequential updating. In this sense our updating scheme can be viewed as 'quasi-simultaneous updating' (Q.S.). This scheme is designed to obtain GPGPU acceleration by improving the sequential memory access (so called 'coalescing'), and the concealment of memory latencies. We have confirmed that our new updating scheme does not change the results within the required statistical accuracy, namely the chemical accuracy.
The paper is organized as follows. In §II we briefly summarize the VMC method (Variational Monte Carlo method). The evaluation of the orbitals represented in a spline basis set is the bottleneck in the computation, as described in this section. The benchmark systems used in the performance evaluation are also introduced. §III is devoted to a description of the GPU architecture. The structure of processors and memories in the GPU used in the present work is briefly explained. Other features, such as how we assign the number of threads and blocks for parallel processing, are discussed in §IV, in connection with the design and implementation of the quasi-updating scheme. The quasi-updating scheme is also introduced in this section. Several other possible implementations with different updating schemes or thread/block assignments are introduced here and their performances are compared. The results are summarized in §V, including comparisons of the energies, operation costs and data transfers, and the dependence of the performance on system size. In §VI we discuss the results, comparing with the ideal performance in terms of operations and memory access. We also discuss the possibility of using high-speed memory devices in the GPU and the relation to linear algebra packages.
II. QMC ELECTRONIC STRUCTURE CALCULATION
A. VMC
In ab-initio calculations the system is specified by a hermitian operatorĤ called the Hamiltonian [15] . The operator includes information about the positions and charges of the ions, the number of electrons, and the form of the potential functions in the system. The fundamental equation at the electronic level is the many-body Schrödinger equation, which takes the form of a partial differential equation withĤ acting on a multivariate function Ψ ( r 1 , · · · , r N ), known as the many-body wave function, where N denotes the number of electrons. The energy of the system, E, is the eigenvalue of the partial differential equation.
The equation has the variational functional [7] 
which is minimized when the above integral is evaluated with Ψ being an exact solution of the eigen equation. For a trial Ψ the functional can be evaluated as an average of the local energy, E L ( r 1 , · · · , r N ) = Ψ −1Ĥ Ψ over the probability density distribution
In VMC the energy is evaluated by Monte Carlo integration using the Metropolis algorithm to generate configurations R
distributed according to the probability distribution
with M being typically of the order of millions. The trial function Ψ is improved by an optimization procedure so that the integral of Eq. (1) can be minimized numerically [8, 9] . Since each E L R (α) is evaluated independently, the summation over α can be divided into subsummations distributed over the processors by MPI with high efficiency [10] . In this work GPGPU is used to accelerate the evaluation of each E L R (α) , rather than parallelization over the suffix α. We used the 'CASINO' program package [6] for the VMC calculations.
There are several possible forms of trial Ψ, and we chose to use the common Slater-Jastrow type wave function [7, 10] ,
where e J( R) is known as the Jastrow factor [16, 17] . Ψ D is a Slater determinant [18] Ψ
which is an anti-symmetrized product of one-particle orbital functions,
. The number of independent orbitals, L, can be reduced by using the symmetries of the system.
5
The bottleneck of the whole simulation has been found to be the construction of the {ψ l ( r)} [6] . In this study the computational power of the GPU is devoted to the bottleneck process, as described in the following subsection.
B. Orbital evaluation
In each MPI process the following evaluations are performed sequentially:
1. An attempted move, R (α) → R (α+1) , is randomly generated, 2. The updated probability p( R (α+1) ) and the ratio ξ = p(
3. Based on the ratio ξ, the attempted move R (α+1) is accepted or rejected,
Each configuration is a set of electronic positions (we omit the spin coordinate for simplicity), 
In practical QMC calculations for extended systems the orbital functions are expanded in a B-spline basis set {Θ s ( r)} [6, 19, 20] as
The index s runs over the subset of the spatial sites within the unit cell of the periodic system. The spline basis functions, {Θ s ( r)}, have non-zero values only at sites s within the fourth nearest neighbor of the position r along each direction. The total number of terms in the summation (6) is therefore 64 = 4 3 (four spatial points along each direction in the three dimensional space), as a subset of the whole lattice within the unit cell amounting to S = 50 3 ∼ 60 3 . The lattice is indexed by {s} S s=1 , as depicted schematically in the two dimensional plane in Fig. 1 . The subset {s}
is the spatial region where {Θ s ( r)} has non-zero values, depending on the given r. Since the indices introduced so far are complicated we summarize them in Table I .
The value of Θ s ( r) at a s-lattice site, R s = (X s , Y s , Z s ), is given by the function depending on the distance between r and R s as, , are precomputed and provided as an input file, stored in memory at the beginning of the simulation. For each MC step with a updated particle position r j , the subset
is identified and used in the summation (6) . Denoting a l,s( r j ) = a [l, j x , j y , j z ] as an array, s ( r j ) = (j x , j y , j z ) forms a simply connected region in the three dimensional space but it does not allow sequential memory access in one dimensional address space, as it is interrupted with some stride due to the higher dimensions (see Fig. 1 ). The orbital index l, is, however, inherently one dimensional and we exploit this for sequential memory access, which is very important for GPGPU, as discussed in §IV.B. We have identified the 'multiply and add' operations in Eq. (6), by which the orbital
, are evaluated as the bottleneck in the present QMC simulation [6] . This operation appears at every MC step when the particle position is updated, r
The number of operations for a single evaluation is proportional to the number of orbitals, L, and hence to N. In the present study we treat system sizes up to L = 384 and N=1,536.
C. Benchmark systems
To investigate the dependence of the acceleration on system size, we prepared three different benchmark systems, as reported in Table II . For each system the atomic cores are replaced by pseudopotentials [5] in Si-diamond (N=216) and cubic TiO 2 (N=648 and 1,536). The periodic boundary conditions for (3 × 3 × 3) or (4 × 4 × 4) arrays of unit cells form a simulation cell. More detailed specifications for each system are given in Ref. [21] for Si and Ref. [22] for TiO 2 . The bottleneck routine (orbital evaluation) to be replaced by the GPU processing kernel occupies 20∼30% of the entire CPU time for TiO 2 (N=1,536), as analyzed by a profiler (Intel VTune Amplifier [23] ). This depends on the choice of the 'dcorr' parameter in CASINO [6, 24] , which specifies the interval between sampling; in order to reduce the correlation in the sampling, the local energy is evaluated every 'dcorr' MC steps (an MC step corresponds to the update of a configuration). The ratio of the CPU time spent in the bottleneck is reduced from 39.5% to 22.0% by increasing 'dcorr' from one to ten, as measured for a simulation with 10,000 MC steps. Typically 'dcorr = 5' is chosen, for which the reduction becomes 27.5%. The reason for this dependence is that the orbital evaluation is called not only by the configuration updating but also by the local energy evaluation. Increasing 'dcorr' means less frequent evaluation of the local energy and hence less frequent calls to the orbital evaluation.
III. GPU
General descriptions of the architecture of a GPU can be found in the literature [2] [3] [4] , and our previous paper [12] also provides such a description. This section provides the minimum amount of information required to understand the present work which was performed with the NVIDIA GeForce GTX 480 architecture.
A GPU has hundreds of processing cores. The key points for the acceleration in the present work can be summarized as follows: (1) Parallelized tasks are distributed over many cores. The large number of processing cores of a GPU allows the whole task to be processed more rapidly than by a CPU. (2) The parallelized tasks are grouped into several sets (called 'warps'). The GPU processes each warp in order ('barrel processing'), skipping those still waiting for data load from memory. As there are usually hundreds of warps, barrel processing conceals the memory latency. In our previous work [12] the acceleration was achieved mainly by dividing a huge number of loops into several subsets and distributing them over GPU processor cores. The present work does not follow this strategy, and we do not divide the loop for the summation in Eq. (6) . Instead a huge number of independent parallel tasks,
are distributed over the GPU processing cores. Other key points for the present work include, (3) memory latency is much improved when the access occurs with sequential memory address (memory coalescing), and (4) a command set called a 'Fused Multiply Add (FMA) which performs multiply and add operations within a clock cycle (two operations at once).
A. Processors and performance
GTX480 has 480 processor cores, each of which is termed a 'streaming core' (SP) for AMD products while 'cuda core' is used for NVIDIA products. In the present paper we use the term SP. The specs of the GTX480 are summarized in Table III peak performance of GTX480 is hence evaluated as
Note that, unlike the previous GTX275, multiply-and-add operations are not subject to a SFU in the present GTX480. For evaluating Eq. (6) there is hence no place for a SFU to be applied, giving an ideal performance to be compared with our achievement of
by omitting the contribution from SFU. Though the present work concentrates on single precision GPU evaluation, the ideal double precision performance is estimated to be 672
[GFlops], which is half of that for single precision. GeForce GTX480, however, limits it to a quarter of this value, 168 [GFlops], by its driver, for some reason [12] . These estimates are summarized in Table IV , compared with that of the CPU (Intel Core i7 920) used in the present work. Table V summarizes the specification of a computational node used for the experiments.
On each node an Intel Core i7 920 processor [25] and a GPU is mounted on a mother board.
Hyper-Threading [26] in the Core i7 processor is turned off, and it is used purely as a fourcore CPU. Compute Capability specifies the version of hardware level controlled by CUDA, It is essential in GPGPU coding to design efficiently the parallelized tasks (termed 'threads') to be grouped into subsets with several different classes. With the variety of memory devices provided in a GPU, see Table VI , each subset has a different 'distance' from these devices. The performance of the GPGPU is critically affected by the choice of theses subsets because a good design can effectively reduce the memory latency. In the present study all of the threads are grouped into 'blocks'. Threads within a block can share memory devices with fewer latencies.
Each block is assigned to a SM by which the threads within the block are processed. The SM processes 32 threads at once, as in vector processing. A bunch of 32 threads is called a 'warp'. When a warp accesses with sequential memory addresses, the latency is much reduced (called 'coalescing'). To conceal the memory latency, the scheduler and dispatcher for warps monitor which warps are immediately executable (namely which ones have already completed their memory loads). Then the scheduled warps are processed sequentially by the SM. A schematic picture is shown in Fig. 3 .
. . . Table VI contains only those kinds of memory relevant to this study, and excludes the texture memory [4] . Off-chip memories are located within the GPU board but not on the device chip. They have larger capacities and are accessible directly from CPU hosts but are in general slower. On-chip memories are complementary, namely with higher speed and lower capacity. Data required for GPU processing is transferred from the mother board to off-chip global memory, and is then loaded to on-chip shared memory, as usual.
The capacity of the global memory ranges from 1 GB to 3 GB, depending on the product.
As a trade-off against the large capacity it is about 100 times slower than on-chip memories.
In GTX480, 768 KB of off-chip L2 cache is available to cover the low speed of the global memory. Another off-chip memory with high speed accessibility is the 64 KB constant memory. Via the constant caches located on every SM the constant memory can be accessed from all the threads with higher speed, although it is limited to read-only. As its name suggests, it is used to store constants referred to by threads. Data loading from the global memory takes at least 200 cycles, and more usually 400 ∼ 600 cycles. To conceal the latency, the GPU administrates all of the warps and monitors whether it is ready to be executed with the completion of data load. With sufficient warps one can ensure that the processors are almost always executing operations without waiting for data loads. To achieve better concealment it is essential to design the code so that it maintains a large number of warps. Since it depends on the specs of each architecture, such as the number of threads per warp, and the maximum possible number of threads per block etc., programming for better performance requires tuning for each GPU product. 
IV. CODING
Only the bottleneck routine for evaluating Eq. (6) is replaced by the CUDA kernel code executed on the GPU. The interface between the main code in F90 and the CUDA kernel is the same as that in our previous study [12] .
A. Quasi-simultaneous updating
To construct appropriate parallelized degrees of freedom, we introduced a new scheme for the MC updating of configurations. Let us denote a configuration at MC step α by
N ), and consider the update of a particle position r
In configuration-by-configuration updating (simultaneous updating), the accept/reject of the updating is evaluated based on the ratio of the probabilities in Eq. (2),
while in particle-by-particle (PbP, sequential updating),
The index j in ξ (j) seq means that the accept/reject step is made for each particle move, unlike in configuration-by-configuration updating. These two updating schemes give slightly different values for the statistical estimates because of the different paths of the evaluations, but they coincide with each other within the statistical errors. We introduce another updating scheme based on the ratio
termed 'quasi-simultaneous updating' (Q.S.). In this scheme the accept/reject evaluation for the jth particle at step (α + 1) is based on the previously fixed α step configuration and on each particle position j, which gives N individual parallel tasks.
Evaluating Eq. (11) reduces to computing the orbital functions with updated positions,
, which requires (N × L) independent evaluations. New trial moves at the (α + 1) step
are generated on a CPU and sent to a GPU and the orbital functions are evaluated (see Fig.   4 ). For TiO 2 (4×4×4) this gives N × L = 1536 × 384 = 589, 824 parallelized tasks. Note that the parallelized multiplicity (N × L) scales as N 2 since the number of orbitals L ∝ N.
The concealment of memory latency is more efficient when the multiplicity increases, and hence we expect better performance for larger systems.
(GPU) accept/reject for determinants.
accept/reject for Jastrow factor.
Generate random walks 
B. Assigning blocks
Each thread evaluating ψ l r (α+1) j is labelled by (j, l). As mentioned in §II.B, the memory access for the coalescing should take l as the sequential index for the data load of a l,s( r j ) = a [l, j x , j y , j z ], as required for Eq. (6). We therefore assigned the threads sharing the same j with sequentially varying l = 1 ∼ L within a block for the coalescing. The number of orbitals, L, cannot therefore exceed the maximum possible number of threads within a block, which is 1,024 for GTX480. The number L can usually be reduced to L < N using the symmetry of the system. It is then a natural choice to assign L within a block rather than N, which is always being larger than L. For non-magnetic solid systems as in the present work, L is reduced to ∼ N/4 by the symmetries, giving the limit N 4,096 for GTX480. For magnetic systems, L ∼ N/2, and hence N 2,048, and for systems without time-reversal symmetry it becomes 1,024. This limitation is consistent with the maximum simulation size of contemporary QMC electronic structure calculations, which are able to treat at most a few thousand electrons in extended systems [21] . Furthermore the limit of 1,024 in GTX480 is expected to double in future architectures [30] , making this issue less important.
When evaluating Eq. (6), all the threads within a block refer to the same Θ s ( r j ) with fixed j. Once a warp loads {Θ s ( r j )} 64 s=1 , these data are stored in L1 and L2 cache including their neighboring data. We can then expect effective cache hits for the data load. The operation of each thread, 64 terms multiply and add, easily fits the FMA of the GPU. (2) and (1), and with larger system sizes N. The Table   VIII . 'PbP' and 'Q.S.' stand for the particle-by-particle and quasi-simultaneous updating schemes, respectively. The 'Index' column shows which indices in Eq. (6) are assigned to threads and blocks in the GPU. For system sizes N , see Table II for a more detailed description of the systems. improvement from (1) to (2) is attributed to the increased number of threads in Q.S. due to processing the particle indices j = 1 ∼ N simultaneously. This also brings about improved efficiency in the data transfer between the CPU and GPU, which is simultaneous transfer in version (2) and repeated transfers for each j in PbP version (1). The improvement from (2) to (3) arises because the number of operations performed on each thread is increased;
multiply and add summation with 64 terms in (3) and just one multiplication in (2) . The fact that memory coalescing only works in (3) also contributes to the improvement, for which a detailed analysis will be given in §VI.C. 
B. Deviation in energy values
In GPGPU for scientific simulations it is important to consider whether the deviation in the results due to the single precision operations are within the required accuracy. For the present electronic structure simulation the deviation should be within ∆E ∼ 0.001 [hartree] in the energy estimation, known as the chemical accuracy. Table IX (0a) and (1) gives the deviation due to the change from double to single precision evaluation.
The deviation is within the statistical error bars, but the agreement is poorer than single precision, which is expected to be correct to around six digits. To understand this we must remember that the energies given in Table IX are statistical estimates obtained from different accept/reject paths after 1,000,000 MC steps. When we look at the difference after a single MC step, agreement is confirmed within six digits, as shown in Table X . Even such small deviations may give rise to different decisions along a accept/reject branch. Once a different decision occurs, the subsequent random walk takes different paths, giving different estimates which are outside of the six digits but within the statistical error bars. 'PbP' stands for the particle-by-particle updating scheme.
Energy (hartree) (0a) CPU/PbP (double prec.) -7.95075065699
(1) GPU/PbP (single prec.) -7.95075065360
A comparison of (0a) and (0b) in Table IX gives the deviation due purely to the two different updating schemes. As expected it is confirmed that these energies agree to within the statistical error bars. The result of (2,3) includes both the deviation due to the changes in the updating scheme and the numerical precision. Comparing with the reference (0a)
demonstrates that our updating scheme keeps the result within the required accuracy.
In the present study, the updated orbital values from the GPGPU/single precision evaluations are used only to determine if updated particle positions are accepted or rejected. The energies reported in Table IX were calculated using CPU/double precision. The difference between single and double precisions alters the paths of the random walks and hence where the R space is sampled. If the energies themselves were also evaluated using single precision it might introduce significant biases, but we have not investigated this here. As mentioned in §VI.d, there are further possibilities for accelerating the second largest bottleneck, which is the updating of the Slater determinants (or equivalents) using GPGPU. In this scheme the updated single precision value of the many-body wave function is used to evaluate the energies, and the errors in these energies should be considered carefully to make sure the results are still within chemical accuracy. sizes N (number of electrons). 'PbP' and 'Q.S.' stand for the particle-by-particle and quasisimultaneous updating schemes, respectively.
so that the efficiency of the GPU improves for larger systems. Coalescing in implementation (3) also becomes more effective for larger L. However there is a drawback for larger systems in the data transfer costs, especially for returning
from the GPU to CPU.
As overall cancellation, the acceleration seems to scale as ∼ N , rather than ∼ N 2 (the number of threads). Table XI summarizes the data transfer times using ver. (3), which
show that the transfer cost increases with system size. The percentage of the total GPU time is, however, decreased because the number of GPU operations also increases. number of warps among all the implementations.
VI. DISCUSSIONS
To prevent redundancy we omit discussions of the evaluation of how much of the original CPU code is optimized, and of how we interpret the acceleration factors achieved in terms of the actual usefulness of GPGPU, as these were discussed in our previous report [12] .
A. Acceleration performance
The ideal limit of the acceleration factor to be compared with our achievement of 30.7 is that of (1345/10.64)=126.4, where 1345 [GFlops] is the ideal limit for the GPU discussed in §III.A and 10.64 is the single core performance of the Intel Core i7 processor used here for implementation (0a). This limit might be achieved when the ratio of the cost of memory loads to that for operations approaches zero, although this is not possible in practice. This ratio corresponds to those shown in the last column of Table XI as in percentages. As shown in the table, the ratio is decreased for larger systems and hence we expect better performance.
As another evaluation of our achievement, we estimated the FLOPS of our implementations applied to TiO 2 (N=1,536) as listed in should be chosen to form the coefficients {a ls }. The actual FLOPS should therefore be larger than those given in the table. In this evaluation, our achievement of a factor 30.7 gives an 'PbP' and 'Q.S.' stand for the particle-by-particle and quasi-simultaneous updating schemes, respectively.
Performance ( 
B. Performances in memory access
Table XIII summarizes the memory load performances of each implementation as measured by 'Compute Visual Profiler' for CUDA [28] . The increase in the number of 32 bit load from (1) to (2) is simply because of the increased number of threads due to the simultaneous processing with respect to the number of particles N in Q.S. From the coalescing in (3) we see a remarkable reduction in the number of memory loads by a factor of ∼ 27. Correspondingly the throughput becomes much closer to the peak value, 177.4 GB/s, compared with the poor achievements in (1) and (2) due to the random access of {a js }.
The SM activity in Table XIII is a measure of the efficiency of the concealment of the memory latency. This quantity is defined as the ratio of the number of cycles at which the operations started after the completion of memory loads to the total number of cycles taken for the kernel execution. With efficient concealing, at least one of the threads is always ready for execution at each cycle, and hence this quantity is expected to be close to 100%. Since the concealment becomes more effective with larger numbers of warps, the SM activity is increased from (1) to (2) by the increased number of threads. Though the number is decreased again in (3) by a factor of ∼ 64, greatly accelerated memory accesses by the coalescing improve the SM activity to 100%.
C. Shared memory and read only memory
In our previous work [12] we found that it was quite effective to exploit high-speed readonly memory. In the present work we have investigated further improvements by exploiting high-speed read-only memory but found no significant gains. This is summarized as follows:
(i) in the present case the data required for the operation on each thread is large and does not fit into the high-speed memory devices, and (ii) even without explicit use of high-speed devices, the compiler automatically assigns them to L1 cache, and the explicit data transfer to shared memory by the user gives rather slower performance than automatic assignment.
The data loads considered in the present case are {ã l,s } S∼250,000 s=1
, which is initially stored in the global memory. After choosing the subset {a l,s } 64 s=1 L l=1 from {ã l,s }, our best implementation (3) loads them by coalescing access to the global memory. However, the access speed to the on-chip shared memory is around 100 times faster than that. [31] . A block sharing a shared memory device has a common j so the set
is shared by all the threads within the block. The 64 KB capacity of the device corresponds to 16,000 (= 64 KB/4B) elements of single precision data. It is therefore possible to accommodate {a l,s } when L < 16, 000/64 = 250, so that the total number of orbitals is less than 250. Table II shows that Si (3 × 3 × 3) and TiO 2 (3 × 3 × 3) correspond to this case. Storing the data within the on-chip memories becomes advantageous when the data is referenced repeatedly by the warps. The number of repeated references is given in our case by the ratio of L ∼ 250 to the warp size, 32, which is less than 10. Beyond this number of repeats the SM switches to another process with different j and hence the corresponding new data for {a l,s } should be loaded again from the global memory. We have tried such an implementation using shared memory devices but we did not find any improvements over (3), possibly because of the small number of repeated references. Such an improvement would already be included implicitly in (3) by L1 cache acting on the shared device. The explicit usage of the device as the shared memory seems to reduce the performance.
Another choice is to use constant memory. Unlike shared memory it is located off chip and blocks with different j indices pick up their subset {a l,s }. The memory should therefore be able to provide the whole set of {ã l,s }, which is far beyond the size of the constant memory and is therefore infeasible. Another data required for the GPU operation
in Eq. (6), the total size of which can be accommodated within the constant memory for N < 250. Our trial implementation using constant memory for {Θ s } actually gives a slight improvement in performance, by ∼ 4%, but this is only applicable to smaller system sizes. The data Θ s r
is updated at every MC step and hence the life time in cache is short, giving only a slight improvement in performance.
D. Acceleration of Slater Determinant Updating
The bottleneck operation of updating of orbital functions, which are computed by the GPU, gives a system size dependence of O(N 2 ) [6] . The corresponding updating of the manybody wave function, (5), actually takes O(N 2 ) in the PbP implementation [7, 29] , in spite of the fact that evaluating a determinant scales as O(N 3 ). In PbP, say r j → r ′ j , only one column of the determinant including r ′ j is updated at each step. The updating of the determinant can hence be evaluated based on the Laplace expansion with respect to the column, and the other co-factors are unchanged. This algorithm, known as Sherman-Morrison updating [29] therefore requires only O(N 2 ) operations [6] . In our Q.S., the updated orbitals are stored in an array and are read sequentially to update the determinant using this algorithm. The cost of the updating of the many-body wave function amounts to 19.4% of the time compared to 39.5 % for the present bottleneck, the orbital updating, when N=1,536, dcorr=1. The ratio of the former to the latter increases with larger dcorr. This operation mostly involves BLAS routines (ddot and daxpy) which could be replaced by CUBLAS in future work.
E. As a prototype of linear algebra problems
The evaluation of Eq. (6) can be written in terms of linear algebra,
Θ 2 ( r j ) . . .
for which the matrix A is defined by A = {a ls } = a l,s( r j ) = A ( r j ) .
If the matrix A is constant, we can write Eq. (13) as a matrix multiplication:
and using this we could increase the ratio of operations to data transfer, and hence the efficiency of GPGPU by a large amount. Fig. 1 , however, reminds us that the matrix A in Eq. (14) is randomly varying in its elements choice indexed by r j , but within the given constant elements {ã l,s }, and hence we cannot reduce Eq. (13) into a unified form as Eq.
(15).
Since our calculation of Eq. (6) is linear algebra we considered whether it could be performed efficiently using CUBLAS. To use CUBLAS we have to construct A ( r j ) on the CPU at every MC step, transfer it to the GPU, and then call CUBLAS. In our case the randomly varying matrix A is not large, however, the cost of constructing it cannot be compensated by the high performance of CUBLAS. Since the enlarged matrixÃ = {ã ls } S s=1 is a constant matrix, only one data transfer to the GPU is required in principle, but this is not possible with CUBLAS which requires the operands to be transferred every time for the operations.
In this sense, our achievement here corresponds to an effective GPGPU implementation of the algorithm for the prototype as explained above, Eq. (13) with a randomly chosen matrix.
VII. CONCLUDING REMARKS
We have applied GPGPU to the evaluation of the orbital functions in ab-initio Quantum Monte Carlo electronic structure calculations, which we identified as the computational bottleneck. For efficiency we propose a new updating scheme for generating trial moves for the walkers in the Monte Carlo sampling, which we call quasi-simultaneous updating. Using this scheme we achieved a speedup of more than a factor of 30 compared with using a single core CPU. The GPGPU implementation gives a deviation in energy from original CPU evaluation which is smaller that the required chemical accuracy. Though the effective performance in operations amounts to 74 GFlops, which is only 5.5% of the peak performance, the memory throughput reaches 153 GB/s, which is 86% of the peak value with almost perfect concealment of memory latency as shown by the SM activity. The implementation presented here is a prototypical problem of linear algebra with a sort of random matrix, processed by GPGPU. 
