An implementation is presented of an uncontracted Rys quadrature algorithm for electron repulsion integrals, including up to g functions on graphical processing units (GPUs). The general GPU programming model, the challenges associated with implementing the Rys quadrature on these highly parallel emerging architectures, and a new approach to implementing the quadrature are outlined. The performance of the implementation is evaluated for single and double precision on two different types of GPU devices. The performance obtained is on par with the matrix−vector routine from the CUDA basic linear algebra subroutines (CUBLAS) library.
Introduction
The evaluation of two-electron (2e -) repulsion integrals (ERI) is a major computational step in determining the electronic structure of molecules using ab initio quantum chemistry and density functional theory (DFT) methods. 1 Accelerating the integral calculations significantly reduces the overall runtime of the direct Hartree-Fock (HF) 2 and the post-HF methods, e.g., many body perturbation methods. 3 In 1951, Boys 4 proposed using Gaussian functions as a standard atomic basis set for quantum chemistry computations because the integrals over the Gaussian functions can be evaluated efficiently in closed form. Since then, many different algorithms have been developed to evaluate ERIs over Gaussian functions. The Gauss quadrature method using orthogonal Rys polynomials, developed by Dupuis, Rys, and King (DRK), 5 is a general algorithm that is applicable to a wide range of integrals that arise in computational chemistry. Besides the original Rys quadrature, other ERI algorithms have been developed by, for example, Pople and Hehre (PH), 6 McMurchie and Davidson (MD), 7 Obara and Saika (OS), 8 and Head-Gordon and Pople (HGP). 9 Some of the modifications to the original DRK algorithm are due to Lindh, Ryu, and Liu (LRL) 10 and Dupuis and Marquez (DM). 11 Each of the developed schemes is more efficient for particular cases of integrals, while less efficient or inapplicable for other cases. In practice, quantum chemistry codes, such as the general atomic and molecular electronic structure system (GAMESS), 12 include several different ERI methods in order to take optimal advantage of the best method for particular integral and angular momentum types. The focus of the present work is on ERIs over higher (e.g., d, f) angular momentum functions.
Computationally the ERI calculations scale as ∼M 3 to ∼M 4 , where M is the number of basis functions used in the calculation, and the scalability range depends on the amount of integral prescreening that can be applied. For the most common calculations, M is typically less than a thousand, while larger calculations could require thousands of basis functions.
The ERI calculations are specific to the domain of computational chemistry and the related fields. They are much less common than general methods, such as Fourier transforms and linear algebra kernels, and typically are not as optimized as the basic linear algebra subroutines (BLAS) libraries. A typical HF or DFT calculation requires both ERI and linear algebra computations. However, ERI computations tend to dominate the overall time, since they require more floating point operations (flops). Moreover, unlike numerical linear algebra kernels that exhibit well-defined memory access patterns and simple long loop structures, ERI calculations have to account for many types of integral classes, and therefore, iteration variables do not have a simple linear relationship to the data elements which must be accessed. To help speed up the time needed to complete the ERI calculations, a general graphical processing unit (GPU) programming model has been implemented. The goal is to implement high angular momentum uncontracted ERIs in this scheme.
Electron Repulsion Integrals
Gaussian functions are taken as the standard basis for most ab initio methods. The Cartesian form of a primitive uncontracted one-electron Gaussian basis function with the center located at the origin takes the form of eq 1:
where R is the Gaussian exponent that governs the spatial extent of the function, r measures the distance from the atomic origin, and a x , a y , and a z are local quantum numbers that determine the net angular momentum L a by eq 2:
Individual Gaussian functions, like those described in eq 1, are generally called "primitive" functions. Especially for lower angular momentum functions (e.g., s and p functions), the actual basis functions are taken to be linear combinations ("contractions") of primitive Gaussians:
On the other hand, functions with higher angular momentum (e.g., d, f, and g functions) are typically uncontractedsthe focus of this work.
An uncontracted ERI in terms of these one-electron functions can be expressed as:
A contracted ERI can be constructed from a series of uncontracted ERIs (eq 5):
The angular momentum L a specifies the shape of the function and is denoted by the letters s, p, d, f, etc., for angular momentum values of 0, 1, 2, 3, etc., respectively. Functions with the same angular momentum that differ only in a x , a y , and a z indices belong to the same shell. Grouping functions into shells allows the ERIs to be evaluated more efficiently. The size (i.e., the number of functions) of a shell with angular momentum L a is and the size of an ERI shell block is where is the binomial coefficient evaluated as n(n -1)/2, and L a , L b , L c , and L d are the angular momenta of the four atomic orbitals in the ERI.
Note that the individual ERIs have an eight-fold symmetry, since (ij|kl) ) (ji|kl) ) (ij|lk), etc.; however, the ERIs are computed as shell blocks, rather than individual integrals. So generally, the eight-fold symmetry is only relevant between blocks, not within a block.
2.1. Rys Quadrature. The Rys quadrature proposed by DRK is efficient for higher order integrals (integrals with a higher order angular momentum) that are required for very accurate calculations that include electron correlation. However, it is less efficient for lower order highly contracted integrals. An attractive feature of the Rys quadrature is that it is very stable numerically, an important advantage for higher order integrals. Unlike other methods mentioned in the Introduction Section, it has a very low memory footprint, making it amenable for architectures with smaller caches, such as the GPUs of interest in this work.
The basic idea of the Rys quadrature is to evaluate the integral using a numerical Gaussian quadrature based on a set of orthogonal Rys polynomials. Equation 4 can be expressed, using i, j, k, l to denote functions of a primitive uncontracted ERI shell block (ab|cd), in the form As suggested in eq 8, X depends on the Gaussian exponents and centers. Equation 6 can be written as eq 11:
where P L is a polynomial of degree L, eq 10, with the coefficients C m in eq 6. Equation 11 can be evaluated exactly by an N-point (where N is an integer greater than L/2) Gaussian quadrature:
The W R and t R are weights and roots of the Rys polynomial, respectively. For example, a (dd|dd) block will have L ) 2 + 2 + 2 + 2 ) 8 and N ) 8/2 + 1 ) 5, and a (gg|ff) block will have L ) 4 + 4 + 3 + 3 ) 14 and N ) 14/2 + 1 ) 8. Separation of variables allows the terms of the P L polynomial, which are integrals over dr 1 dr 2 (see eq 4) to be written as a product of three two-dimensional (2-D) integrals I x , I y , and I z over dx 1 dx 2 , dy 1 dy 2 , and dz 1 dz 2 , respectively.
The overall ERI formula becomes
The 2-D integrals of a shell block have array dimensions (in FORTRAN/MATLAB notation, where commas delimit dimensions and colons specify the range) as shown below:
The first or leading dimension in eq 16 corresponds to the number of roots, and the last four dimensions correspond to Cartesian exponents for each function in a shell block. When constructing the ERI block, the 2-D integrals will be reused multiple times, hence, the computational and memory advantage of calculating ERIs as a block. For example, to construct the first six integrals of the (pp|pp) shell block, the following 2-D Cartesian integrals are used (multiplication by a constant factor is implied):
The roots and weights of the Rys polynomials can be evaluated by polynomial approximations 13 or by using a general Stieltjes procedure.
14 The 2-D Cartesian integrals are evaluated efficiently using recurrence and transfer relationships. The recurrence relationships generate 2-D integrals with all angular momenta shifted to centers i and k from (ss|ss) 2-D integrals, and transfer relationships shift the angular momentum to centers j and l to generate the desired 2-D integrals. For the details of these relationships, the reader is referred to the original DRK paper. 5 The quadrature step itself, i.e., the summation over the roots, eq 15, is the time-consuming step of the ERI shell calculation, requiring: flops, where the factor of 3 is from the two multiplications and an addition in each step. The transfer relationships scale as
, requiring many fewer operations than the quadrature step as the angular momentum increases. The recurrence relationships and root evaluation require even fewer flops than the transfer relationships for higher order integrals. Therefore, since an efficient implementation of the quadrature step determines the overall performance of the algorithm, the main topic of this paper is the efficient parallel implementation of the quadrature. Unlike the recurrence and transfer equations, which have predictable memory access patterns and can be expressed as simple vector operations, the quadrature step has complex memory access patterns which span a large data set and depend on the particular ERI class being evaluated. For example, the evaluation of the (ff |ff) ERI block requires (7)(4)(4)(4)(4) ) 5376 floating point (FP) numbers for three 2-D integral arrays (X, Y, Z) and 10 4 ) 10000 FP numbers for the final integral. For double precision numbers, the overall memory would be 123008 Bytes, well beyond the size of a typical L1 data cache.
Algorithm 1 outlines the basic structure of the Rys quadrature. Its simplicity obscures the fact that the Cartesian indices do not have a simple relationship to the iteration variables and must be either tabulated or each case must be programmed specifically for a particular ERI class. 
Graphical Processing Units
GPU technology has emerged as a viable computing platform for general purpose application programming, also known as general purpose computation on graphical processing units (GPGPU). The GPUs offer high-density arithmetic units at the expense of larger cache sizes and control units. In terms of linear algebra kernels, the GPUs can approach 20 and 70 giga floating point operations per second (GFLOPS) for matrix-vector and matrix-matrix routines, respectively, on current double precision (DP) capable devices 15 that have a theoretical peak of around 90 GFLOPS.
3.1. Compute Unified Device Architecture. Among the current GPGPU technologies, the NVIDIA compute unified device architecture (CUDA) 16 language environment is available for several GPU devices and is the target implementation choice. CUDA is a unified hardware computing architecture and programming model for graphics as well
as general-purpose processors. The current CUDA device architecture consists of a scalable array of streaming multiprocessors (SM), see Figure 1 . Each SM consists of eight scalar processors (SP), a multithreaded instruction unit, on-chip shared memory, one double precision unit, and two special purpose transcendental functional units. Under the CUDA programming model, the GPU is viewed as a highly multithreaded compute device capable of executing many threads in parallel. The threads execute a sequence of instructions in a data parallel fashionssingle-instruction multiple threads (SIMT).
Computationally demanding code paths of an application are isolated into functions (kernels in NVIDIA terminology) that are compiled into the instruction set architecture of the GPU device. The CUDA programming interface is designed with a minimal set of extensions to the C/C++ language. A runtime library provides functions to manage the compute device, to perform memory operations, and to run the devicespecific functions. The main goal of the programming environment is to develop scalable and efficient parallel programs.
A computational kernel is launched from the host and executed by T threads (T is application specific) on the device. The threads are hierarchically arranged as a grid of blocks and as a block of threads, as shown in Figure 2 (adopted from the programmers manual). 17 Each thread within a thread block has a unique set of (x, y, and z) indices that allow three-dimensional (3-D) data to be mapped onto a thread block. Each thread block has unique x and y coordinates, which map all the thread blocks onto a 2-D grid of blocks.
The logical memory space seen by the threads can be hierarchically arranged based on the data visibility (see Figure 1 ). Each thread has access to its local registers on the processor. Threads in a block can access and share data via the parallel data cache, called shared memory. The registers and the shared memory have a low latency and are limited resources available to the threads. One of the biggest challenges in designing the kernels lies in optimizing the per-thread register and the shared memory usage. Each thread also has access to a private local memory and a global memory space that are both part of the device memory and have high data access latencies. Apart from these, an application can also use the read-only constant and the texture memories that are cached.
Access to the main memory has a high latency, on the order of hundreds of cycles. To achieve full bandwidth, accesses to the main memory must be coalesced, meaning consecutive threads access consecutive memory elements to achieve full memory bandwidth. Coalescing ensures that multiple memory requests are being served simultaneously rather than served sequentially. Although only one thread block can execute at any given time on a SM, multiple thread blocks can be actiVe, thereby hiding the memory latency by overlapping the computations and the communications. An active block in CUDA terminology is a block that is ready for execution whenever a SM becomes free, e.g., when the current executing thread block starts fetching the memory. The number of active thread blocks is limited by the register and the shared memory usage.
The execution of a thread block is further batched into a series of warps that are consecutively arranged based on the thread number in batches of 32. To maximize the parallel performance, all threads in a warp must execute an identical GPU instruction, otherwise the warp is said to diverge, and the differing instructions are executed sequentially.
CUDA Rys Quadrature Implementation

Related Work. Ufimtsev and Martinez
18 evaluated the ERIs of s and p functions on GPUs in single precision using the MD algorithm. They later developed an entire Hartree-Fock code that runs on a GPU and showed improved performance 19 over the CPU code. Yasuda 20 implemented the Rys quadrature on a GPU in enhanced single precision for s and p integrals (i.e., some double precision computations were emulated in the software but still using single precision hardware). A new interpolation formula was proposed for the roots and the weights, and an error analysis for the quadrature was given. Some work has also been done to implement ERI algorithms on IBM CELL and FPGAs, 21 however only for the rather limited case of (ss|ss) ERIs.
To our knowledge there has not been a reported implementation of an ERI algorithm for d or higher angular momentum functions on GPUs or on other accelerators. The main difficulty seems to have been the limited amount of fast memory and the amount of code that must be generated for many cases involving higher angular momentum functions. This is the focus of the present work.
4.2. Implementation Considerations. Since the ERI computations are memory bound, the main consideration in designing the CUDA Rys quadrature is to optimize the memory access patterns and the data reuse. The 2-D integrals are reused multiple times to construct different ERIs and should, therefore, be loaded into shared memory. This also implies that an ERI block should be mapped onto a single thread block, as shared memory access and synchronization is limited to thread block boundaries. The ERI blocks are mapped onto the grid so that each thread block computes one ERI block. For the purposes of discussion, block is used to refer to both thread and ERI blocks.
Device memory loads and stores should be coalesced to parallelize memory accesses with high latencies. Multiple thread blocks should be active on a single SM in order to hide memory latency by overlapping computation and communication. In order to have multiple active thread blocks, the shared memory and the registers should be used sparingly. To illustrate hardware constraints, if a GPU has only 1638 4-byte registers and 16 KB of shared memory, then a kernel using 32 registers per thread and 2688 bytes of shared memory per thread block has a limit of 512 threads imposed by the register use and 6 active thread blocks per SM due to shared memory availability.
Clearly, for the larger ERI classes, the entire set of 2-D integrals cannot be kept in shared memory all at once but must be loaded from the device memory as needed. The iteration through the ERIs should be done so as to minimize the number of device memory loads. If the ERIs are only computed on the GPU but are not contracted right away, e.g., to form the Fock operator, then there is no reuse of the final ERIs.
4.3. Implementation Design. The current CUDA capable hardware imposes a limit of 512 or 768 maximum threads per block, depending on the particular GPU device. Consider the (dd|dd) ERI block case. The size of the entire block is 6 4 ) 1296 elements, exceeding the maximum number of threads. However, it is possible to map multiple elements to a single thread, e.g., by mapping i, j, and k indices, corresponding to the first three shells of the block, to a unique thread and iterating over the last index l. Since the thread block is 3-D, the mapping of the i, j, and k shell index to a thread is natural. Algorithm 2 outlines the general idea. The algorithm is in Python-like pseudo code, with ## signifying comments, and the indices and loops over the roots, N, are implied.
Algorithm 2: CUDA Rys Quadrature, i, j, and k Mapping
In terms of shared memory, i, j, and k mapping requires all the 2-D integrals of a specific l index. For the (dd|dd) ERI case, this means that the shared memory overhead for
) ) 135 elements per 2-D integral block. Though three 2-D integral blocks are needed per iteration, it is most likely that one of the previous Cartesian indices will stay the same. This means that the corresponding 2-D integral is already in the shared memory, reducing the memory communication by a third. For example, the construction of a d shell is outlined below. The three rows correspond to I x , I y , and I z Cartesian indices. The indices marked with an asterisk represent load operations. Though there are a total of 18 indices, only 13 indices must be loaded if the shell is arranged to minimize loads.
The above order may differ from the requirements of an application, however, restoring the desired ordering is trivial. In the (ff|ff) ERI case, mapping three indices to threads is not possible, as it requires 1000 threads. However, we can map the i and j indices and loop over the k and l indices in a similar fashion, as outlined in Algorithm 3. Table 1 for a |ff) case. In the example, the number of memory loads is 216. The first row of Table 1 shows the data access pattern for x, y, and z 2-D integrals when the canonical ERI ordering is used. There are some block patterns that are visible in the x and z dimensions. But these blocking patterns are not optimal from the perspective of data reuse because the blocks are small. To improve the overall memory performance, the integrals can be reordered such that one of the 2-D integrals has a well-defined block structure, for example, the z integral.
Template-based Code
Generation. If the cases described above are implemented in CUDA directly, then the register usage is high. To reduce the register use, the loops over the outer indices can be unrolled explicitly for each possible case, e.g. for |pp), |pd), |pf), etc. Programming all of the cases by hand is prohibitive, as it requires a large amount of code. However, using a template-based approach, all of the cases can be generated automatically from a single template.
There exists a number of template engines, e.g., the venerable m4 macro processor, 22 but the Python-based 21 Cheetah template engine 24 is chosen for this project. In Cheetah templates, the Python statements that control the code generation are embedded directly in the source code, similar to the manner in which traditional C preprocessor directives are used. Other benefits of using Cheetah are the ability to write complex support modules in Python and to reuse existing Python utilities.
Since generating code from a template is straightforward, the root summation loops were also explicitly unrolled. This was done to allocate registers to store a single set of 2-D integrals in registers rather than in shared memory. The benefit of doing so is that the use of shared memory and the bank conflicts are reduced. All of the shared memory is arranged in banks; the number of banks for the current hardware is 16, i.e., half-warp size. A bank conflict arises when multiple threads in a half-warp access different memory locations mapped onto the same bank, simultaneously resulting in the serialization of threads in the half-warp. 17 The bank conflicts occur often if the leading dimension (in this case, the number of roots, N) is a divisor of the bank size; accesses to an array with a leading dimension of 8 causes 8 bank conflicts. For other cases, the bank conflicts occur much less often; a leading dimension of 7 causes only one bank conflict. The bank conflicts lead to warp serialization, where the warp threads execute the instructions sequentially rather than executing the same instruction in a single instruction multiple thread (SIMT) or a lock-step fashion. Warp serialization is highly undesirable, and bank Table 1 . Index Ordering for |ff) Case conflicts are a serious performance issue, degrading the overall performance by about 25%. The bank conflicts are especially pronounced when the number of roots is even.
Results and Discussion
The performance of the implementation is evaluated on the NVIDIA GeForce GTX 275 and the Tesla T10 processors. The GTX 275 is a regular graphics card with a double precision support, an 1 GB of device memory, and a clock speed of 1.1 GHz. The Tesla T10 is a dedicated HPC accelerator with double precision (DP) support, 4 GB of device memory, and a clock speed of 1.35 GHz. The Tesla processor is capable of delivering approximately 20 GFLOPS on the level 2 BLAS double precision general matrix vector (DGEMV) routine and 70 GFLOPS on the level 3 BLAS double precision general matrix multiply (DGEMM) routine. The GTX 275 is approximately 25% slower than the Tesla.
The basis set used was purely synthetic with exponents of 1.5, which is representative of the values for higher angular momentum functions. The quoted timings do not include the GPU-CPU communication time.
The performance of the quadrature was evaluated by counting the total number of quadrature operations given by where n block is the number of ERI/thread blocks. The above metric also accounts for multiplication by a constant factor that incorporates contraction coefficients and normalization factors. For example, the (ff|ff) block requires a 21 (10 4 ) ) 210000 flop count. The total flop count is divided by the execution time on the GPU to obtain the GFLOPS metric. The execution time does not include the memory transfer overheads between the host and the GPU. The transfer time latency of the ERIs from GPU to host is several times longer than that of the actual execution time.
The performance results on the GTX 275 and Tesla boards are presented in Tables 2 and 3 , respectively. As can be seen from these tables, the performance depends to a large degree on the ERI class. The larger ERI classes (i.e., higher collective angular momentum) perform better on average than the smaller classes. The computations with an odd number of roots, cf., eq 13, e.g., (gg|dd), (ff|ff), etc., tend to have fewer bank conflicts than those with an even number of roots, e.g., (gg|ff), as discussed in Section 4. Consequently, the performance of ERI classes with an odd number of roots is higher, as much as 25% in an extreme case. The difference between the single and double precision performance is roughly a factor of 2, as previously predicted by Ufimtsev and Martinez. 25 One would expect this difference to favor single precision even more strongly, since the number of SP units is eight times the number of DP units. This suggests that the computations are memory bound rather than compute bound. The performance depends heavily on the mapping used, cf., Section 4.3. As one would expect, the "larger" i, j, and k mapping performs better than the i and j mapping for cases with lower i and j angular momenta (such as the (pp|ff) ERI block), since the shared memory reuse and parallelism is much higher. The difference between the two mappings for the same ERI class can be as high as a factor of 5. However, when the i and j angular momenta are higher (such as the (ff|pp) ERI block), the i and j mapping is only slightly outperformed by the i, j, and k mapping. Interestingly, comparing the best performance for the (pp|ff) and (ff|pp) ERI block gives very similar performance. One could use ERI index symmetry such as (ij|kl) ) (kl|ij) in these cases to ensure that the first two indices are always lower, so the i, j, and k mapping algorithm could always be used when the memory is available.
The difference in performance between the generic GTX GPU and the Tesla T10, presented in Table 4 , is 25-30% across the single and double precision performance. This is e Map is the ERI to thread mapping; the best performing mapping is shown in bold.
consistent with the higher clock speed of the Tesla compute device. In terms of registers and shared memory, Tesla does not have an advantage over the generic GTX GPU. This is reflected in the same mapping on both GPUs having the best performance for a particular ERI case. The performance of the Rys quadrature is on par with the performance of the (S/G) DGEMV routines in the CUBLAS library. In terms of the peak theoretical performance, it is possible to achieve approximately 30% in the best case. The poor-performing lower angular momentum ERI classes utilize the hardware at 10% efficiency in the worst case. However, the two mapping implementations (ijk and ij) are not specifically optimized for these ERI classes.
Previous work in this field focused on s and p integrals, which produce small, highly contracted integral blocks that can be performed entirely in shared memory and registers. Therefore, the ratio of computation to memory traffic is high. Moreover, on Telsa and older architectures, there is a ratio of 8:1 for single vs double precision floating point units. For higher angular momentum integrals, computation cannot be done entirely in shared memory and registers, so partial values must be read and stored in global memory. This helps to explain the difference in performance between this work and previous work.
The difference in performance between the GPU and the CPU, which is also presented in Table 4 , is very promising. The Rys quadrature used in GAMESS, which was also used as a benchmark by Ufimtsev and Martinez, is a legacy FORTRAN implementation that underperforms on modern CPUs. As can be seen from Table 4 , the original Rys quadrature implementation is only 15% efficient at best on a modern 8 GFLOP processor. In order to achieve good performance on both CPU and GPU, the algorithm must be implemented in a way suitable for instruction level parallelism.
Conclusions and Future Work
This work has demonstrated the ability to obtain comparable or better performance to that of an optimized DGEMV routine for a Rys quadrature implementation of two-electron integral computationssa core computation for electronic structure algorithms. Since the focus of this work is on higher angular momentum integrals, the use of the double precision units on the GPU is also highlighted. In order to achieve the best performance, memory access patterns and data reuse have been optimized. The code implementation has been greatly facilitated by using the template-based code generator. Not only does it allow for fast prototyping of various algorithms, it also provides a developer-friendly framework for the developer to focus on the main issues associated with the algorithm and allows the details associated with the many angular momenta and single vs double precision cases to be handled automatically. The use of templates could eventually be taken one step further, so that the generated code could be optimized further, depending on the GPU architecture. However, this may prove to be impractical as the NVIDIA compiler is refined to take advantage of different GPU architectures.
Some improvements are still possible with respect to data reuse, but the gains are unlikely to be high. The improvements would be due to more aggressive memory caching and memory access pattern reordering. The greatest overall improvement will come from reusing the ERI blocks as soon as they are formed on the GPU, e.g., to construct the Fock matrix. The 2-D Fock matrix is formed from 4-D ERI blocks; so, if computed on the GPU device, then the memory transfer would just be that of the Fock matrix (of order M 2 where M is the number of basis functions) rather than those of all the ERI blocks. Therefore, the computation of the Fock matrix on the GPU increases the flop count and reduces the amount of memory to be transferred to the host, resulting in overall greater performance. 17 The direct use of the ERIs on the GPU device is necessary, as the memory transfer of the raw ERIs between GPU and host is several times longer than the computation itself. The amount of data that must be transferred from the host to the GPU to start the computation is small; moreover, since it is small, it can be transferred asynchronously while the computation is running.
The contracted ERIs and the ERIs of small angular momentum functions have not been addressed directly in this work. The implementation of the Rys quadrature roots recurrence and transfer relationships is also not discussed explicitly but will be presented in a future publication. The accuracy and utility of single precision vs double precision computations will be considered in future work. In addition, future work will include the incorporation of the ERIs into modern algorithms for full electronic structure theory calculations.
