-Broadcasts data to all threads, no global memory accesses!
• For large data, shared memory used as a programmanaged cache, coefficients loaded on-demand:
-Tiles sized large enough to service entire inner loop runs, broadcast to all 64 threads in a block -Complications: nested loops, multiple arrays, varying length -Key to performance is to locate tile loading checks outside of the two performance-critical inner loops -Only 27% slower than hardware caching provided by constant memory (GT200) -Next-gen "Fermi" GPUs will provide larger on-chip shared memory, L1/L2 caches, reduced control overhead Array tile loaded in GPU shared memory. Tile size is a power-of-two, multiple of coalescing size, and allows simple indexing in inner loops (array indices are merely offset for reference within loaded tile).
64-Byte memory coalescing block boundaries
Full tile padding CUDA-const-cache-JIT* 1 0.27 173.
(JIT 40% faster)
C 60 basis set 6-31Gd. We used an unusually-high resolution MO grid for accurate timings. A more typical calculation has 1/8 th the grid points. * Runtime-generated JIT kernel compiled using batch mode CUDA tools **Reduced-accuracy approximation of expf(), cannot be used for zero-valued MO isosurfaces 
