Massively parallel architectures offer the potential to significantly accelerate an application relative to their serial counterparts. However, not all applications exhibit an adequate level of data and/or task parallelism to exploit such platforms. Furthermore, the power consumption associated with these forms of computation renders "scaling out" for exascale levels of performance incompatible with modern sustainable energy policies. In this work, we investigate the potential for field-programmable gate arrays (FPGAs) to feature in future exascale platforms, and their capacity to improve performance per unit power measurements for the purposes of scientific computing. We have focussed our efforts on Variational Monte Carlo, and report on the benefits of co-processing with an FPGA relative to a purely multicore system.
I. INTRODUCTION
Quantum Monte Carlo (QMC) encompasses a class of techniques for approximating expectation values to quantum mechanical observables for many-electron systems [1] . By casting the time-independent Schrödinger equation into integral form, the high-dimensional integration -intractable through quadrature techniques -can be realised through a stochastic sampling of the manyelectron wavefunction. Relative to deterministic counterparts, QMC offers the potential to combine the favourable scaling of meanfield techniques and the accuracy of post-Hartree Fock methods [2] . Two forms of QMC in particular have gained traction as popular ab initio methodologies [3] : Variational Monte Carlo (VMC) and Diffusion Monte Carlo (DMC). Commonly, VMC is used to variationally optimise free parameters in the wavefunction, with DMC being subsequently applied to compute production level observables for the molecular system in question.
Prior to the failure of Dennard scaling [4] , application developers were able to indulge in a "free lunch", whereby they could exploit the continual increase in processor clock frequencies with new generations of hardware to directly accelerate an application. Since on-chip power densities no longer scale down with transistor density, manufacturers instead look towards heat dissipation technologies in an effort to accommodate Moore's law [5] . However, such techniques have limitations and eventually require manufacturers to reduce on-chip voltages, forecasted to eventually culminate in an era of "dark silicon" [6] . To address the unsustainability of frequency scaling, the tendency has been for new generations of hardware to increase the number of parallel compute units on a single chip. As a result, the onus falls onto the software * sc2018@cam.ac.uk developer to exploit either operations or tasks that can be undertaken in parallel within an application. Furthermore, with the advent of distributed memory systems and graphical processing units (GPUs) at contemporary high-performance computing (HPC) facilities, the software developer has an enormous arsenal of hardware solutions for exploiting parallelism within an application.
For QMC methodologies, the associated embarrassing parallelism lends itself well to modern parallel architectures. There is a significant body of work demonstrating the near-linear scaling of performance with respect to the available hardware concurrency [7, 8] . More recently, GPUs have been utilised as co-processors to deliver impressive performance gains [9] . However, "scaling out" is not a sustainable solution to simulating increasingly complex systems. The power consumption associated with distributed memory systems and GPUs does not align with modern energy policies. Indeed, a consensus appears to have emerged [10] [11] [12] that the power consumption of HPC is one of the most prodigious barriers to attaining exascale levels of performance. Field-Programmable Gate Arrays (FPGAs) are appealing candidates as processing units for novel exascale platforms [13, 14] . These devices combine low power consumption with the capacity to exploit both data and task parallelism, and are therefore applicable to a wider set of applications than solutions relying on data parallelism alone. To date, use of FPGAs in electronic structure theory has been somewhat limited, although several works demonstrate the power of this platform to accelerate scientific applications within this domain. [15, 16] In this work, we look to port the compute-intensive portions of a VMC calculation to an FPGA and assess the performance relative to a CPU-bound application. We elaborate on the optimisations incorporated into our design to optimise the implementation for the purposes of co-processing. Finally, we demonstrate that our appli-cation benefits from the use of FPGAs in terms of both raw compute performance and power consumption.
II. VARIATIONAL MONTE CARLO
The following section is not intended to present an exhaustive exposition of techniques within QMC. Rather, the reader is directed to the excellent review of Foulkes and coworkers [1] for further details.
A. Implementation
Consider a molecular system comprising N nuclei and n electrons, where R and r denote their collective position vectors respectively. The time-independent electronic wavefunction of this system is represented by Ψ(r; R), where the parametric dependence on R arises from the application of the Born-Oppenheimer approximation. Expanding Ψ(r; R) in terms of the exact (orthonormalised) eigenfunctions of the molecular Hamiltonian,Ĥ(r, R), the energy of the system is determined by the expectation value Ĥ (r, R) Ψ . Alternative (timeindependent) physical quantities can be obtained by substituting the molecular Hamiltonian for the appropriate corresponding operator. From the variational principle it can be shown that for any approximate wavefunction, Ψ T (r; R), referred to as the "trial wavefunction", the expectation value Ĥ (r, R) Ψ T , provides an upper bound to the true energy of the system. Consequently, the variational principle provides an important metric for quantifying the quality of an approximate trial wavefunction, with lower energies implying a higher quality wavefunction. Herein, all explicit functional dependencies are omitted unless a new quantity is introduced. Through the time-independent Schrödinger equation, the energy associated with a particular trial wavefunction can be expressed as
where integration is performed over the 3n electronic degrees of freedom. With some foresight, multiplication by the identity Ψ T /Ψ T casts this expression into a form amenable to solution through stochastic methods:
The above "importance sampled form" utilises |Ψ T | 2 as a probability density function from which samples of the "local energy", E L (r) = Ψ T −1Ĥ Ψ T , can be drawn. Through Monte Carlo, the resultant average of the local energy over N MC (uncorrelated) samples is asymptotic to the variational energy of the trial wavefunction:
where r (i) denotes a sample, and equality emerges in the limit N MC → ∞. Variational Monte Carlo (VMC) provides a means for implementing the above stochastic process. Considering an ensemble of "walkers", each representing a discrete sample of Ψ T with a particular electronic configuration in position space, the simulation proceeds by stochastically propagating each walker through configuration space. By randomly displacing an electronic configuration, r ← r, the Metropolis-Hastings algorithm can be invoked to accept the step according to the transition probability
A basic overview of the VMC algorithm follows in Algorithm 1.
Algorithm 1 Variational Monte Carlo
for iMC = 1, . . . , NMC do for iWalker = 1, . . . , Nw do for iEl = 1, . . . , n do r iEl ← riEl + δ Compute ΨT (r ) if |ΨT (r )/ΨT (r)| 2 > U(0, 1) then Compute ∇ 2 r ΨT (r ) Update Ψ −1 T (r ) Accumulate EL(r ) riEl ← r iEl return ET
B. The Trial Wavefunction
Since the many-electron wavefunction is unknown in closed-form for all but the most trivial of systems, approximations must be invoked. Owing to the antisymmetry of the fermionic wavefunction, a determinant is a particularly appropriate mathematical form for the trial wavefunction. Subject to the spin-invariance of the operator whose associated observable is to be computed, the trial wavefunction can be written as the product of α-and β-spin components,
where D(r) is referred to as the Slater matrix, and r α , r β denote the set electron position with the appropriate spin. The Slater matrix is composed of a set of oneparticle molecular orbitals, ψ(r),
where ω symbolises an arbitrary spin-state. A molecular orbital is constructed as a linear combination of atomic orbitals (LCAO),
where φ(r) denotes an atomic orbital, N AO is the number of atomic orbitals in the expansion and {c ij } are the expansion coefficients for the i th molecular orbital. The atomic orbital is formed from a linear combination of N p primitive functions,
In the above, R j is the position vector of the atom to which the atomic orbital is centred, ζ k is the width of the gaussian function and d jk are the expansion coefficients for the j th atomic orbital. f ( j , m j , r) is a function of the azimuthal and colatitudinal quantum numbers associated with the atomic orbital, i.e. its angular momentum.
It is common practice to multiply the above determinantal form of the wavefunction with a function of interparticle distances, referred to as the Jastrow factor, to account for the correlation effects between particles [17] . While inclusion of the Jastrow factor is one of the powerful features of QMC methods in general, we omit discussion of it in our work for the sake of simplicity. Furthermore, the sum of gaussian functions arising in the expression for the molecular orbital is often substituted for a cubic spline to ease the computational burden of complexity scaling with the number of atomic orbitals in the system [18] . We avoid use of these splines to ensure our application remains compute bound. We wish to stress that our resultant implementation is optimised for a subset of VMC capabilities. Rather than claim our results are representative of all VMC calculations, we hope this work serves to illustrate the potential benefits of using FPGAs, and as such should be considered exploratory as opposed to all-encompassing.
III. FPGAS A. Basics and Nomenclature
From the beginning it is worth clarifying the distinction between latency and throughput, the former being the time taken to traverse a computational workflow, and the latter being the number of outputs from the workflow per unit time; for the processing of a single data item, these are just reciprocally related. For multiple data items, pipeline parallelism offers the capacity to increase computational throughput at the cost of latency. The rationale behind leveraging latency for throughput is best illustrated by example, for which we will refer to Figure 1 . Consider a "stream" of data, x 0 , x 1 , . . ., where the subscript enumerates order. In Figure 1a , we observe an unpipelined implementation of three data-dependent modules: f , g and h, each delivering a result through composition with the preceding modules, i.e. f (x), g(x, f ) and h(x, f, g), where nested parentheses are omitted for clarity. Each module has an associated latency, f , g , h . The latency of this workflow is given by the longest route through the workflow, so here is simply the sum of the individual module latencies. The maximum frequency at which a clock can drive the workflow is determined by the propagation delay associated with the longest pathway in the workflow. As the module latencies differ, allowing multiple elements of the data stream to concurrently reside within the workflow results in improper behaviour. The occupancy of data within modules of the workflow with respect to time is depicted in the right-hand side of Figure 1a . Figure 1b depicts the same workflow as that in 1a, but with the addition of registers (red boxes and their associated latencies) capable of storing data, across the workflow. As such, each module effectively inherits its latency from the maximum module latency in the workflow, g in this case. However, introduction of the registers allows the workflow to process multiple data concurrently. In other words, the data stream exhibits temporal, or pipeline, parallelism, i.e. the task can be executed as a cascade of sub-tasks [19] . The resultant pipelined implementation then increases the throughput of the application, at the cost of a delay in processing a single datum.
An FPGA grants the application developer the means to realise a computational pipeline spatially in silicon. The FPGA is a matrix of configurable logic blocks (CLBs), fundamental programmable units comprising a lookup table and flip-flop (a logic unit capable of storing a state), amongst other logical units depending upon the chip. Signals are routed through the CLBs by a series of programmable switches, theoretically permitting the transmission of a signal between any two CLBs (although in practice one would wish to co-localise the connected CLBs for an optimal configuration). Typically, a number of "hard blocks" (such as digital signal processors, dedicated floating point units, static random access memory (SRAM) blocks, etc.) are also available on the chip for specialised tasks that may be costly to implement directly from CLBs. A configuration for the FPGA is loaded at runtime into a flash memory. The contents of this memory are used to configure the programmable switches, routing signals as per the uploaded configuration. For the types of com- 
(a) Unpipelined three-module workflow, each module being dependent upon the output of the former in sequence. A single data element must traverse the entire workflow prior to the entry of the next data element. While the latency is comparatively lower than for a pipelined case, the throughput is also low. . 
(b) Pipelined three-module workflow, each module being dependent upon the output of the former in sequence.
Multiple data elements are allowed to reside within the workflow simultaneously. The latency of the workflow is higher than the unpipelined case owing to the necessity that all modules have associated registers to increase the effective module latency to that of the highest module latency. However, throughput is twice as high as for the unpipelined case, with a new result output every g once the pipeline is filled.
FIG. 1:
An application comprising a series of data-dependent modules and its amenability to optimisation through pipeline parallelism.
putation considered in this work, the FPGA is able to interact with a general purpose processing unit through, for instance, a PCIe connection. Since the FPGA is configured to perform a specific set of computations, the overheads associated with general purpose computing (such as scheduling and interrupts controlled by an operating system) are eliminated. Furthermore, since the clock frequency at which an FPGA configuration can be run is dictated by the propagation delay across the chip, the implementation will be clocked at far lower frequency than a CPU or GPU. These two factors result in a dramatic decrease in the power consumption of an FPGA. As a rough indication, an FPGA consumes roughly an order of magnitude less power than a CPU or GPU. As such, FPGAs are attractive devices for the purposes of low-power computation.
Naturally there are some fairly sizeable obstacles to the use of FPGAs. The conventional means for programming FPGAs requires knowledge of low-level unabstracted hardware description languages (HDLs). Use of these languages requires expertise in low-level design, and is therefore typically inaccessible to application developers from the physical sciences. Consequently, development times for FPGA-based applications are enormously lengthy. However, there exist a number of tools available to the developer for porting a complex application to FPGAs. High Level Synthesis (HLS) tools, such as those marketed by the major FPGA vendors Xilinx and Intel/Altera, offer the capacity to annotate high-level source code (C/C++ and OpenCL predominantly) with preprocessor directives, subsequently used to construct a HDL implementation of the application. Nevertheless, high performance implementations require considerable restructuring relative to a conventional CPU-based alternative to facilitate the inference of a pipelined implementation by the HLS. While the level of expertise required by the developer is reduced, a significant understanding of the platform is imperative. An alternative approach is to use an embedded domain specific language (EDSL), providing a library for a highlevel host language to support FPGA-based constructs, such as streams of data. The implementation is analysed to construct an abstract syntax tree, which can subsequently be used to generate the HDL implementation of the application. An example of such a facility is the extension to java provided by Maxeler Technologies, of which more will be said later in this work. Similar to HLS, an EDSL requires that the developer write their application in a form amenable to constructing a FPGA implementation, and consequently still requires a knowledge of the platform.
B. Application Overview
Gothandaraman and coworkers [15] have previously ported a VMC application to an FPGA, directing their efforts to potential energy and trial wavefunction evaluation kernels. Through the use of pipelining and fixed point arithmetic, significant improvements in performance were obtained relative to a serial multicore implementation. However, their work was limited to bosonic systems, the trial wavefunction consequently being expressible as a product of functions of pairwise particulate distances, i.e. there is no need to invoke a determinantal form for the trial wavefunction. We have targeted fermionic VMC, and as a result our implementation requires significant departures from previous efforts.
To fully exploit the reconfigurability of an FPGA, we have chosen to write an in-house VMC code so as to not restrict ourselves to the data structures and algorithmic workflows utilised in more sophisticated packages [2, 8, 20] . Rather, we have been able to write the application so as to optimise computation through co-processing with an FPGA. It is worth reiterating that we consider here only a subset of VMC functionality; our implementation is all-electron and Jastrow-free. We have aspired to optimise our CPU implementation so as to possess a reasonable benchmark. Our VMC code is written in ISO C99, with all data structures written in a "struct of arrays" (SoA) format. The application is multithreaded using OpenMP. We have attempted to replicate the benefits of code encapsulation offered by object-oriented approaches; a struct is utilised as a means for storing data and function pointers. The Ensemble t struct, for instance, comprises electron positions, Slater matrices, laplacians, and all other data one may associate with a walker, along with methods for manipulating these data. Our application foregoes physically motivated data structures to facilitate optimal cache behaviour [21] . Such optimisations require a significant effort on the part of the developer, but their implementation can help to ameliorate an application from becoming overly memorybound. Consider, for instance, a set of n particles, each described by a position vector in R 3 . One can conceive of two separate data formats: the contiguous dimensions, i.e. {x 1 , . . . , x n , y 1 , . . . , y n , z 1 , . . . , z n }; and the physically motivated triplets, {x 1 , y 1 , z 1 , x 2 , y 2 , z 2 , . . . , x n , y n , z n }. An example of explicit computation of the unique pairwise square distances between particles is given in Algorithm 2. The first two columns of Table I give the number of cache loads and associated proportion of cache misses from use of the two data formats. The contiguous case gives significantly improved cache performance since the square distance between particles is computed dimension-at-a-time, i.e. one computes dx,dy,dz in sequence, prior to reduction and computation of the square root. Allowing the compiler to vectorise the code, one finds that the implementation effectively computes dx between iAt and several jAt concurrently. Since the x−degrees of freedom are contiguous, a single fetch is adequate to serve the vectorised operation. In contrast, for the triplet storage, one finds that a number of elements in the cache are redundant for the calculation of dx, i.e. the y, z degrees of freedom. As such, the cache does not contain all of the required data to serve the vectorised instruction, resulting in suboptimal caching. The final two columns then consider the multithreaded implementation of the pairwise distance kernel given the contiguous data structure. The cases differ based on which loop is parallelised: iAt or jAt. Given parallelisation of the iAt loop, one finds each thread attempting to load its assigned iAt simultaneously, along with corresponding jAt. The result can lead to exceedingly poor performance as a result of cache-thrashing. On the other hand, parallelising the jAt loop results in each thread working on a single iAt, and the cache contains all required data for the parallelisation over jAt.
In the following section, we adopt the conventional nomenclature "host" and "device" to denote the CPU and the FPGA, respectively (see Figure 2) . The device is connected to the host through PCI express (PCIe), per-TABLE I: Wall time and percentage of cache misses (from the number of cache references) for the contiguous and triplet data structures (first two columns), and parallelising the iAt and jAt loops of the pairwise distance kernel using OpenMP multithreading. Data obtained from unique pairwise distances between 100,000 particles. Multithreading data obtained using two threads with the contiguous data structure. mitting memory transfers to proceed by direct memory access (DMA), i.e. without blocking the CPU. Owing to the fact that VMC is embarrassingly parallel, it should hold that the difference in performance between a host and a host with device is conserved in a multiprocessing environment, with n hosts and n devices. Our study then compares single host against single host with device. Our FPGA implementation is written in the EDSL provided by Maxeler Technologies, maxj. maxj is an extension to java, providing functionalities for dealing with sequences of data which are streamed across an FPGA implementation of the application. Such an implementation is referred to as "dataflow". maxj is interpreted by the MaxCompiler, which constructs a register transfer level (RTL) implementation of the application. The RTL is subsequently passed to a synthesis tool (supplied by the chip vendor) to construct the FPGA configuration.
IV. IMPLEMENTATION DETAILS
Throughout, we use the MAIA board produced by Maxeler Technologies, possessing an Altera Stratix V FPGA chip and 48GB of on-board DDR3 DRAM. Our host processor is an Intel Xeon E5-2640 with 6 physical cores, clocked at 2.5GHz. The molecular system we work with is a lattice of 64 molecules of H 2 . We choose the STO-6G basis set for the sake of simplicity, although our described implementation is general enough to utilise an arbitrary basis set. Evaluating the trial wavefunction and its derivatives is the major computational bottleneck of VMC. For small systems such as Be 2 , profiling of our code reveals over 70% of the compute time is localised in the routines which evaluate atomic and molecular orbitals. As the system size increases, the trial wavefunction routines dominate further: upwards of 90% of compute time for the lattice of hydrogen. As such, our target for offloading to the FPGA is obvious. While the Sherman-Morrison inverse (and determinant) updating routine accounts for a non-negligible portion of runtime, it has been noted that the routine is fairly efficient on CPUs [9] since the inverse matrices fit comfortably in cache. We also note that an improved inverse updating scheme has recently been published [22] . In the future, it may be interesting to incorporate these routines into an FPGA implementation. However, for the purposes of this work, we concentrate our efforts specifically on the trial wavefunction evaluation kernel.
A. Workflow Outline
The algorithmic workflow outlined in Algorithm 1 serves to illustrate the data dependencies of the individual kernels, and the kernels which must be serialised. An electron must be displaced, and the trial wavefunction subject to this displacement evaluated. Should the new trial wavefunction meet the Metropolis criterion, the inverse and laplacian of the trial wavefunction are computed, and the local energy accumulated. Should the trial wavefunction not meet the Metropolis criterion, the electron is reverted to its position prior to displacement. There is no scope for task-level parallelism, rendering it difficult to occupy both host and device simultaneously. Simply offloading the computationally intensive portions of the calculation to the device and blocking the CPU until completion is inefficient from a co-processing perspective. The offload cost must be taken into account when benchmarking against a host-bound application. Any savings made by the co-processor in calculation time must be adjusted for the overhead associated with the offload. The embarrassingly data-parallel nature of our application can be used to mask the offload cost. By simply splitting our ensemble into two equally-sized ensembles, Ensemble A and Ensemble B, one can be processed by the host while the other is processed by the device. In effect, we artificially create pipeline-parallelism through data-parallelism. However, this enforced task-parallelism comes at the price of additional memory consumption. Consider momentarily Ensemble A in isolation of Ensemble B. At t 0 , it is offloaded to the DFE, where the constituent electrons are displaced, the Slater matrix differences and laplacians are computed, and subsequently sent back to the host. The host is now holds Ensemble A at t 1 , where all trial moves have speculatively been accepted. Upon proceeding to the Metropolis step on the host, one finds that upon rejection of a particular move, the data associated with the previous configuration has A host is a regular multicore CPU, all processing units having access to some bank of main memory. The device is able to interact with the host through PCIe. Memory transactions between host and device can take place between the main memory of the host and the FPGA, which can either utilise the data directly, or route it to device DRAM.
been overwritten, resulting in the requirement that the host recompute the data from t 0 . Now, consider the case where the data associated with Ensemble A is double-buffered. The data at t 0 is stored in both the ensemble's buffers. The double-buffer is offloaded to the device which again speculatively updates Ensemble A to t 1 , the data for which is subsequently passed back to the double-buffer. Now, when attempting the Metropolis step, the host has access to data from both t 0 and t 1 , allowing either rejection or acceptance of a trial move without the need to recompute data. Finally, let us reintroduce Ensemble B. Now, while Ensemble A is being processed by the device, Ensemble B can undergo the Metropolis step on the host having previously been speculatively updated on the device. The two ensembles can subsequently alternate being processed by the host and device, resulting in a non-blocking simulation. Furthermore, the host is free to run multithreaded, and the device need never block the host since memory transfers can proceed through DMA (unless the host is undertaking some memory-intensive task, in which case the southbridge will be overfaced). The number of threads the host uses can be tuned to minimise the amount of idle time and power consumed. An overview of this workflow is given in Figure 3 .
B. Dataflow Trial Wavefunction Evaluation
Owing to the enormous computational efforts associated with hardware synthesis, our main aim the construction of a generic dataflow implementation for the trial wavefunction kernel. By generic, we mean that the implementation should be able to process a system of arbitrary complexity (within reason), without the need for resynthesis. Naturally, some constraints must be placed on the FPGA configuration such that it does not simply exhaust all resources on-chip. However, we attempt to minimise these constraints in an attempt to maintain freedom in the design space. In evaluating the trial wavefunction, a number of data parallel regions are amenable to either hardware unrolling or temporal parallelism through the use of stream variables. In constructing a general dataflow solution, our starting point is the identification of a loop which is bounded by some system-independent parameter.
The only such loop is given by Equation (8), the reduction over primitives to calculate the value of an atomic orbital. For the vast majority of basis sets, the maximum number of primitives associated with an atomic orbital is 6. As such, we choose to unroll the loop over primitives in hardware. In doing so, an adder tree is also required to perform the reduction over primitives. From Equation (7), a given atomic orbital φ j (r) is utilised by all molecular orbital LCAOs. Specifically, a fused multiply-add operation is required to accumulate the atomic orbital multiplied by contraction coefficient onto each molecular orbital accumulation. Since the atomic orbital value is available at the end of the pipeline stage which performs the reduction over primitives, an optimal strategy involves immediate use of the atomic orbital value, permitting the calculation of the next atomic orbital in sequence. We consequently unroll the molecular orbital accumulators in hardware. We undertake a similar accumulation over second derivatives of the molecular orbitals since there is the scope for significant data reuse. It should be recognised that unrolling over molecular orbitals and their second derivatives is equivalent to the accumulation of a row of the Slater matrix and laplacian. Our choice for temporal unrolling is now enforced: since we accumulate a single row of the Slater matrix and laplacian at a time, a single electron position vector must be presented to the implementation for N AO consecutive kernel ticks. After this period has passed, the row of the Slater matrix and laplacian will have fully accumulated. These quantities can be offloaded to external memory, the accumulators reset and the next electron streamed through, i.e. the next row of the Slater matrix and laplacian. Since all walkers, and all electrons thereof, must pass through the kernel in this way, we find that our kernel will be required to run for N w × n × N AO ticks. Such counter logic can be realised through three nested counters, over walkers, electrons and atomic orbitals.
A number of quantities are fixed over the course of of the calculation: atomic position vectors, molecular orbital coefficients and primitive parameters. Fast access to these quantities is possible through memory-mapping them to the on-chip ROM of the device from the host. However, we only have access to a counter over all atomic orbitals arising in the LCAO for random access, which could result in significant duplication of data should we construct our memory-mapped ROMs in one-to-one correspondence with the atomic orbital counter. In practice, we utilise a set of decoder ROMs, mapping the atomic orbital counter to the address of quantities corresponding to that atomic orbital in the memory-mapped ROMs. The size of the decoder elements need only accommodate the number of bits to required to represent the total number of atomic orbitals, resulting in significant memory savings.
The overall schematic for our implementation is given in Figure 4 . An electron is streamed from DRAM and subjected to a random displacement. As the displaced electron position vector is required by the host for computing the change in potential energy contribution to the local energy, it is streamed back to device DRAM for offload to the host. The vector and square distance between the displaced electron and an atomic orbital centre is computed, with these quantities being used to compute the values of the primitives associated with the atomic orbital under consideration. A reduction amongst the primitives yields the value of the atomic orbital. The atomic orbital value is subsequently passed onto the molecular orbital accumulators, a fused multiply-add being used to accumulate to the LCAO. Finally, upon complete accumulation, the molecular orbital values are passed to device DRAM.
Our implementation is reliant upon two key kernels which must also be instantiated within our trial wavefunction kernel: a pseudorandom number generator is required for the displacement of electrons, and an exponential function is required for the calculation of primitive values. Our implementation of these two kernels is discussed in detail in Appendices A and B.
C. Numerical Precision
Floating point arithmetic (see Figure 5b ) involves several sequentially dependent operations. Addition of floating point numbers, for instance, requires that: the arguments are aligned to the same exponent; the mantissas are summed; the result is normalised (mantissa is shifted and exponent adjusted); and the mantissa is finally rounded to fit within the parameterised bit width. Naturally, these operations can be pipelined, floating point arithmetic thereby being suitable for the out-of-order, deeply pipelined architectures or modern
FIG. 4: Wavefunction evaluation dataflow implementation.
Red nodes correspond to on-chip memories, yellow nodes to lookup tables. Green nodes are computational kernels, and arrows denote dataflow between elements of the implementation. The blue node represents an on-board addressable DRAM block.
computers. Furthermore, the integration of dedicated ICs, such as floating points units (FPUs), into (or sometimes external to) a CPU permits the offloading of these comparatively cumbersome operations to specialised co-processing units without stalling the CPU. The usage of floating point numerics within an FPGA configuration presents two potential issues: ease of implementation and impact on performance. With regards to the former, a historical obstacle to incorporating floating point numerics into FPGA designs has been the requirement that implementations are written with HDL. maxj permits the definition of a floating point data type through a single invocation of the method dfeFloat(), taking both the integer and mantissa widths of the floating point representation as arguments. The data type can subsequently be used much like any other data type, with the implementation details handled by the compile-time synthesis.
Concerning the second issue, the impact on performance of the FPGA design as a result of floating point arithmetic, the associated latency becomes more problematic than for the CPU. Data streams which are to be combined with the result of a particular floating point operation must be buffered on-chip (as must the actual floating point manipulation stages), so as to accommodate the floating point latency. As a result, logical resources are exhausted without contributing to throughput. Some vendors facilitate the usage of floating point arithmetic through the inclusion of digital signal processors (DSPs) within the fabric of the FPGA. However, such resources are limited, and do not remedy the issues associated with buffering other data streams. Indeed, the usage of DSPs places additional constraints on the place and route, inevitably leading to a potential difficulties in synthesising a particular configuration. One particularly appealing solution is to transition from floating to fixed point arithmetic, the latter being equivalent to integer arithmetic (see Figure 5a) . One is consequently faced with a number of choices: the width of the representation; whether the numbers are signed or unsigned (i.e. whether two's-complement arithmetic must be supported); and where the radix point partitions integer from fractional parts, amongst other related decisions. To address each of these choices, we must know both the dynamical range and required precision of the quantities to be manipulated. Such matters will be discussed in the following section.
Fixed Point Support for Co-Processing
There is no native support for fixed point numerics in ISO C99. Nonetheless, it is fairly straightforward to support both fixed point representations and arithmetic through the use of the signed integer data type. However, writing an application which supports fixed point numerics throughout is complicated by a number of issues, particularly when multiple fixed point representations are required:
1. Owing to the lack of support for templated functions and operator overloading by the C99 standard, the host code will be bloated and cumbersome.
2. It is difficult to verify whether an entire application is amenable to fixed point treatment owing to the high-dimensionality of the design space. Supporting fixed point numerics throughout may consequently be an unproductive use of the developer's time should particular operations require floating point treatment.
3. Use of fixed point arithmetic relies on the construction of bespoke mathematical routines. One consequently foregoes the efficiency of the C standard library routines which have benefited from years of optimisation.
4. Should the width of the fixed point data type not be byte-aligned, the effective bandwidth of memory transactions will be suboptimal, leading to poor cache efficiency.
By choosing to use fixed point numerics exclusively in the device code, one is able to circumvent the above obstacles to some degree. One particularly appealing aspect of this choice is the constraint of our fixed point design space to an, admittedly complex, kernel. Furthermore, maxj possesses native support for fixed point numerics, including exception handling for under-/overflow of the representation along with the inference of representation width from a dynamical range of a quantity. As such, the developer is excused from dealing with low level code optimisation. However, the question remains of the interface between host and device, i.e. at what point the fixed-to-floating point conversions (and vice versa) take place. The simplest interface involves the offloading of data from the host to the device DRAM in floating point. Upon streaming of data into the DFE, all quantities are cast to and manipulated with an appropriate fixed point representation. Finally, the quantities are cast back to floating point before streaming back to device DRAM. The host is then able to retrieve the data in a natively supported data type. We find this approach somewhat wasteful. Our kernel fully unrolls the Slater matrix and laplacian streams over the dimensionality of the Slater matrices. Consequently, a large number of casts will be performed concurrently, for both input and output streams. Casting between fixed and floating point representations consumes significant on-chip logic resources. As such, a sizeable portion of on-chip resources will be devoted to casting between numerical representations, an overhead associated with the use of the co-processor. Naturally, such an implementation is best avoided. A preferable scheme materialises when considering the means by which data is transferred between host and device DRAM. Since the MAIA card resides on the PCIe bridge of a compute node, all data transactions between host and device proceed through PCIe. The PCIe driver is instantiated as an IP core in the fabric of the FPGA, requiring data pass through the FPGA over the course of data transactions. Memory transactions which use PCIe are considered a bottleneck of distributed memory systems. The prudent developer of such applications will consequently minimise memory transactions over FIG. 5: Fixed and Floating Point numerical representations using 9 bits (for the sake of demonstration, i.e. without the implication that 9-bit representations are utilised in our work). Note that these forms are inadequate for intermediary stages in arithmetic operations, where guard digits are used to prevent overflow of the representation.
the course of computation to obtain an optimal implementation. Since data transfer must take place at some point, and the data must pass through the FPGA, it is reasonable to pass the data streams through interface kernels, casting the data as it arrives from the PCIe. A free parameter in our implementation is the width of the stream we cast in the interface kernels. Since the clock frequency of the kernel is known, along with the bandwidth of the PCIe, it is a simple task to evaluate the width of the data stream which saturates the PCIe transfer, and is therefore optimal. For instance, consider a kernel clocked at 100MHz. The bandwidth of a PCIe 2.0 x16 bridge is 8 GB/s. Then, a stream of width 80 bytes saturates the PCIe transfer, equating to 10 double or 20 single precision numbers. A final consideration is the storage of the fixed point data in the device DRAM should the representation not be byte-aligned. It is difficult to find a generic storage width for arbitrary width fixed point data, since DRAM accesses must be burst-aligned. For example, the MAIA card utilised in this work possesses 6 DDR3 channels, each channel having a width of 64 bits. The DRAM can function with a burst size of 8 or 4, resulting in a burst width of 384 or 192 bytes, respectively. Since the burst width will exceed the PCIe saturated width, data must be buffered in the interface kernel for a number of kernel ticks before offloading the buffer to DRAM. Such a scheme is facilitated by using regular widths for the fixed point storage. As such, we choose a storage width of 64 bits. This choice naturally reduces the effective DRAM latency, but for the sake of ease of implementation, we proceed with this parameterisation.
V. RESULTS

A. Fixed Point
Use of a single fixed point representation throughout our kernel is not a feasible option. A number of variables possess vastly different dynamic ranges. While a single fixed point representation could in principle be found to accommodate all variables, precision of the representation will necessarily be leveraged for flexibility. As such, our strategy is to track the dynamic range of each variable over the course of a simulation and classify the number of fundamentally different dynamic ranges that arise. To narrow our design space, we will constrain each representation to the same width. We are consequently able to eliminate degrees of freedom from the device space that are only amenable to systematic optimisation strategies. In ascertaining an adequate fixed point representation, we must consider three independent points:
1. Whether two's complement is required.
2. The dynamic range of the variables, such that the integer width can be determined.
3. The number of fractional bits required to yield a stable calculation.
The first two points are easily dealt with through analysis. As a further constraint, we concern ourselves specifically with ensembles which have undergone a period of equilibration, thereby further restricting the dynamic range of runtime variables. It is, however, difficult to ascertain the required number of fractional bits without resorting to a full systematic exploration of fractional widths. Our task is complicated all the more given that in establishing the adequacy of a fractional width, a fully converged VMC calculation is required. Naturally such studies are not feasible when using a device simulator, meaning individual DFEs must be synthesised for each fractional width, which again is prohibitively expensive. As such, we employ a scheme requiring a single Monte Carlo step with the device simulator. The displacement of electrons is deactivated, and the kernel computes the Slater matrices and associated derivatives for the same Monte Carlo samples residing on the host. Upon passing these quantities back to the host, we monitor the ensemble averaged local energy and acceptance rate for these dummy moves. If the computed averaged local energy differs from the original value by less than 5 × 10 −4 a.u., we deem the implemented fractional bit width adequate. Table II shows such results for a varying number of walkers. It is clear that a total fixed point width of 38 bits is the appropriate fixed point representation for our calculation. An interesting optimisation which could be undertaken in the future centres around the work of Ceperley and Dewing [23] , where a penalty function is applied to random walks with noisy data. We have no reason to assume that truncating a numerical representation introduces a bias into the resultant calculation, and so it appears that one might be able to utilise a shorter fixed width representation and utilise the reported penalty method to compensate for the resulting imprecision. While the acceptance rate will decrease, one can in principle leverage the inefficiency in the Monte Carlo for potential accelerations deriving from the use of a shorter fixed width representation. We plan to investigate the application of the penalty method to this end in the near future.
B. Performance
The final resource utilisation of our FPGA implementation is given in Table III . While our synthesised design is moderately light on consumption of logic and dedicated arithmetic units, the on-chip memory proves to be a limiting factor to increasing the complexity of our design. Both the LUTs of our exponential units and the trial wavefunction parameters, particularly the molecular orbital expansion coefficients, are particularly culpable for such high utilisation. However, given the form of our trial wavefunction, avoiding such overheads is difficult. It is possible to force the MaxCompiler to not pipeline sections of the application, thereby reducing onchip memory consumption, but at the cost of increased difficulty in the design meeting timing.
Our final results correspond to the overall performance of our application relative to a purely host-bound implementation. We consider a number of ensemble sizes propagated for a fixed number of Monte Carlo steps. In the lower pane of Figure 6 , we plot the accelerations relative to the multithreaded host benchmark for a variable number of threads running on the host while the device computes the trial wavefunction. For small ensemble sizes, we observe a reduction in the performance relative to larger ensemble sizes owing to the overhead associated with using the co-processor. However, for ensemble sizes of 256 walkers and above, we see the improvement in performance converge towards an overall acceleration of roughly 30× relative to the multithreaded host implementation. Across all ensemble sizes, we note that the host need only instantiate 3 or 4 threads of execution to tend towards peak performance. An interesting additional result is some metric quantifying the comparative power consumptions of the two calculations. While Maxeler Technologies provide a command line utility to query the power consumption of a board, we are not entirely clear on the resolution or accuracy of this quantity. However, in-house testing on a small board with power readings taken at the power outlet reveals the command line utility is in good agreement with the outlet readings. For the MAIA board, we observed a peak power consumption of 27.6W throughout. It is, unfortunately, a little more difficult to establish a power consumption of the host. The operating system is ultimately in control of scheduling, so the overhead associated with other general purpose tasks must be accounted for. A further difficulty is what to include in the power consumption for the host. Typically, only thermal design powers (TDP) are reported by chip manufacturers, which are reported to be in poor correspondence with power consumption at peak operation (peak power consumptions equating to roughly 1.5× that of the reported TDPs. [24] ). The TDP of the Intel Xeon E5-2640 is reported to be 95W. Furthermore, it is unclear whether the power consumption of RAM chips and other peripherals are to be considered. Rather than attempt to speculate on such matters, we compose a metric given the limited data at our disposal:
where P benchmark is the power consumption of the multithreaded host implementation, P co-processing is the power consumption of the device plus that of the number of threads utilised by the co-processing implementation [25] . S is the speedup of the co-processor relative to the multithreaded host, and E is then the performance-adjusted reduction in power consumption offered by the co-processor. E is plotted in the upper pane of Figure 6 . Given that the performance of our application begins to converge to some peak acceleration when the host runs with upwards of three threads, it is unsurprising that the performance-adjusted power consumption is larger for fewer threads than more. 
VI. CONCLUSION
We have ported the computationally expensive trial wavefunction evaluation kernel from a VMC application to an FPGA platform. Through co-processing with a multicore host, we have established that our implementation offers significant benefits in terms of raw compute performance and reduced power consumption. While our VMC is minimal from the perspective of complexity of trial wavefunction, we hope that this work acts to instigate further investigation. Developer time and on-chip resources remain significantly limiting factors in the use of FPGA co-processing for complex scientific applications.
However, these limitations are forecasted to become decreasingly problematic with significant effort being directed towards alleviation. The new Intel/Altera Stratix X, for instance, possesses roughly twice the number of on-chip resources as the Stratix V used in this work. Work within our group is currently being directed towards adding further complexity to our implementation, including support for Jastrow factors and localised orbitals.
We are also looking to support a DMC application, requiring the first derivative of the trial wavefunction, in addition to support for an FPGAbased Sherman-Morrison inverse updating kernel. We hope that such endeavours will aid in the porting of scientific applications to forecasted exascale platforms.
where I ∈ Z w×w is the identity matrix and a is a w−bit number, a w−1 being the highest-order bit of a. This form is most convenient for computation, as one is able to evaluate the vector-matrix product of Equation (A1) through simple bitwise operations, without having to explicitly form and store A,
where is the right-shift operator, and x 0 denotes the lowest-bit of x.
To improve the k-distribution of the MT, a "tempering" is used for the output pseudorandom numbers, accomplished through application of a tempering matrix, T . As with the twist-transformation matrix, we are spared from explicitly forming/ storing the matrix T through enforcing that it satisfy a set of pipelined operations:
where u, s, t, l are tempering bit shifts, and b, c are tempering bit masks. The ampersand is used in the conventional sense to denote bitwise AND. The process we have just outlined is most amenable to FPGA implementation. The recurrence relationship of (A1) lends itself to realisation with a "linear feedback shift register" (LFSR), where the output of the register containing an initial sequence of (x 0 , x 1 , . . . , x k−1 is simultaneously fed to the back of the LFSR to construct x k , and so on. Furthermore, the pipelined operations outlined in the tempering steps of Equations (A4) -(A7) are suitable for cascaded operations in the fabric of the FPGA. The MT is then a natural candidate as a PRNG for FPGAs, and is our choice for on-chip pseudorandom number generation.
Transcendental functions, by definition, cannot be computed exactly in a finite number of algebraic steps. As such, algorithms which implement transcendental functions must employ some approximation, leveraging speed of convergence for an increase in operational complexity. The exponential function pervades quantum chemistry owing to its utility in simplifying the integrals arising in deterministic methodologies. While useful integral properties are an irrelevance for QMC techniques, linear combinations of exponentials are still commonly used to construct atomic orbitals, as stated in Equation (8) . As such, in order that our FPGA implementation of the trial wavefunction evaluation routines be as efficient as possible, we have chosen to construct an exponential function which can be queried as rapidly as possible, while consuming minimal on-chip resources.
Historically, the CORDIC algorithm [27] has been utilised on FPGAs for trigonometric and hyperbolic functions. The identity cosh(x)+sinh(x) = exp(x) can subsequently be used to approximate the exponential function. Since CORDIC is composed entirely of adds and shifts, it leads to a significant reduction in complexity relative to polynomial approximations which require explicit multiplication. However, iterative nature of the CORDIC algorithm renders it suboptimal for our purposes -the data dependency between iterations is unsuitable for the dataflow implementation we seek to create.
We have instead utilised a lookup table (LUT) approach [28, 29] .
While such implementations are memory-intensive, modern FPGA chips have significant amounts of on-chip RAM (of the order of megabytes), and so the cost of storing a few thousand floating point numbers is fairly inconsequential. In order to utilise a LUT-based approach to approximating the exponential function, one must manipulate the identity exp(x) = 2 log 2 (e) x = 2 x log 2 (e) .
We proceed to define x log 2 (e) = y i + y f , where y i and y f are the integer and fractional parts, respectively, of the original argument scaled by log 2 (e). We consequently arrive at the expression exp(x) = 2 (yi+y f ) = 2 yi × 2 y f .
A similar partitioning can be undertaken to split y f down into smaller pieces. Performing this additional partitioning of the fractional part, we can split the approximation down into a series of independent lookups and multiplication of each returned value.
