Field-Programmable Gate Arrays and Quantum Monte Carlo: Power Efficient
  Co-processing for Scalable High-Performance Computing by Cardamone, Salvatore et al.
Field-Programmable Gate Arrays and Quantum Monte Carlo:
Power Efficient Co-processing for Scalable High-Performance Computing
Salvatore Cardamone,1, ∗ Jonathan R. Kimmitt,1, 2 Hugh G. A. Burton,1 and Alex J. W. Thom1
1University Chemical Laboratory, Lensfield Road, Cambridge CB2 1EW, United Kingdom
2Department of Computer Science and Technology, William Gates Building,
J. J. Thomson Ave, Cambridge CB3 0FD, United Kingdom
(Dated: August 8, 2018)
Massively parallel architectures offer the potential to significantly accelerate an application rel-
ative to their serial counterparts. However, not all applications exhibit an adequate level of data
and/or task parallelism to exploit such platforms. Furthermore, the power consumption associated
with these forms of computation renders “scaling out” for exascale levels of performance incom-
patible with modern sustainable energy policies. In this work, we investigate the potential for
field-programmable gate arrays (FPGAs) to feature in future exascale platforms, and their capacity
to improve performance per unit power measurements for the purposes of scientific computing. We
have focussed our efforts on Variational Monte Carlo, and report on the benefits of co-processing
with an FPGA relative to a purely multicore system.
I. INTRODUCTION
Quantum Monte Carlo (QMC) encompasses a class
of techniques for approximating expectation values
to quantum mechanical observables for many-electron
systems[1]. By casting the time-independent Schro¨dinger
equation into integral form, the high-dimensional integra-
tion — intractable through quadrature techniques — can
be realised through a stochastic sampling of the many-
electron wavefunction.
Relative to deterministic counterparts, QMC offers the
potential to combine the favourable scaling of mean-
field techniques and the accuracy of post-Hartree Fock
methods[2]. Two forms of QMC in particular have gained
traction as popular ab initio methodologies[3]: Varia-
tional Monte Carlo (VMC) and Diffusion Monte Carlo
(DMC). Commonly, VMC is used to variationally op-
timise free parameters in the wavefunction, with DMC
being subsequently applied to compute production level
observables for the molecular system in question.
Prior to the failure of Dennard scaling[4], application
developers were able to indulge in a “free lunch”,
whereby they could exploit the continual increase in
processor clock frequencies with new generations of
hardware to directly accelerate an application. Since
on-chip power densities no longer scale down with tran-
sistor density, manufacturers instead look towards heat
dissipation technologies in an effort to accommodate
Moore’s law[5]. However, such techniques have limita-
tions and eventually require manufacturers to reduce
on-chip voltages, forecasted to eventually culminate in
an era of “dark silicon”[6].
To address the unsustainability of frequency scaling,
the tendency has been for new generations of hardware
to increase the number of parallel compute units on a
single chip. As a result, the onus falls onto the software
∗ sc2018@cam.ac.uk
developer to exploit either operations or tasks that can
be undertaken in parallel within an application. Further-
more, with the advent of distributed memory systems
and graphical processing units (GPUs) at contemporary
high-performance computing (HPC) facilities, the soft-
ware developer has an enormous arsenal of hardware
solutions for exploiting parallelism within an application.
For QMC methodologies, the associated embarrassing
parallelism lends itself well to modern parallel architec-
tures. There is a significant body of work demonstrating
the near-linear scaling of performance with respect to
the available hardware concurrency[7, 8]. More recently,
GPUs have been utilised as co-processors to deliver
impressive performance gains[9]. However, “scaling out”
is not a sustainable solution to simulating increasingly
complex systems. The power consumption associated
with distributed memory systems and GPUs does not
align with modern energy policies. Indeed, a consensus
appears to have emerged[10–12] that the power con-
sumption of HPC is one of the most prodigious barriers
to attaining exascale levels of performance.
Field-Programmable Gate Arrays (FPGAs) are appeal-
ing candidates as processing units for novel exascale
platforms[13, 14]. These devices combine low power con-
sumption with the capacity to exploit both data and task
parallelism, and are therefore applicable to a wider set
of applications than solutions relying on data parallelism
alone. To date, use of FPGAs in electronic structure
theory has been somewhat limited, although several
works demonstrate the power of this platform to ac-
celerate scientific applications within this domain.[15, 16]
In this work, we look to port the compute-intensive
portions of a VMC calculation to an FPGA and assess
the performance relative to a CPU-bound application.
We elaborate on the optimisations incorporated into our
design to optimise the implementation for the purposes
of co-processing. Finally, we demonstrate that our appli-
ar
X
iv
:1
80
8.
02
40
2v
1 
 [p
hy
sic
s.c
om
p-
ph
]  
7 A
ug
 20
18
2cation benefits from the use of FPGAs in terms of both
raw compute performance and power consumption.
II. VARIATIONAL MONTE CARLO
The following section is not intended to present an ex-
haustive exposition of techniques within QMC. Rather,
the reader is directed to the excellent review of Foulkes
and coworkers[1] for further details.
A. Implementation
Consider a molecular system comprising N nuclei and
n electrons, where R and r denote their collective po-
sition vectors respectively. The time-independent elec-
tronic wavefunction of this system is represented by
Ψ(r;R), where the parametric dependence on R arises
from the application of the Born–Oppenheimer approx-
imation. Expanding Ψ(r;R) in terms of the exact (or-
thonormalised) eigenfunctions of the molecular Hamilto-
nian, Hˆ(r,R), the energy of the system is determined
by the expectation value 〈Hˆ(r,R)〉Ψ. Alternative (time-
independent) physical quantities can be obtained by sub-
stituting the molecular Hamiltonian for the appropriate
corresponding operator.
From the variational principle it can be shown that
for any approximate wavefunction, ΨT (r;R), referred
to as the “trial wavefunction”, the expectation value
〈Hˆ(r,R)〉ΨT , provides an upper bound to the true en-
ergy of the system. Consequently, the variational prin-
ciple provides an important metric for quantifying the
quality of an approximate trial wavefunction, with lower
energies implying a higher quality wavefunction. Herein,
all explicit functional dependencies are omitted unless a
new quantity is introduced.
Through the time-independent Schro¨dinger equation, the
energy associated with a particular trial wavefunction
can be expressed as
ET =
∫
drΨ∗T HˆΨT∫
drΨ∗TΨT
, (1)
where integration is performed over the 3n electronic de-
grees of freedom. With some foresight, multiplication
by the identity ΨT /ΨT casts this expression into a form
amenable to solution through stochastic methods:
ET =
∫
dr|ΨT |2 HˆΨTΨT∫
dr|ΨT |2 . (2)
The above “importance sampled form” utilises |ΨT |2 as
a probability density function from which samples of
the “local energy”, EL(r) = ΨT
−1HˆΨT , can be drawn.
Through Monte Carlo, the resultant average of the local
energy over NMC (uncorrelated) samples is asymptotic
to the variational energy of the trial wavefunction:
ET ∼ lim
NMC→∞
1
NMC
NMC∑
i=1
EL(r
(i)) , (3)
where r(i) denotes a sample, and equality emerges in the
limit NMC →∞.
Variational Monte Carlo (VMC) provides a means for
implementing the above stochastic process. Considering
an ensemble of “walkers”, each representing a discrete
sample of ΨT with a particular electronic configuration in
position space, the simulation proceeds by stochastically
propagating each walker through configuration space. By
randomly displacing an electronic configuration, r′ ←
r, the Metropolis–Hastings algorithm can be invoked to
accept the step according to the transition probability
P (r′ ← r) = min
1, ∣∣∣∣∣ΨT (r′)ΨT (r)
∣∣∣∣∣
2
 . (4)
A basic overview of the VMC algorithm follows in Algo-
rithm 1.
Algorithm 1 Variational Monte Carlo
for iMC = 1, . . . , NMC do
for iWalker = 1, . . . , Nw do
for iEl = 1, . . . , n do
r′iEl ← riEl + δ
Compute ΨT (r
′)
if |ΨT (r′)/ΨT (r)|2 > U(0, 1) then
Compute ∇2r′ΨT (r′)
Update Ψ−1T (r
′)
Accumulate EL(r
′)
riEl ← r′iEl
return ET
B. The Trial Wavefunction
Since the many-electron wavefunction is unknown in
closed-form for all but the most trivial of systems, ap-
proximations must be invoked. Owing to the antisym-
metry of the fermionic wavefunction, a determinant is a
particularly appropriate mathematical form for the trial
wavefunction. Subject to the spin-invariance of the oper-
ator whose associated observable is to be computed, the
trial wavefunction can be written as the product of α-
and β-spin components,
ΨT (r
α, rβ) = det[Dα(rα)]× det[Dβ(rβ)] , (5)
where D(r) is referred to as the Slater matrix, and rα, rβ
denote the set electron position with the appropriate
spin. The Slater matrix is composed of a set of one-
particle molecular orbitals, ψ(r),
Dω(rω) =
 ψ1(r
ω
1 ) . . . ψnω (r
ω
1 )
...
. . .
...
ψ1(r
ω
n) . . . ψnω (r
ω
n)
 , (6)
3where ω symbolises an arbitrary spin-state. A molecular
orbital is constructed as a linear combination of atomic
orbitals (LCAO),
ψi(r) =
NAO∑
j=1
cijφj(r) , (7)
where φ(r) denotes an atomic orbital, NAO is the num-
ber of atomic orbitals in the expansion and {cij} are the
expansion coefficients for the ith molecular orbital. The
atomic orbital is formed from a linear combination of Np
primitive functions,
φj(r) = f(`j ,mj , |r −Rj |)
Np∑
k=1
djk exp(−ζk|r −Rj |2)
= f(`j ,mj , r)
Np∑
k=1
ηjk(r) . (8)
In the above, Rj is the position vector of the atom to
which the atomic orbital is centred, ζk is the width
of the gaussian function and djk are the expansion
coefficients for the jth atomic orbital. f(`j ,mj , r) is a
function of the azimuthal and colatitudinal quantum
numbers associated with the atomic orbital, i.e. its
angular momentum.
It is common practice to multiply the above determi-
nantal form of the wavefunction with a function of inter-
particle distances, referred to as the Jastrow factor, to
account for the correlation effects between particles[17].
While inclusion of the Jastrow factor is one of the pow-
erful features of QMC methods in general, we omit dis-
cussion of it in our work for the sake of simplicity. Fur-
thermore, the sum of gaussian functions arising in the
expression for the molecular orbital is often substituted
for a cubic spline to ease the computational burden of
complexity scaling with the number of atomic orbitals in
the system[18]. We avoid use of these splines to ensure
our application remains compute bound.
We wish to stress that our resultant implementation is
optimised for a subset of VMC capabilities. Rather than
claim our results are representative of all VMC calcula-
tions, we hope this work serves to illustrate the potential
benefits of using FPGAs, and as such should be consid-
ered exploratory as opposed to all-encompassing.
III. FPGAS
A. Basics and Nomenclature
From the beginning it is worth clarifying the distinc-
tion between latency and throughput, the former
being the time taken to traverse a computational
workflow, and the latter being the number of outputs
from the workflow per unit time; for the processing of
a single data item, these are just reciprocally related.
For multiple data items, pipeline parallelism offers the
capacity to increase computational throughput at the
cost of latency. The rationale behind leveraging latency
for throughput is best illustrated by example, for which
we will refer to Figure 1.
Consider a “stream” of data, x0, x1, . . ., where the
subscript enumerates order. In Figure 1a, we observe
an unpipelined implementation of three data-dependent
modules: f , g and h, each delivering a result through
composition with the preceding modules, i.e. f(x),
g(x, f) and h(x, f, g), where nested parentheses are
omitted for clarity. Each module has an associated
latency, `f , `g, `h. The latency of this workflow is given
by the longest route through the workflow, so here
is simply the sum of the individual module latencies.
The maximum frequency at which a clock can drive
the workflow is determined by the propagation delay
associated with the longest pathway in the workflow. As
the module latencies differ, allowing multiple elements
of the data stream to concurrently reside within the
workflow results in improper behaviour. The occupancy
of data within modules of the workflow with respect to
time is depicted in the right-hand side of Figure 1a.
Figure 1b depicts the same workflow as that in 1a,
but with the addition of registers (red boxes and their
associated latencies) capable of storing data, across
the workflow. As such, each module effectively inherits
its latency from the maximum module latency in the
workflow, `g in this case. However, introduction of the
registers allows the workflow to process multiple data
concurrently. In other words, the data stream exhibits
temporal, or pipeline, parallelism, i.e. the task can be
executed as a cascade of sub-tasks[19]. The resultant
pipelined implementation then increases the throughput
of the application, at the cost of a delay in processing a
single datum.
An FPGA grants the application developer the means
to realise a computational pipeline spatially in silicon.
The FPGA is a matrix of configurable logic blocks
(CLBs), fundamental programmable units comprising
a lookup table and flip-flop (a logic unit capable of
storing a state), amongst other logical units depending
upon the chip. Signals are routed through the CLBs
by a series of programmable switches, theoretically
permitting the transmission of a signal between any two
CLBs (although in practice one would wish to co-localise
the connected CLBs for an optimal configuration).
Typically, a number of “hard blocks” (such as digital
signal processors, dedicated floating point units, static
random access memory (SRAM) blocks, etc.) are also
available on the chip for specialised tasks that may be
costly to implement directly from CLBs.
A configuration for the FPGA is loaded at runtime into
a flash memory. The contents of this memory are used
to configure the programmable switches, routing signals
as per the uploaded configuration. For the types of com-
4Module f
Latency : 10
Module g
Latency : 20
Module h
Latency : 10
Input
Output
x0
x0
x0
x0
x0
x0x1
x1
x1
x1
x1
Input
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
f g h Output
L = 40
Unpipelined
Total Latency, L = `f + `g + `h = 40
Throughput, T = L−1 = 0.025−1
(a) Unpipelined three-module workflow, each module being
dependent upon the output of the former in sequence. A
single data element must traverse the entire workflow prior
to the entry of the next data element. While the latency is
comparatively lower than for a pipelined case, the
throughput is also low.
.
Module f
Latency : 10
Module g
Latency : 20
Module h
Latency : 10
Input
Output
10 10
10
20
20 20
x0
x0
x0
x0
x0
x0
x0
x0
x1
x1
x1
x1
x1
x1
x1
x1
x2
x2
x2
x2
x2
x2
x3
x3
x3
x3
x4
x4
Input
t0
t1
t2
t3
t4
t5
t6
t7
t8
t9
f g h Output
L = 60
Pipelined
Total Latency, L = 3×max(`f , `g, `h) = 60
Throughput, T = 3× L−1 = 0.05
(b) Pipelined three-module workflow, each module being
dependent upon the output of the former in sequence.
Multiple data elements are allowed to reside within the
workflow simultaneously. The latency of the workflow is
higher than the unpipelined case owing to the necessity that
all modules have associated registers to increase the effective
module latency to that of the highest module latency.
However, throughput is twice as high as for the unpipelined
case, with a new result output every `g once the pipeline is
filled.
FIG. 1: An application comprising a series of data-dependent modules and its amenability to optimisation through
pipeline parallelism.
putation considered in this work, the FPGA is able to
interact with a general purpose processing unit through,
for instance, a PCIe connection. Since the FPGA is
configured to perform a specific set of computations, the
overheads associated with general purpose computing
(such as scheduling and interrupts controlled by an
operating system) are eliminated. Furthermore, since
the clock frequency at which an FPGA configuration
can be run is dictated by the propagation delay across
the chip, the implementation will be clocked at far lower
frequency than a CPU or GPU. These two factors result
in a dramatic decrease in the power consumption of
an FPGA. As a rough indication, an FPGA consumes
roughly an order of magnitude less power than a CPU
or GPU. As such, FPGAs are attractive devices for the
purposes of low-power computation.
Naturally there are some fairly sizeable obstacles to
the use of FPGAs. The conventional means for pro-
gramming FPGAs requires knowledge of low-level unab-
stracted hardware description languages (HDLs). Use of
these languages requires expertise in low-level design, and
is therefore typically inaccessible to application develop-
ers from the physical sciences. Consequently, develop-
ment times for FPGA-based applications are enormously
lengthy.
However, there exist a number of tools available to the
developer for porting a complex application to FPGAs.
High Level Synthesis (HLS) tools, such as those mar-
keted by the major FPGA vendors Xilinx and Intel/Al-
tera, offer the capacity to annotate high-level source code
(C/C++ and OpenCL predominantly) with preprocessor
directives, subsequently used to construct a HDL imple-
mentation of the application. Nevertheless, high perfor-
mance implementations require considerable restructur-
ing relative to a conventional CPU-based alternative to
facilitate the inference of a pipelined implementation by
the HLS. While the level of expertise required by the
developer is reduced, a significant understanding of the
platform is imperative.
An alternative approach is to use an embedded domain
specific language (EDSL), providing a library for a high-
level host language to support FPGA-based constructs,
such as streams of data. The implementation is analysed
to construct an abstract syntax tree, which can subse-
quently be used to generate the HDL implementation
of the application. An example of such a facility is the
extension to java provided by Maxeler Technologies, of
which more will be said later in this work. Similar to
HLS, an EDSL requires that the developer write their
application in a form amenable to constructing a FPGA
implementation, and consequently still requires a knowl-
edge of the platform.
B. Application Overview
Gothandaraman and coworkers[15] have previously
ported a VMC application to an FPGA, directing
their efforts to potential energy and trial wavefunction
5evaluation kernels. Through the use of pipelining and
fixed point arithmetic, significant improvements in
performance were obtained relative to a serial multicore
implementation. However, their work was limited to
bosonic systems, the trial wavefunction consequently
being expressible as a product of functions of pairwise
particulate distances, i.e. there is no need to invoke
a determinantal form for the trial wavefunction. We
have targeted fermionic VMC, and as a result our
implementation requires significant departures from
previous efforts.
To fully exploit the reconfigurability of an FPGA, we
have chosen to write an in-house VMC code so as to not
restrict ourselves to the data structures and algorithmic
workflows utilised in more sophisticated packages[2, 8,
20]. Rather, we have been able to write the application
so as to optimise computation through co-processing with
an FPGA. It is worth reiterating that we consider here
only a subset of VMC functionality; our implementation
is all-electron and Jastrow-free.
We have aspired to optimise our CPU implementation
so as to possess a reasonable benchmark. Our VMC code
is written in ISO C99, with all data structures written
in a “struct of arrays” (SoA) format. The application
is multithreaded using OpenMP. We have attempted to
replicate the benefits of code encapsulation offered by
object-oriented approaches; a struct is utilised as a means
for storing data and function pointers. The Ensemble t
struct, for instance, comprises electron positions, Slater
matrices, laplacians, and all other data one may associate
with a walker, along with methods for manipulating these
data.
Our application foregoes physically motivated data struc-
tures to facilitate optimal cache behaviour[21]. Such op-
timisations require a significant effort on the part of the
developer, but their implementation can help to ame-
liorate an application from becoming overly memory-
bound. Consider, for instance, a set of n particles, each
described by a position vector in R3. One can conceive
of two separate data formats: the contiguous dimensions,
i.e. {x1, . . . , xn, y1, . . . , yn, z1, . . . , zn}; and the physically
motivated triplets, {x1, y1, z1, x2, y2, z2, . . . , xn, yn, zn}.
An example of explicit computation of the unique pair-
wise square distances between particles is given in Algo-
rithm 2.
Algorithm 2 Pairwise Distances using Contiguous or
Triplet Data Structures
static const size_t nAtoms = 100000 , nDims = 3 ;
static float positions[nAtoms*nDims] ;
static size_t xi, yi, zi = 0 ;
static size_t xj, yj, zj = 0 ;
for( size_t iAt=0 ; iAt <nAtoms ; ++iAt ) {
#ifdef CONTIGUOUS
xi = iAt ; yi = nAtoms + xi ; zi = nAtoms + yi ;
#else /* TRIPLET */
xi = iAt * nDims ; yi = xi + 1 ; zi = yi + 1 ;
#endif /* #ifdef CONTIGUOUS */
for( size_t jAt=iAt+1 ; jAt <nAtoms; ++jAt ) {
#ifdef CONTIGUOUS
xj = jAt ; yj = nAtoms + xj ; zj = nAtoms + yj;
#else /* TRIPLET */
xj = jAt * nDims ; yj = xj + 1 ; zj = yj + 1 ;
#endif /* #ifdef CONTIGUOUS */
float dx = positions[xi] - positions[xj] ;
float dy = positions[yi] - positions[yj] ;
float dz = positions[zi] - positions[zj] ;
float rSq = dx*dx + dy*dy + dz*dz ;
}
}
The first two columns of Table I give the number of
cache loads and associated proportion of cache misses
from use of the two data formats. The contiguous
case gives significantly improved cache performance
since the square distance between particles is computed
dimension-at-a-time, i.e. one computes dx,dy,dz in
sequence, prior to reduction and computation of the
square root. Allowing the compiler to vectorise the code,
one finds that the implementation effectively computes
dx between iAt and several jAt concurrently. Since the
x−degrees of freedom are contiguous, a single fetch is
adequate to serve the vectorised operation. In contrast,
for the triplet storage, one finds that a number of
elements in the cache are redundant for the calculation
of dx, i.e. the y, z degrees of freedom. As such, the cache
does not contain all of the required data to serve the
vectorised instruction, resulting in suboptimal caching.
The final two columns then consider the multithreaded
implementation of the pairwise distance kernel given the
contiguous data structure. The cases differ based on
which loop is parallelised: iAt or jAt. Given paralleli-
sation of the iAt loop, one finds each thread attempting
to load its assigned iAt simultaneously, along with
corresponding jAt. The result can lead to exceedingly
poor performance as a result of cache-thrashing. On
the other hand, parallelising the jAt loop results in
each thread working on a single iAt, and the cache
contains all required data for the parallelisation over jAt.
In the following section, we adopt the conventional
nomenclature “host” and “device” to denote the CPU
and the FPGA, respectively (see Figure 2). The device is
connected to the host through PCI express (PCIe), per-
6TABLE I: Wall time and percentage of cache misses (from the number of cache references) for the contiguous and
triplet data structures (first two columns), and parallelising the iAt and jAt loops of the pairwise distance kernel
using OpenMP multithreading. Data obtained from unique pairwise distances between 100,000 particles.
Multithreading data obtained using two threads with the contiguous data structure.
Contiguous Triplet iAt OpenMP jAt OpenMP
Wall Time (s) 20.107 22.831 15.936 13.057
Cache Misses 0.279% 0.534% 16.479% 0.594%
mitting memory transfers to proceed by direct memory
access (DMA), i.e. without blocking the CPU. Owing to
the fact that VMC is embarrassingly parallel, it should
hold that the difference in performance between a host
and a host with device is conserved in a multiprocessing
environment, with n hosts and n devices. Our study
then compares single host against single host with
device.
Our FPGA implementation is written in the EDSL
provided by Maxeler Technologies, maxj. maxj is an
extension to java, providing functionalities for dealing
with sequences of data which are streamed across an
FPGA implementation of the application. Such an
implementation is referred to as “dataflow”. maxj is
interpreted by the MaxCompiler, which constructs a
register transfer level (RTL) implementation of the
application. The RTL is subsequently passed to a
synthesis tool (supplied by the chip vendor) to construct
the FPGA configuration.
IV. IMPLEMENTATION DETAILS
Throughout, we use the MAIA board produced by
Maxeler Technologies, possessing an Altera Stratix V
FPGA chip and 48GB of on-board DDR3 DRAM. Our
host processor is an Intel Xeon E5-2640 with 6 physical
cores, clocked at 2.5GHz. The molecular system we work
with is a lattice of 64 molecules of H2. We choose the
STO-6G basis set for the sake of simplicity, although our
described implementation is general enough to utilise an
arbitrary basis set.
Evaluating the trial wavefunction and its derivatives is
the major computational bottleneck of VMC. For small
systems such as Be2, profiling of our code reveals over
70% of the compute time is localised in the routines
which evaluate atomic and molecular orbitals. As the
system size increases, the trial wavefunction routines
dominate further: upwards of 90% of compute time
for the lattice of hydrogen. As such, our target for
offloading to the FPGA is obvious.
While the Sherman–Morrison inverse (and determinant)
updating routine accounts for a non-negligible portion
of runtime, it has been noted that the routine is fairly
efficient on CPUs[9] since the inverse matrices fit com-
fortably in cache. We also note that an improved inverse
updating scheme has recently been published[22]. In
the future, it may be interesting to incorporate these
routines into an FPGA implementation. However, for
the purposes of this work, we concentrate our efforts
specifically on the trial wavefunction evaluation kernel.
A. Workflow Outline
The algorithmic workflow outlined in Algorithm 1
serves to illustrate the data dependencies of the individ-
ual kernels, and the kernels which must be serialised.
An electron must be displaced, and the trial wavefunc-
tion subject to this displacement evaluated. Should the
new trial wavefunction meet the Metropolis criterion, the
inverse and laplacian of the trial wavefunction are com-
puted, and the local energy accumulated. Should the
trial wavefunction not meet the Metropolis criterion, the
electron is reverted to its position prior to displacement.
There is no scope for task-level parallelism, rendering it
difficult to occupy both host and device simultaneously.
Simply offloading the computationally intensive portions
of the calculation to the device and blocking the CPU
until completion is inefficient from a co-processing per-
spective. The offload cost must be taken into account
when benchmarking against a host-bound application.
Any savings made by the co-processor in calculation time
must be adjusted for the overhead associated with the of-
fload.
The embarrassingly data-parallel nature of our applica-
tion can be used to mask the offload cost. By simply
splitting our ensemble into two equally-sized ensembles,
Ensemble A and Ensemble B, one can be processed by
the host while the other is processed by the device. In
effect, we artificially create pipeline-parallelism through
data-parallelism. However, this enforced task-parallelism
comes at the price of additional memory consumption.
Consider momentarily Ensemble A in isolation of
Ensemble B. At t0, it is offloaded to the DFE, where the
constituent electrons are displaced, the Slater matrix dif-
ferences and laplacians are computed, and subsequently
sent back to the host. The host is now holds Ensemble A
at t1, where all trial moves have speculatively been ac-
cepted. Upon proceeding to the Metropolis step on the
host, one finds that upon rejection of a particular move,
the data associated with the previous configuration has
7Host
CPU 1 CPU 2
Memory
Device
DRAM
FPGA
PCIe
FIG. 2: Schematic of the host-device architecture. A host is a regular multicore CPU, all processing units having
access to some bank of main memory. The device is able to interact with the host through PCIe. Memory
transactions between host and device can take place between the main memory of the host and the FPGA, which
can either utilise the data directly, or route it to device DRAM.
been overwritten, resulting in the requirement that the
host recompute the data from t0.
Now, consider the case where the data associated with
Ensemble A is double-buffered. The data at t0 is stored
in both the ensemble’s buffers. The double-buffer is of-
floaded to the device which again speculatively updates
Ensemble A to t1, the data for which is subsequently
passed back to the double-buffer. Now, when attempt-
ing the Metropolis step, the host has access to data from
both t0 and t1, allowing either rejection or acceptance of
a trial move without the need to recompute data.
Finally, let us reintroduce Ensemble B. Now, while
Ensemble A is being processed by the device, Ensemble B
can undergo the Metropolis step on the host having previ-
ously been speculatively updated on the device. The two
ensembles can subsequently alternate being processed by
the host and device, resulting in a non-blocking simula-
tion. Furthermore, the host is free to run multithreaded,
and the device need never block the host since mem-
ory transfers can proceed through DMA (unless the host
is undertaking some memory-intensive task, in which
case the southbridge will be overfaced). The number
of threads the host uses can be tuned to minimise the
amount of idle time and power consumed. An overview
of this workflow is given in Figure 3.
B. Dataflow Trial Wavefunction Evaluation
Owing to the enormous computational efforts as-
sociated with hardware synthesis, our main aim the
construction of a generic dataflow implementation for
the trial wavefunction kernel. By generic, we mean that
the implementation should be able to process a system
of arbitrary complexity (within reason), without the
need for resynthesis. Naturally, some constraints must
be placed on the FPGA configuration such that it does
not simply exhaust all resources on-chip. However, we
attempt to minimise these constraints in an attempt to
maintain freedom in the design space. In evaluating the
trial wavefunction, a number of data parallel regions
are amenable to either hardware unrolling or temporal
parallelism through the use of stream variables. In
constructing a general dataflow solution, our starting
point is the identification of a loop which is bounded
by some system-independent parameter. The only
such loop is given by Equation (8), the reduction over
primitives to calculate the value of an atomic orbital.
For the vast majority of basis sets, the maximum number
of primitives associated with an atomic orbital is 6. As
such, we choose to unroll the loop over primitives in
hardware. In doing so, an adder tree is also required to
perform the reduction over primitives.
From Equation (7), a given atomic orbital φj(r) is
utilised by all molecular orbital LCAOs. Specifically, a
fused multiply-add operation is required to accumulate
the atomic orbital multiplied by contraction coefficient
onto each molecular orbital accumulation. Since the
atomic orbital value is available at the end of the pipeline
stage which performs the reduction over primitives, an
optimal strategy involves immediate use of the atomic
orbital value, permitting the calculation of the next
atomic orbital in sequence. We consequently unroll
the molecular orbital accumulators in hardware. We
undertake a similar accumulation over second derivatives
of the molecular orbitals since there is the scope for
significant data reuse. It should be recognised that
unrolling over molecular orbitals and their second
derivatives is equivalent to the accumulation of a row of
the Slater matrix and laplacian.
Our choice for temporal unrolling is now enforced: since
we accumulate a single row of the Slater matrix and
laplacian at a time, a single electron position vector must
be presented to the implementation for NAO consecutive
kernel ticks. After this period has passed, the row of the
Slater matrix and laplacian will have fully accumulated.
8Ensemble A
Ensemble B
t mod 2 Offload to DRAM
Trial Wavefunction Calculation
Unload from DRAM
Metropolis Accept/Reject
Ensemble Averages
Synchronise Host and Device
Stream from
DRAM
Stream to
DRAM
In
cr
em
en
t
C
o
u
n
te
r,
t
FIG. 3: Non-Blocking implementation of VMC. The dashed line denotes the partition between host (yellow) and
device (green) kernels. The destination of the two ensembles alternates between counter ticks, and can be handled
by simple ternary logic (represented by multiplexing).
These quantities can be offloaded to external memory,
the accumulators reset and the next electron streamed
through, i.e. the next row of the Slater matrix and
laplacian. Since all walkers, and all electrons thereof,
must pass through the kernel in this way, we find that
our kernel will be required to run for Nw×n×NAO ticks.
Such counter logic can be realised through three nested
counters, over walkers, electrons and atomic orbitals.
A number of quantities are fixed over the course of
of the calculation: atomic position vectors, molecular
orbital coefficients and primitive parameters. Fast access
to these quantities is possible through memory-mapping
them to the on-chip ROM of the device from the
host. However, we only have access to a counter over
all atomic orbitals arising in the LCAO for random
access, which could result in significant duplication of
data should we construct our memory-mapped ROMs
in one-to-one correspondence with the atomic orbital
counter. In practice, we utilise a set of decoder ROMs,
mapping the atomic orbital counter to the address
of quantities corresponding to that atomic orbital in
the memory-mapped ROMs. The size of the decoder
elements need only accommodate the number of bits
to required to represent the total number of atomic
orbitals, resulting in significant memory savings.
The overall schematic for our implementation is given
in Figure 4. An electron is streamed from DRAM and
subjected to a random displacement. As the displaced
electron position vector is required by the host for
computing the change in potential energy contribution
to the local energy, it is streamed back to device DRAM
for offload to the host. The vector and square distance
between the displaced electron and an atomic orbital
centre is computed, with these quantities being used to
compute the values of the primitives associated with the
atomic orbital under consideration. A reduction amongst
the primitives yields the value of the atomic orbital.
The atomic orbital value is subsequently passed onto the
molecular orbital accumulators, a fused multiply-add
being used to accumulate to the LCAO. Finally, upon
complete accumulation, the molecular orbital values are
passed to device DRAM.
Our implementation is reliant upon two key kernels
which must also be instantiated within our trial wave-
function kernel: a pseudorandom number generator is
required for the displacement of electrons, and an expo-
nential function is required for the calculation of primi-
tive values. Our implementation of these two kernels is
discussed in detail in Appendices A and B.
C. Numerical Precision
Floating point arithmetic (see Figure 5b) involves
several sequentially dependent operations. Addition of
floating point numbers, for instance, requires that: the
arguments are aligned to the same exponent; the man-
tissas are summed; the result is normalised (mantissa
is shifted and exponent adjusted); and the mantissa
is finally rounded to fit within the parameterised bit
width. Naturally, these operations can be pipelined,
floating point arithmetic thereby being suitable for the
out-of-order, deeply pipelined architectures or modern
9Primitive 1
d1 exp
(
− ζ1s2
)
BRAM
ζ
BRAM
d
LUT
exp()
Primitive Np
dNp exp
(
− ζNps2
)
• • • • •
BRAM
ζ
BRAM
d
LUT
exp()
Distances
s = |r′ −R|
Displace
r′ = r + δ
PRNG
δ
BRAM
Orbital Centres
R
R
D
R
A
M
rr′
r′
Atomic Orbital
f(s; `,m)
∑
Np
ηi
BRAM
Angular
Momenta
`,m
s
s2 s2
η1(s
2; ζ, d) ηNp(s
2; ζ, d)
Molecular Orbital Accumulators
φ(r′)
BRAM
Molecular Orbital Coefficients
ψ1(r
′)
ψ2(r
′)
ψ3(r
′)
• • • • • •
ψn−1(r′)
ψn(r
′)
FIG. 4: Wavefunction evaluation dataflow implementation. Red nodes correspond to on-chip memories, yellow
nodes to lookup tables. Green nodes are computational kernels, and arrows denote dataflow between elements of the
implementation. The blue node represents an on-board addressable DRAM block.
computers. Furthermore, the integration of dedicated
ICs, such as floating points units (FPUs), into (or some-
times external to) a CPU permits the offloading of these
comparatively cumbersome operations to specialised
co-processing units without stalling the CPU.
The usage of floating point numerics within an FPGA
configuration presents two potential issues: ease of
implementation and impact on performance. With
regards to the former, a historical obstacle to incor-
porating floating point numerics into FPGA designs
has been the requirement that implementations are
written with HDL. maxj permits the definition of a
10
floating point data type through a single invocation of
the method dfeFloat(), taking both the integer and
mantissa widths of the floating point representation
as arguments. The data type can subsequently be
used much like any other data type, with the imple-
mentation details handled by the compile-time synthesis.
Concerning the second issue, the impact on perfor-
mance of the FPGA design as a result of floating point
arithmetic, the associated latency becomes more prob-
lematic than for the CPU. Data streams which are to be
combined with the result of a particular floating point
operation must be buffered on-chip (as must the actual
floating point manipulation stages), so as to accommo-
date the floating point latency. As a result, logical re-
sources are exhausted without contributing to through-
put. Some vendors facilitate the usage of floating point
arithmetic through the inclusion of digital signal proces-
sors (DSPs) within the fabric of the FPGA. However,
such resources are limited, and do not remedy the issues
associated with buffering other data streams. Indeed, the
usage of DSPs places additional constraints on the place
and route, inevitably leading to a potential difficulties in
synthesising a particular configuration.
One particularly appealing solution is to transition from
floating to fixed point arithmetic, the latter being equiv-
alent to integer arithmetic (see Figure 5a). One is conse-
quently faced with a number of choices: the width of the
representation; whether the numbers are signed or un-
signed (i.e. whether two’s-complement arithmetic must
be supported); and where the radix point partitions in-
teger from fractional parts, amongst other related deci-
sions. To address each of these choices, we must know
both the dynamical range and required precision of the
quantities to be manipulated. Such matters will be dis-
cussed in the following section.
1. Fixed Point Support for Co-Processing
There is no native support for fixed point numerics
in ISO C99. Nonetheless, it is fairly straightforward to
support both fixed point representations and arithmetic
through the use of the signed integer data type. How-
ever, writing an application which supports fixed point
numerics throughout is complicated by a number of is-
sues, particularly when multiple fixed point representa-
tions are required:
1. Owing to the lack of support for templated func-
tions and operator overloading by the C99 stan-
dard, the host code will be bloated and cumber-
some.
2. It is difficult to verify whether an entire applica-
tion is amenable to fixed point treatment owing to
the high-dimensionality of the design space. Sup-
porting fixed point numerics throughout may con-
sequently be an unproductive use of the developer’s
time should particular operations require floating
point treatment.
3. Use of fixed point arithmetic relies on the construc-
tion of bespoke mathematical routines. One conse-
quently foregoes the efficiency of the C standard
library routines which have benefited from years of
optimisation.
4. Should the width of the fixed point data type not
be byte-aligned, the effective bandwidth of mem-
ory transactions will be suboptimal, leading to poor
cache efficiency.
By choosing to use fixed point numerics exclusively in
the device code, one is able to circumvent the above
obstacles to some degree. One particularly appealing
aspect of this choice is the constraint of our fixed
point design space to an, admittedly complex, kernel.
Furthermore, maxj possesses native support for fixed
point numerics, including exception handling for under-
/overflow of the representation along with the inference
of representation width from a dynamical range of a
quantity. As such, the developer is excused from dealing
with low level code optimisation.
However, the question remains of the interface between
host and device, i.e. at what point the fixed-to-floating
point conversions (and vice versa) take place. The
simplest interface involves the offloading of data from
the host to the device DRAM in floating point. Upon
streaming of data into the DFE, all quantities are cast
to and manipulated with an appropriate fixed point
representation. Finally, the quantities are cast back to
floating point before streaming back to device DRAM.
The host is then able to retrieve the data in a natively
supported data type.
We find this approach somewhat wasteful. Our kernel
fully unrolls the Slater matrix and laplacian streams over
the dimensionality of the Slater matrices. Consequently,
a large number of casts will be performed concurrently,
for both input and output streams. Casting between
fixed and floating point representations consumes signif-
icant on-chip logic resources. As such, a sizeable portion
of on-chip resources will be devoted to casting between
numerical representations, an overhead associated with
the use of the co-processor. Naturally, such an imple-
mentation is best avoided.
A preferable scheme materialises when considering the
means by which data is transferred between host and
device DRAM. Since the MAIA card resides on the PCIe
bridge of a compute node, all data transactions between
host and device proceed through PCIe. The PCIe driver
is instantiated as an IP core in the fabric of the FPGA,
requiring data pass through the FPGA over the course
of data transactions. Memory transactions which use
PCIe are considered a bottleneck of distributed memory
systems. The prudent developer of such applications
will consequently minimise memory transactions over
11
b8
(−1)b8
b7
b72
2
b6
b62
1
b5
b52
0
b4
b42
−1
b3
b32
−2
b2
b22
−3
b1
b12
−4
b0
b02
−5
Sign Bit Integer Part Fractional Part
(a) 9-bit fixed point representation. A single “Sign Bit” is reserved to represent the signedness of the number. The “Integer
Part” and “Fractional Part” are separated by an implicit radix point. The decimal representation of is given by
(−1)b8 × (b7b6b5.b4b3b2b1b0)2
.
b8
(−1)b8
b7 b6 b5 b4 b3 b2 b1 b0
2(b7b6b5)2−4 1.(
∑4
i=0 bi2
i−5)
Sign Bit Exponent
(Bias of 4)
Mantissa
(Normalised)
(b)
FIG. 5: Fixed and Floating Point numerical representations using 9 bits (for the sake of demonstration, i.e. without
the implication that 9-bit representations are utilised in our work). Note that these forms are inadequate for
intermediary stages in arithmetic operations, where guard digits are used to prevent overflow of the representation.
the course of computation to obtain an optimal imple-
mentation.
Since data transfer must take place at some point, and
the data must pass through the FPGA, it is reasonable
to pass the data streams through interface kernels,
casting the data as it arrives from the PCIe. A free
parameter in our implementation is the width of the
stream we cast in the interface kernels. Since the
clock frequency of the kernel is known, along with the
bandwidth of the PCIe, it is a simple task to evaluate
the width of the data stream which saturates the PCIe
transfer, and is therefore optimal. For instance, consider
a kernel clocked at 100MHz. The bandwidth of a PCIe
2.0 x16 bridge is 8 GB/s. Then, a stream of width 80
bytes saturates the PCIe transfer, equating to 10 double
or 20 single precision numbers.
A final consideration is the storage of the fixed point
data in the device DRAM should the representation not
be byte-aligned. It is difficult to find a generic storage
width for arbitrary width fixed point data, since DRAM
accesses must be burst-aligned. For example, the MAIA
card utilised in this work possesses 6 DDR3 channels,
each channel having a width of 64 bits. The DRAM can
function with a burst size of 8 or 4, resulting in a burst
width of 384 or 192 bytes, respectively. Since the burst
width will exceed the PCIe saturated width, data must
be buffered in the interface kernel for a number of kernel
ticks before offloading the buffer to DRAM. Such a
scheme is facilitated by using regular widths for the fixed
point storage. As such, we choose a storage width of 64
bits. This choice naturally reduces the effective DRAM
latency, but for the sake of ease of implementation, we
proceed with this parameterisation.
V. RESULTS
A. Fixed Point
Use of a single fixed point representation throughout
our kernel is not a feasible option. A number of variables
possess vastly different dynamic ranges. While a single
fixed point representation could in principle be found to
accommodate all variables, precision of the representa-
tion will necessarily be leveraged for flexibility. As such,
our strategy is to track the dynamic range of each vari-
able over the course of a simulation and classify the num-
ber of fundamentally different dynamic ranges that arise.
To narrow our design space, we will constrain each repre-
sentation to the same width. We are consequently able to
eliminate degrees of freedom from the device space that
are only amenable to systematic optimisation strategies.
In ascertaining an adequate fixed point representation,
we must consider three independent points:
1. Whether two’s complement is required.
2. The dynamic range of the variables, such that the
integer width can be determined.
3. The number of fractional bits required to yield a
stable calculation.
The first two points are easily dealt with through analy-
sis. As a further constraint, we concern ourselves specif-
ically with ensembles which have undergone a period
of equilibration, thereby further restricting the dynamic
range of runtime variables. It is, however, difficult to
ascertain the required number of fractional bits without
resorting to a full systematic exploration of fractional
12
widths. Our task is complicated all the more given that
in establishing the adequacy of a fractional width, a fully
converged VMC calculation is required. Naturally such
studies are not feasible when using a device simulator,
meaning individual DFEs must be synthesised for each
fractional width, which again is prohibitively expensive.
As such, we employ a scheme requiring a single Monte
Carlo step with the device simulator. The displacement
of electrons is deactivated, and the kernel computes the
Slater matrices and associated derivatives for the same
Monte Carlo samples residing on the host. Upon passing
these quantities back to the host, we monitor the ensem-
ble averaged local energy and acceptance rate for these
dummy moves. If the computed averaged local energy
differs from the original value by less than 5× 10−4a.u.,
we deem the implemented fractional bit width adequate.
Table II shows such results for a varying number of walk-
ers. It is clear that a total fixed point width of 38 bits is
the appropriate fixed point representation for our calcu-
lation.
An interesting optimisation which could be undertaken
in the future centres around the work of Ceperley and
Dewing[23], where a penalty function is applied to ran-
dom walks with noisy data. We have no reason to assume
that truncating a numerical representation introduces a
bias into the resultant calculation, and so it appears that
one might be able to utilise a shorter fixed width rep-
resentation and utilise the reported penalty method to
compensate for the resulting imprecision. While the ac-
ceptance rate will decrease, one can in principle leverage
the inefficiency in the Monte Carlo for potential accel-
erations deriving from the use of a shorter fixed width
representation. We plan to investigate the application of
the penalty method to this end in the near future.
B. Performance
The final resource utilisation of our FPGA implemen-
tation is given in Table III. While our synthesised design
is moderately light on consumption of logic and dedi-
cated arithmetic units, the on-chip memory proves to be
a limiting factor to increasing the complexity of our de-
sign. Both the LUTs of our exponential units and the
trial wavefunction parameters, particularly the molecu-
lar orbital expansion coefficients, are particularly culpa-
ble for such high utilisation. However, given the form
of our trial wavefunction, avoiding such overheads is dif-
ficult. It is possible to force the MaxCompiler to not
pipeline sections of the application, thereby reducing on-
chip memory consumption, but at the cost of increased
difficulty in the design meeting timing.
Our final results correspond to the overall performance
of our application relative to a purely host-bound im-
plementation. We consider a number of ensemble sizes
propagated for a fixed number of Monte Carlo steps. In
the lower pane of Figure 6, we plot the accelerations rel-
ative to the multithreaded host benchmark for a vari-
able number of threads running on the host while the de-
vice computes the trial wavefunction. For small ensemble
sizes, we observe a reduction in the performance relative
to larger ensemble sizes owing to the overhead associ-
ated with using the co-processor. However, for ensemble
sizes of 256 walkers and above, we see the improvement
in performance converge towards an overall acceleration
of roughly 30× relative to the multithreaded host im-
plementation. Across all ensemble sizes, we note that
the host need only instantiate 3 or 4 threads of execu-
tion to tend towards peak performance. An interesting
additional result is some metric quantifying the compar-
ative power consumptions of the two calculations. While
Maxeler Technologies provide a command line utility to
query the power consumption of a board, we are not en-
tirely clear on the resolution or accuracy of this quantity.
However, in-house testing on a small board with power
readings taken at the power outlet reveals the command
line utility is in good agreement with the outlet read-
ings. For the MAIA board, we observed a peak power
consumption of 27.6W throughout.
It is, unfortunately, a little more difficult to establish a
power consumption of the host. The operating system
is ultimately in control of scheduling, so the overhead
associated with other general purpose tasks must be ac-
counted for. A further difficulty is what to include in the
power consumption for the host. Typically, only thermal
design powers (TDP) are reported by chip manufactur-
ers, which are reported to be in poor correspondence with
power consumption at peak operation (peak power con-
sumptions equating to roughly 1.5× that of the reported
TDPs.[24]). The TDP of the Intel Xeon E5-2640 is re-
ported to be 95W. Furthermore, it is unclear whether the
power consumption of RAM chips and other peripherals
are to be considered.
Rather than attempt to speculate on such matters, we
compose a metric given the limited data at our disposal:
E =
Pbenchmark × S
Pco-processing
(9)
where Pbenchmark is the power consumption of the
multithreaded host implementation, Pco-processing is
the power consumption of the device plus that of
the number of threads utilised by the co-processing
implementation[25]. S is the speedup of the co-processor
relative to the multithreaded host, and E is then the
performance-adjusted reduction in power consump-
tion offered by the co-processor. E is plotted in the
upper pane of Figure 6. Given that the performance
of our application begins to converge to some peak
acceleration when the host runs with upwards of three
threads, it is unsurprising that the performance-adjusted
power consumption is larger for fewer threads than more.
13
Total Fixed Width
Nw 30 32 34 36 38 40 Double
128 -65.1254 -67.4323 -68.4111 -69.6823 -69.6893 -69.6895 -69.6895
256 -64.2432 -68.1332 -68.0236 -69.7899 -69.7900 -69.7901 -69.7901
512 -54.9473 -63.5421 -64.9021 -69.7523 -69.7892 -69.7894 -69.7894
1024 -63.5443 -61.5423 -66.3427 -69.6912 -69.7899 -69.7899 -69.7899
2048 -67.2313 -68.1998 -69.7712 -69.7652 -69.7893 -69.7895 -69.7895
TABLE II: Ensemble-averaged local energy over multiple numbers of walkers for varying fixed width
representations. A sign bit and the number of integer bits have been dictated by the dynamic range of variables. We
find that a total width of 38 bits is adequate to reproduce the ensemble-averaged local energy computed in double
precision (final column) to within a tolerance of 5× 10−4 a.u.
FIG. 6: Acceleration of the host-device implementation for various ensemble sizes relative to a multithreaded host
implementation (lower pane). Performance-adjusted reductions in power consumption are also plotted (upper pane).
Component Number Available Utilisation
Logic 262400 51.36%
18× 18 Multipliers 3926 32.32%
DSP Blocks 1963 36.37%
On-Chip Memory (M20K) 2567 98.64%
TABLE III: Resource utilisation of our final synthesised
FPGA configuration.
VI. CONCLUSION
We have ported the computationally expensive trial
wavefunction evaluation kernel from a VMC application
to an FPGA platform. Through co-processing with a
multicore host, we have established that our implemen-
tation offers significant benefits in terms of raw compute
performance and reduced power consumption. While
our VMC is minimal from the perspective of complexity
of trial wavefunction, we hope that this work acts to
14
instigate further investigation.
Developer time and on-chip resources remain signifi-
cantly limiting factors in the use of FPGA co-processing
for complex scientific applications. However, these
limitations are forecasted to become decreasingly prob-
lematic with significant effort being directed towards
alleviation. The new Intel/Altera Stratix X, for instance,
possesses roughly twice the number of on-chip resources
as the Stratix V used in this work.
Work within our group is currently being directed to-
wards adding further complexity to our implementation,
including support for Jastrow factors and localised
orbitals. We are also looking to support a DMC
application, requiring the first derivative of the trial
wavefunction, in addition to support for an FPGA-
based Sherman–Morrison inverse updating kernel.
We hope that such endeavours will aid in the porting
of scientific applications to forecasted exascale platforms.
ACKNOWLEDGMENTS
AJWT would like to thank the Royal Society for a
University Research Fellowship under grants UF110161
and UF160398 and a Research Grant number RG140728.
This work has been performed as part of the EXTRA[13]
consortium, supported by the funding from the EU Hori-
zon 2020 research and innovation programme under grant
No 671653. Computations were performed on a Maxeler
Galava DFE obtained as part of the Maxeler University
Program, as well as the Delorean MAIA cluster as part of
the STFC Hartree Centre. We would also like to thank
Dr Jose´ Gabriel de Figueiredo Coutinho, Dr Timothy
Todman and Prof. Wayne Luk of Imperial College Lon-
don for their advice and use of MAIA compute resources.
APPENDICES
Appendix A: Pseudorandom Number Generation
The Mersenne Twister (MT) is a pseudo-random num-
ber generator (PRNG) whose period is given by 2z − 1,
z being a Mersenne prime[26]. The most common re-
alisation of the PRNG is parameterised by z = 19937,
hereafter referred to by MT19937. The period of the re-
sultant pseudorandom sequence is therefore adequately
long that repetition will not be observed, even over cos-
mological periods of time with an output frequency of
1fs−1.
The recurrence of a sequence is a commonly invoked
metric for ascertaining the quality of a random number
sequence. The MT19937 is k–distributed to 32-bit accu-
racy, ∀(1 ≤ k ≤ 623), establishing it as a superior PRNG.
The pseudorandom sequence output by a MT is given by
the recurrence relation
xk+n = xk+m ⊕
(
xuk ||xlk+1
)A , (A1)
where ⊕ and || denote the bitwise XOR and concatena-
tion, respectively, and n is the recurrence of the sequence.
We have utilised vector notation to emphasise that the
random numbers are each composed of w–bits; xu and
xl are then the upper and lower bit segments of x. The
length of these segments is ascertained by a single pa-
rameter, r: xu corresponds to the upper w − r bits of
x, and xl denotes the lower r bits of x. Finally, A is
a matrix, termed the “twist-transformation”. The twist-
transformation matrix is of rational normal form,
A =
[
0 I
aw−1 a
]
, (A2)
where I ∈ Zw×w is the identity matrix and a is a w−bit
number, aw−1 being the highest-order bit of a. This
form is most convenient for computation, as one is able
to evaluate the vector-matrix product of Equation (A1)
through simple bitwise operations, without having to ex-
plicitly form and store A,
xA =
{
x 1
(x 1)⊕ a
if x0 = 0
if x0 = 1
(A3)
where  is the right-shift operator, and x0 denotes the
lowest-bit of x.
To improve the k–distribution of the MT, a “temper-
ing” is used for the output pseudorandom numbers, ac-
complished through application of a tempering matrix,
T . As with the twist-transformation matrix, we are
spared from explicitly forming/ storing the matrix T
through enforcing that it satisfy a set of pipelined op-
erations:
y0 ← x ⊕ (x u) (A4)
y1 ← y0 ⊕ ((y0  s) & b) (A5)
y2 ← y1 ⊕ ((y1  t) & c) (A6)
y3 ← y2 ⊕ (y2  l) (A7)
where u, s, t, l are tempering bit shifts, and b, c are tem-
pering bit masks. The ampersand is used in the conven-
tional sense to denote bitwise AND.
The process we have just outlined is most amenable
to FPGA implementation. The recurrence relationship
of (A1) lends itself to realisation with a “linear feedback
shift register” (LFSR), where the output of the register
containing an initial sequence of (x0,x1, . . . ,xk−1 is si-
multaneously fed to the back of the LFSR to construct
xk, and so on. Furthermore, the pipelined operations
outlined in the tempering steps of Equations (A4) - (A7)
are suitable for cascaded operations in the fabric of the
FPGA. The MT is then a natural candidate as a PRNG
for FPGAs, and is our choice for on-chip pseudorandom
number generation.
15
Appendix B: The Exponential Function
Transcendental functions, by definition, cannot be
computed exactly in a finite number of algebraic steps.
As such, algorithms which implement transcendental
functions must employ some approximation, leveraging
speed of convergence for an increase in operational com-
plexity. The exponential function pervades quantum
chemistry owing to its utility in simplifying the integrals
arising in deterministic methodologies. While useful in-
tegral properties are an irrelevance for QMC techniques,
linear combinations of exponentials are still commonly
used to construct atomic orbitals, as stated in Equation
(8). As such, in order that our FPGA implementation
of the trial wavefunction evaluation routines be as ef-
ficient as possible, we have chosen to construct an ex-
ponential function which can be queried as rapidly as
possible, while consuming minimal on-chip resources.
Historically, the CORDIC algorithm[27] has been
utilised on FPGAs for trigonometric and hyperbolic func-
tions. The identity cosh(x)+sinh(x) = exp(x) can subse-
quently be used to approximate the exponential function.
Since CORDIC is composed entirely of adds and shifts, it
leads to a significant reduction in complexity relative to
polynomial approximations which require explicit mul-
tiplication. However, iterative nature of the CORDIC
algorithm renders it suboptimal for our purposes – the
data dependency between iterations is unsuitable for the
dataflow implementation we seek to create.
We have instead utilised a lookup table (LUT)
approach[28, 29]. While such implementations are
memory-intensive, modern FPGA chips have significant
amounts of on-chip RAM (of the order of megabytes),
and so the cost of storing a few thousand floating point
numbers is fairly inconsequential. In order to utilise a
LUT-based approach to approximating the exponential
function, one must manipulate the identity
exp(x) =
(
2log2(e)
)x
= 2x log2(e) . (B1)
We proceed to define x log2(e) = yi+yf , where yi and yf
are the integer and fractional parts, respectively, of the
original argument scaled by log2(e). We consequently
arrive at the expression
exp(x) = 2(yi+yf ) = 2yi × 2yf . (B2)
A similar partitioning can be undertaken to split yf down
into smaller pieces. Performing this additional partition-
ing of the fractional part, we can split the approximation
down into a series of independent lookups and multipli-
cation of each returned value.
[1] W. Foulkes, L. Mitas, R. Needs, and G. Rajagopal, Re-
views of Modern Physics 73, 33 (2001).
[2] R. Needs, M. Towler, N. Drummond, and P. L. R´ıos,
Journal of Physics: Condensed Matter 22, 023201 (2009).
[3] B. L. Hammond, W. A. Lester, and P. J. Reynolds,
Monte Carlo methods in ab initio quantum chemistry,
Vol. 1 (World Scientific, 1994).
[4] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bas-
sous, and A. R. LeBlanc, IEEE Journal of Solid-State
Circuits 9, 256 (1974).
[5] D. R. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang, Het-
erogeneous computing with OpenCL 2.0 (Morgan Kauf-
mann, 2015).
[6] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankar-
alingam, and D. Burger, in Computer Architecture
(ISCA), 2011 38th Annual International Symposium on
(IEEE, 2011) pp. 365–376.
[7] L. Shulenburger, J. Kim, K. P. Esler, J. McMinis, M. A.
Morales, B. K. Clark, and D. M. Ceperley, Hybrid al-
gorithms in quantum Monte Carlo., Tech. Rep. (San-
dia National Lab.(SNL-NM), Albuquerque, NM (United
States), 2011).
[8] J. Kim, A. T. Baczewski, T. D. Beaudet, A. Benali, M. C.
Bennett, M. A. Berrill, N. S. Blunt, E. J. L. Borda,
M. Casula, D. M. Ceperley, et al., Journal of Physics:
Condensed Matter 30, 195901 (2018).
[9] K. P. Esler, J. Kim, D. M. Ceperley, and L. Shu-
lenburger, Computing in Science & Engineering 14, 40
(2012).
[10] S. Amarasinghe, M. Hall, R. Lethin, K. Pingali, D. Quin-
lan, V. Sarkar, J. Shalf, R. Lucas, K. Yelick, P. Balaji,
et al., in Report of the 2011 Workshop on Exascale Pro-
gramming Challenges (2011).
[11] J. Shalf, S. Dosanjh, and J. Morrison, in International
Conference on High Performance Computing for Com-
putational Science (Springer, 2010) pp. 1–25.
[12] K. Bergman, S. Borkar, D. Campbell, W. Carlson,
W. Dally, M. Denneau, P. Franzon, W. Harrod,
K. Hill, J. Hiller, et al., Defense Advanced Research
Projects Agency Information Processing Techniques Of-
fice (DARPA IPTO), Tech. Rep 15 (2008).
[13] D. Stroobandt, A. L. Varbanescu, C. B. Ciobanu,
M. Al Kadi, A. Brokalakis, G. Charitopoulos, T. Tod-
man, X. Niu, D. Pnevmatikatos, A. Kulkarni, et al.,
in Reconfigurable Communication-centric Systems-on-
Chip (ReCoSoC), 2016 11th International Symposium on
(IEEE, 2016) pp. 1–7.
[14] C. B. Ciobanu, A. L. Varbanescu, D. Pnevmatikatos,
G. Charitopoulos, X. Niu, W. Luk, M. D. Santambrogio,
D. Sciuto, M. Al Kadi, M. Huebner, et al., in Computa-
tional Science and Engineering (CSE), 2015 IEEE 18th
International Conference on (IEEE, 2015) pp. 339–342.
[15] A. Gothandaraman, G. D. Peterson, G. L. Warren, R. J.
Hinde, and R. J. Harrison, Parallel Computing 34, 278
(2008).
[16] B. Cooper, S. Girdlestone, P. Burovskiy, G. Gaydadjiev,
V. Averbukh, P. J. Knowles, and W. Luk, Journal of
chemical theory and computation 13, 5265 (2017).
[17] N. Drummond, M. Towler, and R. Needs, Physical Re-
view B 70, 235119 (2004).
[18] A. Williamson, R. Q. Hood, and J. Grossman, Physical
Review Letters 87, 246406 (2001).
16
[19] A. A. Freitas and S. H. Lavington, in Mining Very Large
Databases with Parallel Processing (Springer, 2000) pp.
61–69.
[20] L. K. Wagner, M. Bajdich, and L. Mitas, Journal of
Computational Physics 228, 3390 (2009).
[21] A. Mathuriya, Y. Luo, R. C. Clay III, A. Benali, L. Shu-
lenburger, and J. Kim, in Proceedings of the Interna-
tional Conference for High Performance Computing, Net-
working, Storage and Analysis (ACM, 2017) p. 38.
[22] T. McDaniel, E. F. DAzevedo, Y. W. Li, K. Wong, and
P. R. Kent, The Journal of Chemical Physics 147, 174107
(2017).
[23] D. Ceperley and M. Dewing, The Journal of chemical
physics 110, 9812 (1999).
[24] J. L. Hennessy and D. A. Patterson, Computer architec-
ture: a quantitative approach (Elsevier, 2011).
[25] When only a subset of physical cores are utilised in the
multithreaded co-processing implementation, the power
consumption of each core is given by the TDP divided
by the number of physical cores.
[26] M. Matsumoto and T. Nishimura, ACM Transactions
on Modeling and Computer Simulation (TOMACS) 8,
3 (1998).
[27] R. Andraka, in Proceedings of the 1998 ACM/SIGDA
sixth international symposium on Field programmable
gate arrays (ACM, 1998) pp. 191–200.
[28] C. C. Doss and R. L. Riley, in Field-Programmable Cus-
tom Computing Machines, 2004. FCCM 2004. 12th An-
nual IEEE Symposium on (IEEE, 2004) pp. 229–238.
[29] E. Jamro, K. Wiatr, and M. Wielgosz, in Field Pro-
grammable Logic and Applications, 2007. FPL 2007. In-
ternational Conference on (IEEE, 2007) pp. 718–721.
