A Reconfigurable Vector Instruction Processor for Accelerating a
  Convection Parametrization Model on FPGAs by Nabi, Syed Waqar et al.
A Reconfigurable Vector Instruction Processor for
Accelerating a Convection Parametrization Model on
FPGAs
Syed Waqar Nabi
∗
School of Computing Science,
University of Glasgow,
Glasgow G12 8QQ, UK.
syed.nabi@glasgow.ac.uk
Saji N. Hameed
University of Aizu,
Tsuruga, Ikki-machi,
Aizuwakamatsu-shi, Japan.
saji@u-aizu.ac.jp
Wim Vanderbauwhede
School of Computing Science,
University of Glasgow,
Glasgow G12 8QQ, UK.
wim@dcs.gla.ac.uk
ABSTRACT
High Performance Computing (HPC) platforms allow scien-
tists to model computationally intensive algorithms. HPC
clusters increasingly use General-Purpose Graphics Process-
ing Units (GPGPUs) as accelerators; FPGAs provide an at-
tractive alternative to GPGPUs for use as co-processors, but
they are still far from being mainstream due to a number
of challenges faced when using FPGA-based platforms. Our
research aims to make FPGA-based high performance com-
puting more accessible to the scientific community. In this
work we present the results of investigating the acceleration
of a particular atmospheric model, Flexpart, on FPGAs.
We focus on accelerating the most computationally inten-
sive kernel from this model. The key contribution of our
work is the architectural exploration we undertook to arrive
at a solution that best exploits the parallelism available in
the legacy code, and is also convenient to program, so that
eventually the compilation of high-level legacy code to our
architecture can be fully automated. We present the three
different types of architecture, comparing their resource uti-
lization and performance, and propose that an architecture
where there are a number of computational cores, each built
along the lines of a vector instruction processor, works best
in this particular scenario, and is a promising candidate for
a generic FPGA-based platform for scientific computation.
We also present the results of experiments done with vari-
ous configuration parameters of the proposed architecture,
to show its utility in adapting to a range of scientific appli-
cations.
1. INTRODUCTION
Atmospheric and climate scientists are one of the most im-
portant users of high-performance computing (HPC) plat-
forms. Models predicting global weather and climate, or
more localized phenomenon like after-effects of the Fukushima
nuclear disaster or the Icelandic volcano eruption, require
the execution of computationally intensive models.
Today’s HPC systems use massive clusters of multi-core
∗Also affiliated with:
Namal College, Mianwali, 42250, Pakistan.
This is an extended Pre-Print version of the work that was presented in part
at the international symposium on Highly-Efficient Accelerators and Re-
configurable Technologies (HEART2014), Sendai, Japan, June 9-11, 2014.
CPUs. Increasingly, GPGPUs are being utilized as co-processors
to off-load computationally intensive kernels, as they can of-
fer much greater thread-level parallelism at a lower power
budget.
Field-Programmable Gate Arrays (FPGAs) are another
viable option for accelerating scientific models by working as
co-processors that off load computationally intensive kernels
in a manner similar to the way it is done with GPUs. FPGAs
moreover can achieve the same performance at much lower
power budgets than either CPU only or CPU-GPU systems,
due to their ability to match hardware to the algorithm.
However, the scientific community in general has been reluc-
tant to use FPGA-based HPC systems due to the challenges
associated with programming and exploiting the parallelism
available in an FPGA, and also due to poor portability and
standardisation [1].
Our work is motivated by the need to make FPGA based
HPC systems more accessible to the scientific community.
We propose an FPGA-based architecture in the context of a
particular atmospheric particle dispersion model, Flexpart,
with the view to map the various granularities of parallelism
available in a typical scientific model to the parallelism avail-
able in the FPGA hardware. In addition, we aim to make
this architecture straight-forward to program using a high-
level language and eventually achieve fully automated com-
pilation of legacy code that was originally written for single-
core CPUs to our FPGA system.
2. FPGAS FOR ACCELERATING SCIEN-
TIFIC COMPUTATION
FPGAs are inherently parallel architectures that allow
compile-time reconfiguration of the hardware through Hard-
ware Description Languages (HDLs) like Verilog. While the
overhead of this flexibility results in them being less power
and area efficient than Application-Specific ICs (ASICs),
they can be much more power-efficient than instruction pro-
cessors like CPUs and GPUs. This is because FPGAs can
be tailored to the algorithm that is being executed, provide
parallelism at various granularities, and can give compara-
ble overall throughput at much lower frequencies. So larger,
faster systems can be built with lower power-budgets.
There are a number of challenges associated with using
FPGAs for HPC. FPGAs, like any other co-processor, are
limited by the IO bandwidth due to the loose coupling with
ar
X
iv
:1
50
4.
04
58
6v
1 
 [c
s.A
R]
  1
7 A
pr
 20
15
the host CPU, typically over an interface like PCI Express.
This constraint places and upper limit on the achievable
speed up from any acceleration co-processor. FPGAs are
hence suitable for accelerating small pieces of code – the
kernels – that take the most amount of time to execute.
Programming FPGAs is a very different paradigm from
programming instruction processors, and it is generally con-
sidered outside the domain of software programmers. There
has been considerable progress on this front in recent years,
for example the HLS tools of Xilinx[2], and Altera’a venture
into using OpenCL for programming FPGAs [3]. However,
it can safely be said that this field is far from mature, and
lagging considerably behind the kind of tools and support
available to programmers of CPUs and GPUs.
It is not trivial to convert legacy code for implementation
on FPGAs even with the introduction of high-level program-
ming paradigms. Porting an existing CPU-based code to a
HPC platform based on FPGAs may in fact even be more
difficult than writing completely new code for that FPGA-
based system. Poor portability and a lack of standardization
adds to the challenge and has lead to a reluctance in the sci-
entific community to use FPGAs for their HPC needs.
In view of these challenges, our work looks at atmospheric
models – specifically the Flexpart particle dispersion model
– and we investigate a number of architectural options that
could be used to 1) effectively exploit parallelism at various
granularity levels available in an algorithm and 2)to do it
in a way that makes it easier to program the FPGA, and
eventually can lead to a fully automated compiler.
3. THE FLEXPART LAGRANGIAN PARTI-
CLE DISPERSION MODEL
Flexpart is a meso-scale atmospheric particle dispersion
model [4]. It can model the transport and diffusion of par-
ticles – where particles can be tracers in the air or packets
of air itself. The term meso-scale means it covers phenom-
ena in an area of somewhere between 5 kilometres to several
hundred kilometres. Applications of models at this scale can
include phenomena like transport of radio-nuclide, pollution
transport, greenhouse gas cycles etc.
3.1 Lagrangian and Eulerian Models
Flexpart is a Lagrangian particle dispersion model. This is
one of the two approaches typically taken in fluid dynamics
for modelling the flow of fluids, the other being Eulerian
modelling.
A Lagrangian model models the trajectory of a large num-
ber of particles for every time step. In Flexpart for example,
each particle is moved over time through a 3-dimensional
space, and the final result is produced by counting the par-
ticles in each grid-box.
An Eulerian model on the other hand models the mass
flux of fluid and its change over time at grid boundaries.
A Lagrangian model is more accurate as it accurately
models the trajectory of each individual particle, which is
not effected by the grid-size. The grid is only applied at the
output. In a Eulerian model, a particle released from a point
source instantaneously looses its position in the grid-box
(this is called Numerical Diffusion). A Lagrangian model is
good at representing narrow plumes, in which case its com-
putation requirements can be quite modest as well. How-
ever, in general a Lagrangian model would be more compu-
tationally intensive than a Eulerian model.
3.2 The Working of Flexpart
Flexpart uses Lagrangian dynamics to model particle dis-
persion at the meso-scale. Figure 1 shows the main steps in
the Flexpart model.
Track particles 
in a velocity 
field 
Model sub-grid 
scale processes 
Modify particle 
properties 
Count particles, 
extract 
concentration 
Figure 1: The Working of the Flexpart Particle Dis-
persion Model
Flexpart uses meteorological fields taken from the ECMWF
numerical weather prediction model data [5] on a latitude /
longitude 60-km resolution grid. The number of levels in the
vertical can be up to 60. This data is interpolated and used
by the model to track the trajectory of individual particles
over time in a given velocity field.
There are some phenomena, e.g. convective transport,
which are not captured by this grid-scale modelling, and
they are called sub-grid scale phenomena. Flexpart uses
parametrization schemes for modelling the effect of these
phenomena on the dispersion of particles. As we will see in
the next section, the parametrization of convective transport
places a significant computational burden on the processor.
If required, the individual particle properties can me mod-
ified, for example, loss of mass through radioactive decay.
These three steps are repeatedly executed, and the output
is produced at the required time intervals or at the end of
simulation by counting particles and extracting the concen-
tration for each grid-box.
3.3 The Flexpart Source Code
The Flexpart model has been written in Fortran 77. It is
relevant to note that the author did not make any attempts
to parallelise the code as the model is strictly sequential [4].
These meteorological inputs to Flexpart from the ECMWF
weather model are available at 3-hour time intervals, whereas
the flexpart integration step is typically much smaller, for
example 60 seconds in case of our experiments. Their inter-
polation forms one key element of computation in Flexpart.
The other key method is the modelling of dispersion, which
moves each particle through a certain distance for that in-
tegration time step.
As discussed earlier, sub-grid scale phenomena like con-
vective transport of particles cannot be modelled by trajec-
tory calculations, and have to be parametrized. Our current
work focuses on acceleration of the parametrization algo-
rithm used for modelling the effects of convective turbulence.
3.4 The Convect Method
Flexpart uses the scheme proposed by Emmanuel and
Zivkovic-Rothman[6] for parametrizing the effect of convec-
tion on the position of individual particles. The model uses
Simulated time (hrs) 03
hrs
12
hrs
120
hrs
360
hrs
Simulation time
(sec)
192 657 5,693 27,150
Synch. steps 180 720 7200 21,600
Calls to Convect per
step (thousands)
0.5 6.2 3,558 66,299
Average calls per
time step
2.7 8.6 464 3,069
Percentage compute
time taken by Con-
vect
0.02 0.06 7.55 35
Table 1: Putting Convect in Perspective
grid-scale data available from the ECMWF model and pro-
duces a displacement probability matrix, giving the required
mass flux information based on which particles can be re-
distributed.
The Convect kernel is called once per grid-column for
each simulation time step. While subsequent calls to Con-
vect for the same grid-column must be executed sequentially
due to a data dependency, the calls to Convect for different
grid-columns in the same time step are independent of each
other. The Convect kernel requires a number of computa-
tional steps that must be executed in a linear fashion, with
data-dependencies from one step to the other. These obser-
vations form the basis of our proposed solution to parallelize
the execution of this method on FPGAs.
3.5 Analysis of Flexpart and Convect for Ac-
celeration
Flexpart has already been ported for GPU acceleration,
and a 10× performance improvement on a NVIDIAs Tesla
C1060 has been reported [7]. However, this comparison ex-
cludes the Convect kernel. The authors of Flexpart indicate
that the Convect kernel consumes a significant proportion
of the overall computation time [4], hence our work focused
on this method for acceleration on FPGAs. The protracted
nature of the Convect kernel makes it an inconvenient algo-
rithm to tackle, but it also makes it a good representative
scientific model that can be used for investigating accelera-
tion of scientific models in general, which is our long-term
aim.
Initial simulation seemed to indicate that the execution of
Convect has a proportionately negligible contribution to the
overall processing time. For example, in a 12-hour simulated-
time run, the Convect kernel takes a mere 0.06% of the 657
seconds of simulation time (See Table 1).
Longer simulation runs however indicated that calls are
made to the Convect kernel increasingly as the simulated
time progresses, as can be seen from the fourth row in Table
1. Convect – being a parametrisation of the actual physi-
cal phenomena – is not called on a per-particle basis, but
on a per-grid-column basis. So it only needs to be invoked
for those vertical columns of air where particles are present.
Initially, the particles are typically localised into few grid-
columns. As time progresses, due to the dispersion, particles
spread out and Convect needs to be called more and more
often for each time-step. By extrapolating our results, we
see that if Convect were to be called for every vertical col-
umn of air in every time step right from the beginning of the
simulation, it would take up to 80% of the entire simulation
time. Consequently, we focused on porting the Convect ker-
nel for FPGA acceleration.
4. ARCHITECTURAL EXPLORATION
4.1 Design Considerations
The application domain of scientific computing in general
requires considerable arithmetic operations on large arrays
of numbers. So there are critical loops performing arithmetic
operations on large data, and the flexibility of FPGAs allows
us to deal with these loops in a number of different ways, as
our following experiments will show.
Identification of parallelism of the appropriate granularity
is a very important consideration for accelerating any code
on a HPC system. FPGA can be configured to offer various
granularities of parallelism, ranging from instruction-level,
data-level to thread-level. Hence we can map the architec-
ture to the parallelism exposed by the algorithm rather than
trying to do it the other way round.
In a single call to Convect, the sequence of processes have
data-dependencies from one process to the other. Repeated
calls to convect in different time-steps for the same grid-
column also have a data-dependency. However, the calls to
convect in a particular time step for different grid-columns
can be executed completely independent of and parallel to
each other.
We have explored three architectural alternatives for im-
plementing a Convect kernel on FPGAs. The common fea-
ture in all of these architectures is that they are meant to be
replicated on an FPGA to make use of the fact that calls to
Convect for different grid-columns are independent of each
other and can run in parallel. The three alternatives differ
in the amount of parallelism that we expose inside a single
call to Convect.
The first two architectures are what may be called “naive”
approaches; they provide a baseline at two extremes, and
lead towards the proposed third option.
In all three architectures, we conservatively use 64-bit
fixed-point numbers, with 32 bits for the fractional part.
Since the data available from ECMWF models is in floating-
point, we instantiate a floating-point to fixed-point converter
in the FPGA for dynamic conversions in all three cases [8].
The slice usage and calculation latency estimates are shown
for Virtex-5 XC5VLX devices.
4.2 Option 1: Fully Tiled Implementation
The first architecture is a baseline implementation of a
small sub-set of the Convect kernel, which computes an in-
termediate result used subsequently. We unroll the loop,
and dedicate a hardware unit for each arithmetic operation.
A single iteration of the loop involves 6 multiplications, two
divisions, two additions, and one inversion.
Like most loops in the Convect kernel, this one loops over
each vertical level of air in that grid-column, which num-
bered 24 in our simulation runs. We replicated the compute-
unit 24 times. See Figure 2 for a block diagram of this stage.
Note that it does not require any controller to sequence the
operations as the algorithm has been programmed into the
circuit. A synchronisation barrier is required as different
arithmetic units can have different computational latencies.
This approach would require a unique and custom stage
for each major loop in the Convect kernel, and as our results
Local Memory I/F 
Array 
Mem 
Read 
Logic 
Array 
Mem 
Array 
Mem 
X 
X 
X 
X 
/ 
Sy
n
c-
B
ar
ri
e
r 
X 
/ 
Sy
n
c-
B
ar
ri
e
r 
+ 
+ 
~ 
X 
This logic is  and the array memories are  
replicated N times for iteration of unrolled  
loop (e.g. N = 24) 
Flat 
Data 
Figure 2: First architectural option: fully unrolled
and tiled implementation
will show, it is not a feasible option. However, the architec-
ture serves as an interesting baseline to compare our other
approaches.
4.3 Option 2: Fully Sequential Implementa-
tion
Our second approach explores the other extreme of reusing
minimum arithmetic units sequentially to compute the loop.
This requires a controller and a way to communicate the se-
quence of operations to it, which would be in the form of in-
structions stored in a memory. Thus our design approaches
that of a typical – very basic – instruction processor.
One departure from the conventional instruction proces-
sor in our core is the additional logic around each arith-
metic unit for handling array operations. This allows the
main control unit and instructions to work at coarser level
of array operations, resulting in a very simple controller and
instructions (See Figure 3).
Control 
Wrapper 
Control 
Wrapper 
Control 
Wrapper 
Control 
Wrapper 
Local Memory I/F 
Read 
Logic 
/ + ~ X 
Register 
Arrays 
Controller 
(FSM) 
Code 
Memory 
1 
Figure 3: Second architectural option: fully sequen-
tial implementation
The main advantage of this approach is that it allows very
convenient programming of the algorithm. While this par-
ticular experiment was for the Convect kernel, the architec-
ture can work as a generic processor. The reconfigurability
of the FPGA allows it to be optimised for each implemen-
tation, similar to approach taken by [9].
However, this architecture offers no parallelism inside the
execution of a single Convect kernel, which is executed in
a strictly sequential fashion. There is also the overhead of
having a controller around each single arithmetic unit. The
result is an architecture that has the overheads of a typical
sequential instruction processor, yet would be operating at
much lower frequencies than those at which CPUs operate
because we are implementing it on an FPGA.
Our final proposal builds on the strength of this approach,
and mitigates the overheads by incorporating design ele-
ments from the very rich vector-processor design domain.
4.4 Option 3: Vector Implementation
Based on the first two experiments we propose a more
flexible solution. Our final architecture introduces vector
instructions and multiple arithmetic units.
The convect method, like most scientific applications, re-
quires arithmetic operations on large vectors or arrays. Such
applications are very well suited for the SIMD approach,
and vector processors have been used successfully in a num-
ber of super-computing applications. The Cray computer is
a vector-processor [10] based on which a number of super-
computing platforms have been built. The Earth Simulator
[11] at JAMSTECH in Japan is based on a cluster of NEC’s
NX-9 vector processors.
Our proposed solution implements a Harvard-style vector
instruction processor with an array of arithmetic functional
units. We introduce vector registers and at the same time
we maintain a set of scalar instructions and registers to deal
with the non-vector parts of the algorithm, as encountered
frequently in a protracted kernel like Convect.
Since our motive has been to evolve towards a generic pro-
cessing core for scientific computations, we have introduced
various levels of flexibility in the design. Compile time pa-
rameters can be used to tweak these design aspects:
• Optional use of floating-point/fixed-point converter,
and fixed-point number representation format.
• Size of vector registers bank (width and depth).
• Choice between combinatorial or sequential functional
units, and ability to mix both in the same design.
• Different number of functional units for each arith-
metic operation. This is a very useful configuration
option as we will show later in our results.
We can explore the flexibility along these dimensions with
very little effort, and it will be trivial to automate these
choices at compile time. This will enable us to make use of
a key strength of FPGAs where it can not be matched by
either CPUs or GPUs; the compile-time reconfigurability of
hardware. Our processor core will only instantiate the hard-
ware required for instructions in a given algorithm known at
compile-time.
Figure 4 shows a block diagram of our proposed archi-
tecture. This architecture has been developed in view of a
specific problem of executing the Convect kernel, though the
instruction set allows generic coding as well.
The wrapper around the ALU for sequencing the opera-
tions is required to deal with the different number of different
arithmetic units, which may or may not be the same as the
size of the vector. The required sequencing and collection
of data is completely transparent at the main control-unit
level, and is dealt by the sequencing wrapper which hides
this complexity from rest of the architecture as well as from
the compiler or programmer.
Load_Store
VectorPUnit
Vector
Registers
Scalar
Registers
ControlPUnit CodeMemory
ALUPSequencer
VectorPALU
A A A A A
M M M M
D D
Flattened
PData
Memory
Interface
Floating-Point
to/from
Fixed-Point
Conversion
InterfacePRegisters
Figure 4: Third architecture option: Vector Pro-
cessing Unit
5. RESOURCE UTILIZATION AND CAL-
CULATION LATENCY COMPARISONS
In this section we present the results of simulation and
synthesis of various architectural options on Xilinx’s XC5VLX
devices. We first compare the three architectures that we
have presented earlier, and then we explore different config-
urations of the proposed vector architecture.
5.1 Comparing the Three Architectures
Figure 5 shows the comparison of resource utilization and
calculation latency of the three architectures presented ear-
lier. For the proposed vector processor solution, we have
chosen 8 units each for addition and multiplication, and 24
units for division, which is what 8-8-24 means in the figure.
39063 
5269 
12889 
1083 
32643 
2082 
0
10000
20000
30000
40000
0
10000
20000
30000
40000
50000
Fully Tiled Fully Sequential Vector 8-8-24
C
al
cu
la
ti
o
n
 L
at
e
n
cy
 (
n
s)
 
Sl
ic
e
 U
ti
liz
at
io
n
 
Architecture Options 
Slices Calculation Latency (ns)
Figure 5: Resource utilization and calculation la-
tency comparison of three architectures
The second option of fully sequential implementation con-
sumes about 7× fewer resources and 8× more time for the
same operation as the fully tiled option 1 (See Figure 5), so
there is not much difference between the two in this respect.
Note however the comparison of the second option (se-
quential processor)with the third one, that is the vector core,
with 8 adders, 8 mulitipliers and 24 dividers. We can see an
almost 16× decrease in latency for a mere 2.5× increase in
the use of slices. Another way of looking at this is that we
could instantiate three cores of the first option which make
the resource utilization almost the same as the vector pro-
cessor, yet it will still be almost 6× slower than the vector
processor. This establishes a clear case for the vector pro-
cessor based architecture.
5.2 Exploring the Proposed Architecture
With the flexibility of instantiating different number of
arithmetic units in the vector processor core, we experi-
mented with different mixtures of functional units. Note
that the divide unit is a sequential one, while adder and mul-
tiplier are combinatorial, with the adder having additional
logic to allow subtraction as well. The multiplier takes up
the most logic resources, while the adder takes up the least
resources.
Our first set of experiments was on symmetric architec-
tures that keep the number of units for each arithmetic unit
equal, so that we can judge the effect of uniformly scaling
the vector ALU, from 1 unit each, to one unit each for every
vector element. The results are outlined in Figure 6.
4
8
1
2
 
5
0
4
8
 
5
3
2
5
 9
3
0
0
 
1
1
5
7
7
 
1
9
3
6
0
 
33200 
16763 
8545 4436 3066 1696 
0
1
0
0
0
0
2
0
0
0
0
3
0
0
0
0
4
0
0
0
0
0
5
0
0
0
1
0
0
0
0
1
5
0
0
0
2
0
0
0
0
2
5
0
0
0
1-1-1 2-2-2 4-4-4 8-8-8 12-12-12 24-24-24
C
al
cu
la
ti
o
n
 L
at
e
n
cy
 (
n
s)
 
Sl
ic
e
 U
ti
liz
at
io
n
 
Vector ALU Configuration (Adder-Multiplier-Divider) 
Slices Calculation Latency (ns)
Figure 6: Comparing different symmetrical configu-
rations of ALU in the vector architecture
The power of the vector architecture can be seen here,
where going from a 1-1-1 configuration to a 24-24-24 con-
figuration takes up 4× more resources, but yields a latency
improvement of around 20×. We can also see better yields
for resources from a configuration of 4-4-4 and upwards.
Next we investigated asymmetrical ALU configurations,
by keeping two of the arithmetic unit arrays at 8 instances
each, and increasing the third one to see the relative im-
provements in performance and increase in resources. Our
adders and multipliers are pure combinatorial units, whereas
the divider is a sequential unit, so one can expect the replica-
tion of the divider to yield higher dividends than the other.
This expectation is confirmed by our results as shown in
Figure 7, where the three asymmetrical configurations are
compared with each other against the baseline 8-8-8 option.
Note that increasing adders and especially multipliers in-
creases slice usage considerably, with very little impact on
latency, which is largely determined by the sequential unit
in the design, that is the divider. As expected, the best
trade-off is adding dividers which significantly cuts the la-
tency (2.1×)for a relatively modest increase in resource us-
age (1.4×) when compared with a baseline 8-8-8 configura-
tion. This is the reason we used the 8-8-24 configuration
to compare the vector architecture against the other two in
Figure 5.
9
3
0
0
 
1
0
3
2
6
 
1
6
1
0
2
 
1
2
8
9
1
 
4436 4305 4175 
2088 
0
1000
2000
3000
4000
5000
0
4000
8000
12000
16000
20000
8-8-8 24-8-8 8-24-8 8-8-24
C
al
cu
la
ti
o
n
 L
at
e
n
cy
 (
n
s)
 
Sl
ic
e
 U
ti
liz
at
io
n
 
Vector ALU Configuration (Adder-Multiplier-Divider) 
Slices Calculation Latency (ns)
Figure 7: Comparing asymmetrical configurations of
ALU in the vector architecture against a symmetric
8-8-8 baseline
6. DISCUSSION, CONCLUSION AND FU-
TUREWORK
Our observation after experiments with accelerating the
Convect kernel in context of accelerating Flexpart has given
interesting insights into the subject of acceleration of sci-
entific computations on FPGAs in general. We saw that
naive approaches like fully unrolling loops and replicating
arithmetic units for each and every operation very quickly
bloat into an unmanageable size, and would be impractical
except perhaps for very simple kernels where the same or a
very small set of operations is repeated many times. Using
the Convect kernel as a representative algorithm for scien-
tific computations, we can ill afford this approach, especially
when one adds the effort required to program a custom cir-
cuit for each algorithm.
The other two options both have a major advantage of
being programmable in a way very similar to instruction
processors. Based on our experiments, our proposal for ac-
celerating scientific models on FPGAs is to use an array of
custom vector processors instantiated for the algorithm.
Further experiments with the vector processor indicate
the high degree of compile-time configurability of the pro-
posed architecture. Note that all these experiments were
done in the context of a particular set of equations, for which
we judged the trade-offs. For a different set of equations in
another model, some other mix of functional units would
yield a different optimal configuration. This aspect of our
investigation emphasises the inherent flexibility in the archi-
tecture that can allow it to be configured around different
kinds of algorithms.
The architecture allows parallelism at three levels:
• At the top (thread) level by replicating the symmetric
computational cores, each executing an instance of the
function.
• It is possible to have a pipeline of asymmetric compu-
tational cores, each working on different parts of the
algorithm. This has not been explored in case of Con-
vect and is part of our future work.
• Data-level parallelism by introduction of multiple arith-
metic units in the ALU.
It is a soft processor that can be customised at compile-
time to tailor the architecture to the problem. The compu-
tation core has a dedicated code memory whose contents are
finalised at compile-time. This means that the core can be
customised as per the requirements of the set of instructions
it is to execute. Programming the architecture at assembly
level is quite straightforward as it is essentially a Harvard-
style vector processor. While it is not a trivial task, it is
certainly possible to take legacy high-level code, partition
it to a set of heterogeneous vector cores on an FPGA, and
compile the instructions for each core into their code mem-
ory, in a fully automated manner.
In view of the above, we foresee considerable innovation
for and contribution to the field of HPC for scientific appli-
cations that can be made as we move further in our investi-
gations.
7. REFERENCES
[1] W. Vanderbauwhede and K. Benkrid,
High-Performance Computing Using FPGAs.
Springer, 2013.
[2] A. Madariaga, J. Jimenez, J. Martin, U. Bidarte, and
A. Zuloaga, “Review of electronic design automation
tools for high-level synthesis,” in Applied Electronics
(AE), 2010 International Conference on, Sept 2010,
pp. 1–6.
[3] T. Czajkowski, U. Aydonat, D. Denisenko,
J. Freeman, M. Kinsner, D. Neto, J. Wong,
P. Yiannacouras, and D. Singh, “From opencl to
high-performance hardware on fpgas,” in Field
Programmable Logic and Applications (FPL), 2012
22nd International Conference on, Aug 2012, pp.
531–534.
[4] A. Stohl, C. Forster, A. Frank, P. Seibert, and
G. Wotawa, “Technical note: The lagrangian particle
dispersion model flexpart version 6.2,” Atmospheric
Chemistry and Physics, vol. 5, no. 9, pp. 2461–2474,
2005. [Online]. Available:
http://www.atmos-chem-phys.net/5/2461/2005/
[5] ECMWF, “User guide to ecmwf products 2.1,” 1995.
[6] M. Z.-R. Emanuel, Kerry A., “Development and
evaluation of a convection scheme for use in climate
models,” Journal of the Atmospheric Sciences, vol. 56,
no. 11, pp. 1766–1782, Jun. 1999.
[7] H. M. Jiye ZENG, Tsuneo MATSUNAGA, “Using
nvidia gpu for modelling the lagrangian particle
dispersion in the atmosphere,” in International
Environmental Modelling and Software Society
(iEMSs) International Congress on Environmental
Modelling and Software Modelling for Environments
Sake, Fifth Biennial Meeting, Ottawa, Canada, 2010.
[8] L. Saldanha and R. Lysecky, “Float-to-fixed and
fixed-to-float hardware converters forrapid
hardware/software partitioning of floating point
software applications to static and dynamic fixed
point coprocessors,” Design Automation for Embedded
Systems, vol. 13, no. 3, pp. 139–157, 2009.
[9] S. Chalamalasetti, S. Purohit, M. Margala, and
W. Vanderbauwhede, “Mora - an architecture and
programming model for a resource efficient coarse
grained reconfigurable processor,” in Adaptive
Hardware and Systems, 2009. AHS 2009. NASA/ESA
Conference on, July 2009, pp. 389–396.
[10] R. M. Russell, “The cray-1 computer system,”
Commun. ACM, vol. 21, no. 1, pp. 63–72, Jan. 1978.
[11] S. Habata, M. Yokokawa, and S. Kitawaki, “The earth
simulator system,” NEC Research and Development,
vol. 44, no. 1, pp. 21–26, 2003.
