Mapping adaptive particle filters to heterogeneous reconfigurable systems by Chau, TCP et al.
1Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable
Systems
Thomas C.P. Chau, Department of Computing, Imperial College London, UK
Xinyu Niu, Department of Computing, Imperial College London, UK
Alison Eele, Department of Engineering, University of Cambridge, UK
Jan Maciejowski, Department of Engineering, University of Cambridge, UK
Peter Y.K. Cheung, Department of Electrical and Electronic Engineering, Imperial College London, UK
Wayne Luk, Department of Computing, Imperial College London, UK
This paper presents an approach for mapping real-time applications based on particle filters to heteroge-
neous reconfigurable systems, which typically consist of multiple FPGAs and CPUs. A method is proposed
to adapt the number of particles dynamically and to utilise run-time reconfigurability of FPGAs for reduced
power and energy consumption. A data compression scheme is employed to reduce communication overhead
between FPGAs and CPUs. A mobile robot localisation and tracking application is developed to illustrate
our approach. Experimental results show that the proposed adaptive particle filter can reduce up to 99% of
computation time. Using run-time reconfiguration, we achieve 25-34% reduction in idle power. A 1U system
with four FPGAs is up to 169 times faster than a single-core CPU and 41 times faster than a 1U CPU server
with 12 cores. It is also estimated to be 3 times faster than a system with 4 GPUs.
Categories and Subject Descriptors: C.1.3 [Processor Architectures]: Other Architecture Styles—hetero-
geneous (hybrid) systems; C.3 [Special-purpose and Application-based Systems]: Real-time and embed-
ded systems
General Terms: Algorithms, Design, Performance
Additional Key Words and Phrases: Particle filters, Sequential Monte Carlo, Reconfigurable systems, FP-
GAs, Run-time reconfiguration
ACM Reference Format:
Thomas C.P. Chau, Xinyu Niu, Alison Eele, Jan Maciejowski, Peter Y.K. Cheung and Wayne Luk, 2014.
Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems ACM Trans. Reconfig. Tech.
Syst. 1, 1, Article 1 (March 2014), 17 pages.
DOI:http://dx.doi.org/10.1145/0000000.0000000
1. INTRODUCTION
Particle filter (PF), also known as sequential Monte Carlo (SMC) method, is a sta-
tistical technique for dynamic systems involving non-linear and non-Gaussian prop-
erties. PF has been studied in various application areas including object track-
ing [Happe et al. 2011], robot localisation [Montemerlo et al. 2002], speech recogni-
tion [Vermaak et al. 2002] and air traffic management [Eele and Maciejowski 2011].
This work is supported in part by the European Union FP7 under grant agreement number 257906, 287804
and 318521, by UK EPSRC grant number EP/L00058X/1, EP/I012036/1 and EP/G066477/1, by Maxeler
University Programme, by Xilinx, and by the Croucher Foundation.
Author’s addresses: T.C.P. Chau, X. Niu and W. Luk, Department of Computing, Imperial College London;
P.Y.K. Cheung, Department of Electrical and Electronic Engineering, Imperial College London; A. Eele and
J. Maciejowski, Department of Engineering, University of Cambridge.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c© 2014 ACM 1539-9087/2014/03-ART1 $15.00
DOI:http://dx.doi.org/10.1145/0000000.0000000
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
1:2 Thomas C.P. Chau et al.
PF keeps track of a large number of particles, each contains information about how
a system would evolve. The underlying concept is to approximate a sequence of states
by a collection of particles. Each particle is weighted to reflect the quality of an approx-
imation. The more complex the problem, the larger the number of particles that are
needed. One drawback of PF is its long execution times that limit its practical use.
This paper presents an efficient solution to PF. We derive an adaptive algorithm that
adjusts its computation complexity at run time based on the quality of results. To map
our algorithm to a heterogeneous reconfigurable system (HRS) consisting of multiple
FPGAs and CPUs, we design a pipeline-friendly data structure to make effective use
of the stream computing model. Moreover, we accelerate the algorithm with a data
compression scheme and data control separation.
The key contributions of this paper include:
(1) An adaptive PF algorithm which adapts the size of particle set at run-time. The
algorithm is able to reduce computation workload while maintaining the quality of
results.
(2) Mapping the proposed algorithm to a scalable and reconfigurable system by follow-
ing the stream computing model. A novel data structure is designed to take advan-
tage of the architecture and to alleviate the data transfer bottleneck. The system
uses the run-time reconfigurability of FPGA to switch between computation mode
and low-power mode.
(3) An implementation of a robot localisation application targeting the proposed sys-
tem. Compared to a non-adaptive and non-reconfigurable implementation, the idle
power of our proposed system is reduced by 25-34% and the overall energy con-
sumption decreases by 17-33%. Our system with four FPGAs is up to 169 times
faster than a single core CPU, 41 times faster than a 1U CPU server with 12 cores,
and 3 times faster than a modelled four-GPU system.
2. BACKGROUND AND RELATED WORK
This section briefly outlines the PF algorithm. Amore detailed description can be found
in [Doucet et al. 2001]. PF estimates the state of a system by a sampling-based approx-
imation of the state probability density function. The state of a system in time-step t
is denoted by Xt. The control and observation are denoted by Ut and Yt respectively.
Three pieces of information about the system are known a-priori:
— p(X0) is the probability of the initial state of the system,
— p(Xt|Xt−1, Ut−1) is the state transition probability of the system’s current state
given a previous state and control information,
— p(Yt|Xt) is the observation model describing the likelihood of observing the mea-
surement at the current state.
PF approximates the desired posterior probability p(Xt|Y1:t) using a set of P parti-
cles {χ
(i)
t }
P
i=1 with their associated weights {w
(i)}Pi=1. X0 and U0 are initialised. This
computation consists of three iterative steps.
(1) Sampling: A new particle set {χ˜
(i)
t }
P
i=1 is drawn from the distribution
p(Xt|Xt−1, Ut−1), forming a prediction of the distribution of Xt.
(2) Importance weighting: The likelihood p(Yt|χ˜
(i)
t ) of each particle is calculated.
The likelihood indicates whether the current measurement Yt matches the pre-
dicted state {χ˜
(i)
t }
P
i=1. Then each particle is assigned a weight w
(i) with respect to
the likelihood.
(3) Resampling: Particles with higher weights are replicated and the number of
particles with lower weights is reduced. With resampling, the particle set has a
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems 1:3
smaller variance. The particle set is used in the next time-step to predict the pos-
terior probability subsequently. The distribution of the resulting particles {χ
(i)
t }
P
i=1
approximates p(Xt|Y1:t).
The particles in PF are independent of each other. It means the algorithm can
be accelerated using specialised hardware with massive parallelism and pipelin-
ing. In [Happe et al. 2011], an approach for PF on a hybrid CPU/FPGA platform
is developed. Using a multi-threaded programming model, computation is switched
between hardware and software during run-time to react to performance require-
ments. Resampling algorithms and architectures for distributed PFs are proposed
in [Bolic et al. 2005].
Adaptive PFs have been proposed to improve performance or quality of state es-
timation by controlling the number of particles dynamically. Likelihood-based adap-
tation controls the number of particles such that the sum of weights exceeds a pre-
specified threshold [Koller and Fratkina 1998]. Kullback Leibler distance (KLD) sam-
pling is proposed in [Fox 2003], which offers better quality results than likelihood-
based approach. KLD sampling is improved in [Park et al. 2010] by adjusting the
variance and gradient of data to generate particles near high likelihood regions.
The above methods introduce data dependencies in the sampling and importance
weighting steps, so they are difficult to be parallelised. An adaptive PF is proposed
in [Bolic et al. 2002] that changes the number of particles dynamically based on esti-
mation quality. In [Chau et al. 2012], adaptive PF is extended to a multi-processor sys-
tem on FPGA. The number of particles and active processors change dynamically but
the performance is limited by soft-core processors. In [Liu et al. 2007], a mechanism
and a theoretical lower bound for adapting the sample size of particles are presented.
Our previous work [Chau et al. 2013a] presents a hardware-friendly adaptive PF. The
algorithm is mapped to an accelerator system which consists of an FPGA and a CPU.
However, the system suffers from a large communication overhead when the particles
are transferred between the FPGA and CPU. Moreover, the scalability of the adaptive
PF algorithm to multiple FPGAs is not covered. In this paper, we extend our previous
work to address the problems mentioned above.
3. ADAPTIVE PARTICLE FILTER
This section introduces an adaptive PF algorithm which changes the number of parti-
cles at each time-step. The algorithm is inspired by [Liu et al. 2007] and we transform
it to a pipeline-friendly version for mapping to the stream computing architecture.
This algorithm is shown in Algorithm 1 which consists of four stages.
3.1. Stage 1: Sampling and Importance Weighting (line 8 to 9)
At the initial time-step (t = 0), the maximum number of particles are used, i.e.
P0 = Pmax. At the subsequent time-steps, the number of particles is denoted as Pt.
Initially, the particle set {χ
(i)
t }
Pt
i=1 is sampled to {χ˜
(i)
t+1}
Pt
i=1. Then a weight from {w
i}Pti=1
is assigned to each particle. As a result, {χ˜
(i)
t+1}
Pt
i=1 and {w
(i)}Pti=1 give an estimation of
the next state.
During sampling and importance weighting, the computation of every particle is
independent of each other. The mapping of computation to FPGAs will be described in
Section 4.
3.2. Stage 2: Lower Bound Calculation (line 10)
This stage derives the smallest number of particles that are needed in the next time-
step in order to bound the approximation error. The adaptive algorithm seeks a value
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
1:4 Thomas C.P. Chau et al.
ALGORITHM 1: Adaptive PF algorithm
1: P0 ← Pmax
2: {X
(i)
0 }
P0
i=1 ←random set of particles
3: t = 1
4: for each step t do
5: r = 0
6: while r ≤ itl repeat do
7: —On FPGAs—
8: Sample a new state {χ˜
(i)
t+1}
Pt
i=1 from {χ
(i)
t }
Pt
i=1
9: Calculate unnormalised importance weights {w˜(i)}Pti=1 and accumulate the weights as wsum
10: Calculate the lower bound of sample size P˜t+1 by Equation 1
11: —On CPUs—
12: Sort {χ˜
(i)
t+1}
Pt
i=1 in descending {w˜
(i)}Pti=1
13: if P˜t+1 < Pt then
14: Pt+1 = max
(
⌈P˜t+1⌉, Pt/2
)
15: Set a = 2Pt+1 − Pt and b = Pt+1
16: –Do the following loop in parallel–
17: for i in Pt − Pt+1 do
18: χ˜
(i)
t+1 =
χ
(a)
t+1w˜
(a)+χ
(b)
t+1w˜
(b)
w˜(a)+w˜(b)
19: w˜(i) = w˜(a) + w˜(b)
20: a = a+ 1 and b = b− 1
21: end for
22: else if P˜t+1 ≥ Pt then
23: a = 0 and b = 0
24: for i in Pt+1 − Pt do
25: if w˜(a) < w˜(a+1) and a < Pt+1 then
26: a = a+ 1
27: end if
28: χ˜
(Pt+b)
t+1 = χ˜
(a)
t+1/2
29: χ˜
(a)
t+1 = χ˜
(a)
t+1/2
30: w˜(Pt+b) = w˜(a)/2
31: w˜(a) = w˜(a)/2
32: b = b+ 1
33: end for
34: end if
35: Resample {χ˜
(i)
t+1}
Pt
i=1 to {χ
(i)
t+1}
Pt+1
i=1
36: r = r + 1
37: end while
38: end for
which is less than or equal to Pmax. This number, denoted as P˜t+1, is referred to as the
lower bound of sampling size. It is calculated by Equation 1 to 4.
P˜t+1 = σ
2 ·
Pmax
V ar({χ˜
(i)
t+1}
Pt
i=1)
(1)
σ
2 =
Pt∑
i=1
(
w
(i) · χ˜
(i)
t+1
)2
− 2 · E({χ˜
(i)
t+1}
Pt
i=1) ·
Pt∑
i=1
(
(w(i))2 · χ˜
(i)
t+1
)
+
(
E({χ˜
(i)
t+1}
Pt
i=1)
)2
·
Pt∑
i=1
(w(i))2
(2)
V ar({χ˜
(i)
t+1}
Pt
i=1) =
Pt∑
i=1
(
w
(i) · (χ˜
(i)
t+1)
2
)
−
(
E({χ˜
(i)
t+1}
Pt
i=1)
)2
(3)
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems 1:5
E({χ˜
(i)
t+1}
Pt
i=1) =
Pt∑
i=1
w
(i) · χ˜
(i)
t+1 (4)
As shown in Equation 2 to 4, w(i) is a normalised term. To calculate w(i), a traditional
software-based approach is to iterate through the set of particles twice. The sum of
weights wsum and unnormalised weight w˜
(i) are calculated in the first iteration. Then
w(i) is obtained by dividing w˜(i) by wsum in the second iteration. However, this method
is inefficient for FPGA implementation. Since 2Pt cycles are needed to process Pt pieces
of data, the throughput is reduced to 50%.
To fully utilise deep pipelines targeting an FPGA, we perform function transforma-
tion. Given w(i) = w˜
(i)
wsum
, we extract wsum out of Equation 2 to 4. By doing so, we obtain
a transformed form as shown in Equations 5 to 7. wsum and w˜
(i) are computed simul-
taneously in two separate data paths. At the last clock cycle of the particle stream, σ2,
V ar({χ˜
(i)
t+1}
Pt
i=1) and E({χ˜
(i)
t+1}
Pt
i=1) are obtained. The details of the FPGA kernel design
will be explained in Section 4.
σ
2 =
1
(wsum)2
· (
Pt∑
i=1
(
w˜
(i) · χ˜
(i)
t+1
)2
− 2 · E({χ˜
(i)
t+1}
Pt
i=1) ·
Pt∑
i=1
(
(w˜(i))2 · χ˜
(i)
t+1
)
+
(
E({χ˜
(i)
t+1}
Pt
i=1)
)2
·
Pt∑
i=1
(w˜(i))2)
(5)
V ar({χ˜
(i)
t+1}
Pt
i=1) =
1
wsum
·
Pt∑
i=1
(
w˜
(i) · (χ˜
(i)
t+1)
2
)
−
(
E({χ˜
(i)
t+1}
Pt
i=1)
)2
(6)
E({χ˜
(i)
t+1}
Pt
i=1) =
1
wsum
·
Pt∑
i=1
w˜
(i) · χ˜
(i)
t+1 (7)
3.3. Stage 3: Particle set size tuning (line 12 to 34)
The adaptive approach tunes the particle set size to fit the lower bound Pt+1. This
stage is done on the CPUs because the operations involve non-sequential data access
that cannot be mapped efficiently to FPGAs.
The particles are sorted in descending order according to their weights. As the new
sample size can increase or decrease, there are two cases:
—Case I: Particle set reduction when P˜t+1 < Pt
The lower bound Pt+1 is set tomax
(
⌈P˜t+1⌉, Pt/2
)
. Since the new size is smaller than
the old one, some particles are combined to form a smaller particle set. Figure 1
illustrates the idea of particle reduction. The first 2Pt+1 − Pt particles with higher
weights are kept and the remaining 2(Pt−Pt+1) particles are combined in pairs. As a
result, there are Pt −Pt+1 new particles injected to form the target particle set with
Pt+1 particles. We combine the particles deterministically to keep the statements
in the loop independent of each other. As a result, loop unrolling is undertaken to
execute the statements in parallel. The complexity of the loop is in O
(
Pt−Pt+1
Nparallel
)
,
where Nparallel indicates the level of parallelism.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
1:6 Thomas C.P. Chau et al.
2P −Ptt+1 2(Pt −Pt−1 )
Pt
kept combined in pairs
(a) Combining the last 2(Pt − Pt+1) particles with
lower weights
Pt+1
2Pt+1 −Pt Pt −Pt+1
Pt
kept droppedinjected
(b) Pt+1 new particles are formed
Fig. 1. Particle set reduction
—Case II: Particle set expansion when P˜t+1 ≥ Pt
The lower bound Pt+1 is set to P˜t+1. Some particles are taken from the original
set and are inserted to form a larger set. The particles with larger weight would
have more descendants. As shown in line 22 to 34, the process requires picking
the particle with the largest weight at each iteration of particle incision. Since the
particle set is pre-sorted, the complexity of particle set expansion is O(Pt+1 − Pt).
3.4. Stage 4: Resampling (line 35)
Resampling is performed to pick Pt+1 particles from {χ˜
(i)
t+1}
Pt
i=1 to form {χ
(i)
t+1}
Pt+1
i=1 . The
process has a complexity of O(Pt+1).
4. HETEROGENEOUS RECONFIGURABLE SYSTEM
This section describes the proposed heterogeneous reconfigurable system (HRS). It is
scalable to cope with different FPGA devices and applications. HRS also takes advan-
tage of the run-time reconfiguration feature for power and energy reduction.
4.1. Mapping adaptive PF to HRS
The system design of HRS is shown in Figure 2. A heterogeneous structure is employed
to make use of multiple FPGAs and CPUs. FPGAs and CPUs communicate through
high bandwidth buses. FPGAs are responsible for (1) sampling, (2) importance weight-
ing, and (3) lower bound calculation. The data paths on the FPGAs are fully-pipelined.
Each FPGA has its own on-board dynamic random-access memory (DRAM) to store
the large amount of particle data. On the other hand, the CPUs gather all the parti-
cles from FPGAs to perform particle set size tuning and resampling.
4.2. FPGA Kernel Design
Sampling, importance weighting and lower bound calculation are the most computa-
tion intensive stages. In each time-step, these three stages are iterated for itl repeat
times. An FPGA kernel is designed to enable acceleration of them.
Figure 4 shows the components of the FPGA kernel. The kernel is fully pipelined to
achieve one output per clock cycle. It can also be replicated as many times as FPGA
resource allow and the replications can be split across multiple FPGA boards. The ker-
nel takes three inputs from the CPUs or on-board DRAM: (1) states, (2) controls, and
(3) seeds. Application specific parameters are stored in ROMs. Three building blocks
correspond to the sampling, importance weighting and lower bound calculation stages
as described in Section 3.
For sampling and importance weighting, the computation of each particle is inde-
pendent of each other. Particles are fed to the FPGAs as a stream shown in Figure 3.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems 1:7
r<itl_repeat
Sampling 
Particle set resizing
FPGAs
CPUs
Particles Weights Particles
Lower bound calculation
Resampling
Go to the next 
time-step
Lower 
bound Sum
Importance weighting
r==itl_repeat
Fig. 2. Heterogeneous reconfigurable system (Solid lines: data paths; Dotted lines: control paths)
Field N Field N
Burst address 1 Burst address N+1
Particle 1 Particle 2
Block 1 Block 2
Field 1 Field 2 Field 3 Field 1 Field 2 Field 3
Fig. 3. A particle stream
Each block of the particle stream consists of a number of data fields which store in-
formation of a particle. The number of data fields is application dependent. In every
clock cycle, one piece of data is transferred from the onboard memory to an FPGA data
path. Each FPGA data path has a long pipeline where each stage is filled with a piece
of data, and therefore many particles are processed simultaneously. Fixed-point data
representation is customised at each pipeline stage to reduce the resource usage.
Meanwhile, the accumulation of wsum introduces a feedback loop. A new weight
comes along every cycle which is more quickly than the floating-point unit to per-
form addition of the previous weight. In order to achieve one result per clock cycle,
fixed-point data-path is implemented while ensuring no overflow or underflow occurs.
4.3. Timing model for run time reconfiguration
We derive a model to analyse the computation time of HRS. The model helps us to
design a configuration schedule that satisfies the real-time requirement and, if neces-
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
1:8 Thomas C.P. Chau et al.
Control
ROMs for
application
parameters
Random
number
generator
Sampling
Importance
weighting
Weight
accumulation
DRAM Seed
Weights
Next state
particles
Current states
Lower bound
calculation
Sum
Lower
bound
Fig. 4. FPGA kernel design
sary, amend the application’s specification. The model will be validated by experiments
in Section 6.
The computation time (Tcomp) of HRS consists of three components: (1) Data path
time Tdatapath, (2) CPU time TCPU , and (3) Data transfer time Ttran. The sampling,
importance weighting and resampling processes are repeated for itl repeat times in
every time-step.
Tcomp = itl repeat · (Tdatapath + TCPU + Ttran) (8)
Data path time, Tdatapath, denotes the time spent on the FPGAs. Pt denotes the
number of particles at the current time-step and fFPGA denotes the clock frequency of
the FPGAs. L is the length of the pipeline. Ndatapath denotes the number of data paths
on one FPGA board. NFPGA is the number of FPGA boards in the system.
Tdatapath =
(
Pt
fFPGA ·Ndatapath
+ L− 1
)
1
NFPGA
(9)
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems 1:9
CPU time, TCPU , denotes the time spent on the CPUs. The clock frequency and
number of threads of the CPUs are represented by fCPU and Nthread respectively. par
is an application-specific parameter in the range of [0, 1] which represents the ratio of
CPU instructions that are parallelisable, and α is a scaling constant derived empiri-
cally.
TCPU = α ·
Pt
fCPU
·
(
1− par +
par
Nthread
)
(10)
Data transfer time, Ttran, denotes the time of moving a particle stream between
the FPGAs and the CPUs. df is the number of data fields of a particle. For example, if a
particle contains the information of coordinates (x, y) and heading h, df = 3. Given that
the constant 1 represents the weight and the constant 2 accounts for the movement
of data in and out of the FPGAs, and bwdata is the bit-width of one data field, the
expression (2 · df + 1) · bwdata is regarded as the size of a particle.
fbus is the clock frequency of the bus connecting the CPUs to FPGAs and lane is
the number of bus lanes connected to one FPGA. Since many buses, such as the PCI
Express Bus, encode data during transfer, the effective data are denoted by eff (in
PCI Express Gen2 the value is 8/10). In our previous work [Chau et al. 2013a], the
data transfer time has a significant performance impact on HRS. To reduced the data
transfer overhead, we introduce a data compression technique that will be described
in Section 5.
Ttran =
(2 · df + 1) · bwdata · Pt
fbus · lane · eff ·NFPGA
(11)
In real-time applications, each time-step is fixed and is known as the real-time bound
Trt. The derived model helps system designers to ensure that the computation time
Tcomp is shorter than Trt. An idle time Tidle is introduced to represent the time gap
between the computation time and real-time bound.
Tidle = Trt − Tcomp (12)
Figure 5(a) illustrates the power consumption of an HRSwithout run-time reconfigu-
ration. It shows that the FPGAs are still drawing power after the computation finishes.
By exploiting run-time reconfiguration as shown in Figure 5(b), the FPGAs are loaded
with a low-power configuration during the idle period. Such configuration minimises
the amount of active resources and clock frequency. Equation 13 describes the sleep
time when the FPGAs are idle and being loaded with the low-power configuration. If
the sleep time is positive, reconfiguration would be helpful in these situations.
Tsleep = Tidle − Tconfig (13)
Configuration time, Tconfig, denotes the time needed to download a configuration
bit-stream to the FPGAs. sizebs represents the size of bitstream in bits. fconfig is the
configuration clock frequency in Hz and bwconfig is the width of the configuration port.
Tconfig =
sizebs
fconfig · bwconfig
(14)
5. OPTIMISING TRANSFER OF PARTICLE STREAM
In Section 4, the data transfer time depends on the number of particles and the bus
bandwidth between the CPUs and FPGAs. It can be a major performance bottleneck
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
1:10 Thomas C.P. Chau et al.
Ti To
Tcomp
Tdatapath TCPUTconfig
Tidle
rtT
Config Input Data path Output CPU Idle
power
time
(a) Without reconfiguration
Ti To
Tcomp
Tconfig Tdatapath TCPU Tconfig Tsleep
Tidle
rtT
Config CPU ConfigInput SleepOutputDatapath
power
time
(b) With reconfiguration to low-power mode dur-
ing idle
Fig. 5. Power consumption of the HRS over time
as depicted in [Chau et al. 2013a]. Refer to Figure 6(a), each block stores the data of
a particle. When the CPUs finish processing, all data are transferred from the CPUs
to the FPGAs. The data transfer time cannot be reduced by implementing more FPGA
data paths or increasing the FPGAs’ clock frequency because the bottleneck is at the
bus connecting the CPUs and FPGAs.
To improve the data transfer performance, we design a data structure which facili-
tates compression of particles. The idea comes from an observation of the resampling
process - some particles are eliminated and the vacancies are filled by replicating non-
eliminated particles. Replication means data redundancy exists. For example, in the
original data structure shown in Figure 6(a), particle 1 has three replicates and parti-
cle 2 is eliminated, therefore, particle 1 is stored and transferred for three times.
By using the data structure in Figure 6(b), data redundancy is eliminated by storing
every particle once. Each particle is also transferred once. As a result, the data transfer
time and memory space are reduced.
An HRS often contains DRAM which transfers data in burst in order to maximise
the memory bandwidth. This works fine with the original data structure where the
data are organised as a sequence from the lower address space to the upper. However,
using the new data structure, the data access pattern is not sequential anymore, the
address can go back and forth. The DRAM controller needs to be modified so that the
transfer throughput would not be affected by the change of data access pattern. As
illustrated in Figure 6(b), a tag sequence is used to indicate the address of the next
block. For example, after reading the data of particle 1, the burst address is at N . If
the tag is one, the next burst address will point to the address of the next block at
N +1. Otherwise, the burst address will point to the start address of the current block
(which is 1). The data are still addressed in burst so the performance is not degraded.
The data transfer time with compression is shown below. Rep is the average number
of replication of the particles, and therefore the size of the resampled particle stream
is reduced by a ratio of Rep. The range of Rep is from 1 to Pt, depending on the dis-
tribution of particles after the resampling process. The effect of Rep on data transfer
time will be evaluated in the next section.
Ttran =
( df
Rep
+ df + 1) · bwdata · Pt
fbus · lane · eff ·NFPGA
(15)
6. EXPERIMENTAL RESULTS
To evaluate the performance of the HRS and make comparison with the other sys-
tems, we implement an application which uses PF for localisation and tracking of mo-
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems 1:11
Field 2 Field 3 Field NField 1 Field 2 Field 3 Field NField 1
Burst address 2N+1 Burst address 3N+1
Field 2 Field 3 Field NField 1 Field 2 Field 3Field 1
Burst address 1 Burst address N+1
Field N
Particle 1 Particle 1 Particle 1 Particle 3
Block 1 Block 2 Block 3 Block 4
(a) Particle stream before compression
Field 2 Field 3 Field NField 1 Field 2 Field 3 Field NField 1
Burst address 1 Burst address N+1
Field 2 Field 3 Field NField 1
Burst address 3N+1
Field 2 Field 3 Field NField 1
Burst address 2N+1
Particle 1
0 1 10
Block 1 Block 2 Block 3 Block 4
Particle 3 Particle 4 Particle 5
Tag Tag Tag Tag
Tag = 1Tag = 0
(b) Compressed particle stream
Fig. 6. After the resampling process, some particles are eliminated and the remaining particles are repli-
cated. Data compression is applied so that every particle is stored and transferred once only.
bile robot. The application is proposed in [Montemerlo et al. 2002] to track location of
moving objects conditioned upon robot poses over time. Given a priori learned map, a
robot receives sensor values and moves at regular time intervals. Meanwhile,M mov-
ing objects are tracked by the robot. The states of the robot and objects at time t are
represented by a state vector Xt:
Xt = {Rt, Ht,1, Ht,2, ..., Ht,M} (16)
Rt denotes the robot’s pose at time t, and Ht,1, Ht,2, ..., Ht,M denote the locations of
theM objects at the same time.
The following equation is used to represent the posterior of the robot’s location:
p(Xt|Yt, Ut) = p(Rt|Yt, Ut)
M∏
m=1
p(Ht,m|Rt, Yt, Ut) (17)
Yt is the sensor measurement and Ut is the control of the robot at time t. The robot
path posterior p(Rt|Yt, Ut) is represented by a set of robot-particles. The distribution of
an object’s location p(Ht,m|Rt, Yt, Ut) is represented by a set of object-particles, where
each object-particle set is attached to one particular robot-particle. In other words, if
there are Pr robot-particles representing the posterior over robot path, there are Pr
object-particle sets, each has Ph particles.
In the application, the area of the map is 12m by 18m. The robot makes a movement
of 0.5m every five seconds, i.e. Trt = 5. The robot can track eight moving objects at the
same time. A maximum of 8192 particles are used for robot-tracking and each robot-
particle is associated with 1024 object-particles. Therefore, the maximum number of
data path cycles is 8*8192*1024=67,108,864. Each particle being streamed into the
FPGAs contains coordinates (x,y) and heading h which are represented by three single
precision floating-point numbers. For the particle being streamed out of the FPGAs, it
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
1:12 Thomas C.P. Chau et al.
also contains a weight in addition to the coordinates. From Equation 11, the size of a
particle is (2 · 3 + 1) · 32 bits = 224 bits.
6.1. System Settings
HRS: Two reconfigurable accelerator systems from Maxeler Technologies are used.
The system is developed using MaxCompiler, which adopts a stream computing model.
—MaxWorkstation is a microATX form factor systemwhich is equipped with one Xilinx
Virtex-6 XC6VSX475T FPGA. The FPGA has 297,600 lookup tables (LUTs), 595,200
flip-flops (FFs), 2,016 digital signal processors (DSPs) and 1,064 block RAMs. The
FPGA board is connected to an Intel i7-870 CPU (4 physical cores, 8 threads in total,
clocked at 2.93 GHz) via a PCI Express Gen2 x8 bus. The maximum bandwidth of
the PCI Express bus is 2 GB/s according to the specification provided by Maxeler
Technologies.
—MPC-C500 is a 1U server accommodating four FPGA boards, each of which has a
Xilinx Virtex-6 XC6VSX475T FPGA. Each FPGA board is connected to two Intel
Xeon X5650 CPUs (12 physical cores, 24 threads in total, clocked at 2.66 GHz) via a
PCI Express Gen2 x8 bus.
To support run-time reconfigurability, there are two FPGA configurations:
—Sampling and importance weighting configuration is clocked at 100 MHz. Two data
paths are implemented on one FPGA to process particles in parallel. The total re-
source usage is 231,922 LUTs (78%), 338,376 FFs (56%), 1,934 DSPs (96%) and 514
block RAMs (48%).
—Low-power configuration is clocked at 10 MHz, with 5,962 LUTs (2%), 6,943 FFs
(1%) and 12 block RAMs (1%). It uses minimal resources just to maintain communi-
cation between the FPGAs and CPUs.
CPU: The CPU performance results are obtained from a 1U server that hosts two
Intel Xeon X5650 CPUs. Each CPU is clocked at 2.66 GHz. The program is written
in C language and optimised by Intel Compiler with SSE4.2 and flag -fast enabled.
OpenMP is used to utilise all the processor cores.
GPU: An NVIDIA Tesla C2070 GPU is hosted inside a 4U server. It has 448 cores
running at 1.15 GHz and has a peak performance by 1288 GFlops. The program is
written in C for CUDA and optimised to use all the cores available. To get more com-
prehensive results for comparison, we also estimate the performance of multiple GPUs.
The estimation is based on the fact that the first three stages (sampling, importance
weighting, lower bound calculation) can be evenly distributed to every GPU and be
computed independently, so the data path and data transfer speedup scales linearly
with the number of GPUs. On the other hand, the last two stages (particle set resizing,
resampling) are computed on the CPU no matter how many GPUs are used, therefore,
the CPU time does not scale with the number of GPUs.
6.2. Adaptive PF versus Non-adaptive PF
The comparison of adaptive and non-adaptive PF is shown in Table I. Both model esti-
mation and experimental results are listed. Initially, the maximum number of particles
are instantiated for global localisation. For the non-adaptive scheme, the particle set
size does not change. The total computation time estimated and measured are 1.328
seconds and 1.885 seconds, respectively. The difference is due to the difference between
the effective and maximum bandwidth of the PCI Express bus.
For the adaptive scheme, the number of particles varies from 573k to 67M, and the
computation time scales linearly with the number of particles. From Table I, both the
model and experiment show 99% reduction in computation time.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems 1:13
Table I. Comparison of adaptive and non-adaptive PF on HRS (MaxWorkstation
with one FPGA, no data compression is applied)
Non-adaptive PF Adaptive PF
Model Exp. Model Exp.
No. of particles 67M 573k
Data path time Tdatapath (s) 0.336 0.336 0.003 0.003
CPU time TCPU (s) 0.117 0.117 0.001 0.001
Data time Ttran (s) 0.875 1.432 0.007 0.012
Total comp. time Tcomp (s) 1.328 1.885 0.011 0.016
Comp. speedup (higher is better) 1x 1x 120.7x 117.8x
 1
 10
 100
 1000
 10000
 100000
 1e+06
 1e+07
 1e+08
 0  20  40  60  80  100  120  140
 0.001
 0.01
 0.1
 1
 10
 100
 1000
N
um
be
r o
f p
ar
tic
le
s
Co
m
po
ne
nt
s 
of
 c
om
pu
ta
tio
n 
tim
e 
(s)
Wall-clock time (s)
No. of particles
Data transfer time
Data path time
CPU time
Idle time
Fig. 7. Number of particles and components of total computation time versus wall-clock time
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 0  20  40  60  80  100  120  140
Lo
ca
lis
at
io
n 
er
ro
r (
m)
Wall-clock time (s)
Adaptive
Non-adaptive
Fig. 8. Localisation error versus wall-clock time
Figure 7 shows how the number of particles and the components of total computation
time vary over the wall-clock time (passage of time from the start to the completion of
the application). Although the number of particles is reduced in the proposed design,
the results in Figure 8 show that the localisation error is not adversely affected. The
error is the highest during initial global localisation and it is reduced when the robot
moves.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
1:14 Thomas C.P. Chau et al.
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0.4
 0  5  10  15  20
D
at
a 
tra
ns
fe
r t
im
e 
(s)
Number of replication
Fig. 9. Effect on the data transfer time by particle stream compression
6.3. Data Compression
Figure 9 shows the reduction in data transfer time after applying data compression.
A higher number of replications means a lower data transfer time. The data transfer
time has a lower bound of 0.212 seconds because the data from the FPGAs to the
CPUs are not compressible. Only the particle stream after the resampling process is
compressed when it is transferred from the CPUs to the FPGAs.
6.4. Performance comparison of HRS, CPUs and GPUs
Table II shows the performance comparison of the CPUs, GPUs and HRS.
Data path time: Considering the time spent on the data paths only, HRS is up
to 328 times faster than a single-core CPU and 76 times faster than a 12-core CPU
system with 24 threads. In addition, it is 12 times and 3 times faster than one GPU
and four GPUs, respectively.
Data transfer time: The data transfer time of HRS is shown in three rows. The first
row shows the situation when the PCI Express bandwidth is 2 GB/s. The second row
shows the performance when PCI Express gen3 x8 (7.88 GB/s) is used such that the
bandwidth is comparable with that of the GPU system. When multiple FPGA boards
are used, the data transfer time decreases because multiple PCI Express buses are
utilised simultaneously. The third row shows the performance when data compression
is applied and it is assumed that each particle is replicated for 20 times in average.
CPU time: The CPU time of HRS is shorter than that of the CPU and GPU systems
because part of the resampling process of object-particles is performed on the FPGA us-
ing Independent Metropolis-Hastings (IMH) resampling algorithm [Miao et al. 2011].
IMH resampling algorithm is optimised for the deep pipeline architecture where each
particle occupies a single stage of the pipeline. On the CPUs and GPU, the computa-
tion of the particles are shared by threads and therefore IMH resampling algorithm is
not applicable.
Total computation time: Considering the overall system performance, HRS is up
to 169 times faster than a single-core CPU, 41 times faster than a 12-core CPU system.
In addition, it is 9 times faster than one GPU, and 3 times faster than four GPUs.
Notice that the CPUs violate the real-time constraint of 5 seconds.
Power and energy consumption: In real-time applications, we are interested in
the energy consumption per time-step. Figure 10 shows the power consumption of
HRS, CPUs and GPU over a period of 10 seconds (2 time-steps). The system power is
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems 1:15
Table II. Performance comparison of HRS, CPUs and GPU
CPU(1) a CPU(2) a GPU(1) b GPU(2) b GPU(3) b HRS(1) c HRS(2) d HRS(3) d
Clock freq. (MHz) 2660 2660 1150 1150 1150 100 100 100
Precision single single single single single
single single single
+ custom + custom + custom
Level of parallelism 1 24 448 896 1792 2+8 e 4+24 e 8+24 e
Data path time (s) 27.530 6.363 1.000 0.500 0.250 0.336 0.168 0.084
Data path speedup 1x 4.3x 27.5x 55.1x 110.1x 81.9x 163.9x 327.7x
Data tran. time (s) 0 0 0.360 0.180 0.090
1.432 f 0.716 f 0.358 f
0.363 g 0.182 g 0.091 g
0.223 h 0.111 h 0.056 h
CPU time (s) 0.420 0.334 0.117 0.117 0.117 0.030 0.025 0.025
Total comp. time (s) 27.95 6.697 1.477 0.797 0.457 0.589 0.304 0.165
Overall speedup 1x 4.2x 18.9x 35.1x 61.2x 47.5x 91.9x 169.4x
Comp. power (W) 183 279 287 424 698 145 420 480
Comp. power eff. 1x 0.7x 0.6x 0.4x 0.3x 1.3x 0.4x 0.4x
Idle power (W) 133 133 208 266 382 95 360 360
Idle power eff. 1x 1x 0.6x 0.5x 0.4x 1.4x 0.4x 0.4x
Energy. (J) i 677/5115 673/1868 1041/1157 1331/1456 1911/2054 489/595 1896/1914 1994/2012
Energy eff. 1x 1x/2.7x 0.7x/4.4x 0.5x/3.5x 0.4x/2.5x 1.4x/8.6x 0.4x/2.7x 0.3x/2.5x
a 2 Intel Xeon X5650 CPUs @2.66 GHz (12 cores supporting 24 threads).
b 1/2/4 NVIDIA Tesla C2070 GPUs and 1 Intel Core i7-950 CPU @3.07 GHz (4 cores supporting 8 threads).
c 1 Xilinx XC6VSX475T FPGA and 1 Intel Core i7-870 CPU @2.93 GHz (4 cores supporting 8 threads).
d 4 Xilinx XC6VSX475T FPGAs and 2 Intel Xeon X5650 CPUs @2.66 GHz (12 cores supporting 24 threads).
e Number of FPGA data paths and number of CPU threads.
f Each FPGA communicates with CPUs via a PCI Express bus with 2 GB/s bandwidth.
g Each FPGA communicates with CPUs via a PCI Express Gen3 x8 bus with 7.88 GB/s bandwidth.
h Each FPGA communicates with CPUs via a PCI Express Gen3 x8 bus with data compression.
i Cases for 573k and 67M particles in a 5-second interval.
measured using a power meter which is connected directly between the power source
and the system. All the curves of HRS show peaks when HRS is at the computation
mode and troughs when it is at the low power mode. The power during the configu-
ration period lies between the two modes. On the HRS with one FPGA, run-time re-
configuration reduces the idle power consumption by 34% from 145W to 95W. In other
words, over a 5-second time-step, the energy consumption is reduced by up to 33%. On
the HRS with four FPGAs, the idle power consumption is reduced by 25% from 480W
to 360W, and hence the energy consumption decreased by up to 17%.
The run-time reconfiguration methodology is not limited to the Maxeler systems, it
can be applied to other FPGA platforms. The resource management software of our
system (MaxelerOS) simplifies the effort of performing run-time reconfiguration, and
hence we can focus on studying the impact of run-time reconfiguration on energy sav-
ing.
To identify the speed and energy trade-off, we produce a graph as shown in Fig-
ure 11. Each data point represents the computation time versus energy consumption
of a system setting. Among all the systems, the HRS with one FPGA has the computa-
tion speed that satisfies the real-time requirement, while at the same time consumes
the smallest amount of energy. All the configurations of CPU system cannot meet the
real-time requirement. HRS(3), the HRS with four FPGAs, is the fastest among all
the systems in comparison, therefore it is able to handle larger problems and more
complex applications.
7. CONCLUSION
This paper presents an approach for accelerating adaptive particle filter for real-time
applications. The proposed heterogeneous reconfigurable system demonstrates a sig-
nificant reduction in power and energy consumption compared with CPU and GPU.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
1:16 Thomas C.P. Chau et al.
Fig. 10. Power consumption of HRS, CPU and GPU in one time-step, notice that the computation time of
the CPU system exceeds the 5-second real-time requirement (The lines of HRS(2) and HRS(3) overlap)
 0.01
 0.1
 1
 10
 100
 0  1000  2000  3000  4000  5000
R
un
-ti
m
e 
pe
r t
im
e-
st
ep
 (s
)
Energy consumtpion (J)
CPU(1)
CPU(2)
HRS(1) HRS(2)
HRS(3)
GPU(1)
GPU(2)
GPU(3)
Real-time bound
Fig. 11. Run-time vs. energy consumption of HRS, CPUs and GPUs (5-second time-step, 67M particles;
Refer to Table II for system settings)
The adaptive algorithm reduces computation time while maintaining the quality of
results. The approach is scalable to systems with multiple FPGAs. A data compres-
sion technique is used to mitigate the data transfer overhead between the FPGAs and
CPUs.
In the future, heterogeneous reconfigurable systems will be developed for various
particle filters that are more compute-intensive and have more stringent real-time re-
quirements than the ones described above. Air traffic management [Chau et al. 2013b]
and traffic estimation [Mihaylova et al. 2007] are example applications that can sub-
stantially benefit from the proposed approach in meeting current and future require-
ments. Further work will also be required to automate the optimisation of designs
targeting heterogeneous reconfigurable systems.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
Mapping Adaptive Particle Filters to Heterogeneous Reconfigurable Systems 1:17
ACKNOWLEDGMENTS
This work is supported in part by the European Union Seventh Framework Programme under grant agree-
ment number 257906, 287804 and 318521, by UK EPSRC grant number EP/L00058X/1, EP/I012036/1 and
EP/G066477/1, by Maxeler University Programme, by Xilinx, and by the Croucher Foundation. The authors
thank Oliver Pell at Maxeler Technologies for comments on the paper.
REFERENCES
Miodrag Bolic, Petar M. Djuric, and Sangjin Hong. 2005. Resampling algorithms and architectures for dis-
tributed particle filters. IEEE Trans. Signal Processing 53, 7 (2005), 2442–2450.
Miodrag Bolic, Sangjin Hong, and Petar M. Djuric. 2002. Performance and complexity analysis of adaptive
particle filtering for tracking applications. In Proc. Asilomar Conf. Signals, Systems, and Computers,
Vol. 1. 853–857.
Thomas C.P. Chau, Wayne Luk, Peter Y.K. Cheung, Alison Eele, and Jan Maciejowski. 2012. Adaptive Se-
quential Monte Carlo Approach for Real-Time Applications. In Proc. Int. Conf. Field Programmable
Logic and Applications. 527–530.
Thomas C.P. Chau, Xinyu Niu, Alison Eele, Wayne Luk, Peter Y.K. Cheung, and Jan Maciejowski. 2013a.
Heterogeneous Reconfigurable System for Adaptive particle Filters in Real-Time Applications. In Proc.
Int. Symp. Applied Reconfigurable Computing. 1–12.
Thomas C.P. Chau, James S. Targett, Marlon Wijeyasinghe, Wayne Luk, Peter Y.K. Cheung, Benjamin Cope,
Alison Eele, and J.M. Maciejowski. 2013b. Accelerating Sequential Monte Carlo Method for Real-time
Air Traffic Management. In Proc. Int. Symp. Highly Efficient Accelerators and Reconfigurable Technolo-
gies.
Arnaud Doucet, Nando de Freitas, and Neil Gordon. 2001. Sequential Monte Carlo methods in practice.
Springer.
A. Eele and J.M. Maciejowski. 2011. Comparison of Stochastic Optimisation Methods for Control in Air
Traffic Management. In Proc. IFAC World Congress.
Dieter Fox. 2003. Adapting the sample size in particle filters through KLD-sampling. Int. Trans. Robotics
22, 12 (2003), 985–1003.
Markus Happe, Enno Lu¨bbers, and Marco Platzner. 2011. A self-adaptive heterogeneous multi-core archi-
tecture for embedded real-time video object tracking. Journal Real-Time Image Processing (2011), 1–16.
Daphne Koller and Raya Fratkina. 1998. Using learning for approximation in stochastic processes. In Proc.
Int. Conf. Machine Learning. 287–295.
Zhibin Liu, Zongying Shi, Mingguo Zhao, and Wenli Xu. 2007. Mobile robots global localization using adap-
tive dynamic clustered particle filters. In Proc. Int. Conf. Intelligent Robots and Systems. 1059–1064.
Lifeng Miao, Jun Jason Zhang, Chaitali Chakrabarti, and Antonia Papandreou-Suppappola. 2011. Algo-
rithm and Parallel Implementation of Particle Filtering and its Use in Waveform-Agile Sensing. J. Sig-
nal Process. Syst. 65, 2 (2011), 211–227.
Lyudmila Mihaylova, Ren. Boel, and Andreas Hegyi. 2007. Freeway traffic estimation within particle filter-
ing framework. Automatica 43, 2 (2007), 290–300.
Michael Montemerlo, Sebastian Thrun, and William Whittaker. 2002. Conditional particle filters for si-
multaneous mobile robot localization and people-tracking. In Proc. Int. Conf. Robotics and Automation.
695–701.
Sang-Hyuk Park, Young-Joong Kim, and Myo-Taeg Lim. 2010. Novel adaptive particle filter using adjusted
variance and its application. Int. Journal Control, Automation and Systems 8, 4 (2010), 801–807.
Jaco Vermaak, Christophe Andrieu, Arnaud Doucet, and Simon John Godsill. 2002. Particle methods for
Bayesian modeling and enhancement of speech signals. IEEE Trans. Speech and Audio Processing 10,
3 (2002), 173–185.
Received June 2013; revised February 2014; accepted March 2014
ACM Transactions on Reconfigurable Technology and Systems, Vol. 1, No. 1, Article 1, Publication date: March 2014.
